Hidden Markov Model for Splicing Junction Sites Identification in DNA Sequences

Srabanti      Maji; Deepak      Garg

Abstract

Identification of coding sequence from genomic DNA sequence is the major step in pursuit of gene identification. In the eukaryotic organism, gene structure consists of promoter, intron, start codon, exons and stop codon, etc. and to identify it, accurate labeling of the mentioned segments is necessary. Splice site is the ‘separation’ between exons and introns, the predicted accuracy of which is lower than 90% (in general) though the sequences adjacent to the splice sites have a high conservation. As the accuracy of splice site recognition has not yet been satisfactory (adequate), therefore, much attention has been paid to improve the prediction accuracy and improvement in the algorithms used is very essential element. In this manuscript, Hidden Markov Model (HMM) based splice sites predictor is developed and trained using Modified Expectation Maximization (MEM) algorithm. A 12 fold cross validation technique is also applied to check the reproducibility of the results obtained and to further increase the prediction accuracy. The proposed system can able to achieve the accuracy of 98% of true donor site and 93% for true acceptor site in the standard DNA (nucleotide) sequence.

Keywords: Algorithms, coding sequence, cross validation, gene finding, hidden markov model, modified expectation maximization (MEM), splice site.

« Previous Next »