Aims: To assess the error profile in NGS data, generated from high throughput sequencing machines.
Background: Short-read sequencing data from Next Generation Sequencing (NGS) are currently being generated by a number of research projects. Depicting the errors produced by NGS platforms and expressing accurate genetic variation from reads are two inter-dependent phases. It has high significance in various analyses, such as genome sequence assembly, SNPs calling, evolutionary studies, and haplotype inference. The systematic and random errors show incidence profile for each of the sequencing platforms i.e. Illumina sequencing, Pacific Biosciences, 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Ion Torrent sequencing, and Oxford Nanopore sequencing. Advances in NGS deliver galactic data with the addition of errors. Some ratio of these errors may emulate genuine true biological signals i.e., mutation, and may subsequently negate the results. Various independent applications have been proposed to correct the sequencing errors. Systematic analysis of these algorithms shows that state-of-the-art models are missing.
Objective: In this paper, an effcient error estimation computational model called ESREEM is proposed to assess the error rates in NGS data.
Methods: The proposed model prospects the analysis that there exists a true linear regression association between the number of reads containing errors and the number of reads sequenced. The model is based on a probabilistic error model integrated with the Hidden Markov Model (HMM).
Results: The proposed model is evaluated on several benchmark datasets and the results obtained are compared with state-of-the-art algorithms.
Conclusion: Experimental results analyses show that the proposed model efficiently estimates errors and runs in less time as compared to others.
[http://dx.doi.org/10.1111/j.1541-0420.2009.01353.x] [PMID: 19912177]
[http://dx.doi.org/10.1038/nature05874] [PMID: 17571346]