Generic placeholder image

Current Bioinformatics

Editor-in-Chief

ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

ESREEM: Efficient Short Reads Error Estimation Computational Model for Next-generation Genome Sequencing

Author(s): Muhammad Tahir, Muhammad Sardaraz*, Zahid Mehmood and Muhammad Saud Khan

Volume 16, Issue 2, 2021

Published on: 14 June, 2020

Page: [339 - 349] Pages: 11

DOI: 10.2174/1574893615999200614171832

Price: $65

Abstract

Aims: To assess the error profile in NGS data, generated from high throughput sequencing machines.

Background: Short-read sequencing data from Next Generation Sequencing (NGS) are currently being generated by a number of research projects. Depicting the errors produced by NGS platforms and expressing accurate genetic variation from reads are two inter-dependent phases. It has high significance in various analyses, such as genome sequence assembly, SNPs calling, evolutionary studies, and haplotype inference. The systematic and random errors show incidence profile for each of the sequencing platforms i.e. Illumina sequencing, Pacific Biosciences, 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Ion Torrent sequencing, and Oxford Nanopore sequencing. Advances in NGS deliver galactic data with the addition of errors. Some ratio of these errors may emulate genuine true biological signals i.e., mutation, and may subsequently negate the results. Various independent applications have been proposed to correct the sequencing errors. Systematic analysis of these algorithms shows that state-of-the-art models are missing.

Objective: In this paper, an effcient error estimation computational model called ESREEM is proposed to assess the error rates in NGS data.

Methods: The proposed model prospects the analysis that there exists a true linear regression association between the number of reads containing errors and the number of reads sequenced. The model is based on a probabilistic error model integrated with the Hidden Markov Model (HMM).

Results: The proposed model is evaluated on several benchmark datasets and the results obtained are compared with state-of-the-art algorithms.

Conclusion: Experimental results analyses show that the proposed model efficiently estimates errors and runs in less time as compared to others.

Keywords: NGS, genome, sequencing, error analysis, computational, algorithms.

« Previous
Graphical Abstract
[1]
Tahir M, Sardaraz M, Ikram AA, Bajwa H. Review of genome sequence short read error correction algorithms. Am J Bioinform Res 2013; 3: 1-9.
[2]
Tahir M, Sardaraz M, Aziz Ikram A, Bajwa H. HaShRECA: Hadoop based short read error correction algorithm for genome assembly. Curr Bioinform 2015; 10: 469-75.
[http://dx.doi.org/10.2174/157489361004150922151409]
[3]
Heydari M, Miclotte G, Demeester P, Van de Peer Y, Fostier J. Evaluation of the impact of Illumina error correction tools on de novo genome assembly. BMC Bioinformatics 2017; 18(1): 374.
[http://dx.doi.org/10.1186/s12859-017-1784-8] [PMID: 28821237]
[4]
Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol 2008; 26(10): 1135-45.
[http://dx.doi.org/10.1038/nbt1486] [PMID: 18846087]
[5]
Hoffmann S, Otto C, Kurtz S, et al. Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLOS Comput Biol 2009; 5(9), e1000502.
[http://dx.doi.org/10.1371/journal.pcbi.1000502] [PMID: 19750212]
[6]
Simpson JT. Exploring genome characteristics and sequence quality without a reference. Bioinformatics 2014; 30(9): 1228-35.
[http://dx.doi.org/10.1093/bioinformatics/btu023] [PMID: 24443382]
[7]
Bioinformatics B. FastQC: a quality control tool for high throughput sequence data. Cambridge, UK: Babraham Institute 2011.
[8]
Trivedi UH, Cézard T, Bridgett S, et al. Quality control of next-generation sequencing data without a reference. Front Genet 2014; 5: 111.
[http://dx.doi.org/10.3389/fgene.2014.00111] [PMID: 24834071]
[9]
Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 2016; 17(6): 333-51.
[http://dx.doi.org/10.1038/nrg.2016.49] [PMID: 27184599]
[10]
Minoche AE, Dohm JC, Himmelbauer H. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biol 2011; 12(11): R112.
[http://dx.doi.org/10.1186/gb-2011-12-11-r112] [PMID: 22067484]
[11]
Nakamura K, Oshima T, Morimoto T, et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res 2011; 39(13), e90.
[http://dx.doi.org/10.1093/nar/gkr344] [PMID: 21576222]
[12]
Abnizova I, Leonard S, Skelly T, et al. Analysis of context-dependent errors for illumina sequencing. J Bioinform Comput Biol 2012; 10(2), 1241005.
[http://dx.doi.org/10.1142/S0219720012410053] [PMID: 22809341]
[13]
Ross MG, Russ C, Costello M, et al. Characterizing and measuring bias in sequence data. Genome Biol 2013; 14(5): R51.
[http://dx.doi.org/10.1186/gb-2013-14-5-r51] [PMID: 23718773]
[14]
Janin L, Schulz-Trieglaff O, Cox AJ. BEETL-fastq: a searchable compressed archive for DNA reads. Bioinformatics 2014; 30(19): 2796-801.
[http://dx.doi.org/10.1093/bioinformatics/btu387] [PMID: 24950811]
[15]
Kchouk M, Elloumi M. An error correction and denovo assembly approach for nanopore reads using short reads. Curr Bioinform 2018; 13: 241-52.
[http://dx.doi.org/10.2174/1574893612666170530073736]
[16]
Erlich Y, Mitra PP, delaBastide M, McCombie WR, Hannon GJ. Alta-Cyclic: a self-optimizing base caller for next-generation sequencing. Nat Methods 2008; 5(8): 679-82.
[http://dx.doi.org/10.1038/nmeth.1230] [PMID: 18604217]
[17]
Rougemont J, Amzallag A, Iseli C, Farinelli L, Xenarios I, Naef F. Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics 2008; 9: 431.
[http://dx.doi.org/10.1186/1471-2105-9-431] [PMID: 18851737]
[18]
Kao W-C, Stevens K, Song YS. BayesCall: a model-based basecalling algorithm for high-throughput short-read sequencing. Genome Res 2009; 19(10): 1884-95.
[http://dx.doi.org/10.1101/gr.095299.109]
[19]
Bravo HC, Irizarry RA. Model-based quality assessment and base-calling for second-generation sequencing data. Biometrics 2010; 66(3): 665-74.
[http://dx.doi.org/10.1111/j.1541-0420.2009.01353.x] [PMID: 19912177]
[20]
Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM. Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol 2007; 8(7): R143.
[http://dx.doi.org/10.1186/gb-2007-8-7-r143] [PMID: 17659080]
[21]
Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 2008; 36(16), e105.
[http://dx.doi.org/10.1093/nar/gkn425] [PMID: 18660515]
[22]
Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res 2010; 38(12): e131-.
[http://dx.doi.org/10.1093/nar/gkq224] [PMID: 20395217]
[23]
Lou DI, Hussmann JA, McBee RM, et al. High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. Proc Natl Acad Sci USA 2013; 110(49): 19872-7.
[http://dx.doi.org/10.1073/pnas.1319590110] [PMID: 24243955]
[24]
Hu X, Yuan J, Shi Y, et al. pIRS: Profile-based Illumina pair-end reads simulator. Bioinformatics 2012; 28(11): 1533-5.
[http://dx.doi.org/10.1093/bioinformatics/bts187] [PMID: 22508794]
[25]
Caboche S, Audebert C, Lemoine Y, Hot D. Comparison of mapping algorithms used in high-throughput sequencing: application to Ion Torrent data. BMC Genomics 2014; 15: 264.
[http://dx.doi.org/10.1186/1471-2164-15-264] [PMID: 24708189]
[26]
Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics 2012; 28(4): 593-4.
[http://dx.doi.org/10.1093/bioinformatics/btr708] [PMID: 22199392]
[27]
Hoban S, Bertorelle G, Gaggiotti OE. Computer simulations: tools for population and evolutionary genetics. Nat Rev Genet 2012; 13(2): 110-22.
[http://dx.doi.org/10.1038/nrg3130] [PMID: 22230817]
[28]
McElroy KE, Luciani F, Thomas T. GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 2012; 13: 74.
[http://dx.doi.org/10.1186/1471-2164-13-74] [PMID: 22336055]
[29]
Knudsen B, Forsberg R, Miyamoto MM. A computer simulator for assessing different challenges and strategies of de novo sequence assembly. Genes 2010; 1(2): 263-82.
[http://dx.doi.org/10.3390/genes1020263] [PMID: 24710045]
[30]
Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 2011; 12(6): 443-51.
[http://dx.doi.org/10.1038/nrg2986] [PMID: 21587300]
[31]
OMIC Tools . Available from https://omictools.com/ (Accessed on 2018).
[32]
Nikolenko SI, Korobeynikov AI, Alekseyev MA. BayesHammer: Bayesian clustering for error correction in single-cell sequencing. BMC Genomics 2013; 14(Suppl. 1): S7.
[http://dx.doi.org/10.1186/1471-2164-14-S1-S7] [PMID: 23368723]
[33]
Liu Y, Schröder J, Schmidt B. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics 2013; 29(3): 308-15.
[http://dx.doi.org/10.1093/bioinformatics/bts690] [PMID: 23202746]
[34]
Walker BJ, Abeel T, Shea T, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 2014; 9(11), e112963.
[http://dx.doi.org/10.1371/journal.pone.0112963] [PMID: 25409509]
[35]
Swain MT, Tsai IJ, Assefa SA, Newbold C, Berriman M, Otto TD. A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs. Nat Protoc 2012; 7(7): 1260-84.
[http://dx.doi.org/10.1038/nprot.2012.068] [PMID: 22678431]
[36]
Zagordi O, Klein R, Däumer M, Beerenwinkel N. Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies. Nucleic Acids Res 2010; 38(21): 7400-9.
[http://dx.doi.org/10.1093/nar/gkq655] [PMID: 20671025]
[37]
Wang XV, Blades N, Ding J, Sultana R, Parmigiani G. Estimation of sequencing error rates in short reads. BMC Bioinformatics 2012; 13: 185.
[http://dx.doi.org/10.1186/1471-2105-13-185] [PMID: 22846331]
[38]
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 2010; 11: 94.
[http://dx.doi.org/10.1186/1471-2105-11-94] [PMID: 20167110]
[39]
Butler J, MacCallum I, Kleber M, et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res 2008; 18(5): 810-20.
[http://dx.doi.org/10.1101/gr.7337908] [PMID: 18340039]
[40]
Keele LJ. Semiparametric regression for the social sciences. John Wiley & Sons 2008.
[41]
Schröder J, Schröder H, Puglisi SJ, Sinha R, Schmidt B. SHREC: a short-read error correction method. Bioinformatics 2009; 25(17): 2157-63.
[http://dx.doi.org/10.1093/bioinformatics/btp379] [PMID: 19542152]
[42]
Kelley DR, Schatz MC, Salzberg SL. Quake: quality-aware detection and correction of sequencing errors. Genome Biol 2010; 11(11): R116.
[http://dx.doi.org/10.1186/gb-2010-11-11-r116] [PMID: 21114842]
[43]
Li R, Zhu H, Ruan J, et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 2010; 20(2): 265-72.
[http://dx.doi.org/10.1101/gr.097261.109] [PMID: 20019144]
[44]
Salmela L. Correction of sequencing errors in a mixed set of reads. Bioinformatics 2010; 26(10): 1284-90.
[http://dx.doi.org/10.1093/bioinformatics/btq151] [PMID: 20378555]
[45]
Schröder J, Bailey J, Conway T, Zobel J. Reference-free validation of short read data. PLoS One 2010; 5(9), e12681.
[http://dx.doi.org/10.1371/journal.pone.0012681] [PMID: 20877643]
[46]
Melsted P, Pritchard JK. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 2011; 12: 333.
[http://dx.doi.org/10.1186/1471-2105-12-333] [PMID: 21831268]
[47]
Heo Y, Wu X-L, Chen D, Ma J, Hwu W-M. BLESS: bloom filter-based error correction solution for high-throughput sequencing reads. Bioinformatics 2014; 30(10): 1354-62.
[http://dx.doi.org/10.1093/bioinformatics/btu030] [PMID: 24451628]
[48]
Sahay S. Optimum-time, optimum-space, algorithms for k-mer analysis of whole genome sequences. J Bioinform Comparative Genomics 2014; 1: 1.
[49]
Zhu X, Wang J, Peng B, Shete S. Empirical estimation of sequencing error rates using smoothing splines. BMC Bioinformatics 2016; 17: 177.
[http://dx.doi.org/10.1186/s12859-016-1052-3] [PMID: 27102907]
[50]
National Center for Biotechnology Information. Available from: https://www.ncbi.nlm.nih.gov/sra/ (Accessed on 2018).
[51]
Shi L, Reid LH, Jones WD, et al. MAQC Consortium. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 2006; 24(9): 1151-61.
[http://dx.doi.org/10.1038/nbt1239] [PMID: 16964229]
[52]
Birney E, Stamatoyannopoulos JA, Dutta A, et al. ENCODE Project Consortium. NISC Comparative Sequencing Program; Baylor College of Medicine Human Genome Sequencing Center; Washington University Genome Sequencing Center; Broad Institute; Children’s Hospital Oakland Research Institute. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007; 447(7146): 799-816.
[http://dx.doi.org/10.1038/nature05874] [PMID: 17571346]
[53]
Yang X, Aluru S, Dorman KS. Repeat-aware modeling and correction of short read errors. BMC Bioinformatics 2011; 12(Suppl. 1): S52.
[http://dx.doi.org/10.1186/1471-2105-12-S1-S52] [PMID: 21342585]
[54]
Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 2001; 305(3): 567-80.
[http://dx.doi.org/10.1006/jmbi.2000.4315] [PMID: 11152613]
[55]
Yoon B-J. Hidden Markov models and their applications in biological sequence analysis. Curr Genomics 2009; 10(6): 402-15.
[http://dx.doi.org/10.2174/138920209789177575] [PMID: 20190955]

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy