An Efficient Tool for Searching Maximal and Super Maximal Repeats in Large DNA/Protein Sequences via Induced-Enhanced Suffix Array

Sanjeev       Kumar; Suneeta       Agarwal; Ranvijay

Abstract

Background: DNA and Protein sequences of an organism contain a variety of repeated structures of various types. These repeated structures play an important role in Molecular biology as they are related to genetic backgrounds of inherited diseases. They also serve as a marker for DNA mapping and DNA fingerprinting. Efficient searching of maximal and super maximal repeats in DNA/Protein sequences can lead to many other applications in the area of genomics. Moreover, these repeats can also be used for identification of critical diseases by finding the similarity between frequency distributions of repeats in viruses and genomes (without using alignment algorithms).

Objective: The study aims to develop an efficient tool for searching maximal and super maximal repeats in large DNA/Protein sequences.

Methods: The proposed tool uses a newly introduced data structure Induced Enhanced Suffix Array (IESA). IESA is an extension of enhanced suffix array. It uses induced suffix array instead of classical suffix array. IESA consists of Induced Suffix Array (ISA) and an additional array-Longest Common Prefix (LCP) array. ISA is an array of all sorted suffixes of the input sequence while LCP array stores the lengths of the longest common prefixes between all pairs of consecutive suffixes in an induced suffix array. IESA is known to be efficient w.r.t. both time and space. It facilitates the use of secondary memory for constructing the large suffix-array.

Results: An open source standalone tool named MSR-IESA for searching maximal and super maximal repeats in DNA/Protein sequences is provided at https://github.com/sanjeevalg/MSRIESA. Experimental results show that the proposed algorithm outperforms other state of the art works w.r.t. to both time and space.

Conclusion: The proposed tool MSR-IESA is remarkably efficient for the analysis of DNA/Protein sequences, having maximal and super maximal repeats of any length. It can be used for identification of well-known diseases.

Keywords: DNA, protein, maximal repeats, super-maximal repeats, induced enhanced suffix array, LCP array.

« Previous Next »

Graphical Abstract

[1] 
D. Gusfield, Algorithms on Strings, Trees, and Sequences., Cambridge University Press: New York, 1997.
[2] 
R. Kolpakov,  and G. Kucherov, Finding maximal repetitions in a word in linear timeFOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science, IEEE Computer Society Washington, DC, USA, p. 596. 1999
[3] 
M. H¨ohl, S. Kurtz,  and E. Ohlebusch, "Efficient multiple genome alignment", Bioinformatics, vol. 18, suppl. ( 1), pp. 312-320, 2002.
[4] 
E. Schaper, A.V. Kajava, A. Hauser,  and M. Anisimova, "Repeat or not repeat? - statistical validation of tandem repeat prediction in genomic sequences", Nucleic Acids Res., vol. 40, no. 20, pp. 10005-10017, 2012.
[5] 
M.I. Abouelhoda, S. Kurtz,  and E. Ohlebusch, The enhanced suffix array and its applications to genome analysisWABI-02 Proceedings of the Second Workshop on Algorithms in Bioinformatics, Springer-Verlag London, UK , 2002, pp. 449-463.
[6] 
S. Saha, S. Bridges, Z.V. Magbanua,  and D.G. Peterson, "Empirical comparison of ab initio repeat finding programs", Nucleic Acids Res., vol. 36, no. (7), pp. 2284-2294, 2008.
[7] 
Y.M. Suvorova, M.A. Korotkova,  and E.V. Korotkov, "Comparative analysis of periodicity search methods in DNA sequences", Computational . Biol. Chem., vol. 53, pp. 43-48, 2014.
[8] 
C. Yin,  and J. Wang, "Periodic power spectrum with applications in detection of latent periodicities in DNA sequences", J. Math. Biol., vol. 73, no. 5, pp. 1053-1079, 2016.
[9] 
A.L. Price, N.C. Jones,  and P.A. Pevzner, "De novo identification of repeat families in large genomes", Bioinformatics, vol. 21, suppl. ( 1), pp. 351-358, 2005.
[10] 
S. Kurtz, "The Vmatchlarge scale sequence analysis software", Ref Type: Comput. Prg, vol. 412, p. 297, 2003.
[11] 
M.D. Cao, E. Tasker, K. Willadsen,  and M. Imelfort, "S. Vishwanathan S, S. Sureshkumar, S. Balasubramanian and M, Boden, “Inferring short tandem repeat variation from paired-end short reads", Nucleic Acids Res., vol. 42, no. (3), pp. 1-11, 2014.
[12] 
A. L. Delcher, S. L. Salzberg,  and A. M. Phillippy, "Using MUMmer to identify similar regions in large sequence sets",  Curr. Protocols Bioinform., vol. 10, no. (10.3), pp. 10-13. 2003
[13] 
C.N. Lian, M. Halachev,  and N. Shiri, "Searching for super-maximal repeats in large DNA sequences", In: Bioinformatics Research and Development.M. Elloumi, J. Küng, M. Linial, R.F. Murphy, K. Schneider, C. Toma, Eds,  Heidelberg: Springer: Heidelberg, 2008, pp. 87-101.
[14] 
T. Beller, K. Berger,  and E. Ohlebusch, "Space-efficient computation of maximal and super-maximal repeats in genome sequences", In: SPIRE'12 Proceedings of the 19th International Conference on String Processing and Information Retrieval, Berlin, Heidelberg, pp. 99-110. 2012
[15] 
J.D. Wang, Y.C. Wang, R.M. Hu,  and J.J. Tsai, "Extracting the co-occurrences of DNA maximal repeats in both human and viruses", In: 17th International Conference on Bioinformatics and Bioengineering (BIBE), IEEE, Washington, DC. USA, pp. 106-111. 2017
[16] 
M.O. Kulekci, J.S. Vitter,  and B. Xu, "Efficient maximal repeat finding using the Burrows-Wheeler transform and wavelet tree", IEEE/ACM Trans. Computat. Biol. Bioinform, vol. 9, no. (2), pp. 421-429, 2012.
[17] 
M. Burrows,  and D.J. Wheeler, “A Block-sorting Lossless Data Compression Algorithm”, SRC Research Report, Digital., Systems Research Center, 2000, pp. 1-18.
[18] 
S. Kumar, S. Agarwal,  and R. Prasad, "Efficient Read Alignment Using Burrows Wheeler Transform and Wavelet Tree", In: Advances in Computing and Communication Engineering (ICACCE), 2015  Second International Conference Dehradun, India, pp. 133-138. 2015
[19] 
P. Ko and S. Aluru Space efficient linear time construction of suffix arraysCPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching, Morelia, Michoacán,  Mexico, Springer-Verlag: Berlin, 2003, pp. 200-210.
[20] 
"V. Becher, A. Deymonnaz and P. A. Heiber, “Efficient repeat finding
via suffix arrays”, 2013. Available from: arXiv preprint
arXiv:1304.0528",  (Accessed: 31st Oct 2018).
[21] 
P. Ferragina,  and G. Manzini, "“Opportunistic Data Structures with
Applications”,", In: FOCS '00 Proceedings of the 41st Annual Symposium
on Foundations of Computer Science,, IEEE Computer Society
Washington, DC, USA, 2000, p. 390.
[22] 
D.K. Kim, J.S. Sim, H. Park,  and K. Park, "Linear-Time construction of suffix arrays. Linear-Time Construction of Suffix Arrays", In: Combinatorial Pattern Matching., vol.  2676. R. Baeza-Yates, E. Chávez, M. Crochemore, Eds, Combinatorial
Pattern Matching, 2003, pp. 186-199.
[23] 
S. Gupta, R. Prasad,  and S. Yadav, "Fast and practical algorithms for searching the gapped palindromes", Curr. Bioinform., vol. 12, no. (3), pp. 225-232, 2017.
[24] 
G. Nong, S. Zhang,  and W.H. Chan, Linear suffix array construction by almost pure induced-sortingData Compression Conference, Snowbird, UT, USA, pp. 193-202.
[25] 
T. Kasai, G. Lee, H. Arimura, S. Arikawa,  and K. Park, Linear- Time Longest Common Prefix Computation in Suffix Arrays and its ApplicationsCPM '01 Proceedings of the 12th Annual Symposium
on Combinatorial Pattern Matching,, Springer-Verlag London, UK,, 2001, pp. 181-192.
[26] 
S. Gupta,  and R. Prasad, "Searching exact tandem repeats in DNA sequences using enhanced suffix array", Curr. Bioinform., vol. 13, no. (2), pp. 216-222, 2018.
[27] 

NIH, US National Library of Medicine, Available from:.https://www.ncbi.nlm.nih.gov/home/download/ Accessed on: (Accessed
on: 31st Oct 2018)
[28] 
"Proteomes, Available from:", https://www.uniprot.org/proteomes/ (Accessed on: 31st Oct 2018).

Rights & Permissions Print Cite

Article Metrics

29

2

1

DOI https://dx.doi.org/10.2174/2213275911666181107095645	Print ISSN 2213-2759
Publisher Name Bentham Science Publisher	Online ISSN 1874-4796

Recent Patents on Computer Science

An Efficient Tool for Searching Maximal and Super Maximal Repeats in Large DNA/Protein Sequences via Induced-Enhanced Suffix Array

Abstract

Graphical Abstract

Related Journals

Related Books