NLP-MeTaxa: A Natural Language Processing Approach for Metagenomic Taxonomic Binning Based on Deep Learning

Brahim       Matougui; Abdelbasset       Boukelia; Hacene       Belhadef; Clovis       Galiez; Mohamed       Batouche

Abstract

Background: Metagenomics is the study of genomic content in mass from an environment of interest such as the human gut or soil. Taxonomy is one of the most important fields of metagenomics, which is the science of defining and naming groups of microbial organisms that share the same characteristics. The problem of taxonomy classification is the identification and quantification of microbial species or higher-level taxa sampled by high throughput sequencing.

Objective: Although many methods exist to deal with the taxonomic classification problem, assignment to low taxonomic ranks remains an important challenge for binning methods as is scalability to Gbsized datasets generated with deep sequencing techniques.

Methods: In this paper, we introduce NLP-MeTaxa, a novel composition-based method for taxonomic binning, which relies on the use of words embeddings and deep learning architecture. The new proposed approach is word-based, where the metagenomic DNA fragments are processed as a set of overlapping words by using the word2vec model to vectorize them in order to feed the deep learning model. NLP-MeTaxa output is visualized as NCBI taxonomy tree, this representation helps to show the connection between the predicted taxonomic identifiers. NLP-MeTaxa was trained on large-scale data from the NCBI RefSeq, more than 14,000 complete microbial genomes. The NLP-MeTaxa code is available at the website: https://github.com/padriba/NLP_MeTaxa/.

Results: We evaluated NLP-MeTaxa with a real and simulated metagenomic dataset and compared our results to other tools' results. The experimental results have shown that our method outperforms the other methods especially for the classification of low-ranking taxonomic class such as species and genus.

Conclusion: In summary, our new method might provide novel insight for understanding the microbial community through the identification of the organisms it might contain.

Keywords: Metagenomics, natural language processing, word2vec, embeddings, multilayer perceptron, language processing.

« Previous Next »

Graphical Abstract

Rights & Permissions Print Cite

Article Metrics

51

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893616666210621101150	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

NLP-MeTaxa: A Natural Language Processing Approach for Metagenomic Taxonomic Binning Based on Deep Learning

Abstract

Graphical Abstract

Related Journals

Related Books