Background: Metagenomics is the study of genomic content in mass from an environment
of interest such as the human gut or soil. Taxonomy is one of the most important fields of metagenomics,
which is the science of defining and naming groups of microbial organisms that share the
same characteristics. The problem of taxonomy classification is the identification and quantification of
microbial species or higher-level taxa sampled by high throughput sequencing.
Objective: Although many methods exist to deal with the taxonomic classification problem, assignment
to low taxonomic ranks remains an important challenge for binning methods as is scalability to Gbsized
datasets generated with deep sequencing techniques.
Methods: In this paper, we introduce NLP-MeTaxa, a novel composition-based method for taxonomic
binning, which relies on the use of words embeddings and deep learning architecture. The new proposed
approach is word-based, where the metagenomic DNA fragments are processed as a set of overlapping
words by using the word2vec model to vectorize them in order to feed the deep learning model.
NLP-MeTaxa output is visualized as NCBI taxonomy tree, this representation helps to show the connection
between the predicted taxonomic identifiers. NLP-MeTaxa was trained on large-scale data from
the NCBI RefSeq, more than 14,000 complete microbial genomes. The NLP-MeTaxa code is available
at the website: https://github.com/padriba/NLP_MeTaxa/.
Results: We evaluated NLP-MeTaxa with a real and simulated metagenomic dataset and compared our
results to other tools' results. The experimental results have shown that our method outperforms the
other methods especially for the classification of low-ranking taxonomic class such as species and genus.
Conclusion: In summary, our new method might provide novel insight for understanding the microbial
community through the identification of the organisms it might contain.