Rapidly developing next-generation sequencing technologies significantly promote
metagenomics research, yet also present extreme challenges in the analysis of metagenomic data.
Metagenomic samples can contain thousands of microbial species, thus, sequencing datasets can contain
fragments from thousands of different genomes. Therefore, clustering the sequencing reads with their
original genomes, namely, binning, is usually done to expedite further studies. Currently, binning
methods are divided into two categories: supervised methods (which require reference genomes), and
unsupervised methods (which do not).
We present an unsupervised binning method that combines a novel sequence feature recognition method with a spectral
clustering algorithm. The sequence feature is a hybrid of sequence correlation and sequence composition analyses.
Simulation experiments, based on simulated and actual metagenomic datasets, suggest that the combination of sequence
composition and an intrinsic correlation of oligonucleotides, both extracted from tetranucleotide analyses, performs better
than any single feature. A spectral clustering algorithm, which is a high performance unsupervised clustering method, is
also applied in our binning method. The method is available as an open source package called HSS-bin (Hybrid Sequence
feature and Spectral clustering unsupervised metagenomic binning) at http://bioinfo.seu.edu.cn/HSS-bin/.
We evaluated HSS-bin’s performance using both simulated and actual metagenomic datasets. Experimental results
indicate that HSS-bin can handle metagenomic sequencing data with non-uniform species abundance, short sequences,
and complex phylogenetic diversity with high accuracy. Our method performs well on actual metagenomic datasets and
on datasets simulated from a complex metagenomic community.