Title:PuzzleCluster: A Novel Unsupervised Clustering Algorithm for Binning DNA Fragments in Metagenomics
VOLUME: 10 ISSUE: 2
Author(s):Kyler Siegel, Kristen Altenburger, Yu-Sing Hon, Jessey Lin and Chenglong Yu
Affiliation:Mind and Brain Theme, South Australian Health and Medical Research Institute, Adelaide, SA, 5000, Australia.
Keywords:Clustering, expectation maximization, Jensen-Shannon distance, metagenome, quality threshold algorithm, word
agreement.
Abstract:Metagenomic datasets are composed of DNA fragments from large numbers of different and
potentially novel organisms. These datasets can contain up to several million sequences taken from
heterogeneous populations of extremely varied abundance. Unlike traditional genomic studies, metagenomic analysis requires an
additional binning step. This process groups DNA fragments from the same or similar species of origin. However, existing
unsupervised metagenomic binning programs cannot accurately analyze datasets containing a large number of species or with
significantly unbalanced abundance ratios. To improve upon these current limitations, we present PuzzleCluster, a novel
unsupervised binning algorithm. PuzzleCluster incorporates a unique cluster refinement step by automatically grouping reads
which share a nucleotide word (i.e. reverse complement pairs) of a predetermined length. Additionally, the clustering parameters
are estimated by fitting the Jensen-Shannon distance among sequences using the expectation maximization algorithm. Since
clustering parameters are computed based on each dataset, our approach can adapt to the peculiarities of each dataset and is not
confined by universal parameters. Furthermore, PuzzleCluster utilizes no prior assumptions about the genetic makeup or number
of organisms present in the sample, making it well-suited for applications with a large amount of biodiversity and completely
unknown organisms. As a comparison, PuzzleCluster has an accuracy 9%, 19.8%, 15.7%, and 19.5% higher than MetaCluster
3.0 for taxonomic levels phylum, class, order, and family, respectively. PuzzleCluster source code is freely available at
http://math.stanford.edu/~ksiegel/PuzzleCluster.html.