Background: In bioinformatics, estimation of k-mer abundance histograms or just enumerating
the number of unique k-mers and the number of singletons are desirable in many genome sequence
analysis applications. The applications include predicting genome sizes, data pre-processing
for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection,
sequencing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality
estimation in sequencing data have been developed in recent years.
Objective: In this article, we present a comparative assessment of the different k-mer frequency estimation
programs (ntCard, KmerGenie, KmerStream and Khmer (abundance-dist-single.py and
unique-kmers.py) to assess their relative merits and demerits.
Methods: Principally, the miscounts/error-rates of these tools are analyzed by rigorous experimental
analysis for a varied range of k. We also present experimental results on runtime, scalability for larger
datasets, memory, CPU utilization as well as parallelism of k-mer frequency estimation methods.
Results: The results indicate that ntCard is more accurate in estimating F0, f1 and full k-mer abundance
histograms compared with other methods. ntCard is the fastest but it has more memory requirements
compared to KmerGenie.
Conclusion: The results of this evaluation may serve as a roadmap to potential users and practitioners
of streaming algorithms for estimating k-mer coverage frequencies, to assist them in identifying an
appropriate method. Such results analysis also help researchers to discover remaining open research
questions, effective combinations of existing techniques and possible avenues for future research.