Trends in Genome Compression
Sebastian Wandelt, Marc Bux and Ulf Leser
Affiliation: Knowledge Management in Bioinformatics, Institute for Computer Science, Humboldt-Universität zu Berlin, Germany.
Technological advancements in high throughput sequencing have led to a tremendous increase in the amount of
genomic data produced. With the cost being down to 2,000 USD for a single human genome, sequencing dozens of
individuals is an undertaking that is feasible even for a smaller projects or organizations established. However, generating
the sequence is only one issue; another one is storing, managing, and analyzing it. These tasks become more and more
challenging due to the sheer size of the data sets and are increasingly considered to be the major bottlenecks in larger
genome projects. One possible countermeasure is to compress the data; compression reduces costs in terms of requiring
less hard disk storage and in terms of requiring less bandwidth if data is shipped to large compute clusters for parallel
analysis. Accordingly, sequence compression has recently attracted much interest in the scientific community. In this
paper, we explain the different basic techniques for sequence compression, point to distinctions between different
compression tasks (e.g., genome compression versus read compression), and present a comparison of current approaches
and tools. To further stimulate progress in genome compression research, we also identify key challenges for future
Keywords: Genome compression, read compression, survey.
Rights & PermissionsPrintExport