Background: Acquired Immunodeficiency Syndrome (AIDS) is a large-scale pandemic
caused by the infection of Human Immunodeficiency Virus (HIV). This virus infects over 40 million
people worldwide. In the search for pandemic control, many drug resistance tests have been performed,
resulting in the generation of large genomic data amount. These data are stored in biological databases,
increasing on a daily basis. However, the majority of genomic data lacks important information,
regarding virus subtype distribution, in the primary databases, e.g. GenBank.
Objective: A novel software tool to obtain, index and analyze highly mutational virus data, such as all
HIV-1 sequence data from GenBank.
Method: The software aligns all sequences containing a complete genome (HXB2) for mapping
purposes. In addition, all sequences with subtype references are locally aligned to classify all data into
Results: Our results detail the prevalence of every subtype from a global HIV-1 sequence perspective,
highlighting increases in the number of sequences related to recombinant subtypes. We were also able
to identify country-based distribution of sequences according to geographical data distribution. All data
were analyzed on a reasonable timescale, particularly in comparison to classic methods.
Conclusion: Our software represents an important contribution to HIV molecular epidemiology and
offers a technique to rapidly classify new sequences, in addition to providing insight about sequence
coverage density, subtype and country distribution. This data, together with cross-referencing, will aid
in the generation of a novel, comprehensive and updated HIV-1 database.