Title:VaxiJen Dataset of Bacterial Immunogens: An Update
VOLUME: 15 ISSUE: 5
Author(s):Nevena Zaharieva, Ivawn Dimitrov, Darren R. Flower and Irini Doytchinova*
Affiliation:Faculty of Pharmacy, Medical University of Sofia, Sofia, Faculty of Pharmacy, Medical University of Sofia, Sofia, School of Life and Health Sciences, Aston University, Birmingham, Faculty of Pharmacy, Medical University of Sofia, Sofia
Keywords:Immunogenicity prediction, dataset, bacterial immunogen, VaxiJen, FASTA, epitopes.
Abstract:Background: Identifying immunogenic proteins is the first stage in vaccine design and development.
VaxiJen is the most widely used and highly cited server for immunogenicity prediction. As the developers
of VaxiJen, we are obliged to update and improve it regularly. Here, we present an updated dataset
of bacterial immunogens containing 317 experimentally proven immunogenic proteins of bacterial origin,
of which 60% have been reported during the last 10 years.
Methods: PubMed was searched for papers containing data for novel immunogenic proteins tested on humans
till March 2017. Corresponding protein sequences were collected from NCBI and UniProtKB. The
set was curated manually for multiple protein fragments, isoforms, and duplicates.
Results: The final curated dataset consists of 306 immunogenic proteins tested on humans derived from 47
bacterial microorganisms. Certain proteins have several isoforms. All were considered, and the total protein sequences
in the set are 317. The updated set contains 206 new immunogens, compared to the previous VaxiJen
bacterial dataset. The average number of immunogens per species is 6.7. The set also contains 12 fusion proteins
and 41 peptide fragments and epitopes. The dataset includes the names of bacterial microorganisms, protein
names, and protein sequences in FASTA format.
Conclusion: Currently, the updated VaxiJen bacterial dataset is the best known manually-curated compilation
of bacterial immunogens. It is freely available at http://www.ddg-pharmfac.net/vaxi
jen/dataset. It can easily be downloaded, searched, and processed. When combined with an appropriate
negative dataset, this update could also serve as a training set, allowing enhanced prediction of the potential
immunogenicity of unknown protein sequences.