Background: Protein Data Bank is a world-wide repository that collects and provides macromolecular
data of protein structures and other molecules for Life sciences community. Manipulation of
vast amount of 3D protein structures and exploration of their properties require parsing thousands of flat
files that are used to describe these macromolecular structures every time we perform calculations.
Objective: Expecting more protein structures to appear in the future in open access repositories, like
the Protein Data Bank, and meeting the expectations of the era of fast data analytics, we propose inmemory
management system for protein structures that predominantly uses main memory of the host
server to store, manage and manipulate data. This allows to eliminate the overhead related to loading
data from hard drives and storing them in a buffer cache.
Method: In this paper, we show in-memory protein structure management system (IMPSMS), which
allows performing various operations, including basic functions like: selection, inserting, updating and
searching of protein structures, and execution of more sophisticated functions, like batch calculation of
root mean square deviation between proteins stored in the database, batch calculation of torsion angles,
structure comparison, structural alignment and superposition of the given molecule to molecules stored
in the in-memory database.
Results: In the experimental part, we show that with dedicated in-memory data structures particular
operations on proteins can be performed even a hundred times faster than analogous operations preceded
by traditional loading and parsing macromolecular data from standard PDB flat files.
Conclusion: Our work proves that designing dedicated data structures and management systems for
frequent protein data manipulations brings significant time savings and increases capabilities of running
fast data analytics in bioinformatics.