Chemogenomics is an emerging interdisciplinary field that lies in the interface of biology, chemistry, and informatics. Most of the currently used drugs are small molecules that interact with proteins. Understanding protein-ligand interaction is therefore central to drug discovery and design. In the subfield of chemogenomics known as proteochemometrics, protein-ligand-interaction models are induced from data matrices that consist of both protein and ligand information along with some experimentally measured variable. The two general aims of this quantitative multi-structureproperty- relationship modeling (QMSPR) approach are to exploit sparse/incomplete information sources and to obtain more general models covering larger parts of the protein-ligand space, than traditional approaches that focuses mainly on specific targets or ligands. The data matrices, usually obtained from multiple sparse/incomplete sources, typically contain series of proteins and ligands together with quantitative information about their interactions. A useful model should ideally be easy to interpret and generalize well to new unseen protein-ligand combinations. Resolving this requires sophisticated machine-learning methods for model induction, combined with adequate validation. This review is intended to provide a guide to methods and data sources suitable for this kind of protein-ligand-interaction modeling. An overview of the modeling process is presented including data collection, protein and ligand descriptor computation, data preprocessing, machine-learning-model induction and validation. Concerns and issues specific for each step in this kind of data-driven modeling will be discussed.
Keywords: Chemogenomics, proteochemometrics, QSAR, QMSPR, machine learning, protein-ligand interaction, model induction, protein and ligand descriptor computation, data preprocessing, machine-learning-model induction and validation, data-driven modeling, drug discovery