Background: Anatomical Therapeutic Chemical (ATC) classification of unknown compound has
raised high significance for both drug development and basic research. The ATC system is a multi-label classification
system proposed by the World Health Organization (WHO), which categorizes drugs into classes according
to their therapeutic effects and characteristics. This system comprises five levels and includes several classes
in each level; the first level includes 14 main overlapping classes. The ATC classification system simultaneously
considers anatomical distribution, therapeutic effects, and chemical characteristics, the prediction for an unknown
compound of its ATC classes is an essential problem, since such a prediction could be used to deduce not only a
compound’s possible active ingredients but also its therapeutic, pharmacological, and chemical properties. Nevertheless,
the problem of automatic prediction is very challenging due to the high variability of the samples and the
presence of overlapping among classes, resulting in multiple predictions and making machine learning extremely
Methods: In this paper, we propose a multi-label classifier system based on deep learned features to infer the
ATC classification. The system is based on a 2D representation of the samples: first a 1D feature vector is obtained
extracting information about a compound’s chemical-chemical interaction and its structural and fingerprint
similarities to other compounds belonging to the different ATC classes, then the original 1D feature vector is
reshaped to obtain a 2D matrix representation of the compound. Finally, a convolutional neural network (CNN) is
trained and used as a feature extractor. Two general purpose classifiers designed for multi-label classification are
trained using the deep learned features and resulting scores are fused by the average rule.
Results: Experimental evaluation based on rigorous cross-validation demonstrates the superior prediction quality
of this method compared to other state-of-the-art approaches developed for this problem.
Conclusion: Extensive experiments demonstrate that the new predictor, based on CNN, outperforms other existing
predictors in the literature in almost all the five metrics used to examine the performance for multi-label systems,
particularly in the “absolute true” rate and the “absolute false” rate, the two most significant indexes.
Matlab code will be available at https://github.com/LorisNanni.