Decision trees are renowned in the computational chemistry and machine learning
communities for their interpretability. Their capacity and usage are somewhat limited by the fact that
they normally work on categorical data. Improvements to known decision tree algorithms are usually
carried out by increasing and tweaking parameters, as well as the post-processing of the class
assignment. In this work we attempted to tackle both these issues. Firstly, conditional mutual
information was used as the criterion for selecting the attribute on which to split instances. The
algorithm performance was compared with the results of C4.5 (WEKA’s J48) using default parameters and no restrictions.
Two datasets were used for this purpose, DrugBank compounds for HRH1 binding prediction and Traditional Chinese
Medicine formulation predicted bioactivities for therapeutic class annotation. Secondly, an automated binning method for
continuous data was evaluated, namely Scott’s normal reference rule, in order to allow any decision tree to easily handle
continuous data. This was applied to all approved drugs in DrugBank for predicting the RDKit SLogP property, using the
remaining RDKit physicochemical attributes as input.
Keywords: Computational chemistry, DrugBank, Chinese Medicine, Cheminformatics, Decision Trees, Conditional Mutual
Rights & PermissionsPrintExport