Motivation: Knowledge of the correct protein subcellular localization is necessary for understanding
the function of a protein and revealing the mechanism of many human diseases due to protein
subcellular mislocalization, which is required before approaching gene therapy to treat a disease. In addition,
it is well-known that the gene therapy is an effective way to overcome disease by targeting a gene
therapy product to a specific subcellular compartment. Deep neural networks to predict protein function
have become increasingly popular due to large increases in the available genomics data due to its
strong superiority in the non-linear classification ability. However, they still have some drawbacks
such as too many hyper-parameters and sufficient amount of labeled data.
Results: We present a deep forest-based protein location algorithm relying on sequence information.
The prediction model uses a random forest network with a multi-layered structure to identify the
subcellular regions of protein. The model was trained and tested on a latest UniProt releases protein
dataset, and we demonstrate that our deep forest predict the subcellular location of proteins given only
the protein sequence with high accuracy, outperforming the current state-of-art algorithms.
Meanwhile, unlike the deep neural networks, it has a significantly smaller number of parameters and
is much easier to train.