Sequence classification is one of the most fundamental machine learning tasks in computational biology
nowadays. With the wide availability of large corpora of annotated sequences, the use of supervised learning techniques
can greatly speed up the process of identifying new sequences sharing certain function or properties. Many methods have
been proposed over the years and we hope to provide an introduction to some of the more prominent ones by focussing on
protease cleavage prediction: a typical representative of this class of problem. The variety of proteolytic action modes
between cysteine-proteases covers a broad range of complexity level and feature specificity, illustrating the strengths and
limitations of the different machine learning techniques used on them.
This review briefly introduces the particulars of predicting cleavage by calpains and caspases. We then offer some general
practical considerations on treating sequences for use with machine learning algorithms, before covering specific
methods. The methods presented range from basic position-based statistical models to more technically advanced methods
such as Markov models or kernel-based algorithms, as well as methods with more restricted goals such as decision trees.
With each family of algorithms, examples of implementations are introduced and their performances compared, along
with particular strengths and weaknesses.
With this review, we aim to provide useful elements of decision toward choosing an existing method or developing a new
one, based on the complexity and specific needs of a given sequence classification problem.
Keywords: Calpain, caspase, cleavage prediction, cysteine proteases, machine learning, protease, proteolysis, sequence
classification, Vector Encoding, Markov Models
Rights & PermissionsPrintExport