Title:Cancer Diagnosis Through IsomiR Expression with Machine Learning Method
VOLUME: 13 ISSUE: 1
Author(s):Zhijun Liao, Dapeng Li, Xinrui Wang, Lisheng Li and Quan Zou*
Affiliation:Department of Biochemistry and Molecular Biology, Fujian Medical University, Fuzhou, Department of Internal Medicine-Oncology, The Fourth Hospital in Qinhuangdao, Qinhuangdao, Department of Biochemistry and Molecular Biology, Fujian Medical University, Fuzhou, Department of Biochemistry and Molecular Biology, Fujian Medical University, Fuzhou, School of Computer Science and Technology, Tianjin University, Tianjin
Keywords:MicroRNA, isomiR, cancer, machine learning, high-throughput data, RNA-SEQ data.
Abstract:Background: IsomiR is an isoform of microRNA (miRNA), and its sequences vary from
those of a reference miRNA, which arose with the advencements of deep sequencing, high miRNA
variability has been detected from the same miRNA precursor. IsomiR exists in four main types formed
through the following processes: 5' or 3' trimming, Nucleotide addition, Nucleotide removal, and posttranscriptional
RNA editing.
Objective: For cancer diagnosis, it needs to explore differential expression profiles which can be used to
distinguish cancer and normal cell lines, especially in the isomiR-mRNA regulatory networks, because
aberrant isomiR expression profiles may contribute to tumorigenesis.
Method: We extracted five features of the isomiR read counts from RNA-SEQ data in TCGA, with a
random forest classification algorithm, these features were applied to diagnose six cancers: breast
invasive carcinoma, lung adenocarcinoma, squamous-cell carcinoma of the lung, stomach
adenocarcinoma, thyroid carcinoma, and uterine corpus endometrial carcinoma.
Results: Compare with the classifier libD3C, our method can be utilized to distinguish cancers from
their normal counterparts by performance based on sn, sp, ACC and MCC measures.
Conclusion: IsomiR can be successfully and effectively used to diagnose cancer through machine
learning method from high-throughput data.