Background: The non-coding RNA identification at the organelle genome level is a
challenging task. In our previous work, an ncRNA dataset with less than 80% sequence identity
was built, and a method incorporating an increment of diversity combining with support vector
machine method was proposed.
Objective: Based on the ncRNA_361 dataset, a novel decision-making method-an improved
KNN (iKNN) classifier was proposed.
Methods: In this paper, based on the iKNN algorithm, the physicochemical features of nucleotides,
the degeneracy of genetic codons, and topological secondary structure were selected to represent
the effective ncRNA characters. Then, the incremental feature selection method was utilized to optimize
the feature set.
Results: The results of iKNN indicated that the decision-making method of mean value is distinctly
superior to the traditional decision-making method of majority vote the Increment of Diversity
Combining Support Vector Machine (ID-SVM). The iKNN algorithm achieved an overall accuracy
of 97.368% in the jackknife test, when k=3.
Conclusion: It should be noted that the triplets of the structure-sequence mode under reading
frames not only contains the entire sequence information but also reflects whether the base was
paired or not, and the secondary structural topological parameters further describe the ncRNA secondary
structure on the spatial level. The ncRNA dataset and the iKNN classifier are freely available