Aims: This study analyzes feature selection techniques for text data composed of heterogeneous sources for sentiment classification
Objectives: The objective of work is to analyze the feature selection technique for text gathered from different sources to increase the accuracy of sentiment classification done on microblogs.
Methods: Three feature selection techniques Bag-of-Word(BOW), TF-IDF, and word2vector were applied to find the most suitable feature selection techniques for heterogeneous datasets.
Results: TF-IDF outperforms all of the three selected feature selection techniques for sentiment classification with SVM classifier.
Conclusion: Feature selection is an integral part of any data preprocessing task, and along with that, it is also important for the machine learning algorithms to achieve good accuracy in classification results. Hence it is essential to find out the best suitable approach for heterogeneous sources of data. The heterogeneous sources are rich sources of information and they also play an important role in developing a model for adaptable systems as well. So keeping that also in mind, we compared the three techniques for heterogeneous source data and found that TF-IDF is the most suitable one for all types of data, whether it is balanced or imbalanced data, it is a single source or multiple source data. In all cases, the TF-IDF approach is the most promising approach in generating the results for the classification of sentiments of users.