FEATURE ENGINEERING WITH SENTENCE SIMILARITY USING THE LONGEST COMMON SUBSEQUENCE FOR EMAIL CLASSIFICATION

Main Article Content

Aruna Kumara B
Mallikarjun M Kodabagi

Abstract

Feature selection plays a prominent role in email classification since selecting the most relevant features enhances the accuracy and performance of the learning classifier. Due to the exponential increase rate in the usage of emails, the classification of such emails posed a fitting problem. Therefore, there is a requirement for a proper classification system. Such an email classification system requires an efficient feature selection method for the accurate classification of the most relevant features. This paper proposes a novel feature selection method for sentence similarity using the longest common subsequence for email classification. The proposed feature selection method works in two main phases: First, it builds the longest common subsequence vector of features by comparing each email with all other emails in the dataset. Later, a template is constructed for each class using the closest features of emails of a particular class. Further, email classification is tested for unseen emails using these templates. The performance of the proposed method is compared with traditional feature selection methods such as TF-IDF, Information Gain, Chi-square, and semantic approach. The experimental results showed that the proposed method performed well with 96.61% accuracy.

Downloads

Download data is not yet available.

Article Details

How to Cite
B, A. K. ., & Kodabagi, M. M. (2022). FEATURE ENGINEERING WITH SENTENCE SIMILARITY USING THE LONGEST COMMON SUBSEQUENCE FOR EMAIL CLASSIFICATION. Malaysian Journal of Computer Science, 65–78. https://doi.org/10.22452/mjcs.sp2022no2.6
Section
Articles