Document Classification Using Expectation Maximization with Semi Supervised Learning

Bhawna  Nigam; Poorvi  Ahirwal; Sonal  Salve; Swati  Vamney

doi:10.53075/Ijmsirq/145756854374585

Authors

Bhawna Nigam Department of Information Technology, IET, DAVV
Poorvi Ahirwal Department of Information Technology, IET, DAVV
Sonal Salve
Swati Vamney

DOI:

https://doi.org/10.53075/Ijmsirq/145756854374585

Keywords:

Data mining, semi-supervised, learning, supervised learning, expectation maximization, document classification

Abstract

As the amount of online documents increases, the demand for document classification to aid the analysis and management of documents is increasing. Text is cheap, but information, in the form of knowing what classes a document belongs to, is expensive. The main purpose of this paper is to explain the expectation-maximization technique of data mining to classify the document and to learn how to improve the accuracy while using a semi-supervised approach. The expectation-maximization algorithm is applied with both supervised and semi-supervised approaches. It is found that the semi-supervised approach is more accurate and effective. The main advantage of the semi-supervised approach is “DYNAMICALLY GENERATION OF NEW CLASS”. The algorithm first trains a classifier using the labeled document and probabilistically classifies the unlabeled documents. The car dataset for the evaluation purpose is collected from the UCI repository dataset in which some changes have been done from our side.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Author Biography

Poorvi Ahirwal, Department of Information Technology, IET, DAVV

As the amount of online documents increases, the demand for document classification to aid the analysis and management of documents is increasing. Text is cheap, but information, in the form of knowing what classes a document belongs to, is expensive. The main purpose of this paper is to explain the expectation-maximization technique of data mining to classify the document and to learn how to improve the accuracy while using semi-supervised approach. The expectation maximization algorithm is applied with both supervised and semi-supervised approach. It is found that semi-supervised approach is more accurate and effective. The main advantage of semi supervised approach is “DYNAMICALLY GENERATION OF NEW CLASS”. The algorithm first trains a classifier using the labeled document and probabilistically classifies the unlabeled documents. The car dataset for the evaluation purpose is collected from UCI repository dataset in which some changes have been done from our side.