Simultaneous Learning of Sentence Clustering and Class Prediction for Improved Document Classification

Clustering is the process of grouping objects into subsets that have meaning in the context of a particular problem. It does not rely on predefined classes. It is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects. Many clustering algorithms have been proposed and are used based on different applications. Sentence clustering is one of best clustering technique. Hierarchical Clustering Algorithm is applied for multiple levels for accuracy. For tagging purpose POS tagger, porter stemmer is used. WordNet dictionary is utilized for determining the similarity by invoking the Jiang Conrath and Cosine similarity measure. Grouping is performed with respect to the highest similarity measure value with a mean threshold. This paper incorporates many parameters for finding similarity between words. In order to identify the disambiguated words, the sense identification is performed for the adjectives and comparison is performed. semcor and machine learning datasets are employed. On comparing with previous results for WSD, our work has improvised a lot which gives a percentage of 91.2%

Download Full-text

Nonnegative Matrix Factorization and Document Classification

10.15368/theses.2015.110 ◽

2015 ◽

Author(s):

Stephen Calabrese

Keyword(s):

Matrix Factorization ◽

Nonnegative Matrix Factorization ◽

Nonnegative Matrix ◽

Document Classification

Download Full-text

Challenges in Machine Learning for Document Classification in the Real Estate Industry

10.15396/eres2019_370 ◽

2019 ◽

Author(s):

Björn-Martin Kurzrock ◽

Mario Bodenbender

Keyword(s):

Machine Learning ◽

Real Estate ◽

Document Classification ◽

Real Estate Industry ◽

The Real ◽

The Real Estate

Download Full-text

A Note on Document Classification with Small Training Data

IEEJ Transactions on Electronics Information and Systems ◽

10.1541/ieejeiss.131.1459 ◽

2011 ◽

Vol 131 (8) ◽

pp. 1459-1466

Author(s):

Yasunari Maeda ◽

Hideki Yoshida ◽

Masakiyo Suzuki ◽

Toshiyasu Matsushima

Keyword(s):

Document Classification ◽

Training Data

Download Full-text

Efficient Dynamic Analysis of Low-similarity Proteins for Structural Class Prediction

2020 28th European Signal Processing Conference (EUSIPCO) ◽

10.23919/eusipco47968.2020.9287619 ◽

2021 ◽

Author(s):

M.A. Zervou ◽

E. Doutsi ◽

P. Pavlidis ◽

P. Tsakalides

Keyword(s):

Dynamic Analysis ◽

Class Prediction ◽

Structural Class

Download Full-text

Dialogue Management based on Sentence Clustering

10.3115/v1/p15-2131 ◽

2015 ◽

Cited By ~ 2

Author(s):

Wendong Ge ◽

Bo Xu

Keyword(s):

Dialogue Management ◽

Sentence Clustering

Download Full-text

Protein Structural Class Prediction Based on Distance-related Statistical Features from Graphical Representation of Predicted Secondary Structure

Letters in Organic Chemistry ◽

10.2174/1570178615666180914110451 ◽

2019 ◽

Vol 16 (4) ◽

pp. 317-324

Author(s):

Liang Kong ◽

Lichao Zhang ◽

Xiaodong Han ◽

Jinfeng Lv

Keyword(s):

Feature Extraction ◽

Secondary Structure ◽

Protein Sequence ◽

Function Analysis ◽

Superior Performance ◽

Support Vector ◽

Chaos Game Representation ◽

Class Prediction ◽

Structural Class ◽

Protein Structural Class

Protein structural class prediction is beneficial to protein structure and function analysis. Exploring good feature representation is a key step for this prediction task. Prior works have demonstrated the effectiveness of the secondary structure based feature extraction methods especially for lowsimilarity protein sequences. However, the prediction accuracies still remain limited. To explore the potential of secondary structure information, a novel feature extraction method based on a generalized chaos game representation of predicted secondary structure is proposed. Each protein sequence is converted into a 20-dimensional distance-related statistical feature vector to characterize the distribution of secondary structure elements and segments. The feature vectors are then fed into a support vector machine classifier to predict the protein structural class. Our experiments on three widely used lowsimilarity benchmark datasets (25PDB, 1189 and 640) show that the proposed method achieves superior performance to the state-of-the-art methods. It is anticipated that our method could be extended to other graphical representations of protein sequence and be helpful in future protein research.

Download Full-text