Knowledge-driven graph similarity for text classification

International Journal of Machine Learning and Cybernetics ◽

10.1007/s13042-020-01221-4 ◽

2020 ◽

Author(s):

Niloofer Shanavas ◽

Hui Wang ◽

Zhiwei Lin ◽

Glenn Hawe

Keyword(s):

Text Classification ◽

Structural Information ◽

Similarity Measures ◽

Semantic Knowledge ◽

Exact Matching ◽

Text Documents ◽

Graph Kernel ◽

Word Similarity ◽

Graph Similarity ◽

Automatic Text Classification

AbstractAutomatic text classification using machine learning is significantly affected by the text representation model. The structural information in text is necessary for natural language understanding, which is usually ignored in vector-based representations. In this paper, we present a graph kernel-based text classification framework which utilises the structural information in text effectively through the weighting and enrichment of a graph-based representation. We introduce weighted co-occurrence graphs to represent text documents, which weight the terms and their dependencies based on their relevance to text classification. We propose a novel method to automatically enrich the weighted graphs using semantic knowledge in the form of a word similarity matrix. The similarity between enriched graphs, knowledge-driven graph similarity, is calculated using a graph kernel. The semantic knowledge in the enriched graphs ensures that the graph kernel goes beyond exact matching of terms and patterns to compute the semantic similarity of documents. In the experiments on sentiment classification and topic classification tasks, our knowledge-driven similarity measure significantly outperforms the baseline text similarity measures on five benchmark text classification datasets.

Download Full-text

Towards Robust Text Classification with Semantics-Aware Recurrent Neural Architecture

Machine Learning and Knowledge Extraction ◽

10.3390/make1020034 ◽

2019 ◽

Vol 1 (2) ◽

pp. 575-589 ◽

Cited By ~ 1

Author(s):

Blaž Škrlj ◽

Jan Kralj ◽

Nada Lavrač ◽

Senja Pollak

Keyword(s):

Text Mining ◽

Language Processing ◽

Text Classification ◽

Deep Neural Networks ◽

Semantic Knowledge ◽

Text Documents ◽

Neural Architecture ◽

Classification Tasks ◽

And Gender ◽

Semantic Resources

Deep neural networks are becoming ubiquitous in text mining and natural language processing, but semantic resources, such as taxonomies and ontologies, are yet to be fully exploited in a deep learning setting. This paper presents an efficient semantic text mining approach, which converts semantic information related to a given set of documents into a set of novel features that are used for learning. The proposed Semantics-aware Recurrent deep Neural Architecture (SRNA) enables the system to learn simultaneously from the semantic vectors and from the raw text documents. We test the effectiveness of the approach on three text classification tasks: news topic categorization, sentiment analysis and gender profiling. The experiments show that the proposed approach outperforms the approach without semantic knowledge, with highest accuracy gain (up to 10%) achieved on short document fragments.

Download Full-text

K-Graph: Knowledgeable Graph for Text Documents

Journal of Konbin ◽

10.2478/jok-2021-0006 ◽

2021 ◽

Vol 51 (1) ◽

pp. 73-89

Author(s):

Varsha Mittal ◽

Durgaprasad Gangodkar ◽

Bhaskar Pant

Keyword(s):

Structural Information ◽

Low Complexity ◽

Semantic Knowledge ◽

Access Time ◽

Graph Databases ◽

Text Documents ◽

Memory Overhead ◽

Frequent Subgraphs ◽

Real World Datasets ◽

Different Levels

Abstract Graph databases are applied in many applications, including science and business, due to their low-complexity, low-overheads, and lower time-complexity. The graph-based storage offers the advantage of capturing the semantic and structural information rather than simply using the Bag-of-Words technique. An approach called Knowledgeable graphs (K-Graph) is proposed to capture semantic knowledge. Documents are stored using graph nodes. Thanks to weighted subgraphs, the frequent subgraphs are extracted and stored in the Fast Embedding Referral Table (FERT). The table is maintained at different levels according to the headings and subheadings of the documents. It reduces the memory overhead, retrieval, and access time of the subgraph needed. The authors propose an approach that will reduce the data redundancy to a larger extent. With real-world datasets, K-graph’s performance and power usage are threefold greater than the current methods. Ninety-nine per cent accuracy demonstrates the robustness of the proposed algorithm.

Download Full-text

Incorporate Syntactic Information for Short Text Classification

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.268-270.697 ◽

2011 ◽

Vol 268-270 ◽

pp. 697-700

Author(s):

Rui Xue Duan ◽

Xiao Jie Wang ◽

Wen Feng Li

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Environment ◽

Text Classification ◽

The Internet ◽

Selection Methods ◽

Text Documents ◽

Short Text ◽

Syntactic Information ◽

Dependency Relations

As the volume of online short text documents grow tremendously on the Internet, it is much more urgent to solve the task of organizing the short texts well. However, the traditional feature selection methods cannot suitable for the short text. In this paper, we proposed a method to incorporate syntactic information for the short text. It emphasizes the feature which has more dependency relations with other words. The classifier SVM and machine learning environment Weka are involved in our experiments. The experiment results show that incorporate syntactic information in the short text, we can get more powerful features than traditional feature selection methods, such as DF, CHI. The precision of short text classification improved from 86.2% to 90.8%.

Download Full-text

Language Semantics Interpretation with an Interaction-Based Recurrent Neural Network

Machine Learning and Knowledge Extraction ◽

10.3390/make3040046 ◽

2021 ◽

Vol 3 (4) ◽

pp. 922-945

Author(s):

Shaw-Hwa Lo ◽

Yiqiao Yin

Keyword(s):

Neural Network ◽

Neural Networks ◽

Language Processing ◽

Text Classification ◽

Search Algorithm ◽

Greedy Search ◽

Text Documents ◽

Engineering Technique ◽

Language Semantics ◽

Sequential Models

Text classification is a fundamental language task in Natural Language Processing. A variety of sequential models are capable of making good predictions, yet there is a lack of connection between language semantics and prediction results. This paper proposes a novel influence score (I-score), a greedy search algorithm, called Backward Dropping Algorithm (BDA), and a novel feature engineering technique called the “dagger technique”. First, the paper proposes to use the novel influence score (I-score) to detect and search for the important language semantics in text documents that are useful for making good predictions in text classification tasks. Next, a greedy search algorithm, called the Backward Dropping Algorithm, is proposed to handle long-term dependencies in the dataset. Moreover, the paper proposes a novel engineering technique called the “dagger technique” that fully preserves the relationship between the explanatory variable and the response variable. The proposed techniques can be further generalized into any feed-forward Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs), and any neural network. A real-world application on the Internet Movie Database (IMDB) is used and the proposed methods are applied to improve prediction performance with an 81% error reduction compared to other popular peers if I-score and “dagger technique” are not implemented.

Download Full-text

Toward a multi-sensor neural net approach to automatic text classification

Advanced IT Tools ◽

10.1007/978-0-387-34979-4_41 ◽

1996 ◽

pp. 367-373

Author(s):

Venu Dasigi ◽

Reinhold C. Mann

Keyword(s):

Text Classification ◽

Neural Net ◽

Automatic Text Classification ◽

Automatic Text

Download Full-text

The Evaluation of Accuracy Performance in an Enhanced Embedded Feature Selection for Unstructured Text Classification

Iraqi Journal of Science ◽

10.24996/ijs.2020.61.12.28 ◽

2020 ◽

pp. 3397-3407

Author(s):

Nur Syafiqah Mohd Nafis ◽

Suryanti Awang

Keyword(s):

Feature Selection ◽

Text Classification ◽

Training Dataset ◽

Recursive Feature Elimination ◽

High Dimensional ◽

Significant Feature ◽

Support Vector ◽

Svm Classifier ◽

Text Documents ◽

Text Document

Text documents are unstructured and high dimensional. Effective feature selection is required to select the most important and significant feature from the sparse feature space. Thus, this paper proposed an embedded feature selection technique based on Term Frequency-Inverse Document Frequency (TF-IDF) and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) for unstructured and high dimensional text classificationhis technique has the ability to measure the feature’s importance in a high-dimensional text document. In addition, it aims to increase the efficiency of the feature selection. Hence, obtaining a promising text classification accuracy. TF-IDF act as a filter approach which measures features importance of the text documents at the first stage. SVM-RFE utilized a backward feature elimination scheme to recursively remove insignificant features from the filtered feature subsets at the second stage. This research executes sets of experiments using a text document retrieved from a benchmark repository comprising a collection of Twitter posts. Pre-processing processes are applied to extract relevant features. After that, the pre-processed features are divided into training and testing datasets. Next, feature selection is implemented on the training dataset by calculating the TF-IDF score for each feature. SVM-RFE is applied for feature ranking as the next feature selection step. Only top-rank features will be selected for text classification using the SVM classifier. Based on the experiments, it shows that the proposed technique able to achieve 98% accuracy that outperformed other existing techniques. In conclusion, the proposed technique able to select the significant features in the unstructured and high dimensional text document.

Download Full-text

Toward a multi-sensor-based approach to automatic text classification

10.2172/130610 ◽

1995 ◽

Author(s):

V.R. Dasigi ◽

R.C. Mann

Keyword(s):

Text Classification ◽

Automatic Text Classification ◽

Automatic Text

Download Full-text

Frequency estimates for statistical word similarity measures

10.3115/1073445.1073477 ◽

2003 ◽

Cited By ~ 55

Author(s):

Egidio Terra ◽

C. L. A. Clarke

Keyword(s):

Similarity Measures ◽

Word Similarity

Download Full-text

A Comparison of Word Similarity Measures for Noun Compound Disambiguation

Artificial Intelligence and Cognitive Science - Lecture Notes in Computer Science ◽

10.1007/978-3-642-17080-5_25 ◽

2010 ◽

pp. 231-240 ◽

Cited By ~ 1

Author(s):

Paul Nulty ◽

Fintan Costello

Keyword(s):

Similarity Measures ◽

Word Similarity

Download Full-text

Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality

Political Analysis ◽

10.1017/pan.2020.1 ◽

2020 ◽

Vol 28 (4) ◽

pp. 445-468 ◽

Cited By ~ 3

Author(s):

Reagan Mozer ◽

Luke Miratrix ◽

Aaron Russell Kaufman ◽

L. Jason Anastasopoulos

Keyword(s):

Propensity Scores ◽

Human Subjects ◽

Medical Intervention ◽

Exact Matching ◽

Match Quality ◽

Text Documents ◽

Text Data ◽

News Sources ◽

Text Matching

Matching for causal inference is a well-studied problem, but standard methods fail when the units to match are text documents: the high-dimensional and rich nature of the data renders exact matching infeasible, causes propensity scores to produce incomparable matches, and makes assessing match quality difficult. In this paper, we characterize a framework for matching text documents that decomposes existing methods into (1) the choice of text representation and (2) the choice of distance metric. We investigate how different choices within this framework affect both the quantity and quality of matches identified through a systematic multifactor evaluation experiment using human subjects. Altogether, we evaluate over 100 unique text-matching methods along with 5 comparison methods taken from the literature. Our experimental results identify methods that generate matches with higher subjective match quality than current state-of-the-art techniques. We enhance the precision of these results by developing a predictive model to estimate the match quality of pairs of text documents as a function of our various distance scores. This model, which we find successfully mimics human judgment, also allows for approximate and unsupervised evaluation of new procedures in our context. We then employ the identified best method to illustrate the utility of text matching in two applications. First, we engage with a substantive debate in the study of media bias by using text matching to control for topic selection when comparing news articles from thirteen news sources. We then show how conditioning on text data leads to more precise causal inferences in an observational study examining the effects of a medical intervention.

Download Full-text