K-Graph: Knowledgeable Graph for Text Documents

Abstract Graph databases are applied in many applications, including science and business, due to their low-complexity, low-overheads, and lower time-complexity. The graph-based storage offers the advantage of capturing the semantic and structural information rather than simply using the Bag-of-Words technique. An approach called Knowledgeable graphs (K-Graph) is proposed to capture semantic knowledge. Documents are stored using graph nodes. Thanks to weighted subgraphs, the frequent subgraphs are extracted and stored in the Fast Embedding Referral Table (FERT). The table is maintained at different levels according to the headings and subheadings of the documents. It reduces the memory overhead, retrieval, and access time of the subgraph needed. The authors propose an approach that will reduce the data redundancy to a larger extent. With real-world datasets, K-graph’s performance and power usage are threefold greater than the current methods. Ninety-nine per cent accuracy demonstrates the robustness of the proposed algorithm.

Download Full-text

Knowledge-driven graph similarity for text classification

International Journal of Machine Learning and Cybernetics ◽

10.1007/s13042-020-01221-4 ◽

2020 ◽

Author(s):

Niloofer Shanavas ◽

Hui Wang ◽

Zhiwei Lin ◽

Glenn Hawe

Keyword(s):

Text Classification ◽

Structural Information ◽

Similarity Measures ◽

Semantic Knowledge ◽

Exact Matching ◽

Text Documents ◽

Graph Kernel ◽

Word Similarity ◽

Graph Similarity ◽

Automatic Text Classification

AbstractAutomatic text classification using machine learning is significantly affected by the text representation model. The structural information in text is necessary for natural language understanding, which is usually ignored in vector-based representations. In this paper, we present a graph kernel-based text classification framework which utilises the structural information in text effectively through the weighting and enrichment of a graph-based representation. We introduce weighted co-occurrence graphs to represent text documents, which weight the terms and their dependencies based on their relevance to text classification. We propose a novel method to automatically enrich the weighted graphs using semantic knowledge in the form of a word similarity matrix. The similarity between enriched graphs, knowledge-driven graph similarity, is calculated using a graph kernel. The semantic knowledge in the enriched graphs ensures that the graph kernel goes beyond exact matching of terms and patterns to compute the semantic similarity of documents. In the experiments on sentiment classification and topic classification tasks, our knowledge-driven similarity measure significantly outperforms the baseline text similarity measures on five benchmark text classification datasets.

Download Full-text

WFSM-MaxPWS: An Efficient Approach for Mining Weighted Frequent Subgraphs from Edge-Weighted Graph Databases

Advances in Knowledge Discovery and Data Mining - Lecture Notes in Computer Science ◽

10.1007/978-3-319-93040-4_52 ◽

2018 ◽

pp. 664-676 ◽

Cited By ~ 6

Author(s):

Md. Ashraful Islam ◽

Chowdhury Farhan Ahmed ◽

Carson K. Leung ◽

Calvin S. H. Hoi

Keyword(s):

Weighted Graph ◽

Graph Databases ◽

Efficient Approach ◽

Frequent Subgraphs

Download Full-text

Towards Robust Text Classification with Semantics-Aware Recurrent Neural Architecture

Machine Learning and Knowledge Extraction ◽

10.3390/make1020034 ◽

2019 ◽

Vol 1 (2) ◽

pp. 575-589 ◽

Cited By ~ 1

Author(s):

Blaž Škrlj ◽

Jan Kralj ◽

Nada Lavrač ◽

Senja Pollak

Keyword(s):

Text Mining ◽

Language Processing ◽

Text Classification ◽

Deep Neural Networks ◽

Semantic Knowledge ◽

Text Documents ◽

Neural Architecture ◽

Classification Tasks ◽

And Gender ◽

Semantic Resources

Deep neural networks are becoming ubiquitous in text mining and natural language processing, but semantic resources, such as taxonomies and ontologies, are yet to be fully exploited in a deep learning setting. This paper presents an efficient semantic text mining approach, which converts semantic information related to a given set of documents into a set of novel features that are used for learning. The proposed Semantics-aware Recurrent deep Neural Architecture (SRNA) enables the system to learn simultaneously from the semantic vectors and from the raw text documents. We test the effectiveness of the approach on three text classification tasks: news topic categorization, sentiment analysis and gender profiling. The experiments show that the proposed approach outperforms the approach without semantic knowledge, with highest accuracy gain (up to 10%) achieved on short document fragments.

Download Full-text

Unsupervised Keyphrase Extraction for Web Pages

Multimodal Technologies and Interaction ◽

10.3390/mti3030058 ◽

2019 ◽

Vol 3 (3) ◽

pp. 58 ◽

Cited By ~ 1

Author(s):

Tim Haarman ◽

Bastiaan Zijlema ◽

Marco Wiering

Keyword(s):

Language Processing ◽

State Of The Art ◽

Structural Information ◽

Extraction Methods ◽

Web Pages ◽

Keyphrase Extraction ◽

Text Documents ◽

Normal Text ◽

Textual Data ◽

Novel Method

Keyphrase extraction is an important part of natural language processing (NLP) research, although little research is done in the domain of web pages. The World Wide Web contains billions of pages that are potentially interesting for various NLP tasks, yet it remains largely untouched in scientific research. Current research is often only applied to clean corpora such as abstracts and articles from academic journals or sets of scraped texts from a single domain. However, textual data from web pages differ from normal text documents, as it is structured using HTML elements and often consists of many small fragments. These elements are furthermore used in a highly inconsistent manner and are likely to contain noise. We evaluated the keyphrases extracted by several state-of-the-art extraction methods and found that they did not transfer well to web pages. We therefore propose WebEmbedRank, an adaptation of a recently proposed extraction method that can make use of structural information in web pages in a robust manner. We compared this novel method to other baselines and state-of-the-art methods using a manually annotated dataset and found that WebEmbedRank achieved significant improvements over existing extraction methods on web pages.

Download Full-text

Formalization of Technological Knowledge in the Field of Metallurgy using Document Classification Tools Supported with Semantic Techniques

Archives of Metallurgy and Materials ◽

10.1515/amm-2017-0108 ◽

2017 ◽

Vol 62 (2) ◽

pp. 715-720 ◽

Cited By ~ 2

Author(s):

K. Regulski

Keyword(s):

Knowledge Base ◽

Latent Semantic Indexing ◽

Semantic Knowledge ◽

Technological Knowledge ◽

Semantic Integration ◽

Semantic Indexing ◽

Text Documents ◽

Semantic Techniques ◽

Semantic Knowledge Base

AbstractThe process of knowledge formalization is an essential part of decision support systems development. Creating a technological knowledge base in the field of metallurgy encountered problems in acquisition and codifying reusable computer artifacts based on text documents. The aim of the work was to adapt the algorithms for classification of documents and to develop a method of semantic integration of a created repository. Author used artificial intelligence tools: latent semantic indexing, rough sets, association rules learning and ontologies as a tool for integration. The developed methodology allowed for the creation of semantic knowledge base on the basis of documents in natural language in the field of metallurgy.

Download Full-text

Discovering frequent subgraphs over uncertain graph databases under probabilistic semantics

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '10 ◽

10.1145/1835804.1835885 ◽

2010 ◽

Cited By ~ 62

Author(s):

Zhaonian Zou ◽

Hong Gao ◽

Jianzhong Li

Keyword(s):

Graph Databases ◽

Uncertain Graph ◽

Probabilistic Semantics ◽

Frequent Subgraphs

Download Full-text

Molecular Graph Contrastive Learning with Parameterized Explainable Augmentations

10.1101/2021.12.03.471150 ◽

2021 ◽

Author(s):

Yingheng Wang ◽

Yaosen Min ◽

Erzhuo Shao ◽

Ji Wu

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Structural Information ◽

Molecular Graph ◽

Representation Learning ◽

Graph Representation ◽

Input Graph ◽

Recent Success ◽

Real World Datasets ◽

Comparative Results

ABSTRACTLearning generalizable, transferable, and robust representations for molecule data has always been a challenge. The recent success of contrastive learning (CL) for self-supervised graph representation learning provides a novel perspective to learn molecule representations. The most prevailing graph CL framework is to maximize the agreement of representations in different augmented graph views. However, existing graph CL frameworks usually adopt stochastic augmentations or schemes according to pre-defined rules on the input graph to obtain different graph views in various scales (e.g. node, edge, and subgraph), which may destroy topological semantemes and domain prior in molecule data, leading to suboptimal performance. Therefore, designing parameterized, learnable, and explainable augmentation is quite necessary for molecular graph contrastive learning. A well-designed parameterized augmentation scheme can preserve chemically meaningful structural information and intrinsically essential attributes for molecule graphs, which helps to learn representations that are insensitive to perturbation on unimportant atoms and bonds. In this paper, we propose a novel Molecular Graph Contrastive Learning with Parameterized Explainable Augmentations, MolCLE for brevity, that self-adaptively incorporates chemically significative information from both topological and semantic aspects of molecular graphs. Specifically, we apply deep neural networks to parameterize the augmentation process for both the molecular graph topology and atom attributes, to highlight contributive molecular substructures and recognize underlying chemical semantemes. Comprehensive experiments on a variety of real-world datasets demonstrate that our proposed method consistently outperforms compared baselines, which verifies the effectiveness of the proposed framework. Detailedly, our self-supervised MolCLE model surpasses many supervised counterparts, and meanwhile only uses hundreds of thousands of parameters to achieve comparative results against the state-of-the-art baseline, which has tens of millions of parameters. We also provide detailed case studies to validate the explainability of augmented graph views.CCS CONCEPTS• Mathematics of computing → Graph algorithms; • Applied computing → Bioinformatics; • Computing methodologies → Neural networks; Unsupervised learning.

Download Full-text

Unsupervised Negative Link Prediction in Signed Social Networks

Mathematical Problems in Engineering ◽

10.1155/2019/7348301 ◽

2019 ◽

Vol 2019 ◽

pp. 1-15 ◽

Cited By ~ 1

Author(s):

Pengfei Shen ◽

Shufen Liu ◽

Ying Wang ◽

Lu Han

Keyword(s):

Link Prediction ◽

Nonnegative Matrix Factorization ◽

Structural Information ◽

Nonnegative Matrix ◽

Sociological Study ◽

Negative Interaction ◽

Signed Social Networks ◽

Real World Datasets ◽

Negative Links ◽

Negative Link

It has been proved in a number of applications that it is useful to predict unknown social links, and link prediction has played an important role in sociological study. Although there has been a surge of pertinent approaches to link prediction, most of them focus on positive link prediction while giving few attentions to the problem of inferring unknown negative links. The inherent characteristics of negative relations present great challenges to traditional link prediction: (1) there are very few negative interaction data; (2) negative links are much sparser than positive links; (3) social data is often noisy, incomplete, and fast-evolved. This paper intends to address this novel problem by solely leveraging structural information and further proposes the UN-PNMF framework based on the projective nonnegative matrix factorization, so as to incorporate network embedding and user’s property embedding into negative link prediction. Empirical experiments on real-world datasets corroborate their effectiveness.

Download Full-text

Semantically-Guided Clustering of Text Documents via Frequent Subgraphs Discovery

Lecture Notes in Computer Science - Foundations of Intelligent Systems ◽

10.1007/978-3-642-21916-0_44 ◽

2011 ◽

pp. 407-417 ◽

Cited By ~ 1

Author(s):

Rafal A. Angryk ◽

M. Shahriar Hossain ◽

Brandon Norick

Keyword(s):

Text Documents ◽

Frequent Subgraphs ◽

Guided Clustering

Download Full-text

GOW-Stream: A novel approach of graph-of-words based mixture model for semantic-enhanced text stream clustering

Intelligent Data Analysis ◽

10.3233/ida-205443 ◽

2021 ◽

Vol 25 (5) ◽

pp. 1211-1231

Author(s):

Tham Vo ◽

Phuc Do

Keyword(s):

Social Networks ◽

Rapid Change ◽

Online News ◽

Text Documents ◽

Existing Problems ◽

Stream Clustering ◽

Evaluation Approach ◽

Novel Approach ◽

Independent Evaluation ◽

Real World Datasets

Recently, rapid growth of social networks and online news resources from Internet have made text stream clustering become an insufficient application in multiple domains (e.g.: text retrieval diversification, social event detection, text summarization, etc.) Different from traditional static text clustering approach, text stream clustering task has specific key challenges related to the rapid change of topics/clusters and high-velocity of coming streaming document batches. Recent well-known model-based text stream clustering models, such as: DTM, DCT, MStream, etc. are considered as word-independent evaluation approach which means largely ignoring the relations between words while sampling clusters/topics. It definitely leads to the decrease of overall model accuracy performance, especially for short-length text documents such as comments, microblogs, etc. in social networks. To tackle these existing problems, in this paper we propose a novel approach of graph-of-words (GOWs) based text stream clustering, called GOW-Stream. The application of common GOWs which are generated from each document batch while sampling clusters/topics can support to overcome the word-independent evaluation challenge. Our proposed GOW-Stream is promising to significantly achieve better text stream clustering performance than recent state-of-the-art baselines. Extensive experiments on multiple benchmark real-world datasets demonstrate the effectiveness of our proposed model in both accuracy and time-consuming performances.

Download Full-text