A Keyphrase-Based Tag Cloud Generation Framework to Conceptualize Textual Data

The internet is increasing exponentially with textual content primarily through social websites. The problems were also increasing with anonymous textual data in the internet. The researchers are searching for alternative techniques to know the author of an unknown document. Authorship Attribution is one such technique to predict the details of an unknown document. The researchers extracted various classes of stylistic features like character, lexical, syntactic, structural, content and semantic features to distinguish the authors writing style. In this work, the experiment performed with most frequent content specific features, n-grams of character, word and POS tags. A standard dataset is used for experimentation and identified that the combination of content based and n-gram features achieved best accuracy for prediction of author. Two standard classification algorithms were used for author prediction. The Random forest classifier attained best accuracy for prediction of author when compared with Naïve Bayes Multinomial classifier. The achieved results were good compared to many existing solutions to the Authorship Attribution.

Download Full-text

Author(s):

Елена Макарова ◽

Elena Makarova ◽

Дмитрий Лагерев ◽

Dmitriy Lagerev ◽

Федор Лозбинев ◽

...

Keyword(s):

Decision Making ◽

Vector Representation ◽

Managerial Decision ◽

Text Data ◽

Managerial Decision Making ◽

Human Control ◽

Textual Data ◽

Interactive Visualizations ◽

N Gram ◽

Text Data Analysis

This paper describes text data analysis in the course of managerial decision making. The process of collecting textual data for further analysis as well as the use of visualization in human control over the correctness of data collection is considered in depth. An algorithm modification for creating an "n-gram cloud" visualization is proposed, which can help to make visualization accessible to people with visual impairments. Also, a method of visualization of n-gram vector representation models (word embedding) is proposed. On the basis of the conducted research, a part of a software package was implemented, which is responsible for creating interactive visualizations in a browser and interoperating with them.

Download Full-text

Latent Dirichlet Allocation and POS Tags Based Method for External Plagiarism Detection

Scholarly Ethics and Publishing ◽

10.4018/978-1-5225-8057-7.ch015 ◽

2019 ◽

pp. 319-336

Author(s):

Ali Daud ◽

Jamal Ahmad Khan ◽

Jamal Abdul Nasir ◽

Rabeeh Ayaz Abbasi ◽

Naif Radi Aljohani ◽

...

Keyword(s):

Latent Dirichlet Allocation ◽

Plagiarism Detection ◽

Text Documents ◽

Parts Of Speech ◽

Stop Word ◽

Processing Step ◽

Syntactic Information ◽

N Gram ◽

Basic Hypothesis ◽

Dirichlet Allocation

In this article we present a new semantic and syntactic-based method for external plagiarism detection. In the proposed approach, latent dirichlet allocation (LDA) and parts of speech (POS) tags are used together to detect plagiarism between the sample and a number of source documents. The basic hypothesis is that considering semantic and syntactic information between two text documents may improve the performance of the plagiarism detection task. Our method is based on two steps, naming, which is a pre-processing where we detect the topics from the sentences in documents using the LDA and convert each sentence in POS tags array; then a post processing step where the suspicious cases are verified purely on the basis of semantic rules. For two types of external plagiarism (copy and random obfuscation), we empirically compare our approach to the state-of-the-art N-gram based and stop-word N-gram based methods and observe significant improvements.

Download Full-text

CobWeb Multidimensional Model and Tag-Cloud Operators for OLAP of Documents

International Journal of Green Computing ◽

10.4018/ijgc.2018070104 ◽

2018 ◽

Vol 9 (2) ◽

pp. 46-68

Author(s):

Omar Khrouf ◽

Kais Khrouf ◽

Jamel Feki

Keyword(s):

Decision Makers ◽

Effective Management ◽

Multidimensional Model ◽

Tag Clouds ◽

Xml Documents ◽

Model Based ◽

A survey of tag clouds as tools for information retrieval and content representation

Information Visualization ◽

10.1177/1473871620966638 ◽

2020 ◽

pp. 147387162096663

Author(s):

Úrsula Torres Parejo ◽

Jesús R Campaña ◽

M Amparo Vila ◽

Miguel Delgado

Keyword(s):

Information Retrieval ◽

Structural Properties ◽

The Internet ◽

Original Text ◽

Semantic Association ◽

Tag Clouds ◽

Enriching Documents by Linking Salient Entities and Lexical-Semantic Expansion

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0098 ◽

2018 ◽

Vol 29 (1) ◽

pp. 1109-1121

Author(s):

Mohsen Pourvali ◽

Salvatore Orlando

Keyword(s):

Clustering Algorithms ◽

Ensemble Clustering ◽

British Broadcasting Corporation ◽

Text Documents ◽

Classical Text ◽

Text Corpora ◽

Clustering Quality ◽

Semantic Expansion ◽

Document Representations

Abstract This paper explores a multi-strategy technique that aims at enriching text documents for improving clustering quality. We use a combination of entity linking and document summarization in order to determine the identity of the most salient entities mentioned in texts. To effectively enrich documents without introducing noise, we limit ourselves to the text fragments mentioning the salient entities, in turn, belonging to a knowledge base like Wikipedia, while the actual enrichment of text fragments is carried out using WordNet. To feed clustering algorithms, we investigate different document representations obtained using several combinations of document enrichment and feature extraction. This allows us to exploit ensemble clustering, by combining multiple clustering results obtained using different document representations. Our experiments indicate that our novel enriching strategies, combined with ensemble clustering, can improve the quality of classical text clustering when applied to text corpora like The British Broadcasting Corporation (BBC) NEWS.

Download Full-text

Unsupervised Keyphrase Extraction for Web Pages

Multimodal Technologies and Interaction ◽

10.3390/mti3030058 ◽

2019 ◽

Vol 3 (3) ◽

pp. 58 ◽

Cited By ~ 1

Author(s):

Tim Haarman ◽

Bastiaan Zijlema ◽

Marco Wiering

Keyword(s):

Language Processing ◽

State Of The Art ◽

Structural Information ◽

Extraction Methods ◽

Web Pages ◽

Keyphrase Extraction ◽

Text Documents ◽

Normal Text ◽

Textual Data ◽

Novel Method

Keyphrase extraction is an important part of natural language processing (NLP) research, although little research is done in the domain of web pages. The World Wide Web contains billions of pages that are potentially interesting for various NLP tasks, yet it remains largely untouched in scientific research. Current research is often only applied to clean corpora such as abstracts and articles from academic journals or sets of scraped texts from a single domain. However, textual data from web pages differ from normal text documents, as it is structured using HTML elements and often consists of many small fragments. These elements are furthermore used in a highly inconsistent manner and are likely to contain noise. We evaluated the keyphrases extracted by several state-of-the-art extraction methods and found that they did not transfer well to web pages. We therefore propose WebEmbedRank, an adaptation of a recently proposed extraction method that can make use of structural information in web pages in a robust manner. We compared this novel method to other baselines and state-of-the-art methods using a manually annotated dataset and found that WebEmbedRank achieved significant improvements over existing extraction methods on web pages.

Download Full-text

Emotion Mining Using Semantic Similarity

International Journal of Synthetic Emotions ◽

10.4018/ijse.2018070101 ◽

2018 ◽

Vol 9 (2) ◽

pp. 1-22 ◽

Cited By ~ 2

Author(s):

Rafiya Jan ◽

Afaq Alam Khan

Keyword(s):

Semantic Similarity ◽

Semantic Relatedness ◽

Semantic Features ◽

Emotion Detection ◽

Emotion Classification ◽

Textual Data ◽

Different Types ◽

Distributional Semantic Models ◽

Affective Information

Social networks are considered as the most abundant sources of affective information for sentiment and emotion classification. Emotion classification is the challenging task of classifying emotions into different types. Emotions being universal, the automatic exploration of emotion is considered as a difficult task to perform. A lot of the research is being conducted in the field of automatic emotion detection in textual data streams. However, very little attention is paid towards capturing semantic features of the text. In this article, the authors present the technique of semantic relatedness for automatic classification of emotion in the text using distributional semantic models. This approach uses semantic similarity for measuring the coherence between the two emotionally related entities. Before classification, data is pre-processed to remove the irrelevant fields and inconsistencies and to improve the performance. The proposed approach achieved the accuracy of 71.795%, which is competitive considering as no training or annotation of data is done.

Download Full-text

The Social Spiders in the Clustering of Texts

International Journal of Artificial Life Research ◽

10.4018/jalr.2012070101 ◽

2012 ◽

Vol 3 (3) ◽

pp. 1-14 ◽

Cited By ~ 7

Author(s):

Reda Mohamed Hamou ◽

Abdelmalek Amine ◽

Ahmed Chaouki Lokbani

Keyword(s):

Data Stream ◽

Combinatorial Problem ◽

Large Data ◽

Social Spiders ◽

Stream Flows ◽

The Social ◽

Textual Data ◽

Biomimetic Approach ◽

N Gram

In this paper the authors experiment and test a new biomimetic approach based on social spiders to solve a combinatorial problem ie the automatic classification of texts because a very large data stream flows and particularly on the web. Representation of textual data was performed by a method independent of the language ie n-gram characters and words because there is currently no method of learning that can directly represent unstructured data (text). To validate the classification, the authors used a measure of evaluation based on recall and precision (F-measure). During the experiment, the authors found a powerful visualization tool in social spiders that they exploit to make visual classification.

Download Full-text

A Keyphrase-Based Tag Cloud Generation Framework to Conceptualize Textual Data

Authorship Attribution using Content based Features and N-gram features

Tag Clouds for Software Documents Visualization

Features of Big Text Data Visualization for Managerial Decision Making

Latent Dirichlet Allocation and POS Tags Based Method for External Plagiarism Detection

CobWeb Multidimensional Model and Tag-Cloud Operators for OLAP of Documents

A survey of tag clouds as tools for information retrieval and content representation

Enriching Documents by Linking Salient Entities and Lexical-Semantic Expansion

Unsupervised Keyphrase Extraction for Web Pages

Emotion Mining Using Semantic Similarity

The Social Spiders in the Clustering of Texts

Export Citation Format