A Keyphrase-Based Tag Cloud Generation Framework to Conceptualize Textual Data

Author(s):  
Muhammad Abulaish ◽  
Tarique Anwar

Tag clouds have become an effective tool to quickly perceive the most prominent terms embedded within textual data. Tag clouds help grasp the main theme of a corpus without exploring the pile of documents. However, the effectiveness of tag clouds to conceptualize text corpora is directly proportional to the quality of the tags. In this paper, the authors propose a keyphrase-based tag cloud generation framework. In contrast to existing tag cloud generation systems that use single words as tags and their frequency counts to determine the font size of the tags, the proposed framework identifies feasible keyphrases and uses them as tags. The font-size of a keyphrase is determined as a function of its relevance weight. Instead of using partial or full parsing, which is inefficient for lengthy sentences and inaccurate for the sentences that do not follow proper grammatical structure, the proposed method applies n-gram techniques followed by various heuristics-based refinements to identify candidate phrases from text documents. A rich set of lexical and semantic features are identified to characterize the candidate phrases and determine their keyphraseness and relevance weights. The authors also propose a font-size determination function, which utilizes the relevance weights of the keyphrases to determine their relative font size for tag cloud visualization. The efficacy of the proposed framework is established through experimentation and its comparison with the existing state-of-the-art tag cloud generation methods.

The internet is increasing exponentially with textual content primarily through social websites. The problems were also increasing with anonymous textual data in the internet. The researchers are searching for alternative techniques to know the author of an unknown document. Authorship Attribution is one such technique to predict the details of an unknown document. The researchers extracted various classes of stylistic features like character, lexical, syntactic, structural, content and semantic features to distinguish the authors writing style. In this work, the experiment performed with most frequent content specific features, n-grams of character, word and POS tags. A standard dataset is used for experimentation and identified that the combination of content based and n-gram features achieved best accuracy for prediction of author. Two standard classification algorithms were used for author prediction. The Random forest classifier attained best accuracy for prediction of author when compared with Naïve Bayes Multinomial classifier. The achieved results were good compared to many existing solutions to the Authorship Attribution.


Author(s):  
Ra'fat Ahmad Al-msie'deen

Legacy software documents are hard to understand and visualize. The tag cloud technique helps software developers to visualize the contents of software documents. A tag cloud is a well-known and simple visualization technique. This paper proposes a new method to visualize software documents, using a tag cloud. In this paper, tags visualize in the cloud based on their frequency in an alphabetical order. The most important tags are displayed with a larger font size. The originality of this method is that it visualizes the contents of JavaDoc as a tag cloud. To validate the JavaDocCloud method, it was applied to NanoXML case study, the results of these experiments display the most common and uncommon tags used in the software documents.


Author(s):  
Елена Макарова ◽  
Elena Makarova ◽  
Дмитрий Лагерев ◽  
Dmitriy Lagerev ◽  
Федор Лозбинев ◽  
...  

This paper describes text data analysis in the course of managerial decision making. The process of collecting textual data for further analysis as well as the use of visualization in human control over the correctness of data collection is considered in depth. An algorithm modification for creating an "n-gram cloud" visualization is proposed, which can help to make visualization accessible to people with visual impairments. Also, a method of visualization of n-gram vector representation models (word embedding) is proposed. On the basis of the conducted research, a part of a software package was implemented, which is responsible for creating interactive visualizations in a browser and interoperating with them.


Author(s):  
Ali Daud ◽  
Jamal Ahmad Khan ◽  
Jamal Abdul Nasir ◽  
Rabeeh Ayaz Abbasi ◽  
Naif Radi Aljohani ◽  
...  

In this article we present a new semantic and syntactic-based method for external plagiarism detection. In the proposed approach, latent dirichlet allocation (LDA) and parts of speech (POS) tags are used together to detect plagiarism between the sample and a number of source documents. The basic hypothesis is that considering semantic and syntactic information between two text documents may improve the performance of the plagiarism detection task. Our method is based on two steps, naming, which is a pre-processing where we detect the topics from the sentences in documents using the LDA and convert each sentence in POS tags array; then a post processing step where the suspicious cases are verified purely on the basis of semantic rules. For two types of external plagiarism (copy and random obfuscation), we empirically compare our approach to the state-of-the-art N-gram based and stop-word N-gram based methods and observe significant improvements.


2018 ◽  
Vol 9 (2) ◽  
pp. 46-68
Author(s):  
Omar Khrouf ◽  
Kais Khrouf ◽  
Jamel Feki

There is an explosion in the amount of textual documents that have been generated and stored in recent years. Effective management of these documents is essential for better exploitation in decisional analyses. In this context, the authors propose their CobWeb multidimensional model based on standard facets and dedicated to the OLAP (on-line analytical processing) of XML documents; it aims to provide decision makers with facilities for expressing their analytical queries. Secondly, they suggest new visualization operators for OLAP query results by introducing the concept of Tag clouds as a means to help decision-makers to display OLAP results in an intuitive format and focus on main concepts. The authors have developed a software prototype called MQF (Multidimensional Query based on Facets) to support their proposals and then tested it on documents from the PubMed collection.


2020 ◽  
pp. 147387162096663
Author(s):  
Úrsula Torres Parejo ◽  
Jesús R Campaña ◽  
M Amparo Vila ◽  
Miguel Delgado

Tag clouds are tools that have been widely used on the Internet since their conception. The main applications of these textual visualizations are information retrieval, content representation and browsing of the original text from which the tags are generated. Despite the extensive use of tag clouds, their enormous popularity and the amount of research related to different aspects of them, few studies have summarized their most important features when they work as tools for information retrieval and content representation. In this paper we present a summary of the main characteristics of tag clouds found in the literature, such as their different functions, designs and negative aspects. We also present a summary of the most popular metrics used to capture the structural properties of a tag cloud generated from the query results, as well as other measures for evaluating the goodness of the tag cloud when it works as a tool for content representation. The different methods for tagging and the semantic association processes in tag clouds are also considered. Finally we give a list of alternative for visual interfaces, which makes this study a useful first help for researchers who want to study the content representation and information retrieval interfaces in greater depth.


2018 ◽  
Vol 29 (1) ◽  
pp. 1109-1121
Author(s):  
Mohsen Pourvali ◽  
Salvatore Orlando

Abstract This paper explores a multi-strategy technique that aims at enriching text documents for improving clustering quality. We use a combination of entity linking and document summarization in order to determine the identity of the most salient entities mentioned in texts. To effectively enrich documents without introducing noise, we limit ourselves to the text fragments mentioning the salient entities, in turn, belonging to a knowledge base like Wikipedia, while the actual enrichment of text fragments is carried out using WordNet. To feed clustering algorithms, we investigate different document representations obtained using several combinations of document enrichment and feature extraction. This allows us to exploit ensemble clustering, by combining multiple clustering results obtained using different document representations. Our experiments indicate that our novel enriching strategies, combined with ensemble clustering, can improve the quality of classical text clustering when applied to text corpora like The British Broadcasting Corporation (BBC) NEWS.


2019 ◽  
Vol 3 (3) ◽  
pp. 58 ◽  
Author(s):  
Tim Haarman ◽  
Bastiaan Zijlema ◽  
Marco Wiering

Keyphrase extraction is an important part of natural language processing (NLP) research, although little research is done in the domain of web pages. The World Wide Web contains billions of pages that are potentially interesting for various NLP tasks, yet it remains largely untouched in scientific research. Current research is often only applied to clean corpora such as abstracts and articles from academic journals or sets of scraped texts from a single domain. However, textual data from web pages differ from normal text documents, as it is structured using HTML elements and often consists of many small fragments. These elements are furthermore used in a highly inconsistent manner and are likely to contain noise. We evaluated the keyphrases extracted by several state-of-the-art extraction methods and found that they did not transfer well to web pages. We therefore propose WebEmbedRank, an adaptation of a recently proposed extraction method that can make use of structural information in web pages in a robust manner. We compared this novel method to other baselines and state-of-the-art methods using a manually annotated dataset and found that WebEmbedRank achieved significant improvements over existing extraction methods on web pages.


2018 ◽  
Vol 9 (2) ◽  
pp. 1-22 ◽  
Author(s):  
Rafiya Jan ◽  
Afaq Alam Khan

Social networks are considered as the most abundant sources of affective information for sentiment and emotion classification. Emotion classification is the challenging task of classifying emotions into different types. Emotions being universal, the automatic exploration of emotion is considered as a difficult task to perform. A lot of the research is being conducted in the field of automatic emotion detection in textual data streams. However, very little attention is paid towards capturing semantic features of the text. In this article, the authors present the technique of semantic relatedness for automatic classification of emotion in the text using distributional semantic models. This approach uses semantic similarity for measuring the coherence between the two emotionally related entities. Before classification, data is pre-processed to remove the irrelevant fields and inconsistencies and to improve the performance. The proposed approach achieved the accuracy of 71.795%, which is competitive considering as no training or annotation of data is done.


2012 ◽  
Vol 3 (3) ◽  
pp. 1-14 ◽  
Author(s):  
Reda Mohamed Hamou ◽  
Abdelmalek Amine ◽  
Ahmed Chaouki Lokbani

In this paper the authors experiment and test a new biomimetic approach based on social spiders to solve a combinatorial problem ie the automatic classification of texts because a very large data stream flows and particularly on the web. Representation of textual data was performed by a method independent of the language ie n-gram characters and words because there is currently no method of learning that can directly represent unstructured data (text). To validate the classification, the authors used a measure of evaluation based on recall and precision (F-measure). During the experiment, the authors found a powerful visualization tool in social spiders that they exploit to make visual classification.


Sign in / Sign up

Export Citation Format

Share Document