LSA & LDA topic modeling classification: comparison study on e-books

With the rapid growth of information technology, the amount of unstructured text data in digital libraries is rapidly increased and has become a big challenge in analyzing, organizing and how to classify text automatically in E-research repository to get the benefit from them is the cornerstone. The manual categorization of text documents requires a lot of financial, human resources for management. In order to get so, topic modeling are used to classify documents. This paper addresses a comparison study on scientific unstructured text document classification (e-books) based on the full text where applying the most popular topic modeling approach (LDA, LSA) to cluster the words into a set of topics as important keywords for classification. Our dataset consists of (300) books contain about 23 million words based on full text. In the used topic models (LSA, LDA) each word in the corpus of vocabulary is connected with one or more topics with a probability, as estimated by the model. Many (LDA, LSA) models were built with different values of coherence and pick the one that produces the highest coherence value. The result of this paper showed that LDA has better results than LSA and the best results obtained from the LDA method was (0.592179) of coherence value when the number of topics was 20 while the LSA coherence value was (0.5773026) when the number of topics was 10.

Download Full-text

Assessment of Twitter Data Clusters with Cosine-Based Validation Metrics Using Hybrid Topic Models

Ingénierie des systèmes d information ◽

10.18280/isi.250606 ◽

2020 ◽

Vol 25 (6) ◽

pp. 755-769

Author(s):

Noorullah R. Mohammed ◽

Moulana Mohammed

Keyword(s):

Data Clustering ◽

Topic Models ◽

Cluster Validity ◽

Text Documents ◽

Text Data ◽

Validity Assessment ◽

Text Document ◽

Cluster Validity Indices ◽

Validity Indices ◽

Data Clusters

Text data clustering is performed for organizing the set of text documents into the desired number of coherent and meaningful sub-clusters. Modeling the text documents in terms of topics derivations is a vital task in text data clustering. Each tweet is considered as a text document, and various topic models perform modeling of tweets. In existing topic models, the clustering tendency of tweets is assessed initially based on Euclidean dissimilarity features. Cosine metric is more suitable for more informative assessment, especially of text clustering. Thus, this paper develops a novel cosine based external and interval validity assessment of cluster tendency for improving the computational efficiency of tweets data clustering. In the experimental, tweets data clustering results are evaluated using cluster validity indices measures. Experimentally proved that cosine based internal and external validity metrics outperforms the other using benchmarked and Twitter-based datasets.

Download Full-text

Incorporating Text OLAP in Business Intelligence

Business Intelligence Applications and the Web - Advances in Business Information Systems and Analytics ◽

10.4018/978-1-61350-038-5.ch004 ◽

2011 ◽

pp. 77-101 ◽

Cited By ~ 1

Author(s):

Byung-Kwon Park ◽

Il-Yeol Song

Keyword(s):

Information Retrieval ◽

Text Mining ◽

Business Intelligence ◽

Multidimensional Analysis ◽

Web Pages ◽

Data Types ◽

Text Documents ◽

Text Data ◽

Platform Architecture ◽

Unstructured Text

As the amount of data grows very fast inside and outside of an enterprise, it is getting important to seamlessly analyze both data types for total business intelligence. The data can be classified into two categories: structured and unstructured. For getting total business intelligence, it is important to seamlessly analyze both of them. Especially, as most of business data are unstructured text documents, including the Web pages in Internet, we need a Text OLAP solution to perform multidimensional analysis of text documents in the same way as structured relational data. We first survey the representative works selected for demonstrating how the technologies of text mining and information retrieval can be applied for multidimensional analysis of text documents, because they are major technologies handling text data. And then, we survey the representative works selected for demonstrating how we can associate and consolidate both unstructured text documents and structured relation data for obtaining total business intelligence. Finally, we present a future business intelligence platform architecture as well as related research topics. We expect the proposed total heterogeneous business intelligence architecture, which integrates information retrieval, text mining, and information extraction technologies all together, including relational OLAP technologies, would make a better platform toward total business intelligence.

Download Full-text

Dual Scaling in Data Mining from Text Databases

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2006.p0451 ◽

2006 ◽

Vol 10 (4) ◽

pp. 451-457 ◽

Cited By ~ 3

Author(s):

Junzo Watada ◽

◽

Keisuke Aoki ◽

Masahiro Kawano ◽

Muhammad Suzuri Hitam ◽

...

Keyword(s):

Multivariate Analysis ◽

Text Mining ◽

Kansei Engineering ◽

Semantic Meaning ◽

Dual Scaling ◽

Text Documents ◽

Text Data ◽

Text Document ◽

Text Information ◽

Quantification Model

The availability of multimedia text document information has disseminated text mining among researchers. Text documents, integrate numerical and linguistic data, making text mining interesting and challenging. We propose text mining based on a fuzzy quantification model and fuzzy thesaurus. In text mining, we focus on: 1) Sentences included in Japanese text that are broken down into words. 2) Fuzzy thesaurus for finding words matching keywords in text. 3) Fuzzy multivariate analysis to analyze semantic meaning in predefined case studies. We use a fuzzy thesaurus to translate words using Chinese and Japanese characters into keywords. This speeds up processing without requiring a dictionary to separate words. Fuzzy multivariate analysis is used to analyze such processed data and to extract latent mutual related structures in text data, i.e., to extract otherwise obscured knowledge. We apply dual scaling to mining library and Web page text information, and propose integrating the result in Kansei engineering for possible application in sales, marketing, and production.

Download Full-text

Introduction of topic modeling for extracting potential information from unstructured text data: Issue analysis on news article of dementia-related physical activity

Korean Journal of Sport Science ◽

10.24985/kjss.2019.30.3.501 ◽

2019 ◽

Vol 30 (3) ◽

pp. 501-512

Author(s):

윤효준 ◽

Jiwun Yoon ◽

JaeHyeon

Keyword(s):

Physical Activity ◽

Topic Modeling ◽

News Article ◽

Text Data ◽

Unstructured Text ◽

Issue Analysis

Download Full-text

SUPPORT OF INFORMAL CARERS FOR PEOPLE AFTER A STROKE WITH CROWDSOURCING AND NATURAL LANGUAGE PROCESSING

Acta Electrotechnica et Informatica ◽

10.15546/aeei-2021-0013 ◽

2021 ◽

Vol 21 (3) ◽

pp. 3-10

Author(s):

Petr ŠALOUN ◽

◽

Barbora CIGÁNKOVÁ ◽

David ANDREŠIČ ◽

Lenka KRHUTOVÁ ◽

...

Keyword(s):

Language Processing ◽

Text Documents ◽

Data Set ◽

Text Document ◽

Long Time ◽

Informal Carers ◽

Effective Visualization ◽

Text Document Classification ◽

Lay Public

For a long time, both professionals and the lay public showed little interest in informal carers. Yet these people deals with multiple and common issues in their everyday lives. As the population is aging we can observe a change of this attitude. And thanks to the advances in computer science, we can offer them some effective assistance and support by providing necessary information and connecting them with both professional and lay public community. In this work we describe a project called “Research and development of support networks and information systems for informal carers for persons after stroke” producing an information system visible to public as a web portal. It does not provide just simple a set of information but using means of artificial intelligence, text document classification and crowdsourcing further improving its accuracy, it also provides means of effective visualization and navigation over the content made by most by the community itself and personalized on a level of informal carer’s phase of the care-taking timeline. In can be beneficial for informal carers as it allows to find a content specific to their current situation. This work describes our approach to classification of text documents and its improvement through crowdsourcing. Its goal is to test text documents classifier based on documents similarity measured by N-grams method and to design evaluation and crowdsourcing-based classification improvement mechanism. Interface for crowdsourcing was created using CMS WordPress. In addition to data collection, the purpose of interface is to evaluate classification accuracy, which leads to extension of classifier test data set, thus the classification is more successful.

Download Full-text

Improving the Decision Value of Hierarchical Text Clustering Using Term Overlap Detection

Australasian Journal of Information Systems ◽

10.3127/ajis.v19i0.1180 ◽

2015 ◽

Vol 19 ◽

Cited By ~ 2

Author(s):

Nilupulee Nathawitharana ◽

Damminda Alahakoon ◽

Sumith Matharage

Keyword(s):

Hierarchical Clustering ◽

Categorical Data ◽

Text Clustering ◽

Written Language ◽

Text Documents ◽

Text Data ◽

Text Document ◽

Cluster Accuracy ◽

Document Collection ◽

A New Technique

Humans are used to expressing themselves with written language and language provides a medium with which we can describe our experiences in detail incorporating individuality. Even though documents provide a rich source of information, it becomes very difficult to identify, extract, summarize and search when vast amounts of documents are collected especially over time. Document clustering is a technique that has been widely used to group documents based on similarity of content represented by the words used. Once key groups are identified further drill down into sub-groupings is facilitated by the use of hierarchical clustering. Clustering and hierarchical clustering are very useful when applied to numerical and categorical data and cluster accuracy and purity measures exist to evaluate the outcomes of a clustering exercise. Although the same measures have been applied to text clustering, text clusters are based on words or terms which can be repeated across documents associated with different topics. Therefore text data cannot be considered as a direct ‘coding’ of a particular experience or situation in contrast to numerical and categorical data and term overlap is a very common characteristic in text clustering. In this paper we propose a new technique and methodology for term overlap capture from text documents, highlighting the different situations such overlap could signify and discuss why such understanding is important for obtaining value from text clustering. Experiments were conducted using a widely used text document collection where the proposed methodology allowed exploring the term diversity for a given document collection and obtain clusters with minimum term overlap.

Download Full-text

Intelligent Strategy and Security in Education

Advances in Digital Crime, Forensics, and Cyber Terrorism - Applying Methods of Scientific Inquiry Into Intelligence, Security, and Counterterrorism ◽

10.4018/978-1-5225-8976-1.ch004 ◽

2019 ◽

pp. 87-110

Author(s):

Samson Oluwaseun Fadiya

Keyword(s):

Social Networking ◽

Query Language ◽

Unstructured Data ◽

Text Analytics ◽

Text Documents ◽

Text Data ◽

Web Based ◽

Data Usage ◽

Unstructured Text ◽

Common Problems

Text analytics applies to most businesses, particularly education segments; for instance, if association or university is suspicious about data secrets being spilt to contenders by the workers, text analytics investigation can help dissect many employees' email messages. The massive volume of both organized and unstructured data principally started from the web-based social networking (media) and Web 2.0. The investigation (analysis) of messages online, tweets, and different types of unstructured text data constitute what we call text analytics, which has been developed during the most recent few years in a way that does not shift, through the upheaval of various algorithms and applications being utilized for the processing of data alongside the protection and IT security. This chapter plans to find common problems faced when using the different medium of data usage in education, one can analyze their information through the perform of sentiment analysis using text analytics by extracting useful information from text documents using IBM's annotation query language (AQL).

Download Full-text

The Efficient Storage of Text Documents in Digital Libraries

Information Technology and Libraries ◽

10.6017/ital.v28i3.3222 ◽

2009 ◽

Vol 28 (3) ◽

pp. 143 ◽

Cited By ~ 3

Author(s):

Przemyslaw Skibiński ◽

Jakub Swacha

Keyword(s):

Data Compression ◽

Digital Libraries ◽

Text Compression ◽

Storage Space ◽

Software Library ◽

Source Text ◽

Text Documents ◽

Space Reduction ◽

Text Document ◽

Efficient Storage

In this paper we investigate the possibility of improving the efficiency of data compression, and thus reducing storage requirements, for seven widely used text document formats. We propose an open-source text compression software library, featuring an advanced word-substitution scheme with static and semidynamic word dictionaries. The empirical results show an average storage space reduction as high as 78 percent compared to uncompressed documents, and as high as 30 percent compared to documents compressed with the free compression software gzip.

Download Full-text

A Rule-Based Approach to Embedding Techniques for Text Document Classification

Applied Sciences ◽

10.3390/app10114009 ◽

2020 ◽

Vol 10 (11) ◽

pp. 4009

Author(s):

Asmaa M. Aubaid ◽

Alok Mishra

Keyword(s):

Sudden Expansion ◽

Data Sets ◽

Online Information ◽

Text Documents ◽

Rule Based ◽

Electronic Documents ◽

Text Document ◽

Implementation Time ◽

Rule Based Approach ◽

Text Document Classification

With the growth of online information and sudden expansion in the number of electronic documents provided on websites and in electronic libraries, there is difficulty in categorizing text documents. Therefore, a rule-based approach is a solution to this problem; the purpose of this study is to classify documents by using a rule-based. This paper deals with the rule-based approach with the embedding technique for a document to vector (doc2vec) files. An experiment was performed on two data sets Reuters-21578 and the 20 Newsgroups to classify the top ten categories of these data sets by using a document to vector rule-based (D2vecRule). Finally, this method provided us a good classification result according to the F-measures and implementation time metrics. In conclusion, it was observed that our algorithm document to vector rule-based (D2vecRule) was good when compared with other algorithms such as JRip, One R, and ZeroR applied to the same Reuters-21578 dataset.

Download Full-text

Ldagibbs: A Command for Topic Modeling in Stata Using Latent Dirichlet Allocation

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x1801800107 ◽

2018 ◽

Vol 18 (1) ◽

pp. 101-117 ◽

Cited By ~ 10

Author(s):

Carlo Schwarz

Keyword(s):

Machine Learning ◽

Probability Distribution ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Text Documents ◽

Text Data ◽

Dirichlet Allocation

In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.

Download Full-text