scholarly journals LSA & LDA topic modeling classification: comparison study on e-books

Author(s):  
Shaymaa H. Mohammed ◽  
Salam Al-augby

<p>With the rapid growth of information technology, the amount of unstructured text data in digital libraries is rapidly increased and has become a big challenge in analyzing, organizing and how to classify text automatically in E-research repository to get the benefit from them is the cornerstone. The manual categorization of text documents requires a lot of financial, human resources for management. In order to get so, topic modeling are used to classify documents. This paper addresses a comparison study on scientific unstructured text document classification (e-books) based on the full text where applying the most popular topic modeling approach (LDA, LSA) to cluster the words into a set of topics as important keywords for classification. Our dataset consists of (300) books contain about 23 million words based on full text. In the used topic models (LSA, LDA) each word in the corpus of vocabulary is connected with one or more topics with a probability, as estimated by the model. Many (LDA, LSA) models were built with different values of coherence and pick the one that produces the highest coherence value. The result of this paper showed that LDA has better results than LSA and the best results obtained from the LDA method was (<strong>0.592179</strong>) of coherence value when the number of topics was <strong>20 while</strong> the LSA coherence value was <strong>(0.5773026)</strong> when the number of topics was 10.</p>

2020 ◽  
Vol 25 (6) ◽  
pp. 755-769
Author(s):  
Noorullah R. Mohammed ◽  
Moulana Mohammed

Text data clustering is performed for organizing the set of text documents into the desired number of coherent and meaningful sub-clusters. Modeling the text documents in terms of topics derivations is a vital task in text data clustering. Each tweet is considered as a text document, and various topic models perform modeling of tweets. In existing topic models, the clustering tendency of tweets is assessed initially based on Euclidean dissimilarity features. Cosine metric is more suitable for more informative assessment, especially of text clustering. Thus, this paper develops a novel cosine based external and interval validity assessment of cluster tendency for improving the computational efficiency of tweets data clustering. In the experimental, tweets data clustering results are evaluated using cluster validity indices measures. Experimentally proved that cosine based internal and external validity metrics outperforms the other using benchmarked and Twitter-based datasets.


Author(s):  
Byung-Kwon Park ◽  
Il-Yeol Song

As the amount of data grows very fast inside and outside of an enterprise, it is getting important to seamlessly analyze both data types for total business intelligence. The data can be classified into two categories: structured and unstructured. For getting total business intelligence, it is important to seamlessly analyze both of them. Especially, as most of business data are unstructured text documents, including the Web pages in Internet, we need a Text OLAP solution to perform multidimensional analysis of text documents in the same way as structured relational data. We first survey the representative works selected for demonstrating how the technologies of text mining and information retrieval can be applied for multidimensional analysis of text documents, because they are major technologies handling text data. And then, we survey the representative works selected for demonstrating how we can associate and consolidate both unstructured text documents and structured relation data for obtaining total business intelligence. Finally, we present a future business intelligence platform architecture as well as related research topics. We expect the proposed total heterogeneous business intelligence architecture, which integrates information retrieval, text mining, and information extraction technologies all together, including relational OLAP technologies, would make a better platform toward total business intelligence.


Author(s):  
Junzo Watada ◽  
◽  
Keisuke Aoki ◽  
Masahiro Kawano ◽  
Muhammad Suzuri Hitam ◽  
...  

The availability of multimedia text document information has disseminated text mining among researchers. Text documents, integrate numerical and linguistic data, making text mining interesting and challenging. We propose text mining based on a fuzzy quantification model and fuzzy thesaurus. In text mining, we focus on: 1) Sentences included in Japanese text that are broken down into words. 2) Fuzzy thesaurus for finding words matching keywords in text. 3) Fuzzy multivariate analysis to analyze semantic meaning in predefined case studies. We use a fuzzy thesaurus to translate words using Chinese and Japanese characters into keywords. This speeds up processing without requiring a dictionary to separate words. Fuzzy multivariate analysis is used to analyze such processed data and to extract latent mutual related structures in text data, i.e., to extract otherwise obscured knowledge. We apply dual scaling to mining library and Web page text information, and propose integrating the result in Kansei engineering for possible application in sales, marketing, and production.


2021 ◽  
Vol 21 (3) ◽  
pp. 3-10
Author(s):  
Petr ŠALOUN ◽  
◽  
Barbora CIGÁNKOVÁ ◽  
David ANDREŠIČ ◽  
Lenka KRHUTOVÁ ◽  
...  

For a long time, both professionals and the lay public showed little interest in informal carers. Yet these people deals with multiple and common issues in their everyday lives. As the population is aging we can observe a change of this attitude. And thanks to the advances in computer science, we can offer them some effective assistance and support by providing necessary information and connecting them with both professional and lay public community. In this work we describe a project called “Research and development of support networks and information systems for informal carers for persons after stroke” producing an information system visible to public as a web portal. It does not provide just simple a set of information but using means of artificial intelligence, text document classification and crowdsourcing further improving its accuracy, it also provides means of effective visualization and navigation over the content made by most by the community itself and personalized on a level of informal carer’s phase of the care-taking timeline. In can be beneficial for informal carers as it allows to find a content specific to their current situation. This work describes our approach to classification of text documents and its improvement through crowdsourcing. Its goal is to test text documents classifier based on documents similarity measured by N-grams method and to design evaluation and crowdsourcing-based classification improvement mechanism. Interface for crowdsourcing was created using CMS WordPress. In addition to data collection, the purpose of interface is to evaluate classification accuracy, which leads to extension of classifier test data set, thus the classification is more successful.


Author(s):  
Nilupulee Nathawitharana ◽  
Damminda Alahakoon ◽  
Sumith Matharage

Humans are used to expressing themselves with written language and language provides a medium with which we can describe our experiences in detail incorporating individuality. Even though documents provide a rich source of information, it becomes very difficult to identify, extract, summarize and search when vast amounts of documents are collected especially over time. Document clustering is a technique that has been widely used to group documents based on similarity of content represented by the words used. Once key groups are identified further drill down into sub-groupings is facilitated by the use of hierarchical clustering. Clustering and hierarchical clustering are very useful when applied to numerical and categorical data and cluster accuracy and purity measures exist to evaluate the outcomes of a clustering exercise. Although the same measures have been applied to text clustering, text clusters are based on words or terms which can be repeated across documents associated with different topics. Therefore text data cannot be considered as a direct ‘coding’ of a particular experience or situation in contrast to numerical and categorical data and term overlap is a very common characteristic in text clustering. In this paper we propose a new technique and methodology for term overlap capture from text documents, highlighting the different situations such overlap could signify and discuss why such understanding is important for obtaining value from text clustering. Experiments were conducted using a widely used text document collection where the proposed methodology allowed exploring the term diversity for a given document collection and obtain clusters with minimum term overlap.


Author(s):  
Samson Oluwaseun Fadiya

Text analytics applies to most businesses, particularly education segments; for instance, if association or university is suspicious about data secrets being spilt to contenders by the workers, text analytics investigation can help dissect many employees' email messages. The massive volume of both organized and unstructured data principally started from the web-based social networking (media) and Web 2.0. The investigation (analysis) of messages online, tweets, and different types of unstructured text data constitute what we call text analytics, which has been developed during the most recent few years in a way that does not shift, through the upheaval of various algorithms and applications being utilized for the processing of data alongside the protection and IT security. This chapter plans to find common problems faced when using the different medium of data usage in education, one can analyze their information through the perform of sentiment analysis using text analytics by extracting useful information from text documents using IBM's annotation query language (AQL).


2009 ◽  
Vol 28 (3) ◽  
pp. 143 ◽  
Author(s):  
Przemyslaw Skibiński ◽  
Jakub Swacha

In this paper we investigate the possibility of improving the efficiency of data compression, and thus reducing storage requirements, for seven widely used text document formats. We propose an open-source text compression software library, featuring an advanced word-substitution scheme with static and semidynamic word dictionaries. The empirical results show an average storage space reduction as high as 78 percent compared to uncompressed documents, and as high as 30 percent compared to documents compressed with the free compression software gzip.


2020 ◽  
Vol 10 (11) ◽  
pp. 4009
Author(s):  
Asmaa M. Aubaid ◽  
Alok Mishra

With the growth of online information and sudden expansion in the number of electronic documents provided on websites and in electronic libraries, there is difficulty in categorizing text documents. Therefore, a rule-based approach is a solution to this problem; the purpose of this study is to classify documents by using a rule-based. This paper deals with the rule-based approach with the embedding technique for a document to vector (doc2vec) files. An experiment was performed on two data sets Reuters-21578 and the 20 Newsgroups to classify the top ten categories of these data sets by using a document to vector rule-based (D2vecRule). Finally, this method provided us a good classification result according to the F-measures and implementation time metrics. In conclusion, it was observed that our algorithm document to vector rule-based (D2vecRule) was good when compared with other algorithms such as JRip, One R, and ZeroR applied to the same Reuters-21578 dataset.


Author(s):  
Carlo Schwarz

In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.


Sign in / Sign up

Export Citation Format

Share Document