scholarly journals Deep mixtures of unigrams for uncovering topics in textual data

2021 ◽  
Vol 31 (3) ◽  
Author(s):  
Cinzia Viroli ◽  
Laura Anderlucci

AbstractMixtures of unigrams are one of the simplest and most efficient tools for clustering textual data, as they assume that documents related to the same topic have similar distributions of terms, naturally described by multinomials. When the classification task is particularly challenging, such as when the document-term matrix is high-dimensional and extremely sparse, a more composite representation can provide better insight into the grouping structure. In this work, we developed a deep version of mixtures of unigrams for the unsupervised classification of very short documents with a large number of terms, by allowing for models with further deeper latent layers; the proposal is derived in a Bayesian framework. The behavior of the deep mixtures of unigrams is empirically compared with that of other traditional and state-of-the-art methods, namely k-means with cosine distance, k-means with Euclidean distance on data transformed according to semantic analysis, partition around medoids, mixture of Gaussians on semantic-based transformed data, hierarchical clustering according to Ward’s method with cosine dissimilarity, latent Dirichlet allocation, mixtures of unigrams estimated via the EM algorithm, spectral clustering and affinity propagation clustering. The performance is evaluated in terms of both correct classification rate and Adjusted Rand Index. Simulation studies and real data analysis prove that going deep in clustering such data highly improves the classification accuracy.

2017 ◽  
Vol 11 (03) ◽  
pp. 373-389
Author(s):  
Sara Santilli ◽  
Laura Nota ◽  
Giovanni Pilato

In the present work Latent Semantic Analysis of textual data was applied on texts related to courage, in order to compare and contrast results and evaluate the opportunity of integrating different data sets. To better understand the definition of courage in Italian context, 1199 participants were involved in the present study and was asked to answer to the following question “Courage is[Formula: see text]”. The participants’ definitions of courage were analyzed with the Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA), in order to study the fundamental concepts arising from the population. An analogous comparison with Twitter posts has been also carried out to analyze if the public opinion emerging from social media provides a challenging and rich context to explore computational models of natural language.


2021 ◽  
Vol 11 (13) ◽  
pp. 6113
Author(s):  
Adam Wawrzyński ◽  
Julian Szymański

To effectively process textual data, many approaches have been proposed to create text representations. The transformation of a text into a form of numbers that can be computed using computers is crucial for further applications in downstream tasks such as document classification, document summarization, and so forth. In our work, we study the quality of text representations using statistical methods and compare them to approaches based on neural networks. We describe in detail nine different algorithms used for text representation and then we evaluate five diverse datasets: BBCSport, BBC, Ohsumed, 20Newsgroups, and Reuters. The selected statistical models include Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TFIDF) weighting, Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). For the second group of deep neural networks, Partition-Smooth Inverse Frequency (P-SIF), Doc2Vec-Distributed Bag of Words Paragraph Vector (Doc2Vec-DBoW), Doc2Vec-Memory Model of Paragraph Vectors (Doc2Vec-DM), Hierarchical Attention Network (HAN) and Longformer were selected. The text representation methods were benchmarked in the document classification task and BoW and TFIDF models were used were used as a baseline. Based on the identified weaknesses of the HAN method, an improvement in the form of a Hierarchical Weighted Attention Network (HWAN) was proposed. The incorporation of statistical features into HAN latent representations improves or provides comparable results on four out of five datasets. The article presents how the length of the processed text affects the results of HAN and variants of HWAN models.


Author(s):  
Priyanka R. Patil ◽  
Shital A. Patil

Similarity View is an application for visually comparing and exploring multiple models of text and collection of document. Friendbook finds ways of life of clients from client driven sensor information, measures the closeness of ways of life amongst clients, and prescribes companions to clients if their ways of life have high likeness. Roused by demonstrate a clients day by day life as life records, from their ways of life are separated by utilizing the Latent Dirichlet Allocation Algorithm. Manual techniques can't be utilized for checking research papers, as the doled out commentator may have lacking learning in the exploration disciplines. For different subjective views, causing possible misinterpretations. An urgent need for an effective and feasible approach to check the submitted research papers with support of automated software. A method like text mining method come to solve the problem of automatically checking the research papers semantically. The proposed method to finding the proper similarity of text from the collection of documents by using Latent Dirichlet Allocation (LDA) algorithm and Latent Semantic Analysis (LSA) with synonym algorithm which is used to find synonyms of text index wise by using the English wordnet dictionary, another algorithm is LSA without synonym used to find the similarity of text based on index. LSA with synonym rate of accuracy is greater when the synonym are consider for matching.


Author(s):  
Radha Guha

Background:: In the era of information overload it is very difficult for a human reader to make sense of the vast information available in the internet quickly. Even for a specific domain like college or university website it may be difficult for a user to browse through all the links to get the relevant answers quickly. Objective:: In this scenario, design of a chat-bot which can answer questions related to college information and compare between colleges will be very useful and novel. Methods:: In this paper a novel conversational interface chat-bot application with information retrieval and text summariza-tion skill is designed and implemented. Firstly this chat-bot has a simple dialog skill when it can understand the user query intent, it responds from the stored collection of answers. Secondly for unknown queries, this chat-bot can search the internet and then perform text summarization using advanced techniques of natural language processing (NLP) and text mining (TM). Results:: The advancement of NLP capability of information retrieval and text summarization using machine learning tech-niques of Latent Semantic Analysis(LSI), Latent Dirichlet Allocation (LDA), Word2Vec, Global Vector (GloVe) and Tex-tRank are reviewed and compared in this paper first before implementing them for the chat-bot design. This chat-bot im-proves user experience tremendously by getting answers to specific queries concisely which takes less time than to read the entire document. Students, parents and faculty can get the answers for variety of information like admission criteria, fees, course offerings, notice board, attendance, grades, placements, faculty profile, research papers and patents etc. more effi-ciently. Conclusion:: The purpose of this paper was to follow the advancement in NLP technologies and implement them in a novel application.


2019 ◽  
Vol 15 (4) ◽  
pp. 41-56 ◽  
Author(s):  
Ibukun Tolulope Afolabi ◽  
Opeyemi Samuel Makinde ◽  
Olufunke Oyejoke Oladipupo

Currently, for content-based recommendations, semantic analysis of text from webpages seems to be a major problem. In this research, we present a semantic web content mining approach for recommender systems in online shopping. The methodology is based on two major phases. The first phase is the semantic preprocessing of textual data using the combination of a developed ontology and an existing ontology. The second phase uses the Naïve Bayes algorithm to make the recommendations. The output of the system is evaluated using precision, recall and f-measure. The results from the system showed that the semantic preprocessing improved the recommendation accuracy of the recommender system by 5.2% over the existing approach. Also, the developed system is able to provide a platform for content-based recommendation in online shopping. This system has an edge over the existing recommender approaches because it is able to analyze the textual contents of users feedback on a product in order to provide the necessary product recommendation.


Symmetry ◽  
2021 ◽  
Vol 13 (11) ◽  
pp. 2164
Author(s):  
Héctor J. Gómez ◽  
Diego I. Gallardo ◽  
Karol I. Santoro

In this paper, we present an extension of the truncated positive normal (TPN) distribution to model positive data with a high kurtosis. The new model is defined as the quotient between two random variables: the TPN distribution (numerator) and the power of a standard uniform distribution (denominator). The resulting model has greater kurtosis than the TPN distribution. We studied some properties of the distribution, such as moments, asymmetry, and kurtosis. Parameter estimation is based on the moments method, and maximum likelihood estimation uses the expectation-maximization algorithm. We performed some simulation studies to assess the recovery parameters and illustrate the model with a real data application related to body weight. The computational implementation of this work was included in the tpn package of the R software.


Author(s):  
A.S. Li ◽  
A.J.C. Trappey ◽  
C.V. Trappey

A registered trademark distinctively identifies a company, its products or services. A trademark (TM) is a type of intellectual property (IP) which is protected by the laws in the country where the trademark is officially registered. TM owners may take legal action when their IP rights are infringed upon. TM legal cases have grown in pace with the increasing number of TMs registered globally. In this paper, an intelligent recommender system automatically identifies similar TM case precedents for any given target case to support IP legal research. This study constructs the semantic network representing the TM legal scope and terminologies. A system is built to identify similar cases based on the machine-readable, frame-based knowledge representations of the judgments/documents. In this research, 4,835 US TM legal cases litigated in the US district and federal courts are collected as the experimental dataset. The computer-assisted system is constructed to extract critical features based on the ontology schema. The recommender will identify similar prior cases according to the values of their features embedded in these legal documents which include the case facts, issues under disputes, judgment holdings, and applicable rules and laws. Term frequency-inverse document frequency is used for text mining to discover the critical features of the litigated cases. Soft clustering algorithm, e.g., Latent Dirichlet Allocation, is applied to generate topics and the cases belonging to these topics. Thus, similar cases under each topic are identified for references. Through the analysis of the similarity between the cases based on the TM legal semantic analysis, the intelligent recommender provides precedents to support TM legal action and strategic planning.


2021 ◽  
Vol 18 (1) ◽  
pp. 34-57
Author(s):  
Weifeng Pan ◽  
Xinxin Xu ◽  
Hua Ming ◽  
Carl K. Chang

Mashup technology has become a promising way to develop and deliver applications on the web. Automatically organizing Mashups into functionally similar clusters helps improve the performance of Mashup discovery. Although there are many approaches aiming to cluster Mashups, they solely focus on utilizing semantic similarities to guide the Mashup clustering process and are unable to utilize both the structural and semantic information in Mashup profiles. In this paper, a novel approach to cluster Mashups into groups is proposed, which integrates structural similarity and semantic similarity using fuzzy AHP (fuzzy analytic hierarchy process). The structural similarity is computed from usage histories between Mashups and Web APIs using SimRank algorithm. The semantic similarity is computed from the descriptions and tags of Mashups using LDA (latent dirichlet allocation). A clustering algorithm based on the genetic algorithm is employed to cluster Mashups. Comprehensive experiments are performed on a real data set collected from ProgrammableWeb. The results show the effectiveness of the approach when compared with two kinds of conventional approaches.


2020 ◽  
pp. 638-657
Author(s):  
Firas Ben Kharrat ◽  
Aymen Elkhleifi ◽  
Rim Faiz

This paper puts forward a new recommendation algorithm based on semantic analysis as well as new measurements. Like Facebook, Social network is considered as one of the most well-prominent Web 2.0 applications and relevant services elaborating into functional ways for sharing opinions. Thereupon, social network web sites have since become valuable data sources for opinion mining. This paper proposes to introduce an external resource a sentiment from comments posted by users in order to anticipate recommendation and also to lessen the cold-start problem. The originality of the suggested approach means that posts are not merely characterized by an opinion score, but receive an opinion grade notion in the post instead. In general, the authors' approach has been implemented with Java and Lenskit framework. The study resulted in two real data sets, namely MovieLens and TripAdvisor, in which the authors have shown positive results. They compared their algorithm to SVD and Slope One algorithms. They have fulfilled an amelioration of 10% in precision and recall along with an improvement of 12% in RMSE and nDCG.


Author(s):  
Subhadra Dutta ◽  
Eric M. O’Rourke

Natural language processing (NLP) is the field of decoding human written language. This chapter responds to the growing interest in using machine learning–based NLP approaches for analyzing open-ended employee survey responses. These techniques address scalability and the ability to provide real-time insights to make qualitative data collection equally or more desirable in organizations. The chapter walks through the evolution of text analytics in industrial–organizational psychology and discusses relevant supervised and unsupervised machine learning NLP methods for survey text data, such as latent Dirichlet allocation, latent semantic analysis, sentiment analysis, word relatedness methods, and so on. The chapter also lays out preprocessing techniques and the trade-offs of growing NLP capabilities internally versus externally, points the readers to available resources, and ends with discussing implications and future directions of these approaches.


Sign in / Sign up

Export Citation Format

Share Document