Topic Modelling Brazilian Supreme Court Lawsuits

Frontiers in Artificial Intelligence and Applications - Legal Knowledge and Information Systems ◽

10.3233/faia200855 ◽

2020 ◽

Author(s):

Pedro Henrique Luz De Araujo ◽

Teófilo De Campos

Keyword(s):

Supreme Court ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Classification Task ◽

Bag Of Words ◽

Topic Modelling ◽

Text Representation ◽

Good Classification ◽

Low Dimensional ◽

Topic Relevancy

The present work proposes the use of Latent Dirichlet Allocation to model Extraordinary Appeals received by Brazil’s Supreme Court. The data consist of a corpus of 45,532 lawsuits manually annotated by the Court’s experts with theme labels, a multi-class and multi-label classification task. We initially train models with 10 and 30 topics and analyze their semantics by examining each topic’s most relevant words and their most representative texts, aiming to evaluate model interpretability and quality. We also train models with 30, 100, 300 and 1,000 topics, and quantitatively evaluate their potential using the topics to generate feature vectors for each appeal. These vectors are then used to train a lawsuit theme classifier. We compare traditional bag-of-words approaches (word counts and tf-idf values) with the topic-based text representation to assess topic relevancy. Our topics semantic analysis demonstrate that our models with 10 and 30 topics were capable of capturing some of the legal matters discussed by the Court. In addition, our experiments show that the model with 300 topics was the best text vectoriser and that the interpretable, low dimensional representations it generates achieve good classification results.

Download Full-text

Study of Statistical Text Representation Methods for Performance Improvement of a Hierarchical Attention Network

Applied Sciences ◽

10.3390/app11136113 ◽

2021 ◽

Vol 11 (13) ◽

pp. 6113

Author(s):

Adam Wawrzyński ◽

Julian Szymański

Keyword(s):

Neural Networks ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Document Classification ◽

Bag Of Words ◽

Text Representation ◽

Attention Network ◽

Document Frequency ◽

Textual Data ◽

Latent Representations

To effectively process textual data, many approaches have been proposed to create text representations. The transformation of a text into a form of numbers that can be computed using computers is crucial for further applications in downstream tasks such as document classification, document summarization, and so forth. In our work, we study the quality of text representations using statistical methods and compare them to approaches based on neural networks. We describe in detail nine different algorithms used for text representation and then we evaluate five diverse datasets: BBCSport, BBC, Ohsumed, 20Newsgroups, and Reuters. The selected statistical models include Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TFIDF) weighting, Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). For the second group of deep neural networks, Partition-Smooth Inverse Frequency (P-SIF), Doc2Vec-Distributed Bag of Words Paragraph Vector (Doc2Vec-DBoW), Doc2Vec-Memory Model of Paragraph Vectors (Doc2Vec-DM), Hierarchical Attention Network (HAN) and Longformer were selected. The text representation methods were benchmarked in the document classification task and BoW and TFIDF models were used were used as a baseline. Based on the identified weaknesses of the HAN method, an improvement in the form of a Hierarchical Weighted Attention Network (HWAN) was proposed. The incorporation of statistical features into HAN latent representations improves or provides comparable results on four out of five datasets. The article presents how the length of the processed text affects the results of HAN and variants of HWAN models.

Download Full-text

Spam Classification on 2019 Indonesian President Election Youtube Comments Using Multinomial Naïve-Bayes

Indonesian Journal of Artificial Intelligence and Data Mining ◽

10.24014/ijaidm.v2i1.6445 ◽

2019 ◽

Vol 2 (1) ◽

Cited By ~ 1

Author(s):

Jonathan Radot Fernando ◽

Raymond Budiraharjo ◽

Emeraldi Haganusa

Keyword(s):

Text Classification ◽

Naive Bayes ◽

Naïve Bayes ◽

Classification Task ◽

Bag Of Words ◽

Text Representation ◽

Frequency Data ◽

Bayes Algorithm ◽

Representation Method ◽

The Way

Text classification are used in many aspect of technologies such as spam classification, news categorization, Auto-correct texting. One of the most popular algorithm for text classification nowadays is Multinomial Naïve-Bayes. This paper explained how Naïve-Bayes assumption method works to classify 2019 Indonesian Election Youtube comments. The output prediction of this algorithm is spam or not spam. Spam messages are defined as racist comments, advertising comments, and unsolicited comments. The algorithms text representation method used bag-of-words method. Bag-of-words method defined a text as the multiset of its words. The algorithm then calculate the probability of a word given the class of spam or not spam. The main difference between normal Naïve-Bayes algorithm and Multinomial Naïve-Bayes is the way the algorithm treats the data itself. Multinomial Naïve-Bayes treats data as a frequency data hence it is suitable for text classification task.

Download Full-text

Similarity Detection Using Latent Semantic Analysis Algorithm

International Journal of Emerging Research in Management and Technology ◽

10.23956/ijermt.v6i8.124 ◽

2018 ◽

Vol 6 (8) ◽

pp. 102

Author(s):

Priyanka R. Patil ◽

Shital A. Patil

Keyword(s):

Latent Semantic Analysis ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Mining Method ◽

Research Papers ◽

Information Measures ◽

Automated Software ◽

Day By Day ◽

Ways Of Life ◽

Dirichlet Allocation

Similarity View is an application for visually comparing and exploring multiple models of text and collection of document. Friendbook finds ways of life of clients from client driven sensor information, measures the closeness of ways of life amongst clients, and prescribes companions to clients if their ways of life have high likeness. Roused by demonstrate a clients day by day life as life records, from their ways of life are separated by utilizing the Latent Dirichlet Allocation Algorithm. Manual techniques can't be utilized for checking research papers, as the doled out commentator may have lacking learning in the exploration disciplines. For different subjective views, causing possible misinterpretations. An urgent need for an effective and feasible approach to check the submitted research papers with support of automated software. A method like text mining method come to solve the problem of automatically checking the research papers semantically. The proposed method to finding the proper similarity of text from the collection of documents by using Latent Dirichlet Allocation (LDA) algorithm and Latent Semantic Analysis (LSA) with synonym algorithm which is used to find synonyms of text index wise by using the English wordnet dictionary, another algorithm is LSA without synonym used to find the similarity of text based on index. LSA with synonym rate of accuracy is greater when the synonym are consider for matching.

Download Full-text

Designing a Chat-Bot for College Information using Information Retrieval and Automatic Text Summarization Techniques

Current Chinese Computer Science ◽

10.2174/2665997201999201022191540 ◽

2020 ◽

Vol 01 ◽

Author(s):

Radha Guha

Keyword(s):

Information Retrieval ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Text Summarization ◽

The Internet ◽

Specific Domain ◽

User Query ◽

College Information ◽

Chat Bot

Background:: In the era of information overload it is very difficult for a human reader to make sense of the vast information available in the internet quickly. Even for a specific domain like college or university website it may be difficult for a user to browse through all the links to get the relevant answers quickly. Objective:: In this scenario, design of a chat-bot which can answer questions related to college information and compare between colleges will be very useful and novel. Methods:: In this paper a novel conversational interface chat-bot application with information retrieval and text summariza-tion skill is designed and implemented. Firstly this chat-bot has a simple dialog skill when it can understand the user query intent, it responds from the stored collection of answers. Secondly for unknown queries, this chat-bot can search the internet and then perform text summarization using advanced techniques of natural language processing (NLP) and text mining (TM). Results:: The advancement of NLP capability of information retrieval and text summarization using machine learning tech-niques of Latent Semantic Analysis(LSI), Latent Dirichlet Allocation (LDA), Word2Vec, Global Vector (GloVe) and Tex-tRank are reviewed and compared in this paper first before implementing them for the chat-bot design. This chat-bot im-proves user experience tremendously by getting answers to specific queries concisely which takes less time than to read the entire document. Students, parents and faculty can get the answers for variety of information like admission criteria, fees, course offerings, notice board, attendance, grades, placements, faculty profile, research papers and patents etc. more effi-ciently. Conclusion:: The purpose of this paper was to follow the advancement in NLP technologies and implement them in a novel application.

Download Full-text

News media attention in Climate Action: latent topics and open access

Scientometrics ◽

10.1007/s11192-021-04095-7 ◽

2021 ◽

Author(s):

Tahereh Dehdarirad ◽

Kalle Karlsson

Keyword(s):

Sustainable Development ◽

Regression Analysis ◽

Open Access ◽

News Media ◽

Latent Dirichlet Allocation ◽

Topic Modelling ◽

Sustainable Development Goal ◽

Data Set ◽

Development Goal ◽

Climate Action

AbstractIn this study we investigated whether open access could assist the broader dissemination of scientific research in Climate Action (Sustainable Development Goal 13) via news outlets. We did this by comparing (i) the share of open and non-open access documents in different Climate Action topics, and their news counts, and (ii) the mean of news counts for open access and non-open access documents. The data set of this study comprised 70,206 articles and reviews in Sustainable Development Goal 13, published during 2014–2018, retrieved from SciVal. The number of news mentions for each document was obtained from Altmetrics Details Page API using their DOIs, whereas the open access statuses were obtained using Unpaywall.org. The analysis in this paper was done using a combination of (Latent Dirichlet allocation) topic modelling, descriptive statistics, and regression analysis. The covariates included in the regression analysis were features related to authors, country, journal, institution, funding, readability, news source category and topic. Using topic modelling, we identified 10 topics, with topics 4 (meteorology) [21%], 5 (adaption, mitigation, and legislation) [18%] and 8 (ecosystems and biodiversity) [14%] accounting for 53% of the research in Sustainable Development Goal 13. Additionally, the results of regression analysis showed that while keeping all the variables constant in the model, open access papers in Climate Action had a news count advantage (8.8%) in comparison to non-open access papers. Our findings also showed that while a higher share of open access documents in topics such as topic 9 (Human vulnerability to risks) might not assist with its broader dissemination, in some others such as topic 5 (adaption, mitigation, and legislation), even a lower share of open access documents might accelerate its broad communication via news outlets.

Download Full-text

Intelligent Identification of Trademark Case Precedents Using Semantic Ontology

Advances in Transdisciplinary Engineering - Transdisciplinary Engineering for Complex Socio-technical Systems – Real-life Applications ◽

10.3233/atde200114 ◽

2020 ◽

Author(s):

A.S. Li ◽

A.J.C. Trappey ◽

C.V. Trappey

Keyword(s):

Clustering Algorithm ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Semantic Network ◽

Federal Courts ◽

Computer Assisted ◽

Legal Action ◽

Legal Cases ◽

Semantic Ontology ◽

Machine Readable

A registered trademark distinctively identifies a company, its products or services. A trademark (TM) is a type of intellectual property (IP) which is protected by the laws in the country where the trademark is officially registered. TM owners may take legal action when their IP rights are infringed upon. TM legal cases have grown in pace with the increasing number of TMs registered globally. In this paper, an intelligent recommender system automatically identifies similar TM case precedents for any given target case to support IP legal research. This study constructs the semantic network representing the TM legal scope and terminologies. A system is built to identify similar cases based on the machine-readable, frame-based knowledge representations of the judgments/documents. In this research, 4,835 US TM legal cases litigated in the US district and federal courts are collected as the experimental dataset. The computer-assisted system is constructed to extract critical features based on the ontology schema. The recommender will identify similar prior cases according to the values of their features embedded in these legal documents which include the case facts, issues under disputes, judgment holdings, and applicable rules and laws. Term frequency-inverse document frequency is used for text mining to discover the critical features of the litigated cases. Soft clustering algorithm, e.g., Latent Dirichlet Allocation, is applied to generate topics and the cases belonging to these topics. Thus, similar cases under each topic are identified for references. Through the analysis of the similarity between the cases based on the TM legal semantic analysis, the intelligent recommender provides precedents to support TM legal action and strategic planning.

Download Full-text

Gender Bias in the News: A Scalable Topic Modelling and Visualization Framework

Frontiers in Artificial Intelligence ◽

10.3389/frai.2021.664737 ◽

2021 ◽

Vol 4 ◽

Author(s):

Prashanth Rao ◽

Maite Taboada

Keyword(s):

English Language ◽

Latent Dirichlet Allocation ◽

Topic Modelling ◽

Monthly Basis ◽

Policy Makers ◽

Gender Parity ◽

News Organizations ◽

Gender Based ◽

News Corpus ◽

Key Events

We present a topic modelling and data visualization methodology to examine gender-based disparities in news articles by topic. Existing research in topic modelling is largely focused on the text mining of closed corpora, i.e., those that include a fixed collection of composite texts. We showcase a methodology to discover topics via Latent Dirichlet Allocation, which can reliably produce human-interpretable topics over an open news corpus that continually grows with time. Our system generates topics, or distributions of keywords, for news articles on a monthly basis, to consistently detect key events and trends aligned with events in the real world. Findings from 2 years worth of news articles in mainstream English-language Canadian media indicate that certain topics feature either women or men more prominently and exhibit different types of language. Perhaps unsurprisingly, topics such as lifestyle, entertainment, and healthcare tend to be prominent in articles that quote more women than men. Topics such as sports, politics, and business are characteristic of articles that quote more men than women. The data shows a self-reinforcing gendered division of duties and representation in society. Quoting female sources more frequently in a caregiving role and quoting male sources more frequently in political and business roles enshrines women’s status as caregivers and men’s status as leaders and breadwinners. Our results can help journalists and policy makers better understand the unequal gender representation of those quoted in the news and facilitate news organizations’ efforts to achieve gender parity in their sources. The proposed methodology is robust, reproducible, and scalable to very large corpora, and can be used for similar studies involving unsupervised topic modelling and language analyses.

Download Full-text

Pembentukan Vector Space Model Bahasa Indonesia Menggunakan Metode Word to Vector

Jurnal Buana Informatika ◽

10.24002/jbi.v10i1.2053 ◽

2019 ◽

Vol 10 (1) ◽

pp. 29

Author(s):

Yulius Denny Prabowo ◽

Tedi Lesmana Marselino ◽

Meylisa Suryawiguna

Keyword(s):

Vector Space ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Language Model ◽

Vector Space Model ◽

Online News ◽

Bag Of Words ◽

Space Model ◽

Language Research ◽

Bahasa Indonesia

Extracting information from a large amount of structured data requires expensive computing. The Vector Space Model method works by mapping words in continuous vector space where semantically similar words are mapped in adjacent vector spaces. The Vector Space Model model assumes words that appear in the same context, having the same semantic meaning. In the implementation, there are two different approaches: counting methods (eg: Latent Semantic Analysis) and predictive methods (eg Neural Probabilistic Language Model). This study aims to apply Word2Vec method using the Continuous Bag of Words approach in Indonesian language. Research data was obtained by crawling on several online news portals. The expected result of the research is the Indonesian words vector mapping based on the data used.Keywords: vector space model, word to vector, Indonesian vector space model.Ekstraksi informasi dari sekumpulan data terstruktur dalam jumlah yang besar membutuhkan komputasi yang mahal. Metode Vector Space Model bekerja dengan cara memetakan kata-kata dalam ruang vektor kontinu dimana kata-kata yang serupa secara semantis dipetakan dalam ruang vektor yang berdekatan. Metode Vector Space Model mengasumsikan kata-kata yang muncul pada konteks yang sama, memiliki makna semantik yang sama. Dalam penerapannya ada dua pendekatan yang berbeda yaitu: metode yang berbasis hitungan (misal: Latent Semantic Analysis) dan metode prediktif (misalnya Neural Probabilistic Language Model). Penelitian ini bertujuan untuk menerapkan metode Word2Vec menggunakan pendekatan Continuous Bag Of Words model dalam Bahasa Indonesia. Data penelitian yang digunakan didapatkan dengan cara crawling pada berberapa portal berita online. Hasil penelitian yang diharapkan adalah pemetaan vektor kata Bahasa Indonesia berdasarkan data yang digunakan.Kata Kunci: vector space model, word to vector, vektor kata bahasa Indonesia.

Download Full-text

Open-Ended Questions

Employee Surveys and Sensing ◽

10.1093/oso/9780190939717.003.0013 ◽

2020 ◽

pp. 202-218

Author(s):

Subhadra Dutta ◽

Eric M. O’Rourke

Keyword(s):

Machine Learning ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Written Language ◽

Text Data ◽

Employee Survey ◽

Trade Offs ◽

Word Relatedness ◽

Survey Responses

Natural language processing (NLP) is the field of decoding human written language. This chapter responds to the growing interest in using machine learning–based NLP approaches for analyzing open-ended employee survey responses. These techniques address scalability and the ability to provide real-time insights to make qualitative data collection equally or more desirable in organizations. The chapter walks through the evolution of text analytics in industrial–organizational psychology and discusses relevant supervised and unsupervised machine learning NLP methods for survey text data, such as latent Dirichlet allocation, latent semantic analysis, sentiment analysis, word relatedness methods, and so on. The chapter also lays out preprocessing techniques and the trade-offs of growing NLP capabilities internally versus externally, points the readers to available resources, and ends with discussing implications and future directions of these approaches.

Download Full-text

Topic modelling for qualitative studies

Journal of Information Science ◽

10.1177/0165551515617393 ◽

2016 ◽

Vol 43 (1) ◽

pp. 88-102 ◽

Cited By ~ 54

Author(s):

Sergey I. Nikolenko ◽

Sergei Koltcov ◽

Olessia Koltsova

Keyword(s):

Media Studies ◽

Latent Dirichlet Allocation ◽

Topic Models ◽

Qualitative Studies ◽

Sociological Research ◽

Topic Modelling ◽

Topic Extraction ◽

Quality Metric ◽

Opinion Analysis

Qualitative studies, such as sociological research, opinion analysis and media studies, can benefit greatly from automated topic mining provided by topic models such as latent Dirichlet allocation (LDA). However, examples of qualitative studies that employ topic modelling as a tool are currently few and far between. In this work, we identify two important problems along the way to using topic models in qualitative studies: lack of a good quality metric that closely matches human judgement in understanding topics and the need to indicate specific subtopics that a specific qualitative study may be most interested in mining. For the first problem, we propose a new quality metric, tf-idf coherence, that reflects human judgement more accurately than regular coherence, and conduct an experiment to verify this claim. For the second problem, we propose an interval semi-supervised approach (ISLDA) where certain predefined sets of keywords (that define the topics researchers are interested in) are restricted to specific intervals of topic assignments. Our experiments show that ISLDA is better for topic extraction than LDA in terms of tf-idf coherence, number of topics identified to predefined keywords and topic stability. We also present a case study on a Russian LiveJournal dataset aimed at ethnicity discourse analysis.

Download Full-text