Applying Text Mining, Clustering Analysis, and Latent Dirichlet Allocation Techniques for Topic Classification of Environmental Education Journals

Facing the big data wave, this study applied artificial intelligence to cite knowledge and find a feasible process to play a crucial role in supplying innovative value in environmental education. Intelligence agents of artificial intelligence and natural language processing (NLP) are two key areas leading the trend in artificial intelligence; this research adopted NLP to analyze the research topics of environmental education research journals in the Web of Science (WoS) database during 2011–2020 and interpret the categories and characteristics of abstracts for environmental education papers. The corpus data were selected from abstracts and keywords of research journal papers, which were analyzed with text mining, cluster analysis, latent Dirichlet allocation (LDA), and co-word analysis methods. The decisions regarding the classification of feature words were determined and reviewed by domain experts, and the associated TF-IDF weights were calculated for the following cluster analysis, which involved a combination of hierarchical clustering and K-means analysis. The hierarchical clustering and LDA decided the number of required categories as seven, and the K-means cluster analysis classified the overall documents into seven categories. This study utilized co-word analysis to check the suitability of the K-means classification, analyzed the terms with high TF-IDF wights for distinct K-means groups, and examined the terms for different topics with the LDA technique. A comparison of the results demonstrated that most categories that were recognized with K-means and LDA methods were the same and shared similar words; however, two categories had slight differences. The involvement of field experts assisted with the consistency and correctness of the classified topics and documents.

Download Full-text

Application of Latent Dirichlet Allocation (LDA) for clustering financial tweets

E3S Web of Conferences ◽

10.1051/e3sconf/202129701071 ◽

2021 ◽

Vol 297 ◽

pp. 01071

Author(s):

Sifi Fatima-Zahrae ◽

Sabbar Wafae ◽

El Mzabi Amal

Keyword(s):

Language Processing ◽

Latent Dirichlet Allocation ◽

Sentiment Classification ◽

Research Areas ◽

Preprocessing Method ◽

Long Time ◽

Standard Text ◽

The Given ◽

Dirichlet Allocation

Sentiment classification is one of the hottest research areas among the Natural Language Processing (NLP) topics. While it aims to detect sentiment polarity and classification of the given opinion, requires a large number of aspect extractions. However, extracting aspect takes human effort and long time. To reduce this, Latent Dirichlet Allocation (LDA) method have come out recently to deal with this issue.In this paper, an efficient preprocessing method for sentiment classification is presented and will be used for analyzing user’s comments on Twitter social network. For this purpose, different text preprocessing techniques have been used on the dataset to achieve an acceptable standard text. Latent Dirichlet Allocation has been applied on the obtained data after this fast and accurate preprocessing phase. The implementation of different sentiment analysis methods and the results of these implementations have been compared and evaluated. The experimental results show that the combined uses of the preprocessing method of this paper and Latent Dirichlet Allocation have an acceptable results compared to other basic methods.

Download Full-text

Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya

PLoS ONE ◽

10.1371/journal.pone.0243208 ◽

2021 ◽

Vol 16 (1) ◽

pp. e0243208

Author(s):

Leacky Muchene ◽

Wende Safari

Keyword(s):

Hierarchical Clustering ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Topic Modelling ◽

Two Stage ◽

Scientific Publications ◽

Statistical Tool ◽

Second Stage ◽

The University ◽

Dirichlet Allocation

Unsupervised statistical analysis of unstructured data has gained wide acceptance especially in natural language processing and text mining domains. Topic modelling with Latent Dirichlet Allocation is one such statistical tool that has been successfully applied to synthesize collections of legal, biomedical documents and journalistic topics. We applied a novel two-stage topic modelling approach and illustrated the methodology with data from a collection of published abstracts from the University of Nairobi, Kenya. In the first stage, topic modelling with Latent Dirichlet Allocation was applied to derive the per-document topic probabilities. To more succinctly present the topics, in the second stage, hierarchical clustering with Hellinger distance was applied to derive the final clusters of topics. The analysis showed that dominant research themes in the university include: HIV and malaria research, research on agricultural and veterinary services as well as cross-cutting themes in humanities and social sciences. Further, the use of hierarchical clustering in the second stage reduces the discovered latent topics to clusters of homogeneous topics.

Download Full-text

Analyzing U.S. Army Officer Evaluation Reports with Natural Language Processing: A Log-Odds and Latent Dirichlet Allocation Exploration

Industrial and Systems Engineering Review ◽

10.37266/iser.2019v7i1.pp44-55 ◽

2019 ◽

Vol 7 (1) ◽

pp. 44-55

Author(s):

Heidy Shi ◽

John Caddell ◽

Julia Lensing

Keyword(s):

Natural Language Processing ◽

Text Mining ◽

Language Processing ◽

Text Analysis ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Data Set ◽

Army Officer ◽

Log Odds ◽

Dirichlet Allocation

Each job field (branch) in the Army requires a unique set of skills and talents of the officers assigned. Officers who demonstrate the required skills are often more successful in their assigned branch. To better understand how success is described across branches, research was conducted using text mining and text analysis of a data set of Officer Evaluation Reports (OERs). This research looked for common trends and discrepancies across varying branches and like groups of branches by analyzing the narrative portion of OERs. Text analysis methods examined words and bigrams commonly used to describe varying degrees of performance by officers. Topic modeling using Latent Dirichlet Allocation (LDA) was also conducted on top rated narratives to investigate trends and discrepancies in clustering narratives. Findings show that qualitative narratives for the top two performance designations fail to differentiate between officers’ varying levels of performance regardless of branch.

Download Full-text

Does higher education properly prepare graduates for the growing artificial intelligence market? Gaps identification using text mining

Human Systems Management ◽

10.3233/hsm-211179 ◽

2021 ◽

pp. 1-13

Author(s):

Lamiae Benhayoun ◽

Daniel Lang

Keyword(s):

Artificial Intelligence ◽

Natural Language Processing ◽

Text Mining ◽

Natural Language ◽

Language Processing ◽

Academic Training ◽

Market Requirements ◽

Job Advertisements ◽

The Individual

BACKGROUND: The renewed advent of Artificial Intelligence (AI) is inducing profound changes in the classic categories of technology professions and is creating the need for new specific skills. OBJECTIVE: Identify the gaps in terms of skills between academic training on AI in French engineering and Business Schools, and the requirements of the labour market. METHOD: Extraction of AI training contents from the schools’ websites and scraping of a job advertisements’ website. Then, analysis based on a text mining approach with a Python code for Natural Language Processing. RESULTS: Categorization of occupations related to AI. Characterization of three classes of skills for the AI market: Technical, Soft and Interdisciplinary. Skills’ gaps concern some professional certifications and the mastery of specific tools, research abilities, and awareness of ethical and regulatory dimensions of AI. CONCLUSIONS: A deep analysis using algorithms for Natural Language Processing. Results that provide a better understanding of the AI capability components at the individual and the organizational levels. A study that can help shape educational programs to respond to the AI market requirements.

Download Full-text

A guided latent Dirichlet allocation approach to investigate real-time latent topics of Twitter data during Hurricane Laura

Journal of Information Science ◽

10.1177/01655515211007724 ◽

2021 ◽

pp. 016555152110077

Author(s):

Sulong Zhou ◽

Pengyu Kan ◽

Qunying Huang ◽

Janet Silbernagel

Keyword(s):

Social Media ◽

Real Time ◽

Language Processing ◽

Disaster Response ◽

Domain Knowledge ◽

Latent Dirichlet Allocation ◽

Situational Awareness ◽

High Performing ◽

Latent Topics ◽

Dirichlet Allocation

Natural disasters cause significant damage, casualties and economical losses. Twitter has been used to support prompt disaster response and management because people tend to communicate and spread information on public social media platforms during disaster events. To retrieve real-time situational awareness (SA) information from tweets, the most effective way to mine text is using natural language processing (NLP). Among the advanced NLP models, the supervised approach can classify tweets into different categories to gain insight and leverage useful SA information from social media data. However, high-performing supervised models require domain knowledge to specify categories and involve costly labelling tasks. This research proposes a guided latent Dirichlet allocation (LDA) workflow to investigate temporal latent topics from tweets during a recent disaster event, the 2020 Hurricane Laura. With integration of prior knowledge, a coherence model, LDA topics visualisation and validation from official reports, our guided approach reveals that most tweets contain several latent topics during the 10-day period of Hurricane Laura. This result indicates that state-of-the-art supervised models have not fully utilised tweet information because they only assign each tweet a single label. In contrast, our model can not only identify emerging topics during different disaster events but also provides multilabel references to the classification schema. In addition, our results can help to quickly identify and extract SA information to responders, stakeholders and the general public so that they can adopt timely responsive strategies and wisely allocate resource during Hurricane events.

Download Full-text

Intelligent radar software defect classification approach based on the latent Dirichlet allocation topic model

EURASIP Journal on Advances in Signal Processing ◽

10.1186/s13634-021-00761-3 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Xi Liu ◽

Yongfeng Yin ◽

Haifeng Li ◽

Jiabin Chen ◽

Chang Liu ◽

...

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

Recall Rate ◽

Defect Classification ◽

Software Defects ◽

Classification Approach ◽

Software Defect ◽

Model Combining ◽

Dirichlet Allocation

AbstractExisting software intelligent defect classification approaches do not consider radar characters and prior statistics information. Thus, when applying these appaoraches into radar software testing and validation, the precision rate and recall rate of defect classification are poor and have effect on the reuse effectiveness of software defects. To solve this problem, a new intelligent defect classification approach based on the latent Dirichlet allocation (LDA) topic model is proposed for radar software in this paper. The proposed approach includes the defect text segmentation algorithm based on the dictionary of radar domain, the modified LDA model combining radar software requirement, and the top acquisition and classification approach of radar software defect based on the modified LDA model. The proposed approach is applied on the typical radar software defects to validate the effectiveness and applicability. The application results illustrate that the prediction precison rate and recall rate of the poposed approach are improved up to 15 ~ 20% compared with the other defect classification approaches. Thus, the proposed approach can be applied in the segmentation and classification of radar software defects effectively to improve the identifying adequacy of the defects in radar software.

Download Full-text

Modeling Research Topics for Artificial Intelligence Applications in Medicine: Latent Dirichlet Allocation Application Study

Journal of Medical Internet Research ◽

10.2196/15511 ◽

2019 ◽

Vol 21 (11) ◽

pp. e15511 ◽

Cited By ~ 3

Author(s):

Bach Xuan Tran ◽

Son Nghiem ◽

Oz Sahin ◽

Tuan Manh Vu ◽

Giang Hai Ha ◽

...

Keyword(s):

Artificial Intelligence ◽

Health Care ◽

Precision Medicine ◽

Data Science ◽

Latent Dirichlet Allocation ◽

Developed Countries ◽

Theory And Practice ◽

Clinical Practices ◽

Research Topics ◽

Dirichlet Allocation

Background Artificial intelligence (AI)–based technologies develop rapidly and have myriad applications in medicine and health care. However, there is a lack of comprehensive reporting on the productivity, workflow, topics, and research landscape of AI in this field. Objective This study aimed to evaluate the global development of scientific publications and constructed interdisciplinary research topics on the theory and practice of AI in medicine from 1977 to 2018. Methods We obtained bibliographic data and abstract contents of publications published between 1977 and 2018 from the Web of Science database. A total of 27,451 eligible articles were analyzed. Research topics were classified by latent Dirichlet allocation, and principal component analysis was used to identify the construct of the research landscape. Results The applications of AI have mainly impacted clinical settings (enhanced prognosis and diagnosis, robot-assisted surgery, and rehabilitation), data science and precision medicine (collecting individual data for precision medicine), and policy making (raising ethical and legal issues, especially regarding privacy and confidentiality of data). However, AI applications have not been commonly used in resource-poor settings due to the limit in infrastructure and human resources. Conclusions The application of AI in medicine has grown rapidly and focuses on three leading platforms: clinical practices, clinical material, and policies. AI might be one of the methods to narrow down the inequality in health care and medicine between developing and developed countries. Technology transfer and support from developed countries are essential measures for the advancement of AI application in health care in developing countries.

Download Full-text

A Survey on Intelligence Tools for Data Analytics

Advances in Data Mining and Database Management - Handbook of Research on Engineering, Business, and Healthcare Applications of Data Science and Analytics ◽

10.4018/978-1-7998-3053-5.ch005 ◽

2021 ◽

pp. 73-95

Author(s):

Shatakshi Singh ◽

Kanika Gautam ◽

Prachi Singhal ◽

Sunil Kumar Jangir ◽

Manish Kumar

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Language Processing ◽

Real Life ◽

Learning Tools ◽

The Core ◽

Training Mode ◽

Real Life Situation ◽

Selection Of

The recent development in artificial intelligence is quite astounding in this decade. Especially, machine learning is one of the core subareas of AI. Also, ML field is an incessantly growing along with evolution and becomes a rise in its demand and importance. It transmogrified the way data is extracted, analyzed, and interpreted. Computers are trained to get in a self-training mode so that when new data is fed they can learn, grow, change, and develop themselves without explicit programming. It helps to make useful predictions that can guide better decisions in a real-life situation without human interference. Selection of ML tool is always a challenging task, since choosing an appropriate tool can end up saving time as well as making it faster and easier to provide any solution. This chapter provides a classification of various machine learning tools on the following aspects: for non-programmers, for model deployment, for Computer vision, natural language processing, and audio for reinforcement learning and data mining.

Download Full-text

Ensemble Methods for Improving Classification of Data Produced by Latent Dirichlet Allocation

Computer Science and Mathematical Modelling ◽

10.5604/01.3001.0013.1458 ◽

2019 ◽

Vol 0 (8/2018) ◽

pp. 17-28

Author(s):

Maciej Jankowski

Keyword(s):

Text Analysis ◽

Large Scale ◽

Latent Dirichlet Allocation ◽

Previous Analysis ◽

Ensemble Methods ◽

Topic Modelling ◽

New Methods ◽

Data Scientist ◽

Dirichlet Allocation

Topic models are very popular methods of text analysis. The most popular algorithm for topic modelling is LDA (Latent Dirichlet Allocation). Recently, many new methods were proposed, that enable the usage of this model in large scale processing. One of the problem is, that a data scientist has to choose the number of topics manually. This step, requires some previous analysis. A few methods were proposed to automatize this step, but none of them works very well if LDA is used as a preprocessing for further classification. In this paper, we propose an ensemble approach which allows us to use more than one model at prediction phase, at the same time, reducing the need of finding a single best number of topics. We have also analyzed a few methods of estimating topic number.

Download Full-text

How Artificial Intelligence Can Improve Our Understanding of the Genes Associated with Endometriosis: Natural Language Processing of the PubMed Database

BioMed Research International ◽

10.1155/2018/6217812 ◽

2018 ◽

Vol 2018 ◽

pp. 1-7 ◽

Cited By ~ 7

Author(s):

J. Bouaziz ◽

R. Mashiach ◽

S. Cohen ◽

A. Kedem ◽

A. Baron ◽

...

Keyword(s):

Artificial Intelligence ◽

Natural Language Processing ◽

Text Mining ◽

Natural Language ◽

Language Processing ◽

Data Extraction ◽

Endometrial Tissue ◽

Endometrial Cells ◽

Pubmed Database ◽

Using Data

Endometriosis is a disease characterized by the development of endometrial tissue outside the uterus, but its cause remains largely unknown. Numerous genes have been studied and proposed to help explain its pathogenesis. However, the large number of these candidate genes has made functional validation through experimental methodologies nearly impossible. Computational methods could provide a useful alternative for prioritizing those most likely to be susceptibility genes. Using artificial intelligence applied to text mining, this study analyzed the genes involved in the pathogenesis, development, and progression of endometriosis. The data extraction by text mining of the endometriosis-related genes in the PubMed database was based on natural language processing, and the data were filtered to remove false positives. Using data from the text mining and gene network information as input for the web-based tool, 15,207 endometriosis-related genes were ranked according to their score in the database. Characterization of the filtered gene set through gene ontology, pathway, and network analysis provided information about the numerous mechanisms hypothesized to be responsible for the establishment of ectopic endometrial tissue, as well as the migration, implantation, survival, and proliferation of ectopic endometrial cells. Finally, the human genome was scanned through various databases using filtered genes as a seed to determine novel genes that might also be involved in the pathogenesis of endometriosis but which have not yet been characterized. These genes could be promising candidates to serve as useful diagnostic biomarkers and therapeutic targets in the management of endometriosis.

Download Full-text