Klasifikasi Laporan Keluhan Pelayanan Publik Berdasarkan Instansi Menggunakan Metode LDA-SVM

Sebuah sistem layanan untuk menyampaikan aspirasi dan keluhan masyarakat terhadap layanan pemerintah Indonesia, bernama Lapor! Pemerintah sudah lama memanfaatkan sistem tersebut untuk menjawab permasalahan masyarakat Indonesia terkait permasalahan birokrasi. Namun, peningkatan volume laporan dan pemilahan laporan yang dilakukan oleh operator dengan membaca setiap keluhan yang masuk melalui sistem menyebabkan sering terjadi kesalahan dimana operator meneruskan laporan tersebut ke instansi yang salah. Oleh karena itu, diperlukan suatu solusi yang dapat menentukan konteks laporan secara otomatis dengan menggunakan teknik Natural Language Processing. Penelitian ini bertujuan untuk membangun klasifikasi laporan secara otomatis berdasarkan topik laporan yang ditujukan kepada instansi yang berwenang dengan menggabungkan metode Latent Dirichlet Allocation (LDA) dan Support Vector Machine (SVM). Proses pemodelan topik untuk setiap laporan dilakukan dengan menggunakan metode LDA. Metode ini mengekstrak laporan untuk menemukan pola tertentu dalam dokumen yang akan menghasilkan keluaran dalam nilai distribusi topik. Selanjutnya, proses klasifikasi untuk menentukan laporan agensi tujuan dilakukan dengan menggunakan SVM berdasarkan nilai topik yang diekstraksi dengan metode LDA. Performa model LDA-SVM diukur dengan menggunakan confusion matrix dengan menghitung nilai akurasi, presisi, recall, dan F1 Score. Hasil pengujian menggunakan teknik split train-test dengan skor 70:30 menunjukkan bahwa model menghasilkan kinerja yang baik dengan akurasi 79,85%, presisi 79,98%, recall 72,37%, dan Skor F1 74,67%. AbstractA service system to convey aspirations and complaints from the public against Indonesia's government services, named Lapor! The Government has used the Government for a long time to answer the problems of the Indonesian people related to bureaucratic problems. However, the increasing volume of reports and the sorting of reports carried out by operators by reading every complaint that comes through the system cause frequent errors where operators forward the reports to the wrong agencies. Therefore, we need a solution that can automatically determine the report's context using Natural Language Processing techniques. This study aims to build automatic report classifications based on report topics addressed to authorized agencies by combining Latent Dirichlet Allocation (LDA) and Support Vector Machine (SVM). The topic-modeling process for each report was carried out using the LDA method. This method extracts reports to find specific patterns in documents that will produce output in topic distribution values. Furthermore, the classification process to determine the report's destination agency carried out using the SVM based on the value of the topics extracted by the LDA method. The LDA-SVM model's performance is measured using a confusion matrix by calculating the value of accuracy, precision, recall, and F1 Score. The test results using the train-test split technique with a 70:30 show that the model produces good performance with 79.85% accuracy, 79.98% precision, 72.37% recall, and 74.67% F1 Score

Download Full-text

Opinion mining on newspaper headlines using SVM and NLP

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i3.pp2152-2163 ◽

2019 ◽

Vol 9 (3) ◽

pp. 2152 ◽

Cited By ~ 1

Author(s):

Chaudhary Jashubhai Rameshbhai ◽

Joy Paulose

Keyword(s):

Support Vector Machine ◽

Natural Language Processing ◽

Language Processing ◽

Opinion Mining ◽

Confusion Matrix ◽

Support Vector ◽

Text Data ◽

Mining Technique ◽

Svm Model ◽

Linear Svm

Opinion Mining also known as Sentiment Analysis, is a technique or procedure which uses Natural Language processing (NLP) to classify the outcome from text. There are various NLP tools available which are used for processing text data. Multiple research have been done in opinion mining for online blogs, Twitter, Facebook etc. This paper proposes a new opinion mining technique using Support Vector Machine (SVM) and NLP tools on newspaper headlines. Relative words are generated using Stanford CoreNLP, which is passed to SVM using count vectorizer. On comparing three models using confusion matrix, results indicate that Tf-idf and Linear SVM provides better accuracy for smaller dataset. While for larger dataset, SGD and linear SVM model outperform other models.

Download Full-text

Topic Modeling for Keyword Extraction: using Natural Language Processing methods for keyword extraction in Portal Min@s

Revista de Estudos da Linguagem ◽

10.17851/2237-2083.23.3.695-726 ◽

2015 ◽

Vol 23 (3) ◽

pp. 695 ◽

Cited By ~ 1

Author(s):

Arnaldo Candido Junior ◽

Célia Magalhães ◽

Helena Caseli ◽

Régis Zangirolami

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Keyword Extraction ◽

Processing Methods ◽

Dirichlet Allocation

Este artigo tem o objetivo da avaliar a aplicação de dois métodos automáticos eficientes na extração de palavras-chave, usados pelas comunidades da Linguística de Corpus e do Processamento da Língua Natural para gerar palavras-chave de textos literários: o WordSmith Tools e o Latent Dirichlet Allocation (LDA). As duas ferramentas escolhidas para este trabalho têm suas especificidades e técnicas diferentes de extração, o que nos levou a uma análise orientada para a sua performance. Objetivamos entender, então, como cada método funciona e avaliar sua aplicação em textos literários. Para esse fim, usamos análise humana, com conhecimento do campo dos textos usados. O método LDA foi usado para extrair palavras-chave por meio de sua integração com o Portal Min@s: Corpora de Fala e Escrita, um sistema geral de processamento de corpora, concebido para diferentes pesquisas de Linguística de Corpus. Os resultados do experimento confirmam a eficácia do WordSmith Tools e do LDA na extração de palavras-chave de um corpus literário, além de apontar que é necessária a análise humana das listas em um estágio anterior aos experimentos para complementar a lista gerada automaticamente, cruzando os resultados do WordSmith Tools e do LDA. Também indicam que a intuição linguística do analista humano sobre as listas geradas separadamente pelos dois métodos usados neste estudo foi mais favorável ao uso da lista de palavras-chave do WordSmith Tools.

Download Full-text

Latent Dirichlet Allocation in predicting clinical trial terminations

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-019-0973-y ◽

2019 ◽

Vol 19 (1) ◽

Author(s):

Simon Geletta ◽

Lendie Follett ◽

Marcia Laugerman

Keyword(s):

Clinical Trial ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Structured Data ◽

Unstructured Data ◽

Future Research ◽

Funding Agencies ◽

Dirichlet Allocation

Abstract Background This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported that at least 10 % of all studies that are funded by major research funding agencies terminate without yielding useful results. Since it is well-known that scientific studies that receive funding from major funding agencies are carefully planned, and rigorously vetted through the peer-review process, it was somewhat daunting to us that study-terminations are this prevalent. Moreover, our review of the literature about study terminations suggested that the reasons for study terminations are not well understood. We therefore aimed to address that knowledge gap, by seeking to identify the factors that contribute to study failures. Method We used data from the clinicialTrials.gov repository, from which we extracted both structured data (study characteristics), and unstructured data (the narrative description of the studies). We applied natural language processing techniques to the unstructured data to quantify the risk of termination by identifying distinctive topics that are more frequently associated with trials that are terminated and trials that are completed. We used the Latent Dirichlet Allocation (LDA) technique to derive 25 “topics” with corresponding sets of probabilities, which we then used to predict study-termination by utilizing random forest modeling. We fit two distinct models – one using only structured data as predictors and another model with both structured data and the 25 text topics derived from the unstructured data. Results In this paper, we demonstrate the interpretive and predictive value of LDA as it relates to predicting clinical trial failure. The results also demonstrate that the combined modeling approach yields robust predictive probabilities in terms of both sensitivity and specificity, relative to a model that utilizes the structured data alone. Conclusions Our study demonstrated that the use of topic modeling using LDA significantly raises the utility of unstructured data in better predicating the completion vs. termination of studies. This study sets the direction for future research to evaluate the viability of the designs of health studies.

Download Full-text

Capturing and analyzing social representations. A first application of Natural Language Processing techniques to reader’s comments in COVID-19 news. Argentina, 2020

10.31235/osf.io/3pcdu ◽

2020 ◽

Author(s):

German Rosati ◽

Laia Domenech ◽

Adriana Chazarreta ◽

Tomas Maguire

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Social Representations ◽

Web Crawler ◽

First Approximation ◽

Processing Techniques ◽

Dirichlet Allocation

We present a first approximation to the quantification of social representations about the COVID-19, using news comments. A web crawler was developed for constructing the dataset of reader’s comments. We detect relevant topics in the dataset using Latent Dirichlet Allocation, and analyze its evolution during time. Finally, we show a first prototype to the prediction of the majority topics, using FastText.

Download Full-text

Latent Dirichlet Allocation in Predicting Clinical Trial Terminations

10.21203/rs.2.12904/v4 ◽

2019 ◽

Author(s):

Simon Geletta ◽

Lendie Follett ◽

Marcia R Laugerman

Keyword(s):

Clinical Trial ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Structured Data ◽

Unstructured Data ◽

Future Research ◽

Funding Agencies ◽

Dirichlet Allocation

Abstract This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported that at least ten percent of all studies that are funded by major research funding agencies terminate without yielding useful results. Since it is well-known that scientific studies that receive funding from major funding agencies are carefully planned, and rigorously vetted through the peer-review process, it was somewhat daunting to us that study-terminations are this prevalent. Moreover, our review of the literature about study terminations suggested that the reasons for study terminations are not well understood. We therefore aimed to address that knowledge gap, by seeking to identify the factors that contribute to study failures.Method: We used data from the clinicialTrials.gov repository, from which we extracted both structured data (study characteristics), and unstructured data (the narrative description of the studies). We applied natural language processing techniques to the unstructured data to quantify the risk of termination by identifying distinctive topics that are more frequently associated with trials that are terminated and trials that are completed. We used the Latent Dirichlet Allocation (LDA) technique to derive 25 “topics” with corresponding sets of probabilities, which we then used to predict study-termination by utilizing random forest modeling. We fit two distinct models – one using only structured data as predictors and another model with both structured data and the 25 text topics derived from the unstructured data.Results: In this paper, we demonstrate the interpretive and predictive value of LDA as it relates to predicting clinical trial failure. The results also demonstrate that the combined modeling approach yields robust predictive probabilities in terms of both sensitivity and specificity, relative to a model that utilizes the structured data alone.Conclusions: Our study demonstrated that the use of topic modeling using LDA significantly raises the utility of unstructured data in better predicating the completion vs. termination of studies. This study sets the direction for future research to evaluate the viability of the designs of health studies.

Download Full-text

Natural Language Processing and Machine Learning Classifier used for Detecting the Author of the Sentence

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c4098.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 936-939 ◽

Cited By ~ 6

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Support Vector ◽

Learning Classifier ◽

Number Of Classes

Detecting the author of the sentence in a collective document can be done by choosing a suitable set of features and implementing using Natural Language Processing in Machine Learning. Training our machine is the basic idea to identify the author name of a specific sentence. This can be done by using 8 different NLP steps like applying stemming algorithm, finding stop-list words, preprocessing the data, and then applying it to a machine learning classifier-Support vector machine (SVM) which classify the dataset into a number of classes specifying the author of the sentence and defines the name of author for each and every sentence with an accuracy of 82%.This paper helps the readers who are interested in knowing the names of the authors who have written some specific words

Download Full-text

Latent Dirichlet Allocation in Predicting Clinical Trial Terminations

10.21203/rs.2.12904/v3 ◽

2019 ◽

Author(s):

Simon Geletta ◽

Lendie Follett ◽

Marcia R Laugerman

Keyword(s):

Clinical Trial ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Structured Data ◽

Unstructured Data ◽

Future Research ◽

Funding Agencies ◽

Dirichlet Allocation

Abstract Background: This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported that at least ten percent of all studies that are funded by major research funding agencies terminate without yielding useful results. Since it is well-known that scientific studies that receive funding from major funding agencies are carefully planned, and rigorously vetted through the peer-review process, it was somewhat daunting to us that study-terminations are this prevalent. Moreover, our review of the literature about study terminations suggested that the reasons for study terminations are not well understood. We therefore aimed to address that knowledge gap, by seeking to identify the factors that contribute to study failures.Method: We used data from the clinicialTrials.gov repository, from which we extracted both structured data (study characteristics), and unstructured data (the narrative description of the studies). We applied natural language processing techniques to the unstructured data to quantify the risk of termination by identifying distinctive topics that are more frequently associated with trials that are terminated and trials that are completed. We used the Latent Dirichlet Allocation (LDA) technique to derive 25 “topics” with corresponding sets of probabilities, which we then used to predict study-termination by utilizing random forest modeling. We fit two distinct models – one using only structured data as predictors and another model with both structured data and the 25 text topics derived from the unstructured data.Results: In this paper, we demonstrate the interpretive and predictive value of LDA as it relates to predicting clinical trial failure. The results also demonstrate that the combined modeling approach yields robust predictive probabilities in terms of both sensitivity and specificity, relative to a model that utilizes the structured data alone.Conclusions: Our study demonstrated that the use of topic modeling using LDA significantly raises the utility of unstructured data in better predicating the completion vs. termination of studies. This study sets the direction for future research to evaluate the viability of the designs of health studies.

Download Full-text

Modeling virtual organizations with Latent Dirichlet Allocation: A case for natural language processing

Neural Networks ◽

10.1016/j.neunet.2014.05.008 ◽

2014 ◽

Vol 58 ◽

pp. 38-49 ◽

Cited By ~ 18

Author(s):

Alexander Gross ◽

Dhiraj Murthy

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Virtual Organizations ◽

Dirichlet Allocation

Download Full-text

ANALISIS SENTIMEN KOMENTAR FACEBOOK BERBASIS LEXICON DAN SUPPORT VECTOR MACHINE

SAINTEKBU ◽

10.32764/saintekbu.v12i2.855 ◽

2020 ◽

Vol 12 (2) ◽

pp. 40-44

Author(s):

Iin Kurniasari

Keyword(s):

Support Vector Machine ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Support Vector

Facebook adalah salah satu media sosial yang sering digunakan. Terutama pada pandemi co-19 saat ini. Banyak sekali sentimen publik yang beredar, terutama di Facebook dalam bentuk komentar atas informasi yang ada tentang covid-19 yang menantang untuk dianalisis untuk beberapa tujuan. Teknik NLP (Natural Language Processing) yang terdiri dari casefolding, tokenizing, filtering dan stemming dapat digunakan dalam kasus ini. Studi ini berfokus pada pengembangan analisis sentimen di Facebook menggunakan Lexicon dan Support Vector Machine. Data Lexicon yang diperoleh memiliki akurasi lebih rendah daripada menggunakan Support Vector Machine.

Download Full-text

Classification of Current Procedural Terminology Codes from Electronic Health Record Data Using Machine Learning

Anesthesiology ◽

10.1097/aln.0000000000003150 ◽

2020 ◽

Vol 132 (4) ◽

pp. 738-749 ◽

Cited By ~ 1

Author(s):

Michael L. Burns ◽

Michael R. Mathis ◽

John Vandervest ◽

Xinyu Tan ◽

Bo Lu ◽

...

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Quality Improvement ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Support Vector ◽

Current Procedural Terminology ◽

Test Dataset ◽

Quality Improvement Research

Abstract Background Accurate anesthesiology procedure code data are essential to quality improvement, research, and reimbursement tasks within anesthesiology practices. Advanced data science techniques, including machine learning and natural language processing, offer opportunities to develop classification tools for Current Procedural Terminology codes across anesthesia procedures. Methods Models were created using a Train/Test dataset including 1,164,343 procedures from 16 academic and private hospitals. Five supervised machine learning models were created to classify anesthesiology Current Procedural Terminology codes, with accuracy defined as first choice classification matching the institutional-assigned code existing in the perioperative database. The two best performing models were further refined and tested on a Holdout dataset from a single institution distinct from Train/Test. A tunable confidence parameter was created to identify cases for which models were highly accurate, with the goal of at least 95% accuracy, above the reported 2018 Centers for Medicare and Medicaid Services (Baltimore, Maryland) fee-for-service accuracy. Actual submitted claim data from billing specialists were used as a reference standard. Results Support vector machine and neural network label-embedding attentive models were the best performing models, respectively, demonstrating overall accuracies of 87.9% and 84.2% (single best code), and 96.8% and 94.0% (within top three). Classification accuracy was 96.4% in 47.0% of cases using support vector machine and 94.4% in 62.2% of cases using label-embedding attentive model within the Train/Test dataset. In the Holdout dataset, respective classification accuracies were 93.1% in 58.0% of cases and 95.0% among 62.0%. The most important feature in model training was procedure text. Conclusions Through application of machine learning and natural language processing techniques, highly accurate real-time models were created for anesthesiology Current Procedural Terminology code classification. The increased processing speed and a priori targeted accuracy of this classification approach may provide performance optimization and cost reduction for quality improvement, research, and reimbursement tasks reliant on anesthesiology procedure codes. Editor’s Perspective What We Already Know about This Topic What This Article Tells Us That Is New

Download Full-text