text preprocessing
Recently Published Documents


TOTAL DOCUMENTS

91
(FIVE YEARS 50)

H-INDEX

6
(FIVE YEARS 2)

2021 ◽  
Vol 10 (3) ◽  
pp. 426-431
Author(s):  
Wiyanto Wiyanto ◽  
Zulita Setyaningsih

The Pandemic Covid-19  in Indonesia in 2020 had an impact on Termination of Employment (PHK), this has received various public opinions on social media. At a time when the poverty rate is high and unemployment increases every year, it becomes a factor of public disapproval of Termination of Employment (PHK). It is necessary to classify public opinion into a negative opinion or a positive opinion on this issue. The purpose of this study is to analyze the sentiment towards layoffs to determine negative or positive opinions using the Naïve Bayes algorithm by adding feature selection. The research stages consist of data collection, text preprocessing, feature selection, and application of algorithms. The testing process in this study uses the Rapid Miner application. The test results in this study using the Naive Bayes Algorithm, the accuracy value is 93.57% and for addition to the Naïve Bayes + PSO feature selection, the accuracy value is 93.71%. The best accuracy value in sentiment analysis of layoffs in the covid-19 pandemic is the addition of the PSO feature selection in the Naïve Bayes Algorithm, which is 0.14% better.


2021 ◽  
Vol 2131 (2) ◽  
pp. 022102
Author(s):  
A Kozyreva ◽  
U Nazarenko ◽  
A Berezhkov ◽  
N Nasyrov

Abstract This publication focuses on underdevelopment the possibilities of machine learn-ing to help students prepare their final qualifying paper. Purpose of the study: present the possibilities of machine learning for processing final qualifying paper texts and checking them for compliance with the requirements. The article shows the possibilities of distributing work by topic, which can help students in finding materials on their topic and algorithms for extracting and analyzing text in Rus-sian for further analysis. The research is carried out on the basis of the CRISP DM methodology and describes in detail all the necessary research steps. The pa-per shows the process of extracting text from pdf and docx files; the necessary methods of text preprocessing for further analysis; and demonstrates the capabili-ties of machine learning algorithms using the example of LDA analysis.


2021 ◽  
Vol 5 (2) ◽  
pp. 548-556
Author(s):  
Muhammad Fitra Rizki ◽  
Karina Auliasari ◽  
Renaldi Primaswara Prasetya

Twitter merupakan salah satu media sosial yang saat ini menjadi trend, karena terdapat banyak sekali berita dan informasi yang yang dapat direspon dengan cepat dan tepat dari berbagai sudut pandang. Hal ini menjadikan Twitter tidak hanya berdampak positif, tetapi juga berdampak negatif  bagi pengguna maupun non-pengguna Twitter, salah satunya adalah cyberbullying. Cyberbullying adalah bentuk intimidasi yang pelaku lakukan untuk melecehkan korbannya melalui perangkat teknologi. Korban yang mengalami Cyberbullying akan mengalami gangguan fisik hingga psikologis seperti kesepian, kegelisahan, depresi yang lebih tinggi, dan merasa hargadirinya rendah. Selain itu korban yang mengalami Cyberbullying juga akan merasakan tekanan sehingga menunjukkan keinginan bunuh diri yang lebih tinggi. Pada penelitian ini dilakukan proses analisis sentiment cyberbullying yang disampaikan oleh pengguna pada media sosial twitter dengan mengembangkan sistem berbasis web untuk mengklasifikasikan sentiment tersebut menggunakan metode support vector machine. Data inputan pada sistem ini berupa konten tweet yang diperoleh dari twitter dengan memasukkan keyword hashtag yang berpotensi menimbulkan cyberbullying seperti #cebong atau #kadrun dan tidak melebihi 100 data tweet. Sedangkan outputnya berupa klasifikasi sentiment cyberbullying atau non-cyberbullying dari setiap tweet yang sudah melewati proses text preprocessing dan pembobotan teks dengan TF-IDF. Dari hasil pengujian menunjukan dengan menggunakan 100 data tweet, sistem mampu melakukan proses klasifikasi dengan rata-rata waktu 101100,2 milisecond dan kecepatan pemrosesan 0,000989 data per milisecond. Diperoleh pula hasil pengukuran evaluasi klasifikasi dengan menggunakan metode confusion matrix dengan nilai recall 64%, precision 58% dan tingkat accuracy sebesar 70%.


Author(s):  
Indra Gunawan

Abstract— Pada lingkup pendidikan khususnya perguruan tinggi plagiarisme sering terlihat. Umumnya plagiarisme terjadi karena rasa malas dan ingin cepat dalam menyelesaikan urusan tugasnya. Algoritma Rabin Karp merupakan algoritma pencarian string, algoritma ini digunakan untuk mendeteksi plagiarisme pada teks. Tujuan penelitian mengetahui hasil evaluasi yang didapat dari proses Algoritma Rabin Karp. Data penelitian akan melewati semua tahapan preprocessing (case folding, tokenizing, filtering, dan stemming) dan melewati sebagian tahapan preprocessing (case folding), k-gram yang diuji yaitu 2gram, 3gram, 4gram, 5gram, dan 6gram kemudian melewati tahapan hashing dan mendapatkan nilai fingerprint kemudian diuji tingkat kemiripannya menggunakan Dice Similarity Coeffcient. Metode penelitian yang digunakan yaitu metode Text Mining yang memiliki tahapan Akuisisi, Text Preprocessing, Modeling, dan Evaluasi. Dari data yang digunakan menghasilkan nilai rata-rata total kemiripan 86.84% pada 2gram, 69.56% pada 3gram, 56.06% pada 4gram, 48.71% pada 5gram, dan 44.30% pada 6gram. hasil dari tahapan Preprocessing dengan hasil dari tahapan Sebagian Preprocessing, memiliki perbedaan yaitu, hasil tahapan Preprocessing lebih kecil persentase kemiripannya daripada hasil sebagian Preprocessing, ini disebabkan penghilangan kata pada tahapan filtering dan perubahan kata pada tahapan stemming. Dapat disimpulkan bahwa dari data yang digunakan terlihat adanya tindakan plagiarisme pada abstrak, hal ini didukung dengan adanya data yang memiliki nilai kemiripan hingga 100%.


2021 ◽  
Author(s):  
Esam Alzahrani ◽  
Leon Jololian

Forensic author profiling plays an important role in indicating possible profiles for suspects. Among the many automated solutions recently proposed for author profiling, transfer learning outperforms many other state-of-the-art techniques in natural language processing. Nevertheless, the sophisticated technique has yet to be fully exploited for author profiling. At the same time, whereas current methods of author profiling, all largely based on features engineering, have spawned significant variation in each model used, transfer learning usually requires a preprocessed text to be fed into the model. We reviewed multiple references in the literature and determined the most common preprocessing techniques associated with authors' genders profiling. Considering the variations in potential preprocessing techniques, we conducted an experimental study that involved applying five such techniques to measure each technique’s effect while using the BERT model, chosen for being one of the most-used stock pretrained models. We used the Hugging face transformer library to implement the code for each preprocessing case. In our five experiments, we found that BERT achieves the best accuracy in predicting the gender of the author when no preprocessing technique is applied. Our best case achieved 86.67% accuracy in predicting the gender of authors.


2021 ◽  
Author(s):  
Vasyl Starko ◽  
Andriy Rysin ◽  
Maria Shvedova

2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Mustafa Mhamed ◽  
Richard Sutcliffe ◽  
Xia Sun ◽  
Jun Feng ◽  
Eiad Almekhlafi ◽  
...  

Sentiment analysis is an essential process which is important to many natural language applications. In this paper, we apply two models for Arabic sentiment analysis to the ASTD and ATDFS datasets, in both 2-class and multiclass forms. Model MC1 is a 2-layer CNN with global average pooling, followed by a dense layer. MC2 is a 2-layer CNN with max pooling, followed by a BiGRU and a dense layer. On the difficult ASTD 4-class task, we achieve 73.17%, compared to 65.58% reported by Attia et al., 2018. For the easier 2-class task, we achieve 90.06% with MC1 compared to 85.58% reported by Kwaik et al., 2019. We carry out experiments on various data splits, to match those used by other researchers. We also pay close attention to Arabic preprocessing and include novel steps not reported in other works. In an ablation study, we investigate the effect of two steps in particular, the processing of emoticons and the use of a custom stoplist. On the 4-class task, these can make a difference of up to 4.27% and 5.48%, respectively. On the 2-class task, the maximum improvements are 2.95% and 3.87%.


2021 ◽  
Vol 39 (3) ◽  
pp. 121-128
Author(s):  
Chulho Kim

Natural language processing (NLP) is a computerized approach to analyzing text that explores how computers can be used to understand and manipulate natural language text or speech to do useful things. In healthcare field, these NLP techniques are applied in a variety of applications, ranging from evaluating the adequacy of treatment, assessing the presence of the acute illness, and the other clinical decision support. After converting text into computer-readable data through the text preprocessing process, an NLP can extract valuable information using the rule-based algorithm, machine learning, and neural network. We can use NLP to distinguish subtypes of stroke or accurately extract critical clinical information such as severity of stroke and prognosis of patients, etc. If these NLP methods are actively utilized in the future, they will be able to make the most of the electronic health records to enable optimal medical judgment.


Sign in / Sign up

Export Citation Format

Share Document