scholarly journals Universal Dependencies for Urdu Noisy Text

In this paper, the process of creating a Dependency Treebank for tweetsin Urdu,a morphologically rich and less-resourced languageis described. The 500 Urdu tweets treebank iscreated by manually annotating the treebank withlemma, POS tags, morphological and syntacticrelations using the Universal Dependencies annotation scheme, adopted to the peculiarities of Urdu social media text. annotation process is evaluated through Inter-annotator agreement for dependency relations and total agreement of 94.5% and resultant weighted Kappa = 0.876was observed. The treebank is evaluated through 10-fold cross validation using Maltparserwith various feature settings. Results show average UAS score of 74%, LAS score of 62.9% and LA score of 69.8%.

Teknika ◽  
2021 ◽  
Vol 10 (1) ◽  
pp. 18-26
Author(s):  
Hendry Cipta Husada ◽  
Adi Suryaputra Paramita

Perkembangan teknologi saat ini telah memberikan kemudahan bagi banyak orang dalam mendapatkan dan menyebarkan informasi di berbagai social media platform. Twitter merupakan salah satu media yang kerap digunakan untuk menyampaikan opini sebagai bentuk reaksi seseorang atas suatu hal. Opini yang terdapat di Twitter dapat digunakan perusahaan maskapai penerbangan sebagai parameter kunci untuk mengetahui tingkat kepuasan publik sekaligus bahan evaluasi bagi perusahaan. Berdasarkan hal tersebut, diperlukan sebuah metode yang dapat secara otomatis melakukan klasifikasi opini ke dalam kategori positif, negatif, atau netral melalui proses analisis sentimen. Proses analisis sentimen dilakukan dengan proses data preprocessing, pembobotan kata menggunakan metode TF-IDF, penerapan algoritma, dan pembahasan atas hasil klasifikasi. Klasifikasi opini dilakukan dengan machine learning approach memanfaatkan algoritma multi-class Support Vector Machine (SVM). Data yang digunakan dalam penelitian ini adalah opini dalam bahasa Inggris dari para pengguna Twitter terhadap maskapai penerbangan. Berdasarkan pengujian yang telah dilakukan, hasil klasifikasi terbaik diperoleh menggunakan SVM kernel RBF pada nilai parameter 𝐶(complexity) = 10 dan 𝛾(gamma) = 1, dengan nilai accuracy sebesar 84,37% dan 80,41% ketika menggunakan 10-fold cross validation.


2019 ◽  
Vol 2 (1) ◽  
pp. 88-98 ◽  
Author(s):  
Muhammad Zidny Naf'an ◽  
Alhamda Adisoka Bimantara ◽  
Afiatari Larasati ◽  
Ezar Mega Risondang ◽  
Novanda Alim Setya Nugraha

Instagram is a social media for sharing images, photos and videos. Instagram has many active users from various circles. In addition to sharing submissions, Instagram users can also give likes and comments to other users' posts. However, the comment feature is often misused, for example it is used for cyberbullying which includes one act against the law. But until now, Instagram still does not provide a feature to detect cyberbullying. Therefore, this study aims to create a system that can classify comments whether they contain elements of cyberbullying or not. The results of the classification will be used to detect cyberbullying comments. The algorithm used for classification is Naïve Bayes Classifier. Then for each comment will pass the preprocessing and feature extraction stages with the TF-IDF method. For evaluation and testing using the K-Fold Cross Validation method. The experiment is divided into two, namely using stemming and without stemming. The training data used is 455 data. The best experimental results obtained an accuracy of 84% both with stemming, and without stemming.


2020 ◽  
Author(s):  
Kristo Radion Purba ◽  
David Asirvatham ◽  
Raja Kumar Murugesan

In recent years, social media is growing at an unprecedented rate, and more people have become influencers. Understanding popularity helps ordinary users to boost popularity, and business users to choose better influencers. There were studies to predict the popularity of posted images on social media, but there was none on the user's popularity as a whole. Furthermore, existing studies have not taken hashtag analysis into consideration, one of the most useful social media feature. This research aims to create a model to predict a user's popularity, which is defined by a combination of engagement rate and followers growth. There were six machine learning regression models tested. The proposed model successfully predicted the users’ popularity, with R2 up to 0.852, using Random Forest with 10-fold cross-validation. The additional statistical analysis and features analysis results revealed factors that can boost popularity, such as actively posting and following users, completing user's metadata, and using 11 hashtags. In contrast, it was also found that having a large number of posts and following in the past will not help in growing popularity, as well as the use of popular hashtags.


Computers ◽  
2020 ◽  
Vol 9 (4) ◽  
pp. 90
Author(s):  
Amber Baig ◽  
Mutee U Rahman ◽  
Hameedullah Kazi ◽  
Ahsanullah Baloch

Processing of social media text like tweets is challenging for traditional Natural Language Processing (NLP) tools developed for well-edited text due to the noisy nature of such text. However, demand for tools and resources to correctly process such noisy text has increased in recent years due to the usefulness of such text in various applications. Literature reports various efforts made to develop tools and resources to process such noisy text for various languages, notably, part-of-speech (POS) tagging, an NLP task having a direct effect on the performance of other successive text processing activities. Still, no such attempt has been made to develop a POS tagger for Urdu social media content. Thus, the focus of this paper is on POS tagging of Urdu tweets. We introduce a new tagset for POS-tagging of Urdu tweets along with the POS-tagged Urdu tweets corpus. We also investigated bootstrapping as a potential solution for overcoming the shortage of manually annotated data and present a supervised POS tagger with an accuracy of 93.8% precision, 92.9% recall and 93.3% F-measure.


2015 ◽  
Vol 41 (3) ◽  
pp. 539-548 ◽  
Author(s):  
Sarvnaz Karimi ◽  
Jie Yin ◽  
Jiri Baum

In recent years, many studies have been published on data collected from social media, especially microblogs such as Twitter. However, rather few of these studies have considered evaluation methodologies that take into account the statistically dependent nature of such data, which breaks the theoretical conditions for using cross-validation. Despite concerns raised in the past about using cross-validation for data of similar characteristics, such as time series, some of these studies evaluate their work using standard k-fold cross-validation. Through experiments on Twitter data collected during a two-year period that includes disastrous events, we show that by ignoring the statistical dependence of the text messages published in social media, standard cross-validation can result in misleading conclusions in a machine learning task. We explore alternative evaluation methods that explicitly deal with statistical dependence in text. Our work also raises concerns for any other data for which similar conditions might hold.


2020 ◽  
Vol 7 (3) ◽  
pp. 501
Author(s):  
Wiga Maulana Baihaqi ◽  
Muliasari Pinilih ◽  
Miftakhul Rohmah

<p class="Abstrak">Tulisan yang disampaikan melalui twitter dinamakan dengan <em>tweets</em> atau dalam bahasa indonesia lebih dikenal dengan kicau, tulisan yang di<em>share</em> memiliki batas maksimum, tulisan tidak boleh lebih dari 140 karakter, karakter disini terdiri dari huruf, angka, dan simbol. Penyalahgunaan dalam berpendapat sering terjadi di media sosial, sering kali pengguna media sosial dengan sadar atau tidak sadar telah membuat konten yang mengandung isu Suku (dalam hal ini menyangkut keturunan), agama, ras (kebangsaan) dan antargolongan (SARA). Perlu adanya analisis yang dapat mengidentifikasi secara otomatis apakah kalimat yang ditulis pada media sosial mengandung unsur SARA atau tidak, akan tetapi korpus tentang kalimat yang mengandung unsur SARA belum ada, selain itu label kalimat yang menandakan kalimat SARA atau bukan tidak ada. Penelitian ini bertujuan untuk membuat <em>corpus</em> kalimat yang mengandung unsur SARA yang didapatkan dari twitter, kemudian melabeli kalimat dengan label mengandung unsur SARA dan tidak,  serta melakukan <em>sentiment</em> klasifikasi.  Algoritme yang digunakan untuk proses pelabelan adalah k-<em>means</em>, sedangkan <em>Support Vector Machine</em> (SVM) digunakan untuk proses klasifikasi. Hasil yang diperoleh berdasarkan k-<em>means</em> antara lain 118 <em>tweet</em> positif SARA dan 83 <em>tweet</em> negatif SARA. Dalam proses klasifikasi menggunakan dua metode validasi, yaitu 5-<em>fold cross validation</em> yang dibandingkan dengan 10-<em>fold cross validation</em>, hasil akurasi dari kedua metode validasi tersebut yaitu, masing-masing 64,18% dan 63,68%. Berdasarkan hasil akurasi yang diperoleh untuk meningkatkan hasil akurasi, data hasil proses k-<em>means</em> diolah kembali dengan validasi pakar bahasa, hasil yang diperoleh menjadi 139 <em>tweet</em> positif SARA dan 62 <em>tweet</em> negatif SARA, hasil akurasi meningkat menjadi 70,15% dan 71,14%. Dari hasil yang didapatkan, twitter dapat dijadikan sumber untuk membuat <em>corpus</em> mengenai kalimat SARA, dan metode yang diusulkan berhasil untuk proses pelabelan dan sentimen klasifikasi, akan tetapi masih perlu peningkatan hasil akurasi.</p><p class="Abstrak"> </p><p class="Abstrak"><em><strong>Abstract</strong></em></p><p class="Abstract"><em>Posts sent via twitter are called tweets or in Indonesian better known as chirping, the posts shared have a maximum limit, the writing cannot be more than 140 characters, the characters here consist of letters, numbers, and symbols. Broadcasting in discussions that often occur on social media, often users of social media consciously or unconsciously have created content that contains issues of ethnicity, religion, race (nationality) and intergroup (SARA). Obtained from the analysis that can automatically contain sentences on social media containing no SARA or not, but the corpus about sentences containing SARA does not yet exist, other than that the sentence label indicates SARA or no sentence. This study aims to make sentence corpus containing SARA elements obtained from twitter, then label sentences with labels containing elements of SARA and not, and conduct group sentiments. The algorithm used for the labeling process is k-means, while Support Vector Machine (SVM) is used for the classification process. The results obtained based on k-means include 118 positive SARA tweets and 83 negative SARA tweets. In the classification process using two validation methods, namely cross-fold validation of 5 times compared with 10-fold cross validation, the accuracy of the two validation methods is 64.18% and 63.68%, respectively. Based on the results obtained to improve the results, the k-means process data were reprocessed with linguists, the results obtained were 139 positive SARA tweets and 62 SARA negative tweets, the results of which increased to 70.15% and 71.14%. From the results obtained, Twitter can be used as a source to create a corpus about SARA sentences, and methods that have succeeded in labeling and classification sentiments, but still need to improve the results of accuracy.<strong></strong></em></p><p class="Abstrak"><em><strong><br /></strong></em></p>


Healthcare ◽  
2021 ◽  
Vol 9 (12) ◽  
pp. 1679
Author(s):  
Afiq Izzudin A. Rahim ◽  
Mohd Ismail Ibrahim ◽  
Sook-Ling Chua ◽  
Kamarul Imran Musa

While experts have recognised the significance and necessity of social media integration in healthcare, no systematic method has been devised in Malaysia or Southeast Asia to include social media input into the hospital quality improvement process. The goal of this work is to explain how to develop a machine learning system for classifying Facebook reviews of public hospitals in Malaysia by using service quality (SERVQUAL) dimensions and sentiment analysis. We developed a Machine Learning Quality Classifier (MLQC) based on the SERVQUAL model and a Machine Learning Sentiment Analyzer (MLSA) by manually annotated multiple batches of randomly chosen reviews. Logistic regression (LR), naive Bayes (NB), support vector machine (SVM), and other methods were used to train the classifiers. The performance of each classifier was tested using 5-fold cross validation. For topic classification, the average F1-score was between 0.687 and 0.757 for all models. In a 5-fold cross validation of each SERVQUAL dimension and in sentiment analysis, SVM consistently outperformed other methods. The study demonstrates how to use supervised learning to automatically identify SERVQUAL domains and sentiments from patient experiences on a hospital’s Facebook page. Malaysian healthcare providers can gather and assess data on patient care via the use of these content analysis technology to improve hospital quality of care.


2018 ◽  
Vol 1 (1) ◽  
pp. 120-130 ◽  
Author(s):  
Chunxiang Qian ◽  
Wence Kang ◽  
Hao Ling ◽  
Hua Dong ◽  
Chengyao Liang ◽  
...  

Support Vector Machine (SVM) model optimized by K-Fold cross-validation was built to predict and evaluate the degradation of concrete strength in a complicated marine environment. Meanwhile, several mathematical models, such as Artificial Neural Network (ANN) and Decision Tree (DT), were also built and compared with SVM to determine which one could make the most accurate predictions. The material factors and environmental factors that influence the results were considered. The materials factors mainly involved the original concrete strength, the amount of cement replaced by fly ash and slag. The environmental factors consisted of the concentration of Mg2+, SO42-, Cl-, temperature and exposing time. It was concluded from the prediction results that the optimized SVM model appeared to perform better than other models in predicting the concrete strength. Based on SVM model, a simulation method of variables limitation was used to determine the sensitivity of various factors and the influence degree of these factors on the degradation of concrete strength.


2014 ◽  
Author(s):  
Sandeep Soni ◽  
Tanushree Mitra ◽  
Eric Gilbert ◽  
Jacob Eisenstein

2016 ◽  
Vol 7 (2) ◽  
pp. 75-80
Author(s):  
Adhi Kusnadi ◽  
Risyad Ananda Putra

Indonesia is one country that has a relatively large population . The government in the period of 5 years, annually hold a procurement program 1 million FLPP house units. This program is held in an effort to provide a decent home for low income people. FLPP housing development requires good precision and speed of development on the part of the developer, this is often hampered by the bank process, because it is difficult to predict the results and speed of data processing in the bank. Knowing the ability of consumers to get subsidized credit, has many advantages, among others, developers can plan a better cash flow, and developers can replace consumers who will be rejected before entering the bank process. For that reason built a system that can help developers. There are many methods that can be used to create this application. One of them is data mining with Classification tree. The results of 10-fold-cross-validation applications have an accuracy of 92%. Index Terms-Data Mining, Classification Tree, Housing, FLPP, 10-fold-cross Validation, Consumer Capability


Sign in / Sign up

Export Citation Format

Share Document