Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter

Oluwafemi Oriola; Eduan Kotzé

doi:10.18489/sacj.v32i2.847

Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter

South African Computer Journal ◽

10.18489/sacj.v32i2.847 ◽

2020 ◽

Vol 32 (2) ◽

Author(s):

Oluwafemi Oriola ◽

Eduan Kotzé

Keyword(s):

Logistic Regression ◽

Supervised Learning ◽

South African ◽

Learning Curves ◽

Training Data ◽

Support Vector ◽

Learning Techniques ◽

Learning Technique ◽

Unlabelled Data ◽

Language Detection

Semi-supervised learning is a potential solution for improving training data in low-resourced abusive language detection contexts such as South African abusive language detection on Twitter. However, the existing semi-supervised learning methods have been skewed towards small amounts of labelled data, with small feature space. This paper, therefore, presents a semi-supervised learning technique that improves the distribution of training data by assigning labels to unlabelled data based on the majority voting over different feature sets of labelled and unlabelled data clusters. The technique is applied to South African English corpora consisting of labelled and unlabelled abusive tweets. The proposed technique is compared with state-of-the-art self-learning and active learning techniques based on syntactic and semantic features. The performance of these techniques with Logistic Regression, Support Vector Machine and Neural Networks are evaluated. The proposed technique, with accuracy and F1-score of 0.97 and 0.95, respectively, outperforms existing semi-supervised learning techniques. The learning curves show that the training data was used more efficiently by the proposed technique compared to existing techniques. Overall, n-gram syntactic features with a Logistic Regression classifier records the highest performance. The paper concludes that the proposed semi-supervised learning technique effectively detected implicit and explicit South African abusive language on Twitter.

Download Full-text

Identifying Botnet on IoT by Using Supervised Learning Techniques

Oriental journal of computer science and technology ◽

10.13005/ojcst12.04.04 ◽

2019 ◽

Vol 12 (4) ◽

pp. 185-193

Author(s):

Amirhossein Rezaei

Keyword(s):

Supervised Learning ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbors ◽

Malicious Software ◽

Learning Techniques ◽

Security Challenges ◽

Learning Technique ◽

Security Challenge ◽

The Moment

The security challenge on IoT (Internet of Things) is one of the hottest and most pertinent topics at the moment especially the several security challenges. The Botnet is one of the security challenges that most impact for several purposes. The network of private computers infected by malicious software and controlled as a group without the knowledge of owners and each of them running one or more bots is called Botnets. Normally, it is used for sending spam, stealing data, and performing DDoS attacks. One of the techniques that been used for detecting the Botnet is the Supervised Learning method. This study will examine several Supervised Learning methods such as; Linear Regression, Logistic Regression, Decision Tree, Naive Bayes, k- Nearest Neighbors, Random Forest, Gradient Boosting Machines, and Support Vector Machine for identifying the Botnet in IoT with the aim of finding which Supervised Learning technique can achieve the highest accuracy and fastest detection as well as with minimizing the dependent variable.

Download Full-text

Performance Analysis of Statistical and Supervised Learning Techniques in Stock Data Mining

Data ◽

10.3390/data3040054 ◽

2018 ◽

Vol 3 (4) ◽

pp. 54 ◽

Cited By ~ 7

Author(s):

Manik Sharma ◽

Samriti Sharma ◽

Gurvinder Singh

Keyword(s):

Logistic Regression ◽

Supervised Learning ◽

Misclassification Rate ◽

Support Vector ◽

P Value ◽

Linear Discriminant ◽

Statistical Measures ◽

Specificity And Sensitivity ◽

Learning Techniques ◽

Topsis Technique

Nowadays, overwhelming stock data is available, which areonly of use if it is properly examined and mined. In this paper, the last twelve years of ICICI Bank’s stock data have been extensively examined using statistical and supervised learning techniques. This study may be of great interest for those who wish to mine or study the stock data of banks or any financial organization. Different statistical measures have been computed to explore the nature, range, distribution, and deviation of data. The different descriptive statistical measures assist in finding different valuable metrics such as mean, variance, skewness, kurtosis, p-value, a-squared, and 95% confidence mean interval level of ICICI Bank’s stock data. Moreover, daily percentage changes occurring over the last 12 years have also been recorded and examined. Additionally, the intraday stock status has been mined using ten different classifiers. The performance of different classifiers has been evaluated on the basis of various parameters such as accuracy, misclassification rate, precision, recall, specificity, and sensitivity. Based upon different parameters, the predictive results obtained using logistic regression are more acceptable than the outcomes of other classifiers, whereas naïve Bayes, C4.5, random forest, linear discriminant, and cubic support vector machine (SVM) merely act as a random guessing machine. The outstanding performance of logistic regression has been validated using TOPSIS (technique for order preference by similarity to ideal solution) and WSA (weighted sum approach).

Download Full-text

Mengenal Machine Learning Dengan Teknik Supervised Dan Unsupervised Learning Menggunakan Python

BINA INSANI ICT JOURNAL ◽

10.51211/biict.v7i2.1422 ◽

2020 ◽

Vol 7 (2) ◽

pp. 156

Author(s):

Endang Retnoningsih ◽

Rully Pramudita

Keyword(s):

Machine Learning ◽

Unsupervised Learning ◽

Supervised Learning ◽

Clustering Algorithm ◽

Training Data ◽

Abstract Machine ◽

Dbscan Clustering ◽

Learning Techniques ◽

Learning Technique ◽

Learning Programming

Abstrak: Machine learning merupakan sistem yang mampu belajar sendiri untuk memutuskan sesuatu tanpa harus berulangkali diprogram oleh manusia sehingga komputer menjadi semakin cerdas berlajar dari pengalaman data yang dimiliki. Berdasarkan teknik pembelajarannya, dapat dibedakan supervised learning menggunakan dataset (data training) yang sudah berlabel, sedangkan unsupervised learning menarik kesimpulan berdasarkan dataset. Input berupa dataset digunakan pembelajaran mesin untuk menghasilkan analisis yang benar. Permasalahan yang akan diselesaikan bunga iris (iris tectorum) yang memiliki bunga bermaca-macam warna dan memiliki sepal dan petal yang menunjukkan spesies bunga, dibutuhkan metode yang tepat untuk pengelompokan bunga-bunga tersebut kedalam spesiesnya iris-setosa, iris-versicolor atau iris-virginica. Penyelesaian digunakan Python yang menyediakan algoritma dan library yang digunakan membuat machine learning. Penyelesaian dengan teknik supervised learning dipilih algoritma KNN Clasiffier dan teknik unsupervised learning dipilih algoritma DBSCAN Clustering. Hasil yang diperoleh Python menyediakan library yang lengkap numPy, Pandas, matplotlib, sklearn untuk membuat pemrograman machine learning dengan algortima KNN memanggil from sklearn import neighbors termasuk teknik supervised, maupun DBSCAN memanggil from sklearn.cluster import DBSCAN termasuk teknik unsupervised learning. Kemampuan Python memberikan hasil output sesuai input dalam dataset menghasilkan keputusan berupa klasifikasi maupun klusterisasi. Kata kunci: DBSCAN, KNN, machine learning, python. Abstract: Machine learning is a system that is able to learn on its own to decide something without having to be repeatedly programmed by humans so that computers become smarter in learning from the experience of the data they have. Based on the learning technique, supervised learning can be distinguished using a dataset (training data) that is already labeled, while unsupervised learning draws conclusions based on the dataset. The input in the form of a dataset is used by machine learning to produce the correct analysis. The problem to be solved by iris flowers (iris tectorum), which has flowers of various colors and has sepals and petals that indicate the species of flowers, requires an appropriate method for grouping these flowers into iris-setosa, iris-versicolor or iris-virginica species. The solution is used by Python, which provides the algorithms and libraries used to make machine learning. The solution with the supervised learning technique was chosen by the KNN Clasiffier algorithm and the unsupervised learning technique was selected by the DBSCAN Clustering algorithm. The results obtained by Python provide a complete library of numPy, Pandas, matplotlib, sklearn to create machine learning programming with KNN algorithms calling from sklearn import neighbors including supervised techniques, and DBSCAN calling from sklearn.cluster import DBSCAN including unsupervised learning techniques. Python's ability to provide output according to the input in the dataset results in decisions in the form of classification and clustering. Keywords: DBSCAN, KNN, machine learning, python.

Download Full-text

A sentiment analysis system for social media using machine learning techniques: Social enablement

Digital Scholarship in the Humanities ◽

10.1093/llc/fqy037 ◽

2018 ◽

Vol 34 (3) ◽

pp. 569-581 ◽

Cited By ~ 1

Author(s):

Sujata Rani ◽

Parteek Kumar

Keyword(s):

Machine Learning ◽

Social Media ◽

Sentiment Analysis ◽

Media Analysis ◽

Training Data ◽

Machine Learning Techniques ◽

Support Vector ◽

Analysis Tool ◽

Data Set ◽

Learning Techniques

Abstract In this article, an innovative approach to perform the sentiment analysis (SA) has been presented. The proposed system handles the issues of Romanized or abbreviated text and spelling variations in the text to perform the sentiment analysis. The training data set of 3,000 movie reviews and tweets has been manually labeled by native speakers of Hindi in three classes, i.e. positive, negative, and neutral. The system uses WEKA (Waikato Environment for Knowledge Analysis) tool to convert these string data into numerical matrices and applies three machine learning techniques, i.e. Naive Bayes (NB), J48, and support vector machine (SVM). The proposed system has been tested on 100 movie reviews and tweets, and it has been observed that SVM has performed best in comparison to other classifiers, and it has an accuracy of 68% for movie reviews and 82% in case of tweets. The results of the proposed system are very promising and can be used in emerging applications like SA of product reviews and social media analysis. Additionally, the proposed system can be used in other cultural/social benefits like predicting/fighting human riots.

Download Full-text

Classification Based on Unsupervised Learning

Statistical Techniques for Network Security ◽

10.4018/978-1-59904-708-9.ch010 ◽

2011 ◽

pp. 348-395

Author(s):

Yu Wang

Keyword(s):

Network Security ◽

Unsupervised Learning ◽

Supervised Learning ◽

Network Traffic ◽

High Speed ◽

Ad Hoc ◽

Training Data ◽

Traffic Data ◽

Response Variable ◽

Learning Techniques

The requirement for having a labeled response variable in training data from the supervised learning technique may not be satisfied in some situations: particularly, in dynamic, short-term, and ad-hoc wireless network access environments. Being able to conduct classification without a labeled response variable is an essential challenge to modern network security and intrusion detection. In this chapter we will discuss some unsupervised learning techniques including probability, similarity, and multidimensional models that can be applied in network security. These methods also provide a different angle to analyze network traffic data. For comprehensive knowledge on unsupervised learning techniques please refer to the machine learning references listed in the previous chapter; for their applications in network security see Carmines, Edward & McIver (1981), Lane & Brodley (1997), Herrero, Corchado, Gastaldo, Leoncini, Picasso & Zunino (2007), and Dhanalakshmi & Babu (2008). Unlike in supervised learning, where for each vector 1 2 ( , , , ) n X x x x = ? we have a corresponding observed response, Y, in unsupervised learning we only have X, and Y is not available either because we could not observe it or its frequency is too low to be fit ted with a supervised learning approach. Unsupervised learning has great meanings in practice because in many circumstances, available network traffic data may not include any anomalous events or known anomalous events (e.g., traffics collected from a newly constructed network system). While high-speed mobile wireless and ad-hoc network systems have become popular, the importance and need to develop new unsupervised learning methods that allow the modeling of network traffic data to use anomaly-free training data have significantly increased.

Download Full-text

A Study on Machine Vision Techniques for the Inspection of Health Personnels’ Protective Suits for the Treatment of Patients in Extreme Isolation

Electronics ◽

10.3390/electronics8070743 ◽

2019 ◽

Vol 8 (7) ◽

pp. 743 ◽

Cited By ~ 1

Author(s):

Alice Stazio ◽

Juan G. Victores ◽

David Estevez ◽

Carlos Balaguer

Keyword(s):

Logistic Regression ◽

Machine Vision ◽

Training Data ◽

Gradient Boosting ◽

Support Vector ◽

Classification Algorithms ◽

Adaptive Boosting ◽

Blood Stains ◽

Extreme Gradient Boosting ◽

Vector Machines

The examination of Personal Protective Equipment (PPE) to assure the complete integrity of health personnel in contact with infected patients is one of the most necessary tasks when treating patients affected by infectious diseases, such as Ebola. This work focuses on the study of machine vision techniques for the detection of possible defects on the PPE that could arise after contact with the aforementioned pathological patients. A preliminary study on the use of image classification algorithms to identify blood stains on PPE subsequent to the treatment of the infected patient is presented. To produce training data for these algorithms, a synthetic dataset was generated from a simulated model of a PPE suit with blood stains. Furthermore, the study proceeded with the utilization of images of the PPE with a physical emulation of blood stains, taken by a real prototype. The dataset reveals a great imbalance between positive and negative samples; therefore, all the selected classification algorithms are able to manage this kind of data. Classifiers range from Logistic Regression and Support Vector Machines, to bagging and boosting techniques such as Random Forest, Adaptive Boosting, Gradient Boosting and eXtreme Gradient Boosting. All these algorithms were evaluated on accuracy, precision, recall and F 1 score; and additionally, execution times were considered. The obtained results report promising outcomes of all the classifiers, and, in particular Logistic Regression resulted to be the most suitable classification algorithm in terms of F 1 score and execution time, considering both datasets.

Download Full-text

Assessment of the effects of training data selection on the landslide susceptibility mapping: a comparison between support vector machine (SVM), logistic regression (LR) and artificial neural networks (ANN)

Geomatics Natural Hazards and Risk ◽

10.1080/19475705.2017.1407368 ◽

2017 ◽

Vol 9 (1) ◽

pp. 49-69 ◽

Cited By ~ 102

Author(s):

Bahareh Kalantar ◽

Biswajeet Pradhan ◽

Seyed Amir Naghibi ◽

Alireza Motevalli ◽

Shattri Mansor

Keyword(s):

Neural Networks ◽

Support Vector Machine ◽

Logistic Regression ◽

Artificial Neural Networks ◽

Landslide Susceptibility ◽

Training Data ◽

Data Selection ◽

Landslide Susceptibility Mapping ◽

Support Vector ◽

Training Data Selection

Download Full-text

Machine learning versus logistic regression methods for 2-year mortality prognostication in a small, heterogeneous glioma database

10.1101/472555 ◽

2018 ◽

Cited By ~ 2

Author(s):

Sandip S Panesar ◽

Rhett N D’Souza ◽

Fang-Cheng Yeh ◽

Juan C Fernandez-Miranda

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Machine Learning Techniques ◽

World Health ◽

Support Vector ◽

Molecular Characteristics ◽

Regression Methods ◽

Learning Techniques ◽

The World ◽

Health Organization

AbstractBackgroundMachine learning (ML) is the application of specialized algorithms to datasets for trend delineation, categorization or prediction. ML techniques have been traditionally applied to large, highly-dimensional databases. Gliomas are a heterogeneous group of primary brain tumors, traditionally graded using histopathological features. Recently the World Health Organization proposed a novel grading system for gliomas incorporating molecular characteristics. We aimed to study whether ML could achieve accurate prognostication of 2-year mortality in a small, highly-dimensional database of glioma patients.MethodsWe applied three machine learning techniques: artificial neural networks (ANN), decision trees (DT), support vector machine (SVM), and classical logistic regression (LR) to a dataset consisting of 76 glioma patients of all grades. We compared the effect of applying the algorithms to the raw database, versus a database where only statistically significant features were included into the algorithmic inputs (feature selection).ResultsRaw input consisted of 21 variables, and achieved performance of (accuracy/AUC): 70.7%/0.70 for ANN, 68%/0.72 for SVM, 66.7%/0.64 for LR and 65%/0.70 for DT. Feature selected input consisted of 14 variables and achieved performance of 73.4%/0.75 for ANN, 73.3%/0.74 for SVM, 69.3%/0.73 for LR and 65.2%/0.63 for DT.ConclusionsWe demonstrate that these techniques can also be applied to small, yet highly-dimensional datasets. Our ML techniques achieved reasonable performance compared to similar studies in the literature. Though local databases may be small versus larger cancer repositories, we demonstrate that ML techniques can still be applied to their analysis, though traditional statistical methods are of similar benefit.

Download Full-text

Deep Learning Technique to Predict Heart Disease using IoT Based ECG Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b7166.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 2559-2562

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Deep Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Rule Based ◽

Learning Techniques ◽

Learning Technique ◽

Logistic Regression Algorithm ◽

Target Output

Scientific Knowledge and Electronic devices are growing day by day. In this aspect, many expert systems are involved in the healthcare industry using machine learning algorithms. Deep neural networks beat the machine learning techniques and often take raw data i.e., unrefined data to calculate the target output. Deep learning or feature learning is used to focus on features which is very important and gives a complete understanding of the model generated. Existing methodology used data mining technique like rule based classification algorithm and machine learning algorithm like hybrid logistic regression algorithm to preprocess data and extract meaningful insights of data. This is, however a supervised data. The proposed work is based on unsupervised data that is there is no labelled data and deep neural techniques is deployed to get the target output. Machine learning algorithms are compared with proposed deep learning techniques using TensorFlow and Keras in the aspect of accuracy. Deep learning methodology outfits the existing rule based classification and hybrid logistic regression algorithm in terms of accuracy. The designed methodology is tested on the public MIT-BIH arrhythmia database, classifying four kinds of abnormal beats. The proposed approach based on deep learning technique offered a better performance, improving the results when compared to machine learning approaches of the state-of-the-art

Download Full-text

Perbandingan Algoritma Klasifikasi Sentimen Twitter Terhadap Insiden Kebocoran Data Tokopedia

JISKA (Jurnal Informatika Sunan Kalijaga) ◽

10.14421/jiska.2021.6.2.120-129 ◽

2021 ◽

Vol 6 (2) ◽

pp. 120-129

Author(s):

Nadhif Ikbar Wibowo ◽

Tri Andika Maulana ◽

Hamzah Muhammad ◽

Nur Aini Rakhmawati

Keyword(s):

Support Vector Machine ◽

Logistic Regression ◽

Random Forest ◽

Supervised Learning ◽

Support Vector ◽

Data Set ◽

Logistic Regression Classifier

Public responses, posted on Twitter reacting to the Tokopedia data leak incident, were used as a data set to compare the performance of three different classifiers, trained using supervised learning modeling, to classify sentiment on the text. All tweets were classified into either positive, negative, or neutral classes. This study compares the performance of Random Forest, Support-Vector Machine, and Logistic Regression classifier. Data was scraped automatically and used to evaluate several models; the SVM-based model has the highest f1-score 0.503583. SVM is the best performing classifier.

Download Full-text