Recognizing Indonesian Acronym and Expansion Pairs with Supervised Learning and MapReduce

Taufik Fuadi Abidin; Amir Mahazir; Muhammad Subianto; Khairul Munadi; Ridha Ferdhiana

doi:10.3390/info11040210

Recognizing Indonesian Acronym and Expansion Pairs with Supervised Learning and MapReduce

Information ◽

10.3390/info11040210 ◽

2020 ◽

Vol 11 (4) ◽

pp. 210

Author(s):

Taufik Fuadi Abidin ◽

Amir Mahazir ◽

Muhammad Subianto ◽

Khairul Munadi ◽

Ridha Ferdhiana

Keyword(s):

Supervised Learning ◽

Polynomial Model ◽

Support Vector ◽

Entity Extraction ◽

K Nearest Neighbors ◽

Research Attention ◽

Considerable Research ◽

Learning Techniques ◽

Hadoop Cluster ◽

Candidate Identification

During the previous decades, intelligent identification of acronym and expansion pairs from a large corpus has garnered considerable research attention, particularly in the fields of text mining, entity extraction, and information retrieval. Herein, we present an improved approach to recognize the accurate acronym and expansion pairs from a large Indonesian corpus. Generally, an acronym can be either a combination of uppercase letters or a sequence of speech sounds (syllables). Our proposed approach can be computationally divided into four steps: (1) acronym candidate identification; (2) acronym and expansion pair collection; (3) feature generation; and (4) acronym and expansion pair recognition using supervised learning techniques. Further, we introduce eight numerical features and evaluate their effectiveness in representing the acronym and expansion pairs based on the precision, recall, and F-measure. Furthermore, we compare the k-nearest neighbors (K-NN), support vector machine (SVM), and bidirectional encoder representations from transformers (BERT) algorithms in terms of accurate acronym and expansion pair classification. The experimental results indicate that the SVM polynomial model that considers eight features exhibits the highest accuracy (97.93%), surpassing those of the SVM polynomial model that considers five features (90.45%), the K-NN algorithm with k = 3 that considers eight features (96.82%), the K-NN algorithm with k = 3 that considers five features (95.66%), BERT-Base model (81.64%), and BERT-Base Multilingual Cased model (88.10%). Moreover, we analyze the performance of the Hadoop technology using various numbers of data nodes to identify the acronym and expansion pairs and obtain their feature vectors. The results reveal that the Hadoop cluster containing a large number of data nodes is faster than that with fewer data nodes when processing from ten million to one hundred million pairs of acronyms and expansions.

Download Full-text

Identifying Botnet on IoT by Using Supervised Learning Techniques

Oriental journal of computer science and technology ◽

10.13005/ojcst12.04.04 ◽

2019 ◽

Vol 12 (4) ◽

pp. 185-193

Author(s):

Amirhossein Rezaei

Keyword(s):

Supervised Learning ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbors ◽

Malicious Software ◽

Learning Techniques ◽

Security Challenges ◽

Learning Technique ◽

Security Challenge ◽

The Moment

The security challenge on IoT (Internet of Things) is one of the hottest and most pertinent topics at the moment especially the several security challenges. The Botnet is one of the security challenges that most impact for several purposes. The network of private computers infected by malicious software and controlled as a group without the knowledge of owners and each of them running one or more bots is called Botnets. Normally, it is used for sending spam, stealing data, and performing DDoS attacks. One of the techniques that been used for detecting the Botnet is the Supervised Learning method. This study will examine several Supervised Learning methods such as; Linear Regression, Logistic Regression, Decision Tree, Naive Bayes, k- Nearest Neighbors, Random Forest, Gradient Boosting Machines, and Support Vector Machine for identifying the Botnet in IoT with the aim of finding which Supervised Learning technique can achieve the highest accuracy and fastest detection as well as with minimizing the dependent variable.

Download Full-text

Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter

South African Computer Journal ◽

10.18489/sacj.v32i2.847 ◽

2020 ◽

Vol 32 (2) ◽

Author(s):

Oluwafemi Oriola ◽

Eduan Kotzé

Keyword(s):

Logistic Regression ◽

Supervised Learning ◽

South African ◽

Learning Curves ◽

Training Data ◽

Support Vector ◽

Learning Techniques ◽

Learning Technique ◽

Unlabelled Data ◽

Language Detection

Semi-supervised learning is a potential solution for improving training data in low-resourced abusive language detection contexts such as South African abusive language detection on Twitter. However, the existing semi-supervised learning methods have been skewed towards small amounts of labelled data, with small feature space. This paper, therefore, presents a semi-supervised learning technique that improves the distribution of training data by assigning labels to unlabelled data based on the majority voting over different feature sets of labelled and unlabelled data clusters. The technique is applied to South African English corpora consisting of labelled and unlabelled abusive tweets. The proposed technique is compared with state-of-the-art self-learning and active learning techniques based on syntactic and semantic features. The performance of these techniques with Logistic Regression, Support Vector Machine and Neural Networks are evaluated. The proposed technique, with accuracy and F1-score of 0.97 and 0.95, respectively, outperforms existing semi-supervised learning techniques. The learning curves show that the training data was used more efficiently by the proposed technique compared to existing techniques. Overall, n-gram syntactic features with a Logistic Regression classifier records the highest performance. The paper concludes that the proposed semi-supervised learning technique effectively detected implicit and explicit South African abusive language on Twitter.

Download Full-text

Prediction of Liver Diseases by Using Few Machine Learning Based Approaches

Australian Journal of Engineering and Innovative Technology ◽

10.34104/ajeit.020.085090 ◽

2020 ◽

pp. 85-90

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Liver Diseases ◽

Model Building ◽

Medical Science ◽

Machine Learning Techniques ◽

Support Vector ◽

Classification Algorithms ◽

K Nearest Neighbors ◽

Learning Techniques

Advancement in medical science has always been one of the most vital aspects of the human race. With the progress in technology, the use of modern techniques and equipment is always imposed on treatment purposes. Nowadays, machine learning techniques have widely been used in medical science for assuring accuracy. In this work, we have constructed computational model building techniques for liver disease prediction accurately. We used some efficient classification algorithms: Random Forest, Perceptron, Decision Tree, K-Nearest Neighbors (KNN), and Support Vector Machine (SVM) for predicting liver diseases. Our works provide the implementation of hybrid model construction and comparative analysis for improving prediction performance. At first, classification algorithms are applied to the original liver patient datasets collected from the UCI repository. Then we analyzed features and tweaked to improve the performance of our predictor and made a comparative analysis among the classifiers. We examined that, KNN algorithm outperformed all other techniques with feature selection.

Download Full-text

The Effect of Weather in Soccer Results: An Approach Using Machine Learning Techniques

Applied Sciences ◽

10.3390/app10196750 ◽

2020 ◽

Vol 10 (19) ◽

pp. 6750

Author(s):

Ditsuhi Iskandaryan ◽

Francisco Ramos ◽

Denny Asarias Palinggi ◽

Sergio Trilles

Keyword(s):

Support Vector Machine ◽

Nearest Neighbors ◽

Research Community ◽

Machine Learning Techniques ◽

Weather Data ◽

Support Vector ◽

K Nearest Neighbors ◽

Task Support ◽

Extremely Randomized Trees ◽

Learning Techniques

The growing popularity of soccer has led to the prediction of match results becoming of interest to the research community. The aim of this research is to detect the effects of weather on the result of matches by implementing Random Forest, Support Vector Machine, K-Nearest Neighbors Algorithm, and Extremely Randomized Trees Classifier. The analysis was executed using the Spanish La Liga and Segunda division from the seasons 2013–2014 to 2017–2018 in combination with weather data. Two tasks were proposed as part of this study: the first was to find out whether the game will end in a draw, a win by the hosts or a victory by the guests, and the second was to determine whether the match will end in a draw or if one of the teams will win. The results show that, for the first task, Extremely Randomized Trees Classifier is a better method, with an accuracy of 65.9%, and, for the second task, Support Vector Machine yielded better results with an accuracy of 79.3%. Moreover, it is possible to predict whether the game will end in a draw or not with 0.85 AUC-ROC. Additionally, for comparative purposes, the analysis was also performed without weather data.

Download Full-text

Detection of Loss Zones while Drilling Using Different Machine Learning Techniques

Journal of Energy Resources Technology ◽

10.1115/1.4051553 ◽

2021 ◽

pp. 1-29

Author(s):

Ahmed Alsaihati ◽

Mahmoud Abughaban ◽

Salaheldin Elkatatny ◽

Abdulazeez Abdulraheem

Keyword(s):

Machine Learning ◽

Support Vector Machines ◽

Random Forests ◽

Nearest Neighbors ◽

Machine Learning Techniques ◽

Support Vector ◽

K Nearest Neighbors ◽

Learning Techniques ◽

Vector Machines ◽

Testing Set

Abstract Fluid loss into formations is a common operational issue that is frequently encountered when drilling across naturally or induced fractured formations. This could pose significant operational risks, such as well-control, stuck pipe, and wellbore instability, which, in turn, lead to an increase of well time and cost. This research aims to use and evaluate different machine learning techniques, namely: support vector machines, random forests, and K-nearest neighbors in detecting loss circulation occurrences while drilling using solely drilling surface parameters. Actual field data of seven wells, which had suffered partial or severe loss circulation, were used to build predictive models, while Well-8 was used to compare the performance of the developed models. Different performance metrics were used to evaluate the performance of the developed models. Recall, precision, and F1-score measures were used to evaluate the ability of the developed model to detect loss circulation occurrences. The results showed the K-nearest neighbors classifier achieved a high F1-score of 0.912 in detecting loss circulation occurrence in the testing set, while the random forests was the second-best classifier with almost the same F1-score of 0.910. The support vector machines achieved an F1-score of 0.83 in predicting the loss circulation occurrence in the testing set. The K-nearest neighbors outperformed other models in detecting the loss circulation occurrences in Well-8 with an F1-score of 0.80. The main contribution of this research as compared to previous studies is that it identifies losses events based on real-time measurements of the active pit volume.

Download Full-text

Cloud Detection in All-Sky Images via Multi-scale Neighborhood Features and Multiple Supervised Learning Techniques

10.5194/amt-2016-169 ◽

2016 ◽

Author(s):

Hsu-Yung Cheng ◽

Chih-Lung Lin

Keyword(s):

Supervised Learning ◽

Support Vector ◽

Detection Accuracy ◽

Cloud Detection ◽

Training Process ◽

Multi Scale ◽

Cloud Models ◽

Learning Techniques ◽

Image Patches ◽

Local Image

Abstract. Cloud detection is important for providing necessary information such as cloud cover in many applications. The classic method for cloud detection is based on thresholding of the red blue ratio of an image pixel. However, it is difficult to select a suitable threshold for all cloud conditions. Also, the desired thresholds for different all-sky cameras are different. In this paper, we propose to perform cloud detection using supervised learning techniques. The features are extracted from local image patches with different sizes to include local structure and multi-resolution information. The cloud models are learned through the training process. We consider classifiers including random forest, support vector machine and Bayesian classifier. To take advantage of the clues provided by multiple classifiers and various levels of patch sizes, we employ a voting scheme to combine the results to further increase the detection accuracy. In the experiments, we have shown that the proposed method can distinguish cloud and non-cloud pixels more accurately compared with existing works.

Download Full-text

Analysis of Educational Robotics Activities Using a Machine Learning Approach

Makers at School, Educational Robotics and Innovative Learning Environments - Lecture Notes in Networks and Systems ◽

10.1007/978-3-030-77040-2_27 ◽

2021 ◽

pp. 203-211

Author(s):

Lorenzo Cesaretti ◽

Laura Screpanti ◽

David Scaradozzi ◽

Eleni Mangina

Keyword(s):

Machine Learning ◽

Learning Styles ◽

Machine Learning Techniques ◽

Support Vector ◽

Educational Robotics ◽

School Students ◽

K Nearest Neighbors ◽

Log Files ◽

Learning Techniques ◽

Mixed Approach

AbstractThis paper presents the preliminary results of using machine learning techniques to analyze educational robotics activities. An experiment was conducted with 197 secondary school students in Italy: the authors updated Lego Mindstorms EV3 programming blocks to record log files with coding sequences students had designed in teams. The activities were part of a preliminary robotics exercise. We used four machine learning techniques—logistic regression, support-vector machine (SVM), K-nearest neighbors and random forests—to predict the students’ performance, comparing a supervised approach (using twelve indicators extracted from the log files as input for the algorithms) and a mixed approach (applying a k-means algorithm to calculate the machine learning features). The results showed that the mixed approach with SVM outperformed the other techniques, and that three predominant learning styles emerged from the data mining analysis.

Download Full-text

Machine Learning and Data Segmentation for Building Energy Use Prediction—A Comparative Study

Energies ◽

10.3390/en14185947 ◽

2021 ◽

Vol 14 (18) ◽

pp. 5947

Author(s):

William Mounter ◽

Chris Ogwumike ◽

Huda Dawood ◽

Nashwan Dawood

Keyword(s):

Machine Learning ◽

Building Energy ◽

Training Data ◽

Machine Learning Techniques ◽

Support Vector ◽

Energy Usage ◽

Research Attention ◽

Learning Techniques ◽

Term Energy

Advances in metering technologies and emerging energy forecast strategies provide opportunities and challenges for predicting both short and long-term building energy usage. Machine learning is an important energy prediction technique, and is significantly gaining research attention. The use of different machine learning techniques based on a rolling-horizon framework can help to reduce the prediction error over time. Due to the significant increases in error beyond short-term energy forecasts, most reported energy forecasts based on statistical and machine learning techniques are within the range of one week. The aim of this study was to investigate how facility managers can improve the accuracy of their building’s long-term energy forecasts. This paper presents an extensive study of machine learning and data processing techniques and how they can more accurately predict within different forecast ranges. The Clarendon building of Teesside University was selected as a case study to demonstrate the prediction of overall energy usage with different machine learning techniques such as polynomial regression (PR), support vector regression (SVR) and artificial neural networks (ANNs). This study further examined how preprocessing training data for prediction models can impact the overall accuracy, such as via segmenting the training data by building modes (active and dormant), or by days of the week (weekdays and weekends). The results presented in this paper illustrate a significant reduction in the mean absolute percentage error (MAPE) for segmented building (weekday and weekend) energy usage prediction when compared to unsegmented monthly predictions. A reduction in MAPE of 5.27%, 11.45%, and 12.03% was achieved with PR, SVR and ANN, respectively.

Download Full-text

Cloud detection in all-sky images via multi-scale neighborhood features and multiple supervised learning techniques

Atmospheric Measurement Techniques ◽

10.5194/amt-10-199-2017 ◽

2017 ◽

Vol 10 (1) ◽

pp. 199-208 ◽

Cited By ~ 11

Author(s):

Hsu-Yung Cheng ◽

Chih-Lung Lin

Keyword(s):

Supervised Learning ◽

Detection Methods ◽

Support Vector ◽

Detection Accuracy ◽

Cloud Detection ◽

Multi Scale ◽

Cloud Models ◽

Learning Techniques ◽

Image Patches ◽

Local Image

Abstract. Cloud detection is important for providing necessary information such as cloud cover in many applications. Existing cloud detection methods include red-to-blue ratio thresholding and other classification-based techniques. In this paper, we propose to perform cloud detection using supervised learning techniques with multi-resolution features. One of the major contributions of this work is that the features are extracted from local image patches with different sizes to include local structure and multi-resolution information. The cloud models are learned through the training process. We consider classifiers including random forest, support vector machine, and Bayesian classifier. To take advantage of the clues provided by multiple classifiers and various levels of patch sizes, we employ a voting scheme to combine the results to further increase the detection accuracy. In the experiments, we have shown that the proposed method can distinguish cloud and non-cloud pixels more accurately compared with existing works.

Download Full-text

Performance Analysis of Statistical and Supervised Learning Techniques in Stock Data Mining

Data ◽

10.3390/data3040054 ◽

2018 ◽

Vol 3 (4) ◽

pp. 54 ◽

Cited By ~ 7

Author(s):

Manik Sharma ◽

Samriti Sharma ◽

Gurvinder Singh

Keyword(s):

Logistic Regression ◽

Supervised Learning ◽

Misclassification Rate ◽

Support Vector ◽

P Value ◽

Linear Discriminant ◽

Statistical Measures ◽

Specificity And Sensitivity ◽

Learning Techniques ◽

Topsis Technique

Nowadays, overwhelming stock data is available, which areonly of use if it is properly examined and mined. In this paper, the last twelve years of ICICI Bank’s stock data have been extensively examined using statistical and supervised learning techniques. This study may be of great interest for those who wish to mine or study the stock data of banks or any financial organization. Different statistical measures have been computed to explore the nature, range, distribution, and deviation of data. The different descriptive statistical measures assist in finding different valuable metrics such as mean, variance, skewness, kurtosis, p-value, a-squared, and 95% confidence mean interval level of ICICI Bank’s stock data. Moreover, daily percentage changes occurring over the last 12 years have also been recorded and examined. Additionally, the intraday stock status has been mined using ten different classifiers. The performance of different classifiers has been evaluated on the basis of various parameters such as accuracy, misclassification rate, precision, recall, specificity, and sensitivity. Based upon different parameters, the predictive results obtained using logistic regression are more acceptable than the outcomes of other classifiers, whereas naïve Bayes, C4.5, random forest, linear discriminant, and cubic support vector machine (SVM) merely act as a random guessing machine. The outstanding performance of logistic regression has been validated using TOPSIS (technique for order preference by similarity to ideal solution) and WSA (weighted sum approach).

Download Full-text