Evaluation of Arabian Vascular Plant Barcodes (rbcL and matK): Precision of Unsupervised and Supervised Learning Methods towards Accurate Identification

Arabia is the largest peninsula in the world, with >3000 species of vascular plants. Not much effort has been made to generate a multi-locus marker barcode library to identify and discriminate the recorded plant species. This study aimed to determine the reliability of the available Arabian plant barcodes (>1500; rbcL and matK) at the public repository (NCBI GenBank) using the unsupervised and supervised methods. Comparative analysis was carried out with the standard dataset (FINBOL) to assess the methods and markers’ reliability. Our analysis suggests that from the unsupervised method, TaxonDNA’s All Species Barcode criterion (ASB) exhibits the highest accuracy for rbcL barcodes, followed by the matK barcodes using the aligned dataset (FINBOL). However, for the Arabian plant barcode dataset (GBMA), the supervised method performed better than the unsupervised method, where the Random Forest and K-Nearest Neighbor (gappy kernel) classifiers were robust enough. These classifiers successfully recognized true species from both barcode markers belonging to the aligned and alignment-free datasets, respectively. The multi-class classifier showed high species resolution following the two classifiers, though its performance declined when employed to recognize true species. Similar results were observed for the FINBOL dataset through the supervised learning approach; overall, matK marker showed higher accuracy than rbcL. However, the lower rate of species identification in matK in GBMA data could be due to the higher evolutionary rate or gaps and missing data, as observed for the ASB criterion in the FINBOL dataset. Further, a lower number of sequences and singletons could also affect the rate of species resolution, as observed in the GBMA dataset. The GBMA dataset lacks sufficient species membership. We would encourage the taxonomists from the Arabian Peninsula to join our campaign on the Arabian Barcode of Life at the Barcode of Life Data (BOLD) systems. Our efforts together could help improve the rate of species identification for the Arabian Vascular plants.

Download Full-text

Tropical Balls and Its Applications to K Nearest Neighbor over the Space of Phylogenetic Trees

Mathematics ◽

10.3390/math9070779 ◽

2021 ◽

Vol 9 (7) ◽

pp. 779

Author(s):

Ruriko Yoshida

Keyword(s):

Supervised Learning ◽

Phylogenetic Trees ◽

Nearest Neighbor ◽

Nearest Neighbors ◽

High Dimensional ◽

Learning Method ◽

Dimensional Vector ◽

K Nearest Neighbor ◽

K Nearest Neighbors

A tropical ball is a ball defined by the tropical metric over the tropical projective torus. In this paper we show several properties of tropical balls over the tropical projective torus and also over the space of phylogenetic trees with a given set of leaf labels. Then we discuss its application to the K nearest neighbors (KNN) algorithm, a supervised learning method used to classify a high-dimensional vector into given categories by looking at a ball centered at the vector, which contains K vectors in the space.

Download Full-text

Framing Twitter Public Sentiment on Nigerian Government COVID-19 Palliatives Distribution Using Machine Learning

Sustainability ◽

10.3390/su13063497 ◽

2021 ◽

Vol 13 (6) ◽

pp. 3497

Author(s):

Hassan Adamu ◽

Syaheerah Lebai Lutfi ◽

Nurul Hashimah Ahamed Hassain Malim ◽

Rohail Hassan ◽

Assunta Di Vaio ◽

...

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Nearest Neighbor ◽

Primary Objective ◽

Support Vector ◽

Standard English ◽

Emotion Classification ◽

K Nearest Neighbor ◽

The Public ◽

The Government

Sustainable development plays a vital role in information and communication technology. In times of pandemics such as COVID-19, vulnerable people need help to survive. This help includes the distribution of relief packages and materials by the government with the primary objective of lessening the economic and psychological effects on the citizens affected by disasters such as the COVID-19 pandemic. However, there has not been an efficient way to monitor public funds’ accountability and transparency, especially in developing countries such as Nigeria. The understanding of public emotions by the government on distributed palliatives is important as it would indicate the reach and impact of the distribution exercise. Although several studies on English emotion classification have been conducted, these studies are not portable to a wider inclusive Nigerian case. This is because Informal Nigerian English (Pidgin), which Nigerians widely speak, has quite a different vocabulary from Standard English, thus limiting the applicability of the emotion classification of Standard English machine learning models. An Informal Nigerian English (Pidgin English) emotions dataset is constructed, pre-processed, and annotated. The dataset is then used to classify five emotion classes (anger, sadness, joy, fear, and disgust) on the COVID-19 palliatives and relief aid distribution in Nigeria using standard machine learning (ML) algorithms. Six ML algorithms are used in this study, and a comparative analysis of their performance is conducted. The algorithms are Multinomial Naïve Bayes (MNB), Support Vector Machine (SVM), Random Forest (RF), Logistics Regression (LR), K-Nearest Neighbor (KNN), and Decision Tree (DT). The conducted experiments reveal that Support Vector Machine outperforms the remaining classifiers with the highest accuracy of 88%. The “disgust” emotion class surpassed other emotion classes, i.e., sadness, joy, fear, and anger, with the highest number of counts from the classification conducted on the constructed dataset. Additionally, the conducted correlation analysis shows a significant relationship between the emotion classes of “Joy” and “Fear”, which implies that the public is excited about the palliatives’ distribution but afraid of inequality and transparency in the distribution process due to reasons such as corruption. Conclusively, the results from this experiment clearly show that the public emotions on COVID-19 support and relief aid packages’ distribution in Nigeria were not satisfactory, considering that the negative emotions from the public outnumbered the public happiness.

Download Full-text

Deteksi Batik Parang Menggunakan Fitur Co-Occurence Matrix Dan Geometric Moment Invariant Dengan Klasifikasi KNN

Lontar Komputer Jurnal Ilmiah Teknologi Informasi ◽

10.24843/lkjiti.2016.v07.i01.p05 ◽

2016 ◽

pp. 40

Author(s):

Ni Luh Wiwik Sri Rahayu Ginantra

Keyword(s):

Matrix Method ◽

Nearest Neighbor ◽

Texture Features ◽

Sufficient Information ◽

K Nearest Neighbor ◽

Moment Invariant ◽

The Public ◽

Geometric Moment ◽

Visual Identification ◽

Occurrence Matrix

Batik motifs are the base or the blueprint of batik patterns which serve as the core of the batik image design, and therefore the meaning of a sign, symbol or logo in a batik work can be revealed through its motifs. Visual identification requires visual skills and knowledge in classifying patterns formed in a batik image. Lack of media providing information on batik motifs makes the public unable to have sufficient information about batik motifs. Looking at this phenomenon, this study is conducted in order to perform visual identification using a computer that can assist and facilitate in identifying the types of batik. The methods used for batik image recognition are the Co-occurrence Matrix method to provide extraction of batik texture features, and the Geometric Moment Invariant method, while K Nearest Neighbor is used to classify batik images. The results on the accuracy values obtained reveal that the of 80%, compared to the accuracy value result using the Co-occurrence Matrix method that is 70%.

Download Full-text

Combination of relief feature selection and fuzzy K-nearest neighbor for plant species identification

2016 International Conference on Advanced Computer Science and Information Systems (ICACSIS) ◽

10.1109/icacsis.2016.7872767 ◽

2016 ◽

Cited By ~ 1

Author(s):

Agus Ambarwari ◽

Yeni Herdiyeni ◽

Taufik Djatna

Keyword(s):

Feature Selection ◽

Species Identification ◽

Plant Species ◽

Nearest Neighbor ◽

K Nearest Neighbor ◽

Plant Species Identification

Download Full-text

Predictive analytics of university student intake using supervised methods

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v8.i4.pp367-374 ◽

2019 ◽

Vol 8 (4) ◽

pp. 367

Author(s):

Muhammad Yunus Iqbal Basheer ◽

Sofianita Mutalib ◽

Nurzeatul Hamimah Abdul Hamid ◽

Shuzlina Abdul-Rahman ◽

Ariff Md Ab Malik

Keyword(s):

Nearest Neighbor ◽

Predictive Analytics ◽

University Student ◽

K Nearest Neighbor ◽

Application Form ◽

University Campuses ◽

Campus Visit ◽

Huge Impact ◽

Supervised Methods ◽

Future Outcomes

Predictive analytics extract important factors and patterns from historical data to predict future outcomes. This paper presents predictive analytics of university student intake using supervised methods. Every year, universities face a lot of academic offer rejection by the applicants. Hence, this research aims to predict student acceptance and rejection towards academic offer given by a university using supervised methods subject to past student intake data. To solve this problem, a lot of past studies had been reviewed starting from nineties era till now. From the analysis, two algorithms had been selected namely Decision Tree and k Nearest Neighbor. The dataset of past student intake was obtained with fifteen attributes, which are applicants’ gender, applicants studied stream during Sijil Peperiksaan Malaysia(SPM), university campuses, applicants’ hometown, disability, campus visit, course choice order in application form, applicant’s six SPM subjects result, orphan and status of acceptance. Several experiments were implemented to find the best model to predict the student’s offer acceptance by evaluating the model accuracy. Both models yield best accuracy at 66 percent with the selected attributes. This research gives a huge impact in selecting which applicants is suitable to be offered as well as adapting the university’s academic offering process in much intelligence way in the future.

Download Full-text

Giving more insight for automatic risk prediction during pregnancy with interpretable machine learning

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i3.2344 ◽

2021 ◽

Vol 10 (3) ◽

Author(s):

Muhammad Irfan ◽

Setio Basuki ◽

Yufis Azhar

Keyword(s):

Machine Learning ◽

Nearest Neighbor ◽

Classification Model ◽

K Nearest Neighbor ◽

Pregnancy Risk ◽

The Public ◽

Risk Monitoring ◽

Machine Learning Model ◽

Union Operation ◽

Correlation Based Feature Selection

Maternal mortality rate (MMR) in Indonesia intercensal population survey (SUPAS) was considered high. For pregnancy risk detection, the public health center (puskesmas) applies a Poedji Rochjati screening card (KSPR) demonstrating 20 features. In addition to KSPR, pregnancy risk monitoring has been assisted with a pregnancy control card. Because of the differences in the number of features between the two control cards, it is necessary to make agreements between them. Our objectives are determining the most influential features, exploring the links among features on the KSPR and pregnancy control cards, and building a machine learning model for predicting pregnancy risk. For the first objective, we use correlation-based feature selection (CFS) and C5.0 algorithm. The next objective was answered by the union operation in the features produced by the two techniques. By performing the machine learning experiment on these features, the accuracy of the XGBoost algorithm demonstrated the hightest results of 94% followed by random forest, Naïve Bayes, and k-Nearest neighbor algorithms, 87%, 66%, and 60% respectively. Interpretability aspects are implemented with SHAP and LIME to provide more insight for classification model. In conclusion, the similarity feature generated in the two interpretation approaches confirmed that Cesar was dominant in determining pregnancy risk.

Download Full-text

Perbandingan Akurasi dan Waktu Proses Algoritma K-NN dan SVM dalam Analisis Sentimen Twitter

Jurnal Informatika ◽

10.31311/ji.v6i2.5129 ◽

2019 ◽

Vol 6 (2) ◽

pp. 226-235

Author(s):

Muhammad Rangga Aziz Nasution ◽

Mardhiya Hayaty

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Unsupervised Learning ◽

Supervised Learning ◽

Cross Validation ◽

Nearest Neighbor ◽

Support Vector ◽

K Nearest Neighbor ◽

Fold Cross Validation

Salah satu cabang ilmu komputer yaitu pembelajaran mesin (machine learning) menjadi tren dalam beberapa waktu terakhir. Pembelajaran mesin bekerja dengan memanfaatkan data dan algoritma untuk membuat model dengan pola dari kumpulan data tersebut. Selain itu, pembelajaran mesin juga mempelajari bagaimama model yang telah dibuat dapat memprediksi keluaran (output) berdasarkan pola yang ada. Terdapat dua jenis metode pembelajaran mesin yang dapat digunakan untuk analisis sentimen: supervised learning dan unsupervised learning. Penelitian ini akan membandingkan dua algoritma klasifikasi yang termasuk dari supervised learning: algoritma K-Nearest Neighbor dan Support Vector Machine, dengan cara membuat model dari masing-masing algoritma dengan objek teks sentimen. Perbandingan dilakukan untuk mengetahui algoritma mana lebih baik dalam segi akurasi dan waktu proses. Hasil pada perhitungan akurasi menunjukkan bahwa metode Support Vector Machine lebih unggul dengan nilai 89,70% tanpa K-Fold Cross Validation dan 88,76% dengan K-Fold Cross Validation. Sedangkan pada perhitungan waktu proses metode K-Nearest Neighbor lebih unggul dengan waktu proses 0.0160s tanpa K-Fold Cross Validation dan 0.1505s dengan K-Fold Cross Validation.

Download Full-text

SCMTHP: A New Approach for Identifying and Characterizing of Tumor-Homing Peptides Using Estimated Propensity Scores of Amino Acids

Pharmaceutics ◽

10.3390/pharmaceutics14010122 ◽

2022 ◽

Vol 14 (1) ◽

pp. 122

Author(s):

Phasit Charoenkwan ◽

Wararat Chiangjong ◽

Chanin Nantasenamat ◽

Mohammad Ali Moni ◽

Pietro Lio’ ◽

...

Keyword(s):

Amino Acids ◽

Propensity Scores ◽

Nearest Neighbor ◽

Biochemical Properties ◽

Least Squares Regression ◽

K Nearest Neighbor ◽

Accurate Identification ◽

Major Drawback ◽

Benchmark Datasets ◽

Tumor Homing

Tumor-homing peptides (THPs) are small peptides that can recognize and bind cancer cells specifically. To gain a better understanding of THPs’ functional mechanisms, the accurate identification and characterization of THPs is required. Although some computational methods for in silico THP identification have been proposed, a major drawback is their lack of model interpretability. In this study, we propose a new, simple and easily interpretable computational approach (called SCMTHP) for identifying and analyzing tumor-homing activities of peptides via the use of a scoring card method (SCM). To improve the predictability and interpretability of our predictor, we generated propensity scores of 20 amino acids as THPs. Finally, informative physicochemical properties were used for providing insights on characteristics giving rise to the bioactivity of THPs via the use of SCMTHP-derived propensity scores. Benchmarking experiments from independent test indicated that SCMTHP could achieve comparable performance to state-of-the-art method with accuracies of 0.827 and 0.798, respectively, when evaluated on two benchmark datasets consisting of Main and Small datasets. Furthermore, SCMTHP was found to outperform several well-known machine learning-based classifiers (e.g., decision tree, k-nearest neighbor, multi-layer perceptron, naive Bayes and partial least squares regression) as indicated by both 10-fold cross-validation and independent tests. Finally, the SCMTHP web server was established and made freely available online. SCMTHP is expected to be a useful tool for rapid and accurate identification of THPs and for providing better understanding on THP biophysical and biochemical properties.

Download Full-text

Performance Analysis of Supervised Learning Models for Product Title Classification

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v8.i3.pp228-236 ◽

2019 ◽

Vol 8 (3) ◽

pp. 228 ◽

Cited By ~ 1

Author(s):

Norsyela Muhammad Noor Mathivanan ◽

Nor Azura Md.Ghani ◽

Roziah Mohd Janor

Keyword(s):

Supervised Learning ◽

Nearest Neighbor ◽

Computation Time ◽

Classification Problem ◽

Short Description ◽

Support Vector ◽

Data Sets ◽

Learning Models ◽

K Nearest Neighbor ◽

Online Business

Online business development through e-commerce platforms is a phenomenon which change the world of promoting and selling products in this 21st century. Product title classification is an important task in assisting retailers and sellers to list a product in a suitable category. Product title classification is apart of text classification problem but the properties of product title are different from general document. This study aims to evaluate the performance of five different supervised learning models on data sets consist of e-commerce product titles with a very short description and they are incomplete sentences. The supervised learning models involve in the study are Naïve Bayes, K-Nearest Neighbor (KNN), Decision Tree, Support Vector Machine (SVM) and Random Forest. The results show KNN model is the best model with the highest accuracy and fastest computation time to classify the data used in the study. Hence, KNN model is a good approach in classifying e-commerce products.

Download Full-text

PUBLIC SENTIMENT ANALYSIS OF PASAR LAMA TANGERANG USING K-NEAREST NEIGHBOR METHOD AND PROGRAMMING LANGUAGE R

Jurnal Ilmiah Informatika Komputer ◽

10.35760/ik.2019.v24i2.2367 ◽

2019 ◽

Vol 24 (2) ◽

pp. 129-133

Author(s):

Hustinawaty ◽

Rama Al Azis Dwiputra ◽

Tavipia Rumambi

Keyword(s):

Sentiment Analysis ◽

Programming Languages ◽

Nearest Neighbor ◽

Tourist Attraction ◽

K Nearest Neighbor ◽

The Public ◽

Public Sentiment ◽

A Value ◽

Negative Comments ◽

The City

Pasar Lama Tangerang is a tourist attraction in the city of Tangerang. With the development of current technology, the public can provide an overview of how the facilities and services are provided by expressing opinions on the internet. However, it is difficult to distinguish which opinions belong to positive or negative opinions. Sentiment analysis is needed to overcome this problem. The stage in sentiment analysis starts with collecting data first, then the data is processed. Furthermore, the data that has been propagated is given a sentiment classification using the K-Nearest Neighbor (KNN) algorithm. Then the classification results obtained an accuracy of 83% with a value of k = 1 of 120 data divided by 92 positive and 28 negative comments. Sentiment analysis is made using the R and Rstudio programming languages as supporting software.

Download Full-text