scholarly journals DEVELOPMENT AND COMPARATIVE ANALYSIS OF SEMI-SUPERVISED LEARNING ALGORITHMS ON A SMALL AMOUNT OF LABELED DATA

Author(s):  
Klym Yamkovyi

The paper is dedicated to the development and comparative experimental analysis of semi-supervised learning approaches based on a mix of unsupervised and supervised approaches for the classification of datasets with a small amount of labeled data, namely, identifying to which of a set of categories a new observation belongs using a training set of data containing observations whose category membership is known. Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Unlabeled data, when used in combination with a small quantity of labeled data, can produce significant improvement in learning accuracy. The goal is semi-supervised methods development and analysis along with comparing their accuracy and robustness on different synthetics datasets. The proposed approach is based on the unsupervised K-medoids methods, also known as the Partitioning Around Medoid algorithm, however, unlike Kmedoids the proposed algorithm first calculates medoids using only labeled data and next process unlabeled classes – assign labels of nearest medoid. Another proposed approach is the mix of the supervised method of K-nearest neighbor and unsupervised K-Means. Thus, the proposed learning algorithm uses information about both the nearest points and classes centers of mass. The methods have been implemented using Python programming language and experimentally investigated for solving classification problems using datasets with different distribution and spatial characteristics. Datasets were generated using the scikit-learn library. Was compared the developed approaches to find average accuracy on all these datasets. It was shown, that even small amounts of labeled data allow us to use semi-supervised learning, and proposed modifications ensure to improve accuracy and algorithm performance, which was demonstrated during experiments. And with the increase of available label information accuracy of the algorithms grows up. Thus, the developed algorithms are using a distance metric that considers available label information. Keywords: Unsupervised learning, supervised learning. semi-supervised learning, clustering, distance, distance function, nearest neighbor, medoid, center of mass.

2018 ◽  
Vol 4 (7) ◽  
pp. 95 ◽  
Author(s):  
Ioannis Livieris ◽  
Andreas Kanavos ◽  
Vassilis Tampakas ◽  
Panagiotis Pintelas

A critical component in the computer-aided medical diagnosis of digital chest X-rays is the automatic detection of lung abnormalities, since the effective identification at an initial stage constitutes a significant and crucial factor in patient’s treatment. The vigorous advances in computer and digital technologies have ultimately led to the development of large repositories of labeled and unlabeled images. Due to the effort and expense involved in labeling data, training datasets are of a limited size, while in contrast, electronic medical record systems contain a significant number of unlabeled images. Semi-supervised learning algorithms have become a hot topic of research as an alternative to traditional classification methods, exploiting the explicit classification information of labeled data with the knowledge hidden in the unlabeled data for building powerful and effective classifiers. In the present work, we evaluate the performance of an ensemble semi-supervised learning algorithm for the classification of chest X-rays of tuberculosis. The efficacy of the presented algorithm is demonstrated by several experiments and confirmed by the statistical nonparametric tests, illustrating that reliable and robust prediction models could be developed utilizing a few labeled and many unlabeled data.


2019 ◽  
Vol 20 (5) ◽  
pp. 488-500 ◽  
Author(s):  
Yan Hu ◽  
Yi Lu ◽  
Shuo Wang ◽  
Mengying Zhang ◽  
Xiaosheng Qu ◽  
...  

Background: Globally the number of cancer patients and deaths are continuing to increase yearly, and cancer has, therefore, become one of the world&#039;s highest causes of morbidity and mortality. In recent years, the study of anticancer drugs has become one of the most popular medical topics. </P><P> Objective: In this review, in order to study the application of machine learning in predicting anticancer drugs activity, some machine learning approaches such as Linear Discriminant Analysis (LDA), Principal components analysis (PCA), Support Vector Machine (SVM), Random forest (RF), k-Nearest Neighbor (kNN), and Naïve Bayes (NB) were selected, and the examples of their applications in anticancer drugs design are listed. </P><P> Results: Machine learning contributes a lot to anticancer drugs design and helps researchers by saving time and is cost effective. However, it can only be an assisting tool for drug design. </P><P> Conclusion: This paper introduces the application of machine learning approaches in anticancer drug design. Many examples of success in identification and prediction in the area of anticancer drugs activity prediction are discussed, and the anticancer drugs research is still in active progress. Moreover, the merits of some web servers related to anticancer drugs are mentioned.


Mathematics ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 779
Author(s):  
Ruriko Yoshida

A tropical ball is a ball defined by the tropical metric over the tropical projective torus. In this paper we show several properties of tropical balls over the tropical projective torus and also over the space of phylogenetic trees with a given set of leaf labels. Then we discuss its application to the K nearest neighbors (KNN) algorithm, a supervised learning method used to classify a high-dimensional vector into given categories by looking at a ball centered at the vector, which contains K vectors in the space.


2021 ◽  
pp. 1-17
Author(s):  
Ahmed Al-Tarawneh ◽  
Ja’afer Al-Saraireh

Twitter is one of the most popular platforms used to share and post ideas. Hackers and anonymous attackers use these platforms maliciously, and their behavior can be used to predict the risk of future attacks, by gathering and classifying hackers’ tweets using machine-learning techniques. Previous approaches for detecting infected tweets are based on human efforts or text analysis, thus they are limited to capturing the hidden text between tweet lines. The main aim of this research paper is to enhance the efficiency of hacker detection for the Twitter platform using the complex networks technique with adapted machine learning algorithms. This work presents a methodology that collects a list of users with their followers who are sharing their posts that have similar interests from a hackers’ community on Twitter. The list is built based on a set of suggested keywords that are the commonly used terms by hackers in their tweets. After that, a complex network is generated for all users to find relations among them in terms of network centrality, closeness, and betweenness. After extracting these values, a dataset of the most influential users in the hacker community is assembled. Subsequently, tweets belonging to users in the extracted dataset are gathered and classified into positive and negative classes. The output of this process is utilized with a machine learning process by applying different algorithms. This research build and investigate an accurate dataset containing real users who belong to a hackers’ community. Correctly, classified instances were measured for accuracy using the average values of K-nearest neighbor, Naive Bayes, Random Tree, and the support vector machine techniques, demonstrating about 90% and 88% accuracy for cross-validation and percentage split respectively. Consequently, the proposed network cyber Twitter model is able to detect hackers, and determine if tweets pose a risk to future institutions and individuals to provide early warning of possible attacks.


Mathematics ◽  
2021 ◽  
Vol 9 (8) ◽  
pp. 830
Author(s):  
Seokho Kang

k-nearest neighbor (kNN) is a widely used learning algorithm for supervised learning tasks. In practice, the main challenge when using kNN is its high sensitivity to its hyperparameter setting, including the number of nearest neighbors k, the distance function, and the weighting function. To improve the robustness to hyperparameters, this study presents a novel kNN learning method based on a graph neural network, named kNNGNN. Given training data, the method learns a task-specific kNN rule in an end-to-end fashion by means of a graph neural network that takes the kNN graph of an instance to predict the label of the instance. The distance and weighting functions are implicitly embedded within the graph neural network. For a query instance, the prediction is obtained by performing a kNN search from the training data to create a kNN graph and passing it through the graph neural network. The effectiveness of the proposed method is demonstrated using various benchmark datasets for classification and regression tasks.


Sensors ◽  
2021 ◽  
Vol 21 (5) ◽  
pp. 1797
Author(s):  
Ján Vachálek ◽  
Dana Šišmišová ◽  
Pavol Vašek ◽  
Jan Rybář ◽  
Juraj Slovák ◽  
...  

The article deals with aspects of identifying industrial products in motion based on their color. An automated robotic workplace with a conveyor belt, robot and an industrial color sensor is created for this purpose. Measured data are processed in a database and then statistically evaluated in form of type A standard uncertainty and type B standard uncertainty, in order to obtain combined standard uncertainties results. Based on the acquired data, control charts of RGB color components for identified products are created. Influence of product speed on the measuring process identification and process stability is monitored. In case of identification uncertainty i.e., measured values are outside the limits of control charts, the K-nearest neighbor machine learning algorithm is used. This algorithm, based on the Euclidean distances to the classified value, estimates its most accurate iteration. This results into the comprehensive system for identification of product moving on conveyor belt, where based on the data collection and statistical analysis using machine learning, industry usage reliability is demonstrated.


Author(s):  
Juan Luis Fernández-Martínez ◽  
Ana Cernea

In this paper, we present a supervised ensemble learning algorithm, called SCAV1, and its application to face recognition. This algorithm exploits the uncertainty space of the ensemble classifiers. Its design includes six different nearest-neighbor (NN) classifiers that are based on different and diverse image attributes: histogram, variogram, texture analysis, edges, bidimensional discrete wavelet transform and Zernike moments. In this approach each attribute, together with its corresponding type of the analysis (local or global), and the distance criterion (p-norm) induces a different individual NN classifier. The ensemble classifier SCAV1 depends on a set of parameters: the number of candidate images used by each individual method to perform the final classification and the individual weights given to each individual classifier. SCAV1 parameters are optimized/sampled using a supervised approach via the regressive particle swarm optimization algorithm (RR-PSO). The final classifier exploits the uncertainty space of SCAV1 and uses majority voting (Borda Count) as a final decision rule. We show the application of this algorithm to the ORL and PUT image databases, obtaining very high and stable accuracies (100% median accuracy and almost null interquartile range). In conclusion, exploring the uncertainty space of ensemble classifiers provides optimum results and seems to be the appropriate strategy to adopt for face recognition and other classification problems.


Author(s):  
Noman Ashraf ◽  
Abid Rafiq ◽  
Sabur Butt ◽  
Hafiz Muhammad Faisal Shehzad ◽  
Grigori Sidorov ◽  
...  

On YouTube, billions of videos are watched online and millions of short messages are posted each day. YouTube along with other social networking sites are used by individuals and extremist groups for spreading hatred among users. In this paper, we consider religion as the most targeted domain for spreading hate speech among people of different religions. We present a methodology for the detection of religion-based hate videos on YouTube. Messages posted on YouTube videos generally express the opinions of users’ related to that video. We provide a novel dataset for religious hate speech detection on Youtube comments. The proposed methodology applies data mining techniques on extracted comments from religious videos in order to filter religion-oriented messages and detect those videos which are used for spreading hate. The supervised learning algorithms: Support Vector Machine (SVM), Logistic Regression (LR), and k-Nearest Neighbor (k-NN) are used for baseline results.


Sign in / Sign up

Export Citation Format

Share Document