scholarly journals A Visual Mining Approach to Improved Multiple-Instance Learning

Algorithms ◽  
2021 ◽  
Vol 14 (12) ◽  
pp. 344
Author(s):  
Sonia Castelo ◽  
Moacir Ponti ◽  
Rosane Minghim

Multiple-instance learning (MIL) is a paradigm of machine learning that aims to classify a set (bag) of objects (instances), assigning labels only to the bags. This problem is often addressed by selecting an instance to represent each bag, transforming an MIL problem into standard supervised learning. Visualization can be a useful tool to assess learning scenarios by incorporating the users’ knowledge into the classification process. Considering that multiple-instance learning is a paradigm that cannot be handled by current visualization techniques, we propose a multiscale tree-based visualization called MILTree to support MIL problems. The first level of the tree represents the bags, and the second level represents the instances belonging to each bag, allowing users to understand the MIL datasets in an intuitive way. In addition, we propose two new instance selection methods for MIL, which help users improve the model even further. Our methods can handle both binary and multiclass scenarios. In our experiments, SVM was used to build the classifiers. With support of the MILTree layout, the initial classification model was updated by changing the training set, which is composed of the prototype instances. Experimental results validate the effectiveness of our approach, showing that visual mining by MILTree can support exploring and improving models in MIL scenarios and that our instance selection methods outperform the currently available alternatives in most cases.

Sensors ◽  
2020 ◽  
Vol 20 (9) ◽  
pp. 2684 ◽  
Author(s):  
Obed Tettey Nartey ◽  
Guowu Yang ◽  
Sarpong Kwadwo Asare ◽  
Jinzhao Wu ◽  
Lady Nadia Frempong

Traffic sign recognition is a classification problem that poses challenges for computer vision and machine learning algorithms. Although both computer vision and machine learning techniques have constantly been improved to solve this problem, the sudden rise in the number of unlabeled traffic signs has become even more challenging. Large data collation and labeling are tedious and expensive tasks that demand much time, expert knowledge, and fiscal resources to satisfy the hunger of deep neural networks. Aside from that, the problem of having unbalanced data also poses a greater challenge to computer vision and machine learning algorithms to achieve better performance. These problems raise the need to develop algorithms that can fully exploit a large amount of unlabeled data, use a small amount of labeled samples, and be robust to data imbalance to build an efficient and high-quality classifier. In this work, we propose a novel semi-supervised classification technique that is robust to small and unbalanced data. The framework integrates weakly-supervised learning and self-training with self-paced learning to generate attention maps to augment the training set and utilizes a novel pseudo-label generation and selection algorithm to generate and select pseudo-labeled samples. The method improves the performance by: (1) normalizing the class-wise confidence levels to prevent the model from ignoring hard-to-learn samples, thereby solving the imbalanced data problem; (2) jointly learning a model and optimizing pseudo-labels generated on unlabeled data; and (3) enlarging the training set to satisfy the hunger of deep learning models. Extensive evaluations on two public traffic sign recognition datasets demonstrate the effectiveness of the proposed technique and provide a potential solution for practical applications.


2021 ◽  
Vol 12 ◽  
Author(s):  
Burcu Bakir-Gungor ◽  
Osman Bulut ◽  
Amhar Jabeer ◽  
O. Ufuk Nalbantoglu ◽  
Malik Yousef

Human gut microbiota is a complex community of organisms including trillions of bacteria. While these microorganisms are considered as essential regulators of our immune system, some of them can cause several diseases. In recent years, next-generation sequencing technologies accelerated the discovery of human gut microbiota. In this respect, the use of machine learning techniques became popular to analyze disease-associated metagenomics datasets. Type 2 diabetes (T2D) is a chronic disease and affects millions of people around the world. Since the early diagnosis in T2D is important for effective treatment, there is an utmost need to develop a classification technique that can accelerate T2D diagnosis. In this study, using T2D-associated metagenomics data, we aim to develop a classification model to facilitate T2D diagnosis and to discover T2D-associated biomarkers. The sequencing data of T2D patients and healthy individuals were taken from a metagenome-wide association study and categorized into disease states. The sequencing reads were assigned to taxa, and the identified species are used to train and test our model. To deal with the high dimensionality of features, we applied robust feature selection algorithms such as Conditional Mutual Information Maximization, Maximum Relevance and Minimum Redundancy, Correlation Based Feature Selection, and select K best approach. To test the performance of the classification based on the features that are selected by different methods, we used random forest classifier with 100-fold Monte Carlo cross-validation. In our experiments, we observed that 15 commonly selected features have a considerable effect in terms of minimizing the microbiota used for the diagnosis of T2D and thus reducing the time and cost. When we perform biological validation of these identified species, we found that some of them are known as related to T2D development mechanisms and we identified additional species as potential biomarkers. Additionally, we attempted to find the subgroups of T2D patients using k-means clustering. In summary, this study utilizes several supervised and unsupervised machine learning algorithms to increase the diagnostic accuracy of T2D, investigates potential biomarkers of T2D, and finds out which subset of microbiota is more informative than other taxa by applying state-of-the art feature selection methods.


2019 ◽  
Vol 29 (1) ◽  
pp. 151-168
Author(s):  
Marcin Blachnik

Abstract Instance selection is often performed as one of the preprocessing methods which, along with feature selection, allows a significant reduction in computational complexity and an increase in prediction accuracy. So far, only few authors have considered ensembles of instance selection methods, while the ensembles of final predictive models attract many researchers. To bridge that gap, in this paper we compare four ensembles adapted to instance selection: Bagging, Feature Bagging, AdaBoost and Additive Noise. The last one is introduced for the first time in this paper. The study is based on empirical comparison performed on 43 datasets and 9 base instance selection methods. The experiments are divided into three scenarios. In the first one, evaluated on a single dataset, we demonstrate the influence of the ensembles on the compression–accuracy relation, in the second scenario the goal is to achieve the highest prediction accuracy, and in the third one both accuracy and the level of dataset compression constitute a multi-objective criterion. The obtained results indicate that ensembles of instance selection improve the base instance selection algorithms except for unstable methods such as CNN and IB3, which is achieved at the expense of compression. In the comparison, Bagging and AdaBoost lead in most of the scenarios. In the experiments we evaluate three classifiers: 1NN, kNN and SVM. We also note a deterioration in prediction accuracy for robust classifiers (kNN and SVM) trained on data filtered by any instance selection methods (including the ensembles) when compared with the results obtained when the entire training set was used to train these classifiers.


2020 ◽  
Vol 10 (11) ◽  
pp. 3933 ◽  
Author(s):  
Marcin Blachnik ◽  
Mirosław Kordos

Instance selection and construction methods were originally designed to improve the performance of the k-nearest neighbors classifier by increasing its speed and improving the classification accuracy. These goals were achieved by eliminating redundant and noisy samples, thus reducing the size of the training set. In this paper, the performance of instance selection methods is investigated in terms of classification accuracy and reduction of training set size. The classification accuracy of the following classifiers is evaluated: decision trees, random forest, Naive Bayes, linear model, support vector machine and k-nearest neighbors. The obtained results indicate that for the most of the classifiers compressing the training set affects prediction performance and only a small group of instance selection methods can be recommended as a general purpose preprocessing step. These are learning vector quantization based algorithms, along with the Drop2 and Drop3. Other methods are less efficient or provide low compression ratio.


2021 ◽  
Author(s):  
Erin Chinn ◽  
Rohit Arora ◽  
Ramy Arnaout ◽  
Rima Arnaout

Abstract Deep learning (DL) requires labeled data. Labeling medical images requires medical expertise, which is often a bottleneck. It is therefore useful to prioritize labeling those images that are most likely to improve a model's performance, a practice known as instance selection. Here we introduce ENRICH, a method that selects images for labeling based on how much novelty each image adds to the growing training set. In our implementation, we use cosine similarity between autoencoder embeddings to measure that novelty. We show that ENRICH achieves nearly maximal performance on classification and segmentation tasks using only a fraction of available images, and outperforms the default practice of selecting images at random. We also present evidence that instance selection may perform categorically better on medical vs. non-medical imaging tasks. In conclusion, ENRICH is a simple, computationally efficient method for prioritizing images for expert labeling for DL.


2020 ◽  
Vol 15 ◽  
Author(s):  
Shuwen Zhang ◽  
Qiang Su ◽  
Qin Chen

Abstract: Major animal diseases pose a great threat to animal husbandry and human beings. With the deepening of globalization and the abundance of data resources, the prediction and analysis of animal diseases by using big data are becoming more and more important. The focus of machine learning is to make computers learn how to learn from data and use the learned experience to analyze and predict. Firstly, this paper introduces the animal epidemic situation and machine learning. Then it briefly introduces the application of machine learning in animal disease analysis and prediction. Machine learning is mainly divided into supervised learning and unsupervised learning. Supervised learning includes support vector machines, naive bayes, decision trees, random forests, logistic regression, artificial neural networks, deep learning, and AdaBoost. Unsupervised learning has maximum expectation algorithm, principal component analysis hierarchical clustering algorithm and maxent. Through the discussion of this paper, people have a clearer concept of machine learning and understand its application prospect in animal diseases.


Energies ◽  
2021 ◽  
Vol 14 (7) ◽  
pp. 1809
Author(s):  
Mohammed El Amine Senoussaoui ◽  
Mostefa Brahami ◽  
Issouf Fofana

Machine learning is widely used as a panacea in many engineering applications including the condition assessment of power transformers. Most statistics attribute the main cause of transformer failure to insulation degradation. Thus, a new, simple, and effective machine-learning approach was proposed to monitor the condition of transformer oils based on some aging indicators. The proposed approach was used to compare the performance of two machine-learning classifiers: J48 decision tree and random forest. The service-aged transformer oils were classified into four groups: the oils that can be maintained in service, the oils that should be reconditioned or filtered, the oils that should be reclaimed, and the oils that must be discarded. From the two algorithms, random forest exhibited a better performance and high accuracy with only a small amount of data. Good performance was achieved through not only the application of the proposed algorithm but also the approach of data preprocessing. Before feeding the classification model, the available data were transformed using the simple k-means method. Subsequently, the obtained data were filtered through correlation-based feature selection (CFsSubset). The resulting features were again retransformed by conducting the principal component analysis and were passed through the CFsSubset filter. The transformation and filtration of the data improved the classification performance of the adopted algorithms, especially random forest. Another advantage of the proposed method is the decrease in the number of the datasets required for the condition assessment of transformer oils, which is valuable for transformer condition monitoring.


Sign in / Sign up

Export Citation Format

Share Document