Machine learning behind classification tasks in various engineering and science domains

In this paper, we introduce deboost, a Python library devoted to weighted distance ensembling of predictions for regression and classification tasks. Its backbone resides on the scikit-learn library for default models and data preprocessing functions. It offers flexible choices of models for the ensemble as long as they contain the predict method, like the models available from scikit-learn. deboost is released under the MIT open-source license and can be downloaded from the Python Package Index (PyPI) at https://pypi.org/project/deboost. The source scripts are also available on a GitHub repository at https://github.com/weihao94/DEBoost.

Download Full-text

PyDL8.5: a Library for Learning Optimal Decision Trees

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/750 ◽

2020 ◽

Author(s):

Gaël Aglin ◽

Siegfried Nijssen ◽

Pierre Schaus

Keyword(s):

Machine Learning ◽

Decision Trees ◽

Efficient Algorithm ◽

Optimal Decision ◽

Learning Tasks ◽

Explainable Ai ◽

Classification Tasks ◽

Interpretable Models ◽

Limited Depth

Decision Trees (DTs) are widely used Machine Learning (ML) models with a broad range of applications. The interest in these models has increased even further in the context of Explainable AI (XAI), as decision trees of limited depth are very interpretable models. However, traditional algorithms for learning DTs are heuristic in nature; they may produce trees that are of suboptimal quality under depth constraints. We introduce PyDL8.5, a Python library to infer depth-constrained Optimal Decision Trees (ODTs). PyDL8.5 provides an interface for DL8.5, an efficient algorithm for inferring depth-constrained ODTs. The library provides an easy-to-use scikit-learn compatible interface. It cannot only be used for classification tasks, but also for regression, clustering, and other tasks. We introduce an interface that allows users to easily implement these other learning tasks. We provide a number of examples of how to use this library.

Download Full-text

Classification with Incomplete Data

Handbook of Research on Machine Learning Applications and Trends ◽

10.4018/978-1-60566-766-9.ch007 ◽

2010 ◽

pp. 147-175 ◽

Cited By ~ 1

Author(s):

Pedro J. García-Laencina ◽

Juan Morales-Sánchez ◽

Rafael Verdú-Monedero ◽

Jorge Larrey-Ruiz ◽

José-Luis Sancho-Gómez ◽

...

Keyword(s):

Machine Learning ◽

Missing Data ◽

Pattern Classification ◽

Incomplete Data ◽

Data Handling ◽

Advantages And Disadvantages ◽

Word Classification ◽

Classification Tasks ◽

Missing Data Techniques ◽

Fundamental Requirement

Many real-word classification scenarios suffer a common drawback: missing, or incomplete, data. The ability of missing data handling has become a fundamental requirement for pattern classification because the absence of certain values for relevant data attributes can seriously affect the accuracy of classification results. This chapter focuses on incomplete pattern classification. The research works on this topic currently grows wider and it is well known how useful and efficient are most of the solutions based on machine learning. This chapter analyzes the most popular and proper missing data techniques based on machine learning for solving pattern classification tasks, trying to highlight their advantages and disadvantages.

Download Full-text

A Novel Mutual Information Based Feature Set for Drivers’ Mental Workload Evaluation Using Machine Learning

Brain Sciences ◽

10.3390/brainsci10080551 ◽

2020 ◽

Vol 10 (8) ◽

pp. 551

Author(s):

Mir Riyanul Islam ◽

Shaibal Barua ◽

Mobyen Uddin Ahmed ◽

Shahina Begum ◽

Pietro Aricò ◽

...

Keyword(s):

Machine Learning ◽

Mutual Information ◽

Mental Workload ◽

Contextual Information ◽

Absolute Error ◽

The Novel ◽

Promising Technique ◽

Extensive Evaluation ◽

Target Values ◽

Classification Tasks

Analysis of physiological signals, electroencephalography more specifically, is considered a very promising technique to obtain objective measures for mental workload evaluation, however, it requires a complex apparatus to record, and thus, with poor usability in monitoring in-vehicle drivers’ mental workload. This study proposes a methodology of constructing a novel mutual information-based feature set from the fusion of electroencephalography and vehicular signals acquired through a real driving experiment and deployed in evaluating drivers’ mental workload. Mutual information of electroencephalography and vehicular signals were used as the prime factor for the fusion of features. In order to assess the reliability of the developed feature set mental workload score prediction, classification and event classification tasks were performed using different machine learning models. Moreover, features extracted from electroencephalography were used to compare the performance. In the prediction of mental workload score, expert-defined scores were used as the target values. For classification tasks, true labels were set from contextual information of the experiment. An extensive evaluation of every prediction tasks was carried out using different validation methods. In predicting the mental workload score from the proposed feature set lowest mean absolute error was 0.09 and for classifying mental workload highest accuracy was 94%. According to the outcome of the study, it can be stated that the novel mutual information based features developed through the proposed approach can be employed to classify and monitor in-vehicle drivers’ mental workload.

Download Full-text

Unsupervised star, galaxy, QSO classification

Astronomy and Astrophysics ◽

10.1051/0004-6361/201936648 ◽

2020 ◽

Vol 633 ◽

pp. A154

Author(s):

C. H. A. Logan ◽

S. Fotopoulou

Keyword(s):

Machine Learning ◽

Spatial Clustering ◽

Model Fitting ◽

Attribute Selection ◽

Learning Method ◽

Object Class ◽

Learning Methods ◽

Photometric Redshifts ◽

Classification Tasks ◽

Viable Approach

Context. Classification will be an important first step for upcoming surveys aimed at detecting billions of new sources, such as LSST and Euclid, as well as DESI, 4MOST, and MOONS. The application of traditional methods of model fitting and colour-colour selections will face significant computational constraints, while machine-learning methods offer a viable approach to tackle datasets of that volume. Aims. While supervised learning methods can prove very useful for classification tasks, the creation of representative and accurate training sets is a task that consumes a great deal of resources and time. We present a viable alternative using an unsupervised machine learning method to separate stars, galaxies and QSOs using photometric data. Methods. The heart of our work uses Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to find the star, galaxy, and QSO clusters in a multidimensional colour space. We optimized the hyperparameters and input attributes of three separate HDBSCAN runs, each to select a particular object class and, thus, treat the output of each separate run as a binary classifier. We subsequently consolidated the output to give our final classifications, optimized on the basis of their F1 scores. We explored the use of Random Forest and PCA as part of the pre-processing stage for feature selection and dimensionality reduction. Results. Using our dataset of ∼50 000 spectroscopically labelled objects we obtain F1 scores of 98.9, 98.9, and 93.13 respectively for star, galaxy, and QSO selection using our unsupervised learning method. We find that careful attribute selection is a vital part of accurate classification with HDBSCAN. We applied our classification to a subset of the SDSS spectroscopic catalogue and demonstrated the potential of our approach in correcting misclassified spectra useful for DESI and 4MOST. Finally, we created a multiwavelength catalogue of 2.7 million sources using the KiDS, VIKING, and ALLWISE surveys and published corresponding classifications and photometric redshifts.

Download Full-text

Emotion analysis of Arabic tweets using deep learning approach

Journal Of Big Data ◽

10.1186/s40537-019-0252-x ◽

2019 ◽

Vol 6 (1) ◽

Author(s):

Massa Baali ◽

Nada Ghneim

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Media Analysis ◽

Machine Learning Algorithms ◽

Learning Approach ◽

Learning Approaches ◽

Social Media Analysis ◽

Deep Convolutional Neural Networks ◽

Emotion Analysis ◽

Classification Tasks

Abstract Nowadays, sharing moments on social networks have become something widespread. Sharing ideas, thoughts, and good memories to express our emotions through text without using a lot of words. Twitter, for instance, is a rich source of data that is a target for organizations for which they can use to analyze people’s opinions, sentiments and emotions. Emotion analysis normally gives a more profound overview of the feelings of an author. In Arabic Social Media analysis, nearly all projects have focused on analyzing the expressions as positive, negative or neutral. In this paper we intend to categorize the expressions on the basis of emotions, namely happiness, anger, fear, and sadness. Different approaches have been carried out in the area of automatic textual emotion recognition in the case of other languages, but only a limited number were based on deep learning. Thus, we present our approach used to classify emotions in Arabic tweets. Our model implements a deep Convolutional Neural Networks (CNN) trained on top of trained word vectors specifically on our dataset for sentence classification tasks. We compared the results of this approach with three other machine learning algorithms which are SVM, NB and MLP. The architecture of our deep learning approach is an end-to-end network with word, sentence, and document vectorization steps. The deep learning proposed approach was evaluated on the Arabic tweets dataset provided by SemiEval for the EI-oc task, and the results-compared to the traditional machine learning approaches-were excellent.

Download Full-text

mAML: an automated machine learning pipeline with a microbiome repository for human disease classification

10.1101/2020.02.11.943316 ◽

2020 ◽

Author(s):

Fenglong Yang ◽

Quan Zou

Keyword(s):

Machine Learning ◽

Human Disease ◽

High Performance ◽

Model Building ◽

Disease Classification ◽

Benchmark Datasets ◽

Automated Machine Learning ◽

Classification Tasks ◽

Interpretable Models ◽

Multi Class Classification

AbstractDue to the concerted efforts to utilize the microbial features to improve disease prediction capabilities, automated machine learning (AutoML) systems designed to get rid of the tediousness in manually performing ML tasks are in great demand. Here we developed mAML, an ML model-building pipeline, which can automatically and rapidly generate optimized and interpretable models for personalized microbial classification tasks in a reproducible way. The pipeline is deployed on a web-based platform and the server is user-friendly, flexible, and has been designed to be scalable according to the specific requirements. This pipeline exhibits high performance for 13 benchmark datasets including both binary and multi-class classification tasks. In addition, to facilitate the application of mAML and expand the human disease-related microbiome learning repository, we developed GMrepo ML repository (GMrepo Microbiome Learning repository) from the GMrepo database. The repository involves 120 microbial classification tasks for 85 human-disease phenotypes referring to 12,429 metagenomic samples and 38,643 amplicon samples. The mAML pipeline and the GMrepo ML repository are expected to be important resources for researches in microbiology and algorithm developments.Database URLhttp://39.100.246.211:8050/Home

Download Full-text

How Does Knowledge of the AUC Constrain the Set of Possible Ground-Truth Labelings?

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015425 ◽

2019 ◽

Vol 33 ◽

pp. 5425-5432

Author(s):

Jacob Whitehill

Keyword(s):

Machine Learning ◽

Empirical Evidence ◽

Recent Work ◽

Roc Curve ◽

Binary Classification ◽

Ground Truth ◽

Mathematical Structure ◽

Test Set ◽

Classification Tasks ◽

N Vector

Recent work on privacy-preserving machine learning has considered how datamining competitions such as Kaggle could potentially be “hacked”, either intentionally or inadvertently, by using information from an oracle that reports a classifier’s accuracy on the test set (Blum and Hardt 2015; Hardt and Ullman 2014; Zheng 2015; Whitehill 2016). For binary classification tasks in particular, one of the most common accuracy metrics is the Area Under the ROC Curve (AUC), and in this paper we explore the mathematical structure of how the AUC is computed from an n-vector of real-valued “guesses” with respect to the ground-truth labels. Under the assumption of perfect knowledge of the test set AUC c=p/q, we show how knowing c constrains the set W of possible ground-truth labelings, and we derive an algorithm both to compute the exact number of such labelings and to enumerate efficiently over them. We also provide empirical evidence that, surprisingly, the number of compatible labelings can actually decrease as n grows, until a test set-dependent threshold is reached. Finally, we show how W can be efficiently whittled down, through pairs of oracle queries, to infer all the groundtruth test labels with complete certainty.

Download Full-text

Designing deep neural networks for continual learning in an open world

10.21248/gups.62487 ◽

2021 ◽

Author(s):

◽

Martin Mundt

Keyword(s):

Neural Network ◽

Machine Learning ◽

Deep Learning ◽

Network Architecture ◽

Neural Network Training ◽

Neural Network Architecture ◽

Neural Architecture ◽

Network Training ◽

Classification Tasks ◽

Continual Learning

Deep learning with neural networks seems to have largely replaced traditional design of computer vision systems. Automated methods to learn a plethora of parameters are now used in favor of previously practiced selection of explicit mathematical operators for a specific task. The entailed promise is that practitioners no longer need to take care of every individual step, but rather focus on gathering big amounts of data for neural network training. As a consequence, both a shift in mindset towards a focus on big datasets, as well as a wave of conceivable applications based exclusively on deep learning can be observed. This PhD dissertation aims to uncover some of the only implicitly mentioned or overlooked deep learning aspects, highlight unmentioned assumptions, and finally introduce methods to address respective immediate weaknesses. In the author’s humble opinion, these prevalent shortcomings can be tied to the fact that the involved steps in the machine learning workflow are frequently decoupled. Success is predominantly measured based on accuracy measures designed for evaluation with static benchmark test sets. Individual machine learning workflow components are assessed in isolation with respect to available data, choice of neural network architecture, and a particular learning algorithm, rather than viewing the machine learning system as a whole in context of a particular application. Correspondingly, in this dissertation, three key challenges have been identified: 1. Choice and flexibility of a neural network architecture. 2. Identification and rejection of unseen unknown data to avoid false predictions. 3. Continual learning without forgetting of already learned information. These latter challenges have already been crucial topics in older literature, alas, seem to require a renaissance in modern deep learning literature. Initially, it may appear that they pose independent research questions, however, the thesis posits that the aspects are intertwined and require a joint perspective in machine learning based systems. In summary, the essential question is thus how to pick a suitable neural network architecture for a specific task, how to recognize which data inputs belong to this context, which ones originate from potential other tasks, and ultimately how to continuously include such identified novel data in neural network training over time without overwriting existing knowledge. Thus, the central emphasis of this dissertation is to build on top of existing deep learning strengths, yet also acknowledge mentioned weaknesses, in an effort to establish a deeper understanding of interdependencies and synergies towards the development of unified solution mechanisms. For this purpose, the main portion of the thesis is in cumulative form. The respective publications can be grouped according to the three challenges outlined above. Correspondingly, chapter 1 is focused on choice and extendability of neural network architectures, analyzed in context of popular image classification tasks. An algorithm to automatically determine neural network layer width is introduced and is first contrasted with static architectures found in the literature. The importance of neural architecture design is then further showcased on a real-world application of defect detection in concrete bridges. Chapter 2 is comprised of the complementary ensuing questions of how to identify unknown concepts and subsequently incorporate them into continual learning. A joint central mechanism to distinguish unseen concepts from what is known in classification tasks, while enabling consecutive training without forgetting or revisiting older classes, is proposed. Once more, the role of the chosen neural network architecture is quantitatively reassessed. Finally, chapter 3 culminates in an overarching view, where developed parts are connected. Here, an extensive survey further serves the purpose to embed the gained insights in the broader literature landscape and emphasizes the importance of a common frame of thought. The ultimately presented approach thus reflects the overall thesis’ contribution to advance neural network based machine learning towards a unified solution that ties together choice of neural architecture with the ability to learn continually and the capability to automatically separate known from unknown data.

Download Full-text