Turning biases into hypotheses through method: A logic of scientific discovery for machine learning

Machine learning (ML) systems have shown great potential for performing or supporting inferential reasoning through analyzing large data sets, thereby potentially facilitating more informed decision-making. However, a hindrance to such use of ML systems is that the predictive models created through ML are often complex, opaque, and poorly understood, even if the programs “learning” the models are simple, transparent, and well understood. ML models become difficult to trust, since lay-people, specialists, and even researchers have difficulties gauging the reasonableness, correctness, and reliability of the inferences performed. In this article, we argue that bridging this gap in the understanding of ML models and their reasonableness requires a focus on developing an improved methodology for their creation. This process has been likened to “alchemy” and criticized for involving a large degree of “black art,” owing to its reliance on poorly understood “best practices”. We soften this critique and argue that the seeming arbitrariness often is the result of a lack of explicit hypothesizing stemming from an empiricist and myopic focus on optimizing for predictive performance rather than from an occult or mystical process. We present some of the problems resulting from the excessive focus on optimizing generalization performance at the cost of hypothesizing about the selection of data and biases. We suggest embedding ML in a general logic of scientific discovery similar to the one presented by Charles Sanders Peirce, and present a recontextualized version of Peirce’s scientific hypothesis adjusted to ML.

Download Full-text

Analysis of Risk Factors in Dementia Through Machine Learning

Journal of Alzheimer s Disease ◽

10.3233/jad-200955 ◽

2020 ◽

pp. 1-17

Author(s):

Francisco Javier Balea-Fernandez ◽

Beatriz Martinez-Vega ◽

Samuel Ortega ◽

Himar Fabelo ◽

Raquel Leon ◽

...

Keyword(s):

Machine Learning ◽

Optimization Algorithms ◽

Progressive Increase ◽

Control Group ◽

Data Sets ◽

Modifiable Factors ◽

Validation Set ◽

The One ◽

And Control ◽

Potential Tool

Background: Sociodemographic data indicate the progressive increase in life expectancy and the prevalence of Alzheimer’s disease (AD). AD is raised as one of the greatest public health problems. Its etiology is twofold: on the one hand, non-modifiable factors and on the other, modifiable. Objective: This study aims to develop a processing framework based on machine learning (ML) and optimization algorithms to study sociodemographic, clinical, and analytical variables, selecting the best combination among them for an accurate discrimination between controls and subjects with major neurocognitive disorder (MNCD). Methods: This research is based on an observational-analytical design. Two research groups were established: MNCD group (n = 46) and control group (n = 38). ML and optimization algorithms were employed to automatically diagnose MNCD. Results: Twelve out of 37 variables were identified in the validation set as the most relevant for MNCD diagnosis. Sensitivity of 100%and specificity of 71%were achieved using a Random Forest classifier. Conclusion: ML is a potential tool for automatic prediction of MNCD which can be applied to relatively small preclinical and clinical data sets. These results can be interpreted to support the influence of the environment on the development of AD.

Download Full-text

Generation of geometric interpolations of building types with deep variational autoencoders

Design Science ◽

10.1017/dsj.2020.31 ◽

2020 ◽

Vol 6 ◽

Author(s):

Jaime de Miguel Rodríguez ◽

Maria Eugenia Villafañe ◽

Luka Piškorec ◽

Fernando Sancho Caparrini

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Large Data ◽

Learning Model ◽

Large Data Sets ◽

Data Sets ◽

Connectivity Map ◽

Data Set ◽

3D Objects ◽

Machine Learning Model

Abstract This work presents a methodology for the generation of novel 3D objects resembling wireframes of building types. These result from the reconstruction of interpolated locations within the learnt distribution of variational autoencoders (VAEs), a deep generative machine learning model based on neural networks. The data set used features a scheme for geometry representation based on a ‘connectivity map’ that is especially suited to express the wireframe objects that compose it. Additionally, the input samples are generated through ‘parametric augmentation’, a strategy proposed in this study that creates coherent variations among data by enabling a set of parameters to alter representative features on a given building type. In the experiments that are described in this paper, more than 150 k input samples belonging to two building types have been processed during the training of a VAE model. The main contribution of this paper has been to explore parametric augmentation for the generation of large data sets of 3D geometries, showcasing its problems and limitations in the context of neural networks and VAEs. Results show that the generation of interpolated hybrid geometries is a challenging task. Despite the difficulty of the endeavour, promising advances are presented.

Download Full-text

Machine Learning Improves the Precision and Robustness of High-Content Screens

CrossRef Listing of Deleted DOIs ◽

10.1177/1087057111414878 ◽

2011 ◽

Vol 16 (9) ◽

pp. 1059-1067 ◽

Cited By ~ 44

Author(s):

Peter Horvath ◽

Thomas Wild ◽

Ulrike Kutay ◽

Gabor Csucs

Keyword(s):

Machine Learning ◽

User Interaction ◽

Texture Features ◽

Large Data ◽

Complete Analysis ◽

Data Sets ◽

Reporter Protein ◽

Fluorescent Intensity ◽

Analysis Workflow

Imaging-based high-content screens often rely on single cell-based evaluation of phenotypes in large data sets of microscopic images. Traditionally, these screens are analyzed by extracting a few image-related parameters and use their ratios (linear single or multiparametric separation) to classify the cells into various phenotypic classes. In this study, the authors show how machine learning–based classification of individual cells outperforms those classical ratio-based techniques. Using fluorescent intensity and morphological and texture features, they evaluated how the performance of data analysis increases with increasing feature numbers. Their findings are based on a case study involving an siRNA screen monitoring nucleoplasmic and nucleolar accumulation of a fluorescently tagged reporter protein. For the analysis, they developed a complete analysis workflow incorporating image segmentation, feature extraction, cell classification, hit detection, and visualization of the results. For the classification task, the authors have established a new graphical framework, the Advanced Cell Classifier, which provides a very accurate high-content screen analysis with minimal user interaction, offering access to a variety of advanced machine learning methods.

Download Full-text

Machine learning-based prediction system for rainfall-induced landslides in Benguet First Engineering District

10.31219/osf.io/csx6r ◽

2019 ◽

Author(s):

Zanya Reubenne D. Omadlao ◽

Nica Magdalena A. Tuguinay ◽

Ricarido Maglaqui Saturay

Keyword(s):

Machine Learning ◽

Daily Rainfall ◽

Predictive Performance ◽

Data Sets ◽

Prediction System ◽

True Positive ◽

Rainfall Thresholds ◽

Cumulative Rainfall ◽

Testing Data ◽

Positive Rate

A machine learning-based prediction system for rainfall-induced landslides in Benguet First Engineering District is proposed to address the landslide risk due to the climate and topography of Benguet province. It is intended to improve the decision support system for road management with regards to landslides, as implemented by the Department of Public Works and Highways Benguet First District Engineering Office. Supervised classification was applied to daily rainfall and landslide data for the Benguet First Engineering District covering the years 2014 to 2018 using scikit-learn. Various forms of cumulative rainfall values were used to predict landslide occurrence for a given day. Following typical machine learning workflows, rainfall-landslide data set was divided into training and testing data sets. Machine learning algorithms such as K-Nearest Neighbors, Gaussian Naïve Bayes, Support Vector Machine, Logistic Regression, Random Forest, Decision Tree, and AdaBoost were trained using the training data sets, and the trained models were used to make predictions based on the testing data sets. Predictive performance of the models vis-a-vis the testing data sets were compared using true positive rates, false positive rates, and the area under the Receiver Operating Characteristic Curve. Predictive performance of these models were then compared to 1-day cumulative rainfall thresholds commonly used for landslide predictions. Among the machine learning models evaluated, Gaussian Naïve Bayes has the best performance, with mean false positive rate, true positive rate and area under the curve scores of 7%, 76%, and 84% respectively. It also performs better than the 1-day cumulative rainfall thresholds. This research demonstrates the potential of machine learning for identifying temporal patterns in rainfall-induced landslides using minimal data input -- daily rainfall from a single synoptic station, and highway maintenance records. Such an approach may be tested and applied to similar problems in the field of disaster risk reduction and management.

Download Full-text

Deep Learning Approaches for Sentiment Analysis Challenges and Future Issues

10.4018/978-1-7998-8161-2.ch003 ◽

2022 ◽

pp. 27-50

Author(s):

Rajalaxmi Prabhu B. ◽

Seema S.

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Model Building ◽

Large Data ◽

Machine Learning Algorithms ◽

Large Data Sets ◽

Data Sets ◽

Learning Approaches ◽

Learning Techniques ◽

Important Challenge

A lot of user-generated data is available these days from huge platforms, blogs, websites, and other review sites. These data are usually unstructured. Analyzing sentiments from these data automatically is considered an important challenge. Several machine learning algorithms are implemented to check the opinions from large data sets. A lot of research has been undergone in understanding machine learning approaches to analyze sentiments. Machine learning mainly depends on the data required for model building, and hence, suitable feature exactions techniques also need to be carried. In this chapter, several deep learning approaches, its challenges, and future issues will be addressed. Deep learning techniques are considered important in predicting the sentiments of users. This chapter aims to analyze the deep-learning techniques for predicting sentiments and understanding the importance of several approaches for mining opinions and determining sentiment polarity.

Download Full-text

Recommendations for Reporting Machine Learning Analyses in Clinical Research

Circulation Cardiovascular Quality and Outcomes ◽

10.1161/circoutcomes.120.006556 ◽

2020 ◽

Vol 13 (10) ◽

Cited By ~ 1

Author(s):

Laura M. Stevens ◽

Bobak J. Mortazavi ◽

Rahul C. Deo ◽

Lesley Curtis ◽

David P. Kao

Keyword(s):

Machine Learning ◽

Clinical Research ◽

Clinical Experience ◽

Clinical Data ◽

Critical Evaluation ◽

Predictive Performance ◽

Structured Reporting ◽

Data Sets ◽

Overwhelming Evidence ◽

Peer Reviewers

Use of machine learning (ML) in clinical research is growing steadily given the increasing availability of complex clinical data sets. ML presents important advantages in terms of predictive performance and identifying undiscovered subpopulations of patients with specific physiology and prognoses. Despite this popularity, many clinicians and researchers are not yet familiar with evaluating and interpreting ML analyses. Consequently, readers and peer-reviewers alike may either overestimate or underestimate the validity and credibility of an ML-based model. Conversely, ML experts without clinical experience may present details of the analysis that are too granular for a clinical readership to assess. Overwhelming evidence has shown poor reproducibility and reporting of ML models in clinical research suggesting the need for ML analyses to be presented in a clear, concise, and comprehensible manner to facilitate understanding and critical evaluation. We present a recommendation for transparent and structured reporting of ML analysis results specifically directed at clinical researchers. Furthermore, we provide a list of key reporting elements with examples that can be used as a template when preparing and submitting ML-based manuscripts for the same audience.

Download Full-text

Precision-Recall versus Accuracy and the Role of Large Data Sets

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014039 ◽

2019 ◽

Vol 33 ◽

pp. 4039-4048 ◽

Cited By ~ 8

Author(s):

Brendan Juba ◽

Hai S. Le

Keyword(s):

Machine Learning ◽

Class Imbalance ◽

Imbalanced Data ◽

Large Data ◽

Constant Factor ◽

Data Sets ◽

Data Set ◽

Small Constant ◽

Classifier Performance ◽

Necessary And Sufficient

Practitioners of data mining and machine learning have long observed that the imbalance of classes in a data set negatively impacts the quality of classifiers trained on that data. Numerous techniques for coping with such imbalances have been proposed, but nearly all lack any theoretical grounding. By contrast, the standard theoretical analysis of machine learning admits no dependence on the imbalance of classes at all. The basic theorems of statistical learning establish the number of examples needed to estimate the accuracy of a classifier as a function of its complexity (VC-dimension) and the confidence desired; the class imbalance does not enter these formulas anywhere. In this work, we consider the measures of classifier performance in terms of precision and recall, a measure that is widely suggested as more appropriate to the classification of imbalanced data. We observe that whenever the precision is moderately large, the worse of the precision and recall is within a small constant factor of the accuracy weighted by the class imbalance. A corollary of this observation is that a larger number of examples is necessary and sufficient to address class imbalance, a finding we also illustrate empirically.

Download Full-text

INGOT-DR: an interpretable classifier for predicting drug resistance in M. tuberculosis

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00198-1 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Hooman Zabeti ◽

Nick Dexter ◽

Amir Hosein Safari ◽

Nafiseh Sedaghat ◽

Maxwell Libbrecht ◽

...

Keyword(s):

Machine Learning ◽

Drug Resistance ◽

Predictive Accuracy ◽

Group Testing ◽

Predictive Performance ◽

Machine Learning Techniques ◽

Evaluation Metrics ◽

Lower Accuracy ◽

Unseen Data ◽

The One

Abstract Motivation Prediction of drug resistance and identification of its mechanisms in bacteria such as Mycobacterium tuberculosis, the etiological agent of tuberculosis, is a challenging problem. Solving this problem requires a transparent, accurate, and flexible predictive model. The methods currently used for this purpose rarely satisfy all of these criteria. On the one hand, approaches based on testing strains against a catalogue of previously identified mutations often yield poor predictive performance; on the other hand, machine learning techniques typically have higher predictive accuracy, but often lack interpretability and may learn patterns that produce accurate predictions for the wrong reasons. Current interpretable methods may either exhibit a lower accuracy or lack the flexibility needed to generalize them to previously unseen data. Contribution In this paper we propose a novel technique, inspired by group testing and Boolean compressed sensing, which yields highly accurate predictions, interpretable results, and is flexible enough to be optimized for various evaluation metrics at the same time. Results We test the predictive accuracy of our approach on five first-line and seven second-line antibiotics used for treating tuberculosis. We find that it has a higher or comparable accuracy to that of commonly used machine learning models, and is able to identify variants in genes with previously reported association to drug resistance. Our method is intrinsically interpretable, and can be customized for different evaluation metrics. Our implementation is available at github.com/hoomanzabeti/INGOT_DR and can be installed via The Python Package Index (Pypi) under ingotdr. This package is also compatible with most of the tools in the Scikit-learn machine learning library.

Download Full-text

Machine Learning for Predicting Mycotoxin Occurrence in Maize

Frontiers in Microbiology ◽

10.3389/fmicb.2021.661132 ◽

2021 ◽

Vol 12 ◽

Author(s):

Marco Camardo Leggieri ◽

Marco Mazzoni ◽

Paola Battilani

Keyword(s):

Machine Learning ◽

Mechanistic Model ◽

Cropping System ◽

Large Data ◽

Predictive Performance ◽

Added Value ◽

Linear Regression Models ◽

Data Set ◽

Input Variables ◽

Aflatoxin B

Meteorological conditions are the main driving variables for mycotoxin-producing fungi and the resulting contamination in maize grain, but the cropping system used can mitigate this weather impact considerably. Several researchers have investigated cropping operations’ role in mycotoxin contamination, but these findings were inconclusive, precluding their use in predictive modeling. In this study a machine learning (ML) approach was considered, which included weather-based mechanistic model predictions for AFLA-maize and FER-maize [predicting aflatoxin B1 (AFB1) and fumonisins (FBs), respectively], and cropping system factors as the input variables. The occurrence of AFB1 and FBs in maize fields was recorded, and their corresponding cropping system data collected, over the years 2005–2018 in northern Italy. Two deep neural network (DNN) models were trained to predict, at harvest, which maize fields were contaminated beyond the legal limit with AFB1 and FBs. Both models reached an accuracy >75% demonstrating the ML approach added value with respect to classical statistical approaches (i.e., simple or multiple linear regression models). The improved predictive performance compared with that obtained for AFLA-maize and FER-maize was clearly demonstrated. This coupled to the large data set used, comprising a 13-year time series, and the good results for the statistical scores applied, together confirmed the robustness of the models developed here.

Download Full-text

Implementation of Supervised Learning towards Optimizing Queries in Database Systems

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.b3531.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 1182-1187

Keyword(s):

Machine Learning ◽

Supervised Learning ◽

Student Loans ◽

Large Data ◽

Database Systems ◽

Large Data Sets ◽

Data Sets ◽

Human Intervention ◽

Huge Data ◽

Future Direction

Machine learning is a technology which with accumulated data provides better decisions towards future applications. It is also the scientific study of algorithms implemented efficiently to perform a specific task without using explicit instructions. It may also be viewed as a subset of artificial intelligence in which it may be linked with the ability to automatically learn and improve from experience without being explicitly programmed. Its primary intention is to allow the computers learn automatically and produce more accurate results in order to identify profitable opportunities. Combining machine learning with AI and cognitive technologies can make it even more effective in processing large volumes human intervention or assistance and adjust actions accordingly. It may enable analyzing the huge data of information. It may also be linked to algorithm driven study towards improving the performance of the tasks. In such scenario, the techniques can be applied to judge and predict large data sets. The paper concerns the mechanism of supervised learning in the database systems, which would be self driven as well as secure. Also the citation of an organization dealing with student loans has been presented. The paper ends discussion, future direction and conclusion.

Download Full-text