scholarly journals [Dedicated to Prof. T. Okada and Prof. T. Nishioka: data science in chemistry]Visualizing Individual and Region-specific Microbial–metabolite Relations by Important Variable Selection Using Machine Learning Approaches

2017 ◽  
Vol 18 (0) ◽  
pp. 31-41 ◽  
Author(s):  
Satoshi Tsutsui ◽  
Yasuhiro Date ◽  
Jun Kikuchi
Author(s):  
Greg Lawrance ◽  
Raphael Parra Hernandez ◽  
Khalegh Mamakani ◽  
Suraiya Khan ◽  
Brent Hills ◽  
...  

IntroductionLigo is an open source application that provides a framework for managing and executing administrative data linking projects. Ligo provides an easy-to-use web interface that lets analysts select among data linking methods including deterministic, probabilistic and machine learning approaches and use these in a documented, repeatable, tested, step-by-step process. Objectives and ApproachThe linking application has two primary functions: identifying common entities in datasets [de-duplication] and identifying common entities between datasets [linking]. The application is being built from the ground up in a partnership between the Province of British Columbia’s Data Innovation (DI) Program and Population Data BC, and with input from data scientists. The simple web interface allows analysts to streamline the processing of multiple datasets in a straight-forward and reproducible manner. ResultsBuilt in Python and implemented as a desktop-capable and cloud-deployable containerized application, Ligo includes many of the latest data-linking comparison algorithms with a plugin architecture that supports the simple addition of new formulae. Currently, deterministic approaches to linking have been implemented and probabilistic methods are in alpha testing. A fully functional alpha, including deterministic and probabilistic methods is expected to be ready in September, with a machine learning extension expected soon after. Conclusion/ImplicationsLigo has been designed with enterprise users in mind. The application is intended to make the processes of data de-duplication and linking simple, fast and reproducible. By making the application open source, we encourage feedback and collaboration from across the population research and data science community.


Author(s):  
Omar Farooq ◽  
Parminder Singh

Introduction: The emergence of the concepts like Big Data, Data Science, Machine Learning (ML), and the Internet of Things (IoT) has added the potential of research in today's world. The continuous use of IoT devices, sensors, etc. that collect data continuously puts tremendous pressure on the existing IoT network. Materials and Methods: This resource-constrained IoT environment is flooded with data acquired from millions of IoT nodes deployed at the device level. The limited resources of the IoT Network have driven the researchers towards data Management. This paper focuses on data classification at the device level, edge/fog level, and cloud level using machine learning techniques. Results: The data coming from different devices is vast and is of variety. Therefore, it becomes essential to choose the right approach for classification and analysis. It will help optimize the data at the device edge/fog level to better the network's performance in the future. Conclusion: This paper presents data classification, machine learning approaches, and a proposed mathematical model for the IoT environment.


2018 ◽  
Vol 47 (6) ◽  
pp. 1081-1097 ◽  
Author(s):  
Jeremy Auerbach ◽  
Christopher Blackburn ◽  
Hayley Barton ◽  
Amanda Meng ◽  
Ellen Zegura

We estimate the cost and impact of a proposed anti-displacement program in the Westside of Atlanta (GA) with data science and machine learning techniques. This program intends to fully subsidize property tax increases for eligible residents of neighborhoods where there are two major urban renewal projects underway, a stadium and a multi-use trail. We first estimate household-level income eligibility for the program with data science and machine learning approaches applied to publicly available household-level data. We then forecast future property appreciation due to urban renewal projects using random forests with historic tax assessment data. Combining these projections with household-level eligibility, we estimate the costs of the program for different eligibility scenarios. We find that our household-level data and machine learning techniques result in fewer eligible homeowners but significantly larger program costs, due to higher property appreciation rates than the original analysis, which was based on census and city-level data. Our methods have limitations, namely incomplete data sets, the accuracy of representative income samples, the availability of characteristic training set data for the property tax appreciation model, and challenges in validating the model results. The eligibility estimates and property appreciation forecasts we generated were also incorporated into an interactive tool for residents to determine program eligibility and view their expected increases in home values. Community residents have been involved with this work and provided greater transparency, accountability, and impact of the proposed program. Data collected from residents can also correct and update the information, which would increase the accuracy of the program estimates and validate the modeling, leading to a novel application of community-driven data science.


2020 ◽  
Author(s):  
Shannon M. VanAken ◽  
Duane Newton ◽  
J. Scott VanEpps

ABSTRACTWith an estimated 440,000 active cases occurring each year, medical device associated infections pose a significant burden on the US healthcare system, costing about $9.8 billion in 2013. Staphylococcus epidermidis is the most common cause of these device-associated infections, which typically involve isolates that are multi-drug resistant and possess multiple virulence factors. S. epidermidis is also frequently a benign contaminant of otherwise sterile blood cultures. Therefore, tests that distinguish pathogenic from non-pathogenic isolates would improve the accuracy of diagnosis and prevent overuse/misuse of antibiotics. Attempts to use multi-locus sequence typing (MLST) with machine learning for this purpose had poor accuracy (~73%). In this study we sought to improve the diagnostic accuracy of predicting pathogenicity by focusing on phenotypic markers (i.e., antibiotic resistance, growth fitness in human plasma, and biofilm forming capacity) and the presence of specific virulence genes (i.e., mecA, ses1, and sdrF). Commensal isolates from healthy individuals (n=23), blood culture contaminants (n=21), and pathogenic isolates considered true bacteremia (n=54) were used. Multiple machine learning approaches were applied to characterize strains as pathogenic vs non-pathogenic. The combination of phenotypic markers and virulence genes improved the diagnostic accuracy to 82.4% (sensitivity: 84.9% and specificity: 80.9%). Oxacillin resistance was the most important variable followed by growth rate in plasma. This work shows promise for the addition of phenotypic testing in clinical diagnostic applications.


2021 ◽  
Vol 126 (6) ◽  
pp. 477-491
Author(s):  
Michael D. Broda ◽  
Matthew Bogenschutz ◽  
Parthenia Dinora ◽  
Seb M. Prohn ◽  
Sarah Lineberry ◽  
...  

Abstract In this article, we demonstrate the potential of machine learning approaches as inductive analytic tools for expanding our current evidence base for policy making and practice that affects people with intellectual and developmental disabilities (IDD). Using data from the National Core Indicators In-Person Survey (NCI-IPS), a nationally validated annual survey of more than 20,000 nationally representative people with IDD, we fit a series of classification tree and random forest models to predict individuals' employment status and day activity participation as a function of their responses to all other items on the 2017–2018 NCI-IPS. The most accurate model, a random forest classifier, predicted employment outcomes of adults with IDD with an accuracy of 89 percent on the testing sample, and 80 percent on the holdout sample. The most important variable in this prediction was whether or not community employment was a goal in this person's service plan. These results suggest the potential machine learning tools to examine other valued outcomes used in evidence-based policy making to support people with IDD.


2021 ◽  
Vol 5 (1) ◽  
pp. 36
Author(s):  
Gian Marco Paldino ◽  
Jacopo De Stefani ◽  
Fabrizio De Caro ◽  
Gianluca Bontempi

The availability of massive amounts of temporal data opens new perspectives of knowledge extraction and automated decision making for companies and practitioners. However, learning forecasting models from data requires a knowledgeable data science or machine learning (ML) background and expertise, which is not always available to end-users. This gap fosters a growing demand for frameworks automating the ML pipeline and ensuring broader access to the general public. Automatic machine learning (AutoML) provides solutions to build and validate machine learning pipelines minimizing the user intervention. Most of those pipelines have been validated in static supervised learning settings, while an extensive validation in time series prediction is still missing. This issue is particularly important in the forecasting community, where the relevance of machine learning approaches is still under debate. This paper assesses four existing AutoML frameworks (AutoGluon, H2O, TPOT, Auto-sklearn) on a number of forecasting challenges (univariate and multivariate, single-step and multi-step ahead) by benchmarking them against simple and conventional forecasting strategies (e.g., naive and exponential smoothing). The obtained results highlight that AutoML approaches are not yet mature enough to address generic forecasting tasks once compared with faster yet more basic statistical forecasters. In particular, the tested AutoML configurations, on average, do not significantly outperform a Naive estimator. Those results, yet preliminary, should not be interpreted as a rejection of AutoML solutions in forecasting but as an encouragement to a more rigorous validation of their limits and perspectives.


2020 ◽  
Author(s):  
Yihuan Huang ◽  
Amanda Kay Montoya

Machine learning methods are being increasingly adopted in psychological research. Lasso performs variable selection and regularization, and is particularly appealing to psychology researchers because of its connection to linear regression. Researchers conflate properties of linear regression with properties of lasso; however, we demonstrate that this is not the case for models with categorical predictors. Specifically, the coding strategy used for categorical predictors impacts lasso’s performance but not linear regression. Group lasso is an alternative to lasso for models with categorical predictors. We demonstrate the inconsistency of lasso and group lasso models using a real data set: lasso performs different variable selection and has different prediction accuracy depending on the coding strategy, and group lasso performs consistent variable selection but has different prediction accuracy. Additionally, group lasso may include many predictors when very few are needed, leading to overfitting. Using Monte Carlo simulation, we show that categorical variables with one group mean differing from all others (one dominant group) are more likely to be included in the model by group lasso than lasso, leading to overfitting. This effect is strongest when the mean difference is large and there are many categories. Researchers primarily focus on the similarity between linear regression and lasso, but pay little attention to their different properties. This project demonstrates that when using lasso and group lasso, the effect of coding strategies should be considered. We conclude with recommended solutions to this issue and future directions of exploration to improve implementation of machine learning approaches in psychological science.


PLoS ONE ◽  
2021 ◽  
Vol 16 (3) ◽  
pp. e0241457
Author(s):  
Shannon M. VanAken ◽  
Duane Newton ◽  
J. Scott VanEpps

With an estimated 440,000 active cases occurring each year, medical device associated infections pose a significant burden on the US healthcare system, costing about $9.8 billion in 2013. Staphylococcus epidermidis is the most common cause of these device-associated infections, which typically involve isolates that are multi-drug resistant and possess multiple virulence factors. S. epidermidis is also frequently a benign contaminant of otherwise sterile blood cultures. Therefore, tests that distinguish pathogenic from non-pathogenic isolates would improve the accuracy of diagnosis and prevent overuse/misuse of antibiotics. Attempts to use multi-locus sequence typing (MLST) with machine learning for this purpose had poor accuracy (~73%). In this study we sought to improve the diagnostic accuracy of predicting pathogenicity by focusing on phenotypic markers (i.e., antibiotic resistance, growth fitness in human plasma, and biofilm forming capacity) and the presence of specific virulence genes (i.e., mecA, ses1, and sdrF). Commensal isolates from healthy individuals (n = 23), blood culture contaminants (n = 21), and pathogenic isolates considered true bacteremia (n = 54) were used. Multiple machine learning approaches were applied to characterize strains as pathogenic vs non-pathogenic. The combination of phenotypic markers and virulence genes improved the diagnostic accuracy to 82.4% (sensitivity: 84.9% and specificity: 80.9%). Oxacillin resistance was the most important variable followed by growth rate in plasma. This work shows promise for the addition of phenotypic testing in clinical diagnostic applications.


2021 ◽  
Vol 12 ◽  
Author(s):  
Isabel Moreno-Indias ◽  
Leo Lahti ◽  
Miroslava Nedyalkova ◽  
Ilze Elbere ◽  
Gennady Roshchupkin ◽  
...  

The human microbiome has emerged as a central research topic in human biology and biomedicine. Current microbiome studies generate high-throughput omics data across different body sites, populations, and life stages. Many of the challenges in microbiome research are similar to other high-throughput studies, the quantitative analyses need to address the heterogeneity of data, specific statistical properties, and the remarkable variation in microbiome composition across individuals and body sites. This has led to a broad spectrum of statistical and machine learning challenges that range from study design, data processing, and standardization to analysis, modeling, cross-study comparison, prediction, data science ecosystems, and reproducible reporting. Nevertheless, although many statistics and machine learning approaches and tools have been developed, new techniques are needed to deal with emerging applications and the vast heterogeneity of microbiome data. We review and discuss emerging applications of statistical and machine learning techniques in human microbiome studies and introduce the COST Action CA18131 “ML4Microbiome” that brings together microbiome researchers and machine learning experts to address current challenges such as standardization of analysis pipelines for reproducibility of data analysis results, benchmarking, improvement, or development of existing and new tools and ontologies.


Sign in / Sign up

Export Citation Format

Share Document