Machine Learning for Materials Scientists: An Introductory Guide Towards Best Practices

<div>This Editorial is intended for materials scientists interested in performing machine learning-centered research.</div><div><br></div><div>We cover broad guidelines and best practices regarding the obtaining and treatment of data, feature engineering, model training, validation, evaluation and comparison, popular repositories for materials data and benchmarking datasets, model and architecture sharing, and finally publication.</div><div>In addition, we include interactive Jupyter notebooks with example Python code to demonstrate some of the concepts, workflows, and best practices discussed.</div><div><br></div><div>Overall, the data-driven methods and machine learning workflows and considerations are presented in a simple way, allowing interested readers to more intelligently guide their machine learning research using the suggested references, best practices, and their own materials domain expertise.</div>

Download Full-text

Practices and Infrastructures for ML Systems – An Interview Study

10.36227/techrxiv.16939192.v1 ◽

2021 ◽

Author(s):

Dennis Muiruri ◽

Lucy Ellen Lwakatare ◽

Jukka K. Nurminen ◽

Tommi Mikkonen

Keyword(s):

Machine Learning ◽

Best Practices ◽

Data Management ◽

Management Practices ◽

Interview Study ◽

The State ◽

Data Driven ◽

Software Systems ◽

State Of Practice ◽

Model Training

<div> <div> <div> <p>The best practices and infrastructures for developing and maintaining machine learning (ML) enabled software systems are often reported by large and experienced data-driven organizations. However, little is known about the state of practice across other organizations. Using interviews, we investigated practices and tool-chains for ML-enabled systems from 16 organizations in various domains. Our study makes three broad observations related to data management practices, monitoring practices and automation practices in ML model training, and serving workflows. These have limited number of generic practices and tools applicable across organizations in different domains. </p> </div> </div> </div>

Download Full-text

Machine Learning for Detecting Data Exfiltration

ACM Computing Surveys ◽

10.1145/3442181 ◽

2021 ◽

Vol 54 (3) ◽

pp. 1-47

Author(s):

Bushra Sabir ◽

Faheem Ullah ◽

M. Ali Babar ◽

Raj Gaire

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Data Driven ◽

Feature Engineering ◽

Body Of Knowledge ◽

Large Size ◽

Model Training ◽

And Performance ◽

Size Evaluation ◽

And Behavior

Context : Research at the intersection of cybersecurity, Machine Learning (ML), and Software Engineering (SE) has recently taken significant steps in proposing countermeasures for detecting sophisticated data exfiltration attacks. It is important to systematically review and synthesize the ML-based data exfiltration countermeasures for building a body of knowledge on this important topic. Objective : This article aims at systematically reviewing ML-based data exfiltration countermeasures to identify and classify ML approaches, feature engineering techniques, evaluation datasets, and performance metrics used for these countermeasures. This review also aims at identifying gaps in research on ML-based data exfiltration countermeasures. Method : We used Systematic Literature Review (SLR) method to select and review 92 papers. Results : The review has enabled us to: (a) classify the ML approaches used in the countermeasures into data-driven, and behavior-driven approaches; (b) categorize features into six types: behavioral, content-based, statistical, syntactical, spatial, and temporal; (c) classify the evaluation datasets into simulated, synthesized, and real datasets; and (d) identify 11 performance measures used by these studies. Conclusion : We conclude that: (i) The integration of data-driven and behavior-driven approaches should be explored; (ii) There is a need of developing high quality and large size evaluation datasets; (iii) Incremental ML model training should be incorporated in countermeasures; (iv) Resilience to adversarial learning should be considered and explored during the development of countermeasures to avoid poisoning attacks; and (v) The use of automated feature engineering should be encouraged for efficiently detecting data exfiltration attacks.

Download Full-text

Practices and Infrastructures for ML Systems – An Interview Study

10.36227/techrxiv.16939192 ◽

2021 ◽

Author(s):

Dennis Muiruri ◽

Lucy Ellen Lwakatare ◽

Jukka K. Nurminen ◽

Tommi Mikkonen

Keyword(s):

Machine Learning ◽

Best Practices ◽

Data Management ◽

Management Practices ◽

Interview Study ◽

The State ◽

Data Driven ◽

Software Systems ◽

State Of Practice ◽

Model Training

<div> <div> <div> <p>The best practices and infrastructures for developing and maintaining machine learning (ML) enabled software systems are often reported by large and experienced data-driven organizations. However, little is known about the state of practice across other organizations. Using interviews, we investigated practices and tool-chains for ML-enabled systems from 16 organizations in various domains. Our study makes three broad observations related to data management practices, monitoring practices and automation practices in ML model training, and serving workflows. These have limited number of generic practices and tools applicable across organizations in different domains. </p> </div> </div> </div>

Download Full-text

Dealer

Proceedings of the VLDB Endowment ◽

10.14778/3447689.3447700 ◽

2021 ◽

Vol 14 (6) ◽

pp. 957-969

Author(s):

Jinfei Liu ◽

Jian Lou ◽

Junxu Liu ◽

Li Xiong ◽

Jian Pei ◽

...

Keyword(s):

Machine Learning ◽

Differential Privacy ◽

Optimization Problems ◽

Data Driven ◽

Noise Sensitivity ◽

Efficiency And Effectiveness ◽

Machine Learning Applications ◽

Model Training ◽

Price Functions ◽

Synthetic Datasets

Data-driven machine learning has become ubiquitous. A marketplace for machine learning models connects data owners and model buyers, and can dramatically facilitate data-driven machine learning applications. In this paper, we take a formal data marketplace perspective and propose the first en<u> D </u>-to-end mod <u>e</u> l m <u>a</u> rketp <u>l</u> ace with diff <u>e</u> rential p <u>r</u> ivacy ( Dealer ) towards answering the following questions: How to formulate data owners' compensation functions and model buyers' price functions? How can the broker determine prices for a set of models to maximize the revenue with arbitrage-free guarantee, and train a set of models with maximum Shapley coverage given a manufacturing budget to remain competitive ? For the former, we propose compensation function for each data owner based on Shapley value and privacy sensitivity, and price function for each model buyer based on Shapley coverage sensitivity and noise sensitivity. Both privacy sensitivity and noise sensitivity are measured by the level of differential privacy. For the latter, we formulate two optimization problems for model pricing and model training, and propose efficient dynamic programming algorithms. Experiment results on the real chess dataset and synthetic datasets justify the design of Dealer and verify the efficiency and effectiveness of the proposed algorithms.

Download Full-text

“Garbage In, Garbage Out” Revisited: What Do Machine Learning Application Papers Report About Human-Labeled Training Data?

Quantitative Science Studies ◽

10.1162/qss_a_00144 ◽

2021 ◽

pp. 1-32

Author(s):

R. Stuart Geiger ◽

Dominique Cope ◽

Jamie Ip ◽

Marsha Lotosh ◽

Aayush Shah ◽

...

Keyword(s):

Machine Learning ◽

Best Practices ◽

Ground Truth ◽

Training Data ◽

Supervised Machine Learning ◽

Social Media Platforms ◽

Learning Research ◽

Research And Education ◽

Application Fields

Abstract Supervised machine learning, in which models are automatically derived from labeled training data, is only as good as the quality of that data. This study builds on prior work that investigated to what extent ‘best practices’ around labeling training data were followed in applied ML publications within a single domain (social media platforms). In this paper, we expand by studying publications that apply supervised ML in a far broader spectrum of disciplines, focusing on human-labeled data. We report to what extent a random sample of ML application papers across disciplines give specific details about whether best practices were followed, while acknowledging that a greater range of application fields necessarily produces greater diversity of labeling and annotation methods. Because much of machine learning research and education only focuses on what is done once a “ground truth” or “gold standard” of training data is available, it is especially relevant to discuss issues around the equally-important aspect of whether such data is reliable in the first place. This determination becomes increasingly complex when applied to a variety of specialized fields, as labeling can range from a task requiring little-to-no background knowledge to one that must be performed by someone with career expertise. Peer Review https://publons.com/publon/10.1162/qss_a_00144

Download Full-text

Predicting outcomes in older ED patients with influenza in real time using a big data-driven and machine learning approach to the hospital information system

BMC Geriatrics ◽

10.1186/s12877-021-02229-3 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Tian-Hoe Tan ◽

Chien-Chin Hsu ◽

Chia-Jung Chen ◽

Shu-Lien Hsu ◽

Tzu-Lan Liu ◽

...

Keyword(s):

Machine Learning ◽

Information System ◽

Real Time ◽

Hospital Information System ◽

Clinical Utility ◽

Data Driven ◽

Health Records ◽

Machine Learning Approach ◽

Testing Data ◽

Model Training

Abstract Background Predicting outcomes in older patients with influenza in the emergency department (ED) by machine learning (ML) has never been implemented. Therefore, we conducted this study to clarify the clinical utility of implementing ML. Methods We recruited 5508 older ED patients (≥65 years old) in three hospitals between 2009 and 2018. Patients were randomized into a 70%/30% split for model training and testing. Using 10 clinical variables from their electronic health records, a prediction model using the synthetic minority oversampling technique preprocessing algorithm was constructed to predict five outcomes. Results The best areas under the curves of predicting outcomes were: random forest model for hospitalization (0.840), pneumonia (0.765), and sepsis or septic shock (0.857), XGBoost for intensive care unit admission (0.902), and logistic regression for in-hospital mortality (0.889) in the testing data. The predictive model was further applied in the hospital information system to assist physicians’ decisions in real time. Conclusions ML is a promising way to assist physicians in predicting outcomes in older ED patients with influenza in real time. Evaluations of the effectiveness and impact are needed in the future.

Download Full-text

Using unsupervised machine learning to model tax practice learning theory

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.4.13019 ◽

2018 ◽

Vol 7 (2.4) ◽

pp. 109 ◽

Cited By ~ 1

Author(s):

Alfred Howard Miller

Keyword(s):

Machine Learning ◽

Factor Model ◽

Bachelor's Degree ◽

Data Driven ◽

Unsupervised Machine Learning ◽

Learning Framework ◽

Reporting Requirements ◽

Data Driven Approach ◽

Learning Research ◽

Tax Practice

The aim of this study was to utilize unsupervised machine learning framework to explore a dataset comprised of assessed output by Bachelors of Business, Taxation learners over four successive semesters. The researcher sought to motivate deployment of an evidence-supported, data-driven approach to understand the scope of student learning from a bachelor’s degree in business class taxation class, as a tool for accreditation reporting purposes. Outcomes from the data analysis identified four factors; two related to tax and two related to learning. These factors are, tax theory, and tax practice, along with practical learning and theoretical learning. Research motivated a grounded theory paradigm that explained taxation class learner’s scope of acquired knowledge. The resulting four factor model is a result of the study. The emergent paradigm further explains accounting student’s readiness for career success upon graduation and provides a novel way to meet outcomes reporting requirements mandated by programmatic business accreditors such as required by the Accreditation Council for Business Schools and Programs (ACBSP).

Download Full-text

AI-Ready Training Datasets for Earth Observation: Enabling FAIR data principles for EO training data.

10.5194/egusphere-egu21-12384 ◽

2021 ◽

Author(s):

Alastair McKinstry ◽

Oisin Boydell ◽

Quan Le ◽

Inder Preet ◽

Jennifer Hanafin ◽

...

Keyword(s):

Machine Learning ◽

Best Practices ◽

Forest Biomass ◽

Earth Observation ◽

Training Data ◽

Training Dataset ◽

Data Provenance ◽

Data Sets ◽

Model Training ◽

Ice Detection

<p>The ESA-funded AIREO project [1] sets out to produce AI-ready training dataset specifications and best practices to support the training and development of machine learning models on Earth Observation (EO) data. While the quality and quantity of EO data has increased drastically over the past decades, availability of training data for machine learning applications is considered a major bottleneck. The goal is to move towards implementing FAIR data principles for training data in EO, enhancing especially the finability, interoperability and reusability aspects.&#160; To achieve this goal, AIREO sets out to provide a training data specification and to develop best practices for the use of training datasets in EO. An additional goal is to make training data sets self-explanatory (&#8220;AI-ready) in order to expose challenging problems to a wider audience that does not have expert geospatial knowledge.&#160;</p><p>Key elements that are addressed in the AIREO specification are granular and interoperable metadata (based on STAC), innovative Quality Assurance metrics, data provenance and processing history as well as integrated feature engineering recipes that optimize platform independence. Several initial pilot datasets are being developed following the AIREO data specifications. These pilot applications include for example&#160; forest biomass, sea ice detection and the estimation of atmospheric parameters.An API for the easy exploitation of these datasets will be provided.to allow the Training Datasets (TDS) to work against EO catalogs (based on OGC STAC catalogs and best practises from ML community) to allow updating and updated model training over time.</p><p>&#160;</p><p>This presentation will present the first version of the AIREO training dataset specification and will showcase some elements of the best-practices that were developed. The AIREO compliant pilot datasets will be presented which are openly accessible and community feedback is explicitly encouraged.&#160;</p><p><br><br>[1] https://aireo.net/</p>

Download Full-text

WaterBench: A Large-scale Benchmark Dataset for Data-Driven Streamflow Forecasting

10.31223/x57w6t ◽

2021 ◽

Author(s):

Ibrahim Demir ◽

Zhongrun Xiang ◽

Bekir Zahit Demiray ◽

Muhammed Sit

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Large Scale ◽

Short Term Memory ◽

Earth Science ◽

Science Research ◽

Benchmark Dataset ◽

Data Driven ◽

Streamflow Forecasting ◽

Learning Research

This study proposes a comprehensive benchmark dataset for streamflow forecasting, WaterBench, that follows FAIR data principles that is prepared with a focus on convenience for utilizing in data-driven and machine learning studies, and provides benchmark performance for state-of-art deep learning architectures on the dataset for comparative analysis. By aggregating the datasets of streamflow, precipitation, watershed area, slope, soil types, and evapotranspiration from federal agencies and state organizations (i.e., NASA, NOAA, USGS, and Iowa Flood Center), we provided the WaterBench for hourly streamflow forecast studies. This dataset has a high temporal and spatial resolution with rich metadata and relational information, which can be used for varieties of deep learning and machine learning research. We defined a sample streamflow forecasting task for the next 120 hours and provided performance benchmarks on this task with sample linear regression and deep learning models, including Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and S2S (Sequence-to-sequence). To some extent, WaterBench makes up for the lack of a unified benchmark in earth science research. We highly encourage researchers to use the WaterBench for deep learning research in hydrology.

Download Full-text