High performance machine learning models can fully automate labeling of camera trap images for ecological analyses

AbstractEcological data are increasingly collected over vast geographic areas using arrays of digital sensors. Camera trap arrays have become the ‘gold standard’ method for surveying many terrestrial mammals and birds, but these arrays often generate millions of images that are challenging to process. This causes significant latency between data collection and subsequent inference, which can impede conservation at a time of ecological crisis. Machine learning algorithms have been developed to improve camera trap data processing speeds, but these models are not considered accurate enough for fully automated labeling of images.Here, we present a new approach to building and testing a high performance machine learning model for fully automated labeling of camera trap images. As a case-study, the model classifies 26 Central African forest mammal and bird species (or groups). The model was trained on a relatively small dataset (c.300,000 images) but generalizes to fully independent data and outperforms humans in several respects (e.g. detecting ‘invisible’ animals). We show how the model’s precision and accuracy can be evaluated in an ecological modeling context by comparing species richness, activity patterns (n = 4 species tested) and occupancy (n = 4 species tested) derived from machine learning labels with the same estimates derived from expert labels.Results show that fully automated labels can be equivalent to expert labels when calculating species richness, activity patterns (n = 4 species tested) and estimating occupancy (n = 3 of 4 species tested) in completely out-of-sample test data (n = 227 camera stations, n = 23868 images). Simple thresholding (discarding uncertain labels) improved the model’s performance when calculating activity patterns and estimating occupancy, but did not improve estimates of species richness.We provide the user-community with a multi-platform, multi-language user interface for running the model offline, and conclude that high performance machine learning models can fully automate labeling of camera trap data.

Download Full-text

Statistical and machine learning models for optimizing energy in parallel applications

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019842915 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1079-1097 ◽

Cited By ~ 2

Author(s):

Mark Endrei ◽

Chao Jin ◽

Minh Ngoc Dinh ◽

David Abramson ◽

Heidi Poxon ◽

...

Keyword(s):

Machine Learning ◽

Energy Efficiency ◽

High Performance ◽

Large Scale ◽

Energy Use ◽

Parallel Applications ◽

Learning Models ◽

Trade Off ◽

Time Required ◽

Machine Learning Models

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.

Download Full-text

Machine learning with biomedical ontologies

10.1101/2020.05.07.082164 ◽

2020 ◽

Cited By ~ 3

Author(s):

Maxat Kulmanov ◽

Fatima Zohra Smaili ◽

Xin Gao ◽

Robert Hoehndorf

Keyword(s):

Machine Learning ◽

Research Group ◽

Domain Knowledge ◽

Background Knowledge ◽

Search Space ◽

Biological Database ◽

List Type ◽

Biomedical Ontologies ◽

Learning Models ◽

Machine Learning Models

Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge, and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in biomedical ontologies, and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.Key pointsOntologies provide background knowledge that can be exploited in machine learning models.Ontology embeddings are structure-preserving maps from ontologies into vector spaces and provide an important method for utilizing ontologies in machine learning. Embeddings can preserve different structures in ontologies, including their graph structures, syntactic regularities, or their model-theoretic semantics.Axioms in ontologies, in particular those involving negation, can be used as constraints in optimization and machine learning to reduce the search space.

Download Full-text

Building High Performance Explainable Machine Learning Models for Social Media-based Substance Use Prediction

International Journal of Artificial Intelligence Tools ◽

10.1142/s021821302060009x ◽

2020 ◽

Vol 29 (03n04) ◽

pp. 2060009

Author(s):

Tao Ding ◽

Fatema Hasan ◽

Warren K. Bickel ◽

Shimei Pan

Keyword(s):

Machine Learning ◽

Social Media ◽

Substance Use ◽

Human Behavior ◽

High Performance ◽

Supervised Machine Learning ◽

Learning Models ◽

Wide Range ◽

And Behavior ◽

Machine Learning Models

Social media contain rich information that can be used to help understand human mind and behavior. Social media data, however, are mostly unstructured (e.g., text and image) and a large number of features may be needed to represent them (e.g., we may need millions of unigrams to represent social media texts). Moreover, accurately assessing human behavior is often difficult (e.g., assessing addiction may require medical diagnosis). As a result, the ground truth data needed to train a supervised human behavior model are often difficult to obtain at a large scale. To avoid overfitting, many state-of-the-art behavior models employ sophisticated unsupervised or self-supervised machine learning methods to leverage a large amount of unsupervised data for both feature learning and dimension reduction. Unfortunately, despite their high performance, these advanced machine learning models often rely on latent features that are hard to explain. Since understanding the knowledge captured in these models is important to behavior scientists and public health providers, we explore new methods to build machine learning models that are not only accurate but also interpretable. We evaluate the effectiveness of the proposed methods in predicting Substance Use Disorders (SUD). We believe the methods we proposed are general and applicable to a wide range of data-driven human trait and behavior analysis applications.

Download Full-text

High Performance Machine Learning Models for Functional Verification of Hardware Designs

10.1109/niles53778.2021.9600502 ◽

2021 ◽

Author(s):

Khaled A. Ismail ◽

Mohamed A. Abd El Ghany

Keyword(s):

Machine Learning ◽

High Performance ◽

Functional Verification ◽

Learning Models ◽

Hardware Designs ◽

Machine Learning Models

Download Full-text

Emergency medicine patient wait time multivariable prediction models: a multicentre derivation and validation study

10.1101/2021.03.19.21253921 ◽

2021 ◽

Author(s):

Katie Walker ◽

Jirayus Jiarpakdee ◽

Anne Loupis ◽

Chakkrit Tantithamthavorn ◽

Keith Joe ◽

...

Keyword(s):

Machine Learning ◽

Emergency Medicine ◽

Prediction Models ◽

Wait Time ◽

Wait Times ◽

List Type ◽

Learning Models ◽

Emergency Patient ◽

Individual Site ◽

Machine Learning Models

AbstractObjectivePatients, families and community members would like emergency department wait time visibility. This would improve patient journeys through emergency medicine. The study objective was to derive, internally and externally validate machine learning models to predict emergency patient wait times that are applicable to a wide variety of emergency departments.MethodsTwelve emergency departments provided three years of retrospective administrative data from Australia (2017-19). Descriptive and exploratory analyses were undertaken on the datasets. Statistical and machine learning models were developed to predict wait times at each site and were internally and externally validated. Model performance was tested on COVID-19 period data (January to June 2020).ResultsThere were 1,930,609 patient episodes analysed and median site wait times varied from 24 to 54 minutes. Individual site model prediction median absolute errors varied from +/−22.6 minutes (95%CI 22.4,22.9) to +/− 44.0 minutes (95%CI 43.4,44.4). Global model prediction median absolute errors varied from +/−33.9 minutes (95%CI 33.4, 34.0) to +/−43.8 minutes (95%CI 43.7, 43.9). Random forest and linear regression models performed the best, rolling average models under-estimated wait times. Important variables were triage category, last-k patient average wait time, and arrival time. Wait time prediction models are not transferable across hospitals. Models performed well during the COVID-19 lockdown period.ConclusionsElectronic emergency demographic and flow information can be used to approximate emergency patient wait times. A general model is less accurate if applied without site specific factors.What is already known on this subject⍰Patients and families want to know approximate emergency wait times, which will improve their ability to manage their logistical, physical and emotional needs whilst waiting⍰There are a few small studies from a limited number of jurisdictions, reporting model methods, important predictor variables and accuracy of derived modelsWhat this study adds⍰Our study demonstrates that predicting wait times from simple, readily available data is complex and provides estimates that aren’t as accurate as patients would like, however rough estimates may still be better than no information⍰We present the most influential variables regarding wait times and advise against using rolling average models, preferring random forest or linear regression techniques⍰Emergency medicine machine learning models may be less generalisable to other sites than we hope for when we read manuscripts or buy commercial off-the-shelf models or algorithms. Models developed for one site lose accuracy at another site and global models built for whole systems may need customisation to each individual site. This may apply to data science clinical decision instruments as well as operational machine learning models.

Download Full-text

Predicting Mechanical Properties of High-Performance Fiber-Reinforced Cementitious Composites by Integrating Micromechanics and Machine Learning

Materials ◽

10.3390/ma14123143 ◽

2021 ◽

Vol 14 (12) ◽

pp. 3143

Author(s):

Pengwei Guo ◽

Weina Meng ◽

Mingfeng Xu ◽

Victor C. Li ◽

Yi Bao

Keyword(s):

Machine Learning ◽

Mechanical Properties ◽

High Performance ◽

Mix Design ◽

Cementitious Composites ◽

Learning Models ◽

Design Variables ◽

High Performance Fiber ◽

Fiber Reinforced Cementitious Composites ◽

Machine Learning Models

Current development of high-performance fiber-reinforced cementitious composites (HPFRCC) mainly relies on intensive experiments. The main purpose of this study is to develop a machine learning method for effective and efficient discovery and development of HPFRCC. Specifically, this research develops machine learning models to predict the mechanical properties of HPFRCC through innovative incorporation of micromechanics, aiming to increase the prediction accuracy and generalization performance by enriching and improving the datasets through data cleaning, principal component analysis (PCA), and K-fold cross-validation. This study considers a total of 14 different mix design variables and predicts the ductility of HPFRCC for the first time, in addition to the compressive and tensile strengths. Different types of machine learning methods are investigated and compared, including artificial neural network (ANN), support vector regression (SVR), classification and regression tree (CART), and extreme gradient boosting tree (XGBoost). The results show that the developed machine learning models can reasonably predict the concerned mechanical properties and can be applied to perform parametric studies for the effects of different mix design variables on the mechanical properties. This study is expected to greatly promote efficient discovery and development of HPFRCC.

Download Full-text

Early prediction of high risk gestational diabetes mellitus via machine learning models

10.1101/2020.03.26.20040196 ◽

2020 ◽

Author(s):

Yan-Ting Wu ◽

Chen-Jie Zhang ◽

Ben Willem Mol ◽

Cheng Li ◽

Lei Chen ◽

...

Keyword(s):

Diabetes Mellitus ◽

Machine Learning ◽

Gestational Diabetes ◽

Prediction Model ◽

Early Pregnancy ◽

Prediction Models ◽

List Type ◽

Discriminative Power ◽

Learning Models ◽

Machine Learning Models

AbstractAimsGestational diabetes mellitus (GDM) is a pregnancy-specific disorder that can usually be diagnosed after 24 gestational weeks. So far, there is no accurate method to predict GDM in early pregnancy.MethodsWe collected data extracted from the hospital’s electronic medical record system included 73 features in the first trimester. We also recorded the occurrence of GDM, diagnosed at 24-28 weeks of pregnancy. We conducted a feature selection method to select a panel of most discriminative features. We then developed advanced machine learning models, using Deep Neural Network (DNN), Support Vector Machine (SVM), K-Nearest Neighboring (KNN), and Logistic Regression (LR), based on these features.ResultsWe studied 16,819 women (2,696 GDM) and 14,992 women (1,837 GDM) for the training and validation group. DNN, SVM, KNN, and LR models based on the 73-feature set demonstrated the best discriminative power with corresponding area under the curve (AUC) values of 0.92 (95%CI 0.91, 0.93), 0.82 (95%CI 0.81, 0.83), 0.63 (95%CI 0.62, 0.64), and 0.85 (95%CI 0.84, 0.85), respectively. The 7-feature (selected from the 73-feature set) DNN, SVM, KNN, and LR models had the best discriminative power with corresponding AUCs of 0.84 (95%CI 0.83, 0.84), 0.69 (95%CI 0.68, 0.70), 0.68 (95%CI 0.67, 0.69), and 0.84 (95% CI 0.83, 0.85), respectively. The 7-feature LR model had the best Hosmer-Lemeshow test outcome. Notably, the AUCs of the existing prediction models did not exceed 0.75.ConclusionsOur feature selection and machine learning models showed superior predictive power in early GDM detection than previous methods; these improved models will better serve clinical practices in preventing GDM.Research in Context sectionEvidence before this studyA hysteretic diagnosis of GDM in the 3rd trimester is too late to prevent exposure of the embryos or fetuses to an intrauterine hyperglycemia environment during early pregnancy.Prediction models for gestational diabetes are not uncommon in previous literature reports, but laboratory indicators are rarely involved in predictive indicators.The penetration of AI into the medical field makes us want to introduce it into GDM predictive models.What is the key question?Whether the GDM prediction model established by machine learning has the ability to surpass the traditional LR model?Added value of this studyUsing machine learning to select features is an effective method.DNN prediction model have effective discrimination power for predicting GDM in early pregnancy, but it cannot completely replace LR. KNN and SVM are even worse than LR in this study.Implications of all the available evidenceThe biggest significance of our research is not only to build a prediction model that surpasses previous ones, but also to demonstrate the advantages and disadvantages of different machine learning methods through a practical case.

Download Full-text

Analyzing Transaction Confirmation in Ethereum Using Machine Learning Techniques

ACM SIGMETRICS Performance Evaluation Review ◽

10.1145/3466826.3466832 ◽

2021 ◽

Vol 48 (4) ◽

pp. 12-15

Author(s):

Vinicius C. Oliveira ◽

Julia Almeida Valadares ◽

Jose Eduardo A. Sousa ◽

Alex Borges Vieira ◽

Heder Soares Bernardino ◽

...

Keyword(s):

Machine Learning ◽

Roc Curve ◽

High Performance ◽

Machine Learning Techniques ◽

Relevant Feature ◽

Learning Models ◽

Performance Models ◽

Learning Techniques ◽

Machine Learning Models

Ethereum has emerged as one of the most important cryptocurrencies in terms of the number of transactions. Given the recent growth of Ethereum, the cryptocurrency community and researchers are interested in understanding the Ethereum transactions behavior. In this work, we investigate a key aspect of Ethereum: the prediction of a transaction confirmation or failure based on its features. This is a challenging issue due to the small, but still relevant, fraction of failures in millions of recorded transactions and the complexity of the distributed mechanism to execute transactions in Ethereum. To conduct this investigation, we train machine learning models for this prediction, taking into consideration carefully balanced sets of confirmed and failed transactions. The results show high-performance models for classification of transactions with the best values of F1-score and area under the ROC curve approximately equal to 0.67 and 0.87, respectively. Also, we identified the gas used as the most relevant feature for the prediction.

Download Full-text

CYPstrate: A Set of Machine Learning Models for the Accurate Classification of Cytochrome P450 Enzyme Substrates and Non-Substrates

Molecules ◽

10.3390/molecules26154678 ◽

2021 ◽

Vol 26 (15) ◽

pp. 4678

Author(s):

Malte Holmer ◽

Christina de Bruyn Kops ◽

Conrad Stork ◽

Johannes Kirchmair

Keyword(s):

Machine Learning ◽

Cytochrome P450 ◽

High Performance ◽

Organic Molecules ◽

Cytochrome P450 Enzyme ◽

Learning Models ◽

Cytochrome P450 Enzymes ◽

Small Organic Molecules ◽

Machine Learning Models

The interaction of small organic molecules such as drugs, agrochemicals, and cosmetics with cytochrome P450 enzymes (CYPs) can lead to substantial changes in the bioavailability of active substances and hence consequences with respect to pharmacological efficacy and toxicity. Therefore, efficient means of predicting the interactions of small organic molecules with CYPs are of high importance to a host of different industries. In this work, we present a new set of machine learning models for the classification of xenobiotics into substrates and non-substrates of nine human CYP isozymes: CYPs 1A2, 2A6, 2B6, 2C8, 2C9, 2C19, 2D6, 2E1, and 3A4. The models are trained on an extended, high-quality collection of known substrates and non-substrates and have been subjected to thorough validation. Our results show that the models yield competitive performance and are favorable for the detection of CYP substrates. In particular, a new consensus model reached high performance, with Matthews correlation coefficients (MCCs) between 0.45 (CYP2C8) and 0.85 (CYP3A4), although at the cost of coverage. The best models presented in this work are accessible free of charge via the “CYPstrate” module of the New E-Resource for Drug Discovery (NERDD).

Download Full-text

Improving XGBoost with Imagination Sampling

Communications of the Blyth Institute ◽

10.33014/issn.2640-5652.2.1.holloway.1 ◽

2020 ◽

Vol 2 (1) ◽

pp. 3-6

Author(s):

Eric Holloway

Keyword(s):

Machine Learning ◽

General System ◽

Learning Models ◽

Starting Point ◽

Machine Learning Models

Imagination Sampling is the usage of a person as an oracle for generating or improving machine learning models. Previous work demonstrated a general system for using Imagination Sampling for obtaining multibox models. Here, the possibility of importing such models as the starting point for further automatic enhancement is explored.

Download Full-text