Embracing imperfection: machine-assisted invertebrate classification in real-world datasets

Despite growing concerns over the health of global insect populations, the spatiotemporal breadth of insect population data is severely lacking. Machine-assisted classification has been proposed as a potential solution to quickly gather large amounts of data, but previous studies have often used unrealistic or idealized datasets to train their models. In this study, we describe a practical methodology for including machine learning in ecologic data acquisition pipelines. Here we train and test machine learning algorithms to classify over 56,000 bulk terrestrial invertebrate specimens from image data. All specimens were collected in pitfall traps by the National Ecological Observatory Network (NEON) at 27 locations across the United States. Image data was extracted as feature vectors using ImageJ. When classifying specimens that were known and seen by our models, we reached an accuracy of 74.7% at the lowest taxonomic level. We also classified invertebrate taxa that the model was not trained on using zero-shot classification, with an accuracy of 42.1% on these taxa. The general methodology outlined here represents a realistic approach to how machine learning may be used as a tool for ecological studies.

Download Full-text

Analyzing the occurrence of environmental indicator minerals using clustering techniques and mineral networks

10.5194/egusphere-egu21-14074 ◽

2021 ◽

Author(s):

Jason Williams ◽

Sally Potter-McIntyre ◽

Justin Filiberto ◽

Shaunna Morrison ◽

Daniel Hummer

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

The United States ◽

Hydrothermal Systems ◽

Machine Learning Algorithms ◽

Environmental Indicator ◽

Clustering Techniques ◽

Geological Processes ◽

Indicator Mineral ◽

Indicator Minerals

<p>Indicator minerals have special physical and chemical properties that can be analyzed to glean information concerning the composition of host rocks and formational (or altering) fluids. Clay, zeolite, and tourmaline mineral groups are all ubiquitous at the Earth&#8217;s surface and shallow crust and distributed through a wide variety of sedimentary, igneous, metamorphic, and hydrothermal systems. Traditional studies of indicator mineral-bearing deposits have provided a wealth of data that could be integral to discovering new insights into the formation and evolution of naturally occurring systems. This study evaluates the relationships that exist between different environmental indicator mineral groups through the implementation of machine learning algorithms and network diagrams. Mineral occurrence data for thousands of localities hosting clay, zeolite, and tourmaline minerals were retrieved from mineral databases. Clustering techniques (e.g., agglomerative hierarchical clustering and density based spatial clustering of applications with noise) combined with network analyses were used to analyze the compiled dataset in an effort to characterize and identify geological processes operating at different localities across the United States. Ultimately, this study evaluates the ability of machine learning algorithms to act as supplementary diagnostic and interpretive tools in geoscientific studies.</p>

Download Full-text

Integrating hierarchical statistical models and machine-learning algorithms for ground-truthing drone images of the vegetation: taxonomy, abundance and population ecological models

10.1101/491381 ◽

2018 ◽

Cited By ~ 1

Author(s):

Christian Damgaard

Keyword(s):

Machine Learning ◽

Statistical Models ◽

Learning Algorithms ◽

Plant Competition ◽

Image Data ◽

Ground Truth ◽

Ecological Models ◽

Machine Learning Algorithms ◽

Ground Truth Data ◽

Ground Truthing

AbstractIn order to fit population ecological models, e.g. plant competition models, to new drone-aided image data, we need to develop statistical models that may take the new type of measurement uncertainty when applying machine-learning algorithms into account and quantify its importance for statistical inferences and ecological predictions. Here, it is proposed to quantify the uncertainty and bias of image predicted plant taxonomy and abundance in a hierarchical statistical model that is linked to ground-truth data obtained by the pin-point method. It is critical that the error rate in the species identification process is minimized when the image data are fitted to the population ecological models, and several avenues for reaching this objective are discussed. The outlined method to statistically model known sources of uncertainty when applying machine-learning algorithms may be relevant for other applied scientific disciplines.

Download Full-text

Classification of masked image data

PLoS ONE ◽

10.1371/journal.pone.0254181 ◽

2021 ◽

Vol 16 (7) ◽

pp. e0254181

Author(s):

Kamila Lis ◽

Mateusz Koryciński ◽

Konrad A. Ciecierski

Keyword(s):

Neural Network ◽

Machine Learning ◽

Image Data ◽

Original Data ◽

Machine Learning Algorithms ◽

General Data Protection Regulation ◽

Additional Information ◽

Classification Of Images ◽

Applications Of Machine Learning

Data classification is one of the most commonly used applications of machine learning. The are many developed algorithms that can work in various environments and for different data distributions that perform this task with excellence. Classification algorithms, just like other machine learning algorithms have one thing in common: in order to operate on data, they must see the data. In the present world, where concerns about privacy, GDPR (General Data Protection Regulation), business confidentiality and security are growing bigger and bigger; this requirement to work directly on the original data might become, in some situations, a burden. In this paper, an approach to the classification of images that cannot be directly accessed during training has been made. It has been shown that one can train a deep neural network to create such a representation of the original data that i) without additional information, the original data cannot be restored, and ii) that this representation—called a masked form—can still be used for classification purposes. Moreover, it has been shown that classification of the masked data can be done using both classical and neural network-based classifiers.

Download Full-text

Machine Learning Prediction of Parkinson's Disease Onset and Subtype Using Germline Variants

10.1101/2021.06.14.21258631 ◽

2021 ◽

Author(s):

Saya R Dennis ◽

Tanya Simuni ◽

Yuan Luo

Keyword(s):

Machine Learning ◽

Parkinson’S Disease ◽

Parkinson's Disease ◽

Neurodegenerative Disorder ◽

Disease Onset ◽

The United States ◽

Machine Learning Algorithms ◽

Progression Rate ◽

High Importance ◽

Germline Variants

Parkinson's Disease is the second most common neurodegenerative disorder in the United States, and is characterized by a largely irreversible worsening of motor and non-motor symptoms as the disease progresses. A prominent characteristic of the disease is its high heterogeneity in manifestation as well as the progression rate. For sporadic Parkinson's Disease, which comprises ~90% of all diagnoses, the relationship between the patient genome and disease onset or progression subtype remains largely elusive. Machine learning algorithms are increasingly adopted to study the genomics of diseases due to their ability to capture patterns within the vast feature space of the human genome that might be contributing to the phenotype of interest. In our study, we develop two machine learning models that predict the onset as well as the progression subtype of Parkinson's Disease based on subjects' germline mutations. Our best models achieved an ROC of 0.77 and 0.61 for disease onset and subtype prediction, respectively. To the best of our knowledge, our models present state-of-the-art prediction performances of PD onset and subtype solely based on the subjects' germline variants. The genes with high importance in our best-performing models were enriched for several canonical pathways related to signaling, immune system, and protein modifications, all of which have been previously associated with PD symptoms or pathogenesis. These high-importance gene sets provide us with promising candidate genes for future biomedical and clinical research.

Download Full-text

Characteristics of Twitter Use by State Medicaid Programs in the United States: Machine Learning Approach

Journal of Medical Internet Research ◽

10.2196/18401 ◽

2020 ◽

Vol 22 (8) ◽

pp. e18401

Author(s):

Jane M Zhu ◽

Abeed Sarker ◽

Sarah Gollust ◽

Raina Merchant ◽

David Grande

Keyword(s):

Public Health ◽

United States ◽

Machine Learning ◽

Public Health Education ◽

The United States ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Care Organization ◽

The Public ◽

The Mean

Background Twitter is a potentially valuable tool for public health officials and state Medicaid programs in the United States, which provide public health insurance to 72 million Americans. Objective We aim to characterize how Medicaid agencies and managed care organization (MCO) health plans are using Twitter to communicate with the public. Methods Using Twitter’s public application programming interface, we collected 158,714 public posts (“tweets”) from active Twitter profiles of state Medicaid agencies and MCOs, spanning March 2014 through June 2019. Manual content analyses identified 5 broad categories of content, and these coded tweets were used to train supervised machine learning algorithms to classify all collected posts. Results We identified 15 state Medicaid agencies and 81 Medicaid MCOs on Twitter. The mean number of followers was 1784, the mean number of those followed was 542, and the mean number of posts was 2476. Approximately 39% of tweets came from just 10 accounts. Of all posts, 39.8% (63,168/158,714) were classified as general public health education and outreach; 23.5% (n=37,298) were about specific Medicaid policies, programs, services, or events; 18.4% (n=29,203) were organizational promotion of staff and activities; and 11.6% (n=18,411) contained general news and news links. Only 4.5% (n=7142) of posts were responses to specific questions, concerns, or complaints from the public. Conclusions Twitter has the potential to enhance community building, beneficiary engagement, and public health outreach, but appears to be underutilized by the Medicaid program.

Download Full-text

Application of Machine Learning for Prediction of Lung Cancer using Omics Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.f3625.049620 ◽

2020 ◽

Vol 9 (6) ◽

pp. 230-236

Keyword(s):

Machine Learning ◽

Lung Cancer ◽

Data Mining ◽

Survival Analysis ◽

Early Stage ◽

Image Data ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Data Sets ◽

Omics Data

Cancer is one of the deadly diseases across many countries. However, cancer can be cured, if detected at an early stage. Researchers are working on healthcare for early detection and prevention of cancer. Medical data has reached its utmost potential by providing researchers with huge data sets collected from all over the globe. In the present scenario, Machine Learning has been widely used in the area of cancer diagnosis and prognosis. Survival analysis may help in the prediction of the early onset of disease, relapse, re-occurrence of diseases and biomarker identification. Applications of machine learning and data mining methods in medical field are currently the most widespread in cancer detection and survival analysis. In this survey, different ways to detect and predict lung cancer using latest Machine learning algorithms combined with data mining has been analyzed. Comparative study of various machine learning techniques and technologies has been done over different types of data such as clinical data, omics data, image data etc.

Download Full-text

Efficient Machine Learning Techniques to Diagnose and Predict Alzheimer’s disease

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.c6508.029320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 3953-3960

Keyword(s):

Machine Learning ◽

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Early Diagnosis ◽

Image Data ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector ◽

Learning Mechanisms ◽

Learning Techniques

Recent research in computational engineering have evidenced the design and development numerous intelligent models to analyze medical data and derive inferences related to early diagnosis and prediction of disease severity. In this context, prediction and diagnosis of fatal neurodegenerative diseases that comes under the class of dementia from medical image data is considered as the challenging area of research for many researchers. Recently Alzheimer’s disease is considered as major category of dementia that affects major population. Despite of the development of numerous machine learning models for early diagnosis of Alzheimer’s disease, it is observed that there is a lot more scope of research. Addressing the same, this article presents a systematic literature review of machine learning techniques developed for early diagnosis of Alzheimer’s disease. Furthermore this article includes major categories of machine learning algorithms that include artificial neural networks, Support vector machines and Deep learning based ensemble models that helps the budding researchers to explore the scope of research in predicting Alzheimer’s disease. Implementation results depict the comparative analysis of state of art machine learning mechanisms.

Download Full-text

Machine learning algorithms for the detection of spurious white blood cell differentials due to erythrocyte lysis resistance

Journal of Clinical Pathology ◽

10.1136/jclinpath-2019-205820 ◽

2019 ◽

Vol 72 (6) ◽

pp. 431-437 ◽

Cited By ~ 1

Author(s):

Laura Bigorra ◽

Iciar Larriba ◽

Ricardo Gutiérrez-Gallego

Keyword(s):

Machine Learning ◽

Liver Disease ◽

Blood Cell ◽

White Blood Cell ◽

Population Data ◽

Machine Learning Algorithms ◽

Support Vector ◽

Svm Algorithm ◽

Erythrocyte Lysis ◽

Wbc Count

AimsRed blood cell (RBC) lysis resistance interferes with white blood cell (WBC) count and differential; still, its detection relies on the identification of an abnormal scattergram, and this is not clearly adverted by specific flags in the Beckman-Coulter DXH-800. The aims were to analyse precisely the effect of RBC lysis resistance interference in WBC counts, differentials and cell population data (CPD) and then to design, develop and implement a novel diagnostic machine learning (ML) model to optimise the detection of samples presenting this phenomenon.MethodsWBC counts, differentials and CPD from 232 patients (anaemia or liver disease) were compared with 100 healthy controls (HC) using analysis of variance. The data were analysed after a corrective action, and the analyser differentials were also compared with the digital leucocyte differentials. The ML support vector machine (SVM) algorithm was trained with 70% of the samples (n=233) and the 30% remaining (n=99) were employed exclusively during the validation phase.ResultsWe identified that impedance WBC was not affected by the RBC lysis resistance interference while the DXH-800 differentials overestimated lymphoid subpopulations (17.6%), sometimes even yielding spurious lymphocytosis, and the latter were corrected when sample dilution was performed. The ML-SVM algorithm allowed the classification of the pathological groups when compared with HC with validation accuracies corresponding to 97.98%, 100% and 88.78% for the global, anaemia and liver disease groups, respectively.ConclusionsThe proposed algorithm has an impressive discriminatory potential and its application would be a valuable support system to detect spurious results due to RBC lysis resistance.

Download Full-text

Analysis of Machine Learning-Based Assessment for Elbow Spasticity Using Inertial Sensors

Sensors ◽

10.3390/s20061622 ◽

2020 ◽

Vol 20 (6) ◽

pp. 1622 ◽

Cited By ~ 6

Author(s):

Jung-Yeon Kim ◽

Geunsu Park ◽

Seong-A Lee ◽

Yunyoung Nam

Keyword(s):

Machine Learning ◽

Inertial Sensors ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Lower Limbs ◽

Support Vector ◽

Multilayer Perceptrons ◽

Test Machine ◽

Linear Discriminant ◽

Passive Stretch

Spasticity is a frequently observed symptom in patients with neurological impairments. Spastic movements of their upper and lower limbs are periodically measured to evaluate functional outcomes of physical rehabilitation, and they are quantified by clinical outcome measures such as the modified Ashworth scale (MAS). This study proposes a method to determine the severity of elbow spasticity, by analyzing the acceleration and rotation attributes collected from the elbow of the affected side of patients and machine-learning algorithms to classify the degree of spastic movement; this approach is comparable to assigning an MAS score. We collected inertial data from participants using a wearable device incorporating inertial measurement units during a passive stretch test. Machine-learning algorithms—including decision tree, random forests (RFs), support vector machine, linear discriminant analysis, and multilayer perceptrons—were evaluated in combinations of two segmentation techniques and feature sets. A RF performed well, achieving up to 95.4% accuracy. This work not only successfully demonstrates how wearable technology and machine learning can be used to generate a clinically meaningful index but also offers rehabilitation patients an opportunity to monitor the degree of spasticity, even in nonhealthcare institutions where the help of clinical professionals is unavailable.

Download Full-text

Integrating Hierarchical Statistical Models and Machine-Learning Algorithms for Ground-Truthing Drone Images of the Vegetation: Taxonomy, Abundance and Population Ecological Models

Remote Sensing ◽

10.3390/rs13061161 ◽

2021 ◽

Vol 13 (6) ◽

pp. 1161

Author(s):

Christian Damgaard

Keyword(s):

Machine Learning ◽

Statistical Models ◽

Learning Algorithms ◽

Plant Competition ◽

Image Data ◽

Ground Truth ◽

Ecological Models ◽

Machine Learning Algorithms ◽

Ground Truth Data ◽

Ground Truthing

In order to fit population ecological models, e.g., plant competition models, to new drone-aided image data, we need to develop statistical models that may take the new type of measurement uncertainty when applying machine-learning algorithms into account and quantify its importance for statistical inferences and ecological predictions. Here, it is proposed to quantify the uncertainty and bias of image predicted plant taxonomy and abundance in a hierarchical statistical model that is linked to ground-truth data obtained by the pin-point method. It is critical that the error rate in the species identification process is minimized when the image data are fitted to the population ecological models, and several avenues for reaching this objective are discussed. The outlined method to statistically model known sources of uncertainty when applying machine-learning algorithms may be relevant for other applied scientific disciplines.

Download Full-text