Environmental DNA gives comparable results to morphology-based indices of macroinvertebrates in a large-scale ecological assessment

Anthropogenic activities are changing the state of ecosystems worldwide, affecting community composition and often resulting in loss of biodiversity. Rivers are among the most impacted ecosystems. Recording their current state with regular biomonitoring is important to assess the future trajectory of biodiversity. Traditional monitoring methods for ecological assessments are costly and time-intensive. Here, we compared monitoring of macroinvertebrates based on environmental DNA (eDNA) sampling with monitoring based on traditional kick-net sampling to assess biodiversity patterns at 92 river sites covering all major Swiss river catchments. From the kick-net community data, a biotic index (IBCH) based on 145 indicator taxa had been established. The index was matched by the taxonomically annotated eDNA data by using a machine learning approach. Our comparison of diversity patterns only uses the zero-radius Operational Taxonomic Units assigned to the indicator taxa. Overall, we found a strong congruence between both methods for the assessment of the total indicator community composition (gamma diversity). However, when assessing biodiversity at the site level (alpha diversity), the methods were less consistent and gave complementary data on composition. Specifically, environmental DNA retrieved significantly fewer indicator taxa per site than the kick-net approach. Importantly, however, the subsequent ecological classification of rivers based on the detected indicators resulted in similar biotic index scores for the kick-net and the eDNA data that was classified using a random forest approach. The majority of the predictions (72%) from the random forest classification resulted in the same river status categories as the kick-net approach. Thus, environmental DNA validly detected indicator communities and, combined with machine learning, provided reliable classifications of the ecological state of rivers. Overall, while environmental DNA gives complementary data on the macroinvertebrate community composition compared to the kick-net approach, the subsequently calculated indices for the ecological classification of river sites are nevertheless directly comparable and consistent.

Download Full-text

A large-scale ecological assessment of Swiss rivers using environmental DNA for the monitoring of macroinvertebrates

ARPHA Conference Abstracts ◽

10.3897/aca.4.e65307 ◽

2021 ◽

Vol 4 ◽

Author(s):

Jeanine Brantschen ◽

Rosetta Blackman ◽

Jean-Claude Walser ◽

Florian Altermatt

Keyword(s):

Random Forest ◽

Community Composition ◽

Large Scale ◽

Anthropogenic Activities ◽

Ecological Status ◽

Environmental Dna ◽

Biotic Index ◽

Reference Database ◽

Macroinvertebrate Communities ◽

Monitoring Methods

Anthropogenic activities are changing the state of ecosystems worldwide, affecting community composition and often resulting in loss of biodiversity. Riverine ecosystems are among the most impacted ecosystems. Recording their current state with regular biomonitoring is important to assess the future trajectory of biodiversity. However, traditional monitoring methods for ecological assessments are costly and time-intense. Here, we compare environmental DNA (eDNA) to traditional kick-net sampling in a standardized framework of surface water quality assessment. We use surveys of macroinvertebrate communities to assess biodiversity and the biological state of riverine systems. Both methods were employed to monitor aquatic macroinvertebrate indicator groups at 92 sites across major Swiss river catchments. The eDNA data were taxonomically assigned using a customised reference database. All zero-radius Operational Taxonomic Units (zOTUs) mapping to one of the 142 traditionally used indicator taxon levels were used for subsequent diversity analyses (n = 205). At the site level, eDNA detected less indicator taxa than the kick-net method and alpha diversity correlated only weakly between the methods. However, the methods showed a strong congruence in the overall community composition (gamma diversity), as the same indicator groups were commonly detected. In order to set the community composition in relation to the biotic index, the ecological states of the sampling sites were predicted by a random forest approach. Using all zOTUs mapping to macroinvertebrate indicator groups (n = 693) as predictive features, the random forest models successfully predicted the ecological status of the sampled sites. The majority of the predictions (71%) resulted in the same classification like the kick-net based scores. Thus, the sampling of eDNA enabled the detection of indicator communities and provided valuable classifications of the ecological state, when combined with machine learning. Overall, eDNA based sampling has the potential to complement traditional surveys of macroinvertebrate communities in routine large-scale assessments in a non-invasive and scalable approach.

Download Full-text

Random forest and long short-term memory based machine learning models for classification of ion mobility spectrometry spectra

Chemical, Biological, Radiological, Nuclear, and Explosives (CBRNE) Sensing XXII ◽

10.1117/12.2585829 ◽

2021 ◽

Author(s):

Patrick C. Riley ◽

Samir V. Deshpande ◽

Brian S. Ince ◽

Brian C. Hauck ◽

Kyle P. O'Donnell ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Ion Mobility ◽

Short Term Memory ◽

Learning Models ◽

Short Term ◽

Term Memory ◽

Long Short Term Memory ◽

Machine Learning Models

Download Full-text

Modified Decision Tree Technique for Ransomware Detection at Runtime through API Calls

Scientific Programming ◽

10.1155/2020/8845833 ◽

2020 ◽

Vol 2020 ◽

pp. 1-10

Author(s):

Faizan Ullah ◽

Qaisar Javaid ◽

Abdu Salam ◽

Masood Ahmad ◽

Nadeem Sarwar ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Decision Tree ◽

Feature Vector ◽

Machine Learning Algorithms ◽

The Novel ◽

Proposed Model ◽

Testing Accuracy ◽

Financial Losses

Ransomware (RW) is a distinctive variety of malware that encrypts the files or locks the user’s system by keeping and taking their files hostage, which leads to huge financial losses to users. In this article, we propose a new model that extracts the novel features from the RW dataset and performs classification of the RW and benign files. The proposed model can detect a large number of RW from various families at runtime and scan the network, registry activities, and file system throughout the execution. API-call series was reutilized to represent the behavior-based features of RW. The technique extracts fourteen-feature vector at runtime and analyzes it by applying online machine learning algorithms to predict the RW. To validate the effectiveness and scalability, we test 78550 recent malign and benign RW and compare with the random forest and AdaBoost, and the testing accuracy is extended at 99.56%.

Download Full-text

Phybrata Sensors and Machine Learning for Enhanced Neurophysiological Diagnosis and Treatment

Sensors ◽

10.3390/s21217417 ◽

2021 ◽

Vol 21 (21) ◽

pp. 7417

Author(s):

Alex J. Hope ◽

Utkarsh Vashisth ◽

Matthew J. Parker ◽

Andreas B. Ralston ◽

Joshua M. Roper ◽

...

Keyword(s):

Machine Learning ◽

Time Series ◽

Random Forest ◽

Binary Classification ◽

Classification Performance ◽

Support Vector ◽

Use Case ◽

Signal Features ◽

Test Population

Concussion injuries remain a significant public health challenge. A significant unmet clinical need remains for tools that allow related physiological impairments and longer-term health risks to be identified earlier, better quantified, and more easily monitored over time. We address this challenge by combining a head-mounted wearable inertial motion unit (IMU)-based physiological vibration acceleration (“phybrata”) sensor and several candidate machine learning (ML) models. The performance of this solution is assessed for both binary classification of concussion patients and multiclass predictions of specific concussion-related neurophysiological impairments. Results are compared with previously reported approaches to ML-based concussion diagnostics. Using phybrata data from a previously reported concussion study population, four different machine learning models (Support Vector Machine, Random Forest Classifier, Extreme Gradient Boost, and Convolutional Neural Network) are first investigated for binary classification of the test population as healthy vs. concussion (Use Case 1). Results are compared for two different data preprocessing pipelines, Time-Series Averaging (TSA) and Non-Time-Series Feature Extraction (NTS). Next, the three best-performing NTS models are compared in terms of their multiclass prediction performance for specific concussion-related impairments: vestibular, neurological, both (Use Case 2). For Use Case 1, the NTS model approach outperformed the TSA approach, with the two best algorithms achieving an F1 score of 0.94. For Use Case 2, the NTS Random Forest model achieved the best performance in the testing set, with an F1 score of 0.90, and identified a wider range of relevant phybrata signal features that contributed to impairment classification compared with manual feature inspection and statistical data analysis. The overall classification performance achieved in the present work exceeds previously reported approaches to ML-based concussion diagnostics using other data sources and ML models. This study also demonstrates the first combination of a wearable IMU-based sensor and ML model that enables both binary classification of concussion patients and multiclass predictions of specific concussion-related neurophysiological impairments.

Download Full-text

Support Vector Machines in Big Data Classification: A Systematic Literature Review

10.21203/rs.3.rs-663359/v1 ◽

2021 ◽

Author(s):

Mohammad Hassan Almaspoor ◽

Ali Safaei ◽

Afshin Salajegheh ◽

Behrouz Minaei-Bidgoli

Keyword(s):

Machine Learning ◽

Big Data ◽

Large Scale ◽

Support Vector ◽

Research Areas ◽

Large Scale Data ◽

Training Samples ◽

Big Data Classification ◽

Scale Data

Abstract Classification is one of the most important and widely used issues in machine learning, the purpose of which is to create a rule for grouping data to sets of pre-existing categories is based on a set of training sets. Employed successfully in many scientific and engineering areas, the Support Vector Machine (SVM) is among the most promising methods of classification in machine learning. With the advent of big data, many of the machine learning methods have been challenged by big data characteristics. The standard SVM has been proposed for batch learning in which all data are available at the same time. The SVM has a high time complexity, i.e., increasing the number of training samples will intensify the need for computational resources and memory. Hence, many attempts have been made at SVM compatibility with online learning conditions and use of large-scale data. This paper focuses on the analysis, identification, and classification of existing methods for SVM compatibility with online conditions and large-scale data. These methods might be employed to classify big data and propose research areas for future studies. Considering its advantages, the SVM can be among the first options for compatibility with big data and classification of big data. For this purpose, appropriate techniques should be developed for data preprocessing in order to covert data into an appropriate form for learning. The existing frameworks should also be employed for parallel and distributed processes so that SVMs can be made scalable and properly online to be able to handle big data.

Download Full-text

Damage Classification of Composites Using Machine Learning

Volume 13: Safety Engineering, Risk, and Reliability Analysis ◽

10.1115/imece2019-11851 ◽

2019 ◽

Author(s):

Shweta Dabetwar ◽

Stephen Ekwaro-Osire ◽

João Paulo Dias

Keyword(s):

Machine Learning ◽

Composite Materials ◽

Random Forest ◽

Condition Monitoring ◽

Machine Learning Algorithms ◽

Support Vector ◽

Damage Classification ◽

Combining Data ◽

Ultrasonic Measurements

Abstract Composite materials have tremendous and ever-increasing applications in complex engineering systems; thus, it is important to develop non-destructive and efficient condition monitoring methods to improve damage prediction, thereby avoiding catastrophic failures and reducing standby time. Nondestructive condition monitoring techniques when combined with machine learning applications can contribute towards the stated improvements. Thus, the research question taken into consideration for this paper is “Can machine learning techniques provide efficient damage classification of composite materials to improve condition monitoring using features extracted from acousto-ultrasonic measurements?” In order to answer this question, acoustic-ultrasonic signals in Carbon Fiber Reinforced Polymer (CFRP) composites for distinct damage levels were taken from NASA Ames prognostics data repository. Statistical condition indicators of the signals were used as features to train and test four traditional machine learning algorithms such as K-nearest neighbors, support vector machine, Decision Tree and Random Forest, and their performance was compared and discussed. Results showed higher accuracy for Random Forest with a strong dependency on the feature extraction/selection techniques employed. By combining data analysis from acoustic-ultrasonic measurements in composite materials with machine learning tools, this work contributes to the development of intelligent damage classification algorithms that can be applied to advanced online diagnostics and health management strategies of composite materials, operating under more complex working conditions.

Download Full-text

3145 An Evaluation of Machine Learning and Traditional Statistical Methods for Discovery in Large-Scale Translational Data

Journal of Clinical and Translational Science ◽

10.1017/cts.2019.8 ◽

2019 ◽

Vol 3 (s1) ◽

pp. 2-2

Author(s):

Megan C Hollister ◽

Jeffrey D. Blume

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Random Forest ◽

Gene Expression Data ◽

Large Scale ◽

Second Generation ◽

A Priori ◽

Expression Data ◽

P Values ◽

Machine Learning Methods

OBJECTIVES/SPECIFIC AIMS: To examine and compare the claims in Bzdok, Altman, and Brzywinski under a broader set of conditions by using unbiased methods of comparison. To explore how to accurately use various machine learning and traditional statistical methods in large-scale translational research by estimating their accuracy statistics. Then we will identify the methods with the best performance characteristics. METHODS/STUDY POPULATION: We conducted a simulation study with a microarray of gene expression data. We maintained the original structure proposed by Bzdok, Altman, and Brzywinski. The structure for gene expression data includes a total of 40 genes from 20 people, in which 10 people are phenotype positive and 10 are phenotype negative. In order to find a statistical difference 25% of the genes were set to be dysregulated across phenotype. This dysregulation forced the positive and negative phenotypes to have different mean population expressions. Additional variance was included to simulate genetic variation across the population. We also allowed for within person correlation across genes, which was not done in the original simulations. The following methods were used to determine the number of dysregulated genes in simulated data set: unadjusted p-values, Benjamini-Hochberg adjusted p-values, Bonferroni adjusted p-values, random forest importance levels, neural net prediction weights, and second-generation p-values. RESULTS/ANTICIPATED RESULTS: Results vary depending on whether a pre-specified significance level is used or the top 10 ranked values are taken. When all methods are given the same prior information of 10 dysregulated genes, the Benjamini-Hochberg adjusted p-values and the second-generation p-values generally outperform all other methods. We were not able to reproduce or validate the finding that random forest importance levels via a machine learning algorithm outperform classical methods. Almost uniformly, the machine learning methods did not yield improved accuracy statistics and they depend heavily on the a priori chosen number of dysregulated genes. DISCUSSION/SIGNIFICANCE OF IMPACT: In this context, machine learning methods do not outperform standard methods. Because of this and their additional complexity, machine learning approaches would not be preferable. Of all the approaches the second-generation p-value appears to offer significant benefit for the cost of a priori defining a region of trivially null effect sizes. The choice of an analysis method for large-scale translational data is critical to the success of any statistical investigation, and our simulations clearly highlight the various tradeoffs among the available methods.

Download Full-text

Accurate classification of pediatric colonic IBD subtype using a random forest machine learning classifier

Journal of Pediatric Gastroenterology and Nutrition ◽

10.1097/mpg.0000000000002956 ◽

2020 ◽

Vol Publish Ahead of Print ◽

Author(s):

Jasbir Dhaliwal ◽

Lauren Erdman ◽

Erik Drysdal ◽

Firas Rinawi ◽

Jennifer Muir ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Learning Classifier

Download Full-text

Enhanced Changeover Detection in Industry 4.0 Environments with Machine Learning

Sensors ◽

10.3390/s21175896 ◽

2021 ◽

Vol 21 (17) ◽

pp. 5896

Author(s):

Eddi Miller ◽

Vladyslav Borysenko ◽

Moritz Heusinger ◽

Niklas Niedner ◽

Bastian Engelmann ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Binary Classification ◽

Model Performance ◽

Support Vector ◽

Milling Machine ◽

Vector Machines ◽

Changeover Times ◽

Flow Power

Changeover times are an important element when evaluating the Overall Equipment Effectiveness (OEE) of a production machine. The article presents a machine learning (ML) approach that is based on an external sensor setup to automatically detect changeovers in a shopfloor environment. The door statuses, coolant flow, power consumption, and operator indoor GPS data of a milling machine were used in the ML approach. As ML methods, Decision Trees, Support Vector Machines, (Balanced) Random Forest algorithms, and Neural Networks were chosen, and their performance was compared. The best results were achieved with the Random Forest ML model (97% F1 score, 99.72% AUC score). It was also carried out that model performance is optimal when only a binary classification of a changeover phase and a production phase is considered and less subphases of the changeover process are applied.

Download Full-text

Comparing sediment DNA extraction methods for assessing organic enrichment associated with marine aquaculture

PeerJ ◽

10.7717/peerj.10231 ◽

2020 ◽

Vol 8 ◽

pp. e10231

Author(s):

John K. Pearman ◽

Nigel B. Keeley ◽

Susanna A. Wood ◽

Olivier Laroche ◽

Anastasija Zaiko ◽

...

Keyword(s):

Dna Extraction ◽

Community Composition ◽

Oncorhynchus Tshawytscha ◽

Environmental Dna ◽

Illumina Miseq ◽

Ecological Impacts ◽

Biotic Index ◽

Soil Dna ◽

Additional Advantage ◽

Marine Aquaculture

Marine sediments contain a high diversity of micro- and macro-organisms which are important in the functioning of biogeochemical cycles. Traditionally, anthropogenic perturbation has been investigated by identifying macro-organism responses along gradients. Environmental DNA (eDNA) analyses have recently been advocated as a rapid and cost-effective approach to measuring ecological impacts and efforts are underway to incorporate eDNA tools into monitoring. Before these methods can replace or complement existing methods, robustness and repeatability of each analytical step has to be demonstrated. One area that requires further investigation is the selection of sediment DNA extraction method. Environmental DNA sediment samples were obtained along a disturbance gradient adjacent to a Chinook (Oncorhynchus tshawytscha) salmon farm in Otanerau Bay, New Zealand. DNA was extracted using four extraction kits (Qiagen DNeasy PowerSoil, Qiagen DNeasy PowerSoil Pro, Qiagen RNeasy PowerSoil Total RNA/DNA extraction/elution and Favorgen FavorPrep Soil DNA Isolation Midi Kit) and three sediment volumes (0.25, 2, and 5 g). Prokaryotic and eukaryotic communities were amplified using primers targeting the 16S and 18S ribosomal RNA genes, respectively, and were sequenced on an Illumina MiSeq. Diversity and community composition estimates were obtained from each extraction kit, as well as their relative performance in established metabarcoding biotic indices. Differences were observed in the quality and quantity of the extracted DNA amongst kits with the two Qiagen DNeasy PowerSoil kits performing best. Significant differences were observed in both prokaryotes and eukaryotes (p < 0.001) richness among kits. A small proportion of amplicon sequence variants (ASVs) were shared amongst the kits (~3%) although these shared ASVs accounted for the majority of sequence reads (prokaryotes: 59.9%, eukaryotes: 67.2%). Differences were observed in the richness and relative abundance of taxonomic classes revealed with each kit. Multivariate analysis showed that there was a significant interaction between “distance” from the farm and “kit” in explaining the composition of the communities, with the distance from the farm being a stronger determinant of community composition. Comparison of the kits against the bacterial and eukaryotic metabarcoding biotic index suggested that all kits showed similar patterns along the environmental gradient. Overall, we advocate for the use of Qiagen DNeasy PowerSoil kits for use when characterizing prokaryotic and eukaryotic eDNA from marine farm sediments. We base this conclusion on the higher DNA quality values and richness achieved with these kits compared to the other kits/amounts investigated in this study. The additional advantage of the PowerSoil Kits is that DNA extractions can be performed using an extractor robot, offering additional standardization and reproducibility of results.

Download Full-text