Prediction of Aquatic Ecosystem Health Indices through Machine Learning Models Using the WGAN-Based Data Augmentation Method

Changes in hydrological characteristics and increases in various pollutant loadings due to rapid climate change and urbanization have a significant impact on the deterioration of aquatic ecosystem health (AEH). Therefore, it is important to effectively evaluate the AEH in advance and establish appropriate strategic plans. Recently, machine learning (ML) models have been widely used to solve hydrological and environmental problems in various fields. However, in general, collecting sufficient data for ML training is time-consuming and labor-intensive. Especially in classification problems, data imbalance can lead to erroneous prediction results of ML models. In this study, we proposed a method to solve the data imbalance problem through data augmentation based on Wasserstein Generative Adversarial Network (WGAN) and to efficiently predict the grades (from A to E grades) of AEH indices (i.e., Benthic Macroinvertebrate Index (BMI), Trophic Diatom Index (TDI), Fish Assessment Index (FAI)) through the ML models. Raw datasets for the AEH indices composed of various physicochemical factors (i.e., WT, DO, BOD5, SS, TN, TP, and Flow) and AEH grades were built and augmented through the WGAN. The performance of each ML model was evaluated through a 10-fold cross-validation (CV), and the performances of the ML models trained on the raw and WGAN-based training sets were compared and analyzed through AEH grade prediction on the test sets. The results showed that the ML models trained on the WGAN-based training set had an average F1-score for grades of each AEH index of 0.9 or greater for the test set, which was superior to the models trained on the raw training set (fewer data compared to other datasets) only. Through the above results, it was confirmed that by using the dataset augmented through WGAN, the ML model can yield better AEH grade predictive performance compared to the model trained on limited datasets; this approach reduces the effort needed for actual data collection from rivers which requires enormous time and cost. In the future, the results of this study can be used as basic data to construct big data of aquatic ecosystems, needed to efficiently evaluate and predict AEH in rivers based on the ML models.

Download Full-text

A Generative Adversarial Network (GAN) Technique for Internet of Medical Things Data

Sensors ◽

10.3390/s21113726 ◽

2021 ◽

Vol 21 (11) ◽

pp. 3726

Author(s):

Ivan Vaccari ◽

Vanessa Orani ◽

Alessia Paglialonga ◽

Enrico Cambiaso ◽

Maurizio Mongelli

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Monitoring Program ◽

Clinical Decision Support Systems ◽

Direct Access ◽

Generative Adversarial Networks ◽

Chronic Obstructive ◽

Generative Adversarial Network ◽

Internet Of Medical Things ◽

Synthetic Datasets

The application of machine learning and artificial intelligence techniques in the medical world is growing, with a range of purposes: from the identification and prediction of possible diseases to patient monitoring and clinical decision support systems. Furthermore, the widespread use of remote monitoring medical devices, under the umbrella of the “Internet of Medical Things” (IoMT), has simplified the retrieval of patient information as they allow continuous monitoring and direct access to data by healthcare providers. However, due to possible issues in real-world settings, such as loss of connectivity, irregular use, misuse, or poor adherence to a monitoring program, the data collected might not be sufficient to implement accurate algorithms. For this reason, data augmentation techniques can be used to create synthetic datasets sufficiently large to train machine learning models. In this work, we apply the concept of generative adversarial networks (GANs) to perform a data augmentation from patient data obtained through IoMT sensors for Chronic Obstructive Pulmonary Disease (COPD) monitoring. We also apply an explainable AI algorithm to demonstrate the accuracy of the synthetic data by comparing it to the real data recorded by the sensors. The results obtained demonstrate how synthetic datasets created through a well-structured GAN are comparable with a real dataset, as validated by a novel approach based on machine learning.

Download Full-text

Characterizing Responses of Biological Trait and Functional Diversity of Benthic Macroinvertebrates to Environmental Variables to Develop Aquatic Ecosystem Health Assessment Index.

Korean Journal of Ecology and Environment ◽

10.11614/ksl.2020.53.1.031 ◽

2020 ◽

Vol 53 (1) ◽

pp. 31-45

Author(s):

Mi Young Moon ◽

◽

Chang Woo Ji ◽

Dae-Seong Lee ◽

Da-Yeong Lee ◽

...

Keyword(s):

Functional Diversity ◽

Benthic Macroinvertebrates ◽

Aquatic Ecosystem ◽

Environmental Variables ◽

Ecosystem Health ◽

Health Assessment ◽

Assessment Index ◽

Biological Trait ◽

Ecosystem Health Assessment

Download Full-text

A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems

mBio ◽

10.1128/mbio.00434-20 ◽

2020 ◽

Vol 11 (3) ◽

Cited By ~ 9

Author(s):

Begüm D. Topçuoğlu ◽

Nicholas A. Lesniak ◽

Mack T. Ruffin ◽

Jenna Wiens ◽

Patrick D. Schloss

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Sequence Data ◽

Characteristic Curve ◽

Predictive Performance ◽

Model Complexity ◽

Support Vector ◽

Classification Problems ◽

Microbial Biomarkers

ABSTRACT Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made toward developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs) (n = 490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1- and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, a decision tree, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an area under the receiver operating characteristic curve (AUROC) of 0.695 (interquartile range [IQR], 0.651 to 0.739) but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 (IQR, 0.625 to 0.735), trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability. IMPORTANCE Diagnosing diseases using machine learning (ML) is rapidly being adopted in microbiome studies. However, the estimated performance associated with these models is likely overoptimistic. Moreover, there is a trend toward using black box models without a discussion of the difficulty of interpreting such models when trying to identify microbial biomarkers of disease. This work represents a step toward developing more-reproducible ML practices in applying ML to microbiome research. We implement a rigorous pipeline and emphasize the importance of selecting ML models that reflect the goal of the study. These concepts are not particular to the study of human health but can also be applied to environmental microbiology studies.

Download Full-text

North American Hardwoods Identification Using Machine-Learning

Forests ◽

10.3390/f11030298 ◽

2020 ◽

Vol 11 (3) ◽

pp. 298 ◽

Cited By ~ 2

Author(s):

Dercilio Junior Verly Lopes ◽

Greg W. Burgreen ◽

Edward D. Entsminger

Keyword(s):

Machine Learning ◽

North American ◽

Mobile Application ◽

Cross Validation ◽

Data Augmentation ◽

Technical Note ◽

Machine Learning Method ◽

Training Set ◽

Hardwood Species ◽

Fold Cross Validation

This technical note determines the feasibility of using an InceptionV4_ResNetV2 convolutional neural network (CNN) to correctly identify hardwood species from macroscopic images. The method is composed of a commodity smartphone fitted with a 14× macro lens for photography. The end-grains of ten different North American hardwood species were photographed to create a dataset of 1869 images. The stratified 5-fold cross-validation machine-learning method was used, in which the number of testing samples varied from 341 to 342. Data augmentation was performed on-the-fly for each training set by rotating, zooming, and flipping images. It was found that the CNN could correctly identify hardwood species based on macroscopic images of its end-grain with an adjusted accuracy of 92.60%. With the current growing of machine-learning field, this model can then be readily deployed in a mobile application for field wood identification.

Download Full-text

A Hybrid Vision-Map Method for Urban Road Detection

Journal of Advanced Transportation ◽

10.1155/2017/7090549 ◽

2017 ◽

Vol 2017 ◽

pp. 1-21 ◽

Cited By ~ 6

Author(s):

Carlos Fernández ◽

David Fernández-Llorca ◽

Miguel A. Sotelo

Keyword(s):

Machine Learning ◽

Urban Environments ◽

Machine Learning Techniques ◽

Learning Approaches ◽

Classification Problems ◽

Road Detection ◽

Training Set ◽

Digital Maps ◽

The Road ◽

Learning Techniques

A hybrid vision-map system is presented to solve the road detection problem in urban scenarios. The standardized use of machine learning techniques in classification problems has been merged with digital navigation map information to increase system robustness. The objective of this paper is to create a new environment perception method to detect the road in urban environments, fusing stereo vision with digital maps by detecting road appearance and road limits such as lane markings or curbs. Deep learning approaches make the system hard-coupled to the training set. Even though our approach is based on machine learning techniques, the features are calculated from different sources (GPS, map, curbs, etc.), making our system less dependent on the training set.

Download Full-text

A Hybrid Capsule Network for Pneumonia Detection Using Image Augmentation Based on Generative Adversarial Network

Traitement du signal ◽

10.18280/ts.380309 ◽

2021 ◽

Vol 38 (3) ◽

pp. 619-627

Author(s):

Kazim Firildak ◽

Muhammed Fatih Talu

Keyword(s):

Classification Accuracy ◽

Data Augmentation ◽

Model Parameters ◽

Classification Problems ◽

Generative Adversarial Network ◽

X Ray ◽

Discrimination Capability ◽

Adversarial Network ◽

Original Dataset ◽

Chest X Ray

Pneumonia, featured by inflammation of the air sacs in one or both lungs, is usually detected by examining chest X-ray images. This paper probes into the classification models that can distinguish between normal and pneumonia images. As is known, trained networks like AlexNet and GoogleNet are deep network architectures, which are widely adopted to solve many classification problems. They have been adapted to the target datasets, and employed to classify new data generated through transfer learning. However, the classical architectures are not accurate enough for the diagnosis of pneumonia. Therefore, this paper designs a capsule network with high discrimination capability, and trains the network on Kaggle’s online pneumonia dataset, which contains chest X-ray images of many adults and children. The original dataset consists of 1,583 normal images, and 4,273 pneumonia images. Then, two data augmentation approaches were applied to the dataset, and their effects on classification accuracy were compared in details. The model parameters were optimized through five different experiments. The results show that the highest classification accuracy (93.91% even on small images) was achieved by the capsule network, coupled with data augmentation by generative adversarial network (GAN), using optimized parameters. This network outperformed the classical strategies.

Download Full-text

Data Augmentation based on Sequence Generative Adversarial Network for Chinese Clinical Named Entity Recognition (Preprint)

10.2196/preprints.17847 ◽

2020 ◽

Author(s):

蓬辉王

Keyword(s):

Data Augmentation ◽

Named Entity Recognition ◽

Basic Element ◽

Entity Recognition ◽

Training Set ◽

Generative Adversarial Network ◽

Named Entity ◽

Adversarial Network ◽

External Resource ◽

Medical Domain

BACKGROUND Chinese clinical named entity recognition, as a fundamental task of Chinese medical information extraction, plays an important role in recognizing medical entities contained in Chinese electronic medical records. Limited to lack of large annotated data, existing methods concentrate on employing external resources to improve the performance of clinical named entity recognition, which require lots of time and efficient rules to add external resources. OBJECTIVE To solve the problem of lack of large annotated data, we employ data augmentation without external resource to automatically generate more medical data depending on entities and non-entities in the training set, and enlarge training dataset to improve the performance of named entity recognition. METHODS In this paper, we propose a method of data augmentation, based on sequence generative adversarial network, to enlarge the training set. Different from other sequence generative adversarial networks, where the basic element is character or word, the basic element of our generated sequence is entity or non-entity. In our model, the generator can generate new sentences composed of entities and non-entities based on the learned hidden relationship between the entities and non-entities in the training set and the discriminator can judge if the generated sentences are positive and give rewards to help train the generator. The generated data from sequence adversarial network is used to enlarge the training set and improve the performance of named entity recognition in medical records. RESULTS Without external resource, we employ our data augmentation method in three datasets, both in general domains and medical domain. Experiments show that when we use generated data from data augmentation to expand training set, named entity recognition system has achieved competitive performance compared with existing methods, which shows the effectiveness of our data augmentation method. In general domains, our method achieves an overall F1-score of 59.42% in Weibo NER dataset and a F1-score of 95.28% in Resume. In medical domain, our method achieves 83.40%. CONCLUSIONS Our data augmentation method can expand training set based on the hidden relationship between entities and non-entities in the dataset, which can alleviate the problem of lack of labeled data while avoid using external resource. At the same time, our method can improve the performance of named entity recognition not only in general domains but also medical domain.

Download Full-text

Augmented Ultrasonic Data for Machine Learning

Journal of Nondestructive Evaluation ◽

10.1007/s10921-020-00739-5 ◽

2021 ◽

Vol 40 (1) ◽

Author(s):

Iikka Virkkunen ◽

Tuomas Koskinen ◽

Oskari Jessen-Juhler ◽

Jari Rinta-aho

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Flaw Detection ◽

Machine Learning Algorithms ◽

Non Destructive Testing ◽

Classification Problems ◽

Destructive Testing ◽

Complex Signals ◽

Convolutional Network ◽

Non Destructive

AbstractFlaw detection in non-destructive testing, especially for complex signals like ultrasonic data, has thus far relied heavily on the expertise and judgement of trained human inspectors. While automated systems have been used for a long time, these have mostly been limited to using simple decision automation, such as signal amplitude threshold. The recent advances in various machine learning algorithms have solved many similarly difficult classification problems, that have previously been considered intractable. For non-destructive testing, encouraging results have already been reported in the open literature, but the use of machine learning is still very limited in NDT applications in the field. Key issue hindering their use, is the limited availability of representative flawed data-sets to be used for training. In the present paper, we develop modern, deep convolutional network to detect flaws from phased-array ultrasonic data. We make extensive use of data augmentation to enhance the initially limited raw data and to aid learning. The data augmentation utilizes virtual flaws—a technique, that has successfully been used in training human inspectors and is soon to be used in nuclear inspection qualification. The results from the machine learning classifier are compared to human performance. We show, that using sophisticated data augmentation, modern deep learning networks can be trained to achieve human-level performance.

Download Full-text

Generating the Microstructure of Al-Si Cast Alloys Using Machine Learning

Korean Journal of Metals and Materials ◽

10.3365/kjmm.2021.59.11.838 ◽

2021 ◽

Vol 59 (11) ◽

pp. 838-847

Author(s):

In-Kyu Hwang ◽

Hyun-Ji Lee ◽

Sang-Jun Jeong ◽

In-Sung Cho ◽

Hee-Soo Kim

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Training Dataset ◽

Initial Training ◽

Microstructural Characteristics ◽

Generative Adversarial Network ◽

The Real ◽

Adversarial Network ◽

Cast Alloys ◽

The Right

In this study, we constructed a deep convolutional generative adversarial network (DCGAN) to generate the microstructural images that imitate the real microstructures of binary Al-Si cast alloys. We prepared four combinations of alloys, Al-6wt%Si, Al-9wt%Si, Al-12wt%Si and Al-15wt%Si for machine learning. DCGAN is composed of a generator and a discriminator. The discriminator has a typical convolutional neural network (CNN), and the generator has an inverse shaped CNN. The fake images generated using DCGAN were similar to real microstructural images. However, they showed some strange morphology, including dendrites without directionality, and deformed Si crystals. Verification with Inception V3 revealed that the fake images generated using DCGAN were well classified into the target categories. Even the visually imperfect images in the initial training iterations showed high similarity to the target. It seems that the imperfect images had enough microstructural characteristics to satisfy the classification, even though human cannot recognize the images. Cross validation was carried out using real, fake and other test images. When the training dataset had the fake images only, the real and test images showed high similarities to the target categories. When the training dataset contained both the real and fake images, the similarity at the target categories were high enough to meet the right answers. We concluded that the DCGAN developed for microstructural images in this study is highly useful for data augmentation for rare microstructures.

Download Full-text

A framework for effective application of machine learning to microbiome-based classification problems

10.1101/816090 ◽

2019 ◽

Cited By ~ 3

Author(s):

Begüm D. Topçuoğlu ◽

Nicholas A. Lesniak ◽

Mack Ruffin ◽

Jenna Wiens ◽

Patrick D. Schloss

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Sequence Data ◽

Predictive Performance ◽

Model Complexity ◽

Support Vector ◽

Classification Problems ◽

16S Rrna Sequence ◽

Microbial Biomarkers

AbstractMachine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made towards developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs; n=490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1 and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, decision trees, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an AUROC of 0.695 [IQR 0.651-0.739] but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 [IQR 0.625-0.735], trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability.ImportanceDiagnosing diseases using machine learning (ML) is rapidly being adopted in microbiome studies. However, the estimated performance associated with these models is likely over-optimistic. Moreover, there is a trend towards using black box models without a discussion of the difficulty of interpreting such models when trying to identify microbial biomarkers of disease. This work represents a step towards developing more reproducible ML practices in applying ML to microbiome research. We implement a rigorous pipeline and emphasize the importance of selecting ML models that reflect the goal of the study. These concepts are not particular to the study of human health but can also be applied to environmental microbiology studies.

Download Full-text