Exploring the Relationship Between Chlorophyll-a and Other Water Quality Parameters by Using Machine Learning Methods:A Case Study of Lake Erie

Input Variables

<p>Chlorophyll a (CHLA) is a key water quality indicator for the eutrophication of Lake Erie. In order to better predict the concentration of CHLA, this study divided Lake Erie into the United States and Canada according to national boundaries, and found the input variables most relevant to CHLA. It is concluded that the United States is total phosphorus (TP), and Canada is total nitrogen (TN), and it is analyzed that industrial and agricultural pollution around Lake Erie has caused excessive TP and TN content. The study used machine learning methods to model the water quality of the two parts respectively. The data used in the modelling was obtained from the Canadian Environment and Climate Change Agency for Lake Erie between 2000 and 2018. Several neural network (NN) models and other machine learning methods are used for data analysis, including standard neural network (NN) models, simple recurrent neural network (SRN) models, backpropagation neural network (BPNN) models, jump connections neural network (JCNN) model, random forest (RF) and support vector machine (SVM). At the same time, the most suitable combinations of input variables for CHLA prediction was found. The United States was TP, TN, DO, and T, and Canada was TP, TN, PH, and DO. Combining this result with the environmental protection policies of the United States and Canada, recommendations for improving the pollutant content of Lake Erie were proposed. This will help reduce the risk of eutrophication in Lake Erie.</p>

Machine Learning Methods to Identify Missed Cases of Bladder Cancer in Population-Based Registries

JCO Clinical Cancer Informatics ◽

10.1200/cci.20.00170 ◽

2021 ◽

pp. 641-653

Author(s):

Anne-Michelle Noone ◽

Clara J. K. Lam ◽

Angela B. Smith ◽

Matthew E. Nielsen ◽

Eric Boyd ◽

...

Keyword(s):

United States ◽

Machine Learning ◽

Bladder Cancer ◽

Cancer Incidence ◽

Cancer Registries ◽

The United States ◽

Population Based ◽

Learning Methods ◽

Classification And Regression

PURPOSE Population-based cancer incidence rates of bladder cancer may be underestimated. Accurate estimates are needed for understanding the burden of bladder cancer in the United States. We developed and evaluated the feasibility of a machine learning–based classifier to identify bladder cancer cases missed by cancer registries, and estimated the rate of bladder cancer cases potentially missed. METHODS Data were from population-based cohort of 37,940 bladder cancer cases 65 years of age and older in the SEER cancer registries linked with Medicare claims (2007-2013). Cases with other urologic cancers, abdominal cancers, and unrelated cancers were included as control groups. A cohort of cancer-free controls was also selected using the Medicare 5% random sample. We used five supervised machine learning methods: classification and regression trees, random forest, logic regression, support vector machines, and logistic regression, for predicting bladder cancer. RESULTS Registry linkages yielded 37,940 bladder cancer cases and 766,303 cancer-free controls. Using health insurance claims, classification and regression trees distinguished bladder cancer cases from noncancer controls with very high accuracy (95%). Bacille Calmette-Guerin, cystectomy, and mitomycin were the most important predictors for identifying bladder cancer. From 2007 to 2013, we estimated that up to 3,300 bladder cancer cases in the United States may have been missed by the SEER registries. This would result in an average of 3.5% increase in the reported incidence rate. CONCLUSION SEER cancer registries may potentially miss bladder cancer cases during routine reporting. These missed cases can be identified leveraging Medicare claims and data analytics, leading to more accurate estimates of bladder cancer incidence.

The #MeToo Movement in the United States: Text Analysis of Early Twitter Conversations

Journal of Medical Internet Research ◽

10.2196/13837 ◽

2019 ◽

Vol 21 (9) ◽

pp. e13837 ◽

Cited By ~ 2

Author(s):

Sepideh Modrek ◽

Bozhidar Chakalov

Keyword(s):

United States ◽

Machine Learning ◽

Sexual Assault ◽

Sexual Harassment ◽

Early Life ◽

English Language ◽

Life Experiences ◽

The United States ◽

Learning Methods ◽

Background The #MeToo movement sparked an international debate on the sexual harassment, abuse, and assault and has taken many directions since its inception in October of 2017. Much of the early conversation took place on public social media sites such as Twitter, where the hashtag movement began. Objective The aim of this study is to document, characterize, and quantify early public discourse and conversation of the #MeToo movement from Twitter data in the United States. We focus on posts with public first-person revelations of sexual assault/abuse and early life experiences of such events. Methods We purchased full tweets and associated metadata from the Twitter Premium application programming interface between October 14 and 21, 2017 (ie, the first week of the movement). We examined the content of novel English language tweets with the phrase “MeToo” from within the United States (N=11,935). We used machine learning methods, least absolute shrinkage and selection operator regression, and support vector machine models to summarize and classify the content of individual tweets with revelations of sexual assault and abuse and early life experiences of sexual assault and abuse. Results We found that the most predictive words created a vivid archetype of the revelations of sexual assault and abuse. We then estimated that in the first week of the movement, 11% of novel English language tweets with the words “MeToo” revealed details about the poster’s experience of sexual assault or abuse and 5.8% revealed early life experiences of such events. We examined the demographic composition of posters of sexual assault and abuse and found that white women aged 25-50 years were overrepresented in terms of their representation on Twitter. Furthermore, we found that the mass sharing of personal experiences of sexual assault and abuse had a large reach, where 6 to 34 million Twitter users may have seen such first-person revelations from someone they followed in the first week of the movement. Conclusions These data illustrate that revelations shared went beyond acknowledgement of having experienced sexual harassment and often included vivid and traumatic descriptions of early life experiences of assault and abuse. These findings and methods underscore the value of content analysis, supported by novel machine learning methods, to improve our understanding of how widespread the revelations were, which likely amplified the spread and saliency of the #MeToo movement.

The #MeToo Movement in the United States: Text Analysis of Early Twitter Conversations (Preprint)

10.2196/preprints.13837 ◽

2019 ◽

Author(s):

Sepideh Modrek ◽

Bozhidar Chakalov

Keyword(s):

United States ◽

Machine Learning ◽

Sexual Assault ◽

Sexual Harassment ◽

Early Life ◽

English Language ◽

Life Experiences ◽

The United States ◽

Learning Methods ◽

BACKGROUND The #MeToo movement sparked an international debate on the sexual harassment, abuse, and assault and has taken many directions since its inception in October of 2017. Much of the early conversation took place on public social media sites such as Twitter, where the hashtag movement began. OBJECTIVE The aim of this study is to document, characterize, and quantify early public discourse and conversation of the #MeToo movement from Twitter data in the United States. We focus on posts with public first-person revelations of sexual assault/abuse and early life experiences of such events. METHODS We purchased full tweets and associated metadata from the Twitter Premium application programming interface between October 14 and 21, 2017 (ie, the first week of the movement). We examined the content of novel English language tweets with the phrase “MeToo” from within the United States (N=11,935). We used machine learning methods, least absolute shrinkage and selection operator regression, and support vector machine models to summarize and classify the content of individual tweets with revelations of sexual assault and abuse and early life experiences of sexual assault and abuse. RESULTS We found that the most predictive words created a vivid archetype of the revelations of sexual assault and abuse. We then estimated that in the first week of the movement, 11% of novel English language tweets with the words “MeToo” revealed details about the poster’s experience of sexual assault or abuse and 5.8% revealed early life experiences of such events. We examined the demographic composition of posters of sexual assault and abuse and found that white women aged 25-50 years were overrepresented in terms of their representation on Twitter. Furthermore, we found that the mass sharing of personal experiences of sexual assault and abuse had a large reach, where 6 to 34 million Twitter users may have seen such first-person revelations from someone they followed in the first week of the movement. CONCLUSIONS These data illustrate that revelations shared went beyond acknowledgement of having experienced sexual harassment and often included vivid and traumatic descriptions of early life experiences of assault and abuse. These findings and methods underscore the value of content analysis, supported by novel machine learning methods, to improve our understanding of how widespread the revelations were, which likely amplified the spread and saliency of the #MeToo movement.

Toward the use of neural networks for influenza prediction at multiple spatial resolutions

Science Advances ◽

10.1126/sciadv.abb1237 ◽

2021 ◽

Vol 7 (25) ◽

pp. eabb1237

Author(s):

Emily L. Aiken ◽

Andre T. Nguyen ◽

Cecile Viboud ◽

Mauricio Santillana

Keyword(s):

Neural Network ◽

Machine Learning ◽

Real Time ◽

The United States ◽

Network Approach ◽

Internet Search ◽

Learning Methods ◽

Neural Network Approach ◽

Search Data

Mitigating the effects of disease outbreaks with timely and effective interventions requires accurate real-time surveillance and forecasting of disease activity, but traditional health care–based surveillance systems are limited by inherent reporting delays. Machine learning methods have the potential to fill this temporal “data gap,” but work to date in this area has focused on relatively simple methods and coarse geographic resolutions (state level and above). We evaluate the predictive performance of a gated recurrent unit neural network approach in comparison with baseline machine learning methods for estimating influenza activity in the United States at the state and city levels and experiment with the inclusion of real-time Internet search data. We find that the neural network approach improves upon baseline models for long time horizons of prediction but is not improved by real-time internet search data. We conduct a thorough analysis of feature importances in all considered models for interpretability purposes.

Possibility of Autonomous Estimation of Shiba Goat’s Estrus and Non-Estrus Behavior by Machine Learning Methods

Animals ◽

10.3390/ani10050771 ◽

2020 ◽

Vol 10 (5) ◽

pp. 771

Author(s):

Toshiya Arakawa

Keyword(s):

Neural Network ◽

Machine Learning ◽

Random Forest ◽

Markov Models ◽

Tracking System ◽

Video Tracking ◽

Training Data ◽

Support Vector ◽

Learning Methods ◽

Mammalian behavior is typically monitored by observation. However, direct observation requires a substantial amount of effort and time, if the number of mammals to be observed is sufficiently large or if the observation is conducted for a prolonged period. In this study, machine learning methods as hidden Markov models (HMMs), random forests, support vector machines (SVMs), and neural networks, were applied to detect and estimate whether a goat is in estrus based on the goat’s behavior; thus, the adequacy of the method was verified. Goat’s tracking data was obtained using a video tracking system and used to estimate whether they, which are in “estrus” or “non-estrus”, were in either states: “approaching the male”, or “standing near the male”. Totally, the PC of random forest seems to be the highest. However, The percentage concordance (PC) value besides the goats whose data were used for training data sets is relatively low. It is suggested that random forest tend to over-fit to training data. Besides random forest, the PC of HMMs and SVMs is high. However, considering the calculation time and HMM’s advantage in that it is a time series model, HMM is better method. The PC of neural network is totally low, however, if the more goat’s data were acquired, neural network would be an adequate method for estimation.

Landslide susceptibility mapping based on convolutional neural network and conventional machine learning methods

10.21203/rs.3.rs-190195/v1 ◽

2021 ◽

Author(s):

Rui Liu ◽

Xin Yang ◽

Chong Xu ◽

Luyao Li ◽

Xiangqiang Zeng

Keyword(s):

Neural Network ◽

Machine Learning ◽

Convolutional Neural Network ◽

Landslide Susceptibility ◽

Susceptibility Mapping ◽

Landslide Susceptibility Mapping ◽

Support Vector ◽

Learning Methods ◽

Conventional Machine

Abstract Landslide susceptibility mapping (LSM) is a useful tool to estimate the probability of landslide occurrence, providing a scientific basis for natural hazards prevention, land use planning, and economic development in landslide-prone areas. To date, a large number of machine learning methods have been applied to LSM, and recently the advanced Convolutional Neural Network (CNN) has been gradually adopted to enhance the prediction accuracy of LSM. The objective of this study is to introduce a CNN based model in LSM and systematically compare its overall performance with the conventional machine learning models of random forest, logistic regression, and support vector machine. Herein, we selected the Jiuzhaigou region in Sichuan Province, China as the study area. A total number of 710 landslides and 12 predisposing factors were stacked to form spatial datasets for LSM. The ROC analysis and several statistical metrics, such as accuracy, root mean square error (RMSE), Kappa coefficient, sensitivity, and specificity were used to evaluate the performance of the models in the training and validation datasets. Finally, the trained models were calculated and the landslide susceptibility zones were mapped. Results suggest that both CNN and conventional machine-learning based models have a satisfactory performance (AUC: 85.72% − 90.17%). The CNN based model exhibits excellent good-of-fit and prediction capability, and achieves the highest performance (AUC: 90.17%) but also significantly reduces the salt-of-pepper effect, which indicates its great potential of application to LSM.

Communications in Computer and Information Science - Data Stream Mining & Processing ◽

Detecting Items with the Biggest Weight Based on Neural Network and Machine Learning Methods

10.1007/978-3-030-61656-4_26 ◽

2020 ◽

pp. 383-396

Author(s):

Vitaliy Danylyk ◽

Victoria Vysotska ◽

Vasyl Lytvyn ◽

Svitlana Vyshemyrska ◽

Iryna Lurie ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Learning Methods ◽

Convolutional Neural Network Model in Machine Learning Methods and Computer Vision for Image Recognition: A Review

Journal of Applied Sciences Research ◽

10.22587/jasr.2018.14.6.5 ◽

2018 ◽

Keyword(s):

Neural Network ◽

Machine Learning ◽

Computer Vision ◽

Convolutional Neural Network ◽

Network Model ◽

Image Recognition ◽

Neural Network Model ◽

Learning Methods ◽

Machine Learning Models of COVID-19 Cases in the United States: A Study of Initial Lockdown and Reopen Regimes

Applied Sciences ◽

10.3390/app112311227 ◽

2021 ◽

Vol 11 (23) ◽

pp. 11227

Author(s):

Arnold Kamis ◽

Yudan Ding ◽

Zhenzhen Qu ◽

Chenchen Zhang

Keyword(s):

United States ◽

Machine Learning ◽

Additive Model ◽

Regression Tree ◽

Predictor Variable ◽

The United States ◽

Predictor Variables ◽

Future Research ◽

Variance Explained

The purpose of this paper is to model the cases of COVID-19 in the United States from 13 March 2020 to 31 May 2020. Our novel contribution is that we have obtained highly accurate models focused on two different regimes, lockdown and reopen, modeling each regime separately. The predictor variables include aggregated individual movement as well as state population density, health rank, climate temperature, and political color. We apply a variety of machine learning methods to each regime: Multiple Regression, Ridge Regression, Elastic Net Regression, Generalized Additive Model, Gradient Boosted Machine, Regression Tree, Neural Network, and Random Forest. We discover that Gradient Boosted Machines are the most accurate in both regimes. The best models achieve a variance explained of 95.2% in the lockdown regime and 99.2% in the reopen regime. We describe the influence of the predictor variables as they change from regime to regime. Notably, we identify individual person movement, as tracked by GPS data, to be an important predictor variable. We conclude that government lockdowns are an extremely important de-densification strategy. Implications and questions for future research are discussed.

A Comparison of Machine Learning Methods in a High-Dimensional Classification Problem

Business Systems Research Journal ◽

10.2478/bsrj-2014-0021 ◽

2014 ◽

Vol 5 (3) ◽

pp. 82-96 ◽

Cited By ~ 3

Author(s):

Marijana Zekić-Sušac ◽

Sanja Pfeifer ◽

Nataša Šarlija

Keyword(s):

Neural Network ◽

Machine Learning ◽

Classification Accuracy ◽

Classification Problem ◽

High Dimensional ◽

Nearest Neighbour ◽

Learning Methods ◽

Dimensional Classification ◽

Artificial Neural

Abstract Background: Large-dimensional data modelling often relies on variable reduction methods in the pre-processing and in the post-processing stage. However, such a reduction usually provides less information and yields a lower accuracy of the model. Objectives: The aim of this paper is to assess the high-dimensional classification problem of recognizing entrepreneurial intentions of students by machine learning methods. Methods/Approach: Four methods were tested: artificial neural networks, CART classification trees, support vector machines, and k-nearest neighbour on the same dataset in order to compare their efficiency in the sense of classification accuracy. The performance of each method was compared on ten subsamples in a 10-fold cross-validation procedure in order to assess computing sensitivity and specificity of each model. Results: The artificial neural network model based on multilayer perceptron yielded a higher classification rate than the models produced by other methods. The pairwise t-test showed a statistical significance between the artificial neural network and the k-nearest neighbour model, while the difference among other methods was not statistically significant. Conclusions: Tested machine learning methods are able to learn fast and achieve high classification accuracy. However, further advancement can be assured by testing a few additional methodological refinements in machine learning methods.