Uncovering Los Angeles Tourists' Patterns Using Geospatial Analysis and Supervised Machine Learning with Random Forest Predictors

In the fields of Internet of Things (IoT) infrastructures, attack and anomaly detection are rising concerns. With the increased use of IoT infrastructure in every domain, threats and attacks in these infrastructures are also growing proportionally. In this paper the performances of several machine learning algorithms in identifying cyber-attacks (namely SYN-DOS attacks) to IoT systems are compared both in terms of application performances, and in training/application times. We use supervised machine learning algorithms included in the MLlib library of Apache Spark, a fast and general engine for big data processing. We show the implementation details and the performance of those algorithms on public datasets using a training set of up to 2 million instances. We adopt a Cloud environment, emphasizing the importance of the scalability and of the elasticity of use. Results show that all the Spark algorithms used result in a very good identification accuracy (>99%). Overall, one of them, Random Forest, achieves an accuracy of 1. We also report a very short training time (23.22 sec for Decision Tree with 2 million rows). The experiments also show a very low application time (0.13 sec for over than 600,000 instances for Random Forest) using Apache Spark in the Cloud. Furthermore, the explicit model generated by Random Forest is very easy-to-implement using high- or low-level programming languages. In light of the results obtained, both in terms of computation times and identification performance, a hybrid approach for the detection of SYN-DOS cyber-attacks on IoT devices is proposed: the application of an explicit Random Forest model, implemented directly on the IoT device, along with a second level analysis (training) performed in the Cloud.

Download Full-text

Distinguishing Focal Cortical Dysplasia From Glioneuronal Tumors in Patients With Epilepsy by Machine Learning

Frontiers in Neurology ◽

10.3389/fneur.2020.548305 ◽

2020 ◽

Vol 11 ◽

Author(s):

Yi Guo ◽

Yushan Liu ◽

Wenjie Ming ◽

Zhongjin Wang ◽

Junming Zhu ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Focal Cortical Dysplasia ◽

Cortical Dysplasia ◽

Machine Learning Algorithms ◽

Classification Model ◽

Supervised Machine Learning ◽

Seizure Onset ◽

Glioneuronal Tumors ◽

Patients With Epilepsy

Purpose: We are aiming to build a supervised machine learning-based classifier, in order to preoperatively distinguish focal cortical dysplasia (FCD) from glioneuronal tumors (GNTs) in patients with epilepsy.Methods: This retrospective study was comprised of 96 patients who underwent epilepsy surgery, with the final neuropathologic diagnosis of either an FCD or GNTs. Seven classical machine learning algorithms (i.e., Random Forest, SVM, Decision Tree, Logistic Regression, XGBoost, LightGBM, and CatBoost) were employed and trained by our dataset to get the classification model. Ten features [i.e., Gender, Past history, Age at seizure onset, Course of disease, Seizure type, Seizure frequency, Scalp EEG biomarkers, MRI features, Lesion location, Number of antiepileptic drug (AEDs)] were analyzed in our study.Results: We enrolled 56 patients with FCD and 40 patients with GNTs, which included 29 with gangliogliomas (GGs) and 11 with dysembryoplasic neuroepithelial tumors (DNTs). Our study demonstrated that the Random Forest-based machine learning model offered the best predictive performance on distinguishing the diagnosis of FCD from GNTs, with an F1-score of 0.9180 and AUC value of 0.9340. Furthermore, the most discriminative factor between FCD and GNTs was the feature “age at seizure onset” with the Chi-square value of 1,213.0, suggesting that patients who had a younger age at seizure onset were more likely to be diagnosed as FCD.Conclusion: The Random Forest-based machine learning classifier can accurately differentiate FCD from GNTs in patients with epilepsy before surgery. This might lead to improved clinician confidence in appropriate surgical planning and treatment outcomes.

Download Full-text

Classification models using circulating neutrophil transcripts can detect unruptured intracranial aneurysm

Journal of Translational Medicine ◽

10.1186/s12967-020-02550-2 ◽

2020 ◽

Vol 18 (1) ◽

Author(s):

Kerry E. Poppenberg ◽

Vincent M. Tutino ◽

Lu Li ◽

Muhammad Waqas ◽

Armond June ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Prediction Models ◽

Model Performance ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Training Cohort ◽

Network Analyses ◽

Machine Learning Methods

Abstract Background Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods. Methods Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction. Results Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC) = 0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance. Conclusions We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.

Download Full-text

Application of Natural Language Processing with Supervised Machine Learning Techniques to Predict the Overall Drugs Performance

AJIT-e Online Academic Journal of Information Technology ◽

10.5824/ajite.2020.01.001.x ◽

2020 ◽

Vol 11 (40) ◽

pp. 8-23

Author(s):

Pius MARTHIN ◽

Duygu İÇEN

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Semantic Analysis ◽

Classification Tree ◽

Supervised Machine Learning ◽

Training Dataset ◽

Support Vector ◽

Learning Models ◽

Machine Learning Models

Online product reviews have become a valuable source of information which facilitate customer decision with respect to a particular product. With the wealthy information regarding user's satisfaction and experiences about a particular drug, pharmaceutical companies make the use of online drug reviews to improve the quality of their products. Machine learning has enabled scientists to train more efficient models which facilitate decision making in various fields. In this manuscript we applied a drug review dataset used by (Gräβer, Kallumadi, Malberg,& Zaunseder, 2018), available freely from machine learning repository website of the University of California Irvine (UCI) to identify best machine learning model which provide a better prediction of the overall drug performance with respect to users' reviews. Apart from several manipulations done to improve model accuracy, all necessary procedures required for text analysis were followed including text cleaning and transformation of texts to numeric format for easy training machine learning models. Prior to modeling, we obtained overall sentiment scores for the reviews. Customer's reviews were summarized and visualized using a bar plot and word cloud to explore the most frequent terms. Due to scalability issues, we were able to use only the sample of the dataset. We randomly sampled 15000 observations from the 161297 training dataset and 10000 observations were randomly sampled from the 53766 testing dataset. Several machine learning models were trained using 10 folds cross-validation performed under stratified random sampling. The trained models include Classification and Regression Trees (CART), classification tree by C5.0, logistic regression (GLM), Multivariate Adaptive Regression Spline (MARS), Support vector machine (SVM) with both radial and linear kernels and a classification tree using random forest (Random Forest). Model selection was done through a comparison of accuracies and computational efficiency. Support vector machine (SVM) with linear kernel was significantly best with an accuracy of 83% compared to the rest. Using only a small portion of the dataset, we managed to attain reasonable accuracy in our models by applying the TF-IDF transformation and Latent Semantic Analysis (LSA) technique to our TDM.

Download Full-text

Detecting Real-Time Fall of Elderly People Using Machine Learning

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39635 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1913-1918

Author(s):

Prathima P

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Elderly People ◽

Fall Detection ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Support Vector ◽

False Alarms ◽

Severe Injuries

Abstract: Fall is a significant national health issue for the elderly people, generally resulting in severe injuries when the person lies down on the floor over an extended period without any aid after experiencing a great fall. Thus, elders need to be cared very attentively. A supervised-machine learning based fall detection approach with accelerometer, gyroscope is devised. The system can detect falls by grouping different actions as fall or non-fall events and the care taker is alerted immediately as soon as the person falls. The public dataset SisFall with efficient class of features is used to identify fall. The Random Forest (RF) and Support Vector Machine (SVM) machine learning algorithms are employed to detect falls with lesser false alarms. The SVM algorithm obtain a highest accuracy of 99.23% than RF algorithm. Keywords: Fall detection, Machine learning, Supervised classification, Sisfall, Activities of daily living, Wearable sensors, Random Forest, Support Vector Machine

Download Full-text

Prediction of Autism Spectrum Disorder Using Supervised Machine Learning Algorithms

Asian Journal of Computer Science and Technology ◽

10.51983/ajcst-2019.8.3.2734 ◽

2019 ◽

Vol 8 (3) ◽

pp. 15-18

Author(s):

T. Lakshmi Praveena ◽

N. V. Muthu Lakshmi

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Autism Spectrum Disorder ◽

Random Forest ◽

Decision Tree ◽

Autism Spectrum ◽

Early Years ◽

Spectrum Disorder ◽

Supervised Machine Learning ◽

Treatment Mechanisms

Autism appears to be a neuro developmental disorder that is visible in the early years. It is a wide-spectrum disorder that indicates that the severity and symptoms can vary from person to person. The Centre for Disease Control found that one in 68 was diagnosed with autism spectrum disorder with increasing numbers in every year. Detection of autism in adults is a cumbersome procedure because in adults, many symptoms can blend with some other mental health, motor impairment disorders so misinterpretation of actual diseases can in turn lead to a terrible life without proper diagnosis and effective treatment mechanisms. Machine learning is a powerful computer tool that supports different application domains Learning complex relationships or patterns from large datasets to draw accurate conclusions. Disease assessment can be done with predictive health data analysis and more appropriate treatment mechanisms that are now a hot area of research. Supervised learning is an important step of Machine learning which uses a rule-based approach by examining empirical data sets to build accurate predictive models. In this paper, decision tree, random forest, SVM, neural networks algorithms are applied on autism spectrum data which have been collected from UCI repository. The results of decision tree, random forest, SVM, neural networks algorithms on autism dataset are presented in this paper in an efficient manner. Analysis performed over these accurate results which will be useful to make right decisions in predicting autism spectrum disorder (ASD) at early stages. Thus, early autism intervention using machine learning techniques opens up a new way for autistic individuals to develop the potential to lead a better life by improving their behavioural and emotional skills.

Download Full-text

A Machine Learning Approach to Identify Predictors of Frequent Vaping and Vulnerable Californian Youth Subgroups

Nicotine & Tobacco Research ◽

10.1093/ntr/ntab257 ◽

2021 ◽

Author(s):

Rui Fu ◽

Jiamin Shi ◽

Michael Chaiton ◽

Adam M Leventhal ◽

Jennifer B Unger ◽

...

Keyword(s):

Machine Learning ◽

Native American ◽

Random Forest ◽

Los Angeles ◽

Perceived Discrimination ◽

Contributing Factors ◽

Nicotine Concentration ◽

Increased Risk ◽

Twelfth Grade

Abstract Introduction Machine learning presents a unique opportunity to improve electronic cigarette (vaping) monitoring in youth. Here we built a random forest model to predict frequent vaping status among Californian youth and to identify contributing factors and vulnerable populations. Methods In this prospective cohort study, 1,281 ever-vaping twelfth-grade students from metropolitan Los Angeles were surveyed in Fall and in 6-month in Spring. Frequent vaping was measured at the 6-month follow-up as nicotine-containing vaping on 20 or more days in past 30 days. Predictors (n=131) encompassed sociodemographic characteristics, substance use and perceptions, health status, and characteristics of the household, school and neighborhood. A random forest was developed to identify the top ten predictors of frequent vaping and interactions by sociodemographic variables. Results Forty participants (3.1%) reported frequent vaping at the follow-up. The random forest outperformed a logistic regression model in prediction (C-Index=0.87 vs. 0.77). Higher past-month nicotine concentration in vape, more daily vaping sessions, and greater nicotine dependence were the top three of the ten most important predictors of frequent vaping. Interactions were found between age and perceived discrimination, and between age and race/ethnicity, as those who were younger than their classmates and either reported experiencing discrimination frequently or identified as Asian or Native American/Pacific Islander were at increased risk of becoming frequent vapers. Conclusions Machine learning can produce models that accurately predict progression of vaping behaviours among youth. The potential association between frequent vaping and perceived discrimination warrants more in-depth analyses to confirm if discrimination constitutes a cause of increased vaping.

Download Full-text

Development of a Machine Learning-Based Damage Identification Method Using Multi-Point Simultaneous Acceleration Measurement Results

Sensors ◽

10.3390/s20102780 ◽

2020 ◽

Vol 20 (10) ◽

pp. 2780 ◽

Cited By ~ 1

Author(s):

Pang-jo Chun ◽

Tatsuro Yamane ◽

Shota Izumi ◽

Naoya Kuramoto

Keyword(s):

Machine Learning ◽

Random Forest ◽

Damage Identification ◽

Supervised Machine Learning ◽

Damage Evaluation ◽

Acceleration Measurement ◽

Multiple Points ◽

Measurement Results ◽

Logarithmic Decay ◽

Damage Types

It is necessary to assess damage properly for the safe use of a structure and for the development of an appropriate maintenance strategy. Although many efforts have been made to measure the vibration of a structure to determine the degree of damage, the accuracy of evaluation is not high enough, so it is difficult to say that a damage evaluation based on vibrations in a structure has not been put to practical use. In this study, we propose a method to evaluate damage by measuring the acceleration of a structure at multiple points and interpreting the results with a Random Forest, which is a kind of supervised machine learning. The proposed method uses the maximum response acceleration, standard deviation, logarithmic decay rate, and natural frequency to improve the accuracy of damage assessment. We propose a three-step Random Forest method to evaluate various damage types based on the results of these many measurements. Then, the accuracy of the proposed method is verified based on the results of a cross-validation and a vibration test of an actual damaged specimen.

Download Full-text

A Supervised Machine Learning Approach to Detect the On/Off State in Parkinson’s Disease Using Wearable Based Gait Signals

Diagnostics ◽

10.3390/diagnostics10060421 ◽

2020 ◽

Vol 10 (6) ◽

pp. 421

Author(s):

Satyabrata Aich ◽

Jinyoung Youn ◽

Sabyasachi Chakraborty ◽

Pyari Mohan Pradhan ◽

Jin-han Park ◽

...

Keyword(s):

Machine Learning ◽

Parkinson’S Disease ◽

Parkinson's Disease ◽

Random Forest ◽

Wearable Devices ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Support Vector ◽

Healthcare Applications ◽

Reported Data

Fluctuations in motor symptoms are mostly observed in Parkinson’s disease (PD) patients. This characteristic is inevitable, and can affect the quality of life of the patients. However, it is difficult to collect precise data on the fluctuation characteristics using self-reported data from PD patients. Therefore, it is necessary to develop a suitable technology that can detect the medication state, also termed the “On”/“Off” state, automatically using wearable devices; at the same time, this could be used in the home environment. Recently, wearable devices, in combination with powerful machine learning techniques, have shown the potential to be effectively used in critical healthcare applications. In this study, an algorithm is proposed that can detect the medication state automatically using wearable gait signals. A combination of features that include statistical features and spatiotemporal gait features are used as inputs to four different classifiers such as random forest, support vector machine, K nearest neighbour, and Naïve Bayes. In total, 20 PD subjects with definite motor fluctuations have been evaluated by comparing the performance of the proposed algorithm in association with the four aforementioned classifiers. It was found that random forest outperformed the other classifiers with an accuracy of 96.72%, a recall of 97.35%, and a precision of 96.92%.

Download Full-text