An Enhanced Machine Learning Topic Classification Methodology for Cybersecurity

In this research, we use user defined labels from three internet text sources (Reddit, Stackexchange, Arxiv) to train 21 different machine learning models for the topic classification task of detecting cybersecurity discussions in natural text. We analyze the false positive and false negative rates of each of the 21 model’s in a cross validation experiment. Then we present a Cybersecurity Topic Classification (CTC) tool, which takes the majority vote of the 21 trained machine learning models as the decision mechanism for detecting cybersecurity related text. We also show that the majority vote mechanism of the CTC tool provides lower false negative and false positive rates on average than any of the 21 individual models. We show that the CTC tool is scalable to the hundreds of thousands of documents with a wall clock time on the order of hours.

Download Full-text

Development of Machine Learning Models to Predict Compressed Sward Height in Walloon Pastures Based on Sentinel-1, Sentinel-2 and Meteorological Data Using Multiple Data Transformations

Remote Sensing ◽

10.3390/rs13030408 ◽

2021 ◽

Vol 13 (3) ◽

pp. 408

Author(s):

Charles Nickmilder ◽

Anthony Tedde ◽

Isabelle Dufrasne ◽

Françoise Lessire ◽

Bernard Tychon ◽

...

Keyword(s):

Machine Learning ◽

Decision Support ◽

Support System ◽

Cross Validation ◽

Learning Models ◽

Data Transformations ◽

Independent Validation ◽

Sward Height ◽

Machine Learning Models ◽

Sentinel 2

Accurate information about the available standing biomass on pastures is critical for the adequate management of grazing and its promotion to farmers. In this paper, machine learning models are developed to predict available biomass expressed as compressed sward height (CSH) from readily accessible meteorological, optical (Sentinel-2) and radar satellite data (Sentinel-1). This study assumed that combining heterogeneous data sources, data transformations and machine learning methods would improve the robustness and the accuracy of the developed models. A total of 72,795 records of CSH with a spatial positioning, collected in 2018 and 2019, were used and aggregated according to a pixel-like pattern. The resulting dataset was split into a training one with 11,625 pixellated records and an independent validation one with 4952 pixellated records. The models were trained with a 19-fold cross-validation. A wide range of performances was observed (with mean root mean square error (RMSE) of cross-validation ranging from 22.84 mm of CSH to infinite-like values), and the four best-performing models were a cubist, a glmnet, a neural network and a random forest. These models had an RMSE of independent validation lower than 20 mm of CSH at the pixel-level. To simulate the behavior of the model in a decision support system, performances at the paddock level were also studied. These were computed according to two scenarios: either the predictions were made at a sub-parcel level and then aggregated, or the data were aggregated at the parcel level and the predictions were made for these aggregated data. The results obtained in this study were more accurate than those found in the literature concerning pasture budgeting and grassland biomass evaluation. The training of the 124 models resulting from the described framework was part of the realization of a decision support system to help farmers in their daily decision making.

Download Full-text

Cross-validation and out-of-sample testing of physical activity intensity predictions with a wrist-worn accelerometer

Journal of Applied Physiology ◽

10.1152/japplphysiol.00760.2017 ◽

2018 ◽

Vol 124 (5) ◽

pp. 1284-1293 ◽

Cited By ~ 9

Author(s):

Alexander H. K. Montoye ◽

Bradford S. Westgate ◽

Morgan R. Fonley ◽

Karin A. Pfeiffer

Keyword(s):

Physical Activity ◽

Machine Learning ◽

Cross Validation ◽

Learning Models ◽

Data Set ◽

Feature Sets ◽

Activity Intensity ◽

Out Of Sample ◽

Sample Testing ◽

Machine Learning Models

Wrist-worn accelerometers are gaining popularity for measurement of physical activity. However, few methods for predicting physical activity intensity from wrist-worn accelerometer data have been tested on data not used to create the methods (out-of-sample data). This study utilized two previously collected data sets [Ball State University (BSU) and Michigan State University (MSU)] in which participants wore a GENEActiv accelerometer on the left wrist while performing sedentary, lifestyle, ambulatory, and exercise activities in simulated free-living settings. Activity intensity was determined via direct observation. Four machine learning models (plus 2 combination methods) and six feature sets were used to predict activity intensity (30-s intervals) with the accelerometer data. Leave-one-out cross-validation and out-of-sample testing were performed to evaluate accuracy in activity intensity prediction, and classification accuracies were used to determine differences among feature sets and machine learning models. In out-of-sample testing, the random forest model (77.3–78.5%) had higher accuracy than other machine learning models (70.9–76.4%) and accuracy similar to combination methods (77.0–77.9%). Feature sets utilizing frequency-domain features had improved accuracy over other feature sets in leave-one-out cross-validation (92.6–92.8% vs. 87.8–91.9% in MSU data set; 79.3–80.2% vs. 76.7–78.4% in BSU data set) but similar or worse accuracy in out-of-sample testing (74.0–77.4% vs. 74.1–79.1% in MSU data set; 76.1–77.0% vs. 75.5–77.3% in BSU data set). All machine learning models outperformed the euclidean norm minus one/GGIR method in out-of-sample testing (69.5–78.5% vs. 53.6–70.6%). From these results, we recommend out-of-sample testing to confirm generalizability of machine learning models. Additionally, random forest models and feature sets with only time-domain features provided the best accuracy for activity intensity prediction from a wrist-worn accelerometer. NEW & NOTEWORTHY This study includes in-sample and out-of-sample cross-validation of an alternate method for deriving meaningful physical activity outcomes from accelerometer data collected with a wrist-worn accelerometer. This method uses machine learning to directly predict activity intensity. By so doing, this study provides a classification model that may avoid high errors present with energy expenditure prediction while still allowing researchers to assess adherence to physical activity guidelines.

Download Full-text

Reducing Sanger confirmation testing through false positive prediction algorithms

Genetics in Medicine ◽

10.1038/s41436-021-01148-3 ◽

2021 ◽

Author(s):

James M. Holt ◽

Melissa Kelly ◽

Brett Sundlof ◽

Ghunwa Nakouzi ◽

David Bick ◽

...

Keyword(s):

Machine Learning ◽

False Positive ◽

Turnaround Time ◽

Learning Models ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

False Positive Prediction ◽

Confirmatory Testing ◽

Reference Human Genome ◽

Machine Learning Models

Abstract Purpose Clinical genome sequencing (cGS) followed by orthogonal confirmatory testing is standard practice. While orthogonal testing significantly improves specificity, it also results in increased turnaround time and cost of testing. The purpose of this study is to evaluate machine learning models trained to identify false positive variants in cGS data to reduce the need for orthogonal testing. Methods We sequenced five reference human genome samples characterized by the Genome in a Bottle Consortium (GIAB) and compared the results with an established set of variants for each genome referred to as a truth set. We then trained machine learning models to identify variants that were labeled as false positives. Results After training, the models identified 99.5% of the false positive heterozygous single-nucleotide variants (SNVs) and heterozygous insertions/deletions variants (indels) while reducing confirmatory testing of nonactionable, nonprimary SNVs by 85% and indels by 75%. Employing the algorithm in clinical practice reduced overall orthogonal testing using dideoxynucleotide (Sanger) sequencing by 71%. Conclusion Our results indicate that a low false positive call rate can be maintained while significantly reducing the need for confirmatory testing. The framework that generated our models and results is publicly available at https://github.com/HudsonAlpha/STEVE.

Download Full-text

Assessment of the Different Machine Learning Models for Prediction of Cluster Bean (Cyamopsis tetragonoloba L. Taub.) Yield

Advances in Research ◽

10.9734/air/2020/v21i930238 ◽

2020 ◽

pp. 98-105

Author(s):

Darshan Jagannath Pangarkar ◽

Rajesh Sharma ◽

Amita Sharma ◽

Madhu Sharma

Keyword(s):

Machine Learning ◽

Support Vector Regression ◽

Cross Validation ◽

Support Vector ◽

Linear Kernel ◽

Learning Models ◽

Degree Polynomial ◽

Random Forest Regression ◽

Machine Learning Models ◽

Second Degree

Prediction of crop yield can help traders, agri-business and government agencies to plan their activities accordingly. It can help government agencies to manage situations like over or under production. Traditionally statistical and crop simulation methods are used for this purpose. Machine learning models can be great deal of help. Aim of present study is to assess the predictive ability of various machine learning models for Cluster bean (Cyamopsis tetragonoloba L. Taub.) yield prediction. Various machine learning models were applied and tested on panel data of 19 years i.e. from 1999-2000 to 2017-18 for the Bikaner district of Rajasthan. Various data mining steps were performed before building a model. K- Nearest Nighbors (K-NN), Support Vector Regression (SVR) with various kernels, and Random forest regression were applied. Cross validation was also performed to know extra sampler validity. The best fitted model was chosen based cross validation scores and R2 values. Besides the coefficient of determination (R2), root mean squared error (RMSE), mean absolute error (MAE), and root relative squared error (RRSE) were calculated for the testing set. Support vector regression with linear kernel has the lowest RMSE (23.19), RRSE (0.14), MAE (19.27) values followed by random forest regression and second-degree polynomial support vector regression with the value of gamma = auto. Instead there was a little difference with R2, placing support vector regression first (98.31%), followed by second-degree polynomial support vector regression with value of gamma = auto (89.83%) and second-degree polynomial support vector regression with value of gamma = scale (88.83%). On two-fold cross validation, support vector regression with a linear kernel had the highest cross validation score explaining 71% (+/-0.03) followed by second-degree polynomial support vector regression with a value of gamma = auto and random forest regression. KNN and support vector regression with radial basis function as a kernel function had negative cross validation scores. Support vector regression with linear kernel was found to be the best-fitted model for predicting the yield as it had higher sample validity (98.31%) and global validity (71%).

Download Full-text

Discovery of Urinary Proteomic Signature for Differential Diagnosis of Acute Appendicitis

BioMed Research International ◽

10.1155/2020/3896263 ◽

2020 ◽

Vol 2020 ◽

pp. 1-9

Author(s):

Yinghua Zhao ◽

Lianying Yang ◽

Changqing Sun ◽

Yang Li ◽

Yangzhige He ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Acute Appendicitis ◽

Cross Validation ◽

Urine Samples ◽

Validation Dataset ◽

Learning Models ◽

Discovery Dataset ◽

Leave One Out ◽

Machine Learning Models

Acute appendicitis is one of the most common acute abdomens, but the confident preoperative diagnosis is still a challenge. In order to profile noninvasive urinary biomarkers that could discriminate acute appendicitis from other acute abdomens, we carried out mass spectrometric experiments on urine samples from patients with different acute abdomens and evaluated diagnostic potential of urinary proteins with various machine-learning models. Firstly, outlier protein pools of acute appendicitis and controls were constructed using the discovery dataset (32 acute appendicitis and 41 control acute abdomens) against a reference set of 495 normal urine samples. Ten outlier proteins were then selected by feature selection algorithm and were applied in construction of machine-learning models using naïve Bayes, support vector machine, and random forest algorithms. The models were assessed in the discovery dataset by leave-one-out cross validation and were verified in the validation dataset (16 acute appendicitis and 45 control acute abdomens). Among the three models, random forest model achieved the best performance: the accuracy was 84.9% in the leave-one-out cross validation of discovery dataset and 83.6% (sensitivity: 81.2%, specificity: 84.4%) in the validation dataset. In conclusion, we developed a 10-protein diagnostic panel by the random forest model that was able to distinguish acute appendicitis from confusable acute abdomens with high specificity, which indicated the clinical application potential of noninvasive urinary markers in disease diagnosis.

Download Full-text

Machine learning approach for the prediction of postpartum hemorrhage in vaginal birth

Scientific Reports ◽

10.1038/s41598-021-02198-y ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Munetoshi Akazawa ◽

Kazunori Hashimoto ◽

Noda Katsuhiko ◽

Yoshida Kaname

Keyword(s):

Machine Learning ◽

Blood Loss ◽

Postpartum Hemorrhage ◽

False Negative ◽

Learning Model ◽

Vaginal Birth ◽

Support Vector ◽

Learning Models ◽

Maternal Weight ◽

Machine Learning Models

AbstractPostpartum hemorrhage is the leading cause of maternal morbidity. Clinical prediction of postpartum hemorrhage remains challenging, particularly in the case of a vaginal birth. We studied machine learning models to predict postpartum hemorrhage. Women who underwent vaginal birth at the Tokyo Women Medical University East Center between 1995 and 2020 were included. We used 11 clinical variables to predict a postpartum hemorrhage defined as a blood loss of > 1000 mL. We constructed five machine learning models and a deep learning model consisting of neural networks with two layers after applying the ensemble learning of five machine learning classifiers, namely, logistic regression, a support vector machine, random forest, boosting trees, and decision tree. For an evaluation of the performance, we applied the area under the curve of the receiver operating characteristic (AUC), the accuracy, false positive rate (FPR) and false negative rate (FNR). The importance of each variable was evaluated through a comparison of the feature importance calculated using a Boosted tree. A total of 9,894 patients who underwent vaginal birth were enrolled in the study, including 188 cases (1.9%) with blood loss of > 1000 mL. The best learning model predicted postpartum hemorrhage with an AUC of 0.708, an accuracy of 0.686, FPR of 0.312, and FNR of 0.398. The analysis of the importance of the variables showed that pregnant gestation of labor, the maternal weight upon admission of labor, and the maternal weight before pregnancy were considered to be weighted factors. Machine learning model can predict postpartum hemorrhage during vaginal delivery. Further research should be conducted to analyze appropriate variables and prepare big data, such as hundreds of thousands of cases.

Download Full-text

Analysis of Graphomotor Tests with Machine Learning Algorithms for an Early and Universal Pre-Diagnosis of Dysgraphia

Sensors ◽

10.3390/s21217026 ◽

2021 ◽

Vol 21 (21) ◽

pp. 7026

Author(s):

Louis Devillaine ◽

Raphaël Lambert ◽

Jérôme Boutet ◽

Saifeddine Aloui ◽

Vincent Brault ◽

...

Keyword(s):

Machine Learning ◽

Gold Standard ◽

Cross Validation ◽

Daily Life ◽

Machine Learning Algorithms ◽

Motor Disorder ◽

Learning Models ◽

School Aged Children ◽

Diagnosis Tool ◽

Machine Learning Models

Five to ten percent of school-aged children display dysgraphia, a neuro-motor disorder that causes difficulties in handwriting, which becomes a handicap in the daily life of these children. Yet, the diagnosis of dysgraphia remains tedious, subjective and dependent to the language besides stepping in late in the schooling. We propose a pre-diagnosis tool for dysgraphia using drawings called graphomotor tests. These tests are recorded using graphical tablets. We evaluate several machine-learning models and compare them to build this tool. A database comprising 305 children from the region of Grenoble, including 43 children with dysgraphia, has been established and diagnosed by specialists using the BHK test, which is the gold standard for the diagnosis of dysgraphia in France. We performed tests of classification by extracting, correcting and selecting features from the raw data collected with the tablets and achieved a maximum accuracy of 73% with cross-validation for three models. These promising results highlight the relevance of graphomotor tests to diagnose dysgraphia earlier and more broadly.

Download Full-text

Cross Validation Of Supervised Machine Learning Models Based On Random Forest and Support Vector Machine Techniques for 12S rRNA Molecular Marker Implementation, Comparison and Utility

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i11.345349 ◽

2018 ◽

Vol 6 (11) ◽

pp. 345-349

Author(s):

Rameshwar Pati ◽

Ajey Kumar Pathak ◽

. . ◽

Navita Srivastava

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Molecular Marker ◽

Cross Validation ◽

Supervised Machine Learning ◽

Support Vector ◽

12S Rrna ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Machine learning is the key to diagnose COVID-19: a proof-of-concept study

Scientific Reports ◽

10.1038/s41598-021-86735-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Cedric Gangloff ◽

Sonia Rafi ◽

Guillaume Bouzillé ◽

Louis Soulat ◽

Marc Cuggia

Keyword(s):

Machine Learning ◽

False Negative ◽

Laboratory Data ◽

Chest Ct ◽

Potential Contribution ◽

Imaging Data ◽

Learning Models ◽

Rt Pcr ◽

Negative Results ◽

Machine Learning Models

AbstractThe reverse transcription-polymerase chain reaction (RT-PCR) assay is the accepted standard for coronavirus disease 2019 (COVID-19) diagnosis. As any test, RT-PCR provides false negative results that can be rectified by clinicians by confronting clinical, biological and imaging data. The combination of RT-PCR and chest-CT could improve diagnosis performance, but this would requires considerable resources for its rapid use in all patients with suspected COVID-19. The potential contribution of machine learning in this situation has not been fully evaluated. The objective of this study was to develop and evaluate machine learning models using routine clinical and laboratory data to improve the performance of RT-PCR and chest-CT for COVID-19 diagnosis among post-emergency hospitalized patients. All adults admitted to the ED for suspected COVID-19, and then hospitalized at Rennes academic hospital, France, between March 20, 2020 and May 5, 2020 were included in the study. Three model types were created: logistic regression, random forest, and neural network. Each model was trained to diagnose COVID-19 using different sets of variables. Area under the receiving operator characteristics curve (AUC) was the primary outcome to evaluate model’s performances. 536 patients were included in the study: 106 in the COVID group, 430 in the NOT-COVID group. The AUC values of chest-CT and RT-PCR increased from 0.778 to 0.892 and from 0.852 to 0.930, respectively, with the contribution of machine learning. After generalization, machine learning models will allow increasing chest-CT and RT-PCR performances for COVID-19 diagnosis.

Download Full-text

Reducing Sanger Confirmation Testing through False Positive Prediction Algorithms

10.1101/2020.04.30.066159 ◽

2020 ◽

Author(s):

James M. Holt ◽

Melissa Wilk ◽

Brett Sundlof ◽

Ghunwa Nakouzi ◽

David Bick ◽

...

Keyword(s):

Machine Learning ◽

False Positive ◽

Learning Models ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

False Positive Prediction ◽

Confirmatory Testing ◽

Clinical Genome Sequencing ◽

Reference Human Genome ◽

Machine Learning Models

AbstractPurposeClinical genome sequencing (cGS) followed by orthogonal confirmatory testing is standard practice. While orthogonal testing significantly improves specificity it also results in increased turn-around-time and cost of testing. The purpose of this study is to evaluate machine learning models trained to identify false positive variants in cGS data to reduce the need for orthogonal testing.MethodsWe sequenced five reference human genome samples characterized by the Genome in a Bottle Consortium (GIAB) and compared the results to an established set of variants for each genome referred to as a ‘truth-set’. We then trained machine learning models to identify variants that were labeled as false positives.ResultsAfter training, the models identified 99.5% of the false positive heterozygous single nucleotide variants (SNVs) and heterozygous insertions/deletions variants (indels) while reducing confirmatory testing of true positive SNVs to 1.67% and indels to 20.29%. Employing the algorithm in clinical practice reduced orthogonal testing using dideoxynucleotide (Sanger) sequencing by 78.22%.ConclusionOur results indicate that a low false positive call rate can be maintained while significantly reducing the need for confirmatory testing. The framework that generated our models and results is publicly available at https://github.com/HudsonAlpha/STEVE.

Download Full-text