Data Poisoning against Differentially-Private Learners: Attacks and Defenses

Machine learning to identify lung cancer with tuberculosis from isolated tuberculosis (Preprint)

10.2196/preprints.35101 ◽

2021 ◽

Author(s):

Zhenhao Li

Keyword(s):

Machine Learning ◽

Lung Cancer ◽

Cancer Patients ◽

Tumor Markers ◽

Learning Algorithm ◽

Laboratory Data ◽

Pathological Examination ◽

Training Set ◽

Lung Cancer Patients ◽

Tb Pcr

UNSTRUCTURED Tuberculosis (TB) is a precipitating cause of lung cancer. Lung cancer patients coexisting with TB is difficult to differentiate from isolated TB patients. The aim of this study is to develop a prediction model in identifying those two diseases between the comorbidities and TB. In this work, based on the laboratory data from 389 patients, 81 features, including main laboratory examination of blood test, biochemical test, coagulation assay, tumor markers and baseline information, were initially used as integrated markers and then reduced to form a discrimination system consisting of 31 top-ranked indices. Patients diagnosed with TB PCR ＞1mtb/ml as negative samples, lung cancer patients with TB were confirmed by pathological examination and TB PCR ＞1mtb/ml as positive samples. We used Spatially Uniform ReliefF (SURF) algorithm to determine feature importance, and the predictive model was built using machine learning algorithm Random Forest. For cross-validation, the samples were randomly split into four training set and one test set. The selected features are composed of four tumor markers (Scc, Cyfra21-1, CEA, ProGRP and NSE), fifteen blood biochemical indices (GLU, IBIL, K, CL, Ur, NA, TBA, CHOL, SA, TG, A/G, AST, CA, CREA and CRP), six routine blood indices (EO#, EO%, MCV, RDW-S, LY# and MPV) and four coagulation indices (APTT ratio, APTT, PTA, TT ratio). This model presented a robust and stable classification performance, which can easily differentiate the comorbidity group from the isolated TB group with AUC, ACC, sensitivity and specificity of 0.8817, 0.8654, 0.8594 and 0.8656 for the training set, respectively. Overall, this work may provide a novel strategy for identifying the TB patients with lung cancer from routine admission lab examination with advantages of being timely and economical. It also indicated that our model with enough indices may further increase the effectiveness and efficiency of diagnosis.

Download Full-text

QTG-Finder2: A Generalized Machine-Learning Algorithm for Prioritizing QTL Causal Genes in Plants

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401122 ◽

2020 ◽

Vol 10 (7) ◽

pp. 2411-2421

Author(s):

Fan Lin ◽

Elena Z. Lazarus ◽

Seung Y. Rhee

Keyword(s):

Machine Learning ◽

Linkage Mapping ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Causal Gene ◽

Training Set ◽

Average Precision ◽

Trait Improvement ◽

Causal Genes ◽

Mapping Process

Linkage mapping has been widely used to identify quantitative trait loci (QTL) in many plants and usually requires a time-consuming and labor-intensive fine mapping process to find the causal gene underlying the QTL. Previously, we described QTG-Finder, a machine-learning algorithm to rationally prioritize candidate causal genes in QTLs. While it showed good performance, QTG-Finder could only be used in Arabidopsis and rice because of the limited number of known causal genes in other species. Here we tested the feasibility of enabling QTG-Finder to work on species that have few or no known causal genes by using orthologs of known causal genes as the training set. The model trained with orthologs could recall about 64% of Arabidopsis and 83% of rice causal genes when the top 20% ranked genes were considered, which is similar to the performance of models trained with known causal genes. The average precision was 0.027 for Arabidopsis and 0.029 for rice. We further extended the algorithm to include polymorphisms in conserved non-coding sequences and gene presence/absence variation as additional features. Using this algorithm, QTG-Finder2, we trained and cross-validated Sorghum bicolor and Setaria viridis models. The S. bicolor model was validated by causal genes curated from the literature and could recall 70% of causal genes when the top 20% ranked genes were considered. In addition, we applied the S. viridis model and public transcriptome data to prioritize a plant height QTL and identified 13 candidate genes. QTL-Finder2 can accelerate the discovery of causal genes in any plant species and facilitate agricultural trait improvement.

Download Full-text

An Innovative Deep Learning Algorithm for Drowsiness Detection from EEG Signal

Computation ◽

10.3390/computation7010013 ◽

2019 ◽

Vol 7 (1) ◽

pp. 13 ◽

Cited By ~ 10

Author(s):

Francesco Rundo ◽

Sergio Rinella ◽

Simona Massimino ◽

Marinella Coco ◽

Giorgio Fallica ◽

...

Keyword(s):

Machine Learning ◽

Visual Information ◽

Learning Algorithm ◽

Eeg Signal ◽

Training Set ◽

Robust Algorithms ◽

Physiological Measurement ◽

Deep Learning Algorithm ◽

Wide Range ◽

Electroencephalogram Eeg

The development of detection methodologies for reliable drowsiness tracking is a challenging task requiring both appropriate signal inputs and accurate and robust algorithms of analysis. The aim of this research is to develop an advanced method to detect the drowsiness stage in electroencephalogram (EEG), the most reliable physiological measurement, using the promising Machine Learning methodologies. The methods used in this paper are based on Machine Learning methodologies such as stacked autoencoder with softmax layers. Results obtained from 62 volunteers indicate 100% accuracy in drowsy/wakeful discrimination, proving that this approach can be very promising for use in the next generation of medical devices. This methodology can be extended to other uses in everyday life in which the maintaining of the level of vigilance is critical. Future works aim to perform extended validation of the proposed pipeline with a wide-range training set in which we integrate the photoplethysmogram (PPG) signal and visual information with EEG analysis in order to improve the robustness of the overall approach.

Download Full-text

BREAST CANCER DETECTION USING MAMMOGRAM FEATURES USING RANDOM FOREST ALGORITHM

INTERNATIONAL JOURNAL FOR ADVANCED RESEARCH IN SCIENCE & TECHNOLOGY ◽

10.48047/ijarst/v10/i11/02 ◽

2020 ◽

pp. 12-15

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Random Forest ◽

Cancer Detection ◽

Learning Algorithm ◽

Breast Cancer Dataset ◽

Random Forest Algorithm ◽

Training Set ◽

Cancer Dataset ◽

Breast Cells

Breast Cancer is one of the most dangerous diseases for women. This cancer occurs when some breast cells begin to grow abnormally. Machine learning is the subfield of computer science that studies programs that generalize from past experience. This project looks at classification, where an algorithm tries to predict the label for a sample. The machine learning algorithm takes many of these samples, called the training set, and builds an internal model. This built model is used to classify and predict the data. There are two classes, benign and malignant. Random Forest classifier is used to predict whether the cancer is benign or malignant. Training and testing of the model are done by Wisconsin Diagnosis Breast Cancer dataset.

Download Full-text

Profiling waitlisted incoming students for future delinquency with an ensemble of statistical machine learning algorithms

10.7287/peerj.preprints.3312 ◽

2017 ◽

Author(s):

Maureen Lyndel C Lauron ◽

Jaderick P Pabico

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Decision Support Tool ◽

Machine Learning Algorithms ◽

Statistical Machine Learning ◽

Training Set ◽

Support Tool ◽

Incoming Freshman ◽

Validation Set ◽

Incoming Students

Given a dataset \(\mathcal{R}=\{R_1, R_2, \dots, R_r\}\) of \(r\)~records of waitlisted incoming freshman students (WIFS), where for any \(i=1, 2, \dots, r\), \(R_i\) is a \((m+1)\)--tuple \((O_i, P_i^{(1)}, P_i^{(2)}, \dots, P_i^{(m)})\), \(O_i\) is any one in a set \(\mathcal{O}=\{O_1, O_2, \dots, O_o\}\) of \(o\)~classes, and \(P_i^{(1)}, P_i^{(2)}, \dots, P_i^{(m)}\) are \(m\)~potential predictors for~\(O_i\). Our purpose is to find a statistical machine learning algorithm (SMLA) \(\mathbb{A}\) such that \(V_i=\mathbb{A}(P_i^{(1)}, P_i^{(2)}, \dots, P_i^{(m)})\), where \(V_i\) is a predicted class by~\(\mathbb{A}\) that was developed using \(n\le m\) correct number of predictors for \(O\in\mathcal{O}\), and \(\mathbb{A}\)~is the best algorithm such that the metric \(v^{-1}\sum_{i=1}^v |O_i - V_i|\) is minimum across \(v<r\)~records in the validation set \(\mathcal{V}\subset\mathcal{R}\). Our problem is to find the subset \(\{P_i^{(1)}, P_i^{(2)}, \dots, P_i^{(n)}\}\) and to train \(\mathbb{A}\)~using \(t<r\) records from the training set \(\mathcal{T}\subset\mathcal{R}\), such that \(\mathcal{T}\cap\mathcal{V}=\emptyset\), so that \(\mathbb{A}\)~can predict whether a WIFS trying to enter an undergraduate program at UPLB will incur at least a ``delinquency'' once the student is accepted into the program. The \(\mathbb{A}\)~can be a useful decision-support tool for UPLB deans and college secretaries in deciding whether a WIFS will be accepted into the program or not.

Download Full-text

Deriving Field Scale Soil Moisture from Satellite Observations and Ground Measurements in a Hilly Agricultural Region

Remote Sensing ◽

10.3390/rs11222596 ◽

2019 ◽

Vol 11 (22) ◽

pp. 2596 ◽

Cited By ~ 5

Author(s):

Luca Zappa ◽

Matthias Forkel ◽

Angelika Xaver ◽

Wouter Dorigo

Keyword(s):

Machine Learning ◽

Soil Moisture ◽

Soil Texture ◽

Vegetation Cover ◽

Learning Algorithm ◽

Low Cost ◽

Cost Effective ◽

Training Data ◽

Field Scale ◽

Training Set

Agricultural and hydrological applications could greatly benefit from soil moisture (SM) information at sub-field resolution and (sub-) daily revisit time. However, current operational satellite missions provide soil moisture information at either lower spatial or temporal resolution. Here, we downscale coarse resolution (25–36 km) satellite SM products with quasi-daily resolution to the field scale (30 m) using the random forest (RF) machine learning algorithm. RF models are trained with remotely sensed SM and ancillary variables on soil texture, topography, and vegetation cover against SM measured in the field. The approach is developed and tested in an agricultural catchment equipped with a high-density network of low-cost SM sensors. Our results show a strong consistency between the downscaled and observed SM spatio-temporal patterns. We found that topography has higher predictive power for downscaling than soil texture, due to the hilly landscape of the study area. Furthermore, including a proxy of vegetation cover results in considerable improvements of the performance. Increasing the training set size leads to significant gain in the model skill and expanding the training set is likely to further enhance the accuracy. When only limited in-situ measurements are available as training data, increasing the number of sensor locations should be favored over expanding the duration of the measurements for improved downscaling performance. In this regard, we show the potential of low-cost sensors as a practical and cost-effective solution for gathering the necessary observations. Overall, our findings highlight the suitability of using ground measurements in conjunction with machine learning to derive high spatially resolved SM maps from coarse-scale satellite products.

Download Full-text

Comparative study between deep learning and QSAR classifications for TNBC inhibitors and novel GPCR agonist discovery

Scientific Reports ◽

10.1038/s41598-020-73681-1 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Lun K. Tsou ◽

Shiu-Hwa Yeh ◽

Shau-Hua Ueng ◽

Chun-Ping Chang ◽

Jen-Shin Song ◽

...

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Deep Learning ◽

Virtual Screening ◽

Triple Negative ◽

Learning Algorithm ◽

Structural Information ◽

Training Set ◽

G Protein Coupled ◽

Prediction Efficiency

Abstract Machine learning is a well-known approach for virtual screening. Recently, deep learning, a machine learning algorithm in artificial neural networks, has been applied to the advancement of precision medicine and drug discovery. In this study, we performed comparative studies between deep neural networks (DNN) and other ligand-based virtual screening (LBVS) methods to demonstrate that DNN and random forest (RF) were superior in hit prediction efficiency. By using DNN, several triple-negative breast cancer (TNBC) inhibitors were identified as potent hits from a screening of an in-house database of 165,000 compounds. In broadening the application of this method, we harnessed the predictive properties of trained model in the discovery of G protein-coupled receptor (GPCR) agonist, by which computational structure-based design of molecules could be greatly hindered by lack of structural information. Notably, a potent (~ 500 nM) mu-opioid receptor (MOR) agonist was identified as a hit from a small-size training set of 63 compounds. Our results show that DNN could be an efficient module in hit prediction and provide experimental evidence that machine learning could identify potent hits in silico from a limited training set.

Download Full-text

Spectroscopic observations of the machine-learning selected anomaly catalogue from the AllWISE Sky Survey

Astronomy and Astrophysics ◽

10.1051/0004-6361/202038439 ◽

2020 ◽

Vol 642 ◽

pp. A103

Author(s):

A. Solarz ◽

R. Thomas ◽

F. M. Montenegro-Montes ◽

M. Gromadzki ◽

E. Donoso ◽

...

Keyword(s):

Machine Learning ◽

Near Infrared ◽

Learning Algorithm ◽

Current Data ◽

Sloan Digital Sky Survey ◽

Support Vector ◽

Wide Field ◽

Training Set ◽

Stellar Objects ◽

Sky Survey

We present the results of a programme to search and identify the nature of unusual sources within the All-sky Wide-field Infrared Survey Explorer (WISE) that is based on a machine-learning algorithm for anomaly detection, namely one-class support vector machines (OCSVM). Designed to detect sources deviating from a training set composed of known classes, this algorithm was used to create a model for the expected data based on WISE objects with spectroscopic identifications in the Sloan Digital Sky Survey. Subsequently, it marked as anomalous those sources whose WISE photometry was shown to be inconsistent with this model. We report the results from optical and near-infrared spectroscopy follow-up observations of a subset of 36 bright (gAB < 19.5) objects marked as “anomalous” by the OCSVM code to verify its performance. Among the observed objects, we identified three main types of sources: (i) low redshift (z ∼ 0.03 − 0.15) galaxies containing large amounts of hot dust (53%), including three Wolf-Rayet galaxies; (ii) broad-line quasi-stellar objects (QSOs) (33%) including low-ionisation broad absorption line (LoBAL) quasars and a rare QSO with strong and narrow ultraviolet iron emission; (iii) Galactic objects in dusty phases of their evolution (3%). The nature of four of these objects (11%) remains undetermined due to low signal-to-noise or featureless spectra. The current data show that the algorithm works well at detecting rare but not necessarily unknown objects among the brightest candidates. They mostly represent peculiar sub-types of otherwise well-known sources. To search for even more unusual sources, a more complete and balanced training set should be created after including these rare sub-species of otherwise abundant source classes, such as LoBALs. Such an iterative approach will ideally bring us closer to improving the strategy design for the detection of rarer sources contained within the vast data store of the AllWISE survey.

Download Full-text

A machine learning approach to identify correlates of current e-cigarette use in Canada

Exploration of Medicine ◽

10.37349/emed.2021.00033 ◽

2021 ◽

Author(s):

Rui Fu ◽

Nicholas Mitsakakis ◽

Michael Chaiton

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Electronic Cigarettes ◽

Classification Tree ◽

Cigarette Use ◽

Training Set ◽

Tree Model ◽

Cross Sectional ◽

Classification Tree Model ◽

Validation Set

Aim: Popularity of electronic cigarettes (i.e. e-cigarettes) is soaring in Canada. Understanding person-level correlates of current e-cigarette use (vaping) is crucial to guide tobacco policy, but prior studies have not fully identified these correlates due to model overfitting caused by multicollinearity. This study addressed this issue by using classification tree, a machine learning algorithm. Methods: This population-based cross-sectional study used the Canadian Tobacco, Alcohol, and Drugs Survey (CTADS) from 2017 that targeted residents aged 15 or older. Forty-six person-level characteristics were first screened in a logistic mixed-effects regression procedure for their strength in predicting vaper type (current vs. former vaper) among people who reported to have ever vaped. A 9:1 ratio was used to randomly split the data into a training set and a validation set. A classification tree model was developed using the cross-validation method on the training set using the selected predictors and assessed on the validation set using sensitivity, specificity and accuracy. Results: Of the 3,059 people with an experience of vaping, the average age was 24.4 years (standard deviation = 11.0), with 41.9% of them being female and 8.5% of them being aboriginal. There were 556 (18.2%) current vapers. The classification tree model performed relatively well and suggested attraction to e-cigarette flavors was the most important correlate of current vaping, followed by young age (< 18) and believing vaping to be less harmful to oneself than cigarette smoking. Conclusions: People who vape due to flavors are associated with very high risk of becoming current vapers. The findings of this study provide evidence that supports the ongoing ban on flavored vaping products in the US and suggests a similar regulatory intervention may be effective in Canada.

Download Full-text

Profiling waitlisted incoming students for future delinquency with an ensemble of statistical machine learning algorithms

10.7287/peerj.preprints.3312v1 ◽

2017 ◽

Author(s):

Maureen Lyndel C Lauron ◽

Jaderick P Pabico

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Decision Support Tool ◽

Machine Learning Algorithms ◽

Statistical Machine Learning ◽

Training Set ◽

Support Tool ◽

Incoming Freshman ◽

Validation Set ◽

Incoming Students

Given a dataset \(\mathcal{R}=\{R_1, R_2, \dots, R_r\}\) of \(r\)~records of waitlisted incoming freshman students (WIFS), where for any \(i=1, 2, \dots, r\), \(R_i\) is a \((m+1)\)--tuple \((O_i, P_i^{(1)}, P_i^{(2)}, \dots, P_i^{(m)})\), \(O_i\) is any one in a set \(\mathcal{O}=\{O_1, O_2, \dots, O_o\}\) of \(o\)~classes, and \(P_i^{(1)}, P_i^{(2)}, \dots, P_i^{(m)}\) are \(m\)~potential predictors for~\(O_i\). Our purpose is to find a statistical machine learning algorithm (SMLA) \(\mathbb{A}\) such that \(V_i=\mathbb{A}(P_i^{(1)}, P_i^{(2)}, \dots, P_i^{(m)})\), where \(V_i\) is a predicted class by~\(\mathbb{A}\) that was developed using \(n\le m\) correct number of predictors for \(O\in\mathcal{O}\), and \(\mathbb{A}\)~is the best algorithm such that the metric \(v^{-1}\sum_{i=1}^v |O_i - V_i|\) is minimum across \(v<r\)~records in the validation set \(\mathcal{V}\subset\mathcal{R}\). Our problem is to find the subset \(\{P_i^{(1)}, P_i^{(2)}, \dots, P_i^{(n)}\}\) and to train \(\mathbb{A}\)~using \(t<r\) records from the training set \(\mathcal{T}\subset\mathcal{R}\), such that \(\mathcal{T}\cap\mathcal{V}=\emptyset\), so that \(\mathbb{A}\)~can predict whether a WIFS trying to enter an undergraduate program at UPLB will incur at least a ``delinquency'' once the student is accepted into the program. The \(\mathbb{A}\)~can be a useful decision-support tool for UPLB deans and college secretaries in deciding whether a WIFS will be accepted into the program or not.

Download Full-text