CRISPRIdentify: Identification of CRISPR arrays using machine learning approach

CRISPR-Cas are adaptive immune systems that degrade foreign genetic elements in archaea and bacteria. In carrying out their immune functions, CRISPR-Cas systems heavily rely on RNA components. These CRISPR (cr) RNAs are repeat-spacer units that are produced by processing of pre-crRNA, the transcript of CRISPR arrays, and guide Cas protein(s) to the cognate invading nucleic acids, enabling their destruction. Several bioinformatics tools have been developed to detect CRISPR arrays based solely on DNA sequences, but all these tools employ the same strategy of looking for repetitive patterns, which might correspond to CRISPR array repeats. The identified patterns are evaluated using a fixed, built-in scoring function, and arrays exceeding a cut-off value are reported. Here, we instead introduce a data-driven approach that uses machine learning to detect and differentiate true CRISPR arrays from false ones based on several features. Our CRISPR detection tool, CRISPRIdentify, performs three steps: detection, feature extraction and classification based on manually curated sets of positive and negative examples of CRISPR arrays. The identified CRISPR arrays are then reported to the user accompanied by detailed annotation. We demonstrate that our approach identifies not only previously detected CRISPR arrays, but also CRISPR array candidates not detected by other tools. Compared to other methods, our tool has a drastically reduced false-positive rate. In contrast to the existing tools, our approach not only provides the user with the basic statistics on the identified CRISPR arrays but also produces a certainty score as a practical measure of the likelihood that a given genomic region is a CRISPR array.

Download Full-text

CRISPRidentify: identification of CRISPR arrays using machine learning approach

Nucleic Acids Research ◽

10.1093/nar/gkaa1158 ◽

2020 ◽

Author(s):

Alexander Mitrofanov ◽

Omer S Alkhnbashi ◽

Sergey A Shmakov ◽

Kira S Makarova ◽

Eugene V Koonin ◽

...

Keyword(s):

Machine Learning ◽

Dna Sequences ◽

False Positive Rate ◽

Scoring Function ◽

Protein S ◽

Genomic Region ◽

Immune Functions ◽

Crispr Array ◽

Positive Rate ◽

Data Driven Approach

Abstract CRISPR–Cas are adaptive immune systems that degrade foreign genetic elements in archaea and bacteria. In carrying out their immune functions, CRISPR–Cas systems heavily rely on RNA components. These CRISPR (cr) RNAs are repeat-spacer units that are produced by processing of pre-crRNA, the transcript of CRISPR arrays, and guide Cas protein(s) to the cognate invading nucleic acids, enabling their destruction. Several bioinformatics tools have been developed to detect CRISPR arrays based solely on DNA sequences, but all these tools employ the same strategy of looking for repetitive patterns, which might correspond to CRISPR array repeats. The identified patterns are evaluated using a fixed, built-in scoring function, and arrays exceeding a cut-off value are reported. Here, we instead introduce a data-driven approach that uses machine learning to detect and differentiate true CRISPR arrays from false ones based on several features. Our CRISPR detection tool, CRISPRidentify, performs three steps: detection, feature extraction and classification based on manually curated sets of positive and negative examples of CRISPR arrays. The identified CRISPR arrays are then reported to the user accompanied by detailed annotation. We demonstrate that our approach identifies not only previously detected CRISPR arrays, but also CRISPR array candidates not detected by other tools. Compared to other methods, our tool has a drastically reduced false positive rate. In contrast to the existing tools, our approach not only provides the user with the basic statistics on the identified CRISPR arrays but also produces a certainty score as a practical measure of the likelihood that a given genomic region is a CRISPR array.

Download Full-text

Identification of newborns at risk for autism using electronic medical records and machine learning

10.1101/19008367 ◽

2019 ◽

Author(s):

Rayees Rahman ◽

Arad Kodesh ◽

Stephen Z Levine ◽

Sven Sandin ◽

Abraham Reichenberg ◽

...

Keyword(s):

Machine Learning ◽

Autism Spectrum Disorder ◽

Positive Predictive Value ◽

Electronic Medical Records ◽

Predictive Value ◽

False Positive ◽

Medical Records ◽

False Positive Rate ◽

Autism Spectrum ◽

Positive Rate

AbstractImportanceCurrent approaches for early identification of individuals at high risk for autism spectrum disorder (ASD) in the general population are limited, where most ASD patients are not identified until after the age of 4. This is despite substantial evidence suggesting that early diagnosis and intervention improves developmental course and outcome.ObjectiveDevelop a machine learning (ML) method predicting the diagnosis of ASD in offspring in a general population sample, using parental electronic medical records (EMR) available before childbirthDesignPrognostic study of EMR data within a single Israeli health maintenance organization, for the parents of 1,397 ASD children (ICD-9/10), and 94,741 non-ASD children born between January 1st, 1997 through December 31st, 2008. The complete EMR record of the parents was used to develop various ML models to predict the risk of having a child with ASD.Main outcomes and measuresRoutinely available parental sociodemographic information, medical histories and prescribed medications data until offspring’s birth were used to generate features to train various machine learning algorithms, including multivariate logistic regression, artificial neural networks, and random forest. Prediction performance was evaluated with 10-fold cross validation, by computing C statistics, sensitivity, specificity, accuracy, false positive rate, and precision (positive predictive value, PPV).ResultsAll ML models tested had similar performance, achieving an average C statistics of 0.70, sensitivity of 28.63%, specificity of 98.62%, accuracy of 96.05%, false positive rate of 1.37%, and positive predictive value of 45.85% for predicting ASD in this dataset.Conclusion and relevanceML algorithms combined with EMR capture early life ASD risk. Such approaches may be able to enhance the ability for accurate and efficient early detection of ASD in large populations of children.Key pointsQuestionCan autism risk in children be predicted using the pre-birth electronic medical record (EMR) of the parents?FindingsIn this population-based study that included 1,397 children with autism spectrum disorder (ASD) and 94,741 non-ASD children, we developed a machine learning classifier for predicting the likelihood of childhood diagnosis of ASD with an average C statistic of 0.70, sensitivity of 28.63%, specificity of 98.62%, accuracy of 96.05%, false positive rate of 1.37%, and positive predictive value of 45.85%.MeaningThe results presented serve as a proof-of-principle of the potential utility of EMR for the identification of a large proportion of future children at a high-risk of ASD.

Download Full-text

IoT Dataset Validation Using Machine Learning Techniques for Traffic Anomaly Detection

Electronics ◽

10.3390/electronics10222857 ◽

2021 ◽

Vol 10 (22) ◽

pp. 2857

Author(s):

Laura Vigoya ◽

Diego Fernandez ◽

Victor Carneiro ◽

Francisco Nóvoa

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

False Positive Rate ◽

Machine Learning Techniques ◽

Support Vector ◽

High Detection Rate ◽

Security Vulnerabilities ◽

Smart Systems ◽

Learning Techniques ◽

Positive Rate

With advancements in engineering and science, the application of smart systems is increasing, generating a faster growth of the IoT network traffic. The limitations due to IoT restricted power and computing devices also raise concerns about security vulnerabilities. Machine learning-based techniques have recently gained credibility in a successful application for the detection of network anomalies, including IoT networks. However, machine learning techniques cannot work without representative data. Given the scarcity of IoT datasets, the DAD emerged as an instrument for knowing the behavior of dedicated IoT-MQTT networks. This paper aims to validate the DAD dataset by applying Logistic Regression, Naive Bayes, Random Forest, AdaBoost, and Support Vector Machine to detect traffic anomalies in IoT. To obtain the best results, techniques for handling unbalanced data, feature selection, and grid search for hyperparameter optimization have been used. The experimental results show that the proposed dataset can achieve a high detection rate in all the experiments, providing the best mean accuracy of 0.99 for the tree-based models, with a low false-positive rate, ensuring effective anomaly detection.

Download Full-text

Machine Learning for Automated Polyp Detection in Computed Tomography Colonography

Machine Learning ◽

10.4018/978-1-60960-818-7.ch407 ◽

2012 ◽

pp. 830-850

Author(s):

Abhilash Alexander Miranda ◽

Olivier Caelen ◽

Gianluca Bontempi

Keyword(s):

Machine Learning ◽

Computed Tomography ◽

False Positive ◽

False Positive Rate ◽

Learning Algorithms ◽

Colorectal Polyps ◽

Machine Learning Algorithms ◽

Computed Tomography Colonography ◽

Positive Rate ◽

Independent Features

This chapter presents a comprehensive scheme for automated detection of colorectal polyps in computed tomography colonography (CTC) with particular emphasis on robust learning algorithms that differentiate polyps from non-polyp shapes. The authors’ automated CTC scheme introduces two orientation independent features which encode the shape characteristics that aid in classification of polyps and non-polyps with high accuracy, low false positive rate, and low computations making the scheme suitable for colorectal cancer screening initiatives. Experiments using state-of-the-art machine learning algorithms viz., lazy learning, support vector machines, and naïve Bayes classifiers reveal the robustness of the two features in detecting polyps at 100% sensitivity for polyps with diameter greater than 10 mm while attaining total low false positive rates, respectively, of 3.05, 3.47 and 0.71 per CTC dataset at specificities above 99% when tested on 58 CTC datasets. The results were validated using colonoscopy reports provided by expert radiologists.

Download Full-text

Identification of newborns at risk for autism using electronic medical records and machine learning

European Psychiatry ◽

10.1192/j.eurpsy.2020.17 ◽

2020 ◽

Vol 63 (1) ◽

Author(s):

Rayees Rahman ◽

Arad Kodesh ◽

Stephen Z. Levine ◽

Sven Sandin ◽

Abraham Reichenberg ◽

...

Keyword(s):

Machine Learning ◽

General Population ◽

Electronic Medical Records ◽

False Positive ◽

Medical Records ◽

False Positive Rate ◽

Autism Spectrum ◽

Health Maintenance ◽

C Statistic ◽

Positive Rate

Abstract Background. Current approaches for early identification of individuals at high risk for autism spectrum disorder (ASD) in the general population are limited, and most ASD patients are not identified until after the age of 4. This is despite substantial evidence suggesting that early diagnosis and intervention improves developmental course and outcome. The aim of the current study was to test the ability of machine learning (ML) models applied to electronic medical records (EMRs) to predict ASD early in life, in a general population sample. Methods. We used EMR data from a single Israeli Health Maintenance Organization, including EMR information for parents of 1,397 ASD children (ICD-9/10) and 94,741 non-ASD children born between January 1st, 1997 and December 31st, 2008. Routinely available parental sociodemographic information, parental medical histories, and prescribed medications data were used to generate features to train various ML algorithms, including multivariate logistic regression, artificial neural networks, and random forest. Prediction performance was evaluated with 10-fold cross-validation by computing the area under the receiver operating characteristic curve (AUC; C-statistic), sensitivity, specificity, accuracy, false positive rate, and precision (positive predictive value [PPV]). Results. All ML models tested had similar performance. The average performance across all models had C-statistic of 0.709, sensitivity of 29.93%, specificity of 98.18%, accuracy of 95.62%, false positive rate of 1.81%, and PPV of 43.35% for predicting ASD in this dataset. Conclusions. We conclude that ML algorithms combined with EMR capture early life ASD risk as well as reveal previously unknown features to be associated with ASD-risk. Such approaches may be able to enhance the ability for accurate and efficient early detection of ASD in large populations of children.

Download Full-text

Data Mining Validation of Fluconazole Breakpoints Established by the European Committee on Antimicrobial Susceptibility Testing

Antimicrobial Agents and Chemotherapy ◽

10.1128/aac.00081-09 ◽

2009 ◽

Vol 53 (7) ◽

pp. 2949-2954 ◽

Cited By ~ 18

Author(s):

Isabel Cuesta ◽

Concha Bielza ◽

Pedro Larrañaga ◽

Manuel Cuenca-Estrella ◽

Fernando Laguna ◽

...

Keyword(s):

Machine Learning ◽

Antimicrobial Susceptibility ◽

Roc Curve ◽

False Positive ◽

Statistical Power ◽

Susceptibility Testing ◽

False Positive Rate ◽

Antimicrobial Susceptibility Testing ◽

European Committee ◽

Positive Rate

ABSTRACT European Committee on Antimicrobial Susceptibility Testing (EUCAST) breakpoints classify Candida strains with a fluconazole MIC ≤ 2 mg/liter as susceptible, those with a fluconazole MIC of 4 mg/liter as representing intermediate susceptibility, and those with a fluconazole MIC > 4 mg/liter as resistant. Machine learning models are supported by complex statistical analyses assessing whether the results have statistical relevance. The aim of this work was to use supervised classification algorithms to analyze the clinical data used to produce EUCAST fluconazole breakpoints. Five supervised classifiers (J48, Correlation and Regression Trees [CART], OneR, Naïve Bayes, and Simple Logistic) were used to analyze two cohorts of patients with oropharyngeal candidosis and candidemia. The target variable was the outcome of the infections, and the predictor variables consisted of values for the MIC or the proportion between the dose administered and the MIC of the isolate (dose/MIC). Statistical power was assessed by determining values for sensitivity and specificity, the false-positive rate, the area under the receiver operating characteristic (ROC) curve, and the Matthews correlation coefficient (MCC). CART obtained the best statistical power for a MIC > 4 mg/liter for detecting failures (sensitivity, 87%; false-positive rate, 8%; area under the ROC curve, 0.89; MCC index, 0.80). For dose/MIC determinations, the target was >75, with a sensitivity of 91%, a false-positive rate of 10%, an area under the ROC curve of 0.90, and an MCC index of 0.80. Other classifiers gave similar breakpoints with lower statistical power. EUCAST fluconazole breakpoints have been validated by means of machine learning methods. These computer tools must be incorporated in the process for developing breakpoints to avoid researcher bias, thus enhancing the statistical power of the model.

Download Full-text

Dear Watch, Should I get a COVID Test? Designing deployable machine learning for wearables

10.21203/rs.3.rs-505984/v1 ◽

2021 ◽

Author(s):

Anna Goldenberg ◽

Bret Nestor ◽

Jaryd Hunter ◽

Raghu Kainkaryam ◽

Erik Drysdale ◽

...

Keyword(s):

Public Health ◽

Machine Learning ◽

Real World ◽

False Positive Rate ◽

Wearable Devices ◽

Classification Performance ◽

Wearable Device ◽

Screening Tools ◽

Machine Learning Model ◽

Positive Rate

Abstract Commercial wearable devices are surfacing as an appealing mechanism to detect COVID-19 and potentially other public health threats, due to their widespread use. To assess the validity of wearable devices as population health screening tools, it is essential to evaluate predictive methodologies based on wearable devices by mimicking their real-world deployment. Several points must be addressed to transition from statistically significant differences between infected and uninfected cohorts to COVID-19 inferences on individuals. We demonstrate the strengths and shortcomings of existing approaches on a cohort of 32,198 individuals who experience influenza like illness (ILI), 204 of which report testing positive for COVID-19. We show that, despite commonly made design mistakes resulting in overestimation of performance, when properly designed wearables can be effectively used as a part of the detection pipeline. For example, knowing the week of year, combined with naive randomised test set generation leads to substantial overestimation of COVID-19 classification performance at 0.73 AUROC. However, an average AUROC of only 0.55 +/- 0.02 would be attainable in a simulation of real-world deployment, due to the shifting prevalence of COVID-19 and non-COVID-19 ILI to trigger further testing. In this work we show how to train a machine learning model to differentiate ILI days from healthy days, followed by a survey to differentiate COVID-19 from influenza and unspecified ILI based on symptoms. In a forthcoming week, models can expect a sensitivity of 0.50 (0-0.74, 95% CI), while utilising the wearable device to reduce the burden of surveys by 35%. The corresponding false positive rate is 0.22 (0.02-0.47, 95% CI). In the future, serious consideration must be given to the design, evaluation, and reporting of wearable device interventions if they are to be relied upon as part of frequent COVID-19 or other public health threat testing infrastructures.

Download Full-text

Reconfiguring primase DNA-recognition sequences by using a data-driven approach

10.1101/2020.09.29.317842 ◽

2020 ◽

Author(s):

Adam Soffer ◽

Morya Ifrach ◽

Stefan Ilic ◽

Ariel Afek ◽

Dan Vilenchik ◽

...

Keyword(s):

Machine Learning ◽

Protein Interactions ◽

Dna Sequences ◽

Metabolic Pathways ◽

Dna Recognition ◽

Model System ◽

Data Driven ◽

Specific Interactions ◽

Data Driven Approach ◽

Dna Protein Interactions

AbstractDNA–protein interactions are essential in all aspects of every living cell. Understanding of how features embedded in the DNA sequence affect specific interactions with proteins is challenging but important, since it may contribute to finding the means to regulate metabolic pathways involving DNA–protein interactions. Using a massive experimental benchmark dataset of binding scores for DNA sequences and a machine learning workflow, we describe the binding to DNA of T7 primase, as a model system for specific DNA–protein interactions. Effective binding of T7 primase to its specific DNA recognition se-quences triggers the formation of RNA primers that serve as Okazaki fragment start sites during DNA replication.

Download Full-text

Machine Learning Based Technique for Detection of Rank Attack in RPL based Internet of Things Networks

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i3044.0789s319 ◽

2019 ◽

Vol 8 (9S3) ◽

pp. 244-248

Keyword(s):

Machine Learning ◽

Internet Of Things ◽

Nearest Neighbor ◽

False Positive Rate ◽

Security And Privacy ◽

Weak Links ◽

K Nearest Neighbor ◽

Detection Mechanism ◽

Wormhole Attack ◽

Positive Rate

Internet of Things (IoT) is a new Paradiagram in the network technology. It has the vast application in almost every field like retail, industries, and healthcare etc. It has challenges like security and privacy, robustness, weak links, less power, etc. A major challenge among these is security. Due to the weak connectivity links, these Internet of Things network leads to many attacks in the network layer. RPL is a routing protocol which establishes a path particularly for the constrained nodes in Internet of Things based networks. These RPL based network is exposed to many attacks like black hole attack, wormhole attack, sinkhole attack, rank attack, etc. This paper proposed a detection technique for rank attack based on the machine learning approach called MLTKNN, based on K-nearest neighbor algorithm. The proposed technique was simulated in the Cooja simulation with 30 motes and calculated the true positive rate and false positive rate of the proposed detection mechanism. Finally proved that, the performance of the proposed technique was efficient in terms of the delay, packet delivery rate and in detection of the rank attack.

Download Full-text

Population-Based Screening for Endometrial Cancer: Human vs. Machine Intelligence

Frontiers in Artificial Intelligence ◽

10.3389/frai.2020.539879 ◽

2020 ◽

Vol 3 ◽

Author(s):

Gregory R. Hart ◽

Vanessa Yan ◽

Gloria S. Huang ◽

Ying Liang ◽

Bradley J. Nartowt ◽

...

Keyword(s):

Machine Learning ◽

Endometrial Cancer ◽

False Positive Rate ◽

Population Based ◽

Personal Health ◽

Average Risk ◽

Endometrial Cancer Risk ◽

Machine Learning Model ◽

Fold Reduction ◽

Positive Rate

Incidence and mortality rates of endometrial cancer are increasing, leading to increased interest in endometrial cancer risk prediction and stratification to help in screening and prevention. Previous risk models have had moderate success with the area under the curve (AUC) ranging from 0.68 to 0.77. Here we demonstrate a population-based machine learning model for endometrial cancer screening that achieves a testing AUC of 0.96.We train seven machine learning algorithms based solely on personal health data, without any genomic, imaging, biomarkers, or invasive procedures. The data come from the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO). We further compare our machine learning model with 15 gynecologic oncologists and primary care physicians in the stratification of endometrial cancer risk for 100 women.We find a random forest model that achieves a testing AUC of 0.96 and a neural network model that achieves a testing AUC of 0.91. We test both models in risk stratification against 15 practicing physicians. Our random forest model is 2.5 times better at identifying above-average risk women with a 2-fold reduction in the false positive rate. Our neural network model is 2 times better at identifying above-average risk women with a 3-fold reduction in the false positive rate.Our machine learning models provide a non-invasive and cost-effective way to identify high-risk sub-populations who may benefit from early screening of endometrial cancer, prior to disease onset. Through statistical biopsy of personal health data, we have identified a new and effective approach for early cancer detection and prevention for individual patients.

Download Full-text