SICaRiO: short indel call filtering with boosting

Abstract Despite impressive improvement in the next-generation sequencing technology, reliable detection of indels is still a difficult endeavour. Recognition of true indels is of prime importance in many applications, such as personalized health care, disease genomics and population genetics. Recently, advanced machine learning techniques have been successfully applied to classification problems with large-scale data. In this paper, we present SICaRiO, a gradient boosting classifier for the reliable detection of true indels, trained with the gold-standard dataset from ‘Genome in a Bottle’ (GIAB) consortium. Our filtering scheme significantly improves the performance of each variant calling pipeline used in GIAB and beyond. SICaRiO uses genomic features that can be computed from publicly available resources, i.e. it does not require sequencing pipeline-specific information (e.g. read depth). This study also sheds lights on prior genomic contexts responsible for the erroneous calling of indels made by sequencing pipelines. We have compared prediction difficulty for three categories of indels over different sequencing pipelines. We have also ranked genomic features according to their predictivity in determining false positives.

Download Full-text

SICaRiO: Short Indel Call filteRing with bOosting

10.1101/601450 ◽

2019 ◽

Author(s):

Md Shariful Islam Bhuyan ◽

Itsik Pe’er ◽

M. Sohel Rahman

Keyword(s):

Large Scale ◽

Variant Calling ◽

Read Depth ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Specific Information ◽

Classification Problems ◽

Genomic Features ◽

Next Generation Sequencing Technology ◽

Reliable Detection

AbstractDespite impressive improvement in the next-generation sequencing technology, reliable detection of indels is still a difficult endeavour. Recognition of true indels is of prime importance in many applications, such as, personalized health care, disease genomics, population genetics etc. Recently, advanced machine learning techniques have been successfully applied to classification problems with large-scale data. In this paper, we present SICaRiO, a gradient boosting classifier for reliable detection of true indels, trained with gold-standard dataset from genome-in-a-bottle (GIAB) consortium. Our filtering scheme significantly improves the performance of each variant calling pipeline used in GIAB and beyond. SICaRiO uses genomic features which can be computed from publicly available resources, hence, we can apply it on any indel callsets not having sequencing pipeline-specific information (e.g., read depth). This study also sheds lights on prior genomic contexts responsible for indel calling error made by sequencing platforms. We have compared prediction difficulty for three indel categories over different sequencing pipelines. We have also ranked genomic features according to their predictivity in determining false indel calls.

Download Full-text

Accuracy and efficiency of germline variant calling pipelines for human genome data

10.1101/2020.03.27.011767 ◽

2020 ◽

Cited By ~ 1

Author(s):

Sen Zhao ◽

Oleg Agafonov ◽

Abdulrahman Azab ◽

Tomasz Stokowy ◽

Eivind Hovig

Keyword(s):

Large Scale ◽

Variant Calling ◽

Performance Comparison ◽

Next Generation Sequencing Technology ◽

Genome Data ◽

Germline Variant ◽

Causal Variants ◽

Variant Detection ◽

Variant Analysis ◽

Human Genome Data

AbstractAdvances in next-generation sequencing technology has enabled whole genome sequencing (WGS) to be widely used for identification of causal variants in a spectrum of genetic-related disorders, and provided new insight into how genetic polymorphisms affect disease phenotypes. The development of different bioinformatics pipelines has continuously improved the variant analysis of WGS data, however there is a necessity for a systematic performance comparison of these pipelines to provide guidance on the application of WGS-based scientific and clinical genomics. In this study, we evaluated the performance of three variant calling pipelines (GATK, DRAGEN™ and DeepVariant) using Genome in a Bottle Consortium, “synthetic-diploid” and simulated WGS datasets. DRAGEN™ and DeepVariant show a better accuracy in SNPs and indels calling, with no significant differences in their F1-score. DRAGEN™ platform offers accuracy, flexibility and a highly-efficient running speed, and therefore superior advantage in the analysis of WGS data on a large scale. The combination of DRAGEN™ and DeepVariant also provides a good balance of accuracy and efficiency as an alternative solution for germline variant detection in further applications. Our results facilitate the standardization of benchmarking analysis of bioinformatics pipelines for reliable variant detection, which is critical in genetics-based medical research and clinical application.

Download Full-text

HyP-ABC: A Novel Automated Hyper-Parameter Tuning Algorithm Using Evolutionary Optimization

10.36227/techrxiv.14714508.v3 ◽

2021 ◽

Author(s):

Leila Zahedi ◽

Farid Ghareh Mohammadi ◽

M. Hadi Amini

Keyword(s):

Real World ◽

Large Scale ◽

Convergence Rates ◽

Parameter Tuning ◽

Population Based ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Wide Range ◽

Extreme Gradient Boosting

<p>Machine learning techniques lend themselves as promising decision-making and analytic tools in a wide range of applications. Different ML algorithms have various hyper-parameters. In order to tailor an ML model towards a specific application working at its best, its hyper-parameters should be tuned. Tuning the hyper-parameters directly affects the performance. However, for large-scale search spaces, efficiently exploring the ample number of combinations of hyper-parameters is computationally expensive. Many of the automated hyper-parameter tuning techniques suffer from low convergence rates and high experimental time complexities. In this paper, we propose HyP-ABC, an automatic innovative hybrid hyper-parameter optimization algorithm using the modified artificial bee colony approach, to measure the classification accuracy of three ML algorithms: random forest, extreme gradient boosting, and support vector machine. In order to ensure the robustness of the proposed method, the algorithm takes a wide range of feasible hyper-parameter values and is tested using a real-world educational dataset. Experimental results show that HyP-ABC is competitive with state-of-the-art techniques. Also, it has fewer hyper-parameters to be tuned than other population-based algorithms, making it worthwhile for real-world HPO problems.</p>

Download Full-text

DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark

BMC Bioinformatics ◽

10.1186/s12859-019-3108-7 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 2

Author(s):

Michael D. Linderman ◽

Davin Chia ◽

Forrest Wallace ◽

Frank A. Nothaft

Keyword(s):

Copy Number ◽

Large Scale ◽

Variant Calling ◽

Copy Number Variant ◽

Read Depth ◽

Lessons Learned ◽

Apache Spark ◽

Sequencing Data ◽

Whole Exome Sequencing Data ◽

Genome Analyses

Abstract Background XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the configuration parameter space to obtain the best possible results. Results DECA is a horizontally scalable implementation of the XHMM algorithm using the ADAM framework and Apache Spark that incorporates novel algorithmic optimizations to eliminate unneeded computation. DECA parallelizes XHMM on both multi-core shared memory computers and large shared-nothing Spark clusters. We performed CNV discovery from the read-depth matrix in 2535 exomes in 9.3 min on a 16-core workstation (35.3× speedup vs. XHMM), 12.7 min using 10 executor cores on a Spark cluster (18.8× speedup vs. XHMM), and 9.8 min using 32 executor cores on Amazon AWS’ Elastic MapReduce. We performed CNV discovery from the original BAM files in 292 min using 640 executor cores on a Spark cluster. Conclusions We describe DECA’s performance, our algorithmic and implementation enhancements to XHMM to obtain that performance, and our lessons learned porting a complex genome analysis application to ADAM and Spark. ADAM and Apache Spark are a performant and productive platform for implementing large-scale genome analyses, but efficiently utilizing large clusters can require algorithmic optimizations and careful attention to Spark’s configuration parameters.

Download Full-text

Reliable photometric membership (RPM) of galaxies in clusters – I. A machine learning method and its performance in the local universe

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa486 ◽

2020 ◽

Vol 493 (3) ◽

pp. 3429-3441

Author(s):

Paulo A A Lopes ◽

André L B Ribeiro

Keyword(s):

Machine Learning ◽

Galaxy Evolution ◽

Large Scale ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Validation Data ◽

Membership Probability ◽

Cluster Membership ◽

Stochastic Gradient Boosting

ABSTRACT We introduce a new method to determine galaxy cluster membership based solely on photometric properties. We adopt a machine learning approach to recover a cluster membership probability from galaxy photometric parameters and finally derive a membership classification. After testing several machine learning techniques (such as stochastic gradient boosting, model averaged neural network and k-nearest neighbours), we found the support vector machine algorithm to perform better when applied to our data. Our training and validation data are from the Sloan Digital Sky Survey main sample. Hence, to be complete to $M_r^* + 3$, we limit our work to 30 clusters with $z$phot-cl ≤ 0.045. Masses (M200) are larger than $\sim 0.6\times 10^{14} \, \mathrm{M}_{\odot }$ (most above $3\times 10^{14} \, \mathrm{M}_{\odot }$). Our results are derived taking in account all galaxies in the line of sight of each cluster, with no photometric redshift cuts or background corrections. Our method is non-parametric, making no assumptions on the number density or luminosity profiles of galaxies in clusters. Our approach delivers extremely accurate results (completeness, C $\sim 92{\rm{ per\ cent}}$ and purity, P $\sim 87{\rm{ per\ cent}}$) within R200, so that we named our code reliable photometric membership. We discuss possible dependencies on magnitude, colour, and cluster mass. Finally, we present some applications of our method, stressing its impact to galaxy evolution and cosmological studies based on future large-scale surveys, such as eROSITA, EUCLID, and LSST.

Download Full-text

Predicting synergism of cancer drug combinations using NCI-ALMANAC data

10.1101/504076 ◽

2018 ◽

Cited By ~ 1

Author(s):

Pavel Sidorov ◽

Stefan Naulaerts ◽

Jérémy Ariey-Bonnet ◽

Eddy Pasquier ◽

Pedro J. Ballester

Keyword(s):

Cell Line ◽

In Silico ◽

Large Scale ◽

Alkylating Agents ◽

Drug Combinations ◽

Reliability Estimation ◽

Machine Learning Techniques ◽

Cancer Drug ◽

Gradient Boosting ◽

Extreme Gradient Boosting

AbstractBackgroundDrug combinations are of great interest for cancer treatment. Unfortunately, the discovery of synergistic combinations by purely experimental means is only feasible on small sets of drugs.In silicomodeling methods can substantially widen this search by providing tools able to predict which of all possible combinations in a large compound library are synergistic. Here we investigate to which extent drug combination synergy can be predicted by exploiting the largest available dataset to date (NCI-ALMANAC, with over 290,000 synergy determinations).MethodsEach cell line is modeled using primarily two machine learning techniques, Random Forest (RF) and Extreme Gradient Boosting (XGBoost), on the datasets provided by NCI-ALMANAC. This large-scale predictive modeling study comprises more than 5000 pair-wise drug combinations, 60 cell lines, 4 types of models and 5 types of chemical features. The application of a powerful, yet uncommonly used, RF-specific technique for reliability prediction is also investigated.ResultsThe evaluation of these models shows that it is possible to predict the synergy of unseen drug combinations with high accuracy (Pearson correlations between 0.43 and 0.86 depending on the considered cell line, with XGBoost providing slightly better predictions than RF). We have also found that restricting to the most reliable synergy predictions results in at least two-fold error decrease with respect to employing the best learning algorithm without any reliability estimation. Alkylating agents, tyrosine kinase inhibitors and topoisomerase inhibitors are the drugs whose synergy with other partner drugs are better predicted by the models.ConclusionsDespite its leading size, NCI-ALMANAC comprises an extremely small part of all conceivable combinations. Given their accuracy and reliability estimation, the developed models should drastically reduce the number of requiredin vitrotests by predictingin silicowhich of the considered combinations are likely to be synergistic.

Download Full-text

Intercomparing the robustness of machine learning models in simulation and forecasting of streamflow

Journal of Water and Climate Change ◽

10.2166/wcc.2020.365 ◽

2020 ◽

Author(s):

Parthiban Loganathan ◽

Amit Baburao Mahindrakar

Keyword(s):

Machine Learning ◽

Large Scale ◽

Resource Planning ◽

Machine Learning Techniques ◽

Low Flow ◽

Gradient Boosting ◽

Daily Streamflow ◽

Learning Techniques ◽

Extreme Gradient Boosting ◽

Hydrological Indices

Abstract The intercomparison of streamflow simulation and the prediction of discharge using various renowned machine learning techniques were performed. The daily streamflow discharge model was developed for 35 observation stations located in a large-scale river basin named Cauvery. Various hydrological indices were calculated for observed and predicted discharges for comparing and evaluating the replicability of local hydrological conditions. The model variance and bias observed from the proposed extreme gradient boosting decision tree model were less than 15%, which is compared with other machine learning techniques considered in this study. The model Nash–Sutcliffe efficiency and coefficient of determination values are above 0.7 for both the training and testing phases which demonstrate the effectiveness of model performance. The comparison of monthly observed and model-predicted discharges during the validation period illustrates the model's ability in representing the peaks and fall in high-, medium-, and low-flow zones. The assessment and comparison of hydrological indices between observed and predicted discharges illustrate the model's ability in representing the baseflow, high-spell, and low-spell statistics. Simulating streamflow and predicting discharge are essential for water resource planning and management, especially in large-scale river basins. The proposed machine learning technique demonstrates significant improvement in model efficiency by dropping variance and bias which, in turn, improves the replicability of local-scale hydrology.

Download Full-text

Accuracy and efficiency of germline variant calling pipelines for human genome data

Scientific Reports ◽

10.1038/s41598-020-77218-4 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Sen Zhao ◽

Oleg Agafonov ◽

Abdulrahman Azab ◽

Tomasz Stokowy ◽

Eivind Hovig

Keyword(s):

Large Scale ◽

Variant Calling ◽

Performance Comparison ◽

Superior Performance ◽

Next Generation Sequencing Technology ◽

Genome Data ◽

Germline Variant ◽

Indel Calling ◽

Execution Speed ◽

Variant Detection

AbstractAdvances in next-generation sequencing technology have enabled whole genome sequencing (WGS) to be widely used for identification of causal variants in a spectrum of genetic-related disorders, and provided new insight into how genetic polymorphisms affect disease phenotypes. The development of different bioinformatics pipelines has continuously improved the variant analysis of WGS data. However, there is a necessity for a systematic performance comparison of these pipelines to provide guidance on the application of WGS-based scientific and clinical genomics. In this study, we evaluated the performance of three variant calling pipelines (GATK, DRAGEN and DeepVariant) using the Genome in a Bottle Consortium, “synthetic-diploid” and simulated WGS datasets. DRAGEN and DeepVariant show better accuracy in SNP and indel calling, with no significant differences in their F1-score. DRAGEN platform offers accuracy, flexibility and a highly-efficient execution speed, and therefore superior performance in the analysis of WGS data on a large scale. The combination of DRAGEN and DeepVariant also suggests a good balance of accuracy and efficiency as an alternative solution for germline variant detection in further applications. Our results facilitate the standardization of benchmarking analysis of bioinformatics pipelines for reliable variant detection, which is critical in genetics-based medical research and clinical applications.

Download Full-text

A Comparative Study of Different Machine Learning Algorithms for Disease Prediction

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse/v7i7/0177 ◽

2017 ◽

Vol 7 (7) ◽

pp. 172

Author(s):

Anantvir Singh Romana

Keyword(s):

Machine Learning ◽

Subsequent Treatment ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector ◽

Disease Prediction ◽

Classification Problems ◽

Learning Techniques ◽

Neural Network Classifiers ◽

Diagnostic Detection

Accurate diagnostic detection of the disease in a patient is critical and may alter the subsequent treatment and increase the chances of survival rate. Machine learning techniques have been instrumental in disease detection and are currently being used in various classification problems due to their accurate prediction performance. Various techniques may provide different desired accuracies and it is therefore imperative to use the most suitable method which provides the best desired results. This research seeks to provide comparative analysis of Support Vector Machine, Naïve bayes, J48 Decision Tree and neural network classifiers breast cancer and diabetes datsets.

Download Full-text

BiPSim: a flexible and generic stochastic simulator for polymerization processes

Scientific Reports ◽

10.1038/s41598-021-92833-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Stephan Fischer ◽

Marc Dinh ◽

Vincent Henry ◽

Philippe Robert ◽

Anne Goelzer ◽

...

Keyword(s):

Large Scale ◽

Stochastic Simulation Algorithm ◽

Specific Information ◽

Whole Cell ◽

Simulation Speed ◽

Genome Wide ◽

Cell Simulation ◽

Stochastic Simulator ◽

Modeling Formalisms ◽

Stochastic Phenomena

AbstractDetailed whole-cell modeling requires an integration of heterogeneous cell processes having different modeling formalisms, for which whole-cell simulation could remain tractable. Here, we introduce BiPSim, an open-source stochastic simulator of template-based polymerization processes, such as replication, transcription and translation. BiPSim combines an efficient abstract representation of reactions and a constant-time implementation of the Gillespie’s Stochastic Simulation Algorithm (SSA) with respect to reactions, which makes it highly efficient to simulate large-scale polymerization processes stochastically. Moreover, multi-level descriptions of polymerization processes can be handled simultaneously, allowing the user to tune a trade-off between simulation speed and model granularity. We evaluated the performance of BiPSim by simulating genome-wide gene expression in bacteria for multiple levels of granularity. Finally, since no cell-type specific information is hard-coded in the simulator, models can easily be adapted to other organismal species. We expect that BiPSim should open new perspectives for the genome-wide simulation of stochastic phenomena in biology.

Download Full-text