Analysis of Biological Screening Compounds with Single- or Multi-Target Activity via Diagnostic Machine Learning

AbstractCompounds with defined multi-target activity (promiscuity) play an increasingly important role in drug discovery. However, the molecular basis of multi-target activity is currently only little understood. In particular, it remains unclear whether structural features exist that generally characterize promiscuous compounds and set them apart from compounds with single-target activity. We have devised a test system using machine learning to systematically examine structural features that might characterize compounds with multi-target activity. Using this system, more than 860,000 diagnostic predictions were carried out. The analysis provided compelling evidence for the presence of structural characteristics of promiscuous compounds that were dependent on given target combinations, but not generalizable. Feature weighting and mapping identified characteristic substructures in test compounds. Taken together, these findings are relevant for the design of compounds with desired multi-target activity.

Download Full-text

Predicting ionizing radiation exposure using biochemically-inspired genomic machine learning

F1000Research ◽

10.12688/f1000research.14048.1 ◽

2018 ◽

Vol 7 ◽

pp. 233

Author(s):

Jonathan Z.L. Zhao ◽

Eliseos J. Mucaki ◽

Peter K. Rogan

Keyword(s):

Machine Learning ◽

Ionizing Radiation ◽

Radiation Exposure ◽

Large Scale ◽

Nearest Neighbor ◽

Error Rates ◽

Support Vector ◽

Dose Estimation ◽

Gene Signatures ◽

Ionizing Radiation Exposure

Background: Gene signatures derived from transcriptomic data using machine learning methods have shown promise for biodosimetry testing. These signatures may not be sufficiently robust for large scale testing, as their performance has not been adequately validated on external, independent datasets. The present study develops human and murine signatures with biochemically-inspired machine learning that are strictly validated using k-fold and traditional approaches. Methods: Gene Expression Omnibus (GEO) datasets of exposed human and murine lymphocytes were preprocessed via nearest neighbor imputation and expression of genes implicated in the literature to be responsive to radiation exposure (n=998) were then ranked by Minimum Redundancy Maximum Relevance (mRMR). Optimal signatures were derived by backward, complete, and forward sequential feature selection using Support Vector Machines (SVM), and validated using k-fold or traditional validation on independent datasets. Results: The best human signatures we derived exhibit k-fold validation accuracies of up to 98% (DDB2, PRKDC, TPP2, PTPRE, and GADD45A) when validated over 209 samples and traditional validation accuracies of up to 92% (DDB2, CD8A, TALDO1, PCNA, EIF4G2, LCN2, CDKN1A, PRKCH, ENO1, and PPM1D) when validated over 85 samples. Some human signatures are specific enough to differentiate between chemotherapy and radiotherapy. Certain multi-class murine signatures have sufficient granularity in dose estimation to inform eligibility for cytokine therapy (assuming these signatures could be translated to humans). We compiled a list of the most frequently appearing genes in the top 20 human and mouse signatures. More frequently appearing genes among an ensemble of signatures may indicate greater impact of these genes on the performance of individual signatures. Several genes in the signatures we derived are present in previously proposed signatures. Conclusions: Gene signatures for ionizing radiation exposure derived by machine learning have low error rates in externally validated, independent datasets, and exhibit high specificity and granularity for dose estimation.

Download Full-text

Logging Analysis and Prediction in Open Source Java Project

Research Anthology on Usage and Development of Open Source Software ◽

10.4018/978-1-7998-9158-1.ch038 ◽

2021 ◽

pp. 733-761

Author(s):

Sangeeta Lal ◽

Neetu Sardana ◽

Ashish Sureka

Keyword(s):

Machine Learning ◽

Content Analysis ◽

Software Development ◽

Anomaly Detection ◽

Open Source ◽

Large Scale ◽

Source Code ◽

Scale Analysis ◽

Large Scale Analysis ◽

Research Questions

Log statements present in source code provide important information to the software developers because they are useful in various software development activities such as debugging, anomaly detection, and remote issue resolution. Most of the previous studies on logging analysis and prediction provide insights and results after analyzing only a few code constructs. In this chapter, the authors perform an in-depth, focused, and large-scale analysis of logging code constructs at two levels: the file level and catch-blocks level. They answer several research questions related to statistical and content analysis. Statistical and content analysis reveals the presence of differentiating properties among logged and nonlogged code constructs. Based on these findings, the authors propose a machine-learning-based model for catch-blocks logging prediction. The machine-learning-based model is found to be effective in catch-blocks logging prediction.

Download Full-text

Predicting ionizing radiation exposure using biochemically-inspired genomic machine learning

F1000Research ◽

10.12688/f1000research.14048.2 ◽

2018 ◽

Vol 7 ◽

pp. 233 ◽

Cited By ~ 13

Author(s):

Jonathan Z.L. Zhao ◽

Eliseos J. Mucaki ◽

Peter K. Rogan

Keyword(s):

Machine Learning ◽

Ionizing Radiation ◽

Radiation Exposure ◽

Large Scale ◽

Nearest Neighbor ◽

Error Rates ◽

Support Vector ◽

Dose Estimation ◽

Gene Signatures ◽

Ionizing Radiation Exposure

Background: Gene signatures derived from transcriptomic data using machine learning methods have shown promise for biodosimetry testing. These signatures may not be sufficiently robust for large scale testing, as their performance has not been adequately validated on external, independent datasets. The present study develops human and murine signatures with biochemically-inspired machine learning that are strictly validated using k-fold and traditional approaches. Methods: Gene Expression Omnibus (GEO) datasets of exposed human and murine lymphocytes were preprocessed via nearest neighbor imputation and expression of genes implicated in the literature to be responsive to radiation exposure (n=998) were then ranked by Minimum Redundancy Maximum Relevance (mRMR). Optimal signatures were derived by backward, complete, and forward sequential feature selection using Support Vector Machines (SVM), and validated using k-fold or traditional validation on independent datasets. Results: The best human signatures we derived exhibit k-fold validation accuracies of up to 98% (DDB2, PRKDC, TPP2, PTPRE, and GADD45A) when validated over 209 samples and traditional validation accuracies of up to 92% (DDB2, CD8A, TALDO1, PCNA, EIF4G2, LCN2, CDKN1A, PRKCH, ENO1, and PPM1D) when validated over 85 samples. Some human signatures are specific enough to differentiate between chemotherapy and radiotherapy. Certain multi-class murine signatures have sufficient granularity in dose estimation to inform eligibility for cytokine therapy (assuming these signatures could be translated to humans). We compiled a list of the most frequently appearing genes in the top 20 human and mouse signatures. More frequently appearing genes among an ensemble of signatures may indicate greater impact of these genes on the performance of individual signatures. Several genes in the signatures we derived are present in previously proposed signatures. Conclusions: Gene signatures for ionizing radiation exposure derived by machine learning have low error rates in externally validated, independent datasets, and exhibit high specificity and granularity for dose estimation.

Download Full-text

Logging Analysis and Prediction in Open Source Java Project

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Optimizing Contemporary Application and Processes in Open Source Software ◽

10.4018/978-1-5225-5314-4.ch003 ◽

2018 ◽

pp. 57-85

Author(s):

Sangeeta Lal ◽

Neetu Sardana ◽

Ashish Sureka

Keyword(s):

Machine Learning ◽

Content Analysis ◽

Software Development ◽

Anomaly Detection ◽

Open Source ◽

Large Scale ◽

Source Code ◽

Scale Analysis ◽

Large Scale Analysis ◽

Research Questions

Log statements present in source code provide important information to the software developers because they are useful in various software development activities such as debugging, anomaly detection, and remote issue resolution. Most of the previous studies on logging analysis and prediction provide insights and results after analyzing only a few code constructs. In this chapter, the authors perform an in-depth, focused, and large-scale analysis of logging code constructs at two levels: the file level and catch-blocks level. They answer several research questions related to statistical and content analysis. Statistical and content analysis reveals the presence of differentiating properties among logged and nonlogged code constructs. Based on these findings, the authors propose a machine-learning-based model for catch-blocks logging prediction. The machine-learning-based model is found to be effective in catch-blocks logging prediction.

Download Full-text

Introduction to Machine Learning in Digital Healthcare Epidemiology

Infection Control and Hospital Epidemiology ◽

10.1017/ice.2018.265 ◽

2018 ◽

Vol 39 (12) ◽

pp. 1457-1462 ◽

Cited By ~ 12

Author(s):

Jan A. Roth ◽

Manuel Battegay ◽

Fabrice Juchler ◽

Julia E. Vogt ◽

Andreas F. Widmer

Keyword(s):

Machine Learning ◽

Large Scale ◽

Routine Data ◽

Full Potential ◽

Broad Area ◽

Analysis Techniques ◽

Large Scale Analysis ◽

Healthcare Epidemiology ◽

Technology Specialists ◽

Digital Healthcare

AbstractTo exploit the full potential of big routine data in healthcare and to efficiently communicate and collaborate with information technology specialists and data analysts, healthcare epidemiologists should have some knowledge of large-scale analysis techniques, particularly about machine learning. This review focuses on the broad area of machine learning and its first applications in the emerging field of digital healthcare epidemiology.

Download Full-text

Predictive Strength of Ensemble Machine Learning Algorithms for the Diagnosis of Large Scale Medical Datasets

Applications of Big Data in Large- and Small-Scale Systems - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-6673-2.ch016 ◽

2021 ◽

pp. 260-281

Author(s):

Elangovan Ramanujam ◽

L. Rasikannan ◽

S. Viswa ◽

B. Deepan Prashanth

Keyword(s):

Machine Learning ◽

Large Scale ◽

Nearest Neighbor ◽

Learning Algorithms ◽

Weather Forecast ◽

Machine Learning Algorithms ◽

K Nearest Neighbor ◽

Ensemble Machine Learning ◽

Simple Technology ◽

Sensitivity Specificity

Machine learning is not a simple technology but an amazing field having more and more to explore. It has a number of real-time applications such as weather forecast, price prediction, gaming, medicine, fraud detection, etc. Machine learning has an increased usage in today's technological world as data is growing in volumes and machine learning is capable of producing mathematical and statistical models that can analyze complex data and generate accurate results. To analyze the scalable performance of the learning algorithms, this chapter utilizes various medical datasets from the UCI Machine Learning repository ranges from smaller to large datasets. The performance of learning algorithms such as naïve Bayes, decision tree, k-nearest neighbor, and stacking ensemble learning method are compared in different evaluation models using metrics such as accuracy, sensitivity, specificity, precision, and f-measure.

Download Full-text

Structure-specific DNA recombination sites: Design, validation, and machine learning–based refinement

Science Advances ◽

10.1126/sciadv.aay2922 ◽

2020 ◽

Vol 6 (30) ◽

pp. eaay2922

Author(s):

Aleksandra Nivina ◽

Maj Svea Grieb ◽

Céline Loot ◽

David Bikard ◽

Jean Cury ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Consensus Sequence ◽

Dna Recombination ◽

Structural Features ◽

Protein Coding ◽

Dna Hairpins ◽

Recombination Efficiency ◽

Recombination System ◽

Generation Sequencing

Recombination systems are widely used as bioengineering tools, but their sites have to be highly similar to a consensus sequence or to each other. To develop a recombination system free of these constraints, we turned toward attC sites from the bacterial integron system: single-stranded DNA hairpins specifically recombined by the integrase. Here, we present an algorithm that generates synthetic attC sites with conserved structural features and minimal sequence-level constraints. We demonstrate that all generated sites are functional, their recombination efficiency can reach 60%, and they can be embedded into protein coding sequences. To improve recombination of less efficient sites, we applied large-scale mutagenesis and library enrichment coupled to next-generation sequencing and machine learning. Our results validated the efficiency of this approach and allowed us to refine synthetic attC design principles. They can be embedded into virtually any sequence and constitute a unique example of a structure-specific DNA recombination system.

Download Full-text