BioData Mining
Latest Publications

Abstract Background The mTOR-PI3K-Akt pathway influences cell metabolism and (malignant) cell growth. We generated sex-specific polygenic risk scores capturing natural variation in 7 out of 10 top-ranked genes in this pathway. We studied the scores directly and in interaction with energy balance-related factors (body mass index (BMI), trouser/skirt size, height, physical activity, and early life energy restriction) in relation to colorectal cancer (CRC) risk in the Netherlands Cohort Study (NLCS) (n=120,852). The NLCS has a case-cohort design and 20.3 years of follow-up. Participants completed a baseline questionnaire on diet and cancer in 1986 when 55–69 years old. ~75% of the cohort returned toenail clippings used for DNA isolation and genotyping (n subcohort=3,793, n cases=3,464). To generate the scores, the dataset was split in two and risk alleles were defined and weighted based on sex-specific associations with CRC risk in the other dataset half, because there were no SNPs in the top-ranked genes associated with CRC risk in previous genome-wide association studies at a significance level p<1*10−5. Results Cox regression analyses showed positive associations between the sex-specific polygenic risk scores and colon but not rectal cancer risk in men and women, with hazard ratios for continuously modeled scores close to 1.10. There was no modifying effect observed of the scores on associations between the energy balance-related factors and CRC risk. However, BMI (in men), non-occupational physical activity (in women), and height (in men and women) were associated with the risk of CRC, in particular (proximal and distal) colon cancer, in the direction as expected in the lower tertiles of the sex-specific polygenic risk scores. Conclusions Current data suggest that the mTOR-PI3K-Akt pathway may be involved in colon cancer development. This study thereby sheds more light on colon cancer etiology through use of genetic variation in the mTOR-PI3K-Akt pathway.

Download Full-text

Integrating pathway knowledge with deep neural networks to reduce the dimensionality in single-cell RNA-seq data

BioData Mining ◽

10.1186/s13040-021-00285-4 ◽

2022 ◽

Vol 15 (1) ◽

Author(s):

Pelin Gundogdu ◽

Carlos Loucera ◽

Inmaculada Alamo-Alvarez ◽

Joaquin Dopazo ◽

Isabel Nepomuceno

Keyword(s):

Neural Networks ◽

Single Cell ◽

Deep Neural Networks ◽

Cell Types ◽

Cellular Heterogeneity ◽

Biological Information ◽

Superior Performance ◽

Biological Knowledge ◽

Additional Advantage ◽

Functional Relationships

Abstract Background Single-cell RNA sequencing (scRNA-seq) data provide valuable insights into cellular heterogeneity which is significantly improving the current knowledge on biology and human disease. One of the main applications of scRNA-seq data analysis is the identification of new cell types and cell states. Deep neural networks (DNNs) are among the best methods to address this problem. However, this performance comes with the trade-off for a lack of interpretability in the results. In this work we propose an intelligible pathway-driven neural network to correctly solve cell-type related problems at single-cell resolution while providing a biologically meaningful representation of the data. Results In this study, we explored the deep neural networks constrained by several types of prior biological information, e.g. signaling pathway information, as a way to reduce the dimensionality of the scRNA-seq data. We have tested the proposed biologically-based architectures on thousands of cells of human and mouse origin across a collection of public datasets in order to check the performance of the model. Specifically, we tested the architecture across different validation scenarios that try to mimic how unknown cell types are clustered by the DNN and how it correctly annotates cell types by querying a database in a retrieval problem. Moreover, our approach demonstrated to be comparable to other less interpretable DNN approaches constrained by using protein-protein interactions gene regulation data. Finally, we show how the latent structure learned by the network could be used to visualize and to interpret the composition of human single cell datasets. Conclusions Here we demonstrate how the integration of pathways, which convey fundamental information on functional relationships between genes, with DNNs, that provide an excellent classification framework, results in an excellent alternative to learn a biologically meaningful representation of scRNA-seq data. In addition, the introduction of prior biological knowledge in the DNN reduces the size of the network architecture. Comparative results demonstrate a superior performance of this approach with respect to other similar approaches. As an additional advantage, the use of pathways within the DNN structure enables easy interpretability of the results by connecting features to cell functionalities by means of the pathway nodes, as demonstrated with an example with human melanoma tumor cells.

Download Full-text

Identification of natural selection in genomic data with deep convolutional neural network

BioData Mining ◽

10.1186/s13040-021-00280-9 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Arnaud Nguembang Fadja ◽

Fabrizio Riguzzi ◽

Giorgio Bertorelle ◽

Emiliano Trucchi

Keyword(s):

Neural Network ◽

Machine Learning ◽

Natural Selection ◽

Convolutional Neural Network ◽

Simulated Data ◽

Genomic Data ◽

Relevant Information ◽

Supervised Machine Learning ◽

Selection Processes ◽

Adaptive Processes

Abstract Background With the increase in the size of genomic datasets describing variability in populations, extracting relevant information becomes increasingly useful as well as complex. Recently, computational methodologies such as Supervised Machine Learning and specifically Convolutional Neural Networks have been proposed to make inferences on demographic and adaptive processes using genomic data. Even though it was already shown to be powerful and efficient in different fields of investigation, Supervised Machine Learning has still to be explored as to unfold its enormous potential in evolutionary genomics. Results The paper proposes a method based on Supervised Machine Learning for classifying genomic data, represented as windows of genomic sequences from a sample of individuals belonging to the same population. A Convolutional Neural Network is used to test whether a genomic window shows the signature of natural selection. Training performed on simulated data show that the proposed model can accurately predict neutral and selection processes on portions of genomes taken from real populations with almost 90% accuracy.

Download Full-text

Machine learning approaches for the genomic prediction of rheumatoid arthritis and systemic lupus erythematosus

BioData Mining ◽

10.1186/s13040-021-00284-5 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Chih-Wei Chung ◽

Tzu-Hung Hsiao ◽

Chih-Jen Huang ◽

Yen-Ju Chen ◽

Hsin-Hua Chen ◽

...

Keyword(s):

Rheumatoid Arthritis ◽

Machine Learning ◽

Lupus Erythematosus ◽

Genomic Prediction ◽

Genotype Imputation ◽

Gradient Boosting ◽

Support Vector ◽

Systemic Lupus ◽

Extreme Gradient Boosting ◽

Hla Dqa1

Abstract Background Rheumatoid arthritis (RA) and systemic lupus erythematous (SLE) are autoimmune rheumatic diseases that share a complex genetic background and common clinical features. This study’s purpose was to construct machine learning (ML) models for the genomic prediction of RA and SLE. Methods A total of 2,094 patients with RA and 2,190 patients with SLE were enrolled from the Taichung Veterans General Hospital cohort of the Taiwan Precision Medicine Initiative. Genome-wide single nucleotide polymorphism (SNP) data were obtained using Taiwan Biobank version 2 array. The ML methods used were logistic regression (LR), random forest (RF), support vector machine (SVM), gradient tree boosting (GTB), and extreme gradient boosting (XGB). SHapley Additive exPlanation (SHAP) values were calculated to clarify the contribution of each SNPs. Human leukocyte antigen (HLA) imputation was performed using the HLA Genotype Imputation with Attribute Bagging package. Results Compared with LR (area under the curve [AUC] = 0.8247), the RF approach (AUC = 0.9844), SVM (AUC = 0.9828), GTB (AUC = 0.9932), and XGB (AUC = 0.9919) exhibited significantly better prediction performance. The top 20 genes by feature importance and SHAP values included HLA class II alleles. We found that imputed HLA-DQA1*05:01, DQB1*0201 and DRB1*0301 were associated with SLE; HLA-DQA1*03:03, DQB1*0401, DRB1*0405 were more frequently observed in patients with RA. Conclusions We established ML methods for genomic prediction of RA and SLE. Genetic variations at HLA-DQA1, HLA-DQB1, and HLA-DRB1 were crucial for differentiating RA from SLE. Future studies are required to verify our results and explore their mechanistic explanation.

Download Full-text

LPI-EnEDT: an ensemble framework with extra tree and decision tree classifiers for imbalanced lncRNA-protein interaction data classification

BioData Mining ◽

10.1186/s13040-021-00277-4 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Lihong Peng ◽

Ruya Yuan ◽

Ling Shen ◽

Pengfei Gao ◽

Liqian Zhou

Keyword(s):

Decision Tree ◽

Protein Interaction ◽

Protein Pair ◽

Noncoding Rnas ◽

Data Classification ◽

Protein Interaction Data ◽

Biological Processes ◽

Interaction Data ◽

Biological Features ◽

Unique Dataset

Abstract Background Long noncoding RNAs (lncRNAs) have dense linkages with various biological processes. Identifying interacting lncRNA-protein pairs contributes to understand the functions and mechanisms of lncRNAs. Wet experiments are costly and time-consuming. Most computational methods failed to observe the imbalanced characterize of lncRNA-protein interaction (LPI) data. More importantly, they were measured based on a unique dataset, which produced the prediction bias. Results In this study, we develop an Ensemble framework (LPI-EnEDT) with Extra tree and Decision Tree classifiers to implement imbalanced LPI data classification. First, five LPI datasets are arranged. Second, lncRNAs and proteins are separately characterized based on Pyfeat and BioTriangle and concatenated as a vector to represent each lncRNA-protein pair. Finally, an ensemble framework with Extra tree and decision tree classifiers is developed to classify unlabeled lncRNA-protein pairs. The comparative experiments demonstrate that LPI-EnEDT outperforms four classical LPI prediction methods (LPI-BLS, LPI-CatBoost, LPI-SKF, and PLIPCOM) under cross validations on lncRNAs, proteins, and LPIs. The average AUC values on the five datasets are 0.8480, 0,7078, and 0.9066 under the three cross validations, respectively. The average AUPRs are 0.8175, 0.7265, and 0.8882, respectively. Case analyses suggest that there are underlying associations between HOTTIP and Q9Y6M1, NRON and Q15717. Conclusions Fusing diverse biological features of lncRNAs and proteins and exploiting an ensemble learning model with Extra tree and decision tree classifiers, this work focus on imbalanced LPI data classification as well as interaction information inference for a new lncRNA (or protein).

Download Full-text

Gaussian noise up-sampling is better suited than SMOTE and ADASYN for clinical decision making

BioData Mining ◽

10.1186/s13040-021-00283-6 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Jacqueline Beinecke ◽

Dominik Heider

Keyword(s):

Machine Learning ◽

Clinical Data ◽

Gaussian Noise ◽

Missing Values ◽

Clinical Decision Making ◽

Big Data Analytics ◽

Class Imbalance ◽

Clinical Decision ◽

Data Sets ◽

Augmentation Techniques

AbstractClinical data sets have very special properties and suffer from many caveats in machine learning. They typically show a high-class imbalance, have a small number of samples and a large number of parameters, and have missing values. While feature selection approaches and imputation techniques address the former problems, the class imbalance is typically addressed using augmentation techniques. However, these techniques have been developed for big data analytics, and their suitability for clinical data sets is unclear.This study analyzed different augmentation techniques for use in clinical data sets and subsequent employment of machine learning-based classification. It turns out that Gaussian Noise Up-Sampling (GNUS) is not always but generally, is as good as SMOTE and ADASYN and even outperform those on some datasets. However, it has also been shown that augmentation does not improve classification at all in some cases.

Download Full-text

Development of glaucoma predictive model and risk factors assessment based on supervised models

BioData Mining ◽

10.1186/s13040-021-00281-8 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Mahyar Sharifi ◽

Toktam Khatibi ◽

Mohammad Hassan Emamian ◽

Somayeh Sadat ◽

Hassan Hashemi ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Ensemble Methods ◽

Learning Model ◽

Training Dataset ◽

Data Sampling ◽

Machine Learning Model ◽

Average Accuracy ◽

Under Sampling ◽

Bagging Ensemble

Abstract Objectives To develop and to propose a machine learning model for predicting glaucoma and identifying its risk factors. Method Data analysis pipeline is designed for this study based on Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology. The main steps of the pipeline include data sampling, preprocessing, classification and evaluation and validation. Data sampling for providing the training dataset was performed with balanced sampling based on over-sampling and under-sampling methods. Data preprocessing steps were missing value imputation and normalization. For classification step, several machine learning models were designed for predicting glaucoma including Decision Trees (DTs), K-Nearest Neighbors (K-NN), Support Vector Machines (SVM), Random Forests (RFs), Extra Trees (ETs) and Bagging Ensemble methods. Moreover, in the classification step, a novel stacking ensemble model is designed and proposed using the superior classifiers. Results The data were from Shahroud Eye Cohort Study including demographic and ophthalmology data for 5190 participants aged 40-64 living in Shahroud, northeast Iran. The main variables considered in this dataset were 67 demographics, ophthalmologic, optometric, perimetry, and biometry features for 4561 people, including 4474 non-glaucoma participants and 87 glaucoma patients. Experimental results show that DTs and RFs trained based on under-sampling of the training dataset have superior performance for predicting glaucoma than the compared single classifiers and bagging ensemble methods with the average accuracy of 87.61 and 88.87, the sensitivity of 73.80 and 72.35, specificity of 87.88 and 89.10 and area under the curve (AUC) of 91.04 and 94.53, respectively. The proposed stacking ensemble has an average accuracy of 83.56, a sensitivity of 82.21, a specificity of 81.32, and an AUC of 88.54. Conclusions In this study, a machine learning model is proposed and developed to predict glaucoma disease among persons aged 40-64. Top predictors in this study considered features for discriminating and predicting non-glaucoma persons from glaucoma patients include the number of the visual field detect on perimetry, vertical cup to disk ratio, white to white diameter, systolic blood pressure, pupil barycenter on Y coordinate, age, and axial length.

Download Full-text

Correction to: iGlioSub: an integrative transcriptomic and epigenomic classifier for glioblastoma molecular subtypes

BioData Mining ◽

10.1186/s13040-021-00282-7 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Miquel Ensenyat-Mendez ◽

Sandra Íñiguez-Muñoz ◽

Borja Sesé ◽

Diego M. Marzese

Keyword(s):

Molecular Subtypes

Download Full-text

Prediction of synergistic drug combinations using PCA-initialized deep learning

BioData Mining ◽

10.1186/s13040-021-00278-3 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Jun Ma ◽

Alison Motsinger-Reif

Keyword(s):

Neural Network ◽

Gene Expression ◽

Machine Learning ◽

Deep Learning ◽

Drug Combination ◽

Drug Combinations ◽

Learning Approach ◽

Drug Synergy ◽

Machine Learning Methods ◽

Synergistic Drug Combinations

Abstract Background Cancer is one of the main causes of death worldwide. Combination drug therapy has been a mainstay of cancer treatment for decades and has been shown to reduce host toxicity and prevent the development of acquired drug resistance. However, the immense number of possible drug combinations and large synergistic space makes it infeasible to screen all effective drug pairs experimentally. Therefore, it is crucial to develop computational approaches to predict drug synergy and guide experimental design for the discovery of rational combinations for therapy. Results We present a new deep learning approach to predict synergistic drug combinations by integrating gene expression profiles from cell lines and chemical structure data. Specifically, we use principal component analysis (PCA) to reduce the dimensionality of the chemical descriptor data and gene expression data. We then propagate the low-dimensional data through a neural network to predict drug synergy values. We apply our method to O’Neil’s high-throughput drug combination screening data as well as a dataset from the AstraZeneca-Sanger Drug Combination Prediction DREAM Challenge. We compare the neural network approach with and without dimension reduction. Additionally, we demonstrate the effectiveness of our deep learning approach and compare its performance with three state-of-the-art machine learning methods: Random Forests, XGBoost, and elastic net, with and without PCA-based dimensionality reduction. Conclusions Our developed approach outperforms other machine learning methods, and the use of dimension reduction dramatically decreases the computation time without sacrificing accuracy.

Download Full-text

Humans and machines in biomedical knowledge curation: hypertrophic cardiomyopathy molecular mechanisms’ representation

BioData Mining ◽

10.1186/s13040-021-00279-2 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Mila Glavaški ◽

Lazar Velicki

Keyword(s):

Hypertrophic Cardiomyopathy ◽

Molecular Mechanisms ◽

Centrality Measures ◽

Biomedical Knowledge ◽

Topological Parameters ◽

Manual Curation ◽

Single Reading ◽

Extraction Performance ◽

Reading System ◽

High Level

Abstract Background Biomedical knowledge is dispersed in scientific literature and is growing constantly. Curation is the extraction of knowledge from unstructured data into a computable form and could be done manually or automatically. Hypertrophic cardiomyopathy (HCM) is the most common inherited cardiac disease, with genotype–phenotype associations still incompletely understood. We compared human- and machine-curated HCM molecular mechanisms’ models and examined the performance of different machine approaches for that task. Results We created six models representing HCM molecular mechanisms using different approaches and made them publicly available, analyzed them as networks, and tried to explain the models’ differences by the analysis of factors that affect the quality of machine-curated models (query constraints and reading systems’ performance). A result of this work is also the Interactive HCM map, the only publicly available knowledge resource dedicated to HCM. Sizes and topological parameters of the networks differed notably, and a low consensus was found in terms of centrality measures between networks. Consensus about the most important nodes was achieved only with respect to one element (calcium). Models with a reduced level of noise were generated and cooperatively working elements were detected. REACH and TRIPS reading systems showed much higher accuracy than Sparser, but at the cost of extraction performance. TRIPS proved to be the best single reading system for text segments about HCM, in terms of the compromise between accuracy and extraction performance. Conclusions Different approaches in curation can produce models of the same disease with diverse characteristics, and they give rise to utterly different conclusions in subsequent analysis. The final purpose of the model should direct the choice of curation techniques. Manual curation represents the gold standard for information extraction in biomedical research and is most suitable when only high-quality elements for models are required. Automated curation provides more substance, but high level of noise is expected. Different curation strategies can reduce the level of human input needed. Biomedical knowledge would benefit overwhelmingly, especially as to its rapid growth, if computers were to be able to assist in analysis on a larger scale.

Download Full-text

BioData MiningLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Springer (Biomed Central Ltd.)

Polymorphisms in the mTOR-PI3K-Akt pathway, energy balance-related exposures and colorectal cancer risk in the Netherlands Cohort Study

Integrating pathway knowledge with deep neural networks to reduce the dimensionality in single-cell RNA-seq data

Identification of natural selection in genomic data with deep convolutional neural network

Machine learning approaches for the genomic prediction of rheumatoid arthritis and systemic lupus erythematosus

LPI-EnEDT: an ensemble framework with extra tree and decision tree classifiers for imbalanced lncRNA-protein interaction data classification

Gaussian noise up-sampling is better suited than SMOTE and ADASYN for clinical decision making

Development of glaucoma predictive model and risk factors assessment based on supervised models

Correction to: iGlioSub: an integrative transcriptomic and epigenomic classifier for glioblastoma molecular subtypes

Prediction of synergistic drug combinations using PCA-initialized deep learning

Humans and machines in biomedical knowledge curation: hypertrophic cardiomyopathy molecular mechanisms’ representation

BioData Mining
Latest Publications