scholarly journals A Gene-Based Machine Learning Classifier Associated to the Colorectal Adenoma—Carcinoma Sequence

Biomedicines ◽  
2021 ◽  
Vol 9 (12) ◽  
pp. 1937
Author(s):  
Antonio Lacalamita ◽  
Emanuele Piccinno ◽  
Viviana Scalavino ◽  
Roberto Bellotti ◽  
Gianluigi Giannelli ◽  
...  

Colorectal cancer (CRC) carcinogenesis is generally the result of the sequential mutation and deletion of various genes; this is known as the normal mucosa–adenoma–carcinoma sequence. The aim of this study was to develop a predictor-classifier during the “adenoma-carcinoma” sequence using microarray gene expression profiles of primary CRC, adenoma, and normal colon epithelial tissues. Four gene expression profiles from the Gene Expression Omnibus database, containing 465 samples (105 normal, 155 adenoma, and 205 CRC), were preprocessed to identify differentially expressed genes (DEGs) between adenoma tissue and primary CRC. The feature selection procedure, using the sequential Boruta algorithm and Stepwise Regression, determined 56 highly important genes. K-Means methods showed that, using the selected 56 DEGs, the three groups were clearly separate. The classification was performed with machine learning algorithms such as Linear Model (LM), Random Forest (RF), k-Nearest Neighbors (k-NN), and Artificial Neural Network (ANN). The best classification method in terms of accuracy (88.06 ± 0.70) and AUC (92.04 ± 0.47) was k-NN. To confirm the relevance of the predictive models, we applied the four models on a validation cohort: the k-NN model remained the best model in terms of performance, with 91.11% accuracy. Among the 56 DEGs, we identified 17 genes with an ascending or descending trend through the normal mucosa–adenoma–carcinoma sequence. Moreover, using the survival information of the TCGA database, we selected six DEGs related to patient prognosis (SCARA5, PKIB, CWH43, TEX11, METTL7A, and VEGFA). The six-gene-based classifier described in the current study could be used as a potential biomarker for the early diagnosis of CRC.

2020 ◽  
Author(s):  
Christopher A Mancuso ◽  
Jacob L Canfield ◽  
Deepak Singla ◽  
Arjun Krishnan

AbstractWhile there are >2 million publicly-available human microarray gene-expression profiles, these profiles were measured using a variety of platforms that each cover a pre-defined, limited set of genes. Therefore, key to reanalyzing and integrating this massive data collection are methods that can computationally reconstitute the complete transcriptome in partially-measured microarray samples by imputing the expression of unmeasured genes. Current state-of-the-art imputation methods are tailored to samples from a specific platform and rely on gene-gene relationships regardless of the biological context of the target sample. We show that sparse regression models that capture sample-sample relationships (termed SampleLASSO), built on-the-fly for each new target sample to be imputed, outperform models based on fixed gene relationships. Extensive evaluation involving three machine learning algorithms (LASSO, k-nearest-neighbors, and deep-neural-networks), two gene subsets (GPL96-570 and LINCS), and three imputation tasks (within and across microarray/RNA-seq) establishes that SampleLASSO is the most accurate model. Additionally, we demonstrate the biological interpretability of this method by showing that, for imputing a target sample from a certain tissue, SampleLASSO automatically leverages training samples from the same tissue. Thus, SampleLASSO is a simple, yet powerful and flexible approach for harmonizing large-scale gene-expression data.


2018 ◽  
Vol 2018 ◽  
pp. 1-10 ◽  
Author(s):  
Nina Hauptman ◽  
Emanuela Boštjančič ◽  
Margareta Žlajpah ◽  
Branislava Ranković ◽  
Nina Zidar

Colorectal cancer (CRC) is one of the leading causes of death by cancer worldwide. Bowel cancer screening programs enable us to detect early lesions and improve the prognosis of patients with CRC. However, they also generate a significant number of problematic polyps, e.g., adenomas with epithelial misplacement (pseudoinvasion) which can mimic early adenocarcinoma. Therefore, biomarkers that would enable us to distinguish between adenoma with epithelial misplacement (pseudoinvasion) and adenoma with early adenocarcinomas (true invasion) are needed. We hypothesized that the former are genetically similar to adenoma and the latter to adenocarcinoma and we used bioinformatics approach to search for candidate genes that might be potentially used to distinguish between the two lesions. We used publicly available data from Gene Expression Omnibus database and we analyzed gene expression profiles of 252 samples of normal mucosa, colorectal adenoma, and carcinoma. In total, we analyzed 122 colorectal adenomas, 59 colorectal carcinomas, and 62 normal mucosa samples. We have identified 16 genes with differential expression in carcinoma compared to adenoma:COL12A1,COL1A2,COL3A1, DCN, PLAU, SPARC, SPON2, SPP1,SULF1,FADS1, G0S2, EPHA4, KIAA1324,L1TD1, PCKS1, andC11orf96. In conclusion, ourin silicoanalysis revealed 16 candidate genes with different expression patterns in adenoma compared to carcinoma, which might be used to discriminate between these two lesions.


2020 ◽  
Vol 79 (9) ◽  
pp. 1234-1242 ◽  
Author(s):  
Iago Pinal-Fernandez ◽  
Maria Casal-Dominguez ◽  
Assia Derfoul ◽  
Katherine Pak ◽  
Frederick W Miller ◽  
...  

ObjectivesMyositis is a heterogeneous family of diseases that includes dermatomyositis (DM), antisynthetase syndrome (AS), immune-mediated necrotising myopathy (IMNM), inclusion body myositis (IBM), polymyositis and overlap myositis. Additional subtypes of myositis can be defined by the presence of myositis-specific autoantibodies (MSAs). The purpose of this study was to define unique gene expression profiles in muscle biopsies from patients with MSA-positive DM, AS and IMNM as well as IBM.MethodsRNA-seq was performed on muscle biopsies from 119 myositis patients with IBM or defined MSAs and 20 controls. Machine learning algorithms were trained on transcriptomic data and recursive feature elimination was used to determine which genes were most useful for classifying muscle biopsies into each type and MSA-defined subtype of myositis.ResultsThe support vector machine learning algorithm classified the muscle biopsies with >90% accuracy. Recursive feature elimination identified genes that are most useful to the machine learning algorithm and that are only overexpressed in one type of myositis. For example, CAMK1G (calcium/calmodulin-dependent protein kinase IG), EGR4 (early growth response protein 4) and CXCL8 (interleukin 8) are highly expressed in AS but not in DM or other types of myositis. Using the same computational approach, we also identified genes that are uniquely overexpressed in different MSA-defined subtypes. These included apolipoprotein A4 (APOA4), which is only expressed in anti-3-hydroxy-3-methylglutaryl-CoA reductase (HMGCR) myopathy, and MADCAM1 (mucosal vascular addressin cell adhesion molecule 1), which is only expressed in anti-Mi2-positive DM.ConclusionsUnique gene expression profiles in muscle biopsies from patients with MSA-defined subtypes of myositis and IBM suggest that different pathological mechanisms underly muscle damage in each of these diseases.


2020 ◽  
Vol 48 (21) ◽  
pp. e125-e125
Author(s):  
Christopher A Mancuso ◽  
Jacob L Canfield ◽  
Deepak Singla ◽  
Arjun Krishnan

Abstract While there are >2 million publicly-available human microarray gene-expression profiles, these profiles were measured using a variety of platforms that each cover a pre-defined, limited set of genes. Therefore, key to reanalyzing and integrating this massive data collection are methods that can computationally reconstitute the complete transcriptome in partially-measured microarray samples by imputing the expression of unmeasured genes. Current state-of-the-art imputation methods are tailored to samples from a specific platform and rely on gene-gene relationships regardless of the biological context of the target sample. We show that sparse regression models that capture sample-sample relationships (termed SampleLASSO), built on-the-fly for each new target sample to be imputed, outperform models based on fixed gene relationships. Extensive evaluation involving three machine learning algorithms (LASSO, k-nearest-neighbors, and deep-neural-networks), two gene subsets (GPL96–570 and LINCS), and multiple imputation tasks (within and across microarray/RNA-seq datasets) establishes that SampleLASSO is the most accurate model. Additionally, we demonstrate the biological interpretability of this method by showing that, for imputing a target sample from a certain tissue, SampleLASSO automatically leverages training samples from the same tissue. Thus, SampleLASSO is a simple, yet powerful and flexible approach for harmonizing large-scale gene-expression data.


2021 ◽  
Author(s):  
Julián González Betancur ◽  
José Guevara-Coto ◽  
Adarli Romero

Abstract Background: Intellectual disabilities (IDs) are a group of developmental disorders with high phenotypic and genotypic heterogeneity. Association of genetic elements to IDs has typically been empirically accomplished, however recently, machine learning (ML) has proved to be an excellent instrument to elucidate these associations. miRNAs are short non-coding molecules that participate in spatiotemporal gene regulation, making them relevant for the understanding ID causality. Methods: In this study we used the BrainSpan spatio-temporal expression database to develop a series of machine learning predictors: SVM, RF, FF-ANN, and Stochastic Gradient Descent Classifier. These models were capable of recognizing gene expression profiles. The best classifier was used to label miRNAs associated with NS-IDs using the BrainSpan expression profiles. Results: The model with the best performance was a FF-ANN with 0.78 of F1-score, 0.78 of weighted recall and 0.78 of weighted precision. We used this model to identify miRNAs with high probability to be associated with NS-IDs using the spatio-temporal gene expression profile in the human brain. Labeled miRNAs that were annotated were associated with processes related to either IDs and-or neurodevelopmental processes. Conclusions: The development of a machine learning framework that identified potential NS-ID miRNAs represents an interesting approach for the identification of a potential list of on genes that could be subject for further experimental validation. This study also reinforces the potential of machine learning frameworks in their discovery of potential biomarkers that could improve disease detection and management.


2021 ◽  
Author(s):  
Julián González Betancur ◽  
José A Guevara-Coto ◽  
Adarli Romero

Abstract Background: Intellectual disabilities (IDs) are a group of developmental disorders with high phenotypic and genotypic heterogeneity. Association of genetic elements to IDs has typically been empirically accomplished, however recently, machine learning (ML) has proved to be an excellent instrument to elucidate these associations. miRNAs are short non-coding molecules that participate in spatiotemporal gene regulation, making them relevant for the understanding ID causality. Methods: In this study we used the BrainSpan spatio-temporal expression database to develop a series of machine learning predictors: SVM, RF, FF-ANN, and Stochastic Gradient Descent Classifier. These models were capable of recognizing gene expression profiles. The best classifier was used to label miRNAs associated with NS-IDs using the BrainSpan expression profiles. Results: The model with the best performance was a FF-ANN with 0.78 of F1-score, 0.78 of weighted recall and 0.78 of weighted precision. We used this model to identify miRNAs with high probability to be associated with NS-IDs using the spatio-temporal gene expression profile in the human brain. Labeled miRNAs that were annotated were associated with processes related to either IDs and-or neurodevelopmental processes. Conclusions: The development of a machine learning framework that identified potential NS-ID miRNAs represents an interesting approach for the identification of a potential list of on genes that could be subject for further experimental validation. This study also reinforces the potential of machine learning frameworks in their discovery of potential biomarkers that could improve disease detection and management. Keywords: miRNA association; artificial intelligence; machine learning; intellectual disability; biomarker


Genomics ◽  
2020 ◽  
Vol 112 (3) ◽  
pp. 2524-2534 ◽  
Author(s):  
Lei Chen ◽  
XiaoYong Pan ◽  
Wei Guo ◽  
Zijun Gan ◽  
Yu-Hang Zhang ◽  
...  

2020 ◽  
Author(s):  
Mostafa Abbas ◽  
Yasser EL-Manzalawy

AbstractBackgroundDifferential expression (DE) analysis of transcriptomic data enables genome-wide analysis of gene expression changes associated with biological conditions of interest. Such analysis often provide a wide list of genes that are differentially expressed between two or more groups. In general, identified differentially expressed genes (DEGs) can be subject to further downstream analysis for obtaining more biological insights such as determining enriched functional pathways or gene ontologies. Furthermore, DEGs are treated as candidate biomarkers and a small set of DEGs might be identified as biomarkers using either biological knowledge or data-driven approaches.MethodsIn this work, we present a novel approach for identifying biomarkers from a list of DEGs by re-ranking them according to the Minimum Redundancy Maximum Relevance (MRMR) criteria using repeated cross-validation feature selection procedure.ResultsUsing gene expression profiles for 199 children with sepsis and septic shock, we identify 108 DEGs and propose a 10-gene signature for reliably predicting pediatric sepsis mortality with an estimated Area Under ROC (AUC) score of 0.89.ConclusionsMachine learning based refinement of DE analysis is a promising tool for prioritizing DEGs and discovering biomarkers from gene expression profiles. Moreover, our reported 10-gene signature for pediatric sepsis mortality may facilitate the development of reliable diagnosis and prognosis biomarkers for sepsis.


Sign in / Sign up

Export Citation Format

Share Document