scholarly journals Identifying Subgroups of Patients With Autism by Gene Expression Profiles Using Machine Learning Algorithms

2021 ◽  
Vol 12 ◽  
Author(s):  
Ping-I Lin ◽  
Mohammad Ali Moni ◽  
Susan Shur-Fen Gau ◽  
Valsamma Eapen

Objectives: The identification of subgroups of autism spectrum disorder (ASD) may partially remedy the problems of clinical heterogeneity to facilitate the improvement of clinical management. The current study aims to use machine learning algorithms to analyze microarray data to identify clusters with relatively homogeneous clinical features.Methods: The whole-genome gene expression microarray data were used to predict communication quotient (SCQ) scores against all probes to select differential expression regions (DERs). Gene set enrichment analysis was performed for DERs with a fold-change >2 to identify hub pathways that play a role in the severity of social communication deficits inherent to ASD. We then used two machine learning methods, random forest classification (RF) and support vector machine (SVM), to identify two clusters using DERs. Finally, we evaluated how accurately the clusters predicted language impairment.Results: A total of 191 DERs were initially identified, and 54 of them with a fold-change >2 were selected for the pathway analysis. Cholesterol biosynthesis and metabolisms pathways appear to act as hubs that connect other trait-associated pathways to influence the severity of social communication deficits inherent to ASD. Both RF and SVM algorithms can yield a classification accuracy level >90% when all 191 DERs were analyzed. The ASD subtypes defined by the presence of language impairment, a strong indicator for prognosis, can be predicted by transcriptomic profiles associated with social communication deficits and cholesterol biosynthesis and metabolism.Conclusion: The results suggest that both RF and SVM are acceptable options for machine learning algorithms to identify AD subgroups characterized by clinical homogeneity related to prognosis.

2020 ◽  
Author(s):  
Ping-I Lin ◽  
Mohammad Ali Moni ◽  
Valsamma Eapen ◽  
Susan Shur-Fen Gau

Abstract BackgroundClinical heterogeneity in autism spectrum disorder (ASD) can complicate diagnostics and treatments. The identification of biomarkers may hold the key to the classification of ASD subgroups. Accumulating evidence suggests that genetic or genomic markers may facilitate the clustering of patients with ASD. The goal of the current study is to use machine learning algorithms to analyze microarray data to identify clusters with relatively homogeneous clinical features, such as language function.MethodsThe whole-genome gene expression microarray data were used to predict communication quotient (SCQ) scores against all probes to select differential expression regions (DERs). Gene set enrichment analysis was performed to identify hub pathways that play a role in the severity of social communication deficits inherent to ASD. We then used two machine learning methods, random forest classification (RF) combined with partition around medoids (PAM) and support vector machine (SVM), to identify two clusters using DERs. Finally, we evaluated how accurately the clusters predicted language impairment.ResultsA total of 191 DERs were identified. Cholesterol biosynthesis and metabolisms pathways appear to act as hubs that connect other trait-associated pathways to influence the severity of social communication deficits inherent to ASD. Both RF and SVM algorithms can yield a classification accuracy level greater than 90% when all 191 DERs were analyzed. LimitationsThe primary limitation of the current study is the small sample size. Nevertheless, some machine learning algorithm, such as SVM, can handle a small sample with a large number of features. Additionally, model overfitting may arise due to a lack of another independent sample for validation. Furthermore, unknown confounders may cause spurious associations between the phenotype and genomic markers. ConclusionsThe ASD subtypes defined by the presence of language impairment, a strong indicator for prognosis, can be predicted by transcriptomic profiles associated with social communication deficits and cholesterol biosynthesis and metabolism. Our proof-of-concept study suggests that both RF and SVM are acceptable options for machine learning algorithms to identify AD subgroups characterized by clinical homogeneity related to prognosis.


2021 ◽  
Author(s):  
Melih Agraz ◽  
Umut Agyuz ◽  
E. Celeste Welch ◽  
Kaymaz Yasin ◽  
Kuyumcu Birol

Abstract Background Metastasis is one of the most challenging problems in cancer diagnosis and treatment, as its causes have not been yet well characterized. Prediction of the metastatic status of breast cancer is important in cancer research because it has the potential to save lives. However, the systems biology behind metastasis is complex and driven by a variety of factors beyond those that have already been characterized for various cancer types. Furthermore, prediction of cancer metastasis is a challenging task due to the variation in parameters and conditions specific to individual patients and mutation of the sub-types. Results In this paper, we apply tree-based machine learning algorithms for gene expression data analysis in the estimation of metastatic potentials within a group of 490 breast cancer patients. Hence, we utilize tree-based machine learning algorithms, decision trees, gradient boosting, and extremely randomized trees to assess the variable importance.Conclusions We obtained highly accurate values from all three algorithms, we observed the highest accuracy from the Gradient Boost method which is 0.8901. Finally, we were able to determine the 10 most important genetic variables used in the boosted algorithms, as well as their respective importance scores and biological importance. Common important genes for our algorithms are found as CD8, PB1, THP-1. CD8, also known as CD8A is a receptor for the TCR, or T-cell receptor, which facilitates cytotoxic T-cell activity and its association with cancer is defined in the paper. PB1, PBRM1 or polybromo 1 is a tumor suppressor gene. THP-1 or GLI2 is a zinc finger protein referred to as ”Glioma-Associated Oncogene Family Zinc Finger 2”. This gene encodes a protein for the zinc finger, which binds DNA and mediate Sonic hedgehog signaling (SHH). Disruption in the SHH pathway have long been associated with cancer and cellular proliferation.


2021 ◽  
Vol 12 (2) ◽  
pp. 2422-2439

Cancer classification is one of the main objectives for analyzing big biological datasets. Machine learning algorithms (MLAs) have been extensively used to accomplish this task. Several popular MLAs are available in the literature to classify new samples into normal or cancer populations. Nevertheless, most of them often yield lower accuracies in the presence of outliers, which leads to incorrect classification of samples. Hence, in this study, we present a robust approach for the efficient and precise classification of samples using noisy GEDs. We examine the performance of the proposed procedure in a comparison of the five popular traditional MLAs (SVM, LDA, KNN, Naïve Bayes, Random forest) using both simulated and real gene expression data analysis. We also considered several rates of outliers (10%, 20%, and 50%). The results obtained from simulated data confirm that the traditional MLAs produce better results through our proposed procedure in the presence of outliers using the proposed modified datasets. The further transcriptome analysis found the significant involvement of these extra features in cancer diseases. The results indicated the performance improvement of the traditional MLAs with our proposed procedure. Hence, we propose to apply the proposed procedure instead of the traditional procedure for cancer classification.


Mutagenesis ◽  
2020 ◽  
Vol 35 (2) ◽  
pp. 153-159
Author(s):  
Rhiannon David

Abstract Toxicogenomics, the application of genomics to toxicology, was described as ‘a new era’ for toxicology. Standard toxicity tests typically involve a number of short-term bioassays that are costly, time consuming, require large numbers of animals and generally focus on a single end point. Toxicogenomics was heralded as a way to improve the efficiency of toxicity testing by assessing gene regulation across the genome, allowing rapid classification of compounds based on characteristic expression profiles. Gene expression microarrays could measure and characterise genome-wide gene expression changes in a single study and while transcriptomic profiles that can discriminate between genotoxic and non-genotoxic carcinogens have been identified, challenges with the approach limited its application. As such, toxicogenomics did not transform the field of genetic toxicology in the way it was predicted. More recently, next generation sequencing (NGS) technologies have revolutionised genomics owing to the fact that hundreds of billions of base pairs can be sequenced simultaneously cheaper and quicker than traditional Sanger methods. In relation to genetic toxicology, and thousands of cancer genomes have been sequenced with single-base substitution mutational signatures identified, and mutation signatures have been identified following treatment of cells with known or suspected environmental carcinogens. RNAseq has been applied to detect transcriptional changes following treatment with genotoxins; modified RNAseq protocols have been developed to identify adducts in the genome and Duplex sequencing is an example of a technique that has recently been developed to accurately detect mutation. Machine learning, including MutationSeq and SomaticSeq, has also been applied to somatic mutation detection and improvements in automation and/or the application of machine learning algorithms may allow high-throughput mutation sequencing in the future. This review will discuss the initial promise of transcriptomics for genetic toxicology, and how the development of NGS technologies and new machine learning algorithms may finally realise that promise.


Sign in / Sign up

Export Citation Format

Share Document