Machine Learning and Combinatorial Optimization to Detect Gene-gene Interactions in Genome-wide Real Data: Looking Through the Prism of Four Methods and Two Protocols

Gene-Based Testing of Interactions Using XGBoost in Genome-Wide Association Studies

Frontiers in Cell and Developmental Biology ◽

10.3389/fcell.2021.801113 ◽

2021 ◽

Vol 9 ◽

Author(s):

Yingjie Guo ◽

Chenxi Wu ◽

Zhian Yuan ◽

Yansu Wang ◽

Zhen Liang ◽

...

Keyword(s):

Association Studies ◽

Real Data ◽

Gene Interaction ◽

Genome Wide Association ◽

Superior Performance ◽

Gene Interactions ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

Genome Wide ◽

The Difference

Among the myriad of statistical methods that identify gene–gene interactions in the realm of qualitative genome-wide association studies, gene-based interactions are not only powerful statistically, but also they are interpretable biologically. However, they have limited statistical detection by making assumptions on the association between traits and single nucleotide polymorphisms. Thus, a gene-based method (GGInt-XGBoost) originated from XGBoost is proposed in this article. Assuming that log odds ratio of disease traits satisfies the additive relationship if the pair of genes had no interactions, the difference in error between the XGBoost model with and without additive constraint could indicate gene–gene interaction; we then used a permutation-based statistical test to assess this difference and to provide a statistical p-value to represent the significance of the interaction. Experimental results on both simulation and real data showed that our approach had superior performance than previous experiments to detect gene–gene interactions.

Download Full-text

Testing Gene-Gene Interactions Based on a Neighborhood Perspective in Genome-wide Association Studies

Frontiers in Genetics ◽

10.3389/fgene.2021.801261 ◽

2021 ◽

Vol 12 ◽

Author(s):

Yingjie Guo ◽

Honghong Cheng ◽

Zhian Yuan ◽

Zhen Liang ◽

Yang Wang ◽

...

Keyword(s):

Association Studies ◽

Real Data ◽

Gene Interaction ◽

Statistical Test ◽

Genome Wide Association ◽

Gene Interactions ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Wide Range ◽

The Difference

Unexplained genetic variation that causes complex diseases is often induced by gene-gene interactions (GGIs). Gene-based methods are one of the current statistical methodologies for discovering GGIs in case-control genome-wide association studies that are not only powerful statistically, but also interpretable biologically. However, most approaches include assumptions about the form of GGIs, which results in poor statistical performance. As a result, we propose gene-based testing based on the maximal neighborhood coefficient (MNC) called gene-based gene-gene interaction through a maximal neighborhood coefficient (GBMNC). MNC is a metric for capturing a wide range of relationships between two random vectors with arbitrary, but not necessarily equal, dimensions. We established a statistic that leverages the difference in MNC in case and in control samples as an indication of the existence of GGIs, based on the assumption that the joint distribution of two genes in cases and controls should not be substantially different if there is no interaction between them. We then used a permutation-based statistical test to evaluate this statistic and calculate a statistical p-value to represent the significance of the interaction. Experimental results using both simulation and real data showed that our approach outperformed earlier methods for detecting GGIs.

Download Full-text

Graph Learning for Combinatorial Optimization: A Survey of State-of-the-Art

Data Science and Engineering ◽

10.1007/s41019-021-00155-3 ◽

2021 ◽

Author(s):

Yun Peng ◽

Byron Choi ◽

Jianliang Xu

Keyword(s):

Machine Learning ◽

Combinatorial Optimization ◽

Graph Embedding ◽

Partial Solution ◽

Complex Data ◽

Learning Methods ◽

Graph Learning ◽

Second Stage ◽

End To End ◽

Embedding Methods

AbstractGraphs have been widely used to represent complex data in many applications, such as e-commerce, social networks, and bioinformatics. Efficient and effective analysis of graph data is important for graph-based applications. However, most graph analysis tasks are combinatorial optimization (CO) problems, which are NP-hard. Recent studies have focused a lot on the potential of using machine learning (ML) to solve graph-based CO problems. Most recent methods follow the two-stage framework. The first stage is graph representation learning, which embeds the graphs into low-dimension vectors. The second stage uses machine learning to solve the CO problems using the embeddings of the graphs learned in the first stage. The works for the first stage can be classified into two categories, graph embedding methods and end-to-end learning methods. For graph embedding methods, the learning of the the embeddings of the graphs has its own objective, which may not rely on the CO problems to be solved. The CO problems are solved by independent downstream tasks. For end-to-end learning methods, the learning of the embeddings of the graphs does not have its own objective and is an intermediate step of the learning procedure of solving the CO problems. The works for the second stage can also be classified into two categories, non-autoregressive methods and autoregressive methods. Non-autoregressive methods predict a solution for a CO problem in one shot. A non-autoregressive method predicts a matrix that denotes the probability of each node/edge being a part of a solution of the CO problem. The solution can be computed from the matrix using search heuristics such as beam search. Autoregressive methods iteratively extend a partial solution step by step. At each step, an autoregressive method predicts a node/edge conditioned to current partial solution, which is used to its extension. In this survey, we provide a thorough overview of recent studies of the graph learning-based CO methods. The survey ends with several remarks on future research directions.

Download Full-text

Forest and Trees: Exploring Bacterial Virulence with Genome-wide Association Studies and Machine Learning

Trends in Microbiology ◽

10.1016/j.tim.2020.12.002 ◽

2021 ◽

Author(s):

Jonathan P. Allen ◽

Evan Snitkin ◽

Nathan B. Pincus ◽

Alan R. Hauser

Keyword(s):

Machine Learning ◽

Association Studies ◽

Bacterial Virulence ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

Download Full-text

An approach using ddRADseq and machine learning for understanding speciation in Antarctic Antarctophilinidae gastropods

Scientific Reports ◽

10.1038/s41598-021-87244-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Juan Moles ◽

Shahan Derkarabetian ◽

Stefano Schiaparelli ◽

Michael Schrödl ◽

Jesús S. Troncoso ◽

...

Keyword(s):

Machine Learning ◽

Deep Sea ◽

Weddell Sea ◽

Nucleotide Polymorphisms ◽

Sea Levels ◽

Suitable Material ◽

Genetic Lineages ◽

Genome Wide ◽

Gastropod Mollusks ◽

Double Digestion

AbstractSampling impediments and paucity of suitable material for molecular analyses have precluded the study of speciation and radiation of deep-sea species in Antarctica. We analyzed barcodes together with genome-wide single nucleotide polymorphisms obtained from double digestion restriction site-associated DNA sequencing (ddRADseq) for species in the family Antarctophilinidae. We also reevaluated the fossil record associated with this taxon to provide further insights into the origin of the group. Novel approaches to identify distinctive genetic lineages, including unsupervised machine learning variational autoencoder plots, were used to establish species hypothesis frameworks. In this sense, three undescribed species and a complex of cryptic species were identified, suggesting allopatric speciation connected to geographic or bathymetric isolation. We further observed that the shallow waters around the Scotia Arc and on the continental shelf in the Weddell Sea present high endemism and diversity. In contrast, likely due to the glacial pressure during the Cenozoic, a deep-sea group with fewer species emerged expanding over great areas in the South-Atlantic Antarctic Ridge. Our study agrees on how diachronic paleoclimatic and current environmental factors shaped Antarctic communities both at the shallow and deep-sea levels, promoting Antarctica as the center of origin for numerous taxa such as gastropod mollusks.

Download Full-text

Machine learning and combinatorial optimization, editorial

OR Spectrum ◽

10.1007/s00291-021-00642-z ◽

2021 ◽

Author(s):

Gianni A. Di Caro ◽

Vittorio Maniezzo ◽

Roberto Montemanni ◽

Matteo Salani

Keyword(s):

Machine Learning ◽

Combinatorial Optimization

Download Full-text

Machine Learning for the Dynamic Positioning of UAVs for Extended Connectivity

Sensors ◽

10.3390/s21134618 ◽

2021 ◽

Vol 21 (13) ◽

pp. 4618

Author(s):

Francisco Oliveira ◽

Miguel Luís ◽

Susana Sargento

Keyword(s):

Machine Learning ◽

Cellular Networks ◽

Real Data ◽

Emerging Technology ◽

Machine Learning Algorithms ◽

Base Stations ◽

Aerial Vehicle ◽

Positioning Algorithm ◽

The Military ◽

Better Than

Unmanned Aerial Vehicle (UAV) networks are an emerging technology, useful not only for the military, but also for public and civil purposes. Their versatility provides advantages in situations where an existing network cannot support all requirements of its users, either because of an exceptionally big number of users, or because of the failure of one or more ground base stations. Networks of UAVs can reinforce these cellular networks where needed, redirecting the traffic to available ground stations. Using machine learning algorithms to predict overloaded traffic areas, we propose a UAV positioning algorithm responsible for determining suitable positions for the UAVs, with the objective of a more balanced redistribution of traffic, to avoid saturated base stations and decrease the number of users without a connection. The tests performed with real data of user connections through base stations show that, in less restrictive network conditions, the algorithm to dynamically place the UAVs performs significantly better than in more restrictive conditions, reducing significantly the number of users without a connection. We also conclude that the accuracy of the prediction is a very important factor, not only in the reduction of users without a connection, but also on the number of UAVs deployed.

Download Full-text

FRI0046 PHARMACOGENOMICS-DRIVEN INDIVIDUALIZED PREDICTION OF TREATMENT RESPONSE TO METHOTREXATE IN PATIENTS WITH RHEUMATOID ARTHRITIS: A MACHINE LEARNING APPROACH

Annals of the Rheumatic Diseases ◽

10.1136/annrheumdis-2020-eular.4993 ◽

2020 ◽

Vol 79 (Suppl 1) ◽

pp. 598.2-598

Author(s):

E. Myasoedova ◽

A. Athreya ◽

C. S. Crowson ◽

R. Weinshilboum ◽

L. Wang ◽

...

Keyword(s):

Rheumatoid Arthritis ◽

Machine Learning ◽

Supervised Machine Learning ◽

Research Support ◽

Eular Response ◽

Learning Methods ◽

Machine Learning Methods ◽

Early Ra ◽

Genome Wide

Background:Methotrexate (MTX) is the most common anchor drug for rheumatoid arthritis (RA), but the risk of missing the opportunity for early effective treatment with alternative medications is substantial given the delayed onset of MTX action and 30-40% inadequate response rate. There is a compelling need to accurately predicting MTX response prior to treatment initiation, which allows for effectively identifying patients at RA onset who are likely to respond to MTX.Objectives:To test the ability of machine learning approaches with clinical and genomic biomarkers to predict MTX response with replications in independent samples.Methods:Age, sex, clinical, serological and genome-wide association study (GWAS) data on patients with early RA of European ancestry from 647 patients (336 recruited in United Kingdom [UK]; 307 recruited across Europe; 70% female; 72% rheumatoid factor [RF] positive; mean age 54 years; mean baseline Disease Activity Score with 28-joint count [DAS28] 5.65) of the PhArmacogenetics of Methotrexate in RA (PAMERA) consortium was used in this study. The genomics data comprised 160 genome-wide significant single nucleotide polymorphisms (SNPs) with p<1×10-5 associated with risk of RA and MTX metabolism. DAS28 score was available at baseline and 3-month follow-up visit. Response to MTX monotherapy at the dose of ≥15 mg/week was defined as good or moderate by the EULAR response criteria at 3 months’ follow up visit. Supervised machine-learning methods were trained with 5-repeats and 10-fold cross-validation using data from PAMERA’s 336 UK patients. Class imbalance (higher % of MTX responders) in training was accounted by using simulated minority oversampling technique. Prediction performance was validated in PAMERA’s 307 European patients (not used in training).Results:Age, sex, RF positivity and baseline DAS28 data predicted MTX response with 58% accuracy of UK and European patients (p = 0.7). However, supervised machine-learning methods that combined demographics, RF positivity, baseline DAS28 and genomic SNPs predicted EULAR response at 3 months with area under the receiver operating curve (AUC) of 0.83 (p = 0.051) in UK patients, and achieved prediction accuracies (fraction of correctly predicted outcomes) of 76.2% (p = 0.054) in the European patients, with sensitivity of 72% and specificity of 77%. The addition of genomic data improved the predictive accuracies of MTX response by 19% and achieved cross-site replication. Baseline DAS28 scores and following SNPs rs12446816, rs13385025, rs113798271, and rs2372536 were among the top predictors of MTX response.Conclusion:Pharmacogenomic biomarkers combined with DAS28 scores predicted MTX response in patients with early RA more reliably than using demographics and DAS28 scores alone. Using pharmacogenomics biomarkers for identification of MTX responders at early stages of RA may help to guide effective RA treatment choices, including timely escalation of RA therapies. Further studies on personalized prediction of response to MTX and other anti-rheumatic treatments are warranted to optimize control of RA disease and improve outcomes in patients with RA.Disclosure of Interests:Elena Myasoedova: None declared, Arjun Athreya: None declared, Cynthia S. Crowson Grant/research support from: Pfizer research grant, Richard Weinshilboum Shareholder of: co-founder and stockholder in OneOme, Liewei Wang: None declared, Eric Matteson Grant/research support from: Pfizer, Consultant of: Boehringer Ingelheim, Gilead, TympoBio, Arena Pharmaceuticals, Speakers bureau: Simply Speaking

Download Full-text

Application of a Rough Set-Based Inductive Learning System

Fundamenta Informaticae ◽

10.3233/fi-1993-182-409 ◽

1993 ◽

Vol 18 (2-4) ◽

pp. 209-220

Author(s):

Michael Hadjimichael ◽

Anita Wasilewska

Keyword(s):

Machine Learning ◽

Rough Set ◽

Presidential Election ◽

Predictive Accuracy ◽

Learning Algorithm ◽

Inductive Learning ◽

Real Data ◽

Semantic Content ◽

Learning System ◽

Voter Preferences

We present here an application of Rough Set formalism to Machine Learning. The resulting Inductive Learning algorithm is described, and its application to a set of real data is examined. The data consists of a survey of voter preferences taken during the 1988 presidential election in the U.S.A. Results include an analysis of the predictive accuracy of the generated rules, and an analysis of the semantic content of the rules.

Download Full-text

NIMG-46. RADIOGENOMIC FEATURES PREDICT CLINICALLY RELEVANT GENOME-WIDE ALTERATION SIGNATURES IN GLIOBLASTOMA

Neuro-Oncology ◽

10.1093/neuonc/noaa215.659 ◽

2020 ◽

Vol 22 (Supplement_2) ◽

pp. ii158-ii158

Author(s):

Nicholas Nuechterlein ◽

Beibin Li ◽

James Fink ◽

David Haynor ◽

Eric Holland ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Molecular Subtypes ◽

Feature Selection Method ◽

Selection Method ◽

Versus Group ◽

Mri Features ◽

Genome Wide ◽

Group 2 ◽

Group 1

Abstract BACKGROUND Previously, we have shown that combined whole-exome sequencing (WES) and genome-wide somatic copy number alteration (SCNA) information can separate IDH1/2-wildtype glioblastoma into two prognostic molecular subtypes (Group 1 and Group 2) and that these subtypes cannot be distinguished by epigenetic or clinical features. However, the potential for radiographic features to discriminate between these molecular subtypes has not been established. METHODS Radiogenomic features (n=35,400) were extracted from 46 multiparametric, pre-operative magnetic resonance imaging (MRI) of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive, all of whom have corresponding WES and SCNA data in The Cancer Genome Atlas. We developed a novel feature selection method that leverages the structure of extracted radiogenomic MRI features to mitigate the dimensionality challenge posed by the disparity between the number of features and patients in our cohort. Seven traditional machine learning classifiers were trained to distinguish Group 1 versus Group 2 using our feature selection method. Our feature selection was compared to lasso feature selection, recursive feature elimination, and variance thresholding. RESULTS We are able to classify Group 1 versus Group 2 glioblastomas with a cross-validated area under the curve (AUC) score of 0.82 using ridge logistic regression and our proposed feature selection method, which reduces the size of our feature set from 35,400 to 288. An interrogation of the selected features suggests that features describing contours in the T2 abnormality region on the FLAIR MRI modality may best distinguish these two groups from one another. CONCLUSIONS We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups. This algorithm may be applied to future prospective studies to assess the utility of MRI as a surrogate for costly prognostic genomic studies.

Download Full-text