scholarly journals Machine-learning annotation of human splicing branchpoints

2016 ◽  
Author(s):  
Bethany Signal ◽  
Brian S Gloss ◽  
Marcel E Dinger ◽  
Timothy R Mercer

ABSTRACTBackgroundThe branchpoint element is required for the first lariat-forming reaction in splicing. However due to difficulty in experimentally mapping at a genome-wide scale, current catalogues are incomplete.ResultsWe have developed a machine-learning algorithm trained with empirical human branchpoint annotations to identify branchpoint elements from primary genome sequence alone. Using this approach, we can accurately locate branchpoints elements in 85% of introns in current gene annotations. Consistent with branchpoints as basal genetic elements, we find our annotation is unbiased towards gene type and expression levels. A major fraction of introns was found to encode multiple branchpoints raising the prospect that mutational redundancy is encoded in key genes. We also confirmed all deleterious branchpoint mutations annotated in clinical variant databases, and further identified thousands of clinical and common genetic variants with similar predicted effects.ConclusionsWe propose the broad annotation of branchpoints constitutes a valuable resource for further investigations into the genetic encoding of splicing patterns, and interpreting the impact of common- and disease-causing human genetic variation on gene splicing.

BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Masoud Arabfard ◽  
Mina Ohadi ◽  
Vahid Rezaei Tabar ◽  
Ahmad Delbari ◽  
Kaveh Kavousi

Abstract Background Machine learning can effectively nominate novel genes for various research purposes in the laboratory. On a genome-wide scale, we implemented multiple databases and algorithms to predict and prioritize the human aging genes (PPHAGE). Results We fused data from 11 databases, and used Naïve Bayes classifier and positive unlabeled learning (PUL) methods, NB, Spy, and Rocchio-SVM, to rank human genes in respect with their implication in aging. The PUL methods enabled us to identify a list of negative (non-aging) genes to use alongside the seed (known age-related) genes in the ranking process. Comparison of the PUL algorithms revealed that none of the methods for identifying a negative sample were advantageous over other methods, and their simultaneous use in a form of fusion was critical for obtaining optimal results (PPHAGE is publicly available at https://cbb.ut.ac.ir/pphage). Conclusion We predict and prioritize over 3,000 candidate age-related genes in human, based on significant ranking scores. The identified candidate genes are associated with pathways, ontologies, and diseases that are linked to aging, such as cancer and diabetes. Our data offer a platform for future experimental research on the genetic and biological aspects of aging. Additionally, we demonstrate that fusion of PUL methods and data sources can be successfully used for aging and disease candidate gene prioritization.


2020 ◽  
Vol 2 (3) ◽  
Author(s):  
Jessica Nye ◽  
Mayukh Mondal ◽  
Jaume Bertranpetit ◽  
Hafid Laayouni

Abstract After diverging, each chimpanzee subspecies has been the target of unique selective pressures. Here, we employ a machine learning approach to classify regions as under positive selection or neutrality genome-wide. The regions determined to be under selection reflect the unique demographic and adaptive history of each subspecies. The results indicate that effective population size is important for determining the proportion of the genome under positive selection. The chimpanzee subspecies share signals of selection in genes associated with immunity and gene regulation. With these results, we have created a selection map for each population that can be displayed in a genome browser (www.hsb.upf.edu/chimp_browser). This study is the first to use a detailed demographic history and machine learning to map selection genome-wide in chimpanzee. The chimpanzee selection map will improve our understanding of the impact of selection on closely related subspecies and will empower future studies of chimpanzee.


Genes ◽  
2020 ◽  
Vol 11 (10) ◽  
pp. 1154
Author(s):  
Min Jeong Hong ◽  
Jin-Baek Kim ◽  
Yong Weon Seo ◽  
Dae Yeon Kim

Genes of the F-box family play specific roles in protein degradation by post-translational modification in several biological processes, including flowering, the regulation of circadian rhythms, photomorphogenesis, seed development, leaf senescence, and hormone signaling. F-box genes have not been previously investigated on a genome-wide scale; however, the establishment of the wheat (Triticum aestivum L.) reference genome sequence enabled a genome-based examination of the F-box genes to be conducted in the present study. In total, 1796 F-box genes were detected in the wheat genome and classified into various subgroups based on their functional C-terminal domain. The F-box genes were distributed among 21 chromosomes and most showed high sequence homology with F-box genes located on the homoeologous chromosomes because of allohexaploidy in the wheat genome. Additionally, a synteny analysis of wheat F-box genes was conducted in rice and Brachypodium distachyon. Transcriptome analysis during various wheat developmental stages and expression analysis by quantitative real-time PCR revealed that some F-box genes were specifically expressed in the vegetative and/or seed developmental stages. A genome-based examination and classification of F-box genes provide an opportunity to elucidate the biological functions of F-box genes in wheat.


2014 ◽  
Vol 42 (15) ◽  
pp. 9838-9853 ◽  
Author(s):  
Saeed Kaboli ◽  
Takuya Yamakawa ◽  
Keisuke Sunada ◽  
Tao Takagaki ◽  
Yu Sasano ◽  
...  

Abstract Despite systematic approaches to mapping networks of genetic interactions in Saccharomyces cerevisiae, exploration of genetic interactions on a genome-wide scale has been limited. The S. cerevisiae haploid genome has 110 regions that are longer than 10 kb but harbor only non-essential genes. Here, we attempted to delete these regions by PCR-mediated chromosomal deletion technology (PCD), which enables chromosomal segments to be deleted by a one-step transformation. Thirty-three of the 110 regions could be deleted, but the remaining 77 regions could not. To determine whether the 77 undeletable regions are essential, we successfully converted 67 of them to mini-chromosomes marked with URA3 using PCR-mediated chromosome splitting technology and conducted a mitotic loss assay of the mini-chromosomes. Fifty-six of the 67 regions were found to be essential for cell growth, and 49 of these carried co-lethal gene pair(s) that were not previously been detected by synthetic genetic array analysis. This result implies that regions harboring only non-essential genes contain unidentified synthetic lethal combinations at an unexpectedly high frequency, revealing a novel landscape of genetic interactions in the S. cerevisiae genome. Furthermore, this study indicates that segmental deletion might be exploited for not only revealing genome function but also breeding stress-tolerant strains.


2021 ◽  
Vol 11 ◽  
Author(s):  
Matthew J. Rybin ◽  
Melina Ramic ◽  
Natalie R. Ricciardi ◽  
Philipp Kapranov ◽  
Claes Wahlestedt ◽  
...  

Genome instability is associated with myriad human diseases and is a well-known feature of both cancer and neurodegenerative disease. Until recently, the ability to assess DNA damage—the principal driver of genome instability—was limited to relatively imprecise methods or restricted to studying predefined genomic regions. Recently, new techniques for detecting DNA double strand breaks (DSBs) and single strand breaks (SSBs) with next-generation sequencing on a genome-wide scale with single nucleotide resolution have emerged. With these new tools, efforts are underway to define the “breakome” in normal aging and disease. Here, we compare the relative strengths and weaknesses of these technologies and their potential application to studying neurodegenerative diseases.


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Hanlin Liu ◽  
Linqiang Yang ◽  
Linchao Li

A variety of climate factors influence the precision of the long-term Global Navigation Satellite System (GNSS) monitoring data. To precisely analyze the effect of different climate factors on long-term GNSS monitoring records, this study combines the extended seven-parameter Helmert transformation and a machine learning algorithm named Extreme Gradient boosting (XGboost) to establish a hybrid model. We established a local-scale reference frame called stable Puerto Rico and Virgin Islands reference frame of 2019 (PRVI19) using ten continuously operating long-term GNSS sites located in the rigid portion of the Puerto Rico and Virgin Islands (PRVI) microplate. The stability of PRVI19 is approximately 0.4 mm/year and 0.5 mm/year in the horizontal and vertical directions, respectively. The stable reference frame PRVI19 can avoid the risk of bias due to long-term plate motions when studying localized ground deformation. Furthermore, we applied the XGBoost algorithm to the postprocessed long-term GNSS records and daily climate data to train the model. We quantitatively evaluated the importance of various daily climate factors on the GNSS time series. The results show that wind is the most influential factor with a unit-less index of 0.013. Notably, we used the model with climate and GNSS records to predict the GNSS-derived displacements. The results show that the predicted displacements have a slightly lower root mean square error compared to the fitted results using spline method (prediction: 0.22 versus fitted: 0.31). It indicates that the proposed model considering the climate records has the appropriate predict results for long-term GNSS monitoring.


Author(s):  
Nana Matoba ◽  
Dan Liang ◽  
Huaigu Sun ◽  
Nil Aygün ◽  
Jessica C. McAfee ◽  
...  

AbstractBackgroundAutism spectrum disorder (ASD) is a highly heritable neurodevelopmental disorder. Large genetically informative cohorts of individuals with ASD have led to the identification of three common genome-wide significant (GWS) risk loci to date. However, many more common genetic variants are expected to contribute to ASD risk given the high heritability. Here, we performed a genome-wide association study (GWAS) using the Simons Foundation Powering Autism Research for Knowledge (SPARK) dataset to identify additional common genetic risk factors and molecular mechanisms underlying risk for ASD.MethodsWe performed an association study on 6,222 case-pseudocontrol pairs from SPARK and meta-analyzed with a previous GWAS. We integrated gene regulatory annotations to map non-coding risk variants to their regulated genes. Further, we performed a massively parallel reporter assay (MPRA) to identify causal variant(s) within a novel risk locus.ResultsWe identified one novel GWS locus from the SPARK GWAS. The meta-analysis identified four significant loci, including an additional novel locus. We observed significant enrichment of ASD heritability within regulatory regions of the developing cortex, indicating that disruption of gene regulation during neurodevelopment is critical for ASD risk. The MPRA identified one variant at the novel locus with strong impacts on gene regulation (rs7001340), and expression quantitative trait loci data demonstrated an association between the risk allele and decreased expression of DDHD2 (DDHD domain containing 2) in both adult and pre-natal brains.ConclusionsBy integrating genetic association data with multi-omic gene regulatory annotations and experimental validation, we fine-mapped a causal risk variant and demonstrated that DDHD2 is a novel gene associated with ASD risk.


2021 ◽  
Vol 17 (9) ◽  
pp. e1009317
Author(s):  
Ilario De Toma ◽  
Cesar Sierra ◽  
Mara Dierssen

Trisomy of human chromosome 21 (HSA21) causes Down syndrome (DS). The trisomy does not simply result in the upregulation of HSA21--encoded genes but also leads to a genome-wide transcriptomic deregulation, which affect differently each tissue and cell type as a result of epigenetic mechanisms and protein-protein interactions. We performed a meta-analysis integrating the differential expression (DE) analyses of all publicly available transcriptomic datasets, both in human and mouse, comparing trisomic and euploid transcriptomes from different sources. We integrated all these data in a “DS network”. We found that genome wide deregulation as a consequence of trisomy 21 is not arbitrary, but involves deregulation of specific molecular cascades in which both HSA21 genes and HSA21 interactors are more consistently deregulated compared to other genes. In fact, gene deregulation happens in “clusters”, so that groups from 2 to 13 genes are found consistently deregulated. Most of these events of “co-deregulation” involve genes belonging to the same GO category, and genes associated with the same disease class. The most consistent changes are enriched in interferon related categories and neutrophil activation, reinforcing the concept that DS is an inflammatory disease. Our results also suggest that the impact of the trisomy might diverge in each tissue due to the different gene set deregulation, even though the triplicated genes are the same. Our original method to integrate transcriptomic data confirmed not only the importance of known genes, such as SOD1, but also detected new ones that could be extremely useful for generating or confirming hypotheses and supporting new putative therapeutic candidates. We created “metaDEA” an R package that uses our method to integrate every kind of transcriptomic data and therefore could be used with other complex disorders, such as cancer. We also created a user-friendly web application to query Ensembl gene IDs and retrieve all the information of their differential expression across the datasets.


2018 ◽  
Vol 19 (1) ◽  
pp. 223-246 ◽  
Author(s):  
Saffron A.G. Willis-Owen ◽  
William O.C. Cookson ◽  
Miriam F. Moffatt

Asthma is a common, clinically heterogeneous disease with strong evidence of heritability. Progress in defining the genetic underpinnings of asthma, however, has been slow and hampered by issues of inconsistency. Recent advances in the tools available for analysis—assaying transcription, sequence variation, and epigenetic marks on a genome-wide scale—have substantially altered this landscape. Applications of such approaches are consistent with heterogeneity at the level of causation and specify patterns of commonality with a wide range of alternative disease traits. Looking beyond the individual as the unit of study, advances in technology have also fostered comprehensive analysis of the human microbiome and its varied roles in health and disease. In this article, we consider the implications of these technological advances for our current understanding of the genetics and genomics of asthma.


2019 ◽  
Vol 116 (1) ◽  
pp. 138-148 ◽  
Author(s):  
Katra Hadji-Turdeghal ◽  
Laura Andreasen ◽  
Christian M Hagen ◽  
Gustav Ahlberg ◽  
Jonas Ghouse ◽  
...  

Abstract Aims Syncope is a common condition associated with frequent hospitalization or visits to the emergency department. Family aggregation and twin studies have shown that syncope has a heritable component. We investigated whether common genetic variants predispose to syncope and collapse. Methods and results We used genome-wide association data on syncope on 408 961 individuals with European ancestry from the UK Biobank study. In a replication study, we used the Integrative Psychiatric Research Consortium (iPSYCH) cohort (n = 86 189), to investigate the risk of incident syncope stratified by genotype carrier status. We report on a genome-wide significant locus located on chromosome 2q32.1 [odds ratio = 1.13, 95% confidence interval (CI) 1.10–1.17, P = 5.8 × 10−15], with lead single nucleotide polymorphism rs12465214 in proximity to the gene zinc finger protein 804a (ZNF804A). This association was also shown in the iPSYCH cohort, where homozygous carriers of the C allele conferred an increased hazard ratio (1.30, 95% CI 1.15–1.46, P = 1.68 × 10−5) of incident syncope. Quantitative polymerase chain reaction analysis showed ZNF804A to be expressed most abundantly in brain tissue. Conclusion We identified a genome-wide significant locus (rs12465214) associated with syncope and collapse. The association was replicated in an independent cohort. This is the first genome-wide association study to associate a locus with syncope and collapse.


Sign in / Sign up

Export Citation Format

Share Document