Human-lineage-specific genomic elements: relevance to neurodegenerative disease and APOE transcript usage

ABSTRACTKnowledge of genomic features specific to the human lineage may provide insights into brain-related diseases. We leverage high-depth whole genome sequencing data to generate a combined annotation identifying regions simultaneously depleted for genetic variation (constrained regions) and poorly conserved across primates. We propose that these constrained, non-conserved regions (CNCRs) have been subject to human-specific purifying selection and are enriched for brain-specific elements. We find that CNCRs are depleted from protein-coding genes but enriched within lncRNAs. We demonstrate that per-SNP heritability of a range of brain-relevant phenotypes are enriched within CNCRs. We find that genes implicated in neurological diseases have high CNCR density, including APOE, highlighting an unannotated intron-3 retention event. Using human brain RNA-sequencing data, we show the intron-3-retaining transcript/s to be more abundant in Alzheimer’s disease with more severe tau and amyloid pathological burden. Thus, we demonstrate the importance of human-lineage-specific sequences in brain development and neurological disease. We release our annotation through vizER (https://snca.atica.um.es/browser/app/vizER).

Download Full-text

Human-lineage-specific genomic elements are associated with neurodegenerative disease and APOE transcript usage

Nature Communications ◽

10.1038/s41467-021-22262-5 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Zhongbo Chen ◽

◽

David Zhang ◽

Regina H. Reynolds ◽

Emil K. Gustavsson ◽

...

Keyword(s):

Neurological Diseases ◽

Purifying Selection ◽

Whole Genome Sequencing Data ◽

Human Lineage ◽

Sequencing Data ◽

Protein Coding ◽

Potential Association ◽

High Depth ◽

Specific Sequences ◽

Human Specific

AbstractKnowledge of genomic features specific to the human lineage may provide insights into brain-related diseases. We leverage high-depth whole genome sequencing data to generate a combined annotation identifying regions simultaneously depleted for genetic variation (constrained regions) and poorly conserved across primates. We propose that these constrained, non-conserved regions (CNCRs) have been subject to human-specific purifying selection and are enriched for brain-specific elements. We find that CNCRs are depleted from protein-coding genes but enriched within lncRNAs. We demonstrate that per-SNP heritability of a range of brain-relevant phenotypes are enriched within CNCRs. We find that genes implicated in neurological diseases have high CNCR density, including APOE, highlighting an unannotated intron-3 retention event. Using human brain RNA-sequencing data, we show the intron-3-retaining transcript to be more abundant in Alzheimer’s disease with more severe tau and amyloid pathological burden. Thus, we demonstrate potential association of human-lineage-specific sequences in brain development and neurological disease.

Download Full-text

JuLI: accurate detection of DNA fusions in clinical sequencing for precision oncology

10.1101/521039 ◽

2019 ◽

Author(s):

Hyun-Tae Shin ◽

Nayoung K. D. Kim ◽

Jae Won Yun ◽

Boram Lee ◽

Sungkyu Kyung ◽

...

Keyword(s):

High Throughput Sequencing ◽

False Negative ◽

Detection Algorithm ◽

Clinical Samples ◽

Whole Genome Sequencing Data ◽

Precision Oncology ◽

Sequencing Data ◽

Clinical Sequencing ◽

Accurate Detection ◽

High Depth

ABSTRACTAccurate detection of genomic fusions by high-throughput sequencing in clinical samples with inadequate tumor purity and formalin-fixed paraffin embedded (FFPE) tissue is an essential task in precise oncology. We developed the fusion detection algorithm Junction Location Identifier (JuLI) for optimization of high-depth clinical sequencing. We implemented novel filtering steps to minimize false positives and a joint calling function to increase sensitivity in clinical setting. We comprehensively validated the algorithm using high-depth sequencing data from cancer cell lines and clinical samples and whole genome sequencing data from NA12878. We showed that JuLI outperformed state-of-the-art fusion callers in cases with high-depth clinical sequencing and rescued a driver fusion from false negative in plasma cell-free DNA. JuLI is freely available via GitHub (https://github.com/sgilab/JuLI).

Download Full-text

Purifying Selection in Corvids Is Less Efficient on Islands

Molecular Biology and Evolution ◽

10.1093/molbev/msz233 ◽

2019 ◽

Vol 37 (2) ◽

pp. 469-474 ◽

Cited By ~ 3

Author(s):

Verena E Kutschera ◽

Jelmer W Poelstra ◽

Fidel Botero-Castro ◽

Nicolas Dussex ◽

Neil J Gemmell ◽

...

Keyword(s):

Purifying Selection ◽

Life History Strategies ◽

Whole Genome Sequencing Data ◽

Small Populations ◽

Deleterious Mutations ◽

Mutation Load ◽

Sequencing Data ◽

Effective Population ◽

Extinction Rates ◽

Island Species

Abstract Theory predicts that deleterious mutations accumulate more readily in small populations. As a consequence, mutation load is expected to be elevated in species where life-history strategies and geographic or historical contingencies reduce the number of reproducing individuals. Yet, few studies have empirically tested this prediction using genome-wide data in a comparative framework. We collected whole-genome sequencing data for 147 individuals across seven crow species (Corvus spp.). For each species, we estimated the distribution of fitness effects of deleterious mutations and compared it with proxies of the effective population size Ne. Island species with comparatively smaller geographic range sizes had a significantly increased mutation load. These results support the view that small populations have an elevated risk of mutational meltdown, which may contribute to the higher extinction rates observed in island species.

Download Full-text

Pan-cancer analysis of non-coding recurrent mutations and their possible involvement in cancer pathogenesis

NAR Cancer ◽

10.1093/narcan/zcab008 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Chie Kikutake ◽

Minako Yoshihara ◽

Mikita Suyama

Keyword(s):

The Cancer Genome Atlas ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Huge Number ◽

Protein Coding ◽

Coding Regions ◽

Cancer Pathogenesis ◽

Recurrent Mutations ◽

Cancer Genome Atlas ◽

Pan Cancer

Abstract Cancer-related mutations have been mainly identified in protein-coding regions. Recent studies have demonstrated that mutations in non-coding regions of the genome could also be a risk factor for cancer. However, the non-coding regions comprise 98% of the total length of the human genome and contain a huge number of mutations, making it difficult to interpret their impacts on pathogenesis of cancer. To comprehensively identify cancer-related non-coding mutations, we focused on recurrent mutations in non-coding regions using somatic mutation data from COSMIC and whole-genome sequencing data from The Cancer Genome Atlas (TCGA). We identified 21 574 recurrent mutations in non-coding regions that were shared by at least two different samples from both COSMIC and TCGA databases. Among them, 580 candidate cancer-related non-coding recurrent mutations were identified based on epigenomic and chromatin structure datasets. One of such mutation was located in RREB1 binding site that is thought to interact with TEAD1 promoter. Our results suggest that mutations may disrupt the binding of RREB1 to the candidate enhancer region and increase TEAD1 expression levels. Our findings demonstrate that non-coding recurrent mutations and coding mutations may contribute to the pathogenesis of cancer.

Download Full-text

Extreme purifying selection against point mutations in the human genome

10.1101/2021.08.23.457339 ◽

2021 ◽

Author(s):

Noah Dukler ◽

Mehreen R Mughal ◽

Ritika Ramani ◽

Yi-Fei Huang ◽

Adam Siepel

Keyword(s):

Human Genome ◽

De Novo ◽

Point Mutations ◽

Purifying Selection ◽

Selection Coefficient ◽

Sequencing Data ◽

Protein Coding ◽

Coding Regions ◽

Protein Coding Genes ◽

Selective Effects

Genome sequencing of tens of thousands of human individuals has recently enabled the measurement of large selective effects for mutations to protein-coding genes. Here we describe a new method, called ExtRaINSIGHT, for measuring similar selective effects at individual sites in noncoding as well as in coding regions of the human genome. ExtRaINSIGHT estimates the prevalance of strong purifying selection, or "ultraselection" (λs), as the fractional depletion of rare single-nucleotide variants (minor allele frequency <0.1%) in a target set of genomic sites relative to matched sites that are putatively neutrally evolving, in a manner that controls for local variation and neighbor-dependence in mutation rate. We show using simulations that, above an appropriate threshold, λs is closely related to the average site-specific selection coefficient against heterozygous point mutations, as predicted at mutation-selection balance. Applying ExtRaINSIGHT to 71,702 whole genome sequences from gnomAD v3, we find particularly strong evidence of ultraselection in evolutionarily ancient miRNAs and neuronal protein-coding genes, as well as at splice sites. Moreover, our estimated selection coefficient against heterozygous amino-acid replacements across the genome (at 1.4%) is substantially larger than previous estimates based on smaller sample sizes. By contrast, we find weak evidence of ultraselection in other noncoding RNAs and transcription factor binding sites, and only modest evidence in ultraconserved elements and human accelerated regions. We estimate that ~0.3-0.5% of the human genome is ultraselected, with one third to one half of ultraselected sites falling in coding regions. These estimates suggest ~0.3-0.4 lethal or nearly lethal de novo mutations per potential human zygote, together with ~2 de novo mutations that are more weakly deleterious. Overall, our study sheds new light on the genome-wide distribution of fitness effects for new point mutations by combining deep new sequencing data sets and classical theory from population genetics.

Download Full-text

Genomic features and evolution of the Parapoxvirus during the past two decades

10.21203/rs.3.rs-42668/v1 ◽

2020 ◽

Author(s):

Xiaoting Yao ◽

Ming Pang ◽

Tianxing Wang ◽

Xi Chen ◽

Xidian Tang ◽

...

Keyword(s):

Selection Pressure ◽

Control Strategies ◽

Genomic Structure ◽

Purifying Selection ◽

Life Cycles ◽

Comparative Approach ◽

Common Mechanism ◽

Protein Coding ◽

Genomic Features ◽

Natural Hosts

Abstract Parapoxvirus (PPV) has been identified in most mammals and poses a great threat to both the livestock production and public health. However, it is still not fully understood the viral prevalence and evolution of PPV coding sequences. Here, we performed a comparative approach integrating viral genetics, molecular selection pressure and genomic structure to investigate the genomic features and evolution of PPVs. We noticed that although there were significant differences of GC contents between ORFV and other three species of PPVs, all PPVs showed almost identical nucleotide bias, that is GC richness. This reflected a common mechanism which determines GC compositions for virus with similar life cycles. The structural analysis of PPV genomes showed the divergence of different PPV species, which may due to the specific adaptation to their natural hosts. Additionally, we estimated the phylogenetic diversity of segmented genome of PPV. Our results suggested that during the 2010 – 2018 outbreak, the orf virus has been the dominant species under the selective pressure of the optimal gene patterns. Furthermore, we found the mean substitution rates were between 3.56×10-5 to 4.21×10-4 in different PPV segments, and the PPV VIR gene was evolved at the highest substitution rate. In these protein-coding regions, purifying selection was the major evolutionary pressure, while the GIF and VIR genes suffered the greatest positive selection pressure. These results may provide useful knowledge on the virus genetic evolution from a new perspective which could help create prevention and control strategies.

Download Full-text

Genomic Features and Evolution of the Parapoxvirus during the Past Two Decades

Pathogens ◽

10.3390/pathogens9110888 ◽

2020 ◽

Vol 9 (11) ◽

pp. 888

Author(s):

Xiaoting Yao ◽

Ming Pang ◽

Tianxing Wang ◽

Xi Chen ◽

Xidian Tang ◽

...

Keyword(s):

Control Strategies ◽

In Silico Analysis ◽

Purifying Selection ◽

Virus Species ◽

Protein Coding ◽

Genomic Features ◽

Useful Knowledge ◽

Natural Hosts ◽

New Perspective ◽

And Control

Parapoxvirus (PPV) has been identified in some mammals and poses a great threat to both the livestock production and public health. However, the prevalence and evolution of this virus are still not fully understood. Here, we performed an in silico analysis to investigate the genomic features and evolution of PPVs. We noticed that although there were significant differences of GC contents between orf virus (ORFV) and other three species of PPVs, all PPVs showed almost identical nucleotide bias, that is GC richness. The structural analysis of PPV genomes showed the divergence of different PPV species, which may be due to the specific adaptation to their natural hosts. Additionally, we estimated the phylogenetic diversity of seven different genes of PPV. According to all available sequences, our results suggested that during 2010–2018, ORFV was the dominant virus species under the selective pressure of the optimal gene patterns. Furthermore, we found the substitution rates ranged from 3.56 × 10−5 to 4.21 × 10−4 in different PPV segments, and the PPV VIR gene evolved at the highest substitution rate. In these seven protein-coding regions, purifying selection was the major evolutionary pressure, while the GIF and VIR genes suffered the greatest positive selection pressure. These results may provide useful knowledge on the virus genetic evolution from a new perspective which could help to create prevention and control strategies.

Download Full-text

Analyzing and Characterizing the Chloroplast Genome of Salix wilsonii

BioMed Research International ◽

10.1155/2019/5190425 ◽

2019 ◽

Vol 2019 ◽

pp. 1-14 ◽

Cited By ~ 3

Author(s):

Yingnan Chen ◽

Nan Hu ◽

Huaitong Wu

Keyword(s):

Chloroplast Genome ◽

Tandem Repeats ◽

Single Copy ◽

Rrna Genes ◽

Whole Genome Sequencing Data ◽

Trna Genes ◽

Sequencing Data ◽

Protein Coding ◽

Organelle Genomes ◽

Additional Sequence

Salix wilsonii is an important ornamental willow tree widely distributed in China. In this study, an integrated circular chloroplast genome was reconstructed for S. wilsonii based on the chloroplast reads screened from the whole-genome sequencing data generated with the PacBio RSII platform. The obtained pseudomolecule was 155,750 bp long and had a typical quadripartite structure, comprising a large single copy region (LSC, 84,638 bp) and a small single copy region (SSC, 16,282 bp) separated by two inverted repeat regions (IR, 27,415 bp). The S. wilsonii chloroplast genome encoded 115 unique genes, including four rRNA genes, 30 tRNA genes, 78 protein-coding genes, and three pseudogenes. Repetitive sequence analysis identified 32 tandem repeats, 22 forward repeats, two reverse repeats, and five palindromic repeats. Additionally, a total of 118 perfect microsatellites were detected, with mononucleotide repeats being the most common (89.83%). By comparing the S. wilsonii chloroplast genome with those of other rosid plant species, significant contractions or expansions were identified at the IR-LSC/SSC borders. Phylogenetic analysis of 17 willow species confirmed that S. wilsonii was most closely related to S. chaenomeloides and revealed the monophyly of the genus Salix. The complete S. wilsonii chloroplast genome provides an additional sequence-based resource for studying the evolution of organelle genomes in woody plants.

Download Full-text

Genetic ancestry plays a central role in population pharmacogenomics

Communications Biology ◽

10.1038/s42003-021-01681-6 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Hsin-Chou Yang ◽

Chia-Wei Chen ◽

Yu-Ting Lin ◽

Shih-Kai Chu

Keyword(s):

Principal Component ◽

Genetic Ancestry ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Ancestry Informative Markers ◽

Sequencing Data ◽

Whole Genome Analysis ◽

Protein Coding ◽

Functional Variant ◽

Multiple Drugs

AbstractRecent studies have pointed out the essential role of genetic ancestry in population pharmacogenetics. In this study, we analyzed the whole-genome sequencing data from The 1000 Genomes Project (Phase 3) and the pharmacogenetic information from Drug Bank, PharmGKB, PharmaADME, and Biotransformation. Here we show that ancestry-informative markers are enriched in pharmacogenetic loci, suggesting that trans-ancestry differentiation must be carefully considered in population pharmacogenetics studies. Ancestry-informative pharmacogenetic loci are located in both protein-coding and non-protein-coding regions, illustrating that a whole-genome analysis is necessary for an unbiased examination over pharmacogenetic loci. Finally, those ancestry-informative pharmacogenetic loci that target multiple drugs are often a functional variant, which reflects their importance in biological functions and pathways. In summary, we develop an efficient algorithm for an ultrahigh-dimensional principal component analysis. We create genetic catalogs of ancestry-informative markers and genes. We explore pharmacogenetic patterns and establish a high-accuracy prediction panel of genetic ancestry. Moreover, we construct a genetic ancestry pharmacogenomic database Genetic Ancestry PhD (http://hcyang.stat.sinica.edu.tw/databases/genetic_ancestry_phd/).

Download Full-text

Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning

Nature Communications ◽

10.1038/s41467-021-21790-4 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Dimitrios Vitsios ◽

Ryan S. Dhindsa ◽

Lawrence Middleton ◽

Ayal B. Gussow ◽

Slavé Petrovski

Keyword(s):

Deep Learning ◽

Genomic Sequence ◽

Strong Predictor ◽

Whole Genome Sequencing Data ◽

Disease Genes ◽

Mendelian Disease ◽

Human Lineage ◽

Sequencing Data ◽

Coding Regions ◽

Residual Variation

AbstractElucidating functionality in non-coding regions is a key challenge in human genomics. It has been shown that intolerance to variation of coding and proximal non-coding sequence is a strong predictor of human disease relevance. Here, we integrate intolerance to variation, functional genomic annotations and primary genomic sequence to build JARVIS: a comprehensive deep learning model to prioritize non-coding regions, outperforming other human lineage-specific scores. Despite being agnostic to evolutionary conservation, JARVIS performs comparably or outperforms conservation-based scores in classifying pathogenic single-nucleotide and structural variants. In constructing JARVIS, we introduce the genome-wide residual variation intolerance score (gwRVIS), applying a sliding-window approach to whole genome sequencing data from 62,784 individuals. gwRVIS distinguishes Mendelian disease genes from more tolerant CCDS regions and highlights ultra-conserved non-coding elements as the most intolerant regions in the human genome. Both JARVIS and gwRVIS capture previously inaccessible human-lineage constraint information and will enhance our understanding of the non-coding genome.

Download Full-text