scholarly journals Local adaptation and archaic introgression shape global diversity at human structural variant loci

2021 ◽  
Author(s):  
Stephanie M. Yan ◽  
Rachel M. Sherman ◽  
Dylan J. Taylor ◽  
Divya R. Nair ◽  
Andrew N. Bortvin ◽  
...  

AbstractLarge genomic insertions, deletions, and inversions are a potent source of functional and fitness-altering variation, but are challenging to resolve with short-read DNA sequencing alone. While recent long-read sequencing technologies have greatly expanded the catalog of structural variants (SVs), their costs have so far precluded their application at population scales. Given these limitations, the role of SVs in human adaptation remains poorly characterized. Here, we used a graph-based approach to genotype 107,866 long-read-discovered SVs in short-read sequencing data from diverse human populations. We then applied an admixture-aware method to scan these SVs for patterns of population-specific frequency differentiation—a signature of local adaptation. We identified 220 SVs exhibiting extreme frequency differentiation, including several SVs that were among the lead variants at their corresponding loci. The top two signatures traced to separate insertion and deletion polymorphisms at the immunoglobulin heavy chain locus, together tagging a 325 Kbp haplotype that swept to high frequency and was subsequently fragmented by recombination. Alleles defining this haplotype are nearly fixed (60-95%) in certain Southeast Asian populations, but are rare or absent from other global populations composing the 1000 Genomes Project. Further investigation revealed that the haplotype closely matches with sequences observed in two of three high-coverage Neanderthal genomes, providing strong evidence of a Neanderthal-introgressed origin. This extraordinary episode of positive selection, which we infer to have occurred between 1700 and 8400 years ago, corroborates the role of immune-related genes as prominent targets of adaptive archaic introgression. Our study demonstrates how combining recent advances in genome sequencing, genotyping algorithms, and population genetic methods can reveal signatures of key evolutionary events that remained hidden within poorly resolved regions of the genome.

eLife ◽  
2021 ◽  
Vol 10 ◽  
Author(s):  
Stephanie M Yan ◽  
Rachel M Sherman ◽  
Dylan J Taylor ◽  
Divya R Nair ◽  
Andrew N Bortvin ◽  
...  

Large genomic insertions and deletions are a potent source of functional variation, but are challenging to resolve with short-read sequencing, limiting knowledge of the role of such structural variants (SVs) in human evolution. Here, we used a graph-based method to genotype long-read-discovered SVs in short-read data from diverse human genomes. We then applied an admixture-aware method to identify 220 SVs exhibiting extreme patterns of frequency differentiation—a signature of local adaptation. The top two variants traced to the immunoglobulin heavy chain locus, tagging a haplotype that swept to near fixation in certain Southeast Asian populations, but is rare in other global populations. Further investigation revealed evidence that the haplotype traces to gene flow from Neanderthals, corroborating the role of immune-related genes as prominent targets of adaptive introgression. Our study demonstrates how recent technical advances can help resolve signatures of key evolutionary events that remained obscured within technically challenging regions of the genome.


2018 ◽  
Author(s):  
Li Fang ◽  
Charlly Kao ◽  
Michael V Gonzalez ◽  
Fernanda A Mafra ◽  
Renata Pellegrino da Silva ◽  
...  

AbstractLinked-read sequencing provides long-range information on short-read sequencing data by barcoding reads originating from the same DNA molecule, and can improve the detection and breakpoint identification for structural variants (SVs). We present LinkedSV for SV detection on linked-read sequencing data. LinkedSV considers barcode overlapping and enriched fragment endpoints as signals to detect large SVs, while it leverages read depth, paired-end signals and local assembly to detect small SVs. Benchmarking studies demonstrates that LinkedSV outperforms existing tools, especially on exome data and on somatic SVs with low variant allele frequencies. We demonstrate clinical cases where LinkedSV identifies disease causal SVs from linked-read exome sequencing data missed by conventional exome sequencing, and show examples where LinkedSV identifies SVs missed by high-coverage long-read sequencing. In summary, LinkedSV can detect SVs missed by conventional short-read and long-read sequencing approaches, and may resolve negative cases from clinical genome/exome sequencing studies.


2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Li Fang ◽  
Charlly Kao ◽  
Michael V. Gonzalez ◽  
Fernanda A. Mafra ◽  
Renata Pellegrino da Silva ◽  
...  

AbstractLinked-read sequencing provides long-range information on short-read sequencing data by barcoding reads originating from the same DNA molecule, and can improve detection and breakpoint identification for structural variants (SVs). Here we present LinkedSV for SV detection on linked-read sequencing data. LinkedSV considers barcode overlapping and enriched fragment endpoints as signals to detect large SVs, while it leverages read depth, paired-end signals and local assembly to detect small SVs. Benchmarking studies demonstrate that LinkedSV outperforms existing tools, especially on exome data and on somatic SVs with low variant allele frequencies. We demonstrate clinical cases where LinkedSV identifies disease-causal SVs from linked-read exome sequencing data missed by conventional exome sequencing, and show examples where LinkedSV identifies SVs missed by high-coverage long-read sequencing. In summary, LinkedSV can detect SVs missed by conventional short-read and long-read sequencing approaches, and may resolve negative cases from clinical genome/exome sequencing studies.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Chong Chu ◽  
Rebeca Borges-Monroy ◽  
Vinayak V. Viswanadham ◽  
Soohyun Lee ◽  
Heng Li ◽  
...  

AbstractTransposable elements (TEs) help shape the structure and function of the human genome. When inserted into some locations, TEs may disrupt gene regulation and cause diseases. Here, we present xTea (x-Transposable element analyzer), a tool for identifying TE insertions in whole-genome sequencing data. Whereas existing methods are mostly designed for short-read data, xTea can be applied to both short-read and long-read data. Our analysis shows that xTea outperforms other short read-based methods for both germline and somatic TE insertion discovery. With long-read data, we created a catalogue of polymorphic insertions with full assembly and annotation of insertional sequences for various types of retroelements, including pseudogenes and endogenous retroviruses. Notably, we find that individual genomes have an average of nine groups of full-length L1s in centromeres, suggesting that centromeres and other highly repetitive regions such as telomeres are a significant yet unexplored source of active L1s. xTea is available at https://github.com/parklab/xTea.


2021 ◽  
Vol 3 (2) ◽  
Author(s):  
Xueyi Dong ◽  
Luyi Tian ◽  
Quentin Gouil ◽  
Hasaru Kariyawasam ◽  
Shian Su ◽  
...  

Abstract Application of Oxford Nanopore Technologies’ long-read sequencing platform to transcriptomic analysis is increasing in popularity. However, such analysis can be challenging due to the high sequence error and small library sizes, which decreases quantification accuracy and reduces power for statistical testing. Here, we report the analysis of two nanopore RNA-seq datasets with the goal of obtaining gene- and isoform-level differential expression information. A dataset of synthetic, spliced, spike-in RNAs (‘sequins’) as well as a mouse neural stem cell dataset from samples with a null mutation of the epigenetic regulator Smchd1 was analysed using a mix of long-read specific tools for preprocessing together with established short-read RNA-seq methods for downstream analysis. We used limma-voom to perform differential gene expression analysis, and the novel FLAMES pipeline to perform isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis. We compared results from the sequins dataset to the ground truth, and results of the mouse dataset to a previous short-read study on equivalent samples. Overall, our work shows that transcriptomic analysis of long-read nanopore data using long-read specific preprocessing methods together with short-read differential expression methods and software that are already in wide use can yield meaningful results.


2020 ◽  
Author(s):  
Andrew J. Page ◽  
Nabil-Fareed Alikhan ◽  
Michael Strinden ◽  
Thanh Le Viet ◽  
Timofey Skvortsov

AbstractSpoligotyping of Mycobacterium tuberculosis provides a subspecies classification of this major human pathogen. Spoligotypes can be predicted from short read genome sequencing data; however, no methods exist for long read sequence data such as from Nanopore or PacBio. We present a novel software package Galru, which can rapidly detect the spoligotype of a Mycobacterium tuberculosis sample from as little as a single uncorrected long read. It allows for near real-time spoligotyping from long read data as it is being sequenced, giving rapid sample typing. We compare it to the existing state of the art software and find it performs identically to the results obtained from short read sequencing data. Galru is freely available from https://github.com/quadram-institute-bioscience/galru under the GPLv3 open source licence.


2021 ◽  
Author(s):  
Ricardo Ramirez ◽  
Nicholas van Buuren ◽  
Lindsay Gamelin ◽  
Cameron Soulette ◽  
Lindsey May ◽  
...  

Hepatitis B virus (HBV) can integrate into the chromosomes of infected hepatocytes, creating potentially oncogenic lesions that can lead to hepatocellular carcinoma (HCC). However, our current understanding of integrated HBV DNA architecture, burden and transcriptional activity is incomplete due to technical limitations. A combination of genomics approaches was used to describe HBV integrations and corresponding transcriptional signatures in three HCC cell lines: huH-1, PLC/PRF/5 and Hep3B. To generate high coverage long-read sequencing data, a custom panel of HBV-targeting biotinylated oligonucleotide probes was designed. Targeted long-read DNA sequencing captured entire HBV integration events within individual reads, revealing that integrations may include deletions and inversions of viral sequences. Surprisingly, all three HCC cell lines contain integrations that are associated with host chromosomal translocations. In addition, targeted long-read RNA sequencing allowed for the assignment of transcriptional activity to specific integrations and resolved the contribution of overlapping HBV transcripts. HBV transcripts chimeric with host sequences were resolved in their entirety and often included >1000bp of host sequence. This study provides the first comprehensive description of HBV integrations and associated transcriptional activity in three commonly utilized HCC-derived cell lines. The application of novel methods sheds new light on the complexity of these integrations, including HBV bidirectional transcription, nested transcripts, silent integrations and host genomic rearrangements. The observation of multiple HBV-associated chromosomal translocations gives rise to the hypothesis that HBV may be a driver of genetic instability and provides a potential new mechanism for HCC development. Importance HCC-derived cell lines have served as practical models to study HBV biology for decades. These cell lines harbor multiple HBV integrations and express only HBV surface antigen (HBsAg). To date, an accurate description of the integration burden, architecture and transcriptional profile of these cell lines has been limited due to technical constraints. We have developed a targeted long-read sequencing assay which reveals the entire architecture of integrations in these cell lines. In addition, we identified five chromosomal translocations with integrated HBV DNA at the inter-chromosomal junctions. Incorporation of long-read RNA-Seq data indicated that many integrations and translocations were transcriptionally silent. The observation of multiple HBV-associated translocations has strong implications regarding the potential mechanisms for the development of HBV-associated HCC.


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Robert Bücking ◽  
Murray P Cox ◽  
Georgi Hudjashov ◽  
Lauri Saag ◽  
Herawati Sudoyo ◽  
...  

Abstract Background Traces of interbreeding of Neanderthals and Denisovans with modern humans in the form of archaic DNA have been detected in the genomes of present-day human populations outside sub-Saharan Africa. Up to now, only nuclear archaic DNA has been detected in modern humans; we therefore attempted to identify archaic mitochondrial DNA (mtDNA) residing in modern human nuclear genomes as nuclear inserts of mitochondrial DNA (NUMTs). Results We analysed 221 high-coverage genomes from Oceania and Indonesia using an approach which identifies reads that map both to the nuclear and mitochondrial DNA. We then classified reads according to the source of the mtDNA, and found one NUMT of Denisovan mtDNA origin, present in 15 analysed genomes; analysis of the flanking region suggests that this insertion is more likely to have happened in a Denisovan individual and introgressed into modern humans with the Denisovan nuclear DNA, rather than in a descendant of a Denisovan female and a modern human male. Conclusions Here we present our pipeline for detecting introgressed NUMTs in next generation sequencing data that can be used on genomes sequenced in the future. Further discovery of such archaic NUMTs in modern humans can be used to detect interbreeding between archaic and modern humans and can reveal new insights into the nature of such interbreeding events.


2019 ◽  
Vol 8 (34) ◽  
Author(s):  
Natsuki Tomariguchi ◽  
Kentaro Miyazaki

Rubrobacter xylanophilus strain AA3-22, belonging to the phylum Actinobacteria, was isolated from nonvolcanic Arima Onsen (hot spring) in Japan. Here, we report the complete genome sequence of this organism, which was obtained by combining Oxford Nanopore long-read and Illumina short-read sequencing data.


Author(s):  
Shifu Chen ◽  
Changshou He ◽  
Yingqiang Li ◽  
Zhicheng Li ◽  
Charles E Melançon

Abstract In this paper, we present a toolset and related resources for rapid identification of viruses and microorganisms from short-read or long-read sequencing data. We present fastv as an ultra-fast tool to detect microbial sequences present in sequencing data, identify target microorganisms and visualize coverage of microbial genomes. This tool is based on the k-mer mapping and extension method. K-mer sets are generated by UniqueKMER, another tool provided in this toolset. UniqueKMER can generate complete sets of unique k-mers for each genome within a large set of viral or microbial genomes. For convenience, unique k-mers for microorganisms and common viruses that afflict humans have been generated and are provided with the tools. As a lightweight tool, fastv accepts FASTQ data as input and directly outputs the results in both HTML and JSON formats. Prior to the k-mer analysis, fastv automatically performs adapter trimming, quality pruning, base correction and other preprocessing to ensure the accuracy of k-mer analysis. Specifically, fastv provides built-in support for rapid severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) identification and typing. Experimental results showed that fastv achieved 100% sensitivity and 100% specificity for detecting SARS-CoV-2 from sequencing data; and can distinguish SARS-CoV-2 from SARS, Middle East respiratory syndrome and other coronaviruses. This toolset is available at: https://github.com/OpenGene/fastv.


Sign in / Sign up

Export Citation Format

Share Document