Characterization of LINE-1 transposons in a human genome at allelic resolution

AbstractThe activity of the retrotransposon LINE-1 has created a substantial portion of the human genome. Most of this sequence comprises fractured and debilitated LINE-1s. An accurate approximation of the number, location, and sequence of the LINE-1 elements present in any single genome has proven elusive due to the difficulty of assembling and phasing the repetitive and polymorphic regions of the human genome. Through an in-depth analysis of publicly-available, deep, long-read assemblies of nearly homozygous human genomes, we defined the location and sequence of all intact LINE-1s in these assemblies. We found 148 and 142 intact LINE-1s in two nearly homozygous assemblies. A combination of these assemblies suggests a diploid human genome contains at least 50% more intact LINE-1s than previous estimates – in this case, 290 intact LINE-1s at 194 loci. We think this is the best approximation, to date, of the number of intact LINE-1s in a single diploid human genome. In addition to counting intact LINE-1 elements, we resolved the sequence of each element, including some LINE-1 elements in unassembled, presumably centromeric regions of the genome. A comparison of the intact LINE-1s in each assembly shows the specific pattern of variation between these genomes, including LINE-1s that remain intact in only one genome, allelic variation in shared intact LINE-1s, and LINE-1s that are unique (presumably young) insertions in only one genome. We found that many old elements (> 6 million years old) remain intact, and comparison of the young and intact LINE-1s across assemblies reinforces the notion that only a small portion of all LINE-1 sequences that may be intact in the genomes of the human population has been uncovered. This dataset provides the first nearly comprehensive estimate of LINE-1 diversity within an individual, an important dataset in the quest to understand the functional consequences of sequence variation in LINE-1 and the complete set of LINE-1s in the human population.

Download Full-text

SARS-CoV-2 variants with reduced infectivity and varied sensitivity to the BNT162b2 vaccine are developed during the course of infection

PLoS Pathogens ◽

10.1371/journal.ppat.1010242 ◽

2022 ◽

Vol 18 (1) ◽

pp. e1010242

Author(s):

Dina Khateeb ◽

Tslil Gabrieli ◽

Bar Sofer ◽

Adi Hattar ◽

Sapir Cordela ◽

...

Keyword(s):

Neutralizing Antibodies ◽

Human Population ◽

Natural Infection ◽

Infected Individual ◽

Low Frequencies ◽

Neutralization Activity ◽

Single Genome ◽

Depth Analysis ◽

S Genes

In-depth analysis of SARS-CoV-2 quasispecies is pivotal for a thorough understating of its evolution during infection. The recent deployment of COVID-19 vaccines, which elicit protective anti-spike neutralizing antibodies, has stressed the importance of uncovering and characterizing SARS-CoV-2 variants with mutated spike proteins. Sequencing databases have allowed to follow the spread of SARS-CoV-2 variants that are circulating in the human population, and several experimental platforms were developed to study these variants. However, less is known about the SARS-CoV-2 variants that are developed in the respiratory system of the infected individual. To gain further insight on SARS-CoV-2 mutagenesis during natural infection, we preformed single-genome sequencing of SARS-CoV-2 isolated from nose-throat swabs of infected individuals. Interestingly, intra-host SARS-CoV-2 variants with mutated S genes or N genes were detected in all individuals who were analyzed. These intra-host variants were present in low frequencies in the swab samples and were rarely documented in current sequencing databases. Further examination of representative spike variants identified by our analysis showed that these variants have impaired infectivity capacity and that the mutated variants showed varied sensitivity to neutralization by convalescent plasma and to plasma from vaccinated individuals. Notably, analysis of the plasma neutralization activity against these variants showed that the L1197I mutation at the S2 subunit of the spike can affect the plasma neutralization activity. Together, these results suggest that SARS-CoV-2 intra-host variants should be further analyzed for a more thorough characterization of potential circulating variants.

Download Full-text

Towards a better understanding of the low recall of insertion variants with short-read based variant callers

10.1101/2020.06.09.142232 ◽

2020 ◽

Author(s):

Wesley Delage ◽

Julien Thevenon ◽

Claire Lemaitre

Keyword(s):

Gold Standard ◽

Insertion Site ◽

Structural Variants ◽

Genomic Context ◽

Breakpoint Junction ◽

Short Read ◽

Depth Analysis ◽

Long Read ◽

The Impact

AbstractSince 2009, numerous tools have been developed to detect structural variants (SVs) using short read technologies. Insertions >50 bp are one of the hardest type to discover and are drastically underrepresented in gold standard variant callsets. The advent of long read technologies has completely changed the situation. In 2019, two independent cross technologies studies have published the most complete variant callsets with sequence resolved insertions in human individuals. Among the reported insertions, only 17 to 37% could be discovered with short-read based tools. In this work, we performed an in-depth analysis of these unprecedented insertion callsets in order to investigate the causes of such failures. We have first established a precise classification of insertion variants according to four layers of characterization: the nature and size of the inserted sequence, the genomic context of the insertion site and the breakpoint junction complexity. Because these levels are intertwined, we then used simulations to characterize the impact of each complexity factor on the recall of several SV callers. Simulations showed that the most impacting factor was the insertion type rather than the genomic context, with various difficulties being handled differently among the tested SV callers, and they highlighted the lack of sequence resolution for most insertion calls. Our results explain the low recall by pointing out several difficulty factors among the observed insertion features and provide avenues for improving SV caller algorithms and their [email protected]

Download Full-text

Fast and sensitive mapping of error-prone nanopore sequencing reads with GraphMap

10.1101/020719 ◽

2015 ◽

Cited By ~ 1

Author(s):

Ivan Sovic ◽

Mile Sikic ◽

Andreas Wilm ◽

Shannon Nicole Fenlon ◽

Swaine Chen ◽

...

Keyword(s):

Human Genome ◽

Variant Calling ◽

Error Rates ◽

Nanopore Sequencing ◽

Structural Variants ◽

Specific Identification ◽

Long Reads ◽

Long Read ◽

Specific Error ◽

Very High

Exploiting the power of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. We present the first nanopore read mapper (GraphMap) that uses a read-funneling paradigm to robustly handle variable error rates and fast graph traversal to align long reads with speed and very high precision (>95%). Evaluation on MinION sequencing datasets against short and long-read mappers indicates that GraphMap increases mapping sensitivity by at least 15-80%. GraphMap alignments are the first to demonstrate consensus calling with <1 error in 100,000 bases, variant calling on the human genome with 76% improvement in sensitivity over the next best mapper (BWA-MEM), precise detection of structural variants from 100bp to 4kbp in length and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.

Download Full-text

Centromeric Satellite DNAs: Hidden Sequence Variation in the Human Population

Genes ◽

10.3390/genes10050352 ◽

2019 ◽

Vol 10 (5) ◽

pp. 352 ◽

Cited By ~ 28

Author(s):

Karen H. Miga

Keyword(s):

Human Genome ◽

Satellite Dna ◽

Human Population ◽

Sequence Variation ◽

Tandem Repeats ◽

Association Studies ◽

Unmet Need ◽

Satellite Dnas ◽

Dna Variation ◽

Medical Genomics

The central goal of medical genomics is to understand the inherited basis of sequence variation that underlies human physiology, evolution, and disease. Functional association studies currently ignore millions of bases that span each centromeric region and acrocentric short arm. These regions are enriched in long arrays of tandem repeats, or satellite DNAs, that are known to vary extensively in copy number and repeat structure in the human population. Satellite sequence variation in the human genome is often so large that it is detected cytogenetically, yet due to the lack of a reference assembly and informatics tools to measure this variability, contemporary high-resolution disease association studies are unable to detect causal variants in these regions. Nevertheless, recently uncovered associations between satellite DNA variation and human disease support that these regions present a substantial and biologically important fraction of human sequence variation. Therefore, there is a pressing and unmet need to detect and incorporate this uncharacterized sequence variation into broad studies of human evolution and medical genomics. Here I discuss the current knowledge of satellite DNA variation in the human genome, focusing on centromeric satellites and their potential implications for disease.

Download Full-text

Comparison of different technologies for the decipherment of the whole genome sequence of Campylobacter jejuni BfR-CA-14430

Gut Pathogens ◽

10.1186/s13099-019-0340-7 ◽

2019 ◽

Vol 11 (1) ◽

Author(s):

Lennard Epping ◽

Julia C. Golz ◽

Marie-Theres Knüver ◽

Charlotte Huber ◽

Andrea Thürmer ◽

...

Keyword(s):

Campylobacter Jejuni ◽

Genome Sequence ◽

Bacterial Species ◽

Illumina Miseq ◽

Chicken Meat ◽

Whole Genome ◽

Short Read ◽

Plasmid Sequence ◽

Depth Analysis ◽

Long Read

Abstract Background Campylobacter jejuni is a zoonotic pathogen that infects the human gut through the food chain mainly by consumption of undercooked chicken meat, raw chicken cross-contaminated ready-to-eat food or by raw milk. In the last decades, C. jejuni has increasingly become the most common bacterial cause for food-born infections in high income countries, costing public health systems billions of euros each year. Currently, different whole genome sequencing techniques such as short-read bridge amplification and long-read single molecule real-time sequencing techniques are applied for in-depth analysis of bacterial species, in particular, Illumina MiSeq, PacBio and MinION. Results In this study, we analyzed a recently isolated C. jejuni strain from chicken meat by short- and long-read data from Illumina, PacBio and MinION sequencing technologies. For comparability, this strain is used in the German PAC-CAMPY research consortium in several studies, including phenotypic analysis of biofilm formation, natural transformation and in vivo colonization models. The complete assembled genome sequence most likely consists of a chromosome of 1,645,980 bp covering 1665 coding sequences as well as a plasmid sequence with 41,772 bp that encodes for 46 genes. Multilocus sequence typing revealed that the strain belongs to the clonal complex CC-21 (ST-44) which is known to be involved in C. jejuni human infections, including outbreaks. Furthermore, we discovered resistance determinants and a point mutation in the DNA gyrase (gyrA) that render the bacterium resistant against ampicillin, tetracycline and (fluoro-)quinolones. Conclusion The comparison of Illumina MiSeq, PacBio and MinION sequencing and analyses with different assembly tools enabled us to reconstruct a complete chromosome as well as a circular plasmid sequence of the C. jejuni strain BfR-CA-14430. Illumina short-read sequencing in combination with either PacBio or MinION can substantially improve the quality of the complete chromosome and epichromosomal elements on the level of mismatches and insertions/deletions, depending on the assembly program used.

Download Full-text

Acid phosphatase locus 1 (ACP1): Possible relationship of allelic variation to body size and human population adaptation to thermal stress?A theoretical perspective

American Journal of Human Biology ◽

10.1002/1520-6300(200009/10)12:5<688::aid-ajhb14>3.0.co;2-c ◽

2000 ◽

Vol 12 (5) ◽

pp. 688-701 ◽

Cited By ~ 7

Author(s):

Lawrence S. Greene ◽

Nunzio Bottini ◽

Paola Borgiani ◽

Fulvia Gloria-Bottini

Keyword(s):

Thermal Stress ◽

Body Size ◽

Acid Phosphatase ◽

Human Population ◽

Allelic Variation ◽

Theoretical Perspective ◽

Relationship Of ◽

Population Adaptation

Download Full-text

Endogenous Retroviruses and Human Evolution

Comparative and Functional Genomics ◽

10.1002/cfg.216 ◽

2002 ◽

Vol 3 (6) ◽

pp. 494-498 ◽

Cited By ~ 30

Author(s):

Konstantin Khodosevich ◽

Yuri Lebedev ◽

Eugene Sverdlov

Keyword(s):

Human Genome ◽

Human Evolution ◽

Endogenous Retroviruses ◽

Regulatory Sequences ◽

Coding Regions ◽

Regulatory Systems ◽

Human Genes ◽

Functional Consequences ◽

Polyadenylation Signals ◽

Human Specific

Humans share about 99% of their genomic DNA with chimpanzees and bonobos; thus, the differences between these species are unlikely to be in gene content but could be caused by inherited changes in regulatory systems. Endogenous retroviruses (ERVs) comprise ∼ 5% of the human genome. The LTRs of ERVs contain many regulatory sequences, such as promoters, enhancers, polyadenylation signals and factor-binding sites. Thus, they can influence the expression of nearby human genes. All known human-specific LTRs belong to the HERV-K (human ERV) family, the most active family in the human genome. It is likely that some of these ERVs could have integrated into regulatory regions of the human genome, and therefore could have had an impact on the expression of adjacent genes, which have consequently contributed to human evolution. This review discusses possible functional consequences of ERV integration in active coding regions.

Download Full-text

An Upper Limit on the Functional Fraction of the Human Genome

Genome Biology and Evolution ◽

10.1093/gbe/evx121 ◽

2017 ◽

Vol 9 (7) ◽

pp. 1880-1885 ◽

Cited By ~ 37

Author(s):

Dan Graur

Keyword(s):

Human Genome ◽

Human Population ◽

Mutational Load ◽

Deleterious Mutations ◽

Replacement Level ◽

Functional Regions ◽

Upper Limit ◽

Constant Size ◽

Mean Fitness ◽

The Mean

AbstractFor the human population to maintain a constant size from generation to generation, an increase in fertility must compensate for the reduction in the mean fitness of the population caused, among others, by deleterious mutations. The required increase in fertility due to this mutational load depends on the number of sites in the genome that are functional, the mutation rate, and the fraction of deleterious mutations among all mutations in functional regions. These dependencies and the fact that there exists a maximum tolerable replacement level fertility can be used to put an upper limit on the fraction of the human genome that can be functional. Mutational load considerations lead to the conclusion that the functional fraction within the human genome cannot exceed 15%.

Download Full-text

Illuminating the dark side of the human transcriptome with long read transcript sequencing

10.21203/rs.3.rs-23156/v3 ◽

2020 ◽

Author(s):

Richard Kuo ◽

Yuanyuan Cheng ◽

Runxuan Zhang ◽

John W.S. Brown ◽

Jacqueline Smith ◽

...

Keyword(s):

Data Processing ◽

Error Correction ◽

Human Genome ◽

Parameter Tuning ◽

Dark Side ◽

Sequencing Data ◽

Protein Coding ◽

Human Transcriptome ◽

Model Predictions ◽

Long Read

Abstract Background: The human transcriptome annotation is regarded as one of the most complete of any eukaryotic species. However, limitations in sequencing technologies have biased the annotation toward multi-exonic protein coding genes. Accurate high-throughput long read transcript sequencing can now provide additional evidence for rare transcripts and genes such as mono-exonic and non-coding genes that were previously either undetectable or impossible to differentiate from sequencing noise. Results: We developed the Transcriptome Annotation by Modular Algorithms (TAMA) software to leverage the power of long read transcript sequencing and address the issues with current data processing pipelines. TAMA achieved high sensitivity and precision for gene and transcript model predictions in both reference guided and unguided approaches in our benchmark tests using simulated Pacific Biosciences (PacBio) and Nanopore sequencing data and real PacBio datasets. By analyzing PacBio Sequel II Iso-Seq sequencing data of the Universal Human Reference RNA (UHRR) using TAMA and other commonly used tools, we found that the convention of using alignment identity to measure error correction performance does not reflect actual gain in accuracy of predicted transcript models. In addition, inter-read error correction can cause major changes to read mapping, resulting in potentially over 6K erroneous gene model predictions in the Iso-Seq based human genome annotation. Using TAMA’s genome assembly based error correction and gene feature evidence, we predicted 2,566 putative novel non-coding genes and 1,557 putative novel protein coding gene models.Conclusions: Long read transcript sequencing data has the power to identify novel genes within the highly annotated human genome. The use of parameter tuning and extensive output information of the TAMA software package allows for in depth exploration of eukaryotic transcriptomes. We have found long read data based evidence for thousands of unannotated genes within the human genome. More development in sequencing library preparation and data processing are required for differentiating sequencing noise from real genes in long read RNA sequencing data.

Download Full-text

Long-Read-Sequenced Reference Genomes of the Seven Major Lineages of Enterotoxigenic Escherichia Coli (ETEC) Circulating in Modern Time

10.21203/rs.3.rs-237525/v1 ◽

2021 ◽

Author(s):

Astrid von Mentzer ◽

Grace A Blackwell ◽

Derek Pickard ◽

Christine J Boinett ◽

Enrique Joffré ◽

...

Keyword(s):

Escherichia Coli ◽

Large Scale ◽

Bacterial Species ◽

Enterotoxigenic Escherichia Coli ◽

Poor Countries ◽

Depth Analysis ◽

Sequencing Studies ◽

Long Read ◽

To Come ◽

Reference Genomes

Abstract Abstract Enterotoxigenic Escherichia coli (ETEC) is an enteric pathogen responsible for the majority of diarrheal cases worldwide. ETEC infections are estimated to cause 80,000 fatalities per year, with the highest rates of burden, ca 75 million cases per year, amongst children under five years of age in resource-poor countries. It is also the leading cause of diarrhoea in travellers. Previous large-scale sequencing studies have found seven major ETEC lineages currently in circulation worldwide. We used PacBio long-read sequencing combined with Illumina sequencing to create high-quality complete reference genomes for each of the major lineages with manually curated chromosomes and plasmids. We confirm that the major ETEC lineages all harbour conserved plasmids that have been associated with their respective background genomes for decades and that the plasmids and chromosomes of ETEC are both crucial for ETEC virulence and success as pathogens. The in-depth analysis of gene content, synteny and correct annotations of plasmids will elucidate other plasmids with and without virulence factors in related bacterial species. These reference genomes allow for fast and accurate comparison between different ETEC strains, and these data will form the foundation of ETEC genomics research for years to come.

Download Full-text