PGP1 personal genome assembly - a hybrid assembly dataset using ONT′s PromethION and PacBio′s HiFi sequencing

Mapping Intimacies ◽

10.1101/2021.09.03.458806 ◽

2021 ◽

Author(s):

Hui-Su Kim ◽

Changjae Kim ◽

George McDonald Church ◽

Jong Bhak

Keyword(s):

Genome Assembly ◽

Gene Annotation ◽

Genome Project ◽

Chromosomal Mapping ◽

Personal Genome ◽

Hybrid Assembly ◽

Mapping Data ◽

Contig Assembly ◽

Base Calling ◽

Manual Curation

PGP1 is the first participant of Personal Genome Project. We present the PGP1′s chromosome-scale genome assembly. It was constructed using 255 Gb ultra-long PromethION reads and 97 Gb short paired-end reads. For reducing base calling errors, we corrected PromethION reads using 72 Gb PacBio HiFi reads. 327 Gb Hi-C chromosomal mapping data were utilized to maximize the assembly′s contiguity. PGP1′s contig assembly was 3.01 Gb in length comprising of 4,234 contigs with an N50 value of 33.8 Mb. After scaffolding with Hi-C data and extensive manual curation, we obtained a chromosome-scale assembly that represents 3,880 scaffolds with an N50 value of 142 Mb. From the Merqury assessment, PGP1 assembly achieved a high QV score of Q45.45. For a gene annotation, we predicted 106,789 genes with a liftover from the Gencode 38 and an assembly of transcriptome data.

Get full-text (via PubEx)

Transcript- and annotation-guided genome assembly of the European starling

10.1101/2021.04.07.438753 ◽

2021 ◽

Author(s):

Katarina C. Stuart ◽

Richard J. Edwards ◽

Yuanyuan Cheng ◽

Wesley C. Warren ◽

David W. Burt ◽

...

Keyword(s):

Genome Assembly ◽

Gene Annotation ◽

European Starling ◽

Genomic Research ◽

Important Species ◽

Base Calling ◽

Sequencing Technologies ◽

Long Read ◽

Full Length Transcript ◽

Species Specific

AbstractThe European starling, Sturnus vulgaris, is an ecologically significant, globally invasive avian species that is also suffering from a major decline in its native range. Here, we present the genome assembly and long-read transcriptome of an Australian-sourced European starling (S. vulgaris vAU), and a second North American genome (S. vulgaris vNA), as complementary reference genomes for population genetic and evolutionary characterisation. S. vulgaris vAU combined 10x Genomics linked-reads, low-coverage Nanopore sequencing, and PacBio Iso-Seq full-length transcript scaffolding to generate a 1050 Mb assembly on 1,628 scaffolds (72.5 Mb scaffold N50). Species-specific transcript mapping and gene annotation revealed high structural and functional completeness (94.6% BUSCO completeness). Further scaffolding against the high-quality zebra finch (Taeniopygia guttata) genome assigned 98.6% of the assembly to 32 putative nuclear chromosome scaffolds. Rapid, recent advances in sequencing technologies and bioinformatics software have highlighted the need for evidence-based assessment of assembly decisions on a case-by-case basis. Using S. vulgaris vAU, we demonstrate how the multifunctional use of PacBio Iso-Seq transcript data and complementary homology-based annotation of sequential assembly steps (assessed using a new tool, SAAGA) can be used to assess, inform, and validate assembly workflow decisions. We also highlight some counter-intuitive behaviour in traditional BUSCO metrics, and present BUSCOMP, a complementary tool for assembly comparison designed to be robust to differences in assembly size and base-calling quality. Finally, we present a second starling assembly, S. vulgaris vNA, to facilitate comparative analysis and global genomic research on this ecologically important species.

Get full-text (via PubEx)

The Lithuanian reference genome LT1 - a human de novo genome assembly with short and long read sequence and Hi-C data

10.1101/2021.04.05.438426 ◽

2021 ◽

Author(s):

Hui-Su Kim ◽

Asta Blazyte ◽

Sungwon Jeon ◽

Changhan Yoon ◽

Yeonkyung Kim ◽

...

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Gene Prediction ◽

Chromosomal Mapping ◽

Autosomal Snps ◽

Contig Assembly ◽

De Novo Genome Assembly ◽

Human Reference Genome ◽

The Baltic

We present LT1, the first high-quality human reference genome from the Baltic States. LT1 is a female de novo human reference genome assembly constructed using 57× of ultra-long nanopore reads and 47× of short paired-end reads. We also utilized 72 Gb of Hi-C chromosomal mapping data to maximize the assembly′s contiguity and accuracy. LT1′s contig assembly was 2.73 Gbp in length comprising of 4,490 contigs with an N50 value of 13.4 Mbp. After scaffolding with Hi-C data and extensive manual curation, we produced a chromosome-scale assembly with an N50 value of 138 Mbp and 4,699 scaffolds. Our gene prediction quality assessment using BUSCO identify 89.3% of the single-copy orthologous genes included in the benchmarking set. Detailed characterization of LT1 suggested it has 73,744 predicted transcripts, 4.2 million autosomal SNPs, 974,000 short indels, and 12,330 large structural variants. These data are shared as a public resource without any restrictions and can be used as a benchmark for further in-depth genomic analyses of the Baltic populations.

Get full-text (via PubEx)

Chromosome-level genome assembly and manually-curated proteome of model necrotroph Parastagonospora nodorum Sn15 reveals a genome-wide trove of candidate effector homologs, and redundancy of virulence-related functions within an accessory chromosome

BMC Genomics ◽

10.1186/s12864-021-07699-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Stefania Bertazzoni ◽

Darcy A. B. Jones ◽

Huyen T. Phan ◽

Kar-Chun Tan ◽

James K. Hane

Keyword(s):

Genome Assembly ◽

Plant Pathogens ◽

Gene Annotation ◽

Specific Gene ◽

Accessory Chromosome ◽

Reference Isolate ◽

Manual Curation ◽

A Genome ◽

Depth Analysis ◽

Parastagonospora Nodorum

Abstract Background The fungus Parastagonospora nodorum causes septoria nodorum blotch (SNB) of wheat (Triticum aestivum) and is a model species for necrotrophic plant pathogens. The genome assembly of reference isolate Sn15 was first reported in 2007. P. nodorum infection is promoted by its production of proteinaceous necrotrophic effectors, three of which are characterised – ToxA, Tox1 and Tox3. Results A chromosome-scale genome assembly of P. nodorum Australian reference isolate Sn15, which combined long read sequencing, optical mapping and manual curation, produced 23 chromosomes with 21 chromosomes possessing both telomeres. New transcriptome data were combined with fungal-specific gene prediction techniques and manual curation to produce a high-quality predicted gene annotation dataset, which comprises 13,869 high confidence genes, and an additional 2534 lower confidence genes retained to assist pathogenicity effector discovery. Comparison to a panel of 31 internationally-sourced isolates identified multiple hotspots within the Sn15 genome for mutation or presence-absence variation, which was used to enhance subsequent effector prediction. Effector prediction resulted in 257 candidates, of which 98 higher-ranked candidates were selected for in-depth analysis and revealed a wealth of functions related to pathogenicity. Additionally, 11 out of the 98 candidates also exhibited orthology conservation patterns that suggested lateral gene transfer with other cereal-pathogenic fungal species. Analysis of the pan-genome indicated the smallest chromosome of 0.4 Mbp length to be an accessory chromosome (AC23). AC23 was notably absent from an avirulent isolate and is predominated by mutation hotspots with an increase in non-synonymous mutations relative to other chromosomes. Surprisingly, AC23 was deficient in effector candidates, but contained several predicted genes with redundant pathogenicity-related functions. Conclusions We present an updated series of genomic resources for P. nodorum Sn15 – an important reference isolate and model necrotroph – with a comprehensive survey of its predicted pathogenicity content.

Get full-text (via PubEx)

Use of SNP chips to detect rare pathogenic variants: retrospective, population based diagnostic evaluation

BMJ ◽

10.1136/bmj.n214 ◽

2021 ◽

pp. n214

Author(s):

Weedon MN ◽

Jackson L ◽

Harrison JW ◽

Ruth KS ◽

Tyrrell J ◽

...

Keyword(s):

Positive Predictive Value ◽

Predictive Value ◽

Population Based ◽

Genome Project ◽

Personal Genome ◽

Uk Biobank ◽

Sequencing Data ◽

Snp Chip ◽

Pathogenic Variants ◽

The Uk

Abstract Objective To determine whether the sensitivity and specificity of SNP chips are adequate for detecting rare pathogenic variants in a clinically unselected population. Design Retrospective, population based diagnostic evaluation. Participants 49 908 people recruited to the UK Biobank with SNP chip and next generation sequencing data, and an additional 21 people who purchased consumer genetic tests and shared their data online via the Personal Genome Project. Main outcome measures Genotyping (that is, identification of the correct DNA base at a specific genomic location) using SNP chips versus sequencing, with results split by frequency of that genotype in the population. Rare pathogenic variants in the BRCA1 and BRCA2 genes were selected as an exemplar for detailed analysis of clinically actionable variants in the UK Biobank, and BRCA related cancers (breast, ovarian, prostate, and pancreatic) were assessed in participants through use of cancer registry data. Results Overall, genotyping using SNP chips performed well compared with sequencing; sensitivity, specificity, positive predictive value, and negative predictive value were all above 99% for 108 574 common variants directly genotyped on the SNP chips and sequenced in the UK Biobank. However, the likelihood of a true positive result decreased dramatically with decreasing variant frequency; for variants that are very rare in the population, with a frequency below 0.001% in UK Biobank, the positive predictive value was very low and only 16% of 4757 heterozygous genotypes from the SNP chips were confirmed with sequencing data. Results were similar for SNP chip data from the Personal Genome Project, and 20/21 individuals analysed had at least one false positive rare pathogenic variant that had been incorrectly genotyped. For pathogenic variants in the BRCA1 and BRCA2 genes, which are individually very rare, the overall performance metrics for the SNP chips versus sequencing in the UK Biobank were: sensitivity 34.6%, specificity 98.3%, positive predictive value 4.2%, and negative predictive value 99.9%. Rates of BRCA related cancers in UK Biobank participants with a positive SNP chip result were similar to those for age matched controls (odds ratio 1.31, 95% confidence interval 0.99 to 1.71) because the vast majority of variants were false positives, whereas sequence positive participants had a significantly increased risk (odds ratio 4.05, 2.72 to 6.03). Conclusions SNP chips are extremely unreliable for genotyping very rare pathogenic variants and should not be used to guide health decisions without validation.

Get full-text (via PubEx)

A study of transposable element-associated structural variations (TASVs) using a de novo-assembled Korean genome

Experimental & Molecular Medicine ◽

10.1038/s12276-021-00586-y ◽

2021 ◽

Author(s):

Seyoung Mun ◽

Songmi Kim ◽

Wooseok Lee ◽

Keunsoo Kang ◽

Thomas J. Meyer ◽

...

Keyword(s):

Genome Sequencing ◽

Genome Assembly ◽

De Novo ◽

Personal Genome ◽

Human Populations ◽

Whole Genome ◽

Structural Variations ◽

Insert Size ◽

Human Genomes ◽

Next Generation Sequencing Ngs

AbstractAdvances in next-generation sequencing (NGS) technology have made personal genome sequencing possible, and indeed, many individual human genomes have now been sequenced. Comparisons of these individual genomes have revealed substantial genomic differences between human populations as well as between individuals from closely related ethnic groups. Transposable elements (TEs) are known to be one of the major sources of these variations and act through various mechanisms, including de novo insertion, insertion-mediated deletion, and TE–TE recombination-mediated deletion. In this study, we carried out de novo whole-genome sequencing of one Korean individual (KPGP9) via multiple insert-size libraries. The de novo whole-genome assembly resulted in 31,305 scaffolds with a scaffold N50 size of 13.23 Mb. Furthermore, through computational data analysis and experimental verification, we revealed that 182 TE-associated structural variation (TASV) insertions and 89 TASV deletions contributed 64,232 bp in sequence gain and 82,772 bp in sequence loss, respectively, in the KPGP9 genome relative to the hg19 reference genome. We also verified structural differences associated with TASVs by comparative analysis with TASVs in recent genomes (AK1 and TCGA genomes) and reported their details. Here, we constructed a new Korean de novo whole-genome assembly and provide the first study, to our knowledge, focused on the identification of TASVs in an individual Korean genome. Our findings again highlight the role of TEs as a major driver of structural variations in human individual genomes.

Get full-text (via PubEx)

The genome sequence of the European peacock butterfly, Aglais io (Linnaeus, 1758)

Wellcome Open Research ◽

10.12688/wellcomeopenres.17204.1 ◽

2021 ◽

Vol 6 ◽

pp. 258

Author(s):

Konrad Lohse ◽

Alexander Mackintosh ◽

Roger Vila ◽

◽

...

Keyword(s):

Genome Sequence ◽

Genome Assembly ◽

Sex Chromosome ◽

Gene Annotation ◽

Protein Coding ◽

Individual Male ◽

Protein Coding Genes ◽

A Genome ◽

Inachis Io

We present a genome assembly from an individual male Aglais io (also known as Inachis io and Nymphalis io) (the European peacock; Arthropoda; Insecta; Lepidoptera; Nymphalidae). The genome sequence is 384 megabases in span. The majority (99.91%) of the assembly is scaffolded into 31 chromosomal pseudomolecules, with the Z sex chromosome assembled. Gene annotation of this assembly on Ensembl has identified 11,420 protein coding genes.

Get full-text (via PubEx)

Loss of critical developmental and human disease-causing genes in 58 mammals

10.1101/819169 ◽

2019 ◽

Author(s):

Yatish Turakhia ◽

Heidi I. Chen ◽

Amir Marcovitz ◽

Gill Bejerano

Keyword(s):

Evolutionary Biology ◽

Large Scale ◽

Gene Annotation ◽

Synonymous Substitution ◽

Specific Gene ◽

High Confidence ◽

Protein Coding ◽

Congenital Diseases ◽

Manual Curation ◽

Human Genes

Gene losses provide an insightful route for studying the morphological and physiological adaptations of species, but their discovery is challenging. Existing genome annotation tools and protein databases focus on annotating intact genes and do not attempt to distinguish nonfunctional genes from genes missing annotation due to sequencing and assembly artifacts. Previous attempts to annotate gene losses have required significant manual curation, which hampers their scalability for the ever-increasing deluge of newly sequenced genomes. Using extreme sequence erosion (deletion and non-synonymous substitution) as an unambiguous signature of loss, we developed an automated approach for detecting high-confidence protein-coding gene loss events across a species tree. Our approach relies solely on gene annotation in a single reference genome, raw assemblies for the remaining species to analyze, and the associated phylogenetic tree for all organisms involved. Using the hg38 human assembly as a reference, we discovered over 500 unique human genes affected by such high-confidence erosion events in different clades across 58 mammals. While most of these events likely have benign consequences, we also found dozens of clade-specific gene losses that result in early lethality in outgroup mammals or are associated with severe congenital diseases in humans. Our discoveries yield intriguing potential for translational medical genetics and for evolutionary biology, and our approach is readily applicable to large-scale genome sequencing efforts across the tree of life.

Get full-text (via PubEx)

Gene Annotation and Transcriptome Delineation on a De Novo Genome Assembly for the Reference Leishmania major Friedlin Strain

Genes ◽

10.3390/genes12091359 ◽

2021 ◽

Vol 12 (9) ◽

pp. 1359

Author(s):

Esther Camacho ◽

Sandra González-de la Fuente ◽

Jose C. Solana ◽

Alberto Rastrojo ◽

Fernando Carrasco-Ramiro ◽

...

Keyword(s):

Genome Sequence ◽

Genome Assembly ◽

Molecular Mechanisms ◽

High Throughput Sequencing ◽

Leishmania Major ◽

De Novo ◽

Gene Annotation ◽

Leishmania Species ◽

De Novo Genome Assembly ◽

Sequencing Platforms

Leishmania major is the main causative agent of cutaneous leishmaniasis in humans. The Friedlin strain of this species (LmjF) was chosen when a multi-laboratory consortium undertook the objective of deciphering the first genome sequence for a parasite of the genus Leishmania. The objective was successfully attained in 2005, and this represented a milestone for Leishmania molecular biology studies around the world. Although the LmjF genome sequence was done following a shotgun strategy and using classical Sanger sequencing, the results were excellent, and this genome assembly served as the reference for subsequent genome assemblies in other Leishmania species. Here, we present a new assembly for the genome of this strain (named LMJFC for clarity), generated by the combination of two high throughput sequencing platforms, Illumina short-read sequencing and PacBio Single Molecular Real-Time (SMRT) sequencing, which provides long-read sequences. Apart from resolving uncertain nucleotide positions, several genomic regions were reorganized and a more precise composition of tandemly repeated gene loci was attained. Additionally, the genome annotation was improved by adding 542 genes and more accurate coding-sequences defined for around two hundred genes, based on the transcriptome delimitation also carried out in this work. As a result, we are providing gene models (including untranslated regions and introns) for 11,238 genes. Genomic information ultimately determines the biology of every organism; therefore, our understanding of molecular mechanisms will depend on the availability of precise genome sequences and accurate gene annotations. In this regard, this work is providing an improved genome sequence and updated transcriptome annotations for the reference L. major Friedlin strain.

Get full-text (via PubEx)

The gene-rich genome of the scallop Pecten maximus

GigaScience ◽

10.1093/gigascience/giaa037 ◽

2020 ◽

Vol 9 (5) ◽

Cited By ~ 4

Author(s):

Nathan J Kenny ◽

Shane A McCarthy ◽

Olga Dudchenko ◽

Katherine James ◽

Emma Betteridge ◽

...

Keyword(s):

Genome Assembly ◽

Gene Annotation ◽

Atlantic Coast ◽

Shallow Waters ◽

Pharmaceutical Companies ◽

Marine Bivalve ◽

Pecten Maximus ◽

Assembly Sequence ◽

Algal Toxins ◽

Large Numbers

Abstract Background The king scallop, Pecten maximus, is distributed in shallow waters along the Atlantic coast of Europe. It forms the basis of a valuable commercial fishery and plays a key role in coastal ecosystems and food webs. Like other filter feeding bivalves it can accumulate potent phytotoxins, to which it has evolved some immunity. The molecular origins of this immunity are of interest to evolutionary biologists, pharmaceutical companies, and fisheries management. Findings Here we report the genome assembly of this species, conducted as part of the Wellcome Sanger 25 Genomes Project. This genome was assembled from PacBio reads and scaffolded with 10X Chromium and Hi-C data. Its 3,983 scaffolds have an N50 of 44.8 Mb (longest scaffold 60.1 Mb), with 92% of the assembly sequence contained in 19 scaffolds, corresponding to the 19 chromosomes found in this species. The total assembly spans 918.3 Mb and is the best-scaffolded marine bivalve genome published to date, exhibiting 95.5% recovery of the metazoan BUSCO set. Gene annotation resulted in 67,741 gene models. Analysis of gene content revealed large numbers of gene duplicates, as previously seen in bivalves, with little gene loss, in comparison with the sequenced genomes of other marine bivalve species. Conclusions The genome assembly of P. maximus and its annotated gene set provide a high-quality platform for studies on such disparate topics as shell biomineralization, pigmentation, vision, and resistance to algal toxins. As a result of our findings we highlight the sodium channel gene Nav1, known to confer resistance to saxitoxin and tetrodotoxin, as a candidate for further studies investigating immunity to domoic acid.

Get full-text (via PubEx)

A fully-automated method discovers loss of mouse-lethal and human-monogenic disease genes in 58 mammals

Nucleic Acids Research ◽

10.1093/nar/gkaa550 ◽

2020 ◽

Vol 48 (16) ◽

pp. e91-e91

Author(s):

Yatish Turakhia ◽

Heidi I Chen ◽

Amir Marcovitz ◽

Gill Bejerano

Keyword(s):

Evolutionary Biology ◽

Large Scale ◽

Gene Annotation ◽

Monogenic Disease ◽

Disease Genes ◽

Congenital Diseases ◽

Manual Curation ◽

Automated Method ◽

Human Ortholog ◽

Early Mouse

Abstract Gene losses provide an insightful route for studying the morphological and physiological adaptations of species, but their discovery is challenging. Existing genome annotation tools focus on annotating intact genes and do not attempt to distinguish nonfunctional genes from genes missing annotation due to sequencing and assembly artifacts. Previous attempts to annotate gene losses have required significant manual curation, which hampers their scalability for the ever-increasing deluge of newly sequenced genomes. Using extreme sequence erosion (amino acid deletions and substitutions) and sister species support as an unambiguous signature of loss, we developed an automated approach for detecting high-confidence gene loss events across a species tree. Our approach relies solely on gene annotation in a single reference genome, raw assemblies for the remaining species to analyze, and the associated phylogenetic tree for all organisms involved. Using human as reference, we discovered over 400 unique human ortholog erosion events across 58 mammals. This includes dozens of clade-specific losses of genes that result in early mouse lethality or are associated with severe human congenital diseases. Our discoveries yield intriguing potential for translational medical genetics and evolutionary biology, and our approach is readily applicable to large-scale genome sequencing efforts across the tree of life.

Get full-text (via PubEx)