human genome assembly
Recently Published Documents


TOTAL DOCUMENTS

27
(FIVE YEARS 16)

H-INDEX

7
(FIVE YEARS 4)

2021 ◽  
Vol 6 ◽  
pp. 239
Author(s):  
Ernesto Lowy ◽  
Susan Fairley ◽  
Paul Flicek

The International Genome Sample Resource (IGSR) repository was established to maximise the utility of human genetic data derived from openly consented samples within the research community. Here we describe variant detection in 505 samples from four populations in The Gambia, using the GRCh38 reference genome, adding to the range of populations for which this has been done and, importantly, making allele frequencies available. A multi-caller site discovery process was applied along with imputation and phasing to produce a phased biallelic single nucleotide variant (SNV) and insertion/deletion (INDEL) call set. Variation had not previously been explored on the GRCh38 human genome assembly for 387 of the samples. Compared to our previous work with the 1000 Genomes Project data on GRCh38, we identified over nine million novel SNVs and over 870 thousand novel INDELs.


2021 ◽  
Author(s):  
Arang Rhie ◽  
Ann Mc Cartney ◽  
Kishwar Shafin ◽  
Michael Alonge ◽  
Andrey Bzikadze ◽  
...  

Abstract Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first Telomere-to-Telomere (T2T) human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Though derived from highly accurate sequencing, evaluation revealed that the initial T2T draft assembly had evidence of small errors and structural misassemblies. To correct these errors, we designed a novel repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly QV to 73.9. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both PacBio HiFi and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies


Sensor Review ◽  
2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Mehdi Habibi ◽  
Yunus Dawji ◽  
Ebrahim Ghafar-Zadeh ◽  
Sebastian Magierowski

Purpose Nanopore-based molecular sensing and measurement, specifically DNA sequencing, is advancing at a fast pace. Some embodiments have matured from coarse particle counters to enabling full human genome assembly. This evolution has been powered not only by improvements in the sensors themselves, but also in the assisting microelectronic CMOS readout circuitry closely interfaced to them. In this light, this paper aims to review established and emerging nanopore-based sensing modalities considered for DNA sequencing and CMOS microelectronic methods currently being used. Design/methodology/approach Readout and amplifier circuits, which are potentially appropriate for conditioning and conversion of nanopore signals for downstream processing, are studied. Furthermore, arrayed CMOS readout implementations are focused on and the relevant status of the nanopore sensor technology is reviewed as well. Findings Ion channel nanopore devices have unique properties compared with other electrochemical cells. Currently biological nanopores are the only variants reported which can be used for actual DNA sequencing. The translocation rate of DNA through such pores, the current range at which these cells operate on and the cell capacitance effect, all impose the necessity of using low-noise circuits in the process of signal detection. The requirement of using in-pixel low-noise circuits in turn tends to impose challenges in the implementation of large size arrays. Originality/value The study presents an overview on the readout circuits used for signal acquisition in electrochemical cell arrays and investigates the specific requirements necessary for implementation of nanopore-type electrochemical cell amplifiers and their associated readout electronics.


2021 ◽  
Author(s):  
Nicolas Altemose ◽  
Glennis Logsdon ◽  
Andrey V Bzikadze ◽  
Pragya Sidhwani ◽  
Sasha A Langley ◽  
...  

Existing human genome assemblies have almost entirely excluded highly repetitive sequences within and near centromeres, limiting our understanding of their sequence, evolution, and essential role in chromosome segregation. Here, we present an extensive study of newly assembled peri/centromeric sequences representing 6.2% (189.9 Mb) of the first complete, telomere-to-telomere human genome assembly (T2T-CHM13). We discovered novel patterns of peri/centromeric repeat organization, variation, and evolution at both large and small length scales. We also found that inner kinetochore proteins tend to overlap the most recently duplicated subregions within centromeres. Finally, we compared chromosome X centromeres across a diverse panel of individuals and uncovered structural, epigenetic, and sequence variation at single-base resolution across these regions. In total, this work provides an unprecedented atlas of human centromeres to guide future studies of their complex and critical functions as well as their unique evolutionary dynamics.


2021 ◽  
Author(s):  
Ann M Mc Cartney ◽  
Kishwar Shafin ◽  
Michael Alonge ◽  
Andrey V Bzikadze ◽  
Giulio Formenti ◽  
...  

Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first Telomere-to-Telomere (T2T) human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Though derived from highly accurate sequencing, evaluation revealed that the initial T2T draft assembly had evidence of small errors and structural misassemblies. To correct these errors, we designed a novel repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly QV to 73.9. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both PacBio HiFi and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies.


2021 ◽  
Author(s):  
Barış Ekim ◽  
Bonnie Berger ◽  
Rayan Chikhi

DNA sequencing data continues to progress towards longer reads with increasingly lower sequencing error rates. We focus on the problem of assembling such reads into genomes, which poses challenges in terms of accuracy and computational resources when using cutting-edge assembly approaches, e.g. those based on overlapping reads using minimizer sketches. Here, we introduce the concept of minimizer-space sequencing data analysis, where the minimizers rather than DNA nucleotides are the atomic tokens of the alphabet. By projecting DNA sequences into ordered lists of minimizers, our key idea is to enumerate what we call k-min-mers, that are k-mers over a larger alphabet consisting of minimizer tokens. Our approach, mdBG or minimizer-dBG, achieves orders-of magnitude improvement in both speed and memory usage over existing methods without much loss of accuracy. We demonstrate three uses cases of mdBG: human genome assembly, metagenome assembly, and the representation of large pangenomes. For assembly, we implemented mdBG in software we call rust-mdbg, resulting in ultra-fast, low memory and highly-contiguous assembly of PacBio HiFi reads. A human genome is assembled in under 10 minutes using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 minutes using 1 GB RAM. For pangenome graphs, we newly allow a graphical representation of a collection of 661,405 bacterial genomes as an mdBG and successfully search it (in minimizer-space) for anti-microbial resistance (AMR) genes. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics and pangenomics.


2021 ◽  
Author(s):  
Dmitri S Pavlichin ◽  
HoJoon Lee ◽  
Stephanie U Greer ◽  
Susan M Grimes ◽  
Tsachy Weissman ◽  
...  

K-mers are short DNA sequences that are used for genome sequence analysis. Applications that use k-mers include genome assembly and alignment. Despite these current applications, the wider bioinformatic use of k-mers in has challenges related to the massive scale of genomic sequence data. A single human genome assembly has billions of these short sequences. The sheer amount of computation for effective use of k-mer information is enormous, particularly when involving multiple genome assemblies. To address these issues, we developed a new k-mer indexing data structure based on a hash table tuned for the lookup of k-mer keys. This web application, referred to as KmerKeys (https://kmerkeys.dgi-stanford.org/), provides performant, rapid query speeds for cloud computation on genome assemblies. We enable fuzzy as well as exact k-mer-based searches of assemblies. To enable robust and speedy performance, the website implements cache-friendly hash tables, memory mapping and massive parallel processing. Our method employs a scalable and efficient data structure that can be used to jointly index and search a large collection of human genome assembly information. One can include variant databases and their associated metadata such as the gnomAD population variant catalog. This feature enables the incorporation of future genomic information into sequencing analysis.


2021 ◽  
Author(s):  
Giuliana Giannuzzi ◽  
Glennis A. Logsdon ◽  
Nicolas Chatron ◽  
Danny E. Miller ◽  
Julie Reversat ◽  
...  

AbstractHuman centromeres are composed of alpha satellite DNA hierarchically organized as higher-order repeats and epigenetically specified by CENP-A binding. Current evolutionary models assert that new centromeres are first epigenetically established and subsequently acquire an alphoid array. We identified during routine prenatal aneuploidy diagnosis by FISH a de novo insertion of alpha satellite DNA array (~50-300 kbp) from the centromere of chromosome 18 (D18Z1) into chromosome 15q26 euchromatin. Although bound by CENP-B, this locus did not acquire centromeric functionality as demonstrated by lack of constriction and absence of CENP-A binding. We characterized the rearrangement by FISH and sequencing using Illumina, PacBio, and Nanopore adaptive sampling which revealed that the insertion was associated with a 2.8 kbp deletion and likely occurred in the paternal germline. Notably, the site was located ~10 Mbp distal from the location where a centromere was ancestrally seeded and then became inactive sometime between 20 and 25 million years ago (Mya), in the common ancestor of humans and apes. Long reads spanning either junction showed that the organization of the alphoid insertion followed the 12-mer higher-order repeat structure of the D18Z1 array. Mapping to the CHM13 human genome assembly revealed that the satellite segment transposed from a specific location of chromosome 18 centromere. The rearrangement did not directly disrupt any gene or predicted regulatory element and did not alter the epigenetic status of the surrounding region, consistent with the absence of phenotypic consequences in the carrier. This case demonstrates a likely rare but new class of structural variation that we name ‘alpha satellite insertion’. It also expands our knowledge about the evolutionary life cycle of centromeres, conveying the possibility that alphoid arrays can relocate near vestigial centromeric sites.


2020 ◽  
Author(s):  
Mohammed O.E Abdallah ◽  
Mahmoud Koko ◽  
Raj Ramesar

Abstract Background:The GRCh37 human genome assembly is still widely used in genomics despite the fact an updated human genome assembly (GRCh38) has been available for many years. A particular issue with relevant ramifications for clinical genetics currently is the case of the GRCh37 Ensembl gene annotations which has been archived, and thus not updated, since 2013. These Ensembl GRCh37 gene annotations are just as ubiquitous as the former assembly and are the default gene models used and preferred by the majority of genomic projects internationally. In this study, we highlight the issue of genes with discrepant annotations, that have been recognized as protein coding in the new but not the old assembly. These genes are ignored by all genomic resources that still rely on the archived and outdated gene annotations. Moreover, the majority if not all of these discrepant genes (DGs) are automatically discarded and ignored by all variant prioritization tools that rely on the GRCh37 Ensembl gene annotations.Methods:We performed bioinformatics analysis identifying Ensembl genes with discrepant annotations between the two most recent human genome assemblies, hg37, hg38, respectively. Clinical and phenotype gene curations have been obtained and compared for this gene set. Furthermore, matching RefSeq transcripts have also been collated and analyzed. ٌResults:We found hundreds of genes (N=267) that were reclassified as “protein-coding” in the new hg38 assembly. Notably, 169 of these genes also had a discrepant HGNC gene symbol between the two assemblies.Most genes had RefSeq matches (N=199/267) including all the genes with defined phenotypes in Ensembl genes GRCh38 assembly (N=10). However, many protein-coding genes remain missing from the current known RefSeq gene models (N=68)Conclusion: We found many clinically relevant genes in this group of neglected genes and we anticipate that many more will be found relevant in the future. For these genes, the inaccurate label of “non-protein-coding” hinders the possibility of identifying any causal sequence variants that overlap them. In addition, Important additional annotations such as evolutionary constraint metrics are also not calculated for these genes for the same reason, further relegating them into oblivion.


Author(s):  
David Porubsky ◽  
◽  
Peter Ebert ◽  
Peter A. Audano ◽  
Mitchell R. Vollger ◽  
...  

AbstractHuman genomes are typically assembled as consensus sequences that lack information on parental haplotypes. Here we describe a reference-free workflow for diploid de novo genome assembly that combines the chromosome-wide phasing and scaffolding capabilities of single-cell strand sequencing1,2 with continuous long-read or high-fidelity3 sequencing data. Employing this strategy, we produced a completely phased de novo genome assembly for each haplotype of an individual of Puerto Rican descent (HG00733) in the absence of parental data. The assemblies are accurate (quality value > 40) and highly contiguous (contig N50 > 23 Mbp) with low switch error rates (0.17%), providing fully phased single-nucleotide variants, indels and structural variants. A comparison of Oxford Nanopore Technologies and Pacific Biosciences phased assemblies identified 154 regions that are preferential sites of contig breaks, irrespective of sequencing technology or phasing algorithms.


Sign in / Sign up

Export Citation Format

Share Document