contig assembly
Recently Published Documents


TOTAL DOCUMENTS

44
(FIVE YEARS 14)

H-INDEX

13
(FIVE YEARS 1)

2021 ◽  
Author(s):  
Ilona Ewa Grabowicz ◽  
Julia Herman-Iżycka ◽  
Marta Fructuoso ◽  
Mara Dierssen ◽  
Bartek Wilczynski

The existing methods designated for metatranscriptomic studies are still rare and being developed. In this paper we present a new analytical pipeline combining contig assembly, gene selection and functional annotation. This pipeline allowed us to reconstruct contigs with very high unique mappability (83%) and select sequences encoding putative bacterial genes reaching also a very high (66%), unique mappability of the NGS sequencing reads. Then, we have applied our pipeline to study faecal metatranscriptome of a Down syndrome (DS) mouse model, the Ts65Dn mice, in order to identify the differentially expressed transcripts. Recent studies have implicated dysbiosis of gut microbiota in several central nervous system (CNS) disorders, including DS. Given that DS individuals have an increased prevalence of obesity, we also studied the effects of a high-fat diet (HFD) on the transcriptomic changes of mice gut microbiomes, as the complex symbiotic relationship between the gut microbiome and its host is strongly influenced by diet and nutrition. Using our new pipeline we found that compared to wild type (WT), Ts65Dn mice showed an elevated expression levels of genes involved in hypoxanthine metabolism, which contributes to oxidative stress, and a down-regulated expression of genes involved in interactions with host epithelial cells and virulence. Microbiomes of mice fed HFD showed significantly higher expression levels of genes involved in membrane lipopolysaccharides / lipids biosynthesis, and decreased expression of osmoprotection and lysine fermentation genes, among others. We also found evidence that mice microbiota is capable of expressing genes encoding for neuromodulators, which may play a role in development of compulsive overeating and obesity. Our results show a DS-specific metatranscriptome profile and show that a high-fat diet affects the metabolism of mice gut microbiome by changing activity of genes involved in lipids, sugars, proteins and amino acids metabolism and cell membranes turnover. Our new analytical pipeline combining contig assembly, gene selection and functional annotation provides new insights into the metatranscriptomic studies.


2021 ◽  
Author(s):  
Hui-Su Kim ◽  
Changjae Kim ◽  
George McDonald Church ◽  
Jong Bhak

PGP1 is the first participant of Personal Genome Project. We present the PGP1′s chromosome-scale genome assembly. It was constructed using 255 Gb ultra-long PromethION reads and 97 Gb short paired-end reads. For reducing base calling errors, we corrected PromethION reads using 72 Gb PacBio HiFi reads. 327 Gb Hi-C chromosomal mapping data were utilized to maximize the assembly′s contiguity. PGP1′s contig assembly was 3.01 Gb in length comprising of 4,234 contigs with an N50 value of 33.8 Mb. After scaffolding with Hi-C data and extensive manual curation, we obtained a chromosome-scale assembly that represents 3,880 scaffolds with an N50 value of 142 Mb. From the Merqury assessment, PGP1 assembly achieved a high QV score of Q45.45. For a gene annotation, we predicted 106,789 genes with a liftover from the Gencode 38 and an assembly of transcriptome data.


2021 ◽  
Author(s):  
Tom Davot ◽  
Annie Chateau ◽  
Rohan Fossé ◽  
Rodolphe Giroudeau ◽  
Mathias Weller

Abstract Background: Scaffolding is a bioinformatics problem aimed at completing the contig assembly process by determining the relative position and orientation of these contigs. It can be seen as a paths and cycles cover problem of a particular graph called the “scaffold graph”.Results: We provide some NP-hardness and inapproximability results on this problem. We also adapt a greedy approximation algorithm on complete graphs so that it works on a special class aiming to be close to real instances. The described algorithm is the first polynomial-time approximation algorithm designed for this problem on non-complete graphs.Conclusion: Tests on a set of simulated instances show that our algorithm provides better results than the version on complete graphs.


2021 ◽  
Author(s):  
Hui-Su Kim ◽  
Asta Blazyte ◽  
Sungwon Jeon ◽  
Changhan Yoon ◽  
Yeonkyung Kim ◽  
...  

We present LT1, the first high-quality human reference genome from the Baltic States. LT1 is a female de novo human reference genome assembly constructed using 57× of ultra-long nanopore reads and 47× of short paired-end reads. We also utilized 72 Gb of Hi-C chromosomal mapping data to maximize the assembly′s contiguity and accuracy. LT1′s contig assembly was 2.73 Gbp in length comprising of 4,490 contigs with an N50 value of 13.4 Mbp. After scaffolding with Hi-C data and extensive manual curation, we produced a chromosome-scale assembly with an N50 value of 138 Mbp and 4,699 scaffolds. Our gene prediction quality assessment using BUSCO identify 89.3% of the single-copy orthologous genes included in the benchmarking set. Detailed characterization of LT1 suggested it has 73,744 predicted transcripts, 4.2 million autosomal SNPs, 974,000 short indels, and 12,330 large structural variants. These data are shared as a public resource without any restrictions and can be used as a benchmark for further in-depth genomic analyses of the Baltic populations.


2021 ◽  
Author(s):  
Lisa M. Pinatti ◽  
Wenjin Gu ◽  
Yifan Wang ◽  
Ahmed El Hossiny ◽  
Apurva D. Bhangale ◽  
...  

ABSTRACTBackgroundHuman papillomavirus (HPV) is a well-established driver of malignant transformation in a number of sites including head and neck, cervical, vulvar, anorectal and penile squamous cell carcinomas; however, the impact of HPV integration into the host human genome on this process remains largely unresolved. This is due to the technical challenge of identifying HPV integration sites, which includes limitations of existing informatics approaches to discover viral-host breakpoints from low read coverage sequencing data.MethodsTo overcome this limitation, we developed a new HPV detection pipeline called SearcHPV based on targeted capture technology and applied the algorithm to targeted capture data. We performed an integrated analysis of SearcHPV-defined breakpoints with genome-wide linked read sequencing to identify potential HPV-related structural variations.ResultsThrough analysis of HPV+ models, we show that SearcHPV detects HPV-host integration sites with a higher sensitivity and specificity than two other commonly used HPV detection callers. SearcHPV uncovered HPV integration sites adjacent to known cancer-related genes including TP63 and MYC, as well as near regions of large structural variation. We further validated the junction contig assembly feature of SearcHPV, which helped to accurately identify viral-host junction breakpoint sequences. We found that viral integration occurred through a variety of DNA repair mechanisms including non-homologous end joining, alternative end joining and microhomology mediated repair.ConclusionsIn summary, we show that SearcHPV is a new optimized tool for the accurate detection of HPV-human integration sites from targeted capture DNA sequencing data.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e10420
Author(s):  
Jacopo D’Ercole ◽  
Sean W.J. Prosser ◽  
Paul D.N. Hebert

Natural history collections are a valuable resource for molecular taxonomic studies and for examining patterns of evolutionary diversification, particularly in the case of rare or extinct species. However, the recovery of sequence information is often complicated by DNA degradation. This article describes use of the Sequel platform (Pacific Biosciences) to recover the 658 bp barcode region of the mitochondrial cytochrome c oxidase I (COI) gene from 380 butterflies with an average age of 50 years. Nested multiplex PCR was employed for library preparation to facilitate sequence recovery from extracts with low concentrations of highly degraded DNA. By employing circular consensus sequencing (CCS) of short amplicons (circa 150 bp), full-length barcodes could be assembled without a reference sequence, an important advance from earlier protocols which required reference sequences to guide contig assembly. The Sequel protocol recovered COI sequences (499 bp on average) from 318 of 380 specimens (84%), much higher than for Sanger sequencing (26%). Because each read derives from a single molecule, it was also possible to quantify the incidence of substitutions arising from DNA damage. In agreement with past work on sequence changes induced by DNA degradation, the transition C/G → T/A was the most prevalent category of change, but its rate of occurrence (4.58E−4) was so low that it did not impede the recovery of reliable sequences. Because the current protocol recovers COI sequence from most museum specimens, and because sequence fidelity is unaffected by nucleotide misincorporations, large-scale sequence characterization of museum specimens is feasible.


Author(s):  
Ann McCartney ◽  
Elena Hilario ◽  
Seung-Sub Choi ◽  
Joseph Guhlin ◽  
Jessie Prebble ◽  
...  

We used long read sequencing data generated from Knightia excelsaI R.Br, a nectar producing Proteaceae tree endemic to Aotearoa New Zealand, to explore how sequencing data type, volume and workflows can impact final assembly accuracy and chromosome construction. Establishing a high-quality genome for this species has specific cultural importance to Māori, the indigenous people, as well as commercial importance to honey producers in Aotearoa New Zealand. Assemblies were produced by five long read assemblers using data subsampled based on read lengths, two polishing strategies, and two Hi-C mapping methods. Our results from subsampling the data by read length showed that each assembler tested performed differently depending on the coverage and the read length of the data. Assemblies that used longer read lengths (>30 kb) and lower coverage were the most contiguous, kmer and gene complete. The final genome assembly was constructed into pseudo-chromosomes using all available data assembled with FLYE, polished using Racon/Medaka/Pilon combined, scaffolded using SALSA2 and AllHiC, curated using Juicebox, and validated by synteny with Macadamia. We highlighted the importance of developing assembly workflows based on the volume and type of sequencing data and establishing a set of robust quality metrics for generating high quality assemblies. Scaffolding analyses highlighted that problems found in the initial assemblies could not be resolved accurately by utilizing Hi-C data and that scaffolded assemblies were more accurate when the underlying contig assembly was of higher accuracy. These findings provide insight into what is required for future high-quality de-novo assemblies of non-model organisms.


2020 ◽  
Author(s):  
Ann McCartney ◽  
Elena Hilario ◽  
Seung-Sub Choi ◽  
Joseph Guhlin ◽  
Jessica M. Prebble ◽  
...  

AbstractBackgroundWe used long read sequencing data generated from Knightia excelsaI R.Br, a nectar producing Proteaceae tree endemic to Aotearoa New Zealand, to explore how sequencing data type, volume and workflows can impact final assembly accuracy and chromosome construction. Establishing a high-quality genome for this species has specific cultural importance to Māori, the indigenous people, as well as commercial importance to honey producers in Aotearoa New Zealand.ResultsAssemblies were produced by five long read assemblers using data subsampled based on read lengths, two polishing strategies, and two Hi-C mapping methods. Our results from subsampling the data by read length showed that each assembler tested performed differently depending on the coverage and the read length of the data. Assemblies that used longer read lengths (>30 kb) and lower coverage were the most contiguous, kmer and gene complete. The final genome assembly was constructed into pseudochromosomes using all available data assembled with FLYE, polished using Racon/Medaka/Pilon combined, scaffolded using SALSA2 and AllHiC, curated using Juicebox, and validated by synteny with Macadamia.ConclusionsWe highlighted the importance of developing assembly workflows based on the volume and type of sequencing data and establishing a set of robust quality metrics for generating high quality assemblies. Scaffolding analyses highlighted that problems found in the initial assemblies could not be resolved accurately by utilizing Hi-C data and that scaffolded assemblies were more accurate when the underlying contig assembly was of higher accuracy. These findings provide insight into what is required for future high-quality de-novo assemblies of non-model organisms.


2020 ◽  
Vol 2 (3) ◽  
Author(s):  
Cervin Guyomar ◽  
Wesley Delage ◽  
Fabrice Legeai ◽  
Christophe Mougel ◽  
Jean-Christophe Simon ◽  
...  

Abstract Most metazoans are associated with symbionts. Characterizing the effect of a particular symbiont often requires getting access to its genome, which is usually done by sequencing the whole community. We present MinYS, a targeted assembly approach to assemble a particular genome of interest from such metagenomic data. First, taking advantage of a reference genome, a subset of the reads is assembled into a set of backbone contigs. Then, this draft assembly is completed using the whole metagenomic readset in a de novo manner. The resulting assembly is output as a genome graph, enabling different strains with potential structural variants coexisting in the sample to be distinguished. MinYS was applied to 50 pea aphid resequencing samples, with variable diversity in symbiont communities, in order to recover the genome sequence of its obligatory bacterial symbiont, Buchnera aphidicola. It was able to return high-quality assemblies (one contig assembly in 90% of the samples), even when using increasingly distant reference genomes, and to retrieve large structural variations in the samples. Because of its targeted essence, it outperformed standard metagenomic assemblers in terms of both time and assembly quality.


2019 ◽  
Author(s):  
Cervin Guyomar ◽  
Wesley Delage ◽  
Fabrice Legeai ◽  
Christophe Mougel ◽  
Jean-Christophe Simon ◽  
...  

Most metazoans are associated with symbionts. Characterizing the effect of a particular symbiont often requires to get access to its genome, which is usually done by sequencing the whole community. We present MinYS, a targeted assembly approach to assemble one particular genome of interest from such metagenomic data. First, taking advantage of a reference genome, a subset of the reads is assembled into a set of backbone contigs. Then, this draft assembly is completed using the whole metagenomic readset in a de novo manner. The resulting assembly is output as a genome graph, allowing to distinguish different strains with potential structural variants coexisting in the sample. MinYS was applied to 50 pea aphid re-sequencing samples, with low and high diversity, in order to recover the genome sequence of its obligatory bacterial symbiont, Buchnera aphidicola. It was able to return high quality assemblies (one contig assembly in 90% of the samples), even when using increasingly distant reference genomes, and to retrieve large structural variations in the samples. Due to its targeted essence, it outperformed standard metagenomic assemblers in terms of both time and assembly quality.


Sign in / Sign up

Export Citation Format

Share Document