Chromosome-scale genome assembly of Eustoma grandiflorum, the first complete genome sequence in family Gentianaceae

AbstractEustoma grandiflorum (Raf.) Shinn., is an annual herbaceous plant native to the southern United States, Mexico, and the Greater Antilles. It has a large flower with a variety of colors and an important flower crop. In this study, we established a chromosome-scale de novo assembly of E. grandiflorum by integrating four genomic and genetic approaches: (1) Pacific Biosciences (PacBio) Sequel deep sequencing, (2) error correction of the assembly by Illumina short reads, (3) scaffolding by chromatin conformation capture sequencing (Hi-C), and (4) genetic linkage maps derived from an F2 mapping population. The 36 pseudomolecules and unplaced 64 scaffolds were created with total length of 1,324.8 Mb. Full-length transcript sequencing was obtained by PacBio Iso-Seq sequencing for gene prediction on the assembled genome, Egra_v1. A total of 36,619 genes were predicted on the genome as high confidence HC) genes. Of the 36,619, 25,936 were annotated functions by ZenAnnotation. Genetic diversity analysis was also performed for nine commercial E. grandiflorum varieties bred in Japan, and 254,205 variants were identified. This is the first report of the construction of reference genome sequences in E. grandiflorum as well as in the family Gentianaceae.

Download Full-text

Genome and transcriptome analysis of the latent pathogen Lasiodiplodia theobromae, an emerging threat to the cacao industry

Genome ◽

10.1139/gen-2019-0112 ◽

2020 ◽

Vol 63 (1) ◽

pp. 37-52 ◽

Cited By ~ 1

Author(s):

Shahin S. Ali ◽

Asman Asman ◽

Jonathan Shao ◽

Johnny F. Balidion ◽

Mary D. Strem ◽

...

Keyword(s):

Transcriptome Analysis ◽

Woody Plants ◽

De Novo ◽

Gene Prediction ◽

In Planta ◽

Protein Coding ◽

Lasiodiplodia Theobromae ◽

Diverse Species ◽

The Family ◽

The World

Lasiodiplodia theobromae (Pat.) Griffon & Maubl., a member of the family Botryosphaeriaceae, is becoming a significant threat to crops and woody plants in many parts of the world, including the major cacao growing areas. While attempting to isolate Ceratobasidium theobromae, a causal agent of vascular streak dieback (VSD), from symptomatic cacao stems, 74% of isolated fungi were Lasiodiplodia spp. Sequence-based identification of 52 putative isolates of L. theobromae indicated that diverse species of Lasiodiplodia were associated with cacao in the studied areas, and the isolates showed variation in aggressiveness when assayed using cacao leaf discs. The present study reports a 43.75 Mb de novo assembled genome of an isolate of L. theobromae from cacao. Ab initio gene prediction generated 13 061 protein-coding genes, of which 2862 are unique to L. theobromae, when compared with other closely related Botryosphaeriaceae. Transcriptome analysis revealed that 11 860 predicted genes were transcriptionally active and 1255 were more highly expressed in planta compared with cultured mycelia. The predicted genes differentially expressed during infection were mainly those involved in carbohydrate, pectin, and lignin catabolism, cytochrome P450, necrosis-inducing proteins, and putative effectors. These findings significantly expand our knowledge of the genome of L. theobromae and the genes involved in virulence and pathogenicity.

Download Full-text

Chromosomal-level assembly of the blood clam, Scapharca (Anadara) broughtonii, using long sequence reads and Hi-C

GigaScience ◽

10.1093/gigascience/giz067 ◽

2019 ◽

Vol 8 (7) ◽

Cited By ~ 11

Author(s):

Chang-Ming Bai ◽

Lu-Sheng Xin ◽

Umberto Rosani ◽

Biao Wu ◽

Qing-Chen Wang ◽

...

Keyword(s):

Reference Genome ◽

De Novo ◽

Marine Bivalve ◽

Protein Coding ◽

Long Reads ◽

Oxford Nanopore ◽

The Pacific ◽

The Family ◽

Long Read ◽

Blood Clam

Abstract Background The blood clam, Scapharca (Anadara) broughtonii, is an economically and ecologically important marine bivalve of the family Arcidae. Efforts to study their population genetics, breeding, cultivation, and stock enrichment have been somewhat hindered by the lack of a reference genome. Herein, we report the complete genome sequence of S. broughtonii, a first reference genome of the family Arcidae. Findings A total of 75.79 Gb clean data were generated with the Pacific Biosciences and Oxford Nanopore platforms, which represented approximately 86× coverage of the S. broughtonii genome. De novo assembly of these long reads resulted in an 884.5-Mb genome, with a contig N50 of 1.80 Mb and scaffold N50 of 45.00 Mb. Genome Hi-C scaffolding resulted in 19 chromosomes containing 99.35% of bases in the assembled genome. Genome annotation revealed that nearly half of the genome (46.1%) is composed of repeated sequences, while 24,045 protein-coding genes were predicted and 84.7% of them were annotated. Conclusions We report here a chromosomal-level assembly of the S. broughtonii genome based on long-read sequencing and Hi-C scaffolding. The genomic data can serve as a reference for the family Arcidae and will provide a valuable resource for the scientific community and aquaculture sector.

Download Full-text

De novo assembly of trachidermus fasciatus genome by nanopore sequencing

10.1101/2020.04.18.042093 ◽

2020 ◽

Author(s):

Gangcai Xie ◽

Xu Zhang ◽

Feng Lv ◽

Mengmeng Sang ◽

Hairong Hu ◽

...

Keyword(s):

Reference Genome ◽

De Novo ◽

Gene Prediction ◽

Protein Coding ◽

De Novo Gene ◽

Long Reads ◽

Resource Protection ◽

Trachidermus Fasciatus ◽

Roughskin Sculpin ◽

High Quality Genome

AbstractTrachidermus fasciatus is a roughskin sculpin fish widely located at the coastal areas of East Asia. Due to the environmental destruction and overfishing, the populations of this species have been under threat. It is important to have a reference genome to study the population genetics, domestic farming, and genetic resource protection. However, currently, there is no reference genome for Trachidermus fasciatus, which has greatly hurdled the studies on this species. In this study, we proposed to integrate nanopore long reads sequencing, Illumina short reads sequencing and Hi-C methods to thoroughly de novo assemble the genome of Trachidermus fasciatus. Our results provided a chromosome-level high quality genome assembly with a total length of about 543 Mb, and with N50 of 23 Mb. Based on de novo gene prediction and RNA sequencing information, a total of 38728 genes were found, including 23191 protein coding genes, 2149 small RNAs, 5572 rRNAs, and 7816 tRNAs. Besides, about 23% of the genome area is covered by the repetitive elements. Furthermore, The BUSCO evaluation of the completeness of the assembled genome is more than 96%, and the single base accuracy is 99.997%. Our study provided the first whole genome reference for the species of Trachidermus fasciatus, which might greatly facilitate the future studies on this species.

Download Full-text

The Lithuanian reference genome LT1 - a human de novo genome assembly with short and long read sequence and Hi-C data

10.1101/2021.04.05.438426 ◽

2021 ◽

Author(s):

Hui-Su Kim ◽

Asta Blazyte ◽

Sungwon Jeon ◽

Changhan Yoon ◽

Yeonkyung Kim ◽

...

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Gene Prediction ◽

Chromosomal Mapping ◽

Autosomal Snps ◽

Contig Assembly ◽

De Novo Genome Assembly ◽

Human Reference Genome ◽

The Baltic

We present LT1, the first high-quality human reference genome from the Baltic States. LT1 is a female de novo human reference genome assembly constructed using 57× of ultra-long nanopore reads and 47× of short paired-end reads. We also utilized 72 Gb of Hi-C chromosomal mapping data to maximize the assembly′s contiguity and accuracy. LT1′s contig assembly was 2.73 Gbp in length comprising of 4,490 contigs with an N50 value of 13.4 Mbp. After scaffolding with Hi-C data and extensive manual curation, we produced a chromosome-scale assembly with an N50 value of 138 Mbp and 4,699 scaffolds. Our gene prediction quality assessment using BUSCO identify 89.3% of the single-copy orthologous genes included in the benchmarking set. Detailed characterization of LT1 suggested it has 73,744 predicted transcripts, 4.2 million autosomal SNPs, 974,000 short indels, and 12,330 large structural variants. These data are shared as a public resource without any restrictions and can be used as a benchmark for further in-depth genomic analyses of the Baltic populations.

Download Full-text

Whole-Genome Sequencing and Characterization of Buffalo Genetic Resources: Recent Advances and Future Challenges

Animals ◽

10.3390/ani11030904 ◽

2021 ◽

Vol 11 (3) ◽

pp. 904

Author(s):

Saif ur Rehman ◽

Faiz-ul Hassan ◽

Xier Luo ◽

Zhipeng Li ◽

Qingyou Liu

Keyword(s):

Selective Breeding ◽

Reference Genome ◽

De Novo ◽

Phenotypic Diversity ◽

Molecular Data ◽

Genomic Diversity ◽

Production Performance ◽

Phylogeographic Structure ◽

Economic Significance ◽

And Performance

The buffalo was domesticated around 3000–6000 years ago and has substantial economic significance as a meat, dairy, and draught animal. The buffalo has remained underutilized in terms of the development of a well-annotated and assembled reference genome de novo. It is mandatory to explore the genetic architecture of a species to understand the biology that helps to manage its genetic variability, which is ultimately used for selective breeding and genomic selection. Morphological and molecular data have revealed that the swamp buffalo population has strong geographical genomic diversity with low gene flow but strong phenotypic consistency, while the river buffalo population has higher phenotypic diversity with a weak phylogeographic structure. The availability of recent high-quality reference genome and genotyping marker panels has invigorated many genome-based studies on evolutionary history, genetic diversity, functional elements, and performance traits. The increasing molecular knowledge syndicate with selective breeding should pave the way for genetic improvement in the climatic resilience, disease resistance, and production performance of water buffalo populations globally.

Download Full-text

In Search of Species-Specific SNPs in a Non-Model Animal (European Bison (Bison bonasus))—Comparison of De Novo and Reference-Based Integrated Pipeline of STACKS Using Genotyping-by-Sequencing (GBS) Data

Animals ◽

10.3390/ani11082226 ◽

2021 ◽

Vol 11 (8) ◽

pp. 2226

Author(s):

Sazia Kunvar ◽

Sylwia Czarnomska ◽

Cino Pertoldi ◽

Małgorzata Tokarska

Keyword(s):

Reference Genome ◽

De Novo ◽

Bos Taurus ◽

Model Organism ◽

Genotyping By Sequencing ◽

Model Organisms ◽

European Bison ◽

Model Animal ◽

Pcr Duplicates ◽

Species Specific

The European bison is a non-model organism; thus, most of its genetic and genomic analyses have been performed using cattle-specific resources, such as BovineSNP50 BeadChip or Illumina Bovine 800 K HD Bead Chip. The problem with non-specific tools is the potential loss of evolutionary diversified information (ascertainment bias) and species-specific markers. Here, we have used a genotyping-by-sequencing (GBS) approach for genotyping 256 samples from the European bison population in Bialowieza Forest (Poland) and performed an analysis using two integrated pipelines of the STACKS software: one is de novo (without reference genome) and the other is a reference pipeline (with reference genome). Moreover, we used a reference pipeline with two different genomes, i.e., Bos taurus and European bison. Genotyping by sequencing (GBS) is a useful tool for SNP genotyping in non-model organisms due to its cost effectiveness. Our results support GBS with a reference pipeline without PCR duplicates as a powerful approach for studying the population structure and genotyping data of non-model organisms. We found more polymorphic markers in the reference pipeline in comparison to the de novo pipeline. The decreased number of SNPs from the de novo pipeline could be due to the extremely low level of heterozygosity in European bison. It has been confirmed that all the de novo/Bos taurus and Bos taurus reference pipeline obtained SNPs were unique and not included in 800 K BovineHD BeadChip.

Download Full-text

De Novo SNP Discovery and Genotyping of Iranian Pimpinella Species Using ddRAD Sequencing

Agronomy ◽

10.3390/agronomy11071342 ◽

2021 ◽

Vol 11 (7) ◽

pp. 1342

Author(s):

Shaghayegh Mehravi ◽

Gholam Ali Ranjbar ◽

Ghader Mirzaghaderi ◽

Anita Alice Severn-Ellis ◽

Armin Scheben ◽

...

Keyword(s):

De Novo ◽

Genetic Relationships ◽

Nucleotide Polymorphisms ◽

High Quality ◽

Genomic Resources ◽

High Quality Snps ◽

The Family ◽

Double Digestion ◽

Flanking Sequences ◽

Downstream Analysis

The species of Pimpinella, one of the largest genera of the family Apiaceae, are traditionally cultivated for medicinal purposes. In this study, high-throughput double digest restriction-site associated DNA sequencing technology (ddRAD-seq) was used to identify single nucleotide polymorphisms (SNPs) in eight Pimpinella species from Iran. After double-digestion with the enzymes HpyCH4IV and HinfI, a total of 334,702,966 paired-end reads were de novo assembled into 1,270,791 loci with an average of 28.8 reads per locus. After stringent filtering, 2440 high-quality SNPs were identified for downstream analysis. Analysis of genetic relationships and population structure, based on these retained SNPs, indicated the presence of three major groups. Gene ontology and pathway analysis were determined by using comparison SNP-associated flanking sequences with a public non-redundant database. Due to the lack of genomic resources in this genus, our present study is the first report to provide high-quality SNPs in Pimpinella based on a de novo analysis pipeline using ddRAD-seq. This data will enhance the molecular knowledge of the genus Pimpinella and will provide an important source of information for breeders and the research community to enhance breeding programs and support the management of Pimpinella genomic resources.

Download Full-text

Transcriptomic and Proteomic Analyses Reveal the Diversity of Venom Components from the Vaejovid Scorpion Serradigitus gertschi

Toxins ◽

10.3390/toxins10090359 ◽

2018 ◽

Vol 10 (9) ◽

pp. 359 ◽

Cited By ~ 14

Author(s):

Maria Romero-Gutiérrez ◽

Carlos Santibáñez-López ◽

Juana Jiménez-Vargas ◽

Cesar Batista ◽

Ernesto Ortiz ◽

...

Keyword(s):

Serine Proteases ◽

De Novo ◽

Sequence Similarity ◽

Scorpion Venom ◽

Secretory Proteins ◽

Host Defense Peptides ◽

The Family ◽

Pathogenesis Related ◽

Molecular Masses

To understand the diversity of scorpion venom, RNA from venomous glands from a sawfinger scorpion, Serradigitus gertschi, of the family Vaejovidae, was extracted and used for transcriptomic analysis. A total of 84,835 transcripts were assembled after Illumina sequencing. From those, 119 transcripts were annotated and found to putatively code for peptides or proteins that share sequence similarities with the previously reported venom components of other species. In accordance with sequence similarity, the transcripts were classified as potentially coding for 37 ion channel toxins; 17 host defense peptides; 28 enzymes, including phospholipases, hyaluronidases, metalloproteases, and serine proteases; nine protease inhibitor-like peptides; 10 peptides of the cysteine-rich secretory proteins, antigen 5, and pathogenesis-related 1 protein superfamily; seven La1-like peptides; and 11 sequences classified as “other venom components”. A mass fingerprint performed by mass spectrometry identified 204 components with molecular masses varying from 444.26 Da to 12,432.80 Da, plus several higher molecular weight proteins whose precise masses were not determined. The LC-MS/MS analysis of a tryptic digestion of the soluble venom resulted in the de novo determination of 16,840 peptide sequences, 24 of which matched sequences predicted from the translated transcriptome. The database presented here increases our general knowledge of the biodiversity of venom components from neglected non-buthid scorpions.

Download Full-text

AStrap: identification of alternative splicing from transcript sequences without a reference genome

Bioinformatics ◽

10.1093/bioinformatics/bty1008 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2654-2656 ◽

Cited By ~ 5

Author(s):

Guoli Ji ◽

Wenbin Ye ◽

Yaru Su ◽

Moliang Chen ◽

Guangzao Huang ◽

...

Keyword(s):

Machine Learning ◽

Alternative Splicing ◽

Single Molecule ◽

Reference Genome ◽

De Novo ◽

Supplementary Information ◽

Model Organisms ◽

Sequencing Data ◽

Extensive Evaluation ◽

Reference Genomes

Abstract Summary Alternative splicing (AS) is a well-established mechanism for increasing transcriptome and proteome diversity, however, detecting AS events and distinguishing among AS types in organisms without available reference genomes remains challenging. We developed a de novo approach called AStrap for AS analysis without using a reference genome. AStrap identifies AS events by extensive pair-wise alignments of transcript sequences and predicts AS types by a machine-learning model integrating more than 500 assembled features. We evaluated AStrap using collected AS events from reference genomes of rice and human as well as single-molecule real-time sequencing data from Amborella trichopoda. Results show that AStrap can identify much more AS events with comparable or higher accuracy than the competing method. AStrap also possesses a unique feature of predicting AS types, which achieves an overall accuracy of ∼0.87 for different species. Extensive evaluation of AStrap using different parameters, sample sizes and machine-learning models on different species also demonstrates the robustness and flexibility of AStrap. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources. Availability and implementation AStrap is available for download at https://github.com/BMILAB/AStrap. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

RECORD: Reference-Assisted Genome Assembly for Closely Related Genomes

International Journal of Genomics ◽

10.1155/2015/563482 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Krisztian Buza ◽

Bartek Wilczynski ◽

Norbert Dojer

Keyword(s):

Reference Genome ◽

De Novo ◽

Real Data ◽

Reference Sequence ◽

Individual Genome ◽

Single Experiment ◽

Sequencing Technologies ◽

Sequencing Cost ◽

The Individual ◽

Assembly Software

Background. Next-generation sequencing technologies are now producing multiple times the genome size in total reads from a single experiment. This is enough information to reconstruct at least some of the differences between the individual genome studied in the experiment and the reference genome of the species. However, in most typical protocols, this information is disregarded and the reference genome is used.Results. We provide a new approach that allows researchers to reconstruct genomes very closely related to the reference genome (e.g., mutants of the same species) directly from the reads used in the experiment. Our approach applies de novo assembly software to experimental reads and so-called pseudoreads and uses the resulting contigs to generate a modified reference sequence. In this way, it can very quickly, and at no additional sequencing cost, generate new, modified reference sequence that is closer to the actual sequenced genome and has a full coverage. In this paper, we describe our approach and test its implementation called RECORD. We evaluate RECORD on both simulated and real data. We made our software publicly available on sourceforge.Conclusion. Our tests show that on closely related sequences RECORD outperforms more general assisted-assembly software.

Download Full-text