scholarly journals Recommendations for the FAIRification of genomic track metadata

F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 268
Author(s):  
Sveinung Gundersen ◽  
Sanjay Boddu ◽  
Salvador Capella-Gutierrez ◽  
Finn Drabløs ◽  
José M. Fernández ◽  
...  

Background: Many types of data from genomic analyses can be represented as genomic tracks, i.e. features linked to the genomic coordinates of a reference genome. Examples of such data are epigenetic DNA methylation data, ChIP-seq peaks, germline or somatic DNA variants, as well as RNA-seq expression levels. Researchers often face difficulties in locating, accessing and combining relevant tracks from external sources, as well as locating the raw data, reducing the value of the generated information. Description of work: We propose to advance the application of FAIR data principles (Findable, Accessible, Interoperable, and Reusable) to produce searchable metadata for genomic tracks. Findability and Accessibility of metadata can then be ensured by a track search service that integrates globally identifiable metadata from various track hubs in the Track Hub Registry and other relevant repositories. Interoperability and Reusability need to be ensured by the specification and implementation of a basic set of recommendations for metadata. We have tested this concept by developing such a specification in a JSON Schema, called FAIRtracks, and have integrated it into a novel track search service, called TrackFind. We demonstrate practical usage by importing datasets through TrackFind into existing examples of relevant analytical tools for genomic tracks: EPICO and the GSuite HyperBrowser. Conclusion: We here provide a first iteration of a draft standard for genomic track metadata, as well as the accompanying software ecosystem. It can easily be adapted or extended to future needs of the research community regarding data, methods and tools, balancing the requirements of both data submitters and analytical end-users.

BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Shuhua Zhan ◽  
Cortland Griswold ◽  
Lewis Lukens

Abstract Background Genetic variation for gene expression is a source of phenotypic variation for natural and agricultural species. The common approach to map and to quantify gene expression from genetically distinct individuals is to assign their RNA-seq reads to a single reference genome. However, RNA-seq reads from alleles dissimilar to this reference genome may fail to map correctly, causing transcript levels to be underestimated. Presently, the extent of this mapping problem is not clear, particularly in highly diverse species. We investigated if mapping bias occurred and if chromosomal features associated with mapping bias. Zea mays presents a model species to assess these questions, given it has genotypically distinct and well-studied genetic lines. Results In Zea mays, the inbred B73 genome is the standard reference genome and template for RNA-seq read assignments. In the absence of mapping bias, B73 and a second inbred line, Mo17, would each have an approximately equal number of regulatory alleles that increase gene expression. Remarkably, Mo17 had 2–4 times fewer such positively acting alleles than did B73 when RNA-seq reads were aligned to the B73 reference genome. Reciprocally, over one-half of the B73 alleles that increased gene expression were not detected when reads were aligned to the Mo17 genome template. Genes at dissimilar chromosomal ends were strongly affected by mapping bias, and genes at more similar pericentromeric regions were less affected. Biased transcript estimates were higher in untranslated regions and lower in splice junctions. Bias occurred across software and alignment parameters. Conclusions Mapping bias very strongly affects gene transcript abundance estimates in maize, and bias varies across chromosomal features. Individual genome or transcriptome templates are likely necessary for accurate transcript estimation across genetically variable individuals in maize and other species.


2021 ◽  
Vol 8 ◽  
Author(s):  
Hanhan Yao ◽  
Zhihua Lin ◽  
Yinghui Dong ◽  
Xianghui Kong ◽  
Lin He ◽  
...  

The razor clam, Sinonovacula constricta is a commercially important bivalve in the western Pacific Ocean, yet little is known about the mechanisms of sex determination/differentiation and gametogenesis. In the present study, the comparative transcriptome analysis of adult gonads (female gonads and male gonads) was conducted to identify potential sex-related genes in S. constricta. The number of reads generated for each target library (three females and three males) ranged from 31,853,422 to 37,750,848, and 20,489,472 to 26,152,448 could be mapped to the reference genome of S. constricta (the map percentage ranging from 63.71 to 71.48%). A total of 8,497 genes were identified to be differentially expressed between the female and male gonads, of which 4,253 were female-biased (upregulated in females), and 4,244 were male-biased. Forty-five genes were identified as potential sex-related genes, including DmrtA2, Sox9, Fem-1b, and Fem-1c involved in sex determination/differentiation and Vg, CYP17A1, SOHLH2, and TSSK involved in gametogenesis. The expression profiles of 12 genes were validated by qRT-PCR, which further confirmed the reliability and accuracy of the RNA-Seq results. Our results provide basic information about the genes involved in sex determination/differentiation and gametogenesis, and pave the way for further studies on reproduction and breeding in S. constricta and other marine bivalves.


2019 ◽  
Author(s):  
Jiali Ye ◽  
Xuetong Yang ◽  
Sha Li ◽  
Wei Li ◽  
Qi Liu ◽  
...  

Abstract Background: Heat shock transcription factors (HSFs) play crucial roles in resisting heat stress and regulating plant development. Investigating the HSF family is essential for understanding the fertility conversion mechanism in thermo-sensitive male sterile wheat. Previous studies have investigated the HSF family in wheat but it is necessary to conduct more in-depth and systematic analyses based on the newly published reference genome. Results: In the present study, 61 wheat Hsf (TaHsf) genes were identified using two main strategies and renamed based on their physical locations on chromosomes. According to the gene structure and phylogenetic analyses, the 61 TaHsf genes were classified into three categories and eleven subclasses. The genes were unequally distributed on 21 chromosomes, including two pairs of tandem duplication genes and 52 TaHsf segmental duplication genes. According to the cis-elements identified, most of the TaHsfs can be activated by Ca++ and MYB, and they respond to drought, light, copper, and other stresses as well as heat shock. RNA-seq analysis indicated that the A2 class TaHsf genes exhibited persistently upregulated expression levels in the leaves/shoots, roots (except in the vegetative growth and reproductive growth stages), spikes, and grains in wheat under normal conditions. The A and B class TaHsf genes were positively regulated during the resistance to heat, whereas the C class genes were involved in drought regulation in wheat. Only the A and B class TaHsf genes were upregulated under fertile conditions in thermo-sensitive male sterile wheat. Conclusion: In this study, 61 wheat Hsf genes were identified based on the complete wheat reference genome. This comprehensive analysis provides novel insights into the TaHsf genes, including their diverse functions and involvement in metabolic pathways.


2019 ◽  
Vol 26 (1) ◽  
pp. 106-117 ◽  
Author(s):  
M. Ajmal Ali

The order Caryophyllales exhibit diverse diversity in morphology to molecules, which leads to taxonomic complexities in circumscribing especially to its families. The comparative analysis of the available chloroplast genome to detect pattern of genomic arrangement and variation is lacking; hence, the alignment pattern and genomic rearrangement across the Caryophyllales were detected, and the phylogenetic relationship among the families of the Caryophyllales based on maximum cp genes were inferred. The comparison of the Caryophyllales cp genomes based on representatives of 10 families with Taxillus chinensis as reference genome revealed that coding region were more conserved than the non-coding region; however, clpP, rpl16 and ycf15 were the most divergent coding region among all taxa. Further, the genomic rearrangement occurred in gene organization of the taxa among different families of Caryophyllales, the extensive rearrangement were observed in Amaranthaceae, Caryophyllaceae, Chenopodiaceae,Droseraceae and Cactaceae.


2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Lisong Hu ◽  
Zhongping Xu ◽  
Maojun Wang ◽  
Rui Fan ◽  
Daojun Yuan ◽  
...  

Abstract Black pepper (Piper nigrum), dubbed the ‘King of Spices’ and ‘Black Gold’, is one of the most widely used spices. Here, we present its reference genome assembly by integrating PacBio, 10x Chromium, BioNano DLS optical mapping, and Hi-C mapping technologies. The 761.2 Mb sequences (45 scaffolds with an N50 of 29.8 Mb) are assembled into 26 pseudochromosomes. A phylogenomic analysis of representative plant genomes places magnoliids as sister to the monocots-eudicots clade and indicates that black pepper has diverged from the shared Laurales-Magnoliales lineage approximately 180 million years ago. Comparative genomic analyses reveal specific gene expansions in the glycosyltransferase, cytochrome P450, shikimate hydroxycinnamoyl transferase, lysine decarboxylase, and acyltransferase gene families. Comparative transcriptomic analyses disclose berry-specific upregulated expression in representative genes in each of these gene families. These data provide an evolutionary perspective and shed light on the metabolic processes relevant to the molecular basis of species-specific piperine biosynthesis.


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Karen H. Y. Wong ◽  
Walfred Ma ◽  
Chun-Yu Wei ◽  
Erh-Chan Yeh ◽  
Wan-Jia Lin ◽  
...  

Abstract The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.


BMC Genomics ◽  
2011 ◽  
Vol 12 (1) ◽  
Author(s):  
Geng Chen ◽  
Ruiyuan Li ◽  
Leming Shi ◽  
Junyi Qi ◽  
Pengzhan Hu ◽  
...  

2016 ◽  
pp. gkw655 ◽  
Author(s):  
Hélène Lopez-Maestre ◽  
Lilia Brinza ◽  
Camille Marchet ◽  
Janice Kielbassa ◽  
Sylvère Bastien ◽  
...  

2021 ◽  
Vol 12 ◽  
Author(s):  
Lixing Huang ◽  
Ying Qiao ◽  
Wei Xu ◽  
Linfeng Gong ◽  
Rongchao He ◽  
...  

Fish is considered as a supreme model for clarifying the evolution and regulatory mechanism of vertebrate immunity. However, the knowledge of distinct immune cell populations in fish is still limited, and further development of techniques advancing the identification of fish immune cell populations and their functions are required. Single cell RNA-seq (scRNA-seq) has provided a new approach for effective in-depth identification and characterization of cell subpopulations. Current approaches for scRNA-seq data analysis usually rely on comparison with a reference genome and hence are not suited for samples without any reference genome, which is currently very common in fish research. Here, we present an alternative, i.e. scRNA-seq data analysis with a full-length transcriptome as a reference, and evaluate this approach on samples from Epinephelus coioides-a teleost without any published genome. We show that it reconstructs well most of the present transcripts in the scRNA-seq data achieving a sensitivity equivalent to approaches relying on genome alignments of related species. Based on cell heterogeneity and known markers, we characterized four cell types: T cells, B cells, monocytes/macrophages (Mo/MΦ) and NCC (non-specific cytotoxic cells). Further analysis indicated the presence of two subsets of Mo/MΦ including M1 and M2 type, as well as four subsets in B cells, i.e. mature B cells, immature B cells, pre B cells and early-pre B cells. Our research will provide new clues for understanding biological characteristics, development and function of immune cell populations of teleost. Furthermore, our approach provides a reliable alternative for scRNA-seq data analysis in teleost for which no reference genome is currently available.


Sign in / Sign up

Export Citation Format

Share Document