scholarly journals nanotatoR: a tool for enhanced annotation of genomic structural variants

BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Surajit Bhattacharya ◽  
Hayk Barseghyan ◽  
Emmanuèle C. Délot ◽  
Eric Vilain

Abstract Background Whole genome sequencing is effective at identification of small variants, but because it is based on short reads, assessment of structural variants (SVs) is limited. The advent of Optical Genome Mapping (OGM), which utilizes long fluorescently labeled DNA molecules for de novo genome assembly and SV calling, has allowed for increased sensitivity and specificity in SV detection. However, compared to small variant annotation tools, OGM-based SV annotation software has seen little development, and currently available SV annotation tools do not provide sufficient information for determination of variant pathogenicity. Results We developed an R-based package, nanotatoR, which provides comprehensive annotation as a tool for SV classification. nanotatoR uses both external (DGV; DECIPHER; Bionano Genomics BNDB) and internal (user-defined) databases to estimate SV frequency. Human genome reference GRCh37/38-based BED files are used to annotate SVs with overlapping, upstream, and downstream genes. Overlap percentages and distances for nearest genes are calculated and can be used for filtration. A primary gene list is extracted from public databases based on the patient’s phenotype and used to filter genes overlapping SVs, providing the analyst with an easy way to prioritize variants. If available, expression of overlapping or nearby genes of interest is extracted (e.g. from an RNA-Seq dataset, allowing the user to assess the effects of SVs on the transcriptome). Most quality-control filtration parameters are customizable by the user. The output is given in an Excel file format, subdivided into multiple sheets based on SV type and inheritance pattern (INDELs, inversions, translocations, de novo, etc.). nanotatoR passed all quality and run time criteria of Bioconductor, where it was accepted in the April 2019 release. We evaluated nanotatoR’s annotation capabilities using publicly available reference datasets: the singleton sample NA12878, mapped with two types of enzyme labeling, and the NA24143 trio. nanotatoR was also able to accurately filter the known pathogenic variants in a cohort of patients with Duchenne Muscular Dystrophy for which we had previously demonstrated the diagnostic ability of OGM. Conclusions The extensive annotation enables users to rapidly identify potential pathogenic SVs, a critical step toward use of OGM in the clinical setting.

2020 ◽  
Author(s):  
Surajit Bhattacharya ◽  
Hayk Barseghyan ◽  
Emmanuèle C. Délot ◽  
Eric Vilain

AbstractWhole genome sequencing is effective at identification of small variants but, because it is based on short reads, assessment of structural variants (SVs) is limited. The advent of Optical Genome Mapping (OGM), which utilizes long fluorescently labeled DNA molecules for de novo genome assembly and SV calling, has allowed for increased sensitivity and specificity in SV detection. However, compared to small variant annotation tools, OGM-based SV annotation software has seen little development, and currently available SV annotation tools do not provide sufficient information for determination of variant pathogenicity.We developed an R-based package, nanotatoR, which provides comprehensive annotation as a tool for SV classification. nanotatoR uses both external (DGV; DECIPHER; Bionano Genomics BNDB) and internal (user-defined) databases to estimate SV frequency. Human genome reference GRCh37/38-based BED files are used to annotate SVs with overlapping, upstream, and downstream genes. Overlap percentages and distances for nearest genes are calculated and can be used for filtration. A primary gene list is extracted from public databases based on the patient’s phenotype and used to filter genes overlapping SVs, providing the analyst with an easy way to prioritize variants. If available, expression of overlapping or nearby genes of interest is extracted (e.g. from an RNA-Seq dataset, allowing the user to assess the effects of SVs on the transcriptome). Most quality-control filtration parameters are customizable by the user. The output is given in an Excel file format, subdivided into multiple sheets based on SV type and inheritance pattern (INDELs, inversions, translocations, de novo, etc.).nanotatoR passed all quality and run time criteria of Bioconductor, where it was accepted in the April 2019 release. We evaluated nanotatoR’s annotation capabilities using publicly available reference datasets: the singleton sample NA12878, mapped with two types of enzyme labeling, and the NA24143 trio. nanotatoR was also able to accurately filter the known pathogenic variants in a cohort of patients with Duchenne Muscular Dystrophy for which we had previously demonstrated the diagnostic ability of OGM. The extensive annotation enables users to rapidly identify potential pathogenic SVs, a critical step toward use of OGM in the clinical setting.


2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Yaoxi He ◽  
Xin Luo ◽  
Bin Zhou ◽  
Ting Hu ◽  
Xiaoyu Meng ◽  
...  

Abstract We present a high-quality de novo genome assembly (rheMacS) of the Chinese rhesus macaque (Macaca mulatta) using long-read sequencing and multiplatform scaffolding approaches. Compared to the current Indian rhesus macaque reference genome (rheMac8), rheMacS increases sequence contiguity 75-fold, closing 21,940 of the remaining assembly gaps (60.8 Mbp). We improve gene annotation by generating more than two million full-length transcripts from ten different tissues by long-read RNA sequencing. We sequence resolve 53,916 structural variants (96% novel) and identify 17,000 ape-specific structural variants (ASSVs) based on comparison to ape genomes. Many ASSVs map within ChIP-seq predicted enhancer regions where apes and macaque show diverged enhancer activity and gene expression. We further characterize a subset that may contribute to ape- or great-ape-specific phenotypic traits, including taillessness, brain volume expansion, improved manual dexterity, and large body size. The rheMacS genome assembly serves as an ideal reference for future biomedical and evolutionary studies.


2020 ◽  
Author(s):  
Marek Cmero ◽  
Breon Schmidt ◽  
Ian J. Majewski ◽  
Paul G. Ekert ◽  
Alicia Oshlack ◽  
...  

AbstractGenomic rearrangements can modify gene function by altering transcript sequences, and have been shown to be drivers in both cancer and rare diseases. Although there are now many methods to detect structural variants from Whole Genome Sequencing (WGS), RNA sequencing (RNA-seq) remains under-utilised as a technology for the detection of gene altering structural variants. Calling fusion genes from RNA-seq data is well established, but other transcriptional variants such as fusions with novel sequence, tandem duplications, large insertions and deletions, and novel splicing are difficult to detect using existing approaches.To identify all types of variants in transcriptomes, we developed MINTIE, an integrated pipeline for RNA-seq data. We take a reference free approach, which combines de novo assembly of transcripts with differential expression analysis, to identify up-regulated novel variants in a case sample.We validated MINTIE on simulated and real data sets and compared it with eight other approaches for finding novel transcriptional variants. We found MINTIE was able to detect all defined variant classes at high rates (>70%) while no other method was able to achieve this.We applied MINTIE to RNA-seq data from a cohort of acute lymphoblastic leukemia (ALL) patient samples and identified several novel clinically relevant variants, including an unpartnered recurrent fusion involving the tumour suppressor gene RB1, and variants in ALL-associated genes: tandem duplications in IKZF1 and PAX5, and novel splicing in ETV6. We further demonstrate the utility of MINTIE to identify rare disease variants using RNA-seq, including the discovery of an inter-chromosomal translocation in the DMD gene in a patient with muscular dystrophy. We posit that MINTIE will be able to identify new disease variants across a range of cancers and other disease types.


Author(s):  
Zerin Hyder ◽  
Eduardo Calpena ◽  
Yang Pei ◽  
Rebecca S. Tooze ◽  
Helen Brittain ◽  
...  

Abstract Purpose Genome sequencing (GS) for diagnosis of rare genetic disease is being introduced into the clinic, but the complexity of the data poses challenges for developing pipelines with high diagnostic sensitivity. We evaluated the performance of the Genomics England 100,000 Genomes Project (100kGP) panel-based pipelines, using craniosynostosis as a test disease. Methods GS data from 114 probands with craniosynostosis and their relatives (314 samples), negative on routine genetic testing, were scrutinized by a specialized research team, and diagnoses compared with those made by 100kGP. Results Sixteen likely pathogenic/pathogenic variants were identified by 100kGP. Eighteen additional likely pathogenic/pathogenic variants were identified by the research team, indicating that for craniosynostosis, 100kGP panels had a diagnostic sensitivity of only 47%. Measures that could have augmented diagnoses were improved calling of existing panel genes (+18% sensitivity), review of updated panels (+12%), comprehensive analysis of de novo small variants (+29%), and copy-number/structural variants (+9%). Recent NHS England recommendations that partially incorporate these measures should achieve 85% overall sensitivity (+38%). Conclusion GS identified likely pathogenic/pathogenic variants in 29.8% of previously undiagnosed patients with craniosynostosis. This demonstrates the value of research analysis and the importance of continually improving algorithms to maximize the potential of clinical GS.


2016 ◽  
Author(s):  
Maria Nattestad ◽  
Michael C Schatz

Summary: Assemblytics is a web app for detecting and analyzing structural variants from a de novo genome assembly aligned to a reference genome. It incorporates a unique anchor filtering approach to increase robustness to repetitive elements, and identifies six classes of variants based on their distinct alignment signatures. Assemblytics can be applied both to comparing aberrant genomes, such as human cancers, to a reference, or to identify differences between related species. Multiple interactive visualizations enable in-depth explorations of the genomic distributions of variants. Availability and Implementation: http://qb.cshl.edu/assemblytics, https://github.com/marianattestad/assemblytics Contact: [email protected]


2021 ◽  
Author(s):  
Xiang Li ◽  
Qian Shi ◽  
Mingfu Shao

AbstractMotivationThe widely-used high-throughput RNA-sequencing technologies (RNA-seq) usually produce paired-end reads. We explore if full fragments can be computationally reconstructed from the sequenced two ends—a problem here we refer to as bridging. Solving this problem provides longer, more informative RNA-seq reads, and hence benefits downstream RNA-seq analysis such as transcriptome assembly and expression quantification. However, bridging is a challenging and complicated task owing to alternative splicing, transcript noises, and sequencing errors. It remains unclear if the data itself provides sufficient information for accurate bridging, let alone proper models and efficient algorithms that characterize and determine the true bridges.Algorithmic ResultsWe studied this problem in two settings: reference-based bridging, which assumes reads alignments are available and reconstructs the alignments of full fragments, and de novo bridging, which reconstructs sequences of entire fragments from sequences of the two ends. We proposed a novel mathematical formulation that works for both settings—to seek a path in an underlying graph data structure (i.e., splice graph for reference-based bridging, and compacted de Bruijn graph for de novo bridging) such that its bottleneck weight is maximized. This formulation characterizes true bridges and is efficient in filtering out false bridges. This formulation admits optimal substructure property, and hence efficient dynamic programming algorithms can be designed. For reference-based bridging, we designed such an algorithm to calculate the top N bridging paths, followed by a voting approach to select one using the distribution of fragment length. For de novo bridging, we designed a new truncated Dijkstra’s algorithm. To further speed up, we proposed a novel algorithm that reuses the shortest path tree to avoid running the truncated Dijkstra’s algorithm from scratch for all vertices. These innovations result in scalable algorithms that can bridge all paired-end reads in a compacted de Bruijn graph with millions of vertices.Experimental ResultsWe showed that paired-end RNA-seq reads can be accurately bridged to a large extend. Our reference-based bridging tool could correctly bridge more than 79.6% of reads. For de novo bridging, high precision was observed with varied sensitivity. We also showed that bridging can improve reference-based transcript assembly: the improvement was significant (up to 14.4% measured with adjusted precision), and universal in all combinations with different aligners and assemblers.AvailabilityImplementations of the algorithms for reference-based and de novo bridging are available at https://github.com/Shao-Group/rnabridge-align and https://github.com/Shao-Group/rnabridge-denovo, respectively. Scripts, datasets, and documentations that can reproduce the experimental results in this manuscript are available at https://github.com/Shao-Group/rnabridge-test.


Author(s):  
Hailin Liu ◽  
Shigang Wu ◽  
Alun Li ◽  
Jue Ruan

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. It also has been widely used to study structural variants, phase haplotypes and more. Here, we introduce the assembler— SMARTdenovo, which is an SMS assembler that follows the overlap-layout-consensus (OLC) paradigm. SMARTdenovo (RRID: SCR_017622) was designed to be a fast assembler that did not require highly accurate raw reads for error correction, unlike other, contemporaneous SMS assemblers. It has performed well for evaluating congeneric assemblers and has been successful for a variety of assembly projects. It is compatible with Canu for assembling high-quality genomes, and several of the assembly strategies in this program have been incorporated into subsequent popular assemblers. The assembler has been in use since 2015, and here we provide information on the development of SMARTdenovo and how to implement its algorithms into current projects.


1980 ◽  
Vol 45 (8) ◽  
pp. 2364-2370 ◽  
Author(s):  
Antonín Holý ◽  
Erik De Clercq

Reaction of 3',5'-di-O-benzoyl-6-methyl-2'-deoxyuridine (IIa) with elementary bromine or iodine afforded 5-halogeno derivatives IIc and IId which on methanolysis gave 5-bromo-6-methyl-2'-deoxyurine (Ic) and 5-iodo-6-methyl-2'-deoxyurine (Id), respectively. The CD spectra of Ic, Id and 6-methyl-2'-deoxyuridine (Ia) are compared and discussed with regard to determination of the nucleoside conformation. Unlike 5-bromo- and 5-iodo-2'-deoxyuridine, the 6-methyl derivatives Ic and Id exhibit neither antibacterial nor antiviral activity. Nor do they exert any antimetabolic effect on the de novo DNA synthesis in primary rabbit kidney cells.


1995 ◽  
Vol 269 (2) ◽  
pp. E247-E252 ◽  
Author(s):  
H. O. Ajie ◽  
M. J. Connor ◽  
W. N. Lee ◽  
S. Bassilian ◽  
E. A. Bergner ◽  
...  

To determine the contributions of preexisting fatty acid, de novo synthesis, and chain elongation in long-chain fatty acid (LCFA) synthesis, the synthesis of LCFAs, palmitate (16:0), stearate (18:0), arachidate (20:0), behenate (22:0), and lignocerate (24:0), in the epidermis, liver, and spinal cord was determined using deuterated water and mass isotopomer distribution analysis in hairless mice and Sprague-Dawley rats. Animals were given 4% deuterated water for 5 days or 8 wk in their drinking water. Blood was withdrawn at the end of these times for the determination of deuterium enrichment, and the animals were killed to isolate the various tissues for lipid extraction for the determination of the mass isotopomer distributions. The mass isotopomer distributions in LCFA were incompatible with synthesis from a single pool of primer. The synthesis of palmitate, stearate, arachidate, behenate, and lignocerate followed the expected biochemical pathways for the synthesis of LCFAs. On average, three deuterium atoms were incorporated for every addition of an acetyl unit. The isotopomer distribution resulting from chain elongation and de novo synthesis can be described by the linear combination of two binomial distributions. The proportions of preexisting, chain elongation, and de novo-synthesized fatty acids as a percentage of the total fatty acids were determined using multiple linear regression analysis. Fractional synthesis was found to vary, depending on the tissue type and the fatty acid, from 47 to 87%. A substantial fraction (24-40%) of the newly synthesized molecules was derived from chain elongation of unlabeled (recycled) palmitate.


Sign in / Sign up

Export Citation Format

Share Document