scholarly journals Visualizing genome synteny with xmatchview

2017 ◽  
Author(s):  
René L. Warren

AbstractIn genomics research, the visual representation of DNA sequences is of prime importance. When displayed with additional information, or tracks, showing the position of annotated genes, alignments of sequence of interest, etc., these displays facilitate our understanding of genome and gene structure, and become powerful tools to assess the relationship between various sequence data. They can be used for troubleshooting sequence assemblies, in-depth sequence analysis, and eventually find their way in publications and oral presentations as they often translate complex and abundant data succinctly, with esthetically pleasing images. Here, I introduce xmatchview and xmatchview-conifer, two python applications for comparing genomes visually and assessing their synteny. Availability: https://github.com/warrenlr/xmatchview


2017 ◽  
Vol 9 (2) ◽  
pp. 91
Author(s):  
Sunarno Sunarno ◽  
Yuanita Mulyastuti ◽  
Nelly Puspandari ◽  
Kambang Sariadji

BACKGROUND: dtxR gene is a global regulator that can be used as a marker for detection of Corynebacterium diphtheriae (C. diphtheriae) and it is also a representative tool for mapping purpose (molecular typing) of this bacteria. The aim of this study was to analyze the DNA sequences of partial dtxR gene of C. diphtheriae causing diphtheria in some region of Indonesia. DNA sequence analysis was used to verify the accuracy of the in-house multiplex polymerase chain reaction (PCR) method that used for detection of C. diphtheriae in the clinical specimen as well as a preliminary study to determine the strain diversity of C. diphtheriae circulating in Indonesia.METHODS:Ten PCR products targeting the dtxR gene that have been detected as positive C. diphtheriae previously by in-house multiplex PCR used as samples in this study. The DNA sequencing carried out by Sanger method and the sequence data was analyzed by Bioedit software offline and basic local alignment sequence typing (BLAST) online.RESULTS: All of DNA sequence analyzed in this study were similar or identical to the dtxR gene sequence data of C. diphtheriae registered in GenBank. Within the 162 nucleotides (base 150-311) of dtxR gene that analyzed, at least 2 clonals were found among 10 samples. Substitutions of 2 nucleotides (base 225 and 273) was detected, both were silent mutation.CONCLUSION:Ten partial DNA sequences of dtxR genes in this study verify the accuracy of in-house multiplex PCR which used to identify the bacteria causing diphtheria in the clinical specimen. The DNA sequences also represent the existing diversity of the bacteria causing diphtheria circulating in Indonesia.KEYWORDS: dtxR, C. diphtheriae, diphtheria, Indonesia



Author(s):  
Kuldeepsingh A. Kalariya ◽  
Ram Prasnna Meena ◽  
Lipi Poojara ◽  
Deepa Shahi ◽  
Sandip Patel

Abstract Background Squalene synthase (SQS) is a rate-limiting enzyme necessary to produce pentacyclic triterpenes in plants. It is an important enzyme producing squalene molecules required to run steroidal and triterpenoid biosynthesis pathways working in competitive inhibition mode. Reports are available on information pertaining to SQS gene in several plants, but detailed information on SQS gene in Gymnema sylvestre R. Br. is not available. G. sylvestre is a priceless rare vine of central eco-region known for its medicinally important triterpenoids. Our work aims to characterize the GS-SQS gene in this high-value medicinal plant. Results Coding DNA sequences (CDS) with 1245 bp length representing GS-SQS gene predicted from transcriptome data in G. sylvestre was used for further characterization. The SWISS protein structure modeled for the GS-SQS amino acid sequence data had MolProbity Score of 1.44 and the Clash Score 3.86. The quality estimates and statistical score of Ramachandran plots analysis indicated that the homology model was reliable. For full-length amplification of the gene, primers designed from flanking regions of CDS encoding GS-SQS were used to get amplification against genomic DNA as template which resulted in approximately 6.2-kb sized single-band product. The sequencing of this product through NGS was carried out generating 2.32 Gb data and 3347 number of scaffolds with N50 value of 457 bp. These scaffolds were compared to identify similarity with other SQS genes as well as the GS-SQSs of the transcriptome. Scaffold_3347 representing the GS-SQS gene harbored two introns of 101 and 164 bp size. Both these intronic regions were validated by primers designed from adjoining outside regions of the introns on the scaffold representing GS-SQS gene. The amplification took place when the template was genomic DNA and failed when the template was cDNA confirmed the presence of two introns in GS-SQS gene in Gymnema sylvestre R. Br. Conclusion This study shows GS-SQS gene was very closely related to Coffea arabica and Gardenia jasminoides and this gene harbored two introns of 101 and 164 bp size.



Diagnostics ◽  
2021 ◽  
Vol 11 (6) ◽  
pp. 974
Author(s):  
Hayfa Sharif ◽  
Caroline L. Hoad ◽  
Nichola Abrehart ◽  
Penny A. Gowland ◽  
Robin C. Spiller ◽  
...  

Background: Functional constipation in children is common. Management of this condition can be challenging and is often based on symptom reports. Increased, objective knowledge of colonic volume changes in constipation compared to health could provide additional information. However, very little data on paediatric colonic volume is available except from methods that are invasive or require unphysiological colonic preparations. Objectives: (1) To measure volumes of the undisturbed colon in children with functional constipation (FC) using magnetic resonance imaging (MRI) and provide initial normal range values for healthy controls, and (2) to investigate possible correlation of colonic volume with whole gut transit time (WGTT). Methods: Total and regional (ascending, transverse, descending, sigmoid, and rectum) colon volumes were measured from MRI images of 35 participants aged 7–18 years (16 with FC and 19 healthy controls), and corrected for body surface area. Linear regression was used to explore the relationship between total colon volume and WGTT. Results: Total colonic volume was significantly higher, with a median (interquartile range) of 309 mL (243–384 mL) for the FC group than for the healthy controls of 227 mL (180–263 mL). The largest increase between patients and controls was in the sigmoid colon–rectum region. In a linear regression model, there was a positive significant correlation between total colonic volume and WGTT (R = 0.56, p = 0.0005). Conclusions: This initial study shows increased volumes of the colon in children with FC, in a physiological state, without use of any bowel preparation. Increased knowledge of colonic morphology may improve understanding of FC in this age group and help to direct treatment.



2020 ◽  
Vol 36 (Supplement_2) ◽  
pp. i857-i865
Author(s):  
Derrick Blakely ◽  
Eamon Collins ◽  
Ritambhara Singh ◽  
Andrew Norton ◽  
Jack Lanchantin ◽  
...  

Abstract Motivation Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task’s alphabet size. Results In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. Availability and implementation Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSK Supplementary information Supplementary data are available at Bioinformatics online.



Information ◽  
2020 ◽  
Vol 11 (2) ◽  
pp. 79 ◽  
Author(s):  
Xiaoyu Han ◽  
Yue Zhang ◽  
Wenkai Zhang ◽  
Tinglei Huang

Relation extraction is a vital task in natural language processing. It aims to identify the relationship between two specified entities in a sentence. Besides information contained in the sentence, additional information about the entities is verified to be helpful in relation extraction. Additional information such as entity type getting by NER (Named Entity Recognition) and description provided by knowledge base both have their limitations. Nevertheless, there exists another way to provide additional information which can overcome these limitations in Chinese relation extraction. As Chinese characters usually have explicit meanings and can carry more information than English letters. We suggest that characters that constitute the entities can provide additional information which is helpful for the relation extraction task, especially in large scale datasets. This assumption has never been verified before. The main obstacle is the lack of large-scale Chinese relation datasets. In this paper, first, we generate a large scale Chinese relation extraction dataset based on a Chinese encyclopedia. Second, we propose an attention-based model using the characters that compose the entities. The result on the generated dataset shows that these characters can provide useful information for the Chinese relation extraction task. By using this information, the attention mechanism we used can recognize the crucial part of the sentence that can express the relation. The proposed model outperforms other baseline models on our Chinese relation extraction dataset.



2003 ◽  
Vol 93 (2) ◽  
pp. 219-228 ◽  
Author(s):  
Béatrice Denoyes-Rothan ◽  
Guy Guérin ◽  
Christophe Délye ◽  
Barbara Smith ◽  
Dror Minz ◽  
...  

Ninety-five isolates of Colletotrichum including 81 isolates of C. acutatum (62 from strawberry) and 14 isolates of C. gloeosporioides (13 from strawberry) were characterized by various molecular methods and pathogenicity tests. Results based on random amplified polymorphic DNA (RAPD) polymorphism and internal transcribed spacer (ITS) 2 sequence data provided clear genetic evidence of two subgroups in C. acutatum. The first subgroup, characterized as CA-clonal, included only isolates from strawberry and exhibited identical RAPD patterns and nearly identical ITS2 sequence analysis. A larger genetic group, CA-variable, included isolates from various hosts and exhibited variable RAPD patterns and divergent ITS2 sequence analysis. Within the C. acutatum population isolated from strawberry, the CA-clonal group is prevalent in Europe (54 isolates of 62). A subset of European C. acutatum isolates isolated from strawberry and representing the CA-clonal and CA-variable groups was assigned to two pathogenicity groups. No correlation could be drawn between genetic and pathogenicity groups. On the basis of molecular data, it is proposed that the CA-clonal subgroup contains closely related, highly virulent C. acutatum isolates that may have developed host specialization to strawberry. C. gloeosporioides isolates from Europe, which were rarely observed were either slightly or nonpathogenic on strawberry. The absence of correlation between genetic polymorphism and geographical origin in Colletotrichum spp. suggests a worldwide dissemination of isolates, probably through international plant exchanges.



Genome ◽  
2004 ◽  
Vol 47 (4) ◽  
pp. 732-741 ◽  
Author(s):  
Wolfgang Staiber

The origin of germline-limited chromosomes (Ks) as descendants of somatic chromosomes (Ss) and their structural evolution was recently elucidated in the chironomid Acricotopus. The Ks consist of large S-homologous sections and of heterochromatic segments containing germline-specific, highly repetitive DNA sequences. Less is known about the molecular evolution and features of the sequences in the S-homologous K sections. More information about this was received by comparing homologous gene sequences of Ks and Ss. Genes for 5.8S, 18S, 28S, and 5S ribosomal RNA were choosen for the comparison and therefore isolated first by PCR from somatic DNA of Acricotopus and sequenced. Specific K DNA was collected by microdissection of monopolar moving K complements from differential gonial mitoses and was then amplified by degenerate oligonucleotide primer (DOP)-PCR. With the sequence data of the somatic rDNAs, the homologous 5.8S and 5S rDNA sequences were isolated by PCR from the DOP-PCR sequence pool of the Ks. In addition, a number of K DOP-PCR sequences were directly cloned and analysed. One K clone contained a section of a putative N-acetyltransferase gene. Compared with its homolog from the Ss, the sequence exhibited few nucleotide substitutions (99.2% sequence identity). The same was true for the 5.8S and 5S sequences from Ss and Ks (97.5%–100% identity). This supports the idea that the S-homologous K sequences may be conserved and do not evolve independently from their somatic homologs. Possible mechanisms effecting such conservation of S-derived sequences in the Ks are discussed.Key words: microdissection, DOP-PCR, germline-limited chromosomes, molecular evolution.



2018 ◽  
Vol 20 (4) ◽  
pp. 1542-1559 ◽  
Author(s):  
Damla Senol Cali ◽  
Jeremie S Kim ◽  
Saugata Ghose ◽  
Can Alkan ◽  
Onur Mutlu

Abstract Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.



mSystems ◽  
2018 ◽  
Vol 3 (3) ◽  
Author(s):  
Gabriel A. Al-Ghalith ◽  
Benjamin Hillmann ◽  
Kaiwei Ang ◽  
Robin Shields-Cutler ◽  
Dan Knights

ABSTRACT Next-generation sequencing technology is of great importance for many biological disciplines; however, due to technical and biological limitations, the short DNA sequences produced by modern sequencers require numerous quality control (QC) measures to reduce errors, remove technical contaminants, or merge paired-end reads together into longer or higher-quality contigs. Many tools for each step exist, but choosing the appropriate methods and usage parameters can be challenging because the parameterization of each step depends on the particularities of the sequencing technology used, the type of samples being analyzed, and the stochasticity of the instrumentation and sample preparation. Furthermore, end users may not know all of the relevant information about how their data were generated, such as the expected overlap for paired-end sequences or type of adaptors used to make informed choices. This increasing complexity and nuance demand a pipeline that combines existing steps together in a user-friendly way and, when possible, learns reasonable quality parameters from the data automatically. We propose a user-friendly quality control pipeline called SHI7 (canonically pronounced “shizen”), which aims to simplify quality control of short-read data for the end user by predicting presence and/or type of common sequencing adaptors, what quality scores to trim, whether the data set is shotgun or amplicon sequencing, whether reads are paired end or single end, and whether pairs are stitchable, including the expected amount of pair overlap. We hope that SHI7 will make it easier for all researchers, expert and novice alike, to follow reasonable practices for short-read data quality control. IMPORTANCE Quality control of high-throughput DNA sequencing data is an important but sometimes laborious task requiring background knowledge of the sequencing protocol used (such as adaptor type, sequencing technology, insert size/stitchability, paired-endedness, etc.). Quality control protocols typically require applying this background knowledge to selecting and executing numerous quality control steps with the appropriate parameters, which is especially difficult when working with public data or data from collaborators who use different protocols. We have created a streamlined quality control pipeline intended to substantially simplify the process of DNA quality control from raw machine output files to actionable sequence data. In contrast to other methods, our proposed pipeline is easy to install and use and attempts to learn the necessary parameters from the data automatically with a single command.



2002 ◽  
Vol 76 (14) ◽  
pp. 7094-7102 ◽  
Author(s):  
David J. Griffiths ◽  
Cécile Voisset ◽  
Patrick J. W. Venables ◽  
Robin A. Weiss

ABSTRACT Human retrovirus 5 (HRV-5) represented a fragment of a novel retrovirus sequence identified in human RNA and DNA preparations. In this study, the genome of HRV-5 was cloned and sequenced and integration sites were analyzed. Using PCR and Southern hybridization, we showed that HRV-5 is not integrated into human DNA. A survey of other species revealed that HRV-5 is present in the genomic DNA of the European rabbit (Oryctolagus cuniculus) and belongs to an endogenous retrovirus family found in rabbits. The presence of rabbit sequences flanking HRV-5 proviruses in human DNA extracts suggested that rabbit DNA was present in our human extracts, and this was confirmed by PCR analysis that revealed the presence of rabbit mitochondrial DNA sequences in four of five human DNA preparations tested. The origin of the rabbit DNA and HRV-5 in human DNA preparations remains unclear, but laboratory contamination cannot explain the preferential detection of HRV-5 in inflammatory diseases and lymphomas reported previously. This is the first description of a retrovirus genome in rabbits, and sequence analysis shows that it is related to but distinct from A-type retroelements of mice and other rodents. The species distribution of HRV-5 is restricted to rabbits; other species, including other members of the order Lagomorpha, do not contain this sequence. Analysis of HRV-5 expression by Northern hybridization and reverse transcriptase PCR indicates that the virus is transcribed at a low level in many rabbit tissues. In light of these findings we propose that the sequence previously designated HRV-5 should now be denoted RERV-H (for rabbit endogenous retrovirus H).



Sign in / Sign up

Export Citation Format

Share Document