sequence identifier
Recently Published Documents


TOTAL DOCUMENTS

8
(FIVE YEARS 4)

H-INDEX

2
(FIVE YEARS 0)

2021 ◽  
Author(s):  
Chuanyi Zhang ◽  
Palash Sashittal ◽  
Mohammed El-Kebir

Genes in coronaviruses are preceded by transcription regulatory sequences (TRSs), which play a critical role in gene expression mediated by the viral RNA-dependent RNA-polymerase via the process of discontinuous transcription. In addition to being crucial for our understanding of the regulation and expression of coronavirus genes, we demonstrate for the first time how TRSs can be leveraged to identify gene locations in the coronavirus genome. To that end, we formulate the TRS AND GENE IDENTIFICATION (TRS-GENE-ID) problem of simultaneously identifying TRS sites and gene locations in unannotated coronavirus genomes. We introduce CORSID (CORe Sequence IDentifier), a computational tool to solve this problem. We also present CORSID-A, which solves a constrained version of the TRS-GENE-ID problem, the TRS IDENTIFICATION (TRS-ID) problem, identifying TRS sites in a coronavirus genome with specified gene annotations. We show that CORSID-A outperforms existing motif-based methods in identifying TRS sites in coronaviruses and that CORSID outperforms state-of-the-art gene finding methods in finding genes in coronavirus genomes. We demonstrate that CORSID enables de novo identification of TRS sites and genes in previously unannotated coronaviruses. CORSID is the first method to perform accurate and simultaneous identification of TRS sites and genes in coronavirus genomes without the use of any prior information.


PLoS ONE ◽  
2020 ◽  
Vol 15 (12) ◽  
pp. e0239883
Author(s):  
Reece K. Hart ◽  
Andreas Prlić

Motivation Access to biological sequence data, such as genome, transcript, or protein sequence, is at the core of many bioinformatics analysis workflows. The National Center for Biotechnology Information (NCBI), Ensembl, and other sequence database maintainers provide methods to access sequences through network connections. For many users, the convenience and currency of remotely managed data are compelling, and the network latency is non-consequential. However, for high-throughput and clinical applications, local sequence collections are essential for performance, stability, privacy, and reproducibility. Results Here we describe SeqRepo, a novel system for building a local, high-performance, non-redundant collection of biological sequences. SeqRepo enables clients to use primary database identifiers and several digests to identify sequences and sequence alises. SeqRepo provides a native Python interface and a REST interface, which can run locally and enables access from other programming languages. SeqRepo also provides an alternative REST interface based on the GA4GH refget protocol. SeqRepo provides fast random access to sequence slices. We provide results that demonstrate that a local SeqRepo sequence collection yields significant performance benefits of up to 1300-fold over remote sequence collections. In our use case for a variant validation and normalization pipeline, SeqRepo improved throughput 50-fold relative to use with remote sequences. SeqRepo may be used with any species or sequence type. Regular snapshots of Human sequence collections are available. It is often convenient or necessary to use a computed digest as a sequence identifier. For example, a digest-based identifier may be used to refer to proprietary reference genomes or segments of a graph genome, for which conventional identifiers will not be available. Here we also introduce a convention for the application of the SHA-512 hashing algorithm with Base64 encoding to generate URL-safe identifiers. This convention, sha512t24u, combines a fast digest mechanism with a space-efficient representation that can be used for any object. Our report includes an analysis of timing and collision probabilities for sha512t24u. SeqRepo enables clients to use sha512t24u as identifiers, thereby seamlessly integrating public and private sequence sets. Availability SeqRepo is released under the Apache License 2.0 and is available on github and PyPi. Docker images and database snapshots are also available. See https://github.com/biocommons/biocommons.seqrepo.


2020 ◽  
Author(s):  
Reece K. Hart ◽  
Andreas Prlić

AbstractMotivationAccess to biological sequence data, such as genome, transcript, or protein sequence, is at the core of many bioinformatics analysis workflows. The National Center for Biotechnology Information (NCBI), Ensembl, and other sequence database maintainers provide methods to access sequences through network connections. For many users, the convenience and currency of remotely managed data are compelling, and the network latency is non-consequential. However, for high-throughput and clinical applications, local sequence collections are essential for performance, stability, privacy, and reproducibility.ResultsHere we describe SeqRepo, a novel system for building a local, high-performance, non-redundant collection of biological sequences. SeqRepo enables clients to use primary database identifiers and several digests to identify sequences and sequence alises. SeqRepo provides a native Python interface and a REST interface, which can run locally and enables access from other programming languages. SeqRepo also provides an alternative REST interface based on the GA4GH refget protocol.SeqRepo provides fast random access to sequence slices. We provide results that demonstrate that a local SeqRepo sequence collection yields significant performance benefits of up to 1300-fold over remote sequence collections. In our use case for a variant validation and normalization pipeline, SeqRepo improved throughput 50-fold relative to use with remote sequences. SeqRepo may be used with any species or sequence type. Regular snapshots of Human sequence collections are available.It is often convenient or necessary to use a computed digest as a sequence identifier. For example, a digest-based identifier may be used to refer to proprietary reference genomes or segments of a graph genome, for which conventional identifiers will not be available. Here we also introduce a convention for the application of the SHA-512 hashing algorithm with Base64 encoding to generate URL-safe identifiers. This convention, sha512t24u, combines a fast digest mechanism with a space-efficient representation that can be used for any object. Our report includes an analysis of timing and collision probabilities for sha512t24u. SeqRepo enables clients to use sha512t24u as identifiers, thereby seamlessly integrating public and private sequence sets.AvailabilitySeqRepo is released under the Apache License 2.0 and is available on github and PyPi. Docker images and database snapshots are also available. See https://github.com/biocommons/biocommons.seqrepo.


2019 ◽  
Vol 7 (1) ◽  
pp. 649-651
Author(s):  
Brian Young ◽  
Tom Faris ◽  
Luigi Armogida ◽  
Victor Meles
Keyword(s):  

2015 ◽  
Vol 140 (1) ◽  
pp. 78-87 ◽  
Author(s):  
Karen Harris-Shultz ◽  
Melanie Harrison ◽  
Phillip A. Wadl ◽  
Robert N. Trigiano ◽  
Timothy Rinehart

Little bluestem (Schizachyrium scoparium) is a perennial bunchgrass that is native to North American prairies and woodlands from southern Canada to northern Mexico. Originally used as a forage grass, little bluestem is now listed as a major U.S. native, ornamental grass. With the widespread planting of only a few cultivars, we aimed to assess the ploidy level and genetic diversity among some popular cultivars and accessions in the U.S. Department of Agriculture National Plant Germplasm System collection. Ten microsatellite markers, with successful amplification, were developed by using sequences available in Genbank and additional simple sequence repeat (SSR) markers were generated by using ion torrent sequencing of a genomic library created from the cultivar The Blues. A total of 2812 primer sets was designed from high-throughput sequencing, 100 primer pairs were selected, and 82 of these primers successfully amplified DNA from the Schizachyrium accessions. Only 35 primer pairs, generating 102 scored fragments, were polymorphic among S. scoparium accessions. Twenty-two primer pairs generated more than four fragments per accession. The use of a repetitive sequence identifier found that of 117 examined sequences, only nine sequences did not have similarity to DNA transposons, retrotransposons, viruses, or satellite sequences. The most frequently identified fragments were the long terminal repeat retrotransposons Gypsy (177 fragments) and Copia (98 fragments) and the DNA transposon EnSpm (60 fragments). Using the software program Structure, cluster analysis of the SSR data for S. scoparium revealed four groups. The lowest genetic similarity between little bluestem samples was 86%, which was surprising as a high degree of morphological variation is seen in this species. Furthermore, no variation in ploidy level was seen among little bluestem samples. These microsatellite markers are the first sequence-specific markers designed for little bluestem and can serve as a resource for future genetic studies.


1996 ◽  
Vol 13 (1) ◽  
pp. 41-44
Author(s):  
Gautam Sarkar ◽  
Anish Deb ◽  
Sunit K Sen

1971 ◽  
Vol 92 (1) ◽  
pp. 151-152
Author(s):  
A. Kokeš

Sign in / Sign up

Export Citation Format

Share Document