scholarly journals KusakiDB v1.0: a novel approach for validation and completeness of protein orthologous groups

2020 ◽  
Author(s):  
Andrea Ghelfi ◽  
Yasukazu Nakamura ◽  
Sachiko Isobe

SummaryPlants have quite a low coverage in the major protein databases despite their roughly 350,000 species. Moreover, the agricultural sector is one of the main categories in bioeconomy. In order to manipulate and/or engineer plant-based products, it is important to understand the essential fabric of an organism, its proteins. Therefore, we created KusakiDB, which is a database of orthologous proteins, in plants, that correlates three major databases, OrthoDB, UniProt and RefSeq. KusakiDB has an orthologs assessment and management tools in order to compare orthologous groups, which can provide insights not only under an evolutionary point of view but also evaluate structural gene prediction quality and completeness among plant species. KusakiDB could be a new approach to reduce error propagation of functional annotation in plant species. Additionally, this method could, potentially, bring to light some orthologs unique to a few species or families that could have evolved at a high evolutionary rate or could have been a result of a horizontal gene transfer.Availability and ImplementationThe software is implemented in R. It is available at http://pgdbjsnp.kazusa.or.jp/app/kusakidb and at https://hub.docker.com/r/ghelfi/kusakidb under the MIT license.Contact:[email protected] informationSupplementary data are available at Bioinformatics online.

2016 ◽  
Author(s):  
Stephen G. Gaffney ◽  
Jeffrey P. Townsend

ABSTRACTSummaryPathScore quantifies the level of enrichment of somatic mutations within curated pathways, applying a novel approach that identifies pathways enriched across patients. The application provides several user-friendly, interactive graphic interfaces for data exploration, including tools for comparing pathway effect sizes, significance, gene-set overlap and enrichment differences between projects.Availability and ImplementationWeb application available at pathscore.publichealth.yale.edu. Site implemented in Python and MySQL, with all major browsers supported. Source code available at github.com/sggaffney/pathscore with a GPLv3 [email protected] InformationAdditional documentation can be found at http://pathscore.publichealth.yale.edu/faq.


2019 ◽  
Vol 35 (22) ◽  
pp. 4537-4542 ◽  
Author(s):  
Katelyn McNair ◽  
Carol Zhou ◽  
Elizabeth A Dinsdale ◽  
Brian Souza ◽  
Robert A Edwards

Abstract Motivation Currently there are no tools specifically designed for annotating genes in phages. Several tools are available that have been adapted to run on phage genomes, but due to their underlying design, they are unable to capture the full complexity of phage genomes. Phages have adapted their genomes to be extremely compact, having adjacent genes that overlap and genes completely inside of other longer genes. This non-delineated genome structure makes it difficult for gene prediction using the currently available gene annotators. Here we present PHANOTATE, a novel method for gene calling specifically designed for phage genomes. Although the compact nature of genes in phages is a problem for current gene annotators, we exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible. We represent this network of connections as a weighted graph, and use dynamic programing to find the optimal path. Results We compare PHANOTATE to other gene callers by annotating a set of 2133 complete phage genomes from GenBank, using PHANOTATE and the three most popular gene callers. We found that the four programs agree on 82% of the total predicted genes, with PHANOTATE predicting more genes than the other three. We searched for these extra genes in both GenBank’s non-redundant protein database and all of the metagenomes in the sequence read archive, and found that they are present at levels that suggest that these are functional protein-coding genes. Availability and implementation https://github.com/deprekate/PHANOTATE Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Magdalena E Strauss ◽  
Paul DW Kirk ◽  
John E Reid ◽  
Lorenz Wernisch

AbstractMotivationMany methods have been developed to cluster genes on the basis of their changes in mRNA expression over time, using bulk RNA-seq or microarray data. However, single-cell data may present a particular challenge for these algorithms, since the temporal ordering of cells is not directly observed. One way to address this is to first use pseudotime methods to order the cells, and then apply clustering techniques for time course data. However, pseudotime estimates are subject to high levels of uncertainty, and failing to account for this uncertainty is liable to lead to erroneous and/or over-confident gene clusters.ResultsThe proposed method, GPseudoClust, is a novel approach that jointly infers pseudotem-poral ordering and gene clusters, and quantifies the uncertainty in both. GPseudoClust combines a recent method for pseudotime inference with nonparametric Bayesian clustering methods, efficient MCMC sampling, and novel subsampling strategies which aid computation. We consider a broad array of simulated and experimental datasets to demonstrate the effectiveness of GPseudoClust in a range of settings.AvailabilityAn implementation is available on GitHub: https://github.com/magStra/nonparametricSummaryPSM and https://github.com/magStra/[email protected] informationSupplementary materials are available.


2019 ◽  
Author(s):  
Patrick Sorn ◽  
Christoph Holtsträter ◽  
Martin Löwer ◽  
Ugur Sahin ◽  
David Weber

Abstract Motivation Gene fusions are an important class of transcriptional variants that can influence cancer development and can be predicted from RNA sequencing (RNA-seq) data by multiple existing tools. However, the real-world performance of these tools is unclear due to the lack of known positive and negative events, especially with regard to fusion genes in individual samples. Often simulated reads are used, but these cannot account for all technical biases in RNA-seq data generated from real samples. Results Here, we present ArtiFuse, a novel approach that simulates fusion genes by sequence modification to the genomic reference, and therefore, can be applied to any RNA-seq dataset without the need for any simulated reads. We demonstrate our approach on eight RNA-seq datasets for three fusion gene prediction tools: average recall values peak for all three tools between 0.4 and 0.56 for high-quality and high-coverage datasets. As ArtiFuse affords total control over involved genes and breakpoint position, we also assessed performance with regard to gene-related properties, showing a drop-in recall value for low-expressed genes in high-coverage samples and genes with co-expressed paralogues. Overall tool performance assessed from ArtiFusions is lower compared to previously reported estimates on simulated reads. Due to the use of real RNA-seq datasets, we believe that ArtiFuse provides a more realistic benchmark that can be used to develop more accurate fusion gene prediction tools for application in clinical settings. Availability and implementation ArtiFuse is implemented in Python. The source code and documentation are available at https://github.com/TRON-Bioinformatics/ArtiFusion. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Vladislava Milchevskaya ◽  
Alexei M. Nikitin ◽  
Sergey A. Lukshin ◽  
Ivan V. Filatov ◽  
Yuri V. Kravatsky ◽  
...  

AbstractMotivationLocal protein structure is usually described via classifying each peptide to a unique element from a set of pre-defined structures. These so-called structural alphabets may differ in the number of structures or the length of peptides. Most methods that predict the local structure of a protein from its sequence rely on this kind of classification. However, since all peptides assigned to the same class are indistinguishable, such an approach may not be sufficient to model protein folding with high accuracy.ResultsWe developed a method that predicts the structural representation of a peptide from its sequence. For 5-mer peptides, we achieved the Q16 classification accuracy of 67.9%, which is higher than what is currently reported in the literature. Importantly, our prediction method does not utilize information about protein homologues but only physicochemical properties of the amino acids and the statistics of the structures, but relies on a comprehensive feature-generation procedure based only on the protein sequence and the statistics of resolved structures. We also show that the 3D coordinates of a peptide can be uniquely recovered from its structural coordinates, and show the required conditions for that under various geometric constraints.AvailabilityThe online implementation of the method is provided freely at http://[email protected] or [email protected] informationSupplementary data are available online at http://pbpred.eimb.ru/S/index.html


2018 ◽  
Author(s):  
Hong-Dong Li ◽  
Yunpei Xu ◽  
Xiaoshu Zhu ◽  
Quan Liu ◽  
Gilbert S. Omenn ◽  
...  

ABSTRACTMotivationClustering analysis is essential for understanding complex biological data. In widely used methods such as hierarchical clustering (HC) and consensus clustering (CC), expression profiles of all genes are often used to assess similarity between samples for clustering. These methods output sample clusters, but are not able to provide information about which gene sets (functions) contribute most to the clustering. So interpretability of their results is limited. We hypothesized that integrating prior knowledge of annotated biological processes would not only achieve satisfying clustering performance but also, more importantly, enable potential biological interpretation of clusters.ResultsHere we report ClusterMine, a novel approach that identifies clusters by assessing functional similarity between samples through integrating known annotated gene sets, e.g., in Gene Ontology. In addition to outputting cluster membership of each sample as conventional approaches do, it outputs gene sets that are most likely to contribute to the clustering, a feature facilitating biological interpretation. Using three cancer datasets, two single cell RNA-sequencing based cell differentiation datasets, one cell cycle dataset and two datasets of cells of different tissue origins, we found that ClusterMine achieved similar or better clustering performance and that top-scored gene sets prioritized by ClusterMine are biologically relevant.Implementation and availabilityClusterMine is implemented as an R package and is freely available at: www.genemine.org/[email protected] InformationSupplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (22) ◽  
pp. 4624-4631 ◽  
Author(s):  
Xin Li ◽  
Samaneh Saadat ◽  
Haiyan Hu ◽  
Xiaoman Li

Abstract Motivation The bacterial haplotype reconstruction is critical for selecting proper treatments for diseases caused by unknown haplotypes. Existing methods and tools do not work well on this task, because they are usually developed for viral instead of bacterial populations. Results In this study, we developed BHap, a novel algorithm based on fuzzy flow networks, for reconstructing bacterial haplotypes from next generation sequencing data. Tested on simulated and experimental datasets, we showed that BHap was capable of reconstructing haplotypes of bacterial populations with an average F1 score of 0.87, an average precision of 0.87 and an average recall of 0.88. We also demonstrated that BHap had a low susceptibility to sequencing errors, was capable of reconstructing haplotypes with low coverage and could handle a wide range of mutation rates. Compared with existing approaches, BHap outperformed them in terms of higher F1 scores, better precision, better recall and more accurate estimation of the number of haplotypes. Availability and implementation The BHap tool is available at http://www.cs.ucf.edu/∼xiaoman/BHap/. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Andrew Whalen ◽  
Gregor Gorjanc ◽  
John M Hickey

AbstractSummaryAlphaFamImpute is an imputation package for calling, phasing, and imputing genome-wide genotypes in outbred full-sib families from single nucleotide polymorphism (SNP) array and genotype-by-sequencing (GBS) data. GBS data is increasingly being used to genotype individuals, especially when SNP arrays do not exist for a population of interest. Low-coverage GBS produces data with a large number of missing or incorrect naïve genotype calls, which can be improved by identifying shared haplotype segments between full-sib individuals. Here we present AlphaFamImpute, an algorithm specifically designed to exploit the genetic structure of full-sib families. It performs imputation using a two-step approach. In the first step it phases and imputes parental genotypes based on the segregation states of their offspring (that is, which pair of parental haplotypes the offspring inherited). In the second step it phases and imputes the offspring genotypes by detecting which haplotype segments the offspring inherited from their parents. With a series of simulations we find that AlphaFamImpute obtains high accuracy genotypes, even when the parents are not genotyped and individuals are sequenced at less than 1x coverage.Availability and implementationAlphaFamImpute is available as a Python package from the AlphaGenes website, http://www.AlphaGenes.roslin.ed.ac.uk/[email protected] informationA complete description of the methods is available in the supplementary information.


2020 ◽  
Vol 17 ◽  
pp. 00124
Author(s):  
Elena P. Polikarpova ◽  
Igor E. Mizikovskiy

Modern science and practice does not have a sufficient set of cost management tools, taking into account the duration of the production cycle, characteristic of agricultural activity. The implementation of a cycle-oriented approach to building a model of production costs was based on studying the existing options for classifying production costs, which were supplemented with features from the perspective of managing long production cycles. As a result of the study, a model of production costs was built from the point of view of a cycle-oriented approach, as well as a model of production costs from the standpoint of features of a long production cycle. The model can serve as the basis for the formation of the information space of cost management, control and cost analysis in the economy of agricultural enterprises.


2004 ◽  
Vol 52 (1) ◽  
pp. 37-44
Author(s):  
M. Shokri ◽  
N. Safaian ◽  
M. Z. Ahmadi

Due to the occurrence of considerable areas of wetlands in the world, the wise, sustainable use of these lands is one of major importance for ecologists and agriculturists. As the presence of indicator species and plant communities can be a measure of the compatibility between plants and edaphic conditions in these regions, the ecological niches of plant species in part of the southern coastal areas of the Caspian Sea have been studied to show the correlation of each species with its own habitat. The plant communities were separated with Ward's cluster analysis. The correlation of these communities and plant species with environmental factors was investigated with the CCA method, using PC-Ordination-4 software. The results showed that the soil EC, water table, soil pH, SAR and ESP were 14-157 dS/m, 0-240 cm, 6.5-8.5, 13.4-84.8 and 2-55%, respectively. This range of values, in addition to creating ecological niches for species with different ecological roles, was also effective in the formation of plant communities. The analysis of vegetation and soil data with the CCA method showed the relationships between soil factors and vegetation. In spite of the dominance of the species Halocnemum strobilaceum in all the plant communities, the correlation of this species with plant species such as Aeluropus littoralis, Salicornia europaea, Aeluropus lagopoides,Salsola aurantia and Puccinella distans in relation to changes in EC, water table, pH, SAR and ESP, is important from the point of view of sustaining the physical environment and ecological function. The simplification of these ecosystems (by drainage, agriculture, etc.) may disturb the natural equilibrium. As these ecosystems are susceptible and changes in their use are costly from the ecological and economic points of view, the wise use of ecosystems in their natural forms (rangelands and habitats) is recommended to prevent the spread of salinity and to protect habitats and biodiversity.


Sign in / Sign up

Export Citation Format

Share Document