scholarly journals fLPS 2.0: rapid annotation of compositionally-biased regions in biological sequences

PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e12363
Author(s):  
Paul M. Harrison

Compositionally-biased (CB) regions in biological sequences are enriched for a subset of sequence residue types. These can be shorter regions with a concentrated bias (i.e., those termed ‘low-complexity’), or longer regions that have a compositional skew. These regions comprise a prominent class of the uncharacterized ‘dark matter’ of the protein universe. Here, I report the latest version of the fLPS package for the annotation of CB regions, which includes added consideration of DNA sequences, to label the eight possible biased regions of DNA. In this version, the user is now able to restrict analysis to a specified subset of residue types, and also to filter for previously annotated domains to enable detection of discontinuous CB regions. A ‘thorough’ option has been added which enables the labelling of subtler biases, typically made from a skew for several residue types. In the output, protein CB regions are now labelled with bias classes reflecting the physico-chemical character of the biasing residues. The fLPS 2.0 package is available from: https://github.com/pmharrison/flps2 or in a Supplemental File of this paper.

2018 ◽  
Author(s):  
Akosua Busia ◽  
George E. Dahl ◽  
Clara Fannjiang ◽  
David H. Alexander ◽  
Elizabeth Dorfman ◽  
...  

AbstractMotivationInferring properties of biological sequences--such as determining the species-of-origin of a DNA sequence or the function of an amino-acid sequence--is a core task in many bioinformatics applications. These tasks are often solved using string-matching to map query sequences to labeled database sequences or via Hidden Markov Model-like pattern matching. In the current work we describe and assess an deep learning approach which trains a deep neural network (DNN) to predict database-derived labels directly from query sequences.ResultsWe demonstrate this DNN performs at state-of-the-art or above levels on a difficult, practically important problem: predicting species-of-origin from short reads of 16S ribosomal DNA. When trained on 16S sequences of over 13,000 distinct species, our DNN achieves read-level species classification accuracy within 2.0% of perfect memorization of training data, and produces more accurate genus-level assignments for reads from held-out species thank-mer, alignment, and taxonomic binning baselines. Moreover, our models exhibit greater robustness than these existing approaches to increasing noise in the query sequences. Finally, we show that these DNNs perform well on experimental 16S mock community dataset. Overall, our results constitute a first step towards our long-term goal of developing a general-purpose deep learning approach to predicting meaningful labels from short biological sequences.AvailabilityTensorFlow training code is available through GitHub (https://github.com/tensorflow/models/tree/master/research). Data in TensorFlow TFRecord format is available on Google Cloud Storage (gs://brain-genomics-public/research/seq2species/)[email protected] informationSupplementary data are available in a separate document.


F1000Research ◽  
2014 ◽  
Vol 3 ◽  
pp. 54 ◽  
Author(s):  
Anil S. Thanki ◽  
Shabhonam Caim ◽  
Manuel Corpas ◽  
Robert P. Davey

Summary: Compositional GC/AT content of DNA sequences is a useful feature in genome analysis. GC/AT content provides useful information about evolution, structure and function of genomes, giving clues about their biological function and organisation. We have developed DNAContentViewer, a BioJS component for visualisation of compositional GC/AT content in raw sequences. DNAContentViewer has been integrated in the BioJS project as part of the BioJS registry of components. DNAContentViewer requires a simple configuration and installation. Its design allows potential interactions with other components via predefined events. Availability: http://github.com/biojs/biojs; doi: 10.5281/zenodo.7722.


Author(s):  
Vladimir D. Gusev ◽  
Liubov A. Miroshnichenko

An important quantitative characteristic of symbolic sequence (texts, strings) is complexity, which reflects at the intuitive level the degree of their "non-randomness". A.N. Kolmogorov formulated the most general definition of complexity. He proposed measuring the complexity of an object (symbolic sequence) by the length of the shortest descriptions by which this object can be uniquely reconstructed. Since there is no program guaranteed to search for the shortest description, in practice, various algorithmic approximations considered in this paper are used for this purpose. Along with definitions of complexity, suggesting the possibility of reconstruction a sequence from its "description", a number of measures are considered that do not imply such restoration. They are based on the calculation of some quantitative characteristics. Of interest is not only a quantitative assessment of complexity, but also the identification and classification of structural regularities that determine its specific value. In one form or another, they are expressed in the demonstration of repetition in the broadest sense. The considered measures of complexity are conventionally divided into statistical ones that take into account the frequency of occurrence of symbols or short “words” in the text, “dictionary” ones that estimate the number of different “subwords” and “structural” ones based on the identification of long repeating fragments of text and the determination of relationships between them. Most of the methods are designed for sequences of an arbitrary linguistic nature. The special attention paid to DNA sequences, reflected in the title of the article, is due to the importance of the object, manifestations of repetition of different types, and numerous examples of using the concept of complexity in solving problems of classification and evolution of various biological objects. Local structural features found in the sliding window mode in DNA sequences are of considerable interest, since zones of low complexity in the genomes of various organisms are often associated with the regulation of basic genetic processes.


2017 ◽  
Author(s):  
Sneha Mitra ◽  
Anushua Biswas ◽  
Leelavati Narlikar

AbstractA high-throughput chromatin immunoprecipitation (ChIP) experiment is like a black-box: it reports all regions that are associated with the profiled protein based on the initial cross-linking step. These regions can be a highly diverse set of DNA sequences, with some making direct contact with the protein, some binding through intermediaries, and some being a result of long-range interactions involving the protein. We present diversity, a method that identifies the distinct components of such a mixture, leaving no data behind, while at the same time, using no prior motif knowledge. Using the example of the REST protein, we show that these different components give insights into the various complexes that may be forming along the chromatin and their regulatory functions.http://diversity.ncl.res.in/ (webserver)https://github.com/NarlikarLab/DIVERSITY (standalone for Mac OSX/Linux)


2016 ◽  
Author(s):  
Genivaldo Gueiros Z. Silva ◽  
Bas E. Dutilh ◽  
Robert A. Edwards

ABSTRACTSummaryMetagenomics approaches rely on identifying the presence of organisms in the microbial community from a set of unknown DNA sequences. Sequence classification has valuable applications in multiple important areas of medical and environmental research. Here we introduce FOCUS2, an update of the previously published computational method FOCUS. FOCUS2 was tested with 10 simulated and 543 real metagenomes demonstrating that the program is more sensitive, faster, and more computationally efficient than existing methods.AvailabilityThe Python implementation is freely available at https://edwards.sdsu.edu/FOCUS2.Supplementary informationavailable at Bioinformatics online.


2021 ◽  
Author(s):  
Matthias I Gröschel ◽  
Martin Owens ◽  
Luca Freschi ◽  
Roger Vargas ◽  
Maximilian G Marin ◽  
...  

ABSTRACTIntroductionMultidrug-resistant Mycobacterium tuberculosis (Mtb) is a significant global public health threat. Genotypic resistance prediction from Mtb DNA sequences offers an alternative to laboratory-based drug-susceptibility testing. User-friendly and accurate resistance prediction tools are needed to enable public health and clinical practitioners to rapidly diagnose resistance and inform treatment regimens.MethodsWe present Translational Genomics platform for Tuberculosis (GenTB), a web-based application to predict antibiotic resistance from next-generation sequence data. The user can choose between two potential predictors, a Random Forest (RF) classifier and a Wide and Deep Neural Network (WDNN) to predict phenotypic resistance to 13 and 10 anti-tuberculosis drugs, respectively. We benchmark GenTB’s predictive performance along with leading TB resistance prediction tools (Mykrobe and TB-Profiler) using a ground truth dataset of 20,408 isolates with laboratory-based drug susceptibility data.ResultsAll four tools reliably predicted resistance to first-line tuberculosis drugs but had varying performance for second-line drugs. The mean sensitivities for GenTB-RF and GenTB-WDNN across the nine shared drugs was 77.6% (95% CI 76.6 - 78.5%) and 75.4% (95% CI 74.5 - 76.4%) respectively, and marginally higher than the sensitivities of TB-Profiler at 74.4% (95% CI 73.4 - 75.3%) and Mykrobe at 71.9% (95% CI 70.9 - 72.9%). The higher sensitivities were at an expense of ≤1.5% lower specificity: Mykrobe 97.6% (95% CI 97.5 - 97.7%), TB-Profiler 96.9% (95% CI 96.7 to 97.0%), GenTB-WDNN 96.2% (95% CI 96.0 to 96.4%), and GenTB-RF 96.1% (95% CI 96.0 to 96.3%). Genotypic resistance sensitivity was 11% and 9% lower for isoniazid and rifampicin respectively, on isolates sequenced at low depth (<10x across 95% of the genome) emphasizing the need to quality control input sequence data before prediction. We discuss differences between tools in reporting results to the user including variants underlying the resistance calls and any novel or indeterminate variantsConclusionGenTB is an easy-to-use online tool to rapidly and accurately predict resistance to anti-tuberculosis drugs. GenTB can be accessed online at https://gentb.hms.harvard.edu, and the source code is available at https://github.com/farhat-lab/gentb-site.


2021 ◽  
Author(s):  
Boqiao Lai ◽  
Sheng Qian ◽  
Hanwen Zhang ◽  
Siwei Zhang ◽  
Alena Kozlova ◽  
...  

AbstractDecoding the regulatory effects of non-coding variants is a key challenge in understanding the mechanisms of gene regulation as well as the genetics of common diseases. Recently, deep learning models have been introduced to predict genome-wide epigenomic profiles and effects of DNA variants, in various cellular contexts, but they were often trained in cell lines or bulk tissues that may not be related to phenotypes of interest. This is particularly a challenge for neuropsychiatric disorders, since the most relevant cell and tissue types are often missing in the training data of such models.To address this issue, we introduce a deep transfer learning framework termed MetaChrom that takes advantage of both a reference dataset - an extensive compendium of publicly available epigenomic data, and epigenomic profiles of cell types related to specific phenotypes of interest. We trained and evaluated our model on a comprehensive set of epigenomic profiles from fetal and adult brain, and cellular models representing early neurodevelopment. MetaChrom predicts these epigenomic features with much higher accuracy than previous methods, and than models without the use of reference epigenomic data for transfer learning. Using experimentally determined regulatory variants from iPS cell-derived neurons, we show that MetaChrom predicts functional variants more accurately than existing non-coding variant scoring tools. By combining genome-wide association study (GWAS) data with MetaChrom predictions, we prioritized 31 SNPs for Schizophrenia (SCZ). These candidate SNPs suggest potential risk genes of SCZ and the biological contexts where they act.In summary, MetaChrom is a general transfer learning framework that can be applied to the study of regulatory functions of DNA sequences and variants in any disease-related cell or tissue types. The software tool is available at https://github.com/bl-2633/MetaChrom and a prediction web server is accessible at https://metachrom.ttic.edu/.


2021 ◽  
Vol 7 ◽  
pp. e770
Author(s):  
Zhonghua Hong ◽  
Ziyang Fan ◽  
Xiaohua Tong ◽  
Ruyan Zhou ◽  
Haiyan Pan ◽  
...  

The COVID-19 pandemic is the most serious catastrophe since the Second World War. To predict the epidemic more accurately under the influence of policies, a framework based on Independently Recurrent Neural Network (IndRNN) with fine-tuning are proposed for predict the epidemic development trend of confirmed cases and deaths in the United Stated, India, Brazil, France, Russia, China, and the world to late May, 2021. The proposed framework consists of four main steps: data pre-processing, model pre-training and weight saving, the weight fine-tuning, trend predicting and validating. It is concluded that the proposed framework based on IndRNN and fine-tuning with high speed and low complexity, has great fitting and prediction performance. The applied fine-tuning strategy can effectively reduce the error by up to 20.94% and time cost. For most of the countries, the MAPEs of fine-tuned IndRNN model were less than 1.2%, the minimum MAPE and RMSE were 0.05%, and 1.17, respectively, by using Chinese deaths, during the testing phase. According to the prediction and validation results, the MAPEs of the proposed framework were less than 6.2% in most cases, and it generated lowest MAPE and RMSE values of 0.05% and 2.14, respectively, for deaths in China. Moreover, Policies that play an important role in the development of COVID-19 have been summarized. Timely and appropriate measures can greatly reduce the spread of COVID-19; untimely and inappropriate government policies, lax regulations, and insufficient public cooperation are the reasons for the aggravation of the epidemic situations. The code is available at https://github.com/zhhongsh/COVID19-Precdiction. And the prediction by IndRNN model with fine-tuning are now available online (http://47.117.160.245:8088/IndRNNPredict).


2021 ◽  
Author(s):  
Klara Kuret ◽  
Aram Gustav Amalietti ◽  
Jernej Ule

AbstractBackgroundCrosslinking and immunoprecipitation (CLIP) is a method used to identify in vivo RNA– protein binding sites on a transcriptome-wide scale. With the increasing amounts of available data for RNA-binding proteins (RBPs), it is important to understand to what degree the enriched motifs specify the RNA binding profiles of RBPs in cells.ResultsWe develop positionally-enriched k-mer analysis (PEKA), a computational tool for efficient analysis of enriched motifs from individual CLIP datasets, which minimises the impact of technical and regional genomic biases by internal data normalisation. We cross-validate PEKA with mCross, and show that background correction by size-matched input doesn’t generally improve the specificity of detected motifs. We identify motif classes with common enrichment patterns across eCLIP datasets and across RNA regions, while also observing variations in the specificity and the extent of motif enrichment across eCLIP datasets, between variant CLIP protocols, and between CLIP and in vitro binding data. Thereby we gain insights into the contributions of technical and regional genomic biases to the enriched motifs, and find how motif enrichment features relate to the domain composition and low-complexity regions (LCRs) of the studied proteins.ConclusionsOur study provides insights into the overall contributions of regional binding preferences, protein domains and LCRs to the specificity of protein-RNA interactions, and shows the value of cross-motif and cross-RBP comparison for data interpretation. Our results are presented for exploratory analysis via an online platform in an RBP-centric and motif-centric manner (https://imaps.goodwright.com/apps/peka/). PEKA is available from https://github.com/ulelab/peka.


2019 ◽  
Author(s):  
Akshara Pande ◽  
Sumeet Patiyal ◽  
Anjali Lathwal ◽  
Chakit Arora ◽  
Dilraj Kaur ◽  
...  

AbstractMotivationIn last three decades, a wide range of protein descriptors/features have been discovered to annotate a protein with high precision. A wide range of features have been integrated in numerous software packages (e.g., PROFEAT, PyBioMed, iFeature, protr, Rcpi, propy) to predict function of a protein. These features are not suitable to predict function of a protein at residue level such as prediction of ligand binding residues, DNA interacting residues, post translational modification etc.ResultsIn order to facilitate scientific community, we have developed a software package that computes more than 50,000 features, important for predicting function of a protein and its residues. It has five major modules for computing; composition-based features, binary profiles, evolutionary information, structure-based features and patterns. The composition-based module allows user to compute; i) simple compositions like amino acid, dipeptide, tripeptide; ii) Properties based compositions; iii) Repeats and distribution of amino acids; iv) Shannon entropy to measure the low complexity regions; iv) Miscellaneous compositions like pseudo amino acid, autocorrelation, conjoint triad, quasi-sequence order. Binary profile of amino acid sequences provides complete information including order of residues or type of residues; specifically, suitable to predict function of a protein at residue level. Pfeature allows one to compute evolutionary information-based features in form of PSSM profile generated using PSIBLAST. Structure based module allows computing structure-based features, specifically suitable to annotate chemically modified peptides/proteins. Pfeature also allows generating overlapping patterns and feature from whole protein or its parts (e.g., N-terminal, C-terminal). In summary, Pfeature comprises of almost all features used till now, for predicting function of a protein/peptide including its residues.AvailabilityIt is available in form of a web server, named as Pfeature (https://webs.iiitd.edu.in/raghava/pfeature/), as well as python library and standalone package (https://github.com/raghavagps/Pfeature) suitable for Windows, Ubuntu, Fedora, MacOS and Centos based operating system.


Sign in / Sign up

Export Citation Format

Share Document