EPIP: a novel approach for condition-specific enhancer–promoter interaction prediction

Abstract Motivation The identification of enhancer–promoter interactions (EPIs), especially condition-specific ones, is important for the study of gene transcriptional regulation. Existing experimental approaches for EPI identification are still expensive, and available computational methods either do not consider or have low performance in predicting condition-specific EPIs. Results We developed a novel computational method called EPIP to reliably predict EPIs, especially condition-specific ones. EPIP is capable of predicting interactions in samples with limited data as well as in samples with abundant data. Tested on more than eight cell lines, EPIP reliably identifies EPIs, with an average area under the receiver operating characteristic curve of 0.95 and an average area under the precision–recall curve of 0.73. Tested on condition-specific EPIPs, EPIP correctly identified 99.26% of them. Compared with two recently developed methods, EPIP outperforms them with a better accuracy. Availability and implementation The EPIP tool is freely available at http://www.cs.ucf.edu/˜xiaoman/EPIP/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Develop machine learning-based regression predictive models for engineering protein solubility

Bioinformatics ◽

10.1093/bioinformatics/btz294 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4640-4646 ◽

Cited By ~ 10

Author(s):

Xi Han ◽

Xiaonan Wang ◽

Kang Zhou

Keyword(s):

Machine Learning ◽

Protein Solubility ◽

Supplementary Information ◽

Support Vector ◽

Limited Data ◽

Protein Activity ◽

Novel Approach ◽

The Cost ◽

Experimental Improvement

Abstract Motivation Protein activity is a significant characteristic for recombinant proteins which can be used as biocatalysts. High activity of proteins reduces the cost of biocatalysts. A model that can predict protein activity from amino acid sequence is highly desired, as it aids experimental improvement of proteins. However, only limited data for protein activity are currently available, which prevents the development of such models. Since protein activity and solubility are correlated for some proteins, the publicly available solubility dataset may be adopted to develop models that can predict protein solubility from sequence. The models could serve as a tool to indirectly predict protein activity from sequence. In literature, predicting protein solubility from sequence has been intensively explored, but the predicted solubility represented in binary values from all the developed models was not suitable for guiding experimental designs to improve protein solubility. Here we propose new machine learning (ML) models for improving protein solubility in vivo. Results We first implemented a novel approach that predicted protein solubility in continuous numerical values instead of binary ones. After combining it with various ML algorithms, we achieved a R2 of 0.4115 when support vector machine algorithm was used. Continuous values of solubility are more meaningful in protein engineering, as they enable researchers to choose proteins with higher predicted solubility for experimental validation, while binary values fail to distinguish proteins with the same value—there are only two possible values so many proteins have the same one. Availability and implementation We present the ML workflow as a series of IPython notebooks hosted on GitHub (https://github.com/xiaomizhou616/protein_solubility). The workflow can be used as a template for analysis of other expression and solubility datasets. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

lncLocator 2.0: a cell-line-specific subcellular localization predictor for long non-coding RNAs with interpretable deep learning

Bioinformatics ◽

10.1093/bioinformatics/btab127 ◽

2021 ◽

Author(s):

Yang Lin ◽

Xiaoyong Pan ◽

Hong-Bin Shen

Keyword(s):

Subcellular Localization ◽

Cell Line ◽

Cell Lines ◽

Short Term Memory ◽

Computational Method ◽

Language Models ◽

Supplementary Information ◽

Deep Model ◽

A Cell ◽

Non Coding Rnas

Abstract Motivation Long non-coding RNAs (lncRNAs) are generally expressed in a tissue-specific way, and subcellular localizations of lncRNAs depend on the tissues or cell lines that they are expressed. Previous computational methods for predicting subcellular localizations of lncRNAs do not take this characteristic into account, they train a unified machine learning model for pooled lncRNAs from all available cell lines. It is of importance to develop a cell-line-specific computational method to predict lncRNA locations in different cell lines. Results In this study, we present an updated cell-line-specific predictor lncLocator 2.0, which trains an end-to-end deep model per cell line, for predicting lncRNA subcellular localization from sequences.We first construct benchmark datasets of lncRNA subcellular localizations for 15 cell lines. Then we learn word embeddings using natural language models, and these learned embeddings are fed into convolutional neural network, long short-term memory and multilayer perceptron to classify subcellular localizations. lncLocator 2.0 achieves varying effectiveness for different cell lines and demonstrates the necessity of training cell-line-specific models. Furthermore, we adopt Integrated Gradients to explain the proposed model in lncLocator 2.0, and find some potential patterns that determine the subcellular localizations of lncRNAs, suggesting that the subcellular localization of lncRNAs is linked to some specific nucleotides. Availability The lncLocator 2.0 is available at www.csbio.sjtu.edu.cn/bioinf/lncLocator2 and the source code can be found at https://github.com/Yang-J-LIN/lncLocator2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Prediction of successful de-cannulation of tracheostomised patients in medical intensive care units

Respiratory Research ◽

10.1186/s12931-021-01732-w ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Chul Park ◽

Ryoung-Eun Ko ◽

Jinhee Jung ◽

Soo Jin Na ◽

Kyeongman Jeon

Keyword(s):

Logistic Regression ◽

Regression Analysis ◽

Logistic Regression Analysis ◽

Goodness Of Fit ◽

Characteristic Curve ◽

Multivariable Logistic Regression Analysis ◽

Limited Data ◽

Underlying Malignancy ◽

Medical Intensive Care ◽

Good Calibration

Abstract Background Limited data are available on practical predictors of successful de-cannulation among the patients who undergo tracheostomies. We evaluated factors associated with failed de-cannulations to develop a prediction model that could be easily be used at the time of weaning from MV. Methods In a retrospective cohort of 346 tracheostomised patients managed by a standardized de-cannulation program, multivariable logistic regression analysis identified variables that were independently associated with failed de-cannulation. Based on the logistic regression analysis, the new predictive scoring system for successful de-cannulation, referred to as the DECAN score, was developed and then internally validated. Results The model included age > 67 years, body mass index < 22 kg/m2, underlying malignancy, non-respiratory causes of mechanical ventilation (MV), presence of neurologic disease, vasopressor requirement, and presence of post-tracheostomy pneumonia, presence of delirium. The DECAN score was associated with good calibration (goodness-of-fit, 0.6477) and discrimination outcomes (area under the receiver operating characteristic curve 0.890, 95% CI 0.853–0.921). The optimal cut-off point for the DECAN score for the prediction of the successful de-cannulation was ≤ 5 points, and was associated with the specificities of 84.6% (95% CI 77.7–90.0) and sensitivities of 80.2% (95% CI 73.9–85.5). Conclusions The DECAN score for tracheostomised patients who are successfully weaned from prolonged MV can be computed at the time of weaning to assess the probability of de-cannulation based on readily available variables.

Download Full-text

Multi-model inference of network properties from incomplete data

Journal of Integrative Bioinformatics ◽

10.1515/jib-2006-32 ◽

2006 ◽

Vol 3 (2) ◽

pp. 123-136 ◽

Cited By ~ 3

Author(s):

Michael P. H. Stumpf ◽

Thomas Thorne

Keyword(s):

Network Models ◽

Network Data ◽

Global Networks ◽

New Approach ◽

Model Inference ◽

Novel Approach ◽

Eukaryotic Species ◽

Network Properties ◽

Experimental Approaches ◽

Saccaromyces Cerevisiae

Summary It has previously been shown that subnets differ from global networks from which they are sampled for all but a very limited number of theoretical network models. These differences are of qualitative as well as quantitative nature, and the properties of subnets may be very different from the corresponding properties in the true, unobserved network. Here we propose a novel approach which allows us to infer aspects of the true network from incomplete network data in a multi-model inference framework. We develop the basic theoretical framework, including procedures for assessing confidence intervals of our estimates and evaluate the performance of this approach in simulation studies and against subnets drawn from the presently available PIN network data in Saccaromyces cerevisiae. We then illustrate the potential power of this new approach by estimating the number of interactions that will be detectable with present experimental approaches in sfour eukaryotic species, inlcuding humans. Encouragingly, where independent datasets are available we obtain consistent estimates from different partial protein interaction networks. We conclude with a discussion of the scope of this approaches and areas for further research

Download Full-text

PathScore: a web tool for identifying altered pathways in cancer data

10.1101/067090 ◽

2016 ◽

Cited By ~ 2

Author(s):

Stephen G. Gaffney ◽

Jeffrey P. Townsend

Keyword(s):

Web Application ◽

Somatic Mutations ◽

Supplementary Information ◽

Web Tool ◽

Cancer Data ◽

Link Type ◽

Novel Approach ◽

Supplementary Material ◽

User Friendly ◽

Pathway Effect

ABSTRACTSummaryPathScore quantifies the level of enrichment of somatic mutations within curated pathways, applying a novel approach that identifies pathways enriched across patients. The application provides several user-friendly, interactive graphic interfaces for data exploration, including tools for comparing pathway effect sizes, significance, gene-set overlap and enrichment differences between projects.Availability and ImplementationWeb application available at pathscore.publichealth.yale.edu. Site implemented in Python and MySQL, with all major browsers supported. Source code available at github.com/sggaffney/pathscore with a GPLv3 [email protected] InformationAdditional documentation can be found at http://pathscore.publichealth.yale.edu/faq.

Download Full-text

Nasobiliary Drain Diverted through a Percutaneous Endoscopic Gastrostomy Tube: A Novel Approach to Nasobiliary Drainage

Case Reports in Gastroenterology ◽

10.1159/000518057 ◽

2021 ◽

pp. 891-897

Author(s):

Nikolaos Dimitrios Pantzaris ◽

Tim Lord ◽

Robyn Sotheran ◽

John Hutchinson ◽

Charles Millson

Keyword(s):

Percutaneous Endoscopic Gastrostomy ◽

Treatment Options ◽

Primary Biliary Cholangitis ◽

Percutaneous Endoscopic Gastrostomy Tube ◽

Limited Data ◽

Endoscopic Gastrostomy ◽

Novel Approach ◽

Nasobiliary Drainage ◽

Uvb Phototherapy

Intractable pruritus is a common, debilitating symptom and a well-defined entity occurring in chronic cholestatic disorders. Treatment options include cholestyramine, rifampicin, naltrexone, gabapentin, and sertraline, as well as more interventional measures, such as plasmapheresis, extracorporeal albumin dialysis, nasobiliary drains (NBDs), and UVB phototherapy in patients who fail to respond to medical therapy. Despite the limited data, NBD seems to be a highly effective treatment in the relief of refractory cholestatic pruritus. In this article, we present the case of a 73-year-old woman with primary biliary cholangitis and intractable pruritus, refractory to medical treatment. The patient had a complete resolution of her symptoms following an NBD placement, in which, with a novel approach, the nasal end was redirected and exited through a percutaneous endoscopic gastrostomy port, significantly improving her quality of life.

Download Full-text

ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions

Bioinformatics ◽

10.1093/bioinformatics/btz431 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4754-4756 ◽

Cited By ~ 29

Author(s):

Egor Dolzhenko ◽

Viraj Deshpande ◽

Felix Schlesinger ◽

Peter Krusche ◽

Roman Petrovski ◽

...

Keyword(s):

Tandem Repeat ◽

Broad Class ◽

Source Code ◽

Computational Method ◽

Supplementary Information ◽

Dna Repeats ◽

Supplementary Data ◽

Sequence Graph ◽

Version 2.0 ◽

Short Tandem

Abstract Summary We describe a novel computational method for genotyping repeats using sequence graphs. This method addresses the long-standing need to accurately genotype medically important loci containing repeats adjacent to other variants or imperfect DNA repeats such as polyalanine repeats. Here we introduce a new version of our repeat genotyping software, ExpansionHunter, that uses this method to perform targeted genotyping of a broad class of such loci. Availability and implementation ExpansionHunter is implemented in C++ and is available under the Apache License Version 2.0. The source code, documentation, and Linux/macOS binaries are available at https://github.com/Illumina/ExpansionHunter/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

An integrative approach for fine-mapping chromatin interactions

Bioinformatics ◽

10.1093/bioinformatics/btz843 ◽

2019 ◽

Vol 36 (6) ◽

pp. 1704-1711

Author(s):

Artur Jaroszewicz ◽

Jason Ernst

Keyword(s):

Gene Regulation ◽

High Resolution ◽

Biological Significance ◽

Computational Method ◽

Supplementary Information ◽

Integrative Approach ◽

Genome Architecture ◽

Open Chromatin ◽

Chromatin Interactions ◽

Genome Wide

Abstract Motivation Chromatin interactions play an important role in genome architecture and gene regulation. The Hi-C assay generates such interactions maps genome-wide, but at relatively low resolutions (e.g. 5-25 kb), which is substantially coarser than the resolution of transcription factor binding sites or open chromatin sites that are potential sources of such interactions. Results To predict the sources of Hi-C-identified interactions at a high resolution (e.g. 100 bp), we developed a computational method that integrates data from DNase-seq and ChIP-seq of TFs and histone marks. Our method, χ-CNN, uses this data to first train a convolutional neural network (CNN) to discriminate between called Hi-C interactions and non-interactions. χ-CNN then predicts the high-resolution source of each Hi-C interaction using a feature attribution method. We show these predictions recover original Hi-C peaks after extending them to be coarser. We also show χ-CNN predictions enrich for evolutionarily conserved bases, eQTLs and CTCF motifs, supporting their biological significance. χ-CNN provides an approach for analyzing important aspects of genome architecture and gene regulation at a higher resolution than previously possible. Availability and implementation χ-CNN software is available on GitHub (https://github.com/ernstlab/X-CNN). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Studying 3D genome evolution using genomic sequence

Bioinformatics ◽

10.1093/bioinformatics/btz775 ◽

2019 ◽

Author(s):

Raphaël Mourad

Keyword(s):

Genome Evolution ◽

High Throughput Sequencing ◽

Genomic Sequence ◽

Regulation Of Gene Expression ◽

Replication Timing ◽

Three Dimensions ◽

Supplementary Information ◽

3D Genome ◽

Chromatin Looping ◽

Novel Approach

Abstract Motivation The three dimensions (3D) genome is essential to numerous key processes such as the regulation of gene expression and the replication-timing program. In vertebrates, chromatin looping is often mediated by CTCF, and marked by CTCF motif pairs in convergent orientation. Comparative high-throughput sequencing technique (Hi-C) recently revealed that chromatin looping evolves across species. However, Hi-C experiments are complex and costly, which currently limits their use for evolutionary studies over a large number of species. Results Here, we propose a novel approach to study the 3D genome evolution in vertebrates using the genomic sequence only, e.g. without the need for Hi-C data. The approach is simple and relies on comparing the distances between convergent and divergent CTCF motifs by computing a ratio we named the 3D ratio or ‘3DR’. We show that 3DR is a powerful statistic to detect CTCF looping encoded in the human genome sequence, thus reflecting strong evolutionary constraints encoded in DNA and associated with the 3D genome. When comparing vertebrate genomes, our results reveal that 3DR which underlies CTCF looping and topologically associating domain organization evolves over time and suggest that ancestral character reconstruction can be used to infer 3DR in ancestral genomes. Availability and implementation The R code is available at https://github.com/morphos30/PhyloCTCFLooping. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PHANOTATE: a novel approach to gene identification in phage genomes

Bioinformatics ◽

10.1093/bioinformatics/btz265 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4537-4542 ◽

Cited By ~ 24

Author(s):

Katelyn McNair ◽

Carol Zhou ◽

Elizabeth A Dinsdale ◽

Brian Souza ◽

Robert A Edwards

Keyword(s):

Gene Prediction ◽

Optimal Path ◽

Genome Structure ◽

Weighted Graph ◽

Open Reading Frames ◽

Supplementary Information ◽

Functional Protein ◽

Protein Database ◽

Protein Coding ◽

Novel Approach

Abstract Motivation Currently there are no tools specifically designed for annotating genes in phages. Several tools are available that have been adapted to run on phage genomes, but due to their underlying design, they are unable to capture the full complexity of phage genomes. Phages have adapted their genomes to be extremely compact, having adjacent genes that overlap and genes completely inside of other longer genes. This non-delineated genome structure makes it difficult for gene prediction using the currently available gene annotators. Here we present PHANOTATE, a novel method for gene calling specifically designed for phage genomes. Although the compact nature of genes in phages is a problem for current gene annotators, we exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible. We represent this network of connections as a weighted graph, and use dynamic programing to find the optimal path. Results We compare PHANOTATE to other gene callers by annotating a set of 2133 complete phage genomes from GenBank, using PHANOTATE and the three most popular gene callers. We found that the four programs agree on 82% of the total predicted genes, with PHANOTATE predicting more genes than the other three. We searched for these extra genes in both GenBank’s non-redundant protein database and all of the metagenomes in the sequence read archive, and found that they are present at levels that suggest that these are functional protein-coding genes. Availability and implementation https://github.com/deprekate/PHANOTATE Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text