Expression-based prediction of human essential genes and candidate lncRNAs in cancer cells

Author(s):  
Shuzhen Kuang ◽  
Yanzhang Wei ◽  
Liangjiang Wang

Abstract Motivation Essential genes are required for the reproductive success at either cellular or organismal level. The identification of essential genes is important for understanding the core biological processes and identifying effective therapeutic drug targets. However, experimental identification of essential genes is costly, time consuming and labor intensive. Although several machine learning models have been developed to predict essential genes, these models are not readily applicable to lncRNAs. Moreover, the currently available models cannot be used to predict essential genes in a specific cancer type. Results In this study, we have developed a new machine learning approach, XGEP (eXpression-based Gene Essentiality Prediction), to predict essential genes and candidate lncRNAs in cancer cells. The novelty of XGEP lies in the utilization of relevant features derived from the TCGA transcriptome dataset through collaborative embedding. When evaluated on the pan-cancer dataset, XGEP was able to accurately predict human essential genes and achieve significantly higher performance than previous models. Notably, several candidate lncRNAs selected by XGEP are reported to promote cell proliferation and inhibit cell apoptosis. Moreover, XGEP also demonstrated superior performance on cancer-type-specific datasets to identify essential genes. The comprehensive lists of candidate essential genes in specific cancer types may be used to guide experimental characterization and facilitate the discovery of drug targets for cancer therapy. Availability and implementation The source code and datasets used in this study are freely available at https://github.com/BioDataLearning/XGEP. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Vol 35 (16) ◽  
pp. 2818-2826 ◽  
Author(s):  
Jinyan Chan ◽  
Xuan Wang ◽  
Jacob A Turner ◽  
Nicole E Baldwin ◽  
Jinghua Gu

Abstract Motivation Transcriptome-based computational drug repurposing has attracted considerable interest by bringing about faster and more cost-effective drug discovery. Nevertheless, key limitations of the current drug connectivity-mapping paradigm have been long overlooked, including the lack of effective means to determine optimal query gene signatures. Results The novel approach Dr Insight implements a frame-breaking statistical model for the ‘hand-shake’ between disease and drug data. The genome-wide screening of concordantly expressed genes (CEGs) eliminates the need for subjective selection of query signatures, added to eliciting better proxy for potential disease-specific drug targets. Extensive comparisons on simulated and real cancer datasets have validated the superior performance of Dr Insight over several popular drug-repurposing methods to detect known cancer drugs and drug–target interactions. A proof-of-concept trial using the TCGA breast cancer dataset demonstrates the application of Dr Insight for a comprehensive analysis, from redirection of drug therapies, to a systematic construction of disease-specific drug-target networks. Availability and implementation Dr Insight R package is available at https://cran.r-project.org/web/packages/DrInsight/index.html. Supplementary information Supplementary data are available at Bioinformatics online.


2015 ◽  
Vol 32 (6) ◽  
pp. 821-827 ◽  
Author(s):  
Enrique Audain ◽  
Yassel Ramos ◽  
Henning Hermjakob ◽  
Darren R. Flower ◽  
Yasset Perez-Riverol

Abstract Motivation: In any macromolecular polyprotic system—for example protein, DNA or RNA—the isoelectric point—commonly referred to as the pI—can be defined as the point of singularity in a titration curve, corresponding to the solution pH value at which the net overall surface charge—and thus the electrophoretic mobility—of the ampholyte sums to zero. Different modern analytical biochemistry and proteomics methods depend on the isoelectric point as a principal feature for protein and peptide characterization. Protein separation by isoelectric point is a critical part of 2-D gel electrophoresis, a key precursor of proteomics, where discrete spots can be digested in-gel, and proteins subsequently identified by analytical mass spectrometry. Peptide fractionation according to their pI is also widely used in current proteomics sample preparation procedures previous to the LC-MS/MS analysis. Therefore accurate theoretical prediction of pI would expedite such analysis. While such pI calculation is widely used, it remains largely untested, motivating our efforts to benchmark pI prediction methods. Results: Using data from the database PIP-DB and one publically available dataset as our reference gold standard, we have undertaken the benchmarking of pI calculation methods. We find that methods vary in their accuracy and are highly sensitive to the choice of basis set. The machine-learning algorithms, especially the SVM-based algorithm, showed a superior performance when studying peptide mixtures. In general, learning-based pI prediction methods (such as Cofactor, SVM and Branca) require a large training dataset and their resulting performance will strongly depend of the quality of that data. In contrast with Iterative methods, machine-learning algorithms have the advantage of being able to add new features to improve the accuracy of prediction. Contact: [email protected] Availability and Implementation: The software and data are freely available at https://github.com/ypriverol/pIR. Supplementary information: Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (16) ◽  
pp. 4490-4497
Author(s):  
Siqi Liang ◽  
Haiyuan Yu

Abstract Motivation In silico drug target prediction provides valuable information for drug repurposing, understanding of side effects as well as expansion of the druggable genome. In particular, discovery of actionable drug targets is critical to developing targeted therapies for diseases. Results Here, we develop a robust method for drug target prediction by leveraging a class imbalance-tolerant machine learning framework with a novel training scheme. We incorporate novel features, including drug–gene phenotype similarity and gene expression profile similarity that capture information orthogonal to other features. We show that our classifier achieves robust performance and is able to predict gene targets for new drugs as well as drugs that potentially target unexplored genes. By providing newly predicted drug–target associations, we uncover novel opportunities of drug repurposing that may benefit cancer treatment through action on either known drug targets or currently undrugged genes. Supplementary information Supplementary data are available at Bioinformatics online.


2016 ◽  
Vol 2016 ◽  
pp. 1-9 ◽  
Author(s):  
Hong-Li Hua ◽  
Fa-Zhan Zhang ◽  
Abraham Alemayehu Labena ◽  
Chuan Dong ◽  
Yan-Ting Jin ◽  
...  

Investigation of essential genes is significant to comprehend the minimal gene sets of cell and discover potential drug targets. In this study, a novel approach based on multiple homology mapping and machine learning method was introduced to predict essential genes. We focused on 25 bacteria which have characterized essential genes. The predictions yielded the highest area under receiver operating characteristic (ROC) curve (AUC) of 0.9716 through tenfold cross-validation test. Proper features were utilized to construct models to make predictions in distantly related bacteria. The accuracy of predictions was evaluated via the consistency of predictions and known essential genes of target species. The highest AUC of 0.9552 and average AUC of 0.8314 were achieved when making predictions across organisms. An independent dataset fromSynechococcus elongatus, which was released recently, was obtained for further assessment of the performance of our model. The AUC score of predictions is 0.7855, which is higher than other methods. This research presents that features obtained by homology mapping uniquely can achieve quite great or even better results than those integrated features. Meanwhile, the work indicates that machine learning-based method can assign more efficient weight coefficients than using empirical formula based on biological knowledge.


2019 ◽  
Vol 35 (21) ◽  
pp. 4509-4510
Author(s):  
Neil Pearson ◽  
Karim Malki ◽  
David Evans ◽  
Lewis Vidler ◽  
Cara Ruble ◽  
...  

Abstract Summary We present software to characterize and rank potential therapeutic (drug) targets with data from public databases and present it in a user-friendly format. By understanding potential obstacles to drug development through the gathering and understanding of this information, combined with robust approaches to target validation to generate therapeutic hypotheses, this approach may provide high quality targets, leading the process of drug development to become more efficient and cost-effective. Availability and implementation The information we gather on potential targets concerns small-molecule druggability (ligandability), suitability for large-molecule approaches (e.g. antibodies) or new modalities (e.g. antisense oligonucleotides, siRNA or PROTAC), feasibility (availability of resources such as assays and biological knowledge) and potential safety risks (adverse tissue-wise expression, deleterious phenotypes). This information can be termed ‘tractability’. We provide visualization tools to understand its components. TractaViewer is available from https://github.com/NeilPearson-Lilly/TractaViewer Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (20) ◽  
pp. 3989-3995 ◽  
Author(s):  
Hongjian Li ◽  
Jiangjun Peng ◽  
Pavel Sidorov ◽  
Yee Leung ◽  
Kwong-Sak Leung ◽  
...  

Abstract Motivation Studies have shown that the accuracy of random forest (RF)-based scoring functions (SFs), such as RF-Score-v3, increases with more training samples, whereas that of classical SFs, such as X-Score, does not. Nevertheless, the impact of the similarity between training and test samples on this matter has not been studied in a systematic manner. It is therefore unclear how these SFs would perform when only trained on protein-ligand complexes that are highly dissimilar or highly similar to the test set. It is also unclear whether SFs based on machine learning algorithms other than RF can also improve accuracy with increasing training set size and to what extent they learn from dissimilar or similar training complexes. Results We present a systematic study to investigate how the accuracy of classical and machine-learning SFs varies with protein-ligand complex similarities between training and test sets. We considered three types of similarity metrics, based on the comparison of either protein structures, protein sequences or ligand structures. Regardless of the similarity metric, we found that incorporating a larger proportion of similar complexes to the training set did not make classical SFs more accurate. In contrast, RF-Score-v3 was able to outperform X-Score even when trained on just 32% of the most dissimilar complexes, showing that its superior performance owes considerably to learning from dissimilar training complexes to those in the test set. In addition, we generated the first SF employing Extreme Gradient Boosting (XGBoost), XGB-Score, and observed that it also improves with training set size while outperforming the rest of SFs. Given the continuous growth of training datasets, the development of machine-learning SFs has become very appealing. Availability and implementation https://github.com/HongjianLi/MLSF Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Coryandar Gilvary ◽  
Neel S. Madhukar ◽  
Kaitlyn Gayvert ◽  
Miguel Foronda ◽  
Alexendar Perez ◽  
...  

ABSTRACTLoss-of-function (LoF) screenings have the potential to reveal novel cancer-specific vulnerabilities, prioritize drug treatments, and inform precision medicine therapeutics. These screenings were traditionally done using shRNAs, but with the recent emergence of CRISPR technology there has been a shift in methodology. However, recent analyses have found large inconsistencies between CRISPR and shRNA essentiality results. Here, we examined the DepMap project, the largest cancer LoF effort undertaken to date, and find a lack of correlation between CRISPR and shRNA LoF results; we further characterized differences between genes found to be essential by either platform. We then introduce ECLIPSE, a machine learning approach, which combines genomic, cell line, and experimental design features to predict essential genes and platform specific essential genes in specific cancer cell lines. We applied ECLIPSE to known drug targets and found that our approach strongly differentiated drugs approved for cancer versus those that have not, and can thus be leveraged to identify potential cancer repurposing opportunities. Overall, ECLIPSE allows for a more comprehensive analysis of gene essentiality and drug development; which neither platform can achieve alone.


2019 ◽  
Vol 35 (24) ◽  
pp. 5235-5242 ◽  
Author(s):  
Jun Wang ◽  
Liangjiang Wang

Abstract Motivation Circular RNAs (circRNAs) are a new class of endogenous RNAs in animals and plants. During pre-RNA splicing, the 5′ and 3′ termini of exon(s) can be covalently ligated to form circRNAs through back-splicing (head-to-tail splicing). CircRNAs can be conserved across species, show tissue- and developmental stage-specific expression patterns, and may be associated with human disease. However, the mechanism of circRNA formation is still unclear although some sequence features have been shown to affect back-splicing. Results In this study, by applying the state-of-art machine learning techniques, we have developed the first deep learning model, DeepCirCode, to predict back-splicing for human circRNA formation. DeepCirCode utilizes a convolutional neural network (CNN) with nucleotide sequence as the input, and shows superior performance over conventional machine learning algorithms such as support vector machine and random forest. Relevant features learnt by DeepCirCode are represented as sequence motifs, some of which match human known motifs involved in RNA splicing, transcription or translation. Analysis of these motifs shows that their distribution in RNA sequences can be important for back-splicing. Moreover, some of the human motifs appear to be conserved in mouse and fruit fly. The findings provide new insight into the back-splicing code for circRNA formation. Availability and implementation All the datasets and source code for model construction are available at https://github.com/BioDataLearning/DeepCirCode. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (18) ◽  
pp. 3468-3475 ◽  
Author(s):  
Ismail M Khater ◽  
Fanrui Meng ◽  
Ivan Robert Nabi ◽  
Ghassan Hamarneh

Abstract Motivation Network analysis and unsupervised machine learning processing of single-molecule localization microscopy of caveolin-1 (Cav1) antibody labeling of prostate cancer cells identified biosignatures and structures for caveolae and three distinct non-caveolar scaffolds (S1A, S1B and S2). To obtain further insight into low-level molecular interactions within these different structural domains, we now introduce graphlet decomposition over a range of proximity thresholds and show that frequency of different subgraph (k = 4 nodes) patterns for machine learning approaches (classification, identification, automatic labeling, etc.) effectively distinguishes caveolae and scaffold blobs. Results Caveolae formation requires both Cav1 and the adaptor protein CAVIN1 (also called PTRF). As a supervised learning approach, we applied a wide-field CAVIN1/PTRF mask to CAVIN1/PTRF-transfected PC3 prostate cancer cells and used the random forest classifier to classify blobs based on graphlet frequency distribution (GFD). GFD of CAVIN1/PTRF-positive (PTRF+) and -negative Cav1 clusters showed poor classification accuracy that was significantly improved by stratifying the PTRF+ clusters by either number of localizations or volume. Low classification accuracy (<50%) of large PTRF+ clusters and caveolae blobs identified by unsupervised learning suggests that their GFD is specific to caveolae. High classification accuracy for small PTRF+ clusters and caveolae blobs argues that CAVIN1/PTRF associates not only with caveolae but also non-caveolar scaffolds. At low proximity thresholds (50–100 nm), the caveolae groups showed reduced frequency of highly connected graphlets and increased frequency of completely disconnected graphlets. GFD analysis of single-molecule localization microscopy Cav1 clusters defines changes in structural organization in caveolae and scaffolds independent of association with CAVIN1/PTRF. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (10) ◽  
pp. 3185-3191 ◽  
Author(s):  
Edison Ong ◽  
Haihe Wang ◽  
Mei U Wong ◽  
Meenakshi Seetharaman ◽  
Ninotchka Valdez ◽  
...  

Abstract Motivation Reverse vaccinology (RV) is a milestone in rational vaccine design, and machine learning (ML) has been applied to enhance the accuracy of RV prediction. However, ML-based RV still faces challenges in prediction accuracy and program accessibility. Results This study presents Vaxign-ML, a supervised ML classification to predict bacterial protective antigens (BPAgs). To identify the best ML method with optimized conditions, five ML methods were tested with biological and physiochemical features extracted from well-defined training data. Nested 5-fold cross-validation and leave-one-pathogen-out validation were used to ensure unbiased performance assessment and the capability to predict vaccine candidates against a new emerging pathogen. The best performing model (eXtreme Gradient Boosting) was compared to three publicly available programs (Vaxign, VaxiJen, and Antigenic), one SVM-based method, and one epitope-based method using a high-quality benchmark dataset. Vaxign-ML showed superior performance in predicting BPAgs. Vaxign-ML is hosted in a publicly accessible web server and a standalone version is also available. Availability and implementation Vaxign-ML website at http://www.violinet.org/vaxign/vaxign-ml, Docker standalone Vaxign-ML available at https://hub.docker.com/r/e4ong1031/vaxign-ml and source code is available at https://github.com/VIOLINet/Vaxign-ML-docker. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document