scholarly journals Deep-learning approach to identifying cancer subtypes using high-dimensional genomic data

Author(s):  
Runpu Chen ◽  
Le Yang ◽  
Steve Goodison ◽  
Yijun Sun

Abstract Motivation Cancer subtype classification has the potential to significantly improve disease prognosis and develop individualized patient management. Existing methods are limited by their ability to handle extremely high-dimensional data and by the influence of misleading, irrelevant factors, resulting in ambiguous and overlapping subtypes. Results To address the above issues, we proposed a novel approach to disentangling and eliminating irrelevant factors by leveraging the power of deep learning. Specifically, we designed a deep-learning framework, referred to as DeepType, that performs joint supervised classification, unsupervised clustering and dimensionality reduction to learn cancer-relevant data representation with cluster structure. We applied DeepType to the METABRIC breast cancer dataset and compared its performance to state-of-the-art methods. DeepType significantly outperformed the existing methods, identifying more robust subtypes while using fewer genes. The new approach provides a framework for the derivation of more accurate and robust molecular cancer subtypes by using increasingly complex, multi-source data. Availability and implementation An open-source software package for the proposed method is freely available at http://www.acsu.buffalo.edu/~yijunsun/lab/DeepType.html. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Runpu Chen ◽  
Le Yang ◽  
Steve Goodison ◽  
Yijun Sun

AbstractMotivationCancer subtype classification has the potential to significantly improve disease prognosis and develop individualized patient management. Existing methods are limited by their ability to handle extremely high-dimensional data and by the influence of misleading, irrelevant factors, resulting in ambiguous and overlapping subtypes.ResultsTo address the above issues, we proposed a novel approach to disentangling and eliminating irrelevant factors by leveraging the power of deep learning. Specifically, we designed a deep learning framework, referred to as DeepType, that performs joint supervised classification, unsupervised clustering and dimensionality reduction to learn cancer-relevant data representation with cluster structure. We applied DeepType to the METABRIC breast cancer dataset and compared its performance to state-of-the-art methods. DeepType significantly outperformed the existing methods, identifying more robust subtypes while using fewer genes. The new approach provides a framework for the derivation of more accurate and robust molecular cancer subtypes by using increasingly complex, multi-source data.Availability and implementationAn open-source software package for the proposed method is freely available atwww.acsu.buffalo.edu/~yijunsun/lab/DeepType.html.


Author(s):  
Neha Warikoo ◽  
Yung-Chun Chang ◽  
Wen-Lian Hsu

Abstract Motivation Natural Language Processing techniques are constantly being advanced to accommodate the influx of data as well as to provide exhaustive and structured knowledge dissemination. Within the biomedical domain, relation detection between bio-entities known as the Bio-Entity Relation Extraction (BRE) task has a critical function in knowledge structuring. Although recent advances in deep learning-based biomedical domain embedding have improved BRE predictive analytics, these works are often task selective or use external knowledge-based pre-/post-processing. In addition, deep learning-based models do not account for local syntactic contexts, which have improved data representation in many kernel classifier-based models. In this study, we propose a universal BRE model, i.e. LBERT, which is a Lexically aware Transformer-based Bidirectional Encoder Representation model, and which explores both local and global contexts representations for sentence-level classification tasks. Results This article presents one of the most exhaustive BRE studies ever conducted over five different bio-entity relation types. Our model outperforms state-of-the-art deep learning models in protein–protein interaction (PPI), drug–drug interaction and protein–bio-entity relation classification tasks by 0.02%, 11.2% and 41.4%, respectively. LBERT representations show a statistically significant improvement over BioBERT in detecting true bio-entity relation for large corpora like PPI. Our ablation studies clearly indicate the contribution of the lexical features and distance-adjusted attention in improving prediction performance by learning additional local semantic context along with bi-directionally learned global context. Availability and implementation Github. https://github.com/warikoone/LBERT. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Vinod Jagannath Kadam ◽  
Shivajirao Manikrao Jadhav

Medical data classification is the process of transforming descriptions of medical diagnoses and procedures into universal medical code numbers. The diagnoses and procedures are usually taken from a variety of sources within the healthcare record, such as the transcription of the physician’s notes, laboratory results, radiologic results and other sources. However, there exist many frequency distribution problems in these domains. Hence, this paper intends to develop an advanced and precise medical data classification approach for diabetes and breast cancer dataset. With the knowledge of the features and challenges persisting with the state-of-the-art classification methods, deep learning-based medical data classification methodology is proposed here. It is well known that deep learning networks learn directly from the data. In this paper, the medical data is dimensionally reduced using Principle Component Analysis (PCA). The dimensionally reduced data are transformed by multiplying by a weighting factor, which is optimized using Whale Optimization Algorithm (WOA), to obtain the maximum distance between the features. As a result, the data are transformed into a label-distinguishable plane under which the Deep Belief Network (DBN) is adopted to perform the deep learning process, and the data classification is performed. Further, the proposed WOA-based DBN (WOADBN) method is compared with the Neural Network (NN), DBN, Generic Algorithm-based NN (GANN), GADBN, Particle Swarm Optimization (PSONN), PSO-based DBN (PSODBN), WOA-based NN (WOANN) techniques and the results are obtained, which shows the superiority of proposed algorithm over conventional methods.


2017 ◽  
Vol 7 (4) ◽  
pp. 265-286 ◽  
Author(s):  
Guido Bologna ◽  
Yoichi Hayashi

AbstractRule extraction from neural networks is a fervent research topic. In the last 20 years many authors presented a number of techniques showing how to extract symbolic rules from Multi Layer Perceptrons (MLPs). Nevertheless, very few were related to ensembles of neural networks and even less for networks trained by deep learning. On several datasets we performed rule extraction from ensembles of Discretized Interpretable Multi Layer Perceptrons (DIMLP), and DIMLPs trained by deep learning. The results obtained on the Thyroid dataset and the Wisconsin Breast Cancer dataset show that the predictive accuracy of the extracted rules compare very favorably with respect to state of the art results. Finally, in the last classification problem on digit recognition, generated rules from the MNIST dataset can be viewed as discriminatory features in particular digit areas. Qualitatively, with respect to rule complexity in terms of number of generated rules and number of antecedents per rule, deep DIMLPs and DIMLPs trained by arcing give similar results on a binary classification problem involving digits 5 and 8. On the whole MNIST problem we showed that it is possible to determine the feature detectors created by neural networks and also that the complexity of the extracted rulesets can be well balanced between accuracy and interpretability.


Breast cancer is one of the dangerous diseases leads fast death among women. Several kinds of cancers are affecting people, but breast cancer affects highly women. In medical industry removal of women breasts or major surgery is taken forward as the solution, where it reoccurs after surgery also. Only solution to save women from breast cancer is to identify and detect the earlier stage of cancer and provide necessary treatment. Hence various research works have been focused on finding good solution for diagnosing and classifying the cancer stages as benign, malignant or severe malignant. Still the accuracy of classification needs to be improved on complex breast cancer datasets. Few of the earlier research works have proposed machine learning algorithms, which are semiautomatic and accuracy is also not high. Thus, to provide a better solution this paper aimed to use one of the deep learning algorithms such as Convolution Neural Networks for diagnosing various kinds of breast cancer dataset. From the experimental results, it is obtained that the proposed deep learning algorithms outperforms than the other algorithms.


2020 ◽  
Vol 36 (11) ◽  
pp. 3299-3306
Author(s):  
Ziwei Chen ◽  
Fuzhou Gong ◽  
Lin Wan ◽  
Liang Ma

Abstract Motivation Single-cell sequencing (SCS) data provide unprecedented insights into intratumoral heterogeneity. With SCS, we can better characterize clonal genotypes and reconstruct phylogenetic relationships of tumor cells/clones. However, SCS data are often error-prone, making their computational analysis challenging. Results To infer the clonal evolution in tumor from the error-prone SCS data, we developed an efficient computational framework, termed RobustClone. It recovers the true genotypes of subclones based on the extended robust principal component analysis, a low-rank matrix decomposition method, and reconstructs the subclonal evolutionary tree. RobustClone is a model-free method, which can be applied to both single-cell single nucleotide variation (scSNV) and single-cell copy-number variation (scCNV) data. It is efficient and scalable to large-scale datasets. We conducted a set of systematic evaluations on simulated datasets and demonstrated that RobustClone outperforms state-of-the-art methods in large-scale data both in accuracy and efficiency. We further validated RobustClone on two scSNV and two scCNV datasets and demonstrated that RobustClone could recover genotype matrix and infer the subclonal evolution tree accurately under various scenarios. In particular, RobustClone revealed the spatial progression patterns of subclonal evolution on the large-scale 10X Genomics scCNV breast cancer dataset. Availability and implementation RobustClone software is available at https://github.com/ucasdp/RobustClone. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Tamim Abdelaal ◽  
Paul de Raadt ◽  
Boudewijn P.F. Lelieveldt ◽  
Marcel J.T. Reinders ◽  
Ahmed Mahfouz

AbstractMotivationSingle cell data measures multiple cellular markers at the single-cell level for thousands to millions of cells. Identification of distinct cell populations is a key step for further biological understanding, usually performed by clustering this data. Dimensionality reduction based clustering tools are either not scalable to large datasets containing millions of cells, or not fully automated requiring an initial manual estimation of the number of clusters. Graph clustering tools provide automated and reliable clustering for single cell data, but suffer heavily from scalability to large datasets.ResultsWe developed SCHNEL, a scalable, reliable and automated clustering tool for high-dimensional single-cell data. SCHNEL transforms large high-dimensional data to a hierarchy of datasets containing subsets of data points following the original data manifold. The novel approach of SCHNEL combines this hierarchical representation of the data with graph clustering, making graph clustering scalable to millions of cells. Using seven different cytometry datasets, SCHNEL outperformed three popular clustering tools for cytometry data, and was able to produce meaningful clustering results for datasets of 3.5 and 17.2 million cells within workable timeframes. In addition, we show that SCHNEL is a general clustering tool by applying it to single-cell RNA sequencing data, as well as a popular machine learning benchmark dataset MNIST.Availability and ImplementationImplementation is available on GitHub (https://github.com/paulderaadt/HSNE-clustering)[email protected] informationSupplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (16) ◽  
pp. 2818-2826 ◽  
Author(s):  
Jinyan Chan ◽  
Xuan Wang ◽  
Jacob A Turner ◽  
Nicole E Baldwin ◽  
Jinghua Gu

Abstract Motivation Transcriptome-based computational drug repurposing has attracted considerable interest by bringing about faster and more cost-effective drug discovery. Nevertheless, key limitations of the current drug connectivity-mapping paradigm have been long overlooked, including the lack of effective means to determine optimal query gene signatures. Results The novel approach Dr Insight implements a frame-breaking statistical model for the ‘hand-shake’ between disease and drug data. The genome-wide screening of concordantly expressed genes (CEGs) eliminates the need for subjective selection of query signatures, added to eliciting better proxy for potential disease-specific drug targets. Extensive comparisons on simulated and real cancer datasets have validated the superior performance of Dr Insight over several popular drug-repurposing methods to detect known cancer drugs and drug–target interactions. A proof-of-concept trial using the TCGA breast cancer dataset demonstrates the application of Dr Insight for a comprehensive analysis, from redirection of drug therapies, to a systematic construction of disease-specific drug-target networks. Availability and implementation Dr Insight R package is available at https://cran.r-project.org/web/packages/DrInsight/index.html. Supplementary information Supplementary data are available at Bioinformatics online.


Electronics ◽  
2020 ◽  
Vol 9 (11) ◽  
pp. 1966
Author(s):  
Yoichi Hayashi

Given the complexity of real-world datasets, it is difficult to present data structures using existing deep learning (DL) models. Most research to date has concentrated on datasets with only one type of attribute: categorical or numerical. Categorical data are common in datasets such as the German (-categorical) credit scoring dataset, which contains numerical, ordinal, and nominal attributes. The heterogeneous structure of this dataset makes very high accuracy difficult to achieve. DL-based methods have achieved high accuracy (99.68%) for the Wisconsin Breast Cancer Dataset, whereas DL-inspired methods have achieved high accuracy (97.39%) for the Australian credit dataset. However, to our knowledge, no such method has been proposed to classify the German credit dataset. This study aimed to provide new insights into the reasons why DL-based and DL-inspired classifiers do not work well for categorical datasets, mainly consisting of nominal attributes. We also discuss the problems associated with using nominal attributes to design high-performance classifiers. Considering the expanded utility of DL, this study's findings should aid in the development of a new type of DL that can handle categorical datasets consisting of mainly nominal attributes, which are commonly used in risk evaluation, finance, banking, and marketing.


2017 ◽  
Author(s):  
Evelina Gabasova ◽  
John Reid ◽  
Lorenz Wernisch

AbstractIntegrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others.In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels.We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.Author SummaryIntegrative clustering is the task of identifying groups of samples by combining information from several datasets. An example of this task is cancer subtyping, where we cluster tumour samples based on several datasets, such as gene expression, proteomics and others. Most existing algorithms assume that all such datasets share a similar cluster structure, with samples outside these clusters treated as noise. The structure can, however, be much more heterogeneous: some meaningful clusters may appear only in some datasets.In the paper, we introduce the Clusternomics algorithm that identifies groups of samples across heterogeneous datasets. It models both cluster structure of individual datasets, and the global structure that appears as a combination of local structures. The algorithm uses probabilistic modelling to identify the groups and share information across the local and global levels. We evaluated the algorithm on both simulated and real world datasets, where the algorithm found clinically significant clusters with different survival outcomes.


Sign in / Sign up

Export Citation Format

Share Document