scholarly journals Cox-nnet v2.0: improved neural-network-based survival prediction extended to large-scale EMR data

Author(s):  
Di Wang ◽  
Zheng Jing ◽  
Kevin He ◽  
Lana X Garmire

Abstract Summary Cox-nnet is a neural-network-based prognosis prediction method, originally applied to genomics data. Here, we propose the version 2 of Cox-nnet, with significant improvement on efficiency and interpretability, making it suitable to predict prognosis based on large-scale population data, including those electronic medical records (EMR) datasets. We also add permutation-based feature importance scores and the direction of feature coefficients. When applied on a kidney transplantation dataset, Cox-nnet v2.0 reduces the training time of Cox-nnet up to 32-folds (n =10 000) and achieves better prediction accuracy than Cox-PH (P<0.05). It also achieves similarly superior performance on a publicly available SUPPORT data (n=8000). The high efficiency and accuracy make Cox-nnet v2.0 a desirable method for survival prediction in large-scale EMR data. Availability and implementation Cox-nnet v2.0 is freely available to the public at https://github.com/lanagarmire/Cox-nnet-v2.0. Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Vol 189 (7) ◽  
pp. 717-725 ◽  
Author(s):  
Marnie Downes ◽  
John B Carlin

Abstract Multilevel regression and poststratification (MRP) is a model-based approach for estimating a population parameter of interest, generally from large-scale surveys. It has been shown to be effective in highly selected samples, which is particularly relevant to investigators of large-scale population health and epidemiologic surveys facing increasing difficulties in recruiting representative samples of participants. We aimed to further examine the accuracy and precision of MRP in a context where census data provided reasonable proxies for true population quantities of interest. We considered 2 outcomes from the baseline wave of the Ten to Men study (Australia, 2013–2014) and obtained relevant population data from the 2011 Australian Census. MRP was found to achieve generally superior performance relative to conventional survey weighting methods for the population as a whole and for population subsets of varying sizes. MRP resulted in less variability among estimates across population subsets relative to sample weighting, and there was some evidence of small gains in precision when using MRP, particularly for smaller population subsets. These findings offer further support for MRP as a promising analytical approach for addressing participation bias in the estimation of population descriptive quantities from large-scale health surveys and cohort studies.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Xiujin Yu ◽  
Shengfu Liu ◽  
Hui Zhang

As one of the oldest languages in the world, Chinese has a long cultural history and unique language charm. The multilayer self-organizing neural network and data mining techniques have been widely used and can achieve high-precision prediction in different fields. However, they are hardly applied to Chinese language feature analysis. In order to accurately analyze the characteristics of Chinese language, this paper uses the multilayer self-organizing neural network and the corresponding data mining technology for feature recognition and then compared it with other different types of neural network algorithms. The results show that the multilayer self-organizing neural network can make the accuracy, recall, and F1 score of feature recognition reach 68.69%, 80.21%, and 70.19%, respectively, when there are many samples. Under the influence of strong noise, it keeps high efficiency of feature analysis. This shows that the multilayer self-organizing neural network has superior performance and can provide strong support for Chinese language feature analysis.


2014 ◽  
Vol 2014 ◽  
pp. 1-9 ◽  
Author(s):  
Jun Guo ◽  
Shu Liu ◽  
Bin Zhang ◽  
Yongming Yan

Cloud application provides access to large pool of virtual machines for building high-quality applications to satisfy customers’ requirements. A difficult issue is how to predict virtual machine response time because it determines when we could adjust dynamic scalable virtual machines. To address the critical issue, this paper proposes a prediction virtual machine response time method which is based on genetic algorithm-back propagation (GA-BP) neural network. First of all, we predict component response time by the past virtual machine component usage experience data: the number of concurrent requests and response time. Then, we could predict virtual machines service response time. The results of large-scale experiments show the effectiveness and feasibility of our method.


Author(s):  
Kai Zheng ◽  
Zhu-Hong You ◽  
Lei Wang ◽  
Leon Wong ◽  
Zhan-Heng Chen ◽  
...  

ABSTRACTMotivationPIWI proteins and Piwi-Interacting RNAs (piRNAs) are commonly detected in human cancers, especially in germline and somatic tissues, and correlates with poorer clinical outcomes, suggesting that they play a functional role in cancer. As the problem of combinatorial explosions between ncRNA and disease exposes out gradually, new bioinformatics methods for large-scale identification and prioritization of potential associations are therefore of interest. However, in the real world, the network of interactions between molecules is enormously intricate and noisy, which poses a problem for efficient graph mining. This study aims to make preliminary attempts on bionetwork based graph mining.ResultsIn this study, we present a method based on graph attention network to identify potential and biologically significant piRNA-disease associations (PDAs), called GAPDA. The attention mechanism can calculate a hidden representation of an association in the network based on neighbor nodes and assign weights to the input to make decisions. In particular, we introduced the attention-based Graph Neural Networks to the field of bio-association prediction for the first time, and proposed an abstract network topology suitable for small samples. Specifically, we combined piRNA sequence information and disease semantic similarity with piRNA-disease association network to construct a new attribute network. In the experiment, GAPDA performed excellently in five-fold cross-validation with the AUC of 0.9038. Not only that, but it still has superior performance compared to methods based on collaborative filtering and attribute features. The experimental results show that GAPDA ensures the prospect of the graph neural network on such problems and can be an excellent supplement for future biomedical [email protected];[email protected] informationSupplementary data are available at Bioinformatics online.


Author(s):  
Fatima Zohra Smaili ◽  
Xin Gao ◽  
Robert Hoehndorf

AbstractMotivationOntologies are widely used in biomedicine for the annotation and standardization of data. One of the main roles of ontologies is to provide structured background knowledge within a domain as well as a set of labels, synonyms, and definitions for the classes within a domain. The two types of information provided by ontologies have been extensively exploited in natural language processing and machine learning applications. However, they are commonly used separately, and thus it is unknown if joining the two sources of information can further benefit data analysis tasks.ResultsWe developed a novel method that applies named entity recognition and normalization methods on texts to connect the structured information in biomedical ontologies with the information contained in natural language. We apply this normalization both to literature and to the natural language information contained within ontologies themselves. The normalized ontologies and text are then used to generate embeddings, and relations between entities are predicted using a deep Siamese neural network model that takes these embeddings as input. We demonstrate that our novel embedding and prediction method using self-normalized biomedical ontologies significantly outperforms the state-of-the-art methods in embedding ontologies on two benchmark tasks: prediction of interactions between proteins and prediction of gene–disease associations. Our method also allows us to apply ontology-based annotations and axioms to the prediction of toxicological effects of chemicals where our method shows superior performance. Our method is generic and can be applied in scenarios where ontologies consisting of both structured information and natural language labels or synonyms are used.Availabilityhttps://github.com/bio-ontology-research-group/[email protected] and [email protected]


Stroke ◽  
2021 ◽  
Vol 52 (Suppl_1) ◽  
Author(s):  
Gurkamal Kaur ◽  
Jose Dominguez ◽  
Rosa Semaan ◽  
Leanne Fuentes ◽  
Jonathan Ogulnick ◽  
...  

Introduction: Subarachnoid hemorrhage (SAH) can be a devastating neurologic condition that leads to cardiac arrest (CA), and ultimately poor clinical outcomes. Existing literature on this subject reveal a dismal prognosis when analyzing relatively small sample sizes. We aimed to further elucidate the incidence, mortality rates, and outcomes of CA patients with SAH using large-scale population data. Methods: A retrospective cohort study was conducted using the National Inpatient Sample (NIS) database. Patients included in the study met criteria using International Classification of Diseases (ICD) codes 9th and 10th edition of: non-traumatic SAH, CA cause unspecified, and CA due to other underlying conditions between 2008 and 2014. For all regression analyses, a p-value of <0.05 was considered statistically significant. Results: We identified 170,869 patients hospitalized for non-traumatic SAH. Within these, there was a 3.17% incidence of CA. The mortality rate in CA with SAH was 82% (vs non-CA 18.4%, p< 0.001). Of the survivors of CA with SAH, 15.7% were discharged to special facilities and services (vs non-CA 37.6%, p<0.0001). The remaining 2.3% were discharged home (vs non-CA 44.0%, p<.0001). Higher NIS SAH severity score (NIS-SSS) was a predictor of CA in SAH patients (p <.0001). Patients treated with aneurysm clipping and coiling had lower odds ratio of CA (p <.0001). Conclusion: The study confirms the poor prognosis of patients with CA and SAH using large-scale population data. Patients that underwent aneurysm treatment show lower association with CA. Findings presented here provide useful data for clinical decision making and guiding goals of care discussion with family members. Further studies may identify interventions and protocols for treatment of these severely ill patients.


Information ◽  
2021 ◽  
Vol 12 (6) ◽  
pp. 242
Author(s):  
Jianlong Xu ◽  
Zicong Zhuang ◽  
Zhiyu Xia ◽  
Yuhui Li

Blockchain is an innovative distributed ledger technology that is widely used to build next-generation applications without the support of a trusted third party. With the ceaseless evolution of the service-oriented computing (SOC) paradigm, Blockchain-as-a-Service (BaaS) has emerged, which facilitates development of blockchain-based applications. To develop a high-quality blockchain-based system, users must select highly reliable blockchain services (peers) that offer excellent quality-of-service (QoS). Since the vast number of blockchain services leading to sparse QoS data, selecting the optimal personalized services is challenging. Hence, we improve neural collaborative filtering and propose a QoS-based blockchain service reliability prediction algorithm under BaaS, named modified neural collaborative filtering (MNCF). In this model, we combine a neural network with matrix factorization to perform collaborative filtering for the latent feature vectors of users. Furthermore, multi-task learning for sharing different parameters is introduced to improve the performance of the model. Experiments based on a large-scale real-world dataset validate its superior performance compared to baselines.


Author(s):  
Ke Wang ◽  
Xin Geng

Label Distribution Learning (LDL) is a general learning paradigm in machine learning, which includes both single-label learning (SLL) and multi-label learning (MLL) as its special cases. Recently, many LDL algorithms have been proposed to handle different application tasks such as facial age estimation, head pose estimation and visual sentiment distributions prediction. However, the training time complexity of most existing LDL algorithms is too high, which makes them unapplicable to large-scale LDL. In this paper, we propose a novel LDL method to address this issue, termed Discrete Binary Coding based Label Distribution Learning (DBC-LDL). Specifically, we design an efficiently discrete coding framework to learn binary codes for instances. Furthermore, both the pair-wise semantic similarities and the original label distributions are integrated into this framework to learn highly discriminative binary codes. In addition, a fast approximate nearest neighbor (ANN) search strategy is utilized to predict label distributions for testing instances. Experimental results on five real-world datasets demonstrate its superior performance over several state-of-the-art LDL methods with the lower time cost.


2021 ◽  
Author(s):  
Yuansong Zeng ◽  
Xiang Zhou ◽  
Zixiang Pan ◽  
Yutong Lu ◽  
Yuedong Yang

Single-cell RNA sequencing (scRNA-seq) techniques provide high-resolution data on cellular heterogeneity in diverse tissues, and a critical step for the data analysis is cell type identification. Traditional methods usually cluster the cells and manually identify cell clusters through marker genes, which is time-consuming and subjective. With the launch of several large-scale single-cell projects, millions of sequenced cells have been annotated and it is promising to transfer labels from the annotated datasets to newly generated datasets. One powerful way for the transferring is to learn cell relations through the graph neural network (GNN), while vanilla GNN is difficult to process millions of cells due to the expensive costs of the message-passing procedure at each training epoch. Here, we have developed a robust and scalable GNN-based method for accurate single cell classification (GraphCS), where the graph is constructed to connect similar cells within and between labelled and unlabelled scRNA-seq datasets for propagation of shared information. To overcome the slow information propagation of GNN at each training epoch, the diffused information is pre-calculated via the approximate Generalized PageRank algorithm, enabling sublinear complexity for a high speed and scalability on millions of cells. Compared with existing methods, GraphCS demonstrates better performance on simulated, cross-platform, and cross-species scRNA-seq datasets. More importantly, our model can achieve superior performance on a large dataset with one million cells within 50 minutes.


Author(s):  
Moritz Herrmann ◽  
Philipp Probst ◽  
Roman Hornung ◽  
Vindi Jurinovic ◽  
Anne-Laure Boulesteix

Abstract Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database ‘The Cancer Genome Atlas’ (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan–Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno’s C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups—especially clinical variables—from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contact:[email protected], +49 89 2180 3198 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.


Sign in / Sign up

Export Citation Format

Share Document