independent test dataset
Recently Published Documents


TOTAL DOCUMENTS

38
(FIVE YEARS 29)

H-INDEX

3
(FIVE YEARS 3)

2022 ◽  
Vol 12 (1) ◽  
Author(s):  
Akitoshi Shimazaki ◽  
Daiju Ueda ◽  
Antoine Choppin ◽  
Akira Yamamoto ◽  
Takashi Honjo ◽  
...  

AbstractWe developed and validated a deep learning (DL)-based model using the segmentation method and assessed its ability to detect lung cancer on chest radiographs. Chest radiographs for use as a training dataset and a test dataset were collected separately from January 2006 to June 2018 at our hospital. The training dataset was used to train and validate the DL-based model with five-fold cross-validation. The model sensitivity and mean false positive indications per image (mFPI) were assessed with the independent test dataset. The training dataset included 629 radiographs with 652 nodules/masses and the test dataset included 151 radiographs with 159 nodules/masses. The DL-based model had a sensitivity of 0.73 with 0.13 mFPI in the test dataset. Sensitivity was lower in lung cancers that overlapped with blind spots such as pulmonary apices, pulmonary hila, chest wall, heart, and sub-diaphragmatic space (0.50–0.64) compared with those in non-overlapped locations (0.87). The dice coefficient for the 159 malignant lesions was on average 0.52. The DL-based model was able to detect lung cancers on chest radiographs, with low mFPI.


2021 ◽  
Author(s):  
Ignacio Sarasua ◽  
Sebastian Pölsterl ◽  
Christian Wachinger

Abstract Deep learning offers a powerful approach for analyzing hippocampal changes in Alzheimer's disease (AD) without relying on handcrafted features. Nevertheless, an input format needs to be selected to pass the image information to the neural network, which has wide ramifications for the analysis, but has not been evaluated yet. We compare five hippocampal representations (and their respective tailored network architectures) that span from raw images to geometric representations like meshes and point clouds. We performed a thorough evaluation for the prediction of AD diagnosis and time-to-dementia prediction with experiments on an independent test dataset. Our results show that choosing an appropriate representation of the hippocampus for predicting Alzheimer's disease with deep learning is crucial, since it impacts performance and ease of interpretation.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Phasit Charoenkwan ◽  
Warot Chotpatiwetchkul ◽  
Vannajan Sanghiran Lee ◽  
Chanin Nantasenamat ◽  
Watshara Shoombuatong

AbstractOwing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. As a result, the development of computation models for rapidly and accurately identifying novel TTPs from a large number of uncharacterized protein sequences is desirable. In spite of existing computational models that have already been developed for characterizing thermophilic proteins, their performance and interpretability remain unsatisfactory. We present a novel sequence-based thermophilic protein predictor, termed SCMTPP, for improving model predictability and interpretability. First, an up-to-date and high-quality dataset consisting of 1853 TPPs and 3233 non-TPPs was compiled from published literature. Second, the SCMTPP predictor was created by combining the scoring card method (SCM) with estimated propensity scores of g-gap dipeptides. Benchmarking experiments revealed that SCMTPP had a cross-validation accuracy of 0.883, which was comparable to that of a support vector machine-based predictor (0.906–0.910) and 2–17% higher than that of commonly used machine learning models. Furthermore, SCMTPP outperformed the state-of-the-art approach (ThermoPred) on the independent test dataset, with accuracy and MCC of 0.865 and 0.731, respectively. Finally, the SCMTPP-derived propensity scores were used to elucidate the critical physicochemical properties for protein thermostability enhancement. In terms of interpretability and generalizability, comparative results showed that SCMTPP was effective for identifying and characterizing TPPs. We had implemented the proposed predictor as a user-friendly online web server at http://pmlabstack.pythonanywhere.com/SCMTPP in order to allow easy access to the model. SCMTPP is expected to be a powerful tool for facilitating community-wide efforts to identify TPPs on a large scale and guiding experimental characterization of TPPs.


2021 ◽  
Vol 12 ◽  
Author(s):  
Yuran Jia ◽  
Shan Huang ◽  
Tianjiao Zhang

DNA-binding protein (DBP) is a protein with a special DNA binding domain that is associated with many important molecular biological mechanisms. Rapid development of computational methods has made it possible to predict DBP on a large scale; however, existing methods do not fully integrate DBP-related features, resulting in rough prediction results. In this article, we develop a DNA-binding protein identification method called KK-DBP. To improve prediction accuracy, we propose a feature extraction method that fuses multiple PSSM features. The experimental results show a prediction accuracy on the independent test dataset PDB186 of 81.22%, which is the highest of all existing methods.


2021 ◽  
Author(s):  
Anupama Jha ◽  
Mathieu Quesnel-Vallières ◽  
Andrei Thomas-Tikhonenko ◽  
Kristen W. Lynch ◽  
Yoseph Barash

Cancer is a set of diseases characterized by unchecked cell proliferation and invasion of surrounding tissues. The many genes that have been genetically associated with cancer or shown to directly contribute to oncogenesis vary widely between tumor types, but common gene signatures that relate to core cancer pathways have also been identified, signifying that cancer cases display common hallmark molecular features. It is not clear however whether there exist additional sets of genes or transcriptomic features that are less well known in cancer biology but that are also commonly deregulated across several cancer types. Here, in order to agnostically identify transcriptomic features that are commonly shared between cancer types, we used RNA-Seq datasets encompassing thousands of samples from 19 healthy tissue types and 18 solid tumor types to train three feed-forward neural networks, based either on protein-coding gene expression, lncRNA expression or splice junction use, to distinguish between healthy and tumor samples. All three models achieve high precision, recall and accuracy on test sets derived from 13 datasets used during training and on an independent test dataset, indicating that our models recognize transcriptome signatures that are consistent across tumors. Analysis of attribution values extracted from our models reveals that genes that are commonly altered in cancer by expression or splicing variations are under strong evolutionary and selective constraints, suggesting that they have important cellular functions. Importantly, we found that genes composing our cancer transcriptome signatures are not frequently affected by mutations or genomic alterations and that their functions differ widely from the genes genetically associated with cancer. Finally, our results also highlighted that deregulation of RNA-processing genes and aberrant splicing are pervasive features across a large array of solid tumor types. The transcriptomic features that we highlight here define cancer signatures that may reflect causal variations or consequences of disease state, or a combination of both.


2021 ◽  
Vol In Press (In Press) ◽  
Author(s):  
Chenao Zhan ◽  
Dazhong Tang ◽  
Lu Huang ◽  
Yayuan Geng ◽  
Tao Ai ◽  
...  

Background: The clinical manifestations of amyloid cardiomyopathy (AC) are not specific; therefore, AC is often misdiagnosed as hypertrophic cardiomyopathy (HCM) or hypertensive heart disease (HHD). A differential diagnosis of these three conditions is often necessary in the clinical setting. Objectives: To investigate the differential diagnostic performance of radiomic analysis, based on cardiac magnetic resonance (CMR) native T1 mapping images for the left ventricular hypertrophy (LVH) etiologies. Methods: This retrospective, case-control study was conducted on 91 participants (68 males and 23 females; mean age: 48 ± 13 years), including 22 patients with HHD, 27 patients with AC, 28 patients with HCM, and 14 controls in Tongji Hospital (Shanghai, China). All participants underwent 3.0T CMR imaging. Besides, radiomic analyses were performed using T1 mapping images. The cases were divided into training and test datasets using a random seed. Next, the models were constructed with the training dataset and evaluated with the test dataset. Results: A total of 1,033 radiomic features were extracted in this study. Overall, 11, 28, 19, and eight features were selected to construct the basal T1 mapping, mid-chamber T1 mapping, apical T1 mapping, and multi-module conjoint models, respectively. Optimal performance was reported in the mid-chamber and basal T1 mapping models. The area under the curve (AUC), precision, recall, and F1 score were 0.96, 0.84, 0.82, and 0.83 for the mid-chamber T1 mapping model and 0.96, 0.90, 0.89, and 0.88 for the basal T1 mapping model in the independent test dataset, respectively. The lowest diagnostic performance was observed in the apical T1 mapping model. The AUC, precision, recall, and F1 score of the apical T1 mapping model were 0.86, 0.71, 0.70, and 0.70 in the independent test dataset, respectively. Conclusions: The radiomic analysis of T1 mapping could accurately distinguish the three causes of myocardial hypertrophy, including HCM, HHD, and AC. It may be also a suitable alternative to late gadolinium enhancement-CMR.


Author(s):  
Akila Katuwawala ◽  
Bi Zhao ◽  
Lukasz Kurgan

Abstract Motivation Intrinsically disordered protein regions interact with proteins, nucleic acids and lipids. Regions that bind lipids are implicated in a wide spectrum of cellular functions and several human diseases. Motivated by the growing amount of experimental data for these interactions and lack of tools that can predict them from the protein sequence, we develop DisoLipPred, the first predictor of the disordered lipid-binding residues (DLBRs). Results DisoLipPred relies on a deep bidirectional recurrent network that implements three innovative features: transfer learning, bypass module that sidesteps predictions for putative structured residues, and expanded inputs that cover physiochemical properties associated with the protein–lipid interactions. Ablation analysis shows that these features drive predictive quality of DisoLipPred. Tests on an independent test dataset and the yeast proteome reveal that DisoLipPred generates accurate results and that none of the related existing tools can be used to indirectly identify DLBR. We also show that DisoLipPred’s predictions complement the results generated by predictors of the transmembrane regions. Altogether, we conclude that DisoLipPred provides high-quality predictions of DLBRs that complement the currently available methods. Availability and implementation DisoLipPred’s webserver is available at http://biomine.cs.vcu.edu/servers/DisoLipPred/. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 16 ◽  
Author(s):  
Hoang V. Tran ◽  
Quang H. Nguyen

Background: Reactive oxygen species (ROS) has many roles in the body such as cell signaling, homeostasis or protection from harmful bacteria. However, too much ROS in the body will damage lipids, proteins, and DNA. Many studies show that many environmental factors increase the amount of ROS produced in the body. Antioxidant proteins are responsible for neutralizing these ROS or free radicals. Although the amount of data on protein sequences has increased over the last two decades, we still lack bioinformatics tools to be able to accurately identify antioxidant protein sequences while biochemical methods to determine antioxidant proteins are very expensive and time consuming, so a machine learning approach must be used to speed up the computation. In this study. Methods: we propose a new method that combines convolutional neural network and Random Forest using two features, the normalized PSSM and the best selected feature of the ProtBert output. Result: Our model gave very good results on the independent test dataset with 97.3% sensitivity and 95.9% specificity. Comparison with current state of the art models shows that our model is superior. Conclusion: We have also installed iAnt as an online web site with a friendly interface available at http://antixiodant.nguyenhongquang.edu.vn. iAnt has been developed to accurately identify the antioxidant protein. It shows results outperforming the existing state-of-the-art methods, and it is available online.


2021 ◽  
Author(s):  
Huiting Chen ◽  
Zhaozhong Zhu ◽  
Ye Qiu ◽  
Xingyi Ge ◽  
Heping Zheng ◽  
...  

The coronavirus 3C-like (3CL) protease is a Cysteine protease. It plays an important role in viral infection and immune escape by not only cleaving the viral polyprotein ORF1ab at 11 sites, but also cleaving the host proteins. However, there is still a lack of effective tools for determining the cleavage sites of the 3CL protease. This study systematically investigated the diversity of the cleavage sites of the coronavirus 3CL protease on the viral polyprotein, and found that the cleavage motif were highly conserved for viruses in the genera of Alphacoronavirus, Betacoronavirus and Gammacoronavirus. Strong residue preferences were observed at the neighboring positions of the cleavage sites. A random forest (RF) model was built to predict the cleavage sites of the coronavirus 3CL protease based on the representation of residues at cleavage site and neighboring positions by amino acid indexes, and the model achieved an AUC of 0.96 in cross-validations. The RF model was further tested on an independent test dataset composed of cleavage sites on host proteins, and achieved an AUC of 0.88 and a prediction precision of 0.80 when considering the accessibility of the cleavage site. Then, 1,079 human proteins were predicted to be cleaved by the 3CL protease by the RF model. These proteins were enriched in pathways related to neurodegenerative diseases and pathogen infection. Finally, a user-friendly online server named 3CLP was built to predict the cleavage sites of the coronavirus 3CL protease based on the RF model. Overall, the study not only provides an effective tool for identifying the cleavage sites of the 3CL protease, but also provides insights into the molecular mechanism underlying the pathogenicity of coronaviruses.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Yanjuan Li ◽  
Zhengnan Zhao ◽  
Zhixia Teng

As one of important epigenetic modifications, DNA N4-methylcytosine (4mC) plays a crucial role in controlling gene replication, expression, cell cycle, DNA replication, and differentiation. The accurate identification of 4mC sites is necessary to understand biological functions. In the paper, we use ensemble learning to develop a model named i4mC-EL to identify 4mC sites in the mouse genome. Firstly, a multifeature encoding scheme consisting of Kmer and EIIP was adopted to describe the DNA sequences. Secondly, on the basis of the multifeature encoding scheme, we developed a stacked ensemble model, in which four machine learning algorithms, namely, BayesNet, NaiveBayes, LibSVM, and Voted Perceptron, were utilized to implement an ensemble of base classifiers that produce intermediate results as input of the metaclassifier, Logistic. The experimental results on the independent test dataset demonstrate that the overall rate of predictive accurate of i4mC-EL is 82.19%, which is better than the existing methods. The user-friendly website implementing i4mC-EL can be accessed freely at the following.


Sign in / Sign up

Export Citation Format

Share Document