PredCID: prediction of driver frameshift indels in human cancer

Author(s):  
Zhenyu Yue ◽  
Xinlu Chu ◽  
Junfeng Xia

Abstract The discrimination of driver from passenger mutations has been a hot topic in the field of cancer biology. Although recent advances have improved the identification of driver mutations in cancer genomic research, there is no computational method specific for the cancer frameshift indels (insertions or/and deletions) yet. In addition, existing pathogenic frameshift indel predictors may suffer from plenty of missing values because of different choices of transcripts during the variant annotation processes. In this study, we proposed a computational model, called PredCID (Predictor for Cancer driver frameshift InDels), for accurately predicting cancer driver frameshift indels. Gene, DNA, transcript and protein level features are combined together and selected for classification with eXtreme Gradient Boosting classifier. Benchmarking results on the cross-validation dataset and independent dataset showed that PredCID achieves better and robust performance compared with existing noncancer-specific methods in distinguishing cancer driver frameshift indels from passengers and is therefore a valuable method for deeper understanding of frameshift indels in human cancer. PredCID is freely available for academic research at http://bioinfo.ahu.edu.cn:8080/PredCID.

2019 ◽  
Vol 20 (5) ◽  
pp. 1925-1933 ◽  
Author(s):  
Zhenyu Yue ◽  
Le Zhao ◽  
Na Cheng ◽  
Hua Yan ◽  
Junfeng Xia

Abstract While recent advances in next-generation sequencing technologies have enabled the creation of a multitude of databases in cancer genomic research, there is no comprehensive database focusing on the annotation of driver indels (insertions and deletions) yet. Therefore, we have developed the database of Cancer driver InDels (dbCID), which is a collection of known coding indels that likely to be engaged in cancer development, progression or therapy. dbCID contains experimentally supported and putative driver indels derived from manual curation of literature and is freely available online at http://bioinfo.ahu.edu.cn:8080/dbCID. Using the data deposited in dbCID, we summarized features of driver indels in four levels (gene, DNA, transcript and protein) through comparing with putative neutral indels. We found that most of the genes containing driver indels in dbCID are known cancer genes playing a role in tumorigenesis. Contrary to the expectation, the sequences affected by driver frameshift indels are not larger than those by neutral ones. In addition, the frameshift and inframe driver indels prefer to disrupt high-conservative regions both in DNA sequences and protein domains. Finally, we developed a computational method for discriminating cancer driver from neutral frameshift indels based on the deposited data in dbCID. The proposed method outperformed other widely used non-cancer-specific predictors on an external test set, which demonstrated the usefulness of the data deposited in dbCID. We hope dbCID will be a benchmark for improving and evaluating prediction algorithms, and the characteristics summarized here may assist with investigating the mechanism of indel–cancer association.


2020 ◽  
Vol 21 (S13) ◽  
Author(s):  
Ke Li ◽  
Sijia Zhang ◽  
Di Yan ◽  
Yannan Bin ◽  
Junfeng Xia

Abstract Background Identification of hot spots in protein-DNA interfaces provides crucial information for the research on protein-DNA interaction and drug design. As experimental methods for determining hot spots are time-consuming, labor-intensive and expensive, there is a need for developing reliable computational method to predict hot spots on a large scale. Results Here, we proposed a new method named sxPDH based on supervised isometric feature mapping (S-ISOMAP) and extreme gradient boosting (XGBoost) to predict hot spots in protein-DNA complexes. We obtained 114 features from a combination of the protein sequence, structure, network and solvent accessible information, and systematically assessed various feature selection methods and feature dimensionality reduction methods based on manifold learning. The results show that the S-ISOMAP method is superior to other feature selection or manifold learning methods. XGBoost was then used to develop hot spots prediction model sxPDH based on the three dimensionality-reduced features obtained from S-ISOMAP. Conclusion Our method sxPDH boosts prediction performance using S-ISOMAP and XGBoost. The AUC of the model is 0.773, and the F1 score is 0.713. Experimental results on benchmark dataset indicate that sxPDH can achieve generally better performance in predicting hot spots compared to the state-of-the-art methods.


Author(s):  
Wirot Yotsawat ◽  
Pakaket Wattuya ◽  
Anongnart Srivihok

<span>Several credit-scoring models have been developed using ensemble classifiers in order to improve the accuracy of assessment. However, among the ensemble models, little consideration has been focused on the hyper-parameters tuning of base learners, although these are crucial to constructing ensemble models. This study proposes an improved credit scoring model based on the extreme gradient boosting (XGB) classifier using Bayesian hyper-parameters optimization (XGB-BO). The model comprises two steps. Firstly, data pre-processing is utilized to handle missing values and scale the data. Secondly, Bayesian hyper-parameter optimization is applied to tune the hyper-parameters of the XGB classifier and used to train the model. The model is evaluated on four widely public datasets, i.e., the German, Australia, lending club, and Polish datasets. Several state-of-the-art classification algorithms are implemented for predictive comparison with the proposed method. The results of the proposed model showed promising results, with an improvement in accuracy of 4.10%, 3.03%, and 2.76% on the German, lending club, and Australian datasets, respectively. The proposed model outperformed commonly used techniques, e.g., decision tree, support vector machine, neural network, logistic regression, random forest, and bagging, according to the evaluation results. The experimental results confirmed that the XGB-BO model is suitable for assessing the creditworthiness of applicants.</span>


2021 ◽  
Vol 19 (2) ◽  
pp. 33-40
Author(s):  
Muchamad Taufiq Anwar ◽  
Denny Rianditha Arief Permana

Penentuan teknik/model data mining yang tepat pada sebuah kasus sangat penting untuk mendapatkan model yang baik (tingkat akurat tinggi dan kesesuaiannya dengan masalah yang dipecahkan). Penelitian ini bertujuan untuk membandingkan performa teknik data mining untuk diterapkan pada kasus prediksi dropout mahasiswa. Perbandingan performa dilakukan menggunakan library PyCaret pada Python untuk melakukan pemodelan menggunakan 14 model / teknik data mining yaitu: Extreme Gradient Boosting, Ada Boost Classifier, Light Gradient Boosting Machine, Random Forest Classifier, Gradient Boosting Classifier, Extra Trees Classifier, Decision Tree Classifier, K Neighbors Classifier, Naive Bayes, Ridge Classifier, Linear Discriminant Analysis, Logistic Regression, SVM - Linear Kernel, dan Quadratic Discriminant Analysis. Metrik evaluasi performa model yang digunakan yaitu Accuracy, AUC, Recall, Precision, F1, Kappa, dan MCC (Matthews correlation coefficient). Hasil eksperimen menunjukkan bahwa kasus prediksi dropout mahasiswa lebih tepat jika dimodelkan dengan model berbasis ensemble learner dan pohon keputusan dengan akurasi mencapai 99%. Pohon keputusan memiliki keunggulan dibandingkan model lain seperti SVM - Linear Kernel dan Quadratic Discriminant Analysis karena ia dapat dengan lebih detil dalam memisahkan data ke dalam kedua kelas target. Setelah dilakukan penyesuaian atribut, pembuangan data dengan missing values, dan parameter tuning, didapatkan hasil akurasi yang mirip dari berbagai model yaitu sebesar 87%. Perbedaan akurasi antar model menjadi sangat kecil di saat atribut data yang digunakan sedikit.


2020 ◽  
Author(s):  
Qin He ◽  
Kai Qin ◽  
Diego Loyola ◽  
Ding Li ◽  
Jincheng Shi ◽  
...  

&lt;p&gt;&lt;span&gt;Measurements of nitrogen dioxide (NO&lt;sub&gt;2&lt;/sub&gt;) are essential for understanding air pollution and evaluating its impacts, and satellite remote sensing is an essential approach for obtaining tropospheric NO&lt;sub&gt;2&lt;/sub&gt; columns over wide temporal and spatial ranges. However, Ozone Monitoring Instrument (OMI) onboard Aura is affected by a loss of spatial coverage (around one-third of 60 viewing positions) commonly referred to as row anomaly since June 25&lt;sup&gt;th&lt;/sup&gt;, 2007, and especially after June 5&lt;sup&gt;th&lt;/sup&gt;, 2011. Global Ozone Monitoring Experiment-2 (GOME-2) onboard MetOp-A/B provides data with a maximum swath width of 1920 km, and it needs one and a half days to cover the globe. Therefore, it is challenging to obtain diurnal spatially continuous vertical column densities (VCDs) of tropospheric NO&lt;sub&gt;2&lt;/sub&gt;, which is limited by the performance of the instruments. Besides, the presence of clouds generates numerous missing and abnormal values that affect the application of VCDs data. To fill data gaps due to the above two reasons, this study proposes a framework for reconstructing OMI (afternoon overpass) tropospheric NO&lt;/span&gt;&lt;span&gt;&lt;img src=&quot;&quot; width=&quot;5&quot; height=&quot;31&quot;&gt;&lt;/span&gt;&lt;span&gt;&amp;#160;VCDs over China by combining GOME-2 (morning overpass) products. First, we investigated the ground-based hourly NO&lt;sub&gt;2&lt;/sub&gt; concentration to characterize the diurnal variations, thus deriving the underlying factors that cause the difference between morning VCDs and afternoon ones. Then, the eXtreme Gradient Boosting (XGBoost) method was applied to estimate the missing values of OMI QA4ECV tropospheric NO&lt;sub&gt;2&lt;/sub&gt; VCDs from GOME-2 GDP offline products and other ancillary variables. The spatial coverage of OMI grids (binned to 0.25&amp;#176;) over China from 2015 to 2018 increased from 22% to 63% averagely. Furthermore, for those grids that are null in both products, we utilized an adaptive weighted temporal fitting method to fill missing data that the previous step produced. The reconstructed data set shows spatial and temporal patterns that are coherent with the adjacent areas. Our approach has great potential for reconstructing spatially continuous tropospheric NO&lt;sub&gt;2&lt;/sub&gt; columns, which are critical for daily air quality monitoring.&lt;/span&gt;&lt;/p&gt;


Cancers ◽  
2020 ◽  
Vol 12 (8) ◽  
pp. 2002 ◽  
Author(s):  
Elvin D. de Araujo ◽  
György M. Keserű ◽  
Patrick T. Gunning ◽  
Richard Moriggl

Insights into the mutational landscape of the human cancer genome coding regions defined about 140 distinct cancer driver genes in 2013, which approximately doubled to 300 in 2018 following advances in systems cancer biology studies [...]


2019 ◽  
Vol 40 (Supplement_1) ◽  
Author(s):  
A Agibetov ◽  
B Seirer ◽  
S Aschauer ◽  
D Dalos ◽  
R Rettl ◽  
...  

Abstract Background/Introduction Cardiac amyloidosis (CA) is a rare and complex condition with poor prognosis. Novel therapies have been shown to improve outcome, however, most of the affected individuals remain undiagnosed, mainly due to a lack in awareness among clinicians. One approach to overcome this issue is to use automated diagnostic algorithms that act based on routinely available laboratory results. Purpose We tested the performance of flexible machine learning and traditional statistical prediction models for non-invasive CA diagnosis based on routinely collected laboratory parameters. Since laboratory routines vary between hospitals or other health care providers, special attention has been taken to adaptive and dynamic parameter selection, and to dealing with the frequent occurrence of missing values. Methods Our cohort consisted of 376 clinically accepted patients with various types of heart failure. Of these, 69 were diagnosed with CA via endomyocardial biopsy (positives), and 307 had unrelated cardiac disorders (negatives). A total of 63 routine laboratory parameters were collected from these patients, with a high incidence of missing values (on average 60% of patients for each parameter). We tested the performance of two prediction models: logistic regression, and extreme gradient boosting with regression trees. To deal with missing values we adopted two strategies: a) finding an optimal overlap of parameters and deleting all patients with missing values (reduction of parameters and samples), and b) retaining all features and imputing missing values with parameter-wise means. To fairly assess the performance of prediction models we employed a 10-fold cross validation (stratified to preserve sample class ratio). Finally, area under curve for receiver-operator characteristic (ROC AUC) was used as our final performance measure. Results A complex machine learning model based on forests of regression trees proved to be the most performant (ROC AUC 0.94±4%) and robust to missing values. The best regression model was obtained with the 25 most frequent variables and patient deletion in case of missing values (ROC AUC 0.82±0.8%). While progressive inclusion of predictor variables worsened the performance of the logistic regression, it increased that of the machine learning approach. Conclusions Extreme gradient boosting of regression trees by routine laboratory parameters achieved staggering accuracy results for the automated diagnosis of CA. Our data suggest that implementations of such algorithms as independent interpreters of routine laboratory results may help to establish or suggest the diagnosis of CA in patients with heart failure symptoms, even in the absence of specialized experts.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Yanbo Mai ◽  
Zheng Sheng ◽  
Hanqing Shi ◽  
Qixiang Liao

Atmospheric refraction is a special meteorological phenomenon mainly caused by gas molecules and aerosol particles in the atmosphere, which can change the propagation direction of electromagnetic waves in the atmospheric environment. Atmospheric refractive index, an index to measure atmospheric refraction, is an important parameter for electromagnetic wave. Given that it is difficult to obtain the atmospheric refractive index of 100 meters (m)–3000°m over the ocean, this paper proposes an improved extreme gradient boosting (XGBoost) algorithm based on comprehensive learning particle swarm optimization (CLPSO) operator to obtain them. Finally, the mean absolute percentage error (MAPE) and root mean-squared error (RMSE) are used as evaluation criteria to compare the prediction results of improved XGBoost algorithm with backpropagation (BP) neural network and traditional XGBoost algorithm. The results show that the MAPE and RMSE of the improved XGBoost algorithm are 39% less than those of BP neural network and 32% less than those of the traditional XGBoost. Besides, the improved XGBoost algorithm has the strongest learning and generalization capability to calculate missing values of atmospheric refractive index among the three algorithms. The results of this paper provide a new method to obtain atmospheric refractive index, which will be of great reference significance to further study the atmospheric refraction.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Nicholas Brian Shannon ◽  
Laura Ling Ying Tan ◽  
Qiu Xuan Tan ◽  
Joey Wee-Shan Tan ◽  
Josephine Hendrikson ◽  
...  

AbstractOvarian cancer is associated with poor prognosis. Platinum resistance contributes significantly to the high rate of tumour recurrence. We aimed to identify a set of molecular markers for predicting platinum sensitivity. A signature predicting cisplatin sensitivity was generated using the Genomics of Drug Sensitivity in Cancer and The Cancer Genome Atlas databases. Four potential biomarkers (CYTH3, GALNT3, S100A14, and ERI1) were identified and optimized for immunohistochemistry (IHC). Validation was performed on a cohort of patients (n = 50) treated with surgical resection followed by adjuvant carboplatin. Predictive models were established to predict chemosensitivity. The four biomarkers were also assessed for their ability to prognosticate overall survival in three ovarian cancer microarray expression datasets from The Gene Expression Omnibus. The extreme gradient boosting (XGBoost) algorithm was selected for the final model to validate the accuracy in an independent validation dataset (n = 10). CYTH3 and S100A14, followed by nodal stage, were the features with the greatest importance. The four gene signature had comparable prognostication as clinical information for two-year survival. Assessment of tumour biology by means of gene expression can serve as an adjunct for prediction of chemosensitivity and prognostication. Potentially, the assessment of molecular markers alongside clinical information offers a chance to further optimise therapeutic decision making.


2021 ◽  
Vol 39 (15_suppl) ◽  
pp. e14530-e14530
Author(s):  
Petri Bono ◽  
Jussi Ekström ◽  
Matti K Karvonen ◽  
Jami Mandelin ◽  
Jussi Koivunen

e14530 Background: Bexmarilimab, an investigational immunotherapeutic antibody targeting Clever-1, is currently investigated in phase I/II MATINS study (NCT03733990) for advanced solid tumors. Machine learning (ML) based models combining extensive data could be generated to predict treatment responses to this first-in-class macrophage checkpoint inhibitor. Methods: 58 baseline features from 30 patients included in the part 1 of phase I/II MATINS trial were included in ML modelling. Seven patients were classified as benefitting from the therapy by RECIST 1.1 (PR or SD response in target or non-target lesions). Initial feature selection was done using a combination of domain knowledge and removal of features with several missing values resulting in 20 clinically relevant features from 25 patients. The remaining data was standardized and feature selection using variance analysis (ANOVA) based on F-values between response and features was performed. With this approach, the number of features could be further reduced as the prediction performance increased until the most important features were included in the model. Several prediction models were trained, and prediction performance evaluated using leave-one-out cross-validation (LOOCV), with and without SMOTE oversampling of the positive class of the training data inside each LOOCV fold. In LOOCV the prediction model was trained 25 times. Stacked meta classifier with SMOTE oversampling combining three classifiers: elastic-net logistic regression, random forest and extreme gradient boosting was chosen as the best performing prediction model. Results: Seven baseline features were associated with bexmarilimab treatment benefit. Increasing bexmarilimab dose and high tumor FoxP3 cells showed positive benefit. On contrary, high baseline blood neutrophils, CD4, T-cells, B-cells, and CXCL10 indicated negative relationship to the treatment benefit. The ML model trained with these seven features performed well in LOOCV as 6/7 benefitting and 16/18 non-benefitting were classified correctly, and all considered classification performance metrics were good. In feature importance analysis, low baseline CXCL10 and neutrophils were characterized as the most important predictors for treatment benefit with values of 0.19 and 0.16. Conclusions: This study highlights possibility of using ML models in predicting treatment benefit for novel cancer drugs such as bexmarilimab and boost the clinical development. These findings are in line of expected immune activation of bexmarilimab treatment. The generated ML models should be further validated in a larger patient cohort. Clinical trial information: NCT03733990.


Sign in / Sign up

Export Citation Format

Share Document