Sparse input neural networks to differentiate 32 primary cancer types based on somatic point mutations

Mapping Intimacies ◽

10.1101/2020.05.13.092916 ◽

2020 ◽

Author(s):

Nikolaos Dikaios

Keyword(s):

Neural Networks ◽

Point Mutations ◽

The Cancer Genome Atlas ◽

Patient Treatment ◽

Cancer Site ◽

Gradient Boosting ◽

Support Vector ◽

Primary Cancer ◽

Extreme Gradient Boosting ◽

Cancer Types

AbstractThis paper aims to differentiate cancer types from primary tumour samples based on somatic point mutations (SPM). Primary cancer site identification is necessary to perform site-specific and potentially targeted treatment. Current methods like histopathology/lab-tests cannot accurately determine cancers origin, which results in empirical patient treatment and poor survival rates. The availability of large deoxyribonucleic-acid sequencing datasets has allowed scientists to examine the ability of SPM to classify primary cancer sites. These datasets are highly sparse since most genes will not be mutated, have low signal-to-noise ratio and are imbalanced since rare cancers have less samples. To overcome these limitations a sparse-input neural network (spinn) is suggested that projects the input data in a lower dimensional space, where the more informative genes are used for learning. To train and evaluate spinn, an extensive dataset was collected from the cancer genome atlas containing 7624 samples spanning 32 cancer types. Different sampling strategies were performed to balance the dataset but have not benefited the classifiers performance except for removing Tomek-links. This is probably due to high amount of class overlapping. Spinn consistently outperformed algorithms like extreme gradient-boosting, deep neural networks and support-vector-machines, achieving an accuracy up to 73% on independent testing data.

Download Full-text

DeepVF: a deep learning-based hybrid framework for identifying virulence factors using the stacking strategy

Briefings in Bioinformatics ◽

10.1093/bib/bbaa125 ◽

2020 ◽

Cited By ~ 2

Author(s):

Ruopeng Xie ◽

Jiahui Li ◽

Jiawei Wang ◽

Wei Dai ◽

André Leier ◽

...

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Virulence Factors ◽

Bacterial Genome ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Wide Range ◽

Extreme Gradient Boosting ◽

Hybrid Framework

Abstract Virulence factors (VFs) enable pathogens to infect their hosts. A wealth of individual, disease-focused studies has identified a wide variety of VFs, and the growing mass of bacterial genome sequence data provides an opportunity for computational methods aimed at predicting VFs. Despite their attractive advantages and performance improvements, the existing methods have some limitations and drawbacks. Firstly, as the characteristics and mechanisms of VFs are continually evolving with the emergence of antibiotic resistance, it is more and more difficult to identify novel VFs using existing tools that were previously developed based on the outdated data sets; secondly, few systematic feature engineering efforts have been made to examine the utility of different types of features for model performances, as the majority of tools only focused on extracting very few types of features. By addressing the aforementioned issues, the accuracy of VF predictors can likely be significantly improved. This, in turn, would be particularly useful in the context of genome wide predictions of VFs. In this work, we present a deep learning (DL)-based hybrid framework (termed DeepVF) that is utilizing the stacking strategy to achieve more accurate identification of VFs. Using an enlarged, up-to-date dataset, DeepVF comprehensively explores a wide range of heterogeneous features with popular machine learning algorithms. Specifically, four classical algorithms, including random forest, support vector machines, extreme gradient boosting and multilayer perceptron, and three DL algorithms, including convolutional neural networks, long short-term memory networks and deep neural networks are employed to train 62 baseline models using these features. In order to integrate their individual strengths, DeepVF effectively combines these baseline models to construct the final meta model using the stacking strategy. Extensive benchmarking experiments demonstrate the effectiveness of DeepVF: it achieves a more accurate and stable performance compared with baseline models on the benchmark dataset and clearly outperforms state-of-the-art VF predictors on the independent test. Using the proposed hybrid ensemble model, a user-friendly online predictor of DeepVF (http://deepvf.erc.monash.edu/) is implemented. Furthermore, its utility, from the user’s viewpoint, is compared with that of existing toolkits. We believe that DeepVF will be exploited as a useful tool for screening and identifying potential VFs from protein-coding gene sequences in bacterial genomes.

Download Full-text

Machine Learning Approaches to Classify Primary and Metastatic Cancers Using Tissue of Origin-Based DNA Methylation Profiles

Cancers ◽

10.3390/cancers13153768 ◽

2021 ◽

Vol 13 (15) ◽

pp. 3768

Author(s):

Vijayachitra Modhukur ◽

Shakshi Sharma ◽

Mainak Mondal ◽

Ankita Lawarde ◽

Keiu Kask ◽

...

Keyword(s):

Machine Learning ◽

Dna Methylation ◽

Metastatic Cancer ◽

Gradient Boosting ◽

Support Vector ◽

Cancer Type ◽

Data Repositories ◽

Extreme Gradient Boosting ◽

Cancer Types ◽

Tissue Of Origin

Metastatic cancers account for up to 90% of cancer-related deaths. The clear differentiation of metastatic cancers from primary cancers is crucial for cancer type identification and developing targeted treatment for each cancer type. DNA methylation patterns are suggested to be an intriguing target for cancer prediction and are also considered to be an important mediator for the transition to metastatic cancer. In the present study, we used 24 cancer types and 9303 methylome samples downloaded from publicly available data repositories, including The Cancer Genome Atlas (TCGA) and the Gene Expression Omnibus (GEO). We constructed machine learning classifiers to discriminate metastatic, primary, and non-cancerous methylome samples. We applied support vector machines (SVM), Naive Bayes (NB), extreme gradient boosting (XGBoost), and random forest (RF) machine learning models to classify the cancer types based on their tissue of origin. RF outperformed the other classifiers, with an average accuracy of 99%. Moreover, we applied local interpretable model-agnostic explanations (LIME) to explain important methylation biomarkers to classify cancer types.

Download Full-text

Estimation of daily maize transpiration using support vector machines, extreme gradient boosting, artificial and deep neural networks models

Agricultural Water Management ◽

10.1016/j.agwat.2020.106547 ◽

2021 ◽

Vol 245 ◽

pp. 106547

Author(s):

Junliang Fan ◽

Jing Zheng ◽

Lifeng Wu ◽

Fucang Zhang

Keyword(s):

Neural Networks ◽

Support Vector Machines ◽

Deep Neural Networks ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting ◽

Vector Machines

Download Full-text

Detecting Cybersecurity Attacks in Internet of Things Using Artificial Intelligence Methods: A Systematic Literature Review

Electronics ◽

10.3390/electronics11020198 ◽

2022 ◽

Vol 11 (2) ◽

pp. 198

Author(s):

Mujaheed Abdullahi ◽

Yahia Baashar ◽

Hitham Alhussian ◽

Ayed Alwadain ◽

Norshakirah Aziz ◽

...

Keyword(s):

Artificial Intelligence ◽

Neural Networks ◽

Internet Of Things ◽

Literature Review ◽

Systematic Literature Review ◽

Rapid Development ◽

Security And Privacy ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting

In recent years, technology has advanced to the fourth industrial revolution (Industry 4.0), where the Internet of things (IoTs), fog computing, computer security, and cyberattacks have evolved exponentially on a large scale. The rapid development of IoT devices and networks in various forms generate enormous amounts of data which in turn demand careful authentication and security. Artificial intelligence (AI) is considered one of the most promising methods for addressing cybersecurity threats and providing security. In this study, we present a systematic literature review (SLR) that categorize, map and survey the existing literature on AI methods used to detect cybersecurity attacks in the IoT environment. The scope of this SLR includes an in-depth investigation on most AI trending techniques in cybersecurity and state-of-art solutions. A systematic search was performed on various electronic databases (SCOPUS, Science Direct, IEEE Xplore, Web of Science, ACM, and MDPI). Out of the identified records, 80 studies published between 2016 and 2021 were selected, surveyed and carefully assessed. This review has explored deep learning (DL) and machine learning (ML) techniques used in IoT security, and their effectiveness in detecting attacks. However, several studies have proposed smart intrusion detection systems (IDS) with intelligent architectural frameworks using AI to overcome the existing security and privacy challenges. It is found that support vector machines (SVM) and random forest (RF) are among the most used methods, due to high accuracy detection another reason may be efficient memory. In addition, other methods also provide better performance such as extreme gradient boosting (XGBoost), neural networks (NN) and recurrent neural networks (RNN). This analysis also provides an insight into the AI roadmap to detect threats based on attack categories. Finally, we present recommendations for potential future investigations.

Download Full-text

Machine Learning Enables Accurate and Rapid Prediction of Active Molecules Against Breast Cancer Cells

Frontiers in Pharmacology ◽

10.3389/fphar.2021.796534 ◽

2021 ◽

Vol 12 ◽

Author(s):

Shuyun He ◽

Duancheng Zhao ◽

Yanle Ling ◽

Hanxuan Cai ◽

Yike Cai ◽

...

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Neural Networks ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Breast Cancer Cell Lines ◽

Support Vector ◽

Local Version ◽

Extreme Gradient Boosting

Breast cancer (BC) has surpassed lung cancer as the most frequently occurring cancer, and it is the leading cause of cancer-related death in women. Therefore, there is an urgent need to discover or design new drug candidates for BC treatment. In this study, we first collected a series of structurally diverse datasets consisting of 33,757 active and 21,152 inactive compounds for 13 breast cancer cell lines and one normal breast cell line commonly used in in vitro antiproliferative assays. Predictive models were then developed using five conventional machine learning algorithms, including naïve Bayesian, support vector machine, k-Nearest Neighbors, random forest, and extreme gradient boosting, as well as five deep learning algorithms, including deep neural networks, graph convolutional networks, graph attention network, message passing neural networks, and Attentive FP. A total of 476 single models and 112 fusion models were constructed based on three types of molecular representations including molecular descriptors, fingerprints, and graphs. The evaluation results demonstrate that the best model for each BC cell subtype can achieve high predictive accuracy for the test sets with AUC values of 0.689–0.993. Moreover, important structural fragments related to BC cell inhibition were identified and interpreted. To facilitate the use of the model, an online webserver called ChemBC (http://chembc.idruglab.cn/) and its local version software (https://github.com/idruglab/ChemBC) were developed to predict whether compounds have potential inhibitory activity against BC cells.

Download Full-text

Machine learning models to identify low adherence to influenza vaccination among Korean adults with cardiovascular disease

BMC Cardiovascular Disorders ◽

10.1186/s12872-021-01925-7 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Moojung Kim ◽

Young Jae Kim ◽

Sung Jin Park ◽

Kwang Gi Kim ◽

Pyung Chun Oh ◽

...

Keyword(s):

Machine Learning ◽

Cardiovascular Disease ◽

Influenza Vaccination ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Age Group ◽

Learning Models ◽

Extreme Gradient Boosting ◽

Machine Learning Models

Abstract Background Annual influenza vaccination is an important public health measure to prevent influenza infections and is strongly recommended for cardiovascular disease (CVD) patients, especially in the current coronavirus disease 2019 (COVID-19) pandemic. The aim of this study is to develop a machine learning model to identify Korean adult CVD patients with low adherence to influenza vaccination Methods Adults with CVD (n = 815) from a nationally representative dataset of the Fifth Korea National Health and Nutrition Examination Survey (KNHANES V) were analyzed. Among these adults, 500 (61.4%) had answered "yes" to whether they had received seasonal influenza vaccinations in the past 12 months. The classification process was performed using the logistic regression (LR), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGB) machine learning techniques. Because the Ministry of Health and Welfare in Korea offers free influenza immunization for the elderly, separate models were developed for the < 65 and ≥ 65 age groups. Results The accuracy of machine learning models using 16 variables as predictors of low influenza vaccination adherence was compared; for the ≥ 65 age group, XGB (84.7%) and RF (84.7%) have the best accuracies, followed by LR (82.7%) and SVM (77.6%). For the < 65 age group, SVM has the best accuracy (68.4%), followed by RF (64.9%), LR (63.2%), and XGB (61.4%). Conclusions The machine leaning models show comparable performance in classifying adult CVD patients with low adherence to influenza vaccination.

Download Full-text

Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival

Scientific Reports ◽

10.1038/s41598-021-86327-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Arturo Moncada-Torres ◽

Marissa C. van Maaren ◽

Mathijs P. Hendriks ◽

Sabine Siesling ◽

Gijs Geleijnse

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Explicit Knowledge ◽

Cox Regression ◽

Metastatic Breast ◽

Gradient Boosting ◽

Support Vector ◽

Netherlands Cancer Registry ◽

Extreme Gradient Boosting ◽

The Impact

AbstractCox Proportional Hazards (CPH) analysis is the standard for survival analysis in oncology. Recently, several machine learning (ML) techniques have been adapted for this task. Although they have shown to yield results at least as good as classical methods, they are often disregarded because of their lack of transparency and little to no explainability, which are key for their adoption in clinical settings. In this paper, we used data from the Netherlands Cancer Registry of 36,658 non-metastatic breast cancer patients to compare the performance of CPH with ML techniques (Random Survival Forests, Survival Support Vector Machines, and Extreme Gradient Boosting [XGB]) in predicting survival using the $$c$$ c -index. We demonstrated that in our dataset, ML-based models can perform at least as good as the classical CPH regression ($$c$$ c -index $$\sim \,0.63$$ ∼ 0.63 ), and in the case of XGB even better ($$c$$ c -index $$\sim 0.73$$ ∼ 0.73 ). Furthermore, we used Shapley Additive Explanation (SHAP) values to explain the models’ predictions. We concluded that the difference in performance can be attributed to XGB’s ability to model nonlinearities and complex interactions. We also investigated the impact of specific features on the models’ predictions as well as their corresponding insights. Lastly, we showed that explainable ML can generate explicit knowledge of how models make their predictions, which is crucial in increasing the trust and adoption of innovative ML techniques in oncology and healthcare overall.

Download Full-text

Establishing a Credit Risk Evaluation System for SMEs Using the Soft Voting Fusion Model

Risks ◽

10.3390/risks9110202 ◽

2021 ◽

Vol 9 (11) ◽

pp. 202

Author(s):

Ge Gao ◽

Hongxin Wang ◽

Pengbin Gao

Keyword(s):

Credit Risk ◽

Evaluation System ◽

Predictive Accuracy ◽

Assessment System ◽

Gradient Boosting ◽

Support Vector ◽

Fusion Model ◽

Light Gradient ◽

Extreme Gradient Boosting ◽

The Government

In China, SMEs are facing financing difficulties, and commercial banks and financial institutions are the main financing channels for SMEs. Thus, a reasonable and efficient credit risk assessment system is important for credit markets. Based on traditional statistical methods and AI technology, a soft voting fusion model, which incorporates logistic regression, support vector machine (SVM), random forest (RF), eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), is constructed to improve the predictive accuracy of SMEs’ credit risk. To verify the feasibility and effectiveness of the proposed model, we use data from 123 SMEs nationwide that worked with a Chinese bank from 2016 to 2020, including financial information and default records. The results show that the accuracy of the soft voting fusion model is higher than that of a single machine learning (ML) algorithm, which provides a theoretical basis for the government to control credit risk in the future and offers important references for banks to make credit decisions.

Download Full-text

Classification of Hot Spots using XGBoost and LightGBM Algorithms

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e9459.069520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 722-724

Keyword(s):

Computational Methods ◽

Protein Interactions ◽

Hot Spots ◽

Cell Metabolism ◽

Pearson Correlation ◽

Classification Performance ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting ◽

Hub Proteins

Protein-Protein Interactions referred as PPIs perform significant role in biological functions like cell metabolism, immune response, signal transduction etc. Hot spots are small fractions of residues in interfaces and provide substantial binding energy in PPIs. Therefore, identification of hot spots is important to discover and analyze molecular medicines and diseases. The current strategy, alanine scanning isn't pertinent to enormous scope applications since the technique is very costly and tedious. The existing computational methods are poor in classification performance as well as accuracy in prediction. They are concerned with the topological structure and gene expression of hub proteins. The proposed system focuses on hot spots of hub proteins by eliminating redundant as well as highly correlated features using Pearson Correlation Coefficient and Support Vector Machine based feature elimination. Extreme Gradient boosting and LightGBM algorithms are used to ensemble a set of weak classifiers to form a strong classifier. The proposed system shows better accuracy than the existing computational methods. The model can also be used to predict accurate molecular inhibitors for specific PPIs

Download Full-text

HyP-ABC: A Novel Automated Hyper-Parameter Tuning Algorithm Using Evolutionary Optimization

10.36227/techrxiv.14714508.v2 ◽

2021 ◽

Author(s):

Leila Zahedi ◽

Farid Ghareh Mohammadi ◽

M. Hadi Amini

Keyword(s):

Parameter Optimization ◽

Real World ◽

Optimization Problems ◽

State Of The Art ◽

Parameter Tuning ◽

Gradient Boosting ◽

Support Vector ◽

Wide Range ◽

Extreme Gradient Boosting ◽

Art Techniques

Machine learning techniques lend themselves as promising decision-making and analytic tools in a wide range of applications. Different ML algorithms have various hyper-parameters. In order to tailor an ML model towards a specific application, a large number of hyper-parameters should be tuned. Tuning the hyper-parameters directly affects the performance (accuracy and run-time). However, for large-scale search spaces, efficiently exploring the ample number of combinations of hyper-parameters is computationally challenging. Existing automated hyper-parameter tuning techniques suffer from high time complexity. In this paper, we propose HyP-ABC, an automatic innovative hybrid hyper-parameter optimization algorithm using the modified artificial bee colony approach, to measure the classification accuracy of three ML algorithms, namely random forest, extreme gradient boosting, and support vector machine. Compared to the state-of-the-art techniques, HyP-ABC is more efficient and has a limited number of parameters to be tuned, making it worthwhile for real-world hyper-parameter optimization problems. We further compare our proposed HyP-ABC algorithm with state-of-the-art techniques. In order to ensure the robustness of the proposed method, the algorithm takes a wide range of feasible hyper-parameter values, and is tested using a real-world educational dataset.

Download Full-text