scholarly journals Identifying Potential miRNA Biomarkers for Gastric Cancer Diagnosis Using Machine Learning Variable Selection Approach

2022 ◽  
Vol 12 ◽  
Author(s):  
Neda Gilani ◽  
Reza Arabi Belaghi ◽  
Younes Aftabi ◽  
Elnaz Faramarzi ◽  
Tuba Edgünlü ◽  
...  

Aim: This study aimed to accurately identification of potential miRNAs for gastric cancer (GC) diagnosis at the early stages of the disease.Methods: We used GSE106817 data with 2,566 miRNAs to train the machine learning models. We used the Boruta machine learning variable selection approach to identify the strong miRNAs associated with GC in the training sample. We then validated the prediction models in the independent sample GSE113486 data. Finally, an ontological analysis was done on identified miRNAs to eliciting the relevant relationships.Results: Of those 2,874 patients in the training the model, there were 115 (4%) patients with GC. Boruta identified 30 miRNAs as potential biomarkers for GC diagnosis and hsa-miR-1343-3p was at the highest ranking. All of the machine learning algorithms showed that using hsa-miR-1343-3p as a biomarker, GC can be predicted with very high precision (AUC; 100%, sensitivity; 100%, specificity; 100% ROC; 100%, Kappa; 100) using with the cut-off point of 8.2 for hsa-miR-1343-3p. Also, ontological analysis of 30 identified miRNAs approved their strong relationship with cancer associated genes and molecular events.Conclusion: The hsa-miR-1343-3p could be introduced as a valuable target for studies on the GC diagnosis using reliable biomarkers.

2018 ◽  
Author(s):  
Liyan Pan ◽  
Guangjian Liu ◽  
Xiaojian Mao ◽  
Huixian Li ◽  
Jiexin Zhang ◽  
...  

BACKGROUND Central precocious puberty (CPP) in girls seriously affects their physical and mental development in childhood. The method of diagnosis—gonadotropin-releasing hormone (GnRH)–stimulation test or GnRH analogue (GnRHa)–stimulation test—is expensive and makes patients uncomfortable due to the need for repeated blood sampling. OBJECTIVE We aimed to combine multiple CPP–related features and construct machine learning models to predict response to the GnRHa-stimulation test. METHODS In this retrospective study, we analyzed clinical and laboratory data of 1757 girls who underwent a GnRHa test in order to develop XGBoost and random forest classifiers for prediction of response to the GnRHa test. The local interpretable model-agnostic explanations (LIME) algorithm was used with the black-box classifiers to increase their interpretability. We measured sensitivity, specificity, and area under receiver operating characteristic (AUC) of the models. RESULTS Both the XGBoost and random forest models achieved good performance in distinguishing between positive and negative responses, with the AUC ranging from 0.88 to 0.90, sensitivity ranging from 77.91% to 77.94%, and specificity ranging from 84.32% to 87.66%. Basal serum luteinizing hormone, follicle-stimulating hormone, and insulin-like growth factor-I levels were found to be the three most important factors. In the interpretable models of LIME, the abovementioned variables made high contributions to the prediction probability. CONCLUSIONS The prediction models we developed can help diagnose CPP and may be used as a prescreening tool before the GnRHa-stimulation test.


Author(s):  
Cheng-Chien Lai ◽  
Wei-Hsin Huang ◽  
Betty Chia-Chen Chang ◽  
Lee-Ching Hwang

Predictors for success in smoking cessation have been studied, but a prediction model capable of providing a success rate for each patient attempting to quit smoking is still lacking. The aim of this study is to develop prediction models using machine learning algorithms to predict the outcome of smoking cessation. Data was acquired from patients underwent smoking cessation program at one medical center in Northern Taiwan. A total of 4875 enrollments fulfilled our inclusion criteria. Models with artificial neural network (ANN), support vector machine (SVM), random forest (RF), logistic regression (LoR), k-nearest neighbor (KNN), classification and regression tree (CART), and naïve Bayes (NB) were trained to predict the final smoking status of the patients in a six-month period. Sensitivity, specificity, accuracy, and area under receiver operating characteristic (ROC) curve (AUC or ROC value) were used to determine the performance of the models. We adopted the ANN model which reached a slightly better performance, with a sensitivity of 0.704, a specificity of 0.567, an accuracy of 0.640, and an ROC value of 0.660 (95% confidence interval (CI): 0.617–0.702) for prediction in smoking cessation outcome. A predictive model for smoking cessation was constructed. The model could aid in providing the predicted success rate for all smokers. It also had the potential to achieve personalized and precision medicine for treatment of smoking cessation.


2020 ◽  
Vol 8 (Suppl 3) ◽  
pp. A62-A62
Author(s):  
Dattatreya Mellacheruvu ◽  
Rachel Pyke ◽  
Charles Abbott ◽  
Nick Phillips ◽  
Sejal Desai ◽  
...  

BackgroundAccurately identified neoantigens can be effective therapeutic agents in both adjuvant and neoadjuvant settings. A key challenge for neoantigen discovery has been the availability of accurate prediction models for MHC peptide presentation. We have shown previously that our proprietary model based on (i) large-scale, in-house mono-allelic data, (ii) custom features that model antigen processing, and (iii) advanced machine learning algorithms has strong performance. We have extended upon our work by systematically integrating large quantities of high-quality, publicly available data, implementing new modelling algorithms, and rigorously testing our models. These extensions lead to substantial improvements in performance and generalizability. Our algorithm, named Systematic HLA Epitope Ranking Pan Algorithm (SHERPA™), is integrated into the ImmunoID NeXT Platform®, our immuno-genomics and transcriptomics platform specifically designed to enable the development of immunotherapies.MethodsIn-house immunopeptidomic data was generated using stably transfected HLA-null K562 cells lines that express a single HLA allele of interest, followed by immunoprecipitation using W6/32 antibody and LC-MS/MS. Public immunopeptidomics data was downloaded from repositories such as MassIVE and processed uniformly using in-house pipelines to generate peptide lists filtered at 1% false discovery rate. Other metrics (features) were either extracted from source data or generated internally by re-processing samples utilizing the ImmunoID NeXT Platform.ResultsWe have generated large-scale and high-quality immunopeptidomics data by using approximately 60 mono-allelic cell lines that unambiguously assign peptides to their presenting alleles to create our primary models. Briefly, our primary ‘binding’ algorithm models MHC-peptide binding using peptide and binding pockets while our primary ‘presentation’ model uses additional features to model antigen processing and presentation. Both primary models have significantly higher precision across all recall values in multiple test data sets, including mono-allelic cell lines and multi-allelic tissue samples. To further improve the performance of our model, we expanded the diversity of our training set using high-quality, publicly available mono-allelic immunopeptidomics data. Furthermore, multi-allelic data was integrated by resolving peptide-to-allele mappings using our primary models. We then trained a new model using the expanded training data and a new composite machine learning architecture. The resulting secondary model further improves performance and generalizability across several tissue samples.ConclusionsImproving technologies for neoantigen discovery is critical for many therapeutic applications, including personalized neoantigen vaccines, and neoantigen-based biomarkers for immunotherapies. Our new and improved algorithm (SHERPA) has significantly higher performance compared to a state-of-the-art public algorithm and furthers this objective.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Chengmao Zhou ◽  
Junhong Hu ◽  
Ying Wang ◽  
Mu-Huo Ji ◽  
Jianhua Tong ◽  
...  

AbstractTo explore the predictive performance of machine learning on the recurrence of patients with gastric cancer after the operation. The available data is divided into two parts. In particular, the first part is used as a training set (such as 80% of the original data), and the second part is used as a test set (the remaining 20% of the data). And we use fivefold cross-validation. The weight of recurrence factors shows the top four factors are BMI, Operation time, WGT and age in order. In training group:among the 5 machine learning models, the accuracy of gbm was 0.891, followed by gbm algorithm was 0.876; The AUC values of the five machine learning algorithms are from high to low as forest (0.962), gbm (0.922), GradientBoosting (0.898), DecisionTree (0.790) and Logistic (0.748). And the precision of the forest is the highest 0.957, followed by the GradientBoosting algorithm (0.878). At the same time, in the test group is as follows: the highest accuracy of Logistic was 0.801, followed by forest algorithm and gbm; the AUC values of the five algorithms are forest (0.795), GradientBoosting (0.774), DecisionTree (0.773), Logistic (0.771) and gbm (0.771), from high to low. Among the five machine learning algorithms, the highest precision rate of Logistic is 1.000, followed by the gbm (0.487). Machine learning can predict the recurrence of gastric cancer patients after an operation. Besides, the first four factors affecting postoperative recurrence of gastric cancer were BMI, Operation time, WGT and age.


2020 ◽  
Vol 20 (1) ◽  
Author(s):  
Matthijs Blankers ◽  
Louk F. M. van der Post ◽  
Jack J. M. Dekker

Abstract Background Accurate prediction models for whether patients on the verge of a psychiatric criseis need hospitalization are lacking and machine learning methods may help improve the accuracy of psychiatric hospitalization prediction models. In this paper we evaluate the accuracy of ten machine learning algorithms, including the generalized linear model (GLM/logistic regression) to predict psychiatric hospitalization in the first 12 months after a psychiatric crisis care contact. We also evaluate an ensemble model to optimize the accuracy and we explore individual predictors of hospitalization. Methods Data from 2084 patients included in the longitudinal Amsterdam Study of Acute Psychiatry with at least one reported psychiatric crisis care contact were included. Target variable for the prediction models was whether the patient was hospitalized in the 12 months following inclusion. The predictive power of 39 variables related to patients’ socio-demographics, clinical characteristics and previous mental health care contacts was evaluated. The accuracy and area under the receiver operating characteristic curve (AUC) of the machine learning algorithms were compared and we also estimated the relative importance of each predictor variable. The best and least performing algorithms were compared with GLM/logistic regression using net reclassification improvement analysis and the five best performing algorithms were combined in an ensemble model using stacking. Results All models performed above chance level. We found Gradient Boosting to be the best performing algorithm (AUC = 0.774) and K-Nearest Neighbors to be the least performing (AUC = 0.702). The performance of GLM/logistic regression (AUC = 0.76) was slightly above average among the tested algorithms. In a Net Reclassification Improvement analysis Gradient Boosting outperformed GLM/logistic regression by 2.9% and K-Nearest Neighbors by 11.3%. GLM/logistic regression outperformed K-Nearest Neighbors by 8.7%. Nine of the top-10 most important predictor variables were related to previous mental health care use. Conclusions Gradient Boosting led to the highest predictive accuracy and AUC while GLM/logistic regression performed average among the tested algorithms. Although statistically significant, the magnitude of the differences between the machine learning algorithms was in most cases modest. The results show that a predictive accuracy similar to the best performing model can be achieved when combining multiple algorithms in an ensemble model.


2019 ◽  
Vol 8 (2) ◽  
pp. 4499-4504

Heart diseases are responsible for the greatest number of deaths all over the world. These diseases are usually not detected in early stages as the cost of medical diagnostics is not affordable by a majority of the people. Research has shown that machine learning methods have a great capability to extract valuable information from the medical data. This information is used to build the prediction models which provide cost effective technological aid for a medical practitioner to detect the heart disease in early stages. However, the presence of some irrelevant and redundant features in medical data deteriorates the competence of the prediction system. This research was aimed to improve the accuracy of the existing methods by removing such features. In this study, brute force-based algorithm of feature selection was used to determine relevant significant features. After experimenting rigorously with 7528 possible combinations of features and 5 machine learning algorithms, 8 important features were identified. A prediction model was developed using these significant features. Accuracy of this model is experimentally calculated to be 86.4%which is higher than the results of existing studies. The prediction model proposed in this study shall help in predicting heart disease efficiently.


2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Fathima Aliyar Vellameeran ◽  
Thomas Brindha

Abstract Objectives To make a clear literature review on state-of-the-art heart disease prediction models. Methods It reviews 61 research papers and states the significant analysis. Initially, the analysis addresses the contributions of each literature works and observes the simulation environment. Here, different types of machine learning algorithms deployed in each contribution. In addition, the utilized dataset for existing heart disease prediction models was observed. Results The performance measures computed in entire papers like prediction accuracy, prediction error, specificity, sensitivity, f-measure, etc., are learned. Further, the best performance is also checked to confirm the effectiveness of entire contributions. Conclusions The comprehensive research challenges and the gap are portrayed based on the development of intelligent methods concerning the unresolved challenges in heart disease prediction using data mining techniques.


2021 ◽  
Author(s):  
Kate Bentley ◽  
Kelly Zuromski ◽  
Rebecca Fortgang ◽  
Emily Madsen ◽  
Daniel Kessler ◽  
...  

Background: Interest in developing machine learning algorithms that use electronic health record data to predict patients’ risk of suicidal behavior has recently proliferated. Whether and how such models might be implemented and useful in clinical practice, however, remains unknown. In order to ultimately make automated suicide risk prediction algorithms useful in practice, and thus better prevent patient suicides, it is critical to partner with key stakeholders (including the frontline providers who will be using such tools) at each stage of the implementation process.Objective: The aim of this focus group study was to inform ongoing and future efforts to deploy suicide risk prediction models in clinical practice. The specific goals were to better understand hospital providers’ current practices for assessing and managing suicide risk; determine providers’ perspectives on using automated suicide risk prediction algorithms; and identify barriers, facilitators, recommendations, and factors to consider for initiatives in this area. Methods: We conducted 10 two-hour focus groups with a total of 40 providers from psychiatry, internal medicine and primary care, emergency medicine, and obstetrics and gynecology departments within an urban academic medical center. Audio recordings of open-ended group discussions were transcribed and coded for relevant and recurrent themes by two independent study staff members. All coded text was reviewed and discrepancies resolved in consensus meetings with doctoral-level staff. Results: Though most providers reported using standardized suicide risk assessment tools in their clinical practices, existing tools were commonly described as unhelpful and providers indicated dissatisfaction with current suicide risk assessment methods. Overall, providers’ general attitudes toward the practical use of automated suicide risk prediction models and corresponding clinical decision support tools were positive. Providers were especially interested in the potential to identify high-risk patients who might be missed by traditional screening methods. Some expressed skepticism about the potential usefulness of these models in routine care; specific barriers included concerns about liability, alert fatigue, and increased demand on the healthcare system. Key facilitators included presenting specific patient-level features contributing to risk scores, emphasizing changes in risk over time, and developing systematic clinical workflows and provider trainings. Participants also recommended considering risk-prediction windows, timing of alerts, who will have access to model predictions, and variability across treatment settings.Conclusions: Providers were dissatisfied with current suicide risk assessment methods and open to the use of a machine learning-based risk prediction system to inform clinical decision-making. They also raised multiple concerns about potential barriers to the usefulness of this approach and suggested several possible facilitators. Future efforts in this area will benefit from incorporating systematic qualitative feedback from providers, patients, administrators, and payers on the use of new methods in routine care, especially given the complex, sensitive, and unfortunately still stigmatized nature of suicide risk.


Energies ◽  
2020 ◽  
Vol 13 (17) ◽  
pp. 4368 ◽  
Author(s):  
Chun-Wei Chen ◽  
Chun-Chang Li ◽  
Chen-Yu Lin

Energy baseline is an important method for measuring the energy-saving benefits of chiller system, and the benefits can be calculated by comparing prediction models and actual results. Currently, machine learning is often adopted as a prediction model for energy baselines. Common models include regression, ensemble learning, and deep learning models. In this study, we first reviewed several machine learning algorithms, which were used to establish prediction models. Then, the concept of clustering to preprocess chiller data was adopted. Data mining, K-means clustering, and gap statistic were used to successfully identify the critical variables to cluster chiller modes. Applying these key variables effectively enhanced the quality of the chiller data, and combining the clustering results and the machine learning model effectively improved the prediction accuracy of the model and the reliability of the energy baselines.


Author(s):  
Ruchika Malhotra ◽  
Anuradha Chug

Software maintenance is an expensive activity that consumes a major portion of the cost of the total project. Various activities carried out during maintenance include the addition of new features, deletion of obsolete code, correction of errors, etc. Software maintainability means the ease with which these operations can be carried out. If the maintainability can be measured in early phases of the software development, it helps in better planning and optimum resource utilization. Measurement of design properties such as coupling, cohesion, etc. in early phases of development often leads us to derive the corresponding maintainability with the help of prediction models. In this paper, we performed a systematic review of the existing studies related to software maintainability from January 1991 to October 2015. In total, 96 primary studies were identified out of which 47 studies were from journals, 36 from conference proceedings and 13 from others. All studies were compiled in structured form and analyzed through numerous perspectives such as the use of design metrics, prediction model, tools, data sources, prediction accuracy, etc. According to the review results, we found that the use of machine learning algorithms in predicting maintainability has increased since 2005. The use of evolutionary algorithms has also begun in related sub-fields since 2010. We have observed that design metrics is still the most favored option to capture the characteristics of any given software before deploying it further in prediction model for determining the corresponding software maintainability. A significant increase in the use of public dataset for making the prediction models has also been observed and in this regard two public datasets User Interface Management System (UIMS) and Quality Evaluation System (QUES) proposed by Li and Henry is quite popular among researchers. Although machine learning algorithms are still the most popular methods, however, we suggest that researchers working on software maintainability area should experiment on the use of open source datasets with hybrid algorithms. In this regard, more empirical studies are also required to be conducted on a large number of datasets so that a generalized theory could be made. The current paper will be beneficial for practitioners, researchers and developers as they can use these models and metrics for creating benchmark and standards. Findings of this extensive review would also be useful for novices in the field of software maintainability as it not only provides explicit definitions, but also lays a foundation for further research by providing a quick link to all important studies in the said field. Finally, this study also compiles current trends, emerging sub-fields and identifies various opportunities of future research in the field of software maintainability.


Sign in / Sign up

Export Citation Format

Share Document