scholarly journals DOP31 Serum protein markers for early and differential IBD diagnosis validated by machine learning approaches

2020 ◽  
Vol 14 (Supplement_1) ◽  
pp. S070-S070
Author(s):  
S Verstockt ◽  
N Verplaetse ◽  
D Raimondi ◽  
B Verstockt ◽  
E Glorieus ◽  
...  

Abstract Background The inflammatory bowel diseases (IBD), Crohn’s disease (CD) and ulcerative colitis (UC) are chronic inflammatory conditions with a polygenic and multifactorial pathogenesis. Intensified treatment early in the disease course of IBD results in better outcomes. This is, however, challenged by the diagnostic delay faced in IBD, and especially in CD. Therefore, markers supporting early and differential diagnosis are needed. In this study, we aimed to discriminate IBD patients from non-IBD controls, and CD from UC patients, using serum protein profiles combined with an IBD polygenic risk score. Methods Patients naïve for immunosuppressives and biologicals, and without previous IBD-related surgery were prospectively included within 3 months after diagnosis, across three Belgian IBD referral centres (PANTHER study). We collected serum from 127 patients (88 CD, 39 UC) and 66 age- and gender-matched non-IBD controls. Relative serum levels of 576 unique proteins were quantified (OLINK). Proteins were ranked according to (1) adjusted (adj.) p values obtained from differential expression analysis; (2) importance scores from machine-learning feature-selection algorithms (univariate feature selection, logistic regression with L2 penalty and Random Forest). For all individuals, a weighted IBD polygenic risk score (PRS) was calculated (PRSice 2.0) for the 242 known IBD risk loci. Receiver operating characteristics (ROC) and area under the curve (AUC) analysis were performed to measure the performance of top-ranked proteins and the IBD PRS (R package ROCR). Results Following statistical analysis, 243 serum proteins were found to be differentially expressed (adj. p < 0.05) between IBD patients and controls. Three top-ranked markers were also identified as top 10 ranked proteins by all feature-selection algorithms, and resulted in a significant AUC of 93% (95% CI: 89–97%) to distinguish IBD from controls. While adding the IBD PRS did not further contribute (AUC 93% [95% CI: 89–97%]), the top-ranked protein on its own had a strong discriminative power with an AUC of 87% (95% CI: 82–92%). When comparing UC and CD, we found 15 differentially expressed proteins. Two proteins ranked within the top 10 across all feature-selection algorithms. This two-marker panel could discriminate UC from CD with an accuracy of 88% (95% CI: 82–96%). Adding the IBD PRS did not further improve the prediction model (AUC=88% [95% CI: 81–96%]). Conclusion Machine learning approaches validated top differentially expressed serum proteins with diagnostic potential in IBD. We identified a three-marker panel classifying IBD patients and non-IBD controls, and a two-marker panel discriminating UC from CD.

Mathematics ◽  
2021 ◽  
Vol 9 (11) ◽  
pp. 1226
Author(s):  
Saeed Najafi-Zangeneh ◽  
Naser Shams-Gharneh ◽  
Ali Arjomandi-Nezhad ◽  
Sarfaraz Hashemkhani Zolfani

Companies always seek ways to make their professional employees stay with them to reduce extra recruiting and training costs. Predicting whether a particular employee may leave or not will help the company to make preventive decisions. Unlike physical systems, human resource problems cannot be described by a scientific-analytical formula. Therefore, machine learning approaches are the best tools for this aim. This paper presents a three-stage (pre-processing, processing, post-processing) framework for attrition prediction. An IBM HR dataset is chosen as the case study. Since there are several features in the dataset, the “max-out” feature selection method is proposed for dimension reduction in the pre-processing stage. This method is implemented for the IBM HR dataset. The coefficient of each feature in the logistic regression model shows the importance of the feature in attrition prediction. The results show improvement in the F1-score performance measure due to the “max-out” feature selection method. Finally, the validity of parameters is checked by training the model for multiple bootstrap datasets. Then, the average and standard deviation of parameters are analyzed to check the confidence value of the model’s parameters and their stability. The small standard deviation of parameters indicates that the model is stable and is more likely to generalize well.


2020 ◽  
Vol 79 (Suppl 1) ◽  
pp. 891-892
Author(s):  
D. Galbraith ◽  
M. Caliskan ◽  
O. Jabado ◽  
S. Hu ◽  
R. Fleischmann ◽  
...  

Background:RA is a systemic autoimmune disease with heterogeneous manifestation. Recent advances in serum proteomics, such as the SomaScan®platform (SomaLogic, Inc., Boulder, USA), allow for a deeper exploration of the protein biomarkers associated with RA and a better understanding of the molecular aetiology of the disease.Objectives:To characterise the differences in baseline serum proteome of patients with RA (enrolled in the Phase IIIb Abatacept vs adaliMumab comParison in bioLogic-naïvERA subjects with background MTX [AMPLE] study)1compared with a healthy population, and to identify serum protein biomarkers associated with disease severity and radiographic progression.Methods:Patients in the AMPLE study had an inadequate response to MTX and were naïve to biologic DMARDs. Protein abundance was assessed in baseline serum samples from 440 AMPLE study patients and 123 healthy individuals with matching demographics using the SomaScan®platform, with 5000+ slow off-rate modified aptamers and up to 8 log of dynamic range.2Differential abundance testing was performed using linear models to identify differences in protein abundance in patients with RA vs healthy individuals. A separate analysis using a linear model was conducted in only the patients with RA to identify the proteins associated with DAS28 (CRP) and TSS. Pathway analyses were performed for proteins significantly (false discovery rate-adjusted p value <0.05) associated with RA and the disease severity measurements to identify over-representation of the molecular pathways.Results:Compared with healthy individuals, >2000 serum proteins were significantly differentially expressed in patients with RA, including many proteins that have been associated with RA (e.g. serum amyloid A [SAA], CRP) and complement. Most of the protein expression differences were of small magnitude (fold change <2). Proteins that were differentially expressed between patients with RA and healthy individuals were enriched in interleukin signalling, neutrophil degranulation, platelet activation/degranulation and extracellular matrix organisation pathways. DAS28 (CRP) was significantly associated with several biomarkers, including SAA, fibrinogen and CRP; in general, proteins associated with DAS28 (CRP) were most strongly enriched in the platelet activation/degranulation pathways (Figure 1), also seen in patients with RA vs healthy individuals. Additionally, many proteins were significantly associated with TSS, including SAA, matrix metalloproteinase-3 and cartilage acidic protein 1. Here, the proteins were most strongly enriched in the extracellular matrix remodelling pathways (Figure 2).Conclusion:Our study revealed that thousands of serum proteins are differentially expressed and several pathways are dysregulated between patients with RA and healthy individuals. Additional pathways were identified that reflect disease severity, including joint damage, distinct from those pathways associated with the disease. The SomaScan®platform provides a unique proteomic tool with a wide dynamic range for the identification of serum protein biomarkers associated with RA and disease severity. Proteomic signatures should be considered in clinical trials to better understand disease pathogenesis and predict risk in response to treatment.References:[1]Schiff M, et al.Ann Rheum Dis2014;73:86–94.[2]Gold L, et al.PLoS One2010;5:e15004.Acknowledgments:Rachel Rankin (medical writing, Caudex; funding: Bristol-Myers Squibb)Disclosure of Interests:David Galbraith Shareholder of: Bristol-Myers Squibb, Employee of: Bristol-Myers Squibb, Minal Caliskan Employee of: Bristol-Myers Squibb, Omar Jabado Shareholder of: Bristol-Myers Squibb, Employee of: Bristol-Myers Squibb, Sarah Hu Shareholder of: Bristol-Myers Squibb, Employee of: Bristol-Myers Squibb, Roy Fleischmann Grant/research support from: AbbVie, Akros, Amgen, AstraZeneca, Bristol-Myers Squibb, Boehringer, IngelhCentrexion, Eli Lilly, EMD Serono, Genentech, Gilead, Janssen, Merck, Nektar, Novartis, Pfizer, Regeneron Pharmaceuticals, Inc., Roche, Samsung, Sandoz, Sanofi Genzyme, Selecta, Taiho, UCB, Consultant of: AbbVie, ACEA, Amgen, Bristol-Myers Squibb, Eli Lilly, Gilead, GlaxoSmithKline, Novartis, Pfizer, Sanofi Genzyme, UCB, Michael Weinblatt Grant/research support from: Amgen, Bristol-Myers Squibb, Crescendo, Lily, Sanofi/Regeneron, Consultant of: AbbVie, Amgen, Bristol-Myers Squibb, Crescendo, Gilead, Horizon, Lily, Pfizer, Roche, Sean Connolly Shareholder of: Bristol-Myers Squibb, Employee of: Bristol-Myers Squibb, Michael A Maldonado Shareholder of: Bristol-Myers Squibb, Employee of: Bristol-Myers Squibb, Sheng Gao Shareholder of: Bristol-Myers Squibb, Employee of: Bristol-Myers Squibb


2019 ◽  
Vol 26 (3) ◽  
pp. 1810-1826 ◽  
Author(s):  
Behnaz Raef ◽  
Masoud Maleki ◽  
Reza Ferdousi

The aim of this study is to develop a computational prediction model for implantation outcome after an embryo transfer cycle. In this study, information of 500 patients and 1360 transferred embryos, including cleavage and blastocyst stages and fresh or frozen embryos, from April 2016 to February 2018, were collected. The dataset containing 82 attributes and a target label (indicating positive and negative implantation outcomes) was constructed. Six dominant machine learning approaches were examined based on their performance to predict embryo transfer outcomes. Also, feature selection procedures were used to identify effective predictive factors and recruited to determine the optimum number of features based on classifiers performance. The results revealed that random forest was the best classifier (accuracy = 90.40% and area under the curve = 93.74%) with optimum features based on a 10-fold cross-validation test. According to the Support Vector Machine-Feature Selection algorithm, the ideal numbers of features are 78. Follicle stimulating hormone/human menopausal gonadotropin dosage for ovarian stimulation was the most important predictive factor across all examined embryo transfer features. The proposed machine learning-based prediction model could predict embryo transfer outcome and implantation of embryos with high accuracy, before the start of an embryo transfer cycle.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Noura AlNuaimi ◽  
Mohammad Mehedy Masud ◽  
Mohamed Adel Serhani ◽  
Nazar Zaki

Organizations in many domains generate a considerable amount of heterogeneous data every day. Such data can be processed to enhance these organizations’ decisions in real time. However, storing and processing large and varied datasets (known as big data) is challenging to do in real time. In machine learning, streaming feature selection has always been considered a superior technique for selecting the relevant subset features from highly dimensional data and thus reducing learning complexity. In the relevant literature, streaming feature selection refers to the features that arrive consecutively over time; despite a lack of exact figure on the number of features, numbers of instances are well-established. Many scholars in the field have proposed streaming-feature-selection algorithms in attempts to find the proper solution to this problem. This paper presents an exhaustive and methodological introduction of these techniques. This study provides a review of the traditional feature-selection algorithms and then scrutinizes the current algorithms that use streaming feature selection to determine their strengths and weaknesses. The survey also sheds light on the ongoing challenges in big-data research.


2017 ◽  
Vol 24 (1) ◽  
pp. 3-37 ◽  
Author(s):  
SANDRA KÜBLER ◽  
CAN LIU ◽  
ZEESHAN ALI SAYYED

AbstractWe investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Yosef Masoudi-Sobhanzadeh ◽  
Habib Motieghader ◽  
Ali Masoudi-Nejad

2020 ◽  
Author(s):  
Aaron Cardenas-Martinez ◽  
Victor Rodriguez-Galiano ◽  
Juan Antonio Luque-Espinar ◽  
Maria Paula Mendes

&lt;p&gt;The establishment of the sources and driven-forces of groundwater nitrate pollution is of paramount importance, contributing to agro-environmental measures implementation and evaluation. High concentrations of nitrates in groundwater occur all around the world, in rich and less developed countries.&lt;/p&gt;&lt;p&gt;In the case of Spain, 21.5% of the wells of the groundwater quality monitoring network showed mean concentrations above the quality standard (QS) of 50 mg/l. The objectives of this work were: i) to predict the current probability of having nitrate concentrations above the QS in Andalusian groundwater bodies (Spain) using past time features, being some of them obtained from satellite observations; ii) to assess the importance of features in the prediction; iii) to evaluate different machine learning approaches (ML) and feature selection techniques (FS).&lt;/p&gt;&lt;p&gt;Several predictive models based on an ML algorithm, the Random Forest, were used, as well as, FS techniques. 321 nitrate samples and respective predictive features were obtained from different groundwater bodies. These predictive features were divided into three groups, regarding their focus: agricultural production (phenology); livestock pressure (excretion rates); and environmental settings (soil characteristics and texture, geomorphology, and local climate conditions). Models were trained with the features of a year [YEAR (t&lt;sub&gt;0&lt;/sub&gt;)], and then applied to new features obtained for the next year &amp;#8211; [YEAR(t&lt;sub&gt;0+1&lt;/sub&gt;)], performing k-fold cross-validation. Additionally, a further prediction was carried out for a present time &amp;#8211; [YEAR(t&lt;sub&gt;0+n&lt;/sub&gt;)], validating with an independent test. This methodology examined the use of a model, trained with previous nitrates concentrations and predictive features, for the prediction of current nitrates concentrations based on present features. Our findings showed an improvement in the predictive performance when using a wrapper with sequential search for FS when compared to the use alone of the Random Forest algorithm. Phenology features, derived from remotely sensed variables, were the most explanative features, performing better than the use of static land-use maps or vegetation index images (e.g., NDVI). They also provided much more comprehensive information, and more importantly, employing only extrinsic features of groundwater bodies.&lt;/p&gt;


2013 ◽  
Vol 22 (04) ◽  
pp. 1350027
Author(s):  
JAGANATHAN PALANICHAMY ◽  
KUPPUCHAMY RAMASAMY

Feature selection is essential in data mining and pattern recognition, especially for database classification. During past years, several feature selection algorithms have been proposed to measure the relevance of various features to each class. A suitable feature selection algorithm normally maximizes the relevancy and minimizes the redundancy of the selected features. The mutual information measure can successfully estimate the dependency of features on the entire sampling space, but it cannot exactly represent the redundancies among features. In this paper, a novel feature selection algorithm is proposed based on maximum relevance and minimum redundancy criterion. The mutual information is used to measure the relevancy of each feature with class variable and calculate the redundancy by utilizing the relationship between candidate features, selected features and class variables. The effectiveness is tested with ten benchmarked datasets available in UCI Machine Learning Repository. The experimental results show better performance when compared with some existing algorithms.


2021 ◽  
Vol 11 ◽  
Author(s):  
Qi Wan ◽  
Jiaxuan Zhou ◽  
Xiaoying Xia ◽  
Jianfeng Hu ◽  
Peng Wang ◽  
...  

ObjectiveTo evaluate the performance of 2D and 3D radiomics features with different machine learning approaches to classify SPLs based on magnetic resonance(MR) T2 weighted imaging (T2WI).Material and MethodsA total of 132 patients with pathologically confirmed SPLs were examined and randomly divided into training (n = 92) and test datasets (n = 40). A total of 1692 3D and 1231 2D radiomics features per patient were extracted. Both radiomics features and clinical data were evaluated. A total of 1260 classification models, comprising 3 normalization methods, 2 dimension reduction algorithms, 3 feature selection methods, and 10 classifiers with 7 different feature numbers (confined to 3–9), were compared. The ten-fold cross-validation on the training dataset was applied to choose the candidate final model. The area under the receiver operating characteristic curve (AUC), precision-recall plot, and Matthews Correlation Coefficient were used to evaluate the performance of machine learning approaches.ResultsThe 3D features were significantly superior to 2D features, showing much more machine learning combinations with AUC greater than 0.7 in both validation and test groups (129 vs. 11). The feature selection method Analysis of Variance(ANOVA), Recursive Feature Elimination(RFE) and the classifier Logistic Regression(LR), Linear Discriminant Analysis(LDA), Support Vector Machine(SVM), Gaussian Process(GP) had relatively better performance. The best performance of 3D radiomics features in the test dataset (AUC = 0.824, AUC-PR = 0.927, MCC = 0.514) was higher than that of 2D features (AUC = 0.740, AUC-PR = 0.846, MCC = 0.404). The joint 3D and 2D features (AUC=0.813, AUC-PR = 0.926, MCC = 0.563) showed similar results as 3D features. Incorporating clinical features with 3D and 2D radiomics features slightly improved the AUC to 0.836 (AUC-PR = 0.918, MCC = 0.620) and 0.780 (AUC-PR = 0.900, MCC = 0.574), respectively.ConclusionsAfter algorithm optimization, 2D feature-based radiomics models yield favorable results in differentiating malignant and benign SPLs, but 3D features are still preferred because of the availability of more machine learning algorithmic combinations with better performance. Feature selection methods ANOVA and RFE, and classifier LR, LDA, SVM and GP are more likely to demonstrate better diagnostic performance for 3D features in the current study.


Sign in / Sign up

Export Citation Format

Share Document