scholarly journals Effects of Dataset Size and Interactions on the Prediction Performance of Logistic Regression and Deep Learning Models

Author(s):  
Alexandre Bailly ◽  
Corentin Blanc ◽  
Élie Francis ◽  
Thierry Guillotin ◽  
Fadi Jamal ◽  
...  
Information ◽  
2021 ◽  
Vol 12 (9) ◽  
pp. 374
Author(s):  
Babacar Gaye ◽  
Dezheng Zhang ◽  
Aziguli Wulamu

With the extensive availability of social media platforms, Twitter has become a significant tool for the acquisition of peoples’ views, opinions, attitudes, and emotions towards certain entities. Within this frame of reference, sentiment analysis of tweets has become one of the most fascinating research areas in the field of natural language processing. A variety of techniques have been devised for sentiment analysis, but there is still room for improvement where the accuracy and efficacy of the system are concerned. This study proposes a novel approach that exploits the advantages of the lexical dictionary, machine learning, and deep learning classifiers. We classified the tweets based on the sentiments extracted by TextBlob using a stacked ensemble of three long short-term memory (LSTM) as base classifiers and logistic regression (LR) as a meta classifier. The proposed model proved to be effective and time-saving since it does not require feature extraction, as LSTM extracts features without any human intervention. We also compared our proposed approach with conventional machine learning models such as logistic regression, AdaBoost, and random forest. We also included state-of-the-art deep learning models in comparison with the proposed model. Experiments were conducted on the sentiment140 dataset and were evaluated in terms of accuracy, precision, recall, and F1 Score. Empirical results showed that our proposed approach manifested state-of-the-art results by achieving an accuracy score of 99%.


2021 ◽  
Vol 23 (1) ◽  
Author(s):  
Seulkee Lee ◽  
Seonyoung Kang ◽  
Yeonghee Eun ◽  
Hong-Hee Won ◽  
Hyungjin Kim ◽  
...  

Abstract Background Few studies on rheumatoid arthritis (RA) have generated machine learning models to predict biologic disease-modifying antirheumatic drugs (bDMARDs) responses; however, these studies included insufficient analysis on important features. Moreover, machine learning is yet to be used to predict bDMARD responses in ankylosing spondylitis (AS). Thus, in this study, machine learning was used to predict such responses in RA and AS patients. Methods Data were retrieved from the Korean College of Rheumatology Biologics therapy (KOBIO) registry. The number of RA and AS patients in the training dataset were 625 and 611, respectively. We prepared independent test datasets that did not participate in any process of generating machine learning models. Baseline clinical characteristics were used as input features. Responders were defined as those who met the ACR 20% improvement response criteria (ACR20) and ASAS 20% improvement response criteria (ASAS20) in RA and AS, respectively, at the first follow-up. Multiple machine learning methods, including random forest (RF-method), were used to generate models to predict bDMARD responses, and we compared them with the logistic regression model. Results The RF-method model had superior prediction performance to logistic regression model (accuracy: 0.726 [95% confidence interval (CI): 0.725–0.730] vs. 0.689 [0.606–0.717], area under curve (AUC) of the receiver operating characteristic curve (ROC) 0.638 [0.576–0.658] vs. 0.565 [0.493–0.605], F1 score 0.841 [0.837–0.843] vs. 0.803 [0.732–0.828], AUC of the precision-recall curve 0.808 [0.763–0.829] vs. 0.754 [0.714–0.789]) with independent test datasets in patients with RA. However, machine learning and logistic regression exhibited similar prediction performance in AS patients. Furthermore, the patient self-reporting scales, which are patient global assessment of disease activity (PtGA) in RA and Bath Ankylosing Spondylitis Functional Index (BASFI) in AS, were revealed as the most important features in both diseases. Conclusions RF-method exhibited superior prediction performance for responses of bDMARDs to a conventional statistical method, i.e., logistic regression, in RA patients. In contrast, despite the comparable size of the dataset, machine learning did not outperform in AS patients. The most important features of both diseases, according to feature importance analysis were patient self-reporting scales.


2019 ◽  
Vol 40 (Supplement_1) ◽  
Author(s):  
I Korsakov ◽  
A Gusev ◽  
T Kuznetsova ◽  
D Gavrilov ◽  
R Novitskiy

Abstract Abstract Background Advances in precision medicine will require an increasingly individualized prognostic evaluation of patients in order to provide the patient with appropriate therapy. The traditional statistical methods of predictive modeling, such as SCORE, PROCAM, and Framingham, according to the European guidelines for the prevention of cardiovascular disease, not adapted for all patients and require significant human involvement in the selection of predictive variables, transformation and imputation of variables. In ROC-analysis for prediction of significant cardiovascular disease (CVD), the areas under the curve for Framingham: 0.62–0.72, for SCORE: 0.66–0.73 and for PROCAM: 0.60–0.69. To improve it, we apply for approaches to predict a CVD event rely on conventional risk factors by machine learning and deep learning models to 10-year CVD event prediction by using longitudinal electronic health record (EHR). Methods For machine learning, we applied logistic regression (LR) and recurrent neural networks with long short-term memory (LSTM) units as a deep learning algorithm. We extract from longitudinal EHR the following features: demographic, vital signs, diagnoses (ICD-10-cm: I21-I22.9: I61-I63.9) and medication. The problem in this step, that near 80 percent of clinical information in EHR is “unstructured” and contains errors and typos. Missing data are important for the correct training process using by deep learning & machine learning algorithm. The study cohort included patients between the ages of 21 to 75 with a dynamic observation window. In total, we got 31517 individuals in the dataset, but only 3652 individuals have all features or missing features values can be easy to impute. Among these 3652 individuals, 29.4% has a CVD, mean age 49.4 years, 68,2% female. Evaluation We randomly divided the dataset into a training and a test set with an 80/20 split. The LR was implemented with Python Scikit-Learn and the LSTM model was implemented with Keras using Tensorflow as the backend. Results We applied machine learning and deep learning models using the same features as traditional risk scale and longitudinal EHR features for CVD prediction, respectively. Machine learning model (LR) achieved an AUROC of 0.74–0.76 and deep learning (LSTM) 0.75–0.76. By using features from EHR logistic regression and deep learning models improved the AUROC to 0.78–0.79. Conclusion The machine learning models outperformed a traditional clinically-used predictive model for CVD risk prediction (i.e. SCORE, PROCAM, and Framingham equations). This approach was used to create a clinical decision support system (CDSS). It uses both traditional risk scales and models based on neural networks. Especially important is the fact that the system can calculate the risks of cardiovascular disease automatically and recalculate immediately after adding new information to the EHR. The results are delivered to the user's personal account.


2021 ◽  
Vol 23 (07) ◽  
pp. 1129-1139
Author(s):  
Manikantha K ◽  
◽  
Aishwarya R Bhat ◽  
Pavani Nerella ◽  
Pooja Baburaj ◽  
...  

Recognising one’s identity to enter a system is called authentication. This process can take various forms where users input the system with a set of identifying credentials to access the system. Signatures belong to behavioural biometric, where the distinct features of every individual are considered in order to corroborate the person’s identity. The act of falsely imitating one’s signature biometric to impersonate and leverage access to their asset is called signature forgery. Our paper presents a comparative study of various deep learning models using Siamese architecture, over a wide catalogue of signature images. Openly available datasets like CEDAR, Handwritten Signatures dataset from Kaggle, ICDAR 2011 SigComp, and BH-Sig260 signature corpus are used to train the models. A set of classifiers – Support Vector Classifiers (SVC), Gaussian Naïve Bayes (GNB), Logistic Regression (LR) and K-Nearest Neighbours (KNN) are applied sequentially to classify the signature as genuine or forged.


Entropy ◽  
2021 ◽  
Vol 23 (2) ◽  
pp. 143
Author(s):  
Domjan Barić ◽  
Petar Fumić ◽  
Davor Horvatić ◽  
Tomislav Lipic

The adaptation of deep learning models within safety-critical systems cannot rely only on good prediction performance but needs to provide interpretable and robust explanations for their decisions. When modeling complex sequences, attention mechanisms are regarded as the established approach to support deep neural networks with intrinsic interpretability. This paper focuses on the emerging trend of specifically designing diagnostic datasets for understanding the inner workings of attention mechanism based deep learning models for multivariate forecasting tasks. We design a novel benchmark of synthetically designed datasets with the transparent underlying generating process of multiple time series interactions with increasing complexity. The benchmark enables empirical evaluation of the performance of attention based deep neural networks in three different aspects: (i) prediction performance score, (ii) interpretability correctness, (iii) sensitivity analysis. Our analysis shows that although most models have satisfying and stable prediction performance results, they often fail to give correct interpretability. The only model with both a satisfying performance score and correct interpretability is IMV-LSTM, capturing both autocorrelations and crosscorrelations between multiple time series. Interestingly, while evaluating IMV-LSTM on simulated data from statistical and mechanistic models, the correctness of interpretability increases with more complex datasets.


Author(s):  
Ishtiaque Ahmed ◽  
◽  
Manan Darda ◽  
Neha Tikyani ◽  
Rachit Agrawal ◽  
...  

The COVID-19 pandemic has caused large-scale outbreaks in more than 150 countries worldwide, causing massive damage to the livelihood of many people. The capacity to identify contaminated patients early and get unique treatment is quite possibly the primary stride in the battle against COVID-19. One of the quickest ways to diagnose patients is to use radiography and radiology images to detect the disease. Early studies have shown that chest X-rays of patients infected with COVID-19 have unique abnormalities. To identify COVID-19 patients from chest X-ray images, we used various deep learning models based on previous studies. We first compiled a data set of 2,815 chest radiographs from public sources. The model produces reliable and stable results with an accuracy of 91.6%, a Positive Predictive Value of 80%, a Negative Predictive Value of 100%, specificity of 87.50%, and Sensitivity of 100%. It is observed that the CNN-based architecture can diagnose COVID19 disease. The parameters’ outcomes can be further improved by increasing the dataset size and by developing the CNN-based architecture for training the model.


2021 ◽  
Vol 6 ◽  
pp. 248
Author(s):  
Paul Mwaniki ◽  
Timothy Kamanu ◽  
Samuel Akech ◽  
Dustin Dunsmuir ◽  
J. Mark Ansermino ◽  
...  

Background: The success of many machine learning applications depends on knowledge about the relationship between the input data and the task of interest (output), hindering the application of machine learning to novel tasks. End-to-end deep learning, which does not require intermediate feature engineering, has been recommended to overcome this challenge but end-to-end deep learning models require large labelled training data sets often unavailable in many medical applications. In this study, we trained machine learning models to predict paediatric hospitalization given raw photoplethysmography (PPG) signals obtained from a pulse oximeter. We trained self-supervised learning (SSL) for automatic feature extraction from PPG signals and assessed the utility of SSL in initializing end-to-end deep learning models trained on a small labelled data set with the aim of predicting paediatric hospitalization.Methods: We compared logistic regression models fitted using features extracted using SSL with end-to-end deep learning models initialized either randomly or using weights from the SSL model. We also compared the performance of SSL models trained on labelled data alone (n=1,031) with SSL trained using both labelled and unlabelled signals (n=7,578). Results: The SSL model trained on both labelled and unlabelled PPG signals produced features that were more predictive of hospitalization compared to the SSL model trained on labelled PPG only (AUC of logistic regression model: 0.78 vs 0.74). The end-to-end deep learning model had an AUC of 0.80 when initialized using the SSL model trained on all PPG signals, 0.77 when initialized using SSL trained on labelled data only, and 0.73 when initialized randomly. Conclusions: This study shows that SSL can improve the classification of PPG signals by either extracting features required by logistic regression models or initializing end-to-end deep learning models. Furthermore, SSL can leverage larger unlabelled data sets to improve performance of models fitted using small labelled data sets.


2020 ◽  
Author(s):  
Michael J Horry ◽  
Subrata Chakraborty ◽  
Manoranjan Paul ◽  
Anwaar Ulhaq ◽  
Biswajeet Pradhan ◽  
...  

Detecting COVID-19 early may help in devising an appropriate treatment plan and disease containment decisions. In this study, we demonstrate how pre-trained deep learning models can be adopted to perform COVID-19 detection using X-Ray images. The aim is to provide over-stressed medical professionals a second pair of eyes through intelligent image classification models. We highlight the challenges (including dataset size and quality) in utilising current publicly available COVID-19 datasets for developing useful deep learning models. We propose a semi-automated image pre-processing model to create a trustworthy image dataset for developing and testing deep learning models. The new approach is aimed to reduce unwanted noise from X-Ray images so that deep learning models can focus on detecting diseases with specific features from them. Next, we devise a deep learning experimental framework, where we utilise the processed dataset to perform comparative testing for several popular and widely available deep learning model families such as VGG, Inception, Xception, and Resnet. The experimental results highlight the suitability of these models for current available dataset and indicates that models with simpler networks such as VGG19 performs relatively better with up to 83% precision. This will provide a solid pathway for researchers and practitioners to develop improved models in the future.


Author(s):  
Ishtiaque Ahmed ◽  
◽  
Manan Darda ◽  
Neha Tikyani ◽  
Rachit Agrawal ◽  
...  

The COVID-19 pandemic has caused large-scale outbreaks in more than 150 countries worldwide, causing massive damage to the livelihood of many people. The capacity to identify contaminated patients early and get unique treatment is quite possibly the primary stride in the battle against COVID-19. One of the quickest ways to diagnose patients is to use radiography and radiology images to detect the disease. Early studies have shown that chest X-rays of patients infected with COVID-19 have unique abnormalities. To identify COVID-19 patients from chest X-ray images, we used various deep learning models based on previous studies. We first compiled a data set of 2,815 chest radiographs from public sources. The model produces reliable and stable results with an accuracy of 91.6%, a Positive Predictive Value of 80%, a Negative Predictive Value of 100%, specificity of 87.50%, and Sensitivity of 100%. It is observed that the CNN-based architecture can diagnose COVID-19 disease. The parameters’ outcomes can be further improved by increasing the dataset size and by developing the CNN-based architecture for training the model.


Sign in / Sign up

Export Citation Format

Share Document