scholarly journals Weighting Methods for Rare Event Identification From Imbalanced Datasets

2021 ◽  
Vol 4 ◽  
Author(s):  
Jia He ◽  
Maggie X. Cheng

In machine learning, we often face the situation where the event we are interested in has very few data points buried in a massive amount of data. This is typical in network monitoring, where data are streamed from sensing or measuring units continuously but most data are not for events. With imbalanced datasets, the classifiers tend to be biased in favor of the main class. Rare event detection has received much attention in machine learning, and yet it is still a challenging problem. In this paper, we propose a remedy for the standing problem. Weighting and sampling are two fundamental approaches to address the problem. We focus on the weighting method in this paper. We first propose a boosting-style algorithm to compute class weights, which is proved to have excellent theoretical property. Then we propose an adaptive algorithm, which is suitable for real-time applications. The adaptive nature of the two algorithms allows a controlled tradeoff between true positive rate and false positive rate and avoids excessive weight on the rare class, which leads to poor performance on the main class. Experiments on power grid data and some public datasets show that the proposed algorithms outperform the existing weighting and boosting methods, and that their superiority is more noticeable with noisy data.

Sensors ◽  
2020 ◽  
Vol 20 (2) ◽  
pp. 348 ◽  
Author(s):  
Chang-Hee Han ◽  
Euijin Kim ◽  
Chang-Hwan Im

Asynchronous brain–computer interfaces (BCIs) based on electroencephalography (EEG) generally suffer from poor performance in terms of classification accuracy and false-positive rate (FPR). Thus, BCI toggle switches based on electrooculogram (EOG) signals were developed to toggle on/off synchronous BCI systems. The conventional BCI toggle switches exhibit fast responses with high accuracy; however, they have a high FPR or cannot be applied to patients with oculomotor impairments. To circumvent these issues, we developed a novel BCI toggle switch that users can employ to toggle on or off synchronous BCIs by holding their breath for a few seconds. Two states—normal breath and breath holding—were classified using a linear discriminant analysis with features extracted from the respiration-modulated photoplethysmography (PPG) signals. A real-time BCI toggle switch was implemented with calibration data trained with only 1-min PPG data. We evaluated the performance of our PPG switch by combining it with a steady-state visual evoked potential-based BCI system that was designed to control four external devices, with regard to the true-positive rate and FPR. The parameters of the PPG switch were optimized through an offline experiment with five subjects, and the performance of the switch system was evaluated in an online experiment with seven subjects. All the participants successfully turned on the BCI by holding their breath for approximately 10 s (100% accuracy), and the switch system exhibited a very low FPR of 0.02 false operations per minute, which is the lowest FPR reported thus far. All participants could successfully control external devices in the synchronous BCI mode. Our results demonstrated that the proposed PPG-based BCI toggle switch can be used to implement practical BCIs.


2021 ◽  
Vol 39 (10) ◽  
Author(s):  
Eka Sudarmaji ◽  
Noer Azam Achsani ◽  
Yandra Arkeman ◽  
Idqan Fahmi

Companies can form their own "ESCO model" with their capitals. Unfortunately, customer's creditworthiness was becoming more crucial for ESCO. Machine learning was used to predict the creditworthiness of clients in ESCO financing processes. This research aimed to develop a scoring model to leverage a machine learning and life cycle cost analysis (LCCA) to evaluate alternative financing for Energy Saving in Indonesia. The results of calculations using multinomial logistic regression showed that the accuracy value of prediction data with test data was 88.3562 %. The prediction rate result that refers to the percentage of correct predictions among all test data was 91.67%, and False Positive Rate (FPR) was 39.44%. The True Positive Rate was called Recall or 'Sensitivity Rate' as it was defined as several positive cases that were correctly identified (TPR) was 92.20%. We found the machine learning methods for creditworthiness prediction in retrofitting projects were fresh and worth a shot. It was hoped that this new practice would grow in popularity and become standard among ESCOs. Unfortunately, current machine-learning-based creditworthiness scoring practices lacked explain ability and interpretability. Unfortunately, ESCO must penalize the retrofitting project. As a result, since retrofitting was a new industry, the credit approval process was challenging to communicate to consumers. The most important thing for ESCO to deal with the project was to have a friendship and know-how with the client. Research from these case studies led to a clearer understanding of the factors affecting all parties' decisions to implement and continue with their ESCO project.


Wireless networks are continuously facing challenges in the field of Information Security. This leads to major researches in the area of Intrusion detection. The working of Intrusion detection is performed mainly by signature based detection and anomaly based detection. Anomaly based detection is based on the behavior of the network. One of the major challenge in this domain is to identify and detect the malicious node in wireless networks. The intrusion detection mechanism has to analyse the behavior of the node in the network by means of the several features possessed by each node. Intelligent schemes are the need of the hour in such scenario. This paper has taken a standard dataset for studying the features of the wireless node and reduced the features by applying the most efficient Correlation Attribute feature selection method. The machine learning algorithms are applied to obtain an effective training model which is then applied on the testing dataset to validate the model. The accuracy of the model is determined by the performance parameters such as true positive rate, false positive rate and ROC area. Neural network, bagging and decision tree algorithm RepTree are giving promising results in comparison with other classification algorithms.


Blood ◽  
2018 ◽  
Vol 132 (Supplement 1) ◽  
pp. 711-711
Author(s):  
Sanjeet Dadwal ◽  
Zahra Eftekhari ◽  
Tushondra Thomas ◽  
Deron Johnson ◽  
Dongyun Yang ◽  
...  

Abstract Sepsis and severe sepsis contribute significantly to early treatment-related mortality after hematopoietic cell transplantation (HCT), with reported mortality rates of 30 and 55% due to severe sepsis, during engraftment admission, for autologous and allogeneic HCT, respectively. Since the clinical presentation and characteristics of sepsis immediately after HCT can be different from that seen in general population or those who are receiving non-HCT chemotherapy, detecting early signs of sepsis in HCT recipients becomes critical. Herein, we developed and validated a machine-learning based sepsis prediction model for patients who underwent HCT at City of Hope, using variables within the Electronic Health Record (EHR) data. We evaluated a consecutive case series of 1046 HCTs (autologous: n=491, allogeneic: n=555) at our center between 2014 and 2017. The median age at the time of HCT was 56 years (range: 18-78). For this analysis, the primary clinical event was sepsis diagnosis within 100 days post-HCT, identified based on - use of the institutional sepsis management order set and mention of "sepsis" in the progress notes. The time of sepsis order set was considered as time of sepsis for analyses. To train the model, 829 visits (104 septic and 725 non-septic) and their data were used, while 217 visits (31 septic and 186 non-septic) were used as a validation cohort. At each hour after HCT, when a new data point was available, 47 variables were calculated from each patient's data and a risk score was assigned to each time point. These variables consisted of patient demographics, transplant type, regimen intensity, disease status, Hematopoietic cell transplantation - specific comorbidity index, lab values, vital signs, medication orders, and comorbidities. For the 829 visits in the training dataset, the 47 variables were calculated at 220,889 different time points, resulting in a total of 10,381,783 data points. Lab values and vital signs were considered as changes from individual patient's baselines at each time point. The baseline for each lab value and vital sign were the last measured values before HCT. An ensemble of 20 random forest binary classification models were trained to identify and learn patterns of data for HCT patients at high risk for sepsis and differentiate them from patients at lower sepsis risk. To help the model learning patterns of data prior to sepsis, available data from septic patients' within 24 hours preceding diagnosis of sepsis was used. For 829 septic visits in the training data set, there were 5048 time points, each having 47 variables. Variable importance for the 20 models was assessed using Gini mean decrease accuracy method. The sum of importance values from each model was calculated for each variable as the final importance value. Figure 1a shows the importance of variables using this method. Testing the model on the validation cohort results in an AUC of 0.85 on the test dataset (Figure 1b). At a threshold of 0.6, our model was 0.32 sensitive and 0.96 specific. At this threshold, this model identified 10 out of 31 patients with a median lead time of 119.5 hours, of which 2 patients were flagged as high risk at the time of transplant and developed sepsis at 17 and 60 days post-HCT. The lead time is what truly sets this predictive model apart from detective models with organ failure or dysfunction or other deterioration metrics as their detection criteria. At a threshold of 0.4, our model has 0.9 sensitivity and 0.65 specificity. In summary, a machine-learning sepsis prediction model can be tailored towards HCT recipients to improve the quality of care, prevent sepsis associated-organ damage and decrease mortality post-HCT. Our model significantly outperforms widely used Modified Early Warning Score (MEWS), with AUC of 0.73 in general population. Possible application of our model include showing a "red flag" at a threshold of 0.6 (0.32 true positive rate and 0.04 false positive rate) for antibiotic initiation/modification, and a "yellow flag" at a threshold of 0.4 (0.9 true positive rate and 0.35 false positive rate) suggesting closer monitoring or less aggressive treatments for the patient. Figure 1. Figure 1. Disclosures Dadwal: MERK: Consultancy, Membership on an entity's Board of Directors or advisory committees, Research Funding, Speakers Bureau; Gilead: Research Funding; AiCuris: Research Funding; Shire: Research Funding.


Author(s):  
Ravinder Ahuja ◽  
Vishal Vivek ◽  
Manika Chandna ◽  
Shivani Virmani ◽  
Alisha Banga

An early diagnosis of insomnia can prevent further medical aids such as anger issues, heart diseases, anxiety, depression, and hypertension. Fifteen machine learning algorithms have been applied and 14 leading factors have been taken into consideration for predicting insomnia. Seven performance parameters (accuracy, kappa, the true positive rate, false positive rate, precision, f-measure, and AUC) are used and for implementation. The authors have used python language. The support vector machine is giving higher performance out of all algorithms giving accuracy 91.6%, f-measure is 92.13, and kappa is 0.83. Further, SVM is applied on another dataset of 100 patients and giving accuracy 92%. In addition, an analysis of the variable importance of CART, C5.0, decision tree, random forest, adaptive boost, and XG boost is calculated. The analysis shows that insomnia primarily depends on the factors, which are the vision problem, mobility problem, and sleep disorder. This chapter mainly finds the usages and effectiveness of machine learning algorithms in Insomnia diseases prediction.


Blood ◽  
2019 ◽  
Vol 134 (Supplement_1) ◽  
pp. 4477-4477
Author(s):  
Zahra Eftekhari ◽  
Sally Mokhtari ◽  
Tushondra Thomas ◽  
Dongyun Yang ◽  
Liana Nikolaenko ◽  
...  

Sepsis contributes significantly to early treatment-related mortality after hematopoietic cell transplantation (HCT). Since the clinical presentation and characteristics of sepsis immediately after HCT can be different from that seen in general population or those who are receiving non-HCT chemotherapy, detecting early signs of sepsis in HCT recipients becomes critical. Herein, we extended our earlier analyses (Dadwal et al. ASH 2018) and evaluated a consecutive case series of 1806 patients who underwent HCT at City of Hope (2014-2017) to develop a machine-learning sepsis prediction model for HCT recipients, namely Early Sepsis Prediction/Identification for Transplant Recipients (ESPRIT) using variables within the Electronic Health Record (EHR) data. The primary clinical event was sepsis diagnosis within 100 days post-HCT, identified based on the use of the institutional "sepsis management order set" and mention of "sepsis" in the progress notes. The time of sepsis order set was considered as time of sepsis for the analyses. Data from 2014 to 2016 (108 visits with and 1315 visits without sepsis, 8% sepsis prevalence) were used as the training set and data from 2017 (24 visits with and 359 visits without sepsis, 6.6% sepsis prevalence) were kept as the holdout dataset for testing the model. From each patient visit, 61 variables were collected with a total of 862,009 lab values, 3,284,561 vital sign values and 249,982 medication orders for 1806 visits over the duration of HCT hospitalization (median: 24.1 days, range: 7-304). An ensemble of 100 random forest classification models were used to develop the prediction model. Last Observation Carried Forward (LOCF) imputation was done to attribute the missing values with the last observed value of that variable. For model development and optimization, we applied a 5-fold stratified cross validation on the training dataset. Variable importance for the 100 models was assessed using Gini mean decrease accuracy method value, which was averaged to produce the final variable importance. HCT was autologous in 798 and allogeneic in 1008 patients. Ablative conditioning regimen was delivered to 97.3% and 38.3% of patients in autologous and allogeneic groups, respectively. When the impact of "sepsis" was analyzed as a time-dependent variable, sepsis development was associated with increased mortality (HR=2.79, 95%CI: 2.14-3.64, p<0.001) by multivariable Cox regression model. Retrospective evaluation at 0, 4, 8 and 12 hours pre-sepsis showed area under the ROC curves (AUCs) of 0.98, 0.91, 0.90 and 0.85, respectively (Fig 1a), outperforming the widely used Modified Early Warning Score (MEWS) (Fig 1b). We then simulated our ESPRIT's performance in the unselected real-world data by running the model every hour from admit to sepsis or discharge, whichever occurred first. This process created an hourly risk score from admit to sepsis or discharge. ESPRIT achieved an AUC of 0.83 on the training and AUC of 0.82 on the holdout test dataset (Fig 2). An example of risk over time for a septic patient that was identified by the model with 27 hours lead time at threshold of 0.6 is shown in Fig 3. With at risk threshold of 0.6 (sensitivity: 0.4, specificity: 0.93), ESPRIT had a median lead time of 35 and 47 hours on training and holdout test data, respectively. This model allows users to select any threshold (with specific false positive/negative rate expected for a given population) to be used for specific purposes. For example, a red flag can be assigned to a patient when the risk passes the threshold of 0.6. At this threshold the false positive rate is only 7% and true positive rate is 40%. Then a yellow flag can be assigned at the threshold of 0.4, with which the model has higher (38%) false positive rate but also a high (90%) true positive rate. Using this two-step assessment/intervention system (red flag as an alarm and yellow flag as a warning sign to examine the patient to rule out sepsis), the model would achieve 90% sensitivity and 93% specificity in practice and overcome the low positive predictive value due to the rare incidence of sepsis. In summary, we developed and validated a novel machine learning monitoring system for sepsis prediction in HCT recipients. Our data strongly support further clinical validation of the ESPRIT model as a method to provide real-time sepsis predictions, and timely initiation of preemptive antibiotics therapy according to the predicted risks in the era of EHR. Disclosures Dadwal: Ansun biopharma: Research Funding; SHIRE: Research Funding; Janssen: Membership on an entity's Board of Directors or advisory committees; Merck: Membership on an entity's Board of Directors or advisory committees; Clinigen: Membership on an entity's Board of Directors or advisory committees. Nakamura:Kirin Kyowa: Other: support for an academic seminar in a university in Japan; Merck: Membership on an entity's Board of Directors or advisory committees; Celgene: Other: support for an academic seminar in a university in Japan; Alexion: Other: support to a lecture at a Japan Society of Transfusion/Cellular Therapy meeting .


Web use and digitized information are getting expanded each day. The measure of information created is likewise getting expanded. On the opposite side, the security assaults cause numerous security dangers in the system, sites and Internet. Interruption discovery in a fast system is extremely a hard undertaking. The Hadoop Implementation is utilized to address the previously mentioned test that is distinguishing interruption in a major information condition at constant. To characterize the strange bundle stream, AI methodologies are used. Innocent Bayes does grouping by a vector of highlight esteems produced using some limited set. Choice Tree is another Machine Learning classifier which is likewise an administered learning model. Choice tree is the stream diagram like tree structure. J48 and Naïve Bayes Algorithm are actualized in Hadoop MapReduce Framework for parallel preparing by utilizing the KDDCup Data Corrected Benchmark dataset records. The outcome acquired is 89.9% True Positive rate and 0.04% False Positive rate for Naive Bayes Algorithm and 98.06% True Positive rate and 0.001% False Positive rate for Decision Tree Algorithm.


2017 ◽  
Vol 56 (04) ◽  
pp. 308-318 ◽  
Author(s):  
Asli Bostanci ◽  
Murat Turhan ◽  
Selen Bozkurt

SummaryObjectives: The goal of this study is to evaluate the results of machine learning methods for the classification of OSA severity of patients with suspected sleep disorder breathing as normal, mild, moderate and severe based on non-polysomnographic variables: 1) clinical data, 2) symptoms and 3) physical examination.Methods: In order to produce classification models for OSA severity, five different machine learning methods (Bayesian network, Decision Tree, Random Forest, Neural Networks and Logistic Regression) were trained while relevant variables and their relationships were derived empirically from observed data. Each model was trained and evaluated using 10-fold cross-validation and to evaluate classification performances of all methods, true positive rate (TPR), false positive rate (FPR), Positive Predictive Value (PPV), F measure and Area Under Receiver Operating Characteristics curve (ROC-AUC) were used.Results: Results of 10-fold cross validated tests with different variable settings promisingly indicated that the OSA severity of suspected OSA patients can be classified, using non-polysomnographic features, with 0.71 true positive rate as the highest and, 0.15 false positive rate as the lowest, respectively. Moreover, the test results of different variables settings revealed that the accuracy of the classification models was significantly improved when physical examination variables were added to the model.Conclusions: Study results showed that machine learning methods can be used to estimate the probabilities of no, mild, moderate, and severe obstructive sleep apnea and such approaches may improve accurate initial OSA screening and help referring only the suspected moderate or severe OSA patients to sleep laboratories for the expensive tests.


2021 ◽  
Author(s):  
Faraz Khoshbaktian ◽  
Ardian Lagman ◽  
Dionne M Aleman ◽  
Randy Giffen ◽  
Proton Rahman

Early and effective detection of severe infection cases during a pandemic can significantly help patient prognosis and resource allocation. We develop a machine learning framework for detecting severe COVID-19 cases at the time of RT-PCR testing. We retrospectively studied 988 patients from a small Canadian province that tested positive for SARS-CoV-2 where 42 (4%) cases were at-risk (i.e., resulted in hospitalization, admission to ICU, or death), and 8 (<1%) cases resulted in death. The limited information available at the time of RT-PCR testing included age, comorbidities, and patients' reported symptoms, totaling 27 features. Due to the severe class imbalance and small dataset size, we formulated the problem of detecting severe COVID as anomaly detection and applied three models: one-class support vector machine (OCSVM), weight-adjusted XGBoost, and weight-adjusted AdaBoost. The OCSVM was the best performing model for detecting the deceased cases with an average 95% true positive rate (TPR) and 27.2% false positive rate (FPR). Meanwhile, the XGBoost provided the best performance for detecting the at-risk cases with an average 96.2% TPR and 19% FPR. In addition, we developed a novel extension to SHAP interpretability to explain the outputs from the models. In agreement with conventional knowledge, we found that comorbidities were influential in predicting severity, however, we also found that symptoms were generally more influential, noting that machine learning combines all available data and is not a single-variate statistical analysis.


Author(s):  
H. Baba ◽  
Y. Akiyama ◽  
T. Tokudomi ◽  
Y. Takahashi

Abstract. Vacant housing detection is an urgent problem that needs to be addressed. It is also a suitable example to promote utilisation of smart data that are stored in municipalities. This study proposes a vacant housing detection model that uses closed municipal data and considers accelerating the use of public data to promote smart cities. Employing a machine learning technique, this study ensures high predictive power for vacant housing detection. The model enables us to handle complex municipal data that include non-linear feature characteristics and substantial missing data. In particular, handling missing data is important in the practical use of closed municipal data because not all of the data are necessarily absorbed to a building unit. Consequently, the model in this analysis showed that the accuracy and false positive rate are 95.4 percent and 3.7 percent, respectively, which are high enough to detect vacant houses. However, the true positive rate is 77.0 percent. Although the rate is not low to some extent, selection of features and further collection of extra samples may improve the rate. Geographic distribution of vacant houses further enabled us to check the difference between the actual and estimated number of vacant houses, and more than 80 percent of 500-meter grid data are with below 10 errors, which we think, provides city planners with informative data to roughly grasp geographical tendencies.


Sign in / Sign up

Export Citation Format

Share Document