scholarly journals Data Normalization and Standardization: Impacting Classification Model Accuracy

2021 ◽  
Vol 183 (35) ◽  
pp. 6-9
Author(s):  
Mani Butwall
Author(s):  
Dmitrii Borkin ◽  
Andrea Némethová ◽  
German Michaľčonok ◽  
Konstantin Maiorov

Abstract In this paper, we present the impact of the data normalization on the classification model performance. In first part of this paper, we present the structure of our dataset, where we discuss the features of the data set and basic statistical analysis of the data. In this research, we worked with the medical data about the patients with the Parkinson disease. In second part of this paper, we present the process of data normalization and the impact of scaling data on the classification model performance. In this research, we used the XGBoost model as our classification model. The main classification task was to classify whether the patient is ill with Parkinson disease or not. Since the data set contains more numerical parameters of different scaling, the main aim of this paper was to investigate the impact of the data normalization (scaling) on the performance of the classification model.


2021 ◽  
Vol 13 (15) ◽  
pp. 2935
Author(s):  
Chunhua Qian ◽  
Hequn Qiang ◽  
Feng Wang ◽  
Mingyang Li

Building a high-precision, stable, and universal automatic extraction model of the rocky desertification information is the premise for exploring the spatiotemporal evolution of rocky desertification. Taking Guizhou province as the research area and based on MODIS and continuous forest inventory data in China, we used a machine learning algorithm to build a rocky desertification model with bedrock exposure rate, temperature difference, humidity, and other characteristic factors and considered improving the model accuracy from the spatial and temporal dimensions. The results showed the following: (1) The supervised classification method was used to build a rocky desertification model, and the logical model, RF model, and SVM model were constructed separately. The accuracies of the models were 73.8%, 78.2%, and 80.6%, respectively, and the kappa coefficients were 0.61, 0.672, and 0.707, respectively. SVM performed the best. (2) Vegetation types and vegetation seasonal phases are closely related to rocky desertification. After combining them, the model accuracy and kappa coefficient improved to 91.1% and 0.861. (3) The spatial distribution characteristics of rocky desertification in Guizhou are obvious, showing a pattern of being heavy in the west, light in the east, heavy in the south, and light in the north. Rocky desertification has continuously increased from 2001 to 2019. In conclusion, combining the vertical spatial structure of vegetation and the differences in seasonal phase is an effective method to improve the modeling accuracy of rocky desertification, and the SVM model has the highest rocky desertification classification accuracy. The research results provide data support for exploring the spatiotemporal evolution pattern of rocky desertification in Guizhou.


2019 ◽  
Vol 8 (2) ◽  
pp. 5969-5971

Feature selection is the most important step to develop any latest learning model. As the complexity of the leaning models increases day by day there is an increasing demand, in selecting the right features to build the model. There are many methods for feature selection. A new feature selection based on the Manova statistical test is implemented. Using the Manova test, we select attributes from academic datasets. Using the selected attributes, we build a classification model. Accuracy of the model with feature selection is compared with a model with all attributes. Results are discussed. It is proved that the classification model build with features selected by Manova test achieves more accuracy than a model built with all features.


Tele-marketing presents a huge challenge in identifying potential customers with lack of effective marketing strategy may led a company to succumbs to problems such as prolonged marketing campaign. Various attempts to improve the performance of binary classification model for bank tele-marketing data. Previous researches indicate that the neural network is the most common algorithms being employed and able to produce commendable results with higher accuracy percentages compared to other algorithms. Despite several attempts to improve the model through treatment of imbalance dataset and features selection, this research argues that they are incomplete. Therefore, this research proposes a data pre-processing algorithm for bank tele-marketing binary classification neural network. Three datasets have been employed (19, 16, and 20 features) to evaluate the performance of the algorithm towards the classification model. The data pre-processing algorithm is divided into three phases; data cleaning, data imbalance treatment and finally data normalization. In this paper, the result indicated that binary classification model complemented with data cleaning techniques such as Missing common (MC) and Tomek Links (TL) shows a better result compared to Ignore Missing (IM). In terms of data normalization, techniques such as MaxAbsScaler (MAS) and MinMaxScaler (MMS) consistently indicated better performance from other normalization techniques. The classification model employed in this paper utilize data pre-processing algorithm combination of MC-TL-MMS. The algorithm using this approach able to record an area of the receiver operating characteristic curve (AUC) of 0.9129 and 0.9464 by using 16 features and 20 features respectively. This result presents the highest figure in terms of performance accuracy compared to other previous researches


Energies ◽  
2019 ◽  
Vol 13 (1) ◽  
pp. 45
Author(s):  
Eun Ji Choi ◽  
Yongseok Yoo ◽  
Bo Rang Park ◽  
Young Jae Choi ◽  
Jin Woo Moon

This study aims to propose a pose classification model using indoor occupant images. For developing the intelligent and automated model, a deep learning neural network was employed. Indoor posture images and joint coordinate data were collected and used to conduct the training and optimization of the model. The output of the trained model is the occupant pose of the sedentary activities in the indoor space. The performance of the developed model was evaluated for two different indoor environments: home and office. Using the metabolic rates corresponding to the classified poses, the model accuracy was compared with that of the conventional method, which considered the fixed activity. The result showed that the accuracy was improved by as much as 73.96% and 55.26% in home and office, respectively. Thus, the potential of the pose classification model was verified for providing a more comfortable and personalized thermal environment to the occupant.


2021 ◽  
Vol 39 (15_suppl) ◽  
pp. e13560-e13560
Author(s):  
Daniel France ◽  
Paromita Nath ◽  
Sankaran Mahadevan ◽  
Jason Slagle ◽  
Rajiv Agarwal ◽  
...  

e13560 Background: A common cause of preventable harm is the failure to detect and appropriately respond to clinical deterioration. Timely intervention is needed, particularly in cancer patients, to mitigate the effects of adverse events, disease progression, and medical error. This problem requires effective clinical surveillance, early recognition, timely notification of the appropriate clinician, and effective intervention. Methods: Applying a user-centered systems engineering design approach, we designed and implemented a surveillance-and-response system to improve the detection and response to clinical deterioration in cancer outpatients. The surveillance system predicts 7-day risk of UTEs, defined as clinically meaningful changes in the patient’s treatment course or cancer care pathway (e.g., any unplanned/unexpected: clinic or ER visit, hospital admission, or major treatment change and/or delays, and/or death). Data inputs consist of: 1) patient activity and health data collected by a Fitbit monitor; 2) geolocation data to measure activity outside the home (i.e., locations preselected at study onset); 3) clinical data from the hospital’s electronic health record; and 4) patient-reported outcomes measures (i.e., PROMs; the NCCN Distress Thermometer, the Comprehensive OpeN-Ended Survey or CONES, Global Health Score, items from the Consumer Assessment of Healthcare Providers and Systems (CAHPS)). Herein, we measured the effectiveness of Fitbit data alone to UTEs in a pilot sample of patients. Dimension reduction of Fitbit variables was first carried out by using Pearson correlation analysis to eliminate redundant variables. As UTEs are rare events, they were oversampled using the Synthetic Minority Oversampling Technique (SMOTE) to balance the dataset. A random forest classification model was trained to predict 7-day UTE risk. Model accuracy was determined by calculating the mean of Stratified 5-Fold Cross-Validation with 10 repeats. Results: Fitbit data was collected over a 6-8-week period from 14 head and neck cancer patients receiving surgical resection, outpatient chemotherapy, and/or radiotherapy. We identified six UTEs in 5 patients. A random forest classification model was developed from 10 variables derived from 7 Fitbit measures. The following variables were averaged or summed daily: average heart rate (HR), resting HR, below 50% or zone 1 of maximum HR, zone 2 and zone 3 HR combined (i.e., 70-100% of max HR), total daily calories, steps, and sleep in minutes. We achieved a model accuracy of 94% (ROC AUC: 0.984, Precision-Recall AUC: 0.985). Conclusions: Activity and health data collected by a commercial activity monitor demonstrated effectiveness in predicting patient UTEs when an oversampling procedure was used to adjust for class imbalance (i.e., low UTE rate). Future studies are recommended to verify and validate this result in a larger patient sample.


Sign in / Sign up

Export Citation Format

Share Document