unbalanced dataset
Recently Published Documents


TOTAL DOCUMENTS

46
(FIVE YEARS 36)

H-INDEX

5
(FIVE YEARS 2)

2022 ◽  
Vol 2022 ◽  
pp. 1-15
Author(s):  
Lingfei Mo ◽  
Hongjie Yu ◽  
Wenqi Hua

Human physical activity identification based on wearable sensors is of great significance to human health analysis. A large number of machine learning models have been applied to human physical activity identification and achieved remarkable results. However, most human physical activity identification models can only be trained based on labeled data, and it is difficult to obtain enough labeled data, which leads to weak generalization ability of the model. A Pruning Growing SOM model is proposed in this paper to address the limitations of small-scale labeled dataset, which is unsupervised in the training stage, and then only a small amount of labeled data is used for labeling neurons to reduce dependency on labeled data. In training stage, the inactive neurons in network can be deleted by pruning mechanism, which makes the model more consistent with the data distribution and improves the identification accuracy even on unbalanced dataset, especially for the action categories with poor identification effect. In addition, the pruning mechanism can also speed up the inference of the model by controlling its scale.


2021 ◽  
Vol 7 (12) ◽  
pp. 276
Author(s):  
Antonio Galli ◽  
Stefano Marrone ◽  
Gabriele Piantadosi ◽  
Mario Sansone ◽  
Carlo Sansone

The recent spread of Deep Learning (DL) in medical imaging is pushing researchers to explore its suitability for lesion segmentation in Dynamic Contrast-Enhanced Magnetic-Resonance Imaging (DCE-MRI), a complementary imaging procedure increasingly used in breast-cancer analysis. Despite some promising proposed solutions, we argue that a “naive” use of DL may have limited effectiveness as the presence of a contrast agent results in the acquisition of multimodal 4D images requiring thorough processing before training a DL model. We thus propose a pipelined approach where each stage is intended to deal with or to leverage a peculiar characteristic of breast DCE-MRI data: the use of a breast-masking pre-processing to remove non-breast tissues; the use of Three-Time-Points (3TP) slices to effectively highlight contrast agent time course; the application of a motion-correction technique to deal with patient involuntary movements; the leverage of a modified U-Net architecture tailored on the problem; and the introduction of a new “Eras/Epochs” training strategy to handle the unbalanced dataset while performing a strong data augmentation. We compared our pipelined solution against some literature works. The results show that our approach outperforms the competitors by a large margin (+9.13% over our previous solution) while also showing a higher generalization ability.


Author(s):  
Chaolun Ma ◽  
Yongxin Peng ◽  
Lingtao Wu ◽  
Xiaoyu Guo ◽  
Xiubin Wang ◽  
...  

Distraction occurs when a driver’s attention is diverted from driving to a secondary task. The number of distraction-affected crashes has been increasing in recent years. Accurately predicting distraction-affected crashes is critical for roadway agencies to reduce distracted driving behaviors and distraction-affected crashes. Recently, more and more emerging phone-use data and machine learning techniques are available to safety researchers, and can potentially improve the prediction of distraction-affected crashes. Therefore, this study first examines if phone-use events provide essential information for distraction-affected crashes. The authors apply the machine learning technique (i.e., XGBoost) under two scenarios, with and without phone-use events, and compare their performances with two conventional statistical models: logistic regression model and mixed-effects logistic regression model. The comparison demonstrates the superiority of XGBoost over logistic regression with a high-dimensional unbalanced dataset. Further, this study implements SHAP (SHapley Additive exPlanation) to interpret the results and analyze the importance of individual features related to distraction-affected crashes and tests its ability to improve prediction accuracy. The trained XGBoost model achieves a sensitivity of 91.59%, a specificity of 85.92%, and 88.72% accuracy. The XGBoost and SHAP results suggest that: (1) phone-use information is an important factor associated with the occurrences of distraction-affected crashes; (2) distraction-affected crashes are more likely to occur on roadway segments with higher exposure (i.e., length and traffic volume), unevenness of traffic flow condition, or with medium truck volume.


Author(s):  
Arman Ghavidel ◽  
Rouzbeh Ghousi ◽  
Alireza Atashi

Nowadays, according to spectacular improvement in health care and biomedical level, a tremendous amount of data is recorded by hospitals. In addition, the most effective approach to reduce disease mortality is to diagnose it as soon as possible. As a result, data mining by applying machine learning in the field of diseases provides good opportunities to examine the hidden patterns of this collection. An exact forecast of the mortality after heart surgery will cause Successful medical treatment and fewer costs. This research wants to recommend a new stacking predictive model after utilizing the random forest feature importance method to foresee the mortality after heart surgery on a highly unbalanced dataset by using the most practical features. To solve the unbalanced data problem, a combination of the SVM-SMOTE over-sampling algorithm and the Edited-Nearest-Neighbor under-sampling algorithm is used. This research compares the introduced model with some other machine learning classifiers to ensure efficiency through shuffle hold-out and 10-fold cross-validation strategies. In order to validate the performance of the implemented machine learning methods in this research, both shuffle hold-out, and 10-fold cross-validation results indicated that our model had the highest efficiency compared to the other models. Furthermore, the Friedman statistical test is applied to survey the differences between models. The result demonstrates that the introduced stacking model reached the most accurate predicting performance after Logistic Regression.


Sensors ◽  
2021 ◽  
Vol 21 (17) ◽  
pp. 5821
Author(s):  
Aleksandr Lapušinskij ◽  
Ivan Suzdalev ◽  
Nikolaj Goranin ◽  
Justinas Janulevičius ◽  
Simona Ramanauskaitė ◽  
...  

The increase in flying time of unmanned aerial vehicles (UAV) is a relevant and difficult task for UAV designers. It is especially important in such tasks as monitoring, mapping, or signal retranslation. While the majority of research is concentrated on increasing the battery capacity, it is also important to utilize natural renewable energy sources, such as solar energy, thermals, etc. This article proposed a method for the automatic recognition of cumuliform clouds. Practical application of this method allows diverting of an unmanned aerial vehicle towards the identified cumuliform cloud and improving its probability of flying into a thermal flow, thus increasing the flight time of the UAV, as is performed by glider and paraglider pilots. The proposed method is based on the application of Hough transform and Canny edge detector methods, which have not been used for such a task before. For testing the proposed method a dataset of different clouds was generated and marked by experts. The achieved average accuracy of 87% on the unbalanced dataset demonstrates the practical applicability of the proposed method for detecting thermals related to cumuliform clouds. The article also provides the concept of VilniusTech developed UAV, implementing the proposed method.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Guoliang Yao ◽  
Xiaobo Mao ◽  
Nan Li ◽  
Huaxing Xu ◽  
Xiangyang Xu ◽  
...  

The diagnosis of electrocardiogram (ECG) is extremely onerous and inefficient, so it is necessary to use a computer-aided diagnosis of ECG signals. However, it is still a challenging problem to design high-accuracy ECG algorithms suitable for the medical field. In this paper, a classification method is proposed to classify ECG signals. Firstly, wavelet transform is used to denoise the original data, and data enhancement technology is used to overcome the problem of an unbalanced dataset. Secondly, an integrated convolutional neural network (CNN) and gated recurrent unit (GRU) classifier is proposed. The proposed network consists of a convolution layer, followed by 6 local feature extraction modules (LFEM), a GRU, and a Dense layer and a Softmax layer. Finally, the processed data were input into the CNN-GRU network into five categories: nonectopic beats, supraventricular ectopic beats, ventricular ectopic beats, fusion beats, and unknown beats. The MIT-BIH arrhythmia database was used to evaluate the approach, and the average sensitivity, accuracy, and F1-score of the network for 5 types of ECG were 99.33%, 99.61%, and 99.42%. The evaluation criteria of the proposed method are superior to other state-of-the-art methods, and this model can be applied to wearable devices to achieve high-precision monitoring of ECG.


Sensors ◽  
2021 ◽  
Vol 21 (16) ◽  
pp. 5494
Author(s):  
Chen Zhao ◽  
Jianliang Sun ◽  
Shuilin Lin ◽  
Yan Peng

Rolling mill multi-row bearings are subjected to axial loads, which cause damage of rolling elements and cages, so the axial vibration signal contains rich fault character information. The vertical shock caused by the failure is weakened because multiple rows of bearings are subjected to radial forces together. Considering the special characters of rolling mill bearing vibration signals, a fault diagnosis method combining Adaptive Multivariate Variational Mode Decomposition (AMVMD) and Multi-channel One-dimensional Convolution Neural Network (MC1DCNN) is proposed to improve the diagnosis accuracy. Additionally, Deep Convolutional Generative Adversarial Network (DCGAN) is embedded in models to solve the problem of fault data scarcity. DCGAN is used to generate AMVMD reconstruction data to supplement the unbalanced dataset, and the MC1DCNN model is trained by the dataset to diagnose the real data. The proposed method is compared with a variety of diagnostic models, and the experimental results show that the method can effectively improve the diagnosis accuracy of rolling mill multi-row bearing under unbalanced dataset conditions. It is an important guide to the current problem of insufficient data and low diagnosis accuracy faced in the fault diagnosis of multi-row bearings such as rolling mills.


2021 ◽  
Author(s):  
Alessia Auriemma Citarella ◽  
Luigi Di Biasi ◽  
Michele Risi ◽  
Genoveffa Tortora

Abstract Background: SNARE proteins play an important role in different biological functions. This study aims to investigate the contribution of a new class of molecular descriptors (called SNARER) related to the chemical-physical properties of proteins in order to evaluate the performance of binary classifiers for SNARE proteins. Results: We constructed a SNARE proteins balanced dataset, D128, and an unbalanced one, DUNI, on which we tested and compared the performance of the new descriptors presented here in combination with the feature sets (GAAC, CTDT, CKSAAP and 188D) already present in the literature. The machine learning algorithms used were Random Forest, k-Nearest Neighbors and AdaBoost and oversampling and subsampling techniques were applied to the unbalanced dataset. The addition of the SNARER descriptors increases the precision for all considered ML algorithms. In particular, on the unbalanced DUNI dataset the accuracy increases in parallel with the increase in sensitivity while on the balanced dataset D128 the accuracy increases compared to the counterpart without the addition of SNARER descriptors, with a strong improvement in specificity. Our best result is the combination of our descriptors SNARER with CKSAAP feature on the dataset D128 with 92.3% of accuracy, 90.1% for sensitivity and 95% for specificity with the RF algorithm. Conclusions: The performed analysis has shown how the introduction of molecular descriptors linked to the chemical-physical and structural characteristics of the proteins can improve the classification performance. Additionally, it was pointed out that performance can change based on using a balanced or unbalanced dataset. The balanced nature of training can significantly improve forecast accuracy.


Risks ◽  
2021 ◽  
Vol 9 (6) ◽  
pp. 114
Author(s):  
Paritosh Navinchandra Jha ◽  
Marco Cucculelli

The paper introduces a novel approach to ensemble modeling as a weighted model average technique. The proposed idea is prudent, simple to understand, and easy to implement compared to the Bayesian and frequentist approach. The paper provides both theoretical and empirical contributions for assessing credit risk (probability of default) effectively in a new way by creating an ensemble model as a weighted linear combination of machine learning models. The idea can be generalized to any classification problems in other domains where ensemble-type modeling is a subject of interest and is not limited to an unbalanced dataset or credit risk assessment. The results suggest a better forecasting performance compared to the single best well-known machine learning of parametric, non-parametric, and other ensemble models. The scope of our approach can be extended to any further improvement in estimating weights differently that may be beneficial to enhance the performance of the model average as a future research direction.


Sign in / Sign up

Export Citation Format

Share Document