imbalanced data sets
Recently Published Documents


TOTAL DOCUMENTS

199
(FIVE YEARS 36)

H-INDEX

29
(FIVE YEARS 4)

2021 ◽  
pp. 1-18
Author(s):  
Gaoteng Yuan ◽  
Yinping Dong ◽  
Xiaofeng Zhou

BACKGROUND: Gynecological diseases threaten women’s health, and vaginal microecological testing is a common method for detecting gynecological diseases. Efficient and accurate microecological testing methods have always been the goal pursued by gynecologists. OBJECTIVE: In order to automatically identify different types of microbial images in vaginal micromorphology detection, this paper proposes a vaginal microecological image recognition method based on Gabor texture analysis combined with long and short-term memory network (LSTM) model. METHOD: Firstly, we denoise the microecological morphological im-ages, which selects the area of interest and sets the label of the microorganism according to the doctors label. Secondly, texture analysis is carried out for the region of interest, which uses Gabor filters with 8 directions and 5 scales to filter the region of interest to extract the texture features on the image. Comparing the differences between different microbial image features, and screening suitable features to reduce the number of features. Then, we design an LSTM model to analyze the relationship of image features in different categories of microorganisms. Finally, we use the full connection layer and Softmax function to realize the automatic recognition of different microbial images. RESULTS: The experimental results show that the image classification accuracy of 8 common microorganisms is 81.26%. CONCLUSION: Texture analysis combined with LSTM network strategy can identify different kinds of vaginal micro ecological images. Gabor-LSTM model has better classification effect on imbalanced data sets.


2021 ◽  
Vol 15 (5) ◽  
pp. 372-390
Author(s):  
Joan N. Vickers

This paper reveals new insights that comes from comparing quiet eye (QE) studies within the motor accuracy and motor error paradigms. Motor accuracy is defined by the rules of the sport (e.g,. hits versus misses), while motor error is defined by a behavioral measure, such as how far a ball or other object lands from the target (e.g. radial error). The QE motor accuracy paradigm treats accuracy as an independent variable and determines the QE duration during an equal (or near-equal) number of hits and misses per condition per participant, while the motor error QE paradigm combines hits and misses into one data set and determines the correlation between the QE and motor error, which is used as a proxy for accuracy. QE studies within the motor accuracy paradigm consistently find a longer QE duration is a characteristic of skill, and/or interaction of skill by accuracy. In contrast, QE motor error studies do not analyze or report the relationship between the QE duration and accuracy (although often claimed), and rarely find a significant correlation between the QE duration and error. Evidence is provided showing the absence of significant results in QE motor error studies is due to the low number of accurate trials found in motor error studies due to the inherent complexity of all sport skills. Novices in targeting skills make fewer than 20% of their shots and experts less than 40% (with some exceptions) creating imbalanced data sets that make it difficult, if not impossible, to find significant QE results (or any other neural, perceptual or cognitive variable) related to motor accuracy in sport.


2021 ◽  
Author(s):  
Arinan De Piemonte Dourado ◽  
Felipe Viana

Abstract In this contribution, a case study considering an unexpected corrosion-fatigue crack propagation issue in an aircraft fleet is used to discuss how to compensate for incomplete knowledge in time dependent responses integration and extrapolation. For the considered application, degradation resulting from mechanical fatigue is well understood and accounted in the damage models. However, the unexpected corrosion effects are not accounted in damage integration, yielding a large discrepancy between predicted and observed crack lengths. To address this epistemic uncertainty in the fleet damage accumulation model, hybrid neural networks cells are formulated; where physics-informed layers address well-understood aspects of the degradation, and data-driven layers are trained to act as correction terms. The considered case study encompasses highly imbalanced data sets with uncertainties acting asynchronously. To improve overall accuracy, ensemble learning techniques are adapted to merge the resulting hybrid neural network cells predictions. Lastly, a heuristic based on optimal ensemble weights is presented to help in the decision-making task of defining safe operation of the fleet. Results show that our proposed approach was capable of compensating for the epistemic uncertainties, and that the proposed heuristic can be used to rank aircraft damage severity, allowing to prioritize aircraft for inspection and/or route reassignment.


2021 ◽  
Author(s):  
Ming Li ◽  
Dezhi Han ◽  
Dun Li ◽  
Han Liu ◽  
Chin- Chen Chang

Abstract Network intrusion detection, which takes the extraction and analysis of network traffic features as the main method, plays a vital role in network security protection. The current network traffic feature extraction and analysis for network intrusion detection mostly uses deep learning algorithms. Currently, deep learning requires a lot of training resources, and have weak processing capabilities for imbalanced data sets. In this paper, a deep learning model (MFVT) based on feature fusion network and Vision Transformer architecture is proposed, to which improves the processing ability of imbalanced data sets and reduces the sample data resources needed for training. Besides, to improve the traditional raw traffic features extraction methods, a new raw traffic features extraction method (CRP) is proposed, the CPR uses PCA algorithm to reduce all the processed digital traffic features to the specified dimension. On the IDS 2017 dataset and the IDS 2012 dataset, the ablation experiments show that the performance of the proposed MFVT model is significantly better than other network intrusion detection models, and the detection accuracy can reach the state-of-the-art level. And, When MFVT model is combined with CRP algorithm, the detection accuracy is further improved to 99.99%.


Author(s):  
Giovanni Ceccarelli ◽  
Guido Cantelmo ◽  
Marialisa Nigro ◽  
Constantinos Antoniou

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Dominic Simm ◽  
Klas Hatje ◽  
Stephan Waack ◽  
Martin Kollmar

AbstractCoiled-coil regions were among the first protein motifs described structurally and theoretically. The simplicity of the motif promises that coiled-coil regions can be detected with reasonable accuracy and precision in any protein sequence. Here, we re-evaluated the most commonly used coiled-coil prediction tools with respect to the most comprehensive reference data set available, the entire Protein Data Bank, down to each amino acid and its secondary structure. Apart from the 30-fold difference in minimum and maximum number of coiled coils predicted the tools strongly vary in where they predict coiled-coil regions. Accordingly, there is a high number of false predictions and missed, true coiled-coil regions. The evaluation of the binary classification metrics in comparison with naïve coin-flip models and the calculation of the Matthews correlation coefficient, the most reliable performance metric for imbalanced data sets, suggests that the tested tools’ performance is close to random. This implicates that the tools’ predictions have only limited informative value. Coiled-coil predictions are often used to interpret biochemical data and are part of in-silico functional genome annotation. Our results indicate that these predictions should be treated very cautiously and need to be supported and validated by experimental evidence.


Author(s):  
Ghulam Fatima ◽  
Sana Saeed

In the data mining communal, imbalanced class dispersal data sets have established mounting consideration. The evolving field of data mining and information discovery seeks to establish precise and effective computational tools for the investigation of such data sets to excerpt innovative facts from statistics. Sampling methods re-balance the imbalanced data sets consequently improve the enactment of classifiers. For the classification of the imbalanced data sets, over-fitting and under-fitting are the two striking problems. In this study, a novel weighted ensemble method is anticipated to diminish the influence of over-fitting and under-fitting while classifying these kinds of data sets. Forty imbalanced data sets with varying imbalance ratios are engaged to conduct a comparative study. The enactment of the projected method is compared with four customary classifiers including decision tree(DT), k-nearest neighbor (KNN), support vector machines (SVM), and neural network (NN). This evaluation is completed with two over-sampling procedures, an adaptive synthetic sampling approach (ADASYN), and a synthetic minority over-sampling (SMOTE) technique. The projected scheme remained efficacious in diminishing the impact of over-fitting and under-fitting on the classification of these data sets.


2021 ◽  
Vol 11 (11) ◽  
pp. 4970
Author(s):  
Łukasz Rybak ◽  
Janusz Dudczyk

The history of gravitational classification started in 1977. Over the years, the gravitational approaches have reached many extensions, which were adapted into different classification problems. This article is the next stage of the research concerning the algorithms of creating data particles by their geometrical divide. In the previous analyses it was established that the Geometrical Divide (GD) method outperforms the algorithm creating the data particles based on classes by a compound of 1 ÷ 1 cardinality. This occurs in the process of balanced data sets classification, in which class centroids are close to each other and the groups of objects, described by different labels, overlap. The purpose of the article was to examine the efficiency of the Geometrical Divide method in the unbalanced data sets classification, by the example of real case-occupancy detecting. In addition, in the paper, the concept of the Unequal Geometrical Divide (UGD) was developed. The evaluation of approaches was conducted on 26 unbalanced data sets-16 with the features of Moons and Circles data sets and 10 created based on real occupancy data set. In the experiment, the GD method and its unbalanced variant (UGD) as well as the 1CT1P approach, were compared. Each method was combined with three data particle mass determination algorithms-n-Mass Model (n-MM), Stochastic Learning Algorithm (SLA) and Bath-update Algorithm (BLA). k-fold cross validation method, precision, recall, F-measure, and number of used data particles were applied in the evaluation process. Obtained results showed that the methods based on geometrical divide outperform the 1CT1P approach in the imbalanced data sets classification. The article’s conclusion describes the observations and indicates the potential directions of further research and development of methods, which concern creating the data particle through its geometrical divide.


Sign in / Sign up

Export Citation Format

Share Document