scholarly journals Multi-Stage Harmonization for Robust AI across Breast MR Databases

Cancers ◽  
2021 ◽  
Vol 13 (19) ◽  
pp. 4809
Author(s):  
Heather M. Whitney ◽  
Hui Li ◽  
Yu Ji ◽  
Peifang Liu ◽  
Maryellen L. Giger

Radiomic features extracted from medical images may demonstrate a batch effect when cases come from different sources. We investigated classification performance using training and independent test sets drawn from two sources using both pre-harmonization and post-harmonization features. In this retrospective study, a database of thirty-two radiomic features, extracted from DCE-MR images of breast lesions after fuzzy c-means segmentation, was collected. There were 944 unique lesions in Database A (208 benign lesions, 736 cancers) and 1986 unique lesions in Database B (481 benign lesions, 1505 cancers). The lesions from each database were divided by year of image acquisition into training and independent test sets, separately by database and in combination. ComBat batch harmonization was conducted on the combined training set to minimize the batch effect on eligible features by database. The empirical Bayes estimates from the feature harmonization were applied to the eligible features of the combined independent test set. The training sets (A, B, and combined) were then used in training linear discriminant analysis classifiers after stepwise feature selection. The classifiers were then run on the A, B, and combined independent test sets. Classification performance was compared using pre-harmonization features to post-harmonization features, including their corresponding feature selection, evaluated using the area under the receiver operating characteristic curve (AUC) as the figure of merit. Four out of five training and independent test scenarios demonstrated statistically equivalent classification performance when compared pre- and post-harmonization. These results demonstrate that translation of machine learning techniques with batch data harmonization can potentially yield generalizable models that maintain classification performance.

2022 ◽  
Vol 9 (1) ◽  
Author(s):  
Joffrey L. Leevy ◽  
John Hancock ◽  
Taghi M. Khoshgoftaar ◽  
Jared M. Peterson

AbstractThe recent years have seen a proliferation of Internet of Things (IoT) devices and an associated security risk from an increasing volume of malicious traffic worldwide. For this reason, datasets such as Bot-IoT were created to train machine learning classifiers to identify attack traffic in IoT networks. In this study, we build predictive models with Bot-IoT to detect attacks represented by dataset instances from the Information Theft category, as well as dataset instances from the data exfiltration and keylogging subcategories. Our contribution is centered on the evaluation of ensemble feature selection techniques (FSTs) on classification performance for these specific attack instances. A group or ensemble of FSTs will often perform better than the best individual technique. The classifiers that we use are a diverse set of four ensemble learners (Light GBM, CatBoost, XGBoost, and random forest (RF)) and four non-ensemble learners (logistic regression (LR), decision tree (DT), Naive Bayes (NB), and a multi-layer perceptron (MLP)). The metrics used for evaluating classification performance are area under the receiver operating characteristic curve (AUC) and Area Under the precision-recall curve (AUPRC). For the most part, we determined that our ensemble FSTs do not affect classification performance but are beneficial because feature reduction eases computational burden and provides insight through improved data visualization.


Author(s):  
CHANDRALEKHA MOHAN ◽  
SHENBAGAVADIVU NAGARAJAN

Researchers train and build specific models to classify the presence and absence of a disease and the accuracy of such classification models is continuously improved. The process of building a model and training depends on the medical data utilized. Various machine learning techniques and tools are used to handle different data with respect to disease types and their clinical conditions. Classification is the most widely used technique to classify disease and the accuracy of the classifier largely depends on the attributes. The choice of the attribute largely affects the diagnosis and performance of the classifier. Due to growing large volumes of medical data across different clinical conditions, the need for choosing relevant attributes and features still lacks method to handle datasets that target specific diseases. This study uses an ensemble-based feature selection using random trees and wrapper method to improve the classification. The proposed ensemble learning classification method derives a subset using the wrapper method, bagging, and random trees. The proposed method removes the irrelevant features and selects the optimal features for classification through probability weighting criteria. The improved algorithm has the ability to distinguish the relevant features from irrelevant features and improve the classification performance. The proposed feature selection method is evaluated using SVM, RF, and NB evaluators and the performances are compared against the FSNBb, FSSVMb, GASVMb, GANBb, and GARFb methods. The proposed method achieves mean classification accuracy of 92% and outperforms the other ensemble methods.


2017 ◽  
Vol 2017 ◽  
pp. 1-11 ◽  
Author(s):  
Yunfeng Wu ◽  
Pinnan Chen ◽  
Yuchen Yao ◽  
Xiaoquan Ye ◽  
Yugui Xiao ◽  
...  

Analysis of quantified voice patterns is useful in the detection and assessment of dysphonia and related phonation disorders. In this paper, we first study the linear correlations between 22 voice parameters of fundamental frequency variability, amplitude variations, and nonlinear measures. The highly correlated vocal parameters are combined by using the linear discriminant analysis method. Based on the probability density functions estimated by the Parzen-window technique, we propose an interclass probability risk (ICPR) method to select the vocal parameters with small ICPR values as dominant features and compare with the modified Kullback-Leibler divergence (MKLD) feature selection approach. The experimental results show that the generalized logistic regression analysis (GLRA), support vector machine (SVM), and Bagging ensemble algorithm input with the ICPR features can provide better classification results than the same classifiers with the MKLD selected features. The SVM is much better at distinguishing normal vocal patterns with a specificity of 0.8542. Among the three classification methods, the Bagging ensemble algorithm with ICPR features can identify 90.77% vocal patterns, with the highest sensitivity of 0.9796 and largest area value of 0.9558 under the receiver operating characteristic curve. The classification results demonstrate the effectiveness of our feature selection and pattern analysis methods for dysphonic voice detection and measurement.


2020 ◽  
Vol 16 (2) ◽  
pp. 155014772090523
Author(s):  
ZhenLong Li ◽  
HaoXin Wang ◽  
YaoWei Zhang ◽  
XiaoHua Zhao

A method for drunk driving detection using Feature Selection based on the Random Forest was proposed. First, driving behavior data were collected using a driving simulator at Beijing University of Technology. Second, the features were selected according to the Feature Importance in the random forest. Third, a dummy variable was introduced to encode the geometric characteristics of different roads so that drunk driving under different road conditions can be detected with the same classifier based on the random forest. Finally, the linear discriminant analysis, support vector machine, and AdaBoost classifiers were used and compared with the random forest. The accuracy, F1 score, receiver operating characteristic curve, and area under the curve value were used to evaluate the performance of the classifiers. The results show that Accelerator Depth, Speed, Distance to the Center of the Lane, Acceleration, Engine Revolution, Brake Depth, and Steering Angle have important influences on identifying the drivers’ states and can be used to detect drunk driving. Specifically, the classifiers with Accelerator Depth outperformed the other classifiers without Accelerator Depth. This means that Accelerator Depth is an important feature. Both the AdaBoost and random forest classifiers have an accuracy of 81.48%, which verified the effectiveness of the proposed method.


Sensors ◽  
2020 ◽  
Vol 20 (22) ◽  
pp. 6572
Author(s):  
Huan Lu ◽  
Guangjie Yuan ◽  
Jin Zhang ◽  
Guangyuan Liu

Love at first sight is a well-known and interesting phenomenon, and denotes the strong attraction to a person of the opposite sex when first meeting. As far as we know, there are no studies on the changes in physiological signals between the opposite sexes when this phenomenon occurs. Although privacy is involved, knowing how attractive a partner is may be beneficial to building a future relationship in an open society where both men and women accept each other. Therefore, this study adopts the photoplethysmography (PPG) signal acquisition method (already applied in wearable devices) to collect signals that are beneficial for utilizing the results of the analysis. In particular, this study proposes a love pulse signal recognition algorithm based on a PPG signal. First, given the high correlation between the impulse signals of love at first sight and those for physical attractiveness, photos of people with different levels of attractiveness are used to induce real emotions. Then, the PPG signal is analyzed in the time, frequency, and nonlinear domains, respectively, in order to extract its physiological characteristics. Finally, we propose the use of a variety of machine learning techniques (support vector machine (SVM), random forest (RF), linear discriminant analysis (LDA), and extreme gradient enhancement (XGBoost)) for identifying the impulsive states of love, with or without feature selection. The results show that the XGBoost classifier has the highest classification accuracy (71.09%) when using the feature selection.


Agriculture ◽  
2020 ◽  
Vol 10 (10) ◽  
pp. 465
Author(s):  
Shiuan Wan ◽  
Yi-Ping Wang

The analysis, measurement, and computation of remote sensing images often require enhanced unsupervised/supervised classification approaches. The goal of this study is to have a better understanding of (a) the classification performance of multispectral image and hyperspectral image data; (b) the classification performance of unsupervised and supervised models; and (c) the classification performance of feature selection among different models. More specifically, the multispectral images and hyperspectral images with high spatial resolution are well accepted for improving land use and classification. Hence, this study used multispectral images (WorldView-2) and hyperspectral images (CASI-1500) and focused on the classifiers K-means, density-based spatial clustering of applications with noise (DBSCAN), linear discriminant analysis (LDA), and back-propagation neural network (BPN). Then the feature selection (principle component analysis, PCA) on four classifiers is studied. The results show that the image material of CASI-1500 classification accuracy is slightly better than that of WorldView-2. The overall classification of BPN is the best, the overall data has a κ value of 0.89 and the overall accuracy is 97%. The DBSCAN presents a reality with good accuracy and the integrity of the thematic map. The DBSCAN can attain an accuracy of about 88% and save 95.1% of computational time.


Sensors ◽  
2020 ◽  
Vol 20 (9) ◽  
pp. 2649 ◽  
Author(s):  
Amin Ul Haq ◽  
Jian Ping Li ◽  
Jalaluddin Khan ◽  
Muhammad Hammad Memon ◽  
Shah Nazir ◽  
...  

Significant attention has been paid to the accurate detection of diabetes. It is a big challenge for the research community to develop a diagnosis system to detect diabetes in a successful way in the e-healthcare environment. Machine learning techniques have an emerging role in healthcare services by delivering a system to analyze the medical data for diagnosis of diseases. The existing diagnosis systems have some drawbacks, such as high computation time, and low prediction accuracy. To handle these issues, we have proposed a diagnosis system using machine learning methods for the detection of diabetes. The proposed method has been tested on the diabetes data set which is a clinical dataset designed from patient’s clinical history. Further, model validation methods, such as hold out, K-fold, leave one subject out and performance evaluation metrics, includes accuracy, specificity, sensitivity, F1-score, receiver operating characteristic curve, and execution time have been used to check the validity of the proposed system. We have proposed a filter method based on the Decision Tree (Iterative Dichotomiser 3) algorithm for highly important feature selection. Two ensemble learning algorithms, Ada Boost and Random Forest, are also used for feature selection and we also compared the classifier performance with wrapper based feature selection algorithms. Classifier Decision Tree has been used for the classification of healthy and diabetic subjects. The experimental results show that the proposed feature selection algorithm selected features improve the classification performance of the predictive model and achieved optimal accuracy. Additionally, the proposed system performance is high compared to the previous state-of-the-art methods. High performance of the proposed method is due to the different combinations of selected features set and Plasma glucose concentrations, Diabetes pedigree function, and Blood mass index are more significantly important features in the dataset for prediction of diabetes. Furthermore, the experimental results statistical analysis demonstrated that the proposed method would effectively detect diabetes and can be deployed in an e-healthcare environment.


2002 ◽  
Vol 24 (2-3) ◽  
pp. 59-67 ◽  
Author(s):  
Josef Smolle ◽  
Armin Gerger ◽  
Wolfgang Weger ◽  
Heinz Kutzner ◽  
Michael Tronnier

Background: Tissue counter analysis is an image analysis tool designed for the detection of structures in complex images at the macroscopic or microscopic scale. As a basic principle, small square or circular measuring masks are randomly placed across the image and image analysis parameters are obtained for each mask. Based on learning sets, statistical classification procedures are generated which facilitate an automated classification of new data sets.Objective: To evaluate the influence of the size and shape of the measuring masks as well as the importance of feature selection, statistical procedures and technical preparation of slides on the performance of tissue counter analysis in microscopic images. As main quality measure of the final classification procedure, the percentage of elements that were correctly classified was used.Study design: H&E‐stained slides of 25 primary cutaneous melanomas were evaluated by tissue counter analysis for the recognition of melanoma elements (section area occupied by tumour cells) in contrast to other tissue elements and background elements. Circular and square measuring masks, various subsets of image analysis features and classification and regression trees compared with linear discriminant analysis as statistical alternatives were used. The percentage of elements that were correctly classified by the various classification procedures was assessed. In order to evaluate the applicability to slides obtained from different laboratories, the best procedure was automatically applied in a test set of another 50 cases of primary melanoma derived from the same laboratory as the learning set and two test sets of 20 cases each derived from two different laboratories, and the measurements of melanoma area in these cases were compared with conventional assessment of vertical tumour thickness.Results: Square measuring masks were slightly superior to circular masks, and larger masks (64 or 128 pixels in diameter) were superior to smaller masks (8 to 32 pixels in diameter). As far as the subsets of image analysis features were concerned, colour features were superior to densitometric and Haralick texture features. Statistical moments of the grey level distribution were of least significance. CART (classification and regression tree) analysis turned out to be superior to linear discriminant analysis. In the best setting, 95% of melanoma tissue elements were correctly recognized. Automated measurement of melanoma area in the independent test sets yielded a correlation ofr=0.846 with vertical tumour thickness (p< 0.001), similar to the relationship reported for manual measurements. The test sets obtained from different laboratories yielded comparable results.Conclusions: Large, square measuring masks, colour features and CART analysis provide a useful setting for the automated measurement of melanoma tissue in tissue counter analysis, which can also be used for slides derived from different laboratories.


2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Yingji Qi ◽  
Feng Ding ◽  
Fangzhou Xu ◽  
Jimin Yang

Brain-computer interface (BCI) is a communication and control system linking the human brain and computers or other electronic devices. However, irrelevant channels and misleading features unrelated to tasks limit classification performance. To address these problems, we propose an efficient signal processing framework based on particle swarm optimization (PSO) for channel and feature selection, channel selection, and feature selection. Modified Stockwell transforms were used for a feature extraction, and multilevel hybrid PSO-Bayesian linear discriminant analysis was applied to optimization and classification. The BCI Competition III dataset I was used here to confirm the superiority of the proposed scheme. Compared to a method without optimization (89% accuracy), the best classification accuracy of the PSO-based scheme was 99% when less than 10.5% of the original features were used, the test time was reduced by more than 90%, and it achieved Kappa values and F-score of 0.98 and 98.99%, respectively, and better signal-to-noise ratio, thereby outperforming existing algorithms. The results show that the channel and feature selection scheme can accelerate the speed of convergence to the global optimum and reduce the training time. As the proposed framework can significantly improve classification performance, effectively reduce the number of features, and greatly shorten the test time, it can serve as a reference for related real-time BCI application system research.


2020 ◽  
Author(s):  
Nalika Ulapane ◽  
Karthick Thiyagarajan ◽  
sarath kodagoda

<div>Classification has become a vital task in modern machine learning and Artificial Intelligence applications, including smart sensing. Numerous machine learning techniques are available to perform classification. Similarly, numerous practices, such as feature selection (i.e., selection of a subset of descriptor variables that optimally describe the output), are available to improve classifier performance. In this paper, we consider the case of a given supervised learning classification task that has to be performed making use of continuous-valued features. It is assumed that an optimal subset of features has already been selected. Therefore, no further feature reduction, or feature addition, is to be carried out. Then, we attempt to improve the classification performance by passing the given feature set through a transformation that produces a new feature set which we have named the “Binary Spectrum”. Via a case study example done on some Pulsed Eddy Current sensor data captured from an infrastructure monitoring task, we demonstrate how the classification accuracy of a Support Vector Machine (SVM) classifier increases through the use of this Binary Spectrum feature, indicating the feature transformation’s potential for broader usage.</div><div><br></div>


Sign in / Sign up

Export Citation Format

Share Document