Exploring the Time-efficient Evolutionary-based Feature Selection Algorithms for Speech Data under Stressful Work Condition

Derry Pramono Adi; Lukman Junaedi;  Frismanda; Agustinus Bimo Gumelar; Andreas Agung Kristanto

doi:10.24003/emitter.v9i1.571

Exploring the Time-efficient Evolutionary-based Feature Selection Algorithms for Speech Data under Stressful Work Condition

EMITTER International Journal of Engineering Technology ◽

10.24003/emitter.v9i1.571 ◽

2021 ◽

Vol 9 (1) ◽

pp. 60-74

Author(s):

Derry Pramono Adi ◽

Lukman Junaedi ◽

Frismanda ◽

Agustinus Bimo Gumelar ◽

Andreas Agung Kristanto

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Customer Service ◽

Harmony Search ◽

Computation Time ◽

Principal Component ◽

Support Vector ◽

Bee Colony ◽

Speech Data ◽

Selection Algorithms

Initially, the goal of Machine Learning (ML) advancements is faster computation time and lower computation resources, while the curse of dimensionality burdens both computation time and resource. This paper describes the benefits of the Feature Selection Algorithms (FSA) for speech data under workload stress. FSA contributes to reducing both data dimension and computation time and simultaneously retains the speech information. We chose to use the robust Evolutionary Algorithm, Harmony Search, Principal Component Analysis, Genetic Algorithm, Particle Swarm Optimization, Ant Colony Optimization, and Bee Colony Optimization, which are then to be evaluated using the hierarchical machine learning models. These FSAs are explored with the conversational workload stress data of a Customer Service hotline, which has daily complaints that trigger stress in speaking. Furthermore, we employed precisely 223 acoustic-based features. Using Random Forest, our evaluation result showed computation time had improved 3.6 faster than the original 223 features employed. Evaluation using Support Vector Machine beat the record with 0.001 seconds of computation time.

Get full-text (via PubEx)

Evaluating Variable Selection and Machine Learning Algorithms for Estimating Forest Heights by Combining Lidar and Hyperspectral Data

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9090507 ◽

2020 ◽

Vol 9 (9) ◽

pp. 507

Author(s):

Sanjiwana Arjasakusuma ◽

Sandiaga Swahyu Kusuma ◽

Stuart Phinn

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Algorithms ◽

Principal Component ◽

Hyperspectral Data ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Forest Height ◽

Extreme Gradient Boosting

Machine learning has been employed for various mapping and modeling tasks using input variables from different sources of remote sensing data. For feature selection involving high- spatial and spectral dimensionality data, various methods have been developed and incorporated into the machine learning framework to ensure an efficient and optimal computational process. This research aims to assess the accuracy of various feature selection and machine learning methods for estimating forest height using AISA (airborne imaging spectrometer for applications) hyperspectral bands (479 bands) and airborne light detection and ranging (lidar) height metrics (36 metrics), alone and combined. Feature selection and dimensionality reduction using Boruta (BO), principal component analysis (PCA), simulated annealing (SA), and genetic algorithm (GA) in combination with machine learning algorithms such as multivariate adaptive regression spline (MARS), extra trees (ET), support vector regression (SVR) with radial basis function, and extreme gradient boosting (XGB) with trees (XGbtree and XGBdart) and linear (XGBlin) classifiers were evaluated. The results demonstrated that the combinations of BO-XGBdart and BO-SVR delivered the best model performance for estimating tropical forest height by combining lidar and hyperspectral data, with R2 = 0.53 and RMSE = 1.7 m (18.4% of nRMSE and 0.046 m of bias) for BO-XGBdart and R2 = 0.51 and RMSE = 1.8 m (15.8% of nRMSE and −0.244 m of bias) for BO-SVR. Our study also demonstrated the effectiveness of BO for variables selection; it could reduce 95% of the data to select the 29 most important variables from the initial 516 variables from lidar metrics and hyperspectral data.

Get full-text (via PubEx)

Multivariate Analysis and Machine Learning for Ripeness Classification of Cape Gooseberry Fruits

Processes ◽

10.3390/pr7120928 ◽

2019 ◽

Vol 7 (12) ◽

pp. 928 ◽

Cited By ~ 2

Author(s):

Miguel De-la-Torre ◽

Omar Zatarain ◽

Himer Avila-George ◽

Mirna Muñoz ◽

Jimy Oblitas ◽

...

Keyword(s):

Machine Learning ◽

Principal Component Analysis ◽

Feature Selection ◽

Principal Component ◽

Component Analysis ◽

Support Vector ◽

Color Spaces ◽

Combination Methods ◽

Fruit Samples ◽

Cape Gooseberry

This paper explores five multivariate techniques for information fusion on sorting the visual ripeness of Cape gooseberry fruits (principal component analysis, linear discriminant analysis, independent component analysis, eigenvector centrality feature selection, and multi-cluster feature selection.) These techniques are applied to the concatenated channels corresponding to red, green, and blue (RGB), hue, saturation, value (HSV), and lightness, red/green value, and blue/yellow value (L*a*b) color spaces (9 features in total). Machine learning techniques have been reported for sorting the Cape gooseberry fruits’ ripeness. Classifiers such as neural networks, support vector machines, and nearest neighbors discriminate on fruit samples using different color spaces. Despite the color spaces being equivalent up to a transformation, a few classifiers enable better performances due to differences in the pixel distribution of samples. Experimental results show that selection and combination of color channels allow classifiers to reach similar levels of accuracy; however, combination methods still require higher computational complexity. The highest level of accuracy was obtained using the seven-dimensional principal component analysis feature space.

Get full-text (via PubEx)

An integrated machine learning framework for a discriminative analysis of schizophrenia using multi-biological data

Scientific Reports ◽

10.1038/s41598-021-94007-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Peng-fei Ke ◽

Dong-sheng Xiong ◽

Jia-hui Li ◽

Zhi-lin Pan ◽

Jing Zhou ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Gut Microbiota ◽

Cross Sectional Study ◽

Biological Data ◽

Support Vector ◽

Cross Sectional ◽

Integrated Framework ◽

Learning Framework ◽

Selection Algorithms

AbstractFinding effective and objective biomarkers to inform the diagnosis of schizophrenia is of great importance yet remains challenging. Relatively little work has been conducted on multi-biological data for the diagnosis of schizophrenia. In this cross-sectional study, we extracted multiple features from three types of biological data, including gut microbiota data, blood data, and electroencephalogram data. Then, an integrated framework of machine learning consisting of five classifiers, three feature selection algorithms, and four cross validation methods was used to discriminate patients with schizophrenia from healthy controls. Our results show that the support vector machine classifier without feature selection using the input features of multi-biological data achieved the best performance, with an accuracy of 91.7% and an AUC of 96.5% (p < 0.05). These results indicate that multi-biological data showed better discriminative capacity for patients with schizophrenia than single biological data. The top 5% discriminative features selected from the optimal model include the gut microbiota features (Lactobacillus, Haemophilus, and Prevotella), the blood features (superoxide dismutase level, monocyte-lymphocyte ratio, and neutrophil count), and the electroencephalogram features (nodal local efficiency, nodal efficiency, and nodal shortest path length in the temporal and frontal-parietal brain areas). The proposed integrated framework may be helpful for understanding the pathophysiology of schizophrenia and developing biomarkers for schizophrenia using multi-biological data.

Get full-text (via PubEx)

Machine Learning and Feature Selection Methods for EGFR Mutation Status Prediction in Lung Cancer

Applied Sciences ◽

10.3390/app11073273 ◽

2021 ◽

Vol 11 (7) ◽

pp. 3273

Author(s):

Joana Morgado ◽

Tania Pereira ◽

Francisco Silva ◽

Cláudia Freitas ◽

Eduardo Negrão ◽

...

Keyword(s):

Machine Learning ◽

Lung Cancer ◽

Feature Selection ◽

Egfr Mutation ◽

Feature Selection Method ◽

Principal Component ◽

Image Features ◽

Support Vector ◽

Selection Methods ◽

Mutation Status

The evolution of personalized medicine has changed the therapeutic strategy from classical chemotherapy and radiotherapy to a genetic modification targeted therapy, and although biopsy is the traditional method to genetically characterize lung cancer tumor, it is an invasive and painful procedure for the patient. Nodule image features extracted from computed tomography (CT) scans have been used to create machine learning models that predict gene mutation status in a noninvasive, fast, and easy-to-use manner. However, recent studies have shown that radiomic features extracted from an extended region of interest (ROI) beyond the tumor, might be more relevant to predict the mutation status in lung cancer, and consequently may be used to significantly decrease the mortality rate of patients battling this condition. In this work, we investigated the relation between image phenotypes and the mutation status of Epidermal Growth Factor Receptor (EGFR), the most frequently mutated gene in lung cancer with several approved targeted-therapies, using radiomic features extracted from the lung containing the nodule. A variety of linear, nonlinear, and ensemble predictive classification models, along with several feature selection methods, were used to classify the binary outcome of wild-type or mutant EGFR mutation status. The results show that a comprehensive approach using a ROI that included the lung with nodule can capture relevant information and successfully predict the EGFR mutation status with increased performance compared to local nodule analyses. Linear Support Vector Machine, Elastic Net, and Logistic Regression, combined with the Principal Component Analysis feature selection method implemented with 70% of variance in the feature set, were the best-performing classifiers, reaching Area Under the Curve (AUC) values ranging from 0.725 to 0.737. This approach that exploits a holistic analysis indicates that information from more extensive regions of the lung containing the nodule allows a more complete lung cancer characterization and should be considered in future radiogenomic studies.

Get full-text (via PubEx)

Comparing Methods of Feature Extraction of Brain Activities for Octave Illusion Classification Using Machine Learning

Sensors ◽

10.3390/s21196407 ◽

2021 ◽

Vol 21 (19) ◽

pp. 6407

Author(s):

Nina Pilyugina ◽

Akihiko Tsukahara ◽

Keita Tanaka

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Feature Selection ◽

Principal Component ◽

Machine Learning Algorithms ◽

Recursive Feature Elimination ◽

Support Vector ◽

Selection Methods ◽

Automatic Feature Extraction ◽

Octave Illusion

The aim of this study was to find an efficient method to determine features that characterize octave illusion data. Specifically, this study compared the efficiency of several automatic feature selection methods for automatic feature extraction of the auditory steady-state responses (ASSR) data in brain activities to distinguish auditory octave illusion and nonillusion groups by the difference in ASSR amplitudes using machine learning. We compared univariate selection, recursive feature elimination, principal component analysis, and feature importance by testifying the results of feature selection methods by using several machine learning algorithms: linear regression, random forest, and support vector machine. The univariate selection with the SVM as the classification method showed the highest accuracy result, 75%, compared to 66.6% without using feature selection. The received results will be used for future work on the explanation of the mechanism behind the octave illusion phenomenon and creating an algorithm for automatic octave illusion classification.

Get full-text (via PubEx)

FEATURE SELECTION AND MACHINE LEARNING CLASSIFICATION FOR MALWARE DETECTION

Jurnal Teknologi ◽

10.11113/jt.v77.3558 ◽

2015 ◽

Vol 77 (1) ◽

Cited By ~ 13

Author(s):

Ban Mohammed Khammas ◽

Alireza Monemi ◽

Joseph Stephen Bassi ◽

Ismahani Ismail ◽

Sulaiman Mohd Nor ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Computer Security ◽

Malware Detection ◽

Principal Component ◽

Machine Learning Techniques ◽

Detection Methods ◽

Support Vector ◽

Machine Learning Classification ◽

Minimum Number

Malware is a computer security problem that can morph to evade traditional detection methods based on known signature matching. Since new malware variants contain patterns that are similar to those in observed malware, machine learning techniques can be used to identify new malware. This work presents a comparative study of several feature selection methods with four different machine learning classifiers in the context of static malware detection based on n-grams analysis. The result shows that the use of Principal Component Analysis (PCA) feature selection and Support Vector Machines (SVM) classification gives the best classification accuracy using a minimum number of features.

Get full-text (via PubEx)

Breast Cancer Identification from Patients’ Tweet Streaming Using Machine Learning Solution on Spark

Complexity ◽

10.1155/2021/6653508 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Nahla F. Omran ◽

Sara F. Abd-el Ghany ◽

Hager Saleh ◽

Ayman Nabil

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Real Time ◽

Random Forest Classifier ◽

Streaming Data ◽

Support Vector ◽

Real Time System ◽

Selection Algorithms

Twitter integrates with streaming data technologies and machine learning to add new value to healthcare. This paper presented a real-time system to predict breast cancer based on streaming patient’s health data from Twitter. The proposed system consists of two major components: developing an offline building model and an online prediction pipeline. For the first component, we made a correlation between the features to determine the correlation between features and reduce the number of features from the Breast Cancer Wisconsin Diagnostic dataset. Two feature selection algorithms are recursive feature elimination and univariate feature selection algorithms which are applied to features after correlation to select the essential features. Four decision trees, logistic regression, support vector machine, and random forest classifier have been used on features after correlation and feature selection. Also, hyperparameter tuning and cross-validation have been applied with machine learning to optimize models and enhance accuracy. Apache Spark, Apache Kafka, and Twitter Streaming API are used to develop the second component. The best model with the highest accuracy obtained from the first component predicts breast cancer in real time from tweets’ streaming. The results showed that the best model is the random forest classifier which achieved the best accuracy.

Get full-text (via PubEx)

Application of Machine Learning in Animal Disease Analysis and Prediction

Current Bioinformatics ◽

10.2174/1574893615999200728195613 ◽

2020 ◽

Vol 15 ◽

Author(s):

Shuwen Zhang ◽

Qiang Su ◽

Qin Chen

Keyword(s):

Machine Learning ◽

Unsupervised Learning ◽

Supervised Learning ◽

Clustering Algorithm ◽

Principal Component ◽

Support Vector ◽

Animal Disease ◽

Human Beings ◽

Animal Diseases ◽

Disease Analysis

Abstract: Major animal diseases pose a great threat to animal husbandry and human beings. With the deepening of globalization and the abundance of data resources, the prediction and analysis of animal diseases by using big data are becoming more and more important. The focus of machine learning is to make computers learn how to learn from data and use the learned experience to analyze and predict. Firstly, this paper introduces the animal epidemic situation and machine learning. Then it briefly introduces the application of machine learning in animal disease analysis and prediction. Machine learning is mainly divided into supervised learning and unsupervised learning. Supervised learning includes support vector machines, naive bayes, decision trees, random forests, logistic regression, artificial neural networks, deep learning, and AdaBoost. Unsupervised learning has maximum expectation algorithm, principal component analysis hierarchical clustering algorithm and maxent. Through the discussion of this paper, people have a clearer concept of machine learning and understand its application prospect in animal diseases.

Get full-text (via PubEx)

A Comparison of Feature Selection and Forecasting Machine Learning Algorithms for Predicting Glycaemia in Type 1 Diabetes Mellitus

Applied Sciences ◽

10.3390/app11041742 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1742

Author(s):

Ignacio Rodríguez-Rodríguez ◽

José-Víctor Rodríguez ◽

Wai Lok Woo ◽

Bo Wei ◽

Domingo-Javier Pardo-Quiles

Keyword(s):

Diabetes Mellitus ◽

Machine Learning ◽

Type 1 Diabetes ◽

Feature Selection ◽

Blood Glucose ◽

Type 1 Diabetes Mellitus ◽

Support Vector ◽

Chronic Hyperglycemia ◽

Predictive Algorithms

Type 1 diabetes mellitus (DM1) is a metabolic disease derived from falls in pancreatic insulin production resulting in chronic hyperglycemia. DM1 subjects usually have to undertake a number of assessments of blood glucose levels every day, employing capillary glucometers for the monitoring of blood glucose dynamics. In recent years, advances in technology have allowed for the creation of revolutionary biosensors and continuous glucose monitoring (CGM) techniques. This has enabled the monitoring of a subject’s blood glucose level in real time. On the other hand, few attempts have been made to apply machine learning techniques to predicting glycaemia levels, but dealing with a database containing such a high level of variables is problematic. In this sense, to the best of the authors’ knowledge, the issues of proper feature selection (FS)—the stage before applying predictive algorithms—have not been subject to in-depth discussion and comparison in past research when it comes to forecasting glycaemia. Therefore, in order to assess how a proper FS stage could improve the accuracy of the glycaemia forecasted, this work has developed six FS techniques alongside four predictive algorithms, applying them to a full dataset of biomedical features related to glycaemia. These were harvested through a wide-ranging passive monitoring process involving 25 patients with DM1 in practical real-life scenarios. From the obtained results, we affirm that Random Forest (RF) as both predictive algorithm and FS strategy offers the best average performance (Root Median Square Error, RMSE = 18.54 mg/dL) throughout the 12 considered predictive horizons (up to 60 min in steps of 5 min), showing Support Vector Machines (SVM) to have the best accuracy as a forecasting algorithm when considering, in turn, the average of the six FS techniques applied (RMSE = 20.58 mg/dL).

Get full-text (via PubEx)

Feature-Selection and Mutual-Clustering Approaches to Improve DoS Detection and Maintain WSNs’ Lifetime

Sensors ◽

10.3390/s21144821 ◽

2021 ◽

Vol 21 (14) ◽

pp. 4821

Author(s):

Rami Ahmad ◽

Raniyah Wazirali ◽

Qusay Bsoul ◽

Tarik Abu-Ain ◽

Waleed Abu-Ain

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Open Field ◽

Network Lifetime ◽

Detection Efficiency ◽

Denial Of Service ◽

Harmony Search ◽

Machine Learning Algorithms ◽

Transport Layer ◽

Feature Selection Techniques

Wireless Sensor Networks (WSNs) continue to face two major challenges: energy and security. As a consequence, one of the WSN-related security tasks is to protect them from Denial of Service (DoS) and Distributed DoS (DDoS) attacks. Machine learning-based systems are the only viable option for these types of attacks, as traditional packet deep scan systems depend on open field inspection in transport layer security packets and the open field encryption trend. Moreover, network data traffic will become more complex due to increases in the amount of data transmitted between WSN nodes as a result of increasing usage in the future. Therefore, there is a need to use feature selection techniques with machine learning in order to determine which data in the DoS detection process are most important. This paper examined techniques for improving DoS anomalies detection along with power reservation in WSNs to balance them. A new clustering technique was introduced, called the CH_Rotations algorithm, to improve anomaly detection efficiency over a WSN’s lifetime. Furthermore, the use of feature selection techniques with machine learning algorithms in examining WSN node traffic and the effect of these techniques on the lifetime of WSNs was evaluated. The evaluation results showed that the Water Cycle (WC) feature selection displayed the best average performance accuracy of 2%, 5%, 3%, and 3% greater than Particle Swarm Optimization (PSO), Simulated Annealing (SA), Harmony Search (HS), and Genetic Algorithm (GA), respectively. Moreover, the WC with Decision Tree (DT) classifier showed 100% accuracy with only one feature. In addition, the CH_Rotations algorithm improved network lifetime by 30% compared to the standard LEACH protocol. Network lifetime using the WC + DT technique was reduced by 5% compared to other WC + DT-free scenarios.

Get full-text (via PubEx)