data preprocessing
Recently Published Documents


TOTAL DOCUMENTS

961
(FIVE YEARS 381)

H-INDEX

40
(FIVE YEARS 8)

Epigenomics ◽  
2022 ◽  
Author(s):  
Ze Zhang ◽  
Min Kyung Lee ◽  
Laurent Perreard ◽  
Karl T Kelsey ◽  
Brock C Christensen ◽  
...  

Aim: Tandem bisulfite (BS) and oxidative bisulfite (oxBS) conversion on DNA followed by hybridization to Infinium HumanMethylation BeadChips allows nucleotide resolution of 5-hydroxymethylcytosine genome-wide. Here, the authors compared data quality acquired from BS-treated and oxBS-treated samples. Materials & methods: Raw BeadArray data from 417 pairs of samples across 12 independent datasets were included in the study. Probe call rates were compared between paired BS and oxBS treatments controlling for technical variables. Results: oxBS-treated samples had a significantly lower call-rate. Among technical variables, DNA-specific extraction kits performed better with higher call rates after oxBS conversion. Conclusion: The authors emphasize the importance of quality control during oxBS conversion to minimize information loss and recommend using a DNA-specific extraction kit for DNA extraction and an oxBSQC package for data preprocessing.


Author(s):  
Zheqi Yu ◽  
Adnan Zahid ◽  
Shuja Ansari ◽  
Hasan Abbas ◽  
Hadi Heidari ◽  
...  

Aiming at the self-association feature of the Hopfield neural network, we can reduce the need for extensive sensor training samples during human behavior recognition. For a training algorithm to obtain a general activity feature template with only one time data preprocessing, this work proposes a data preprocessing framework that is suitable for neuromorphic computing. Based on the preprocessing method of the construction matrix and feature extraction, we achieved simplification and improvement in the classification of output of the Hopfield neuromorphic algorithm. We assigned different samples to neurons by constructing a feature matrix, which changed the weights of different categories to classify sensor data. Meanwhile, the preprocessing realizes the sensor data fusion process, which helps improve the classification accuracy and avoids falling into the local optimal value caused by single sensor data. Experimental results show that the framework has high classification accuracy with necessary robustness. Using the proposed method, the classification and recognition accuracy of the Hopfield neuromorphic algorithm on the three classes of human activities is 96.3%. Compared with traditional machine learning algorithms, the proposed framework only requires learning samples once to get the feature matrix for human activities, complementing the limited sample databases while improving the classification accuracy.


Electronics ◽  
2022 ◽  
Vol 11 (1) ◽  
pp. 139
Author(s):  
Juneseo Chang ◽  
Myeongjin Kang ◽  
Daejin Park

Smart homes assist users by providing convenient services from activity classification with the help of machine learning (ML) technology. However, most of the conventional high-performance ML algorithms require relatively high power consumption and memory usage due to their complex structure. Moreover, previous studies on lightweight ML/DL models for human activity classification still require relatively high resources for extremely resource-limited embedded systems; thus, they are inapplicable for smart homes’ embedded system environments. Therefore, in this study, we propose a low-power, memory-efficient, high-speed ML algorithm for smart home activity data classification suitable for an extremely resource-constrained environment. We propose a method for comprehending smart home activity data as image data, hence using the MNIST dataset as a substitute for real-world activity data. The proposed ML algorithm consists of three parts: data preprocessing, training, and classification. In data preprocessing, training data of the same label are grouped into further detailed clusters. The training process generates hyperplanes by accumulating and thresholding from each cluster of preprocessed data. Finally, the classification process classifies input data by calculating the similarity between the input data and each hyperplane using the bitwise-operation-based error function. We verified our algorithm on `Raspberry Pi 3’ and `STM32 Discovery board’ embedded systems by loading trained hyperplanes and performing classification on 1000 training data. Compared to a linear support vector machine implemented from Tensorflow Lite, the proposed algorithm improved memory usage to 15.41%, power consumption to 41.7%, performance up to 50.4%, and power per accuracy to 39.2%. Moreover, compared to a convolutional neural network model, the proposed model improved memory usage to 15.41%, power consumption to 61.17%, performance to 57.6%, and power per accuracy to 55.4%.


2022 ◽  
pp. 25-75
Author(s):  
Jinyu Chen ◽  
Haoran Zhang ◽  
Wenjing Li ◽  
Ryosuke Shibasaki

2022 ◽  
Vol 2161 (1) ◽  
pp. 012043
Author(s):  
Ananya Devarakonda ◽  
Nilesh Sharma ◽  
Prita Saha ◽  
S Ramya

Abstract As most of the population acquires access to the internet, protecting online identity from threats of confidentiality, integrity, and accessibility becomes an increasingly important problem to tackle. By definition, a network intrusion detection system (IDS) helps pinpoint and identify anomalous network traffic to bring forward and classify suspicious activity. It is a fundamental part of network security and provides the first line of defense against a potential attack by alerting an administrator or appropriate personnel of possible malicious network activity. Several academic publications propose various artificial intelligence (AI) methods for an accurate network intrusion detection system (IDS). This paper outlines and compares four AI methods to train two benchmark datasets- the KDD’99 and the NSL-KDD. Apart from model selection, data preprocessing plays a vital role in contributing to accurate solutions, and thus, we propose a simple yet effective data preprocessing method. We also evaluate and compare the accuracy and performance of four popular models- decision tree (DT), multi-layer perceptron (MLP), random forest (RF), and a stacked autoencoder (SAE) model. Of the four methods, the random forest classifier showed the most consistent and accurate results.


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Xin Li ◽  
HongBo Li ◽  
WenSheng Cui ◽  
ZhaoHui Cai ◽  
MeiJuan Jia

Breast cancer is one of the primary causes of cancer death in the world and has a great impact on women’s health. Generally, the majority of classification methods rely on the high-level feature. However, different levels of features may not be positively correlated for the final results of classification. Inspired by the recent widespread use of deep learning, this study proposes a novel method for classifying benign cancer and malignant breast cancer based on deep features. First, we design Sliding + Random and Sliding + Class Balance Random window slicing strategies for data preprocessing. The two strategies enhance the generalization of model and improve classification performance on minority classes. Second, feature extraction is based on the AlexNet model. We also discuss the influence of intermediate- and high-level features on classification results. Third, different levels of features are input into different machine-learning models for classification, and then, the best combination is chosen. The experimental results show that the data preprocessing of the Sliding + Class Balance Random window slicing strategy produces decent effectiveness on the BreaKHis dataset. The classification accuracy ranges from 83.57% to 88.69% at different magnifications. On this basis, combining intermediate- and high-level features with SVM has the best classification effect. The classification accuracy ranges from 85.30% to 88.76% at different magnifications. Compared with the latest results of F. A. Spanhol’s team who provide BreaKHis data, the presented method shows better classification performance on image-level accuracy. We believe that the proposed method has promising good practical value and research significance.


Electronics ◽  
2021 ◽  
Vol 11 (1) ◽  
pp. 56
Author(s):  
Hongwei Li ◽  
Hongyan Mao ◽  
Jingzi Wang

Part-of-Speech (POS) tagging is one of the most important tasks in the field of natural language processing (NLP). POS tagging for a word depends not only on the word itself but also on its position, its surrounding words, and their POS tags. POS tagging can be an upstream task for other NLP tasks, further improving their performance. Therefore, it is important to improve the accuracy of POS tagging. In POS tagging, bidirectional Long Short-Term Memory (Bi-LSTM) is commonly used and achieves good performance. However, Bi-LSTM is not as powerful as Transformer in leveraging contextual information, since Bi-LSTM simply concatenates the contextual information from left-to-right and right-to-left. In this study, we propose a novel approach for POS tagging to improve the accuracy. For each token, all possible POS tags are obtained without considering context, and then rules are applied to prune out these possible POS tags, which we call rule-based data preprocessing. In this way, the number of possible POS tags of most tokens can be reduced to one, and they are considered to be correctly tagged. Finally, POS tags of the remaining tokens are masked, and a model based on Transformer is used to only predict the masked POS tags, which enables it to leverage bidirectional contexts. Our experimental result shows that our approach leads to better performance than other methods using Bi-LSTM.


Mathematics ◽  
2021 ◽  
Vol 10 (1) ◽  
pp. 29
Author(s):  
Jersson X. Leon-Medina ◽  
Núria Parés ◽  
Maribel Anaya ◽  
Diego A. Tibaduiza ◽  
Francesc Pozo

The classification and use of robust methodologies in sensor array applications of electronic noses (ENs) remain an open problem. Among the several steps used in the developed methodologies, data preprocessing improves the classification accuracy of this type of sensor. Data preprocessing methods, such as data transformation and data reduction, enable the treatment of data with anomalies, such as outliers and features, that do not provide quality information; in addition, they reduce the dimensionality of the data, thereby facilitating the tasks of a machine learning classifier. To help solve this problem, in this study, a machine learning methodology is introduced to improve signal processing and develop methodologies for classification when an EN is used. The proposed methodology involves a normalization stage to scale the data from the sensors, using both the well-known min−max approach and the more recent mean-centered unitary group scaling (MCUGS). Next, a manifold learning algorithm for data reduction is applied using uniform manifold approximation and projection (UMAP). The dimensionality of the data at the input of the classification machine is reduced, and an extreme learning machine (ELM) is used as a machine learning classifier algorithm. To validate the EN classification methodology, three datasets of ENs were used. The first dataset was composed of 3600 measurements of 6 volatile organic compounds performed by employing 16 metal-oxide gas sensors. The second dataset was composed of 235 measurements of 3 different qualities of wine, namely, high, average, and low, as evaluated by using an EN sensor array composed of 6 different sensors. The third dataset was composed of 309 measurements of 3 different gases obtained by using an EN sensor array of 2 sensors. A 5-fold cross-validation approach was used to evaluate the proposed methodology. A test set consisting of 25% of the data was used to validate the methodology with unseen data. The results showed a fully correct average classification accuracy of 1 when the MCUGS, UMAP, and ELM methods were used. Finally, the effect of changing the number of target dimensions on the reduction of the number of data was determined based on the highest average classification accuracy.


Sign in / Sign up

Export Citation Format

Share Document