EMPLOYMENT DATA CLEANING METHODS IN DATABASE REENGINEERING

Author(s):  
A. I. Baranchikov ◽  
◽  
I. I. Yakovlev ◽  
I. A. Klyueva ◽  
◽  
...  
Author(s):  
Jinlin Wang ◽  
Hongli Zhang ◽  
Binxing Fang ◽  
Xing Wang ◽  
Lin Ye

2020 ◽  
Vol 62 (3) ◽  
pp. 1053-1075
Author(s):  
Jinlin Wang ◽  
Xing Wang ◽  
Yuchen Yang ◽  
Hongli Zhang ◽  
Binxing Fang

2021 ◽  
Vol 15 ◽  
Author(s):  
Shengjie Liu ◽  
Guangye Li ◽  
Shize Jiang ◽  
Xiaolong Wu ◽  
Jie Hu ◽  
...  

Stereo-electroencephalography (SEEG) utilizes localized and penetrating depth electrodes to directly measure electrophysiological brain activity. The implanted electrodes generally provide a sparse sampling of multiple brain regions, including both cortical and subcortical structures, making the SEEG neural recordings a potential source for the brain–computer interface (BCI) purpose in recent years. For SEEG signals, data cleaning is an essential preprocessing step in removing excessive noises for further analysis. However, little is known about what kinds of effect that different data cleaning methods may exert on BCI decoding performance and, moreover, what are the reasons causing the differentiated effects. To address these questions, we adopted five different data cleaning methods, including common average reference, gray–white matter reference, electrode shaft reference, bipolar reference, and Laplacian reference, to process the SEEG data and evaluated the effect of these methods on improving BCI decoding performance. Additionally, we also comparatively investigated the changes of SEEG signals induced by these different methods from multiple-domain (e.g., spatial, spectral, and temporal domain). The results showed that data cleaning methods could improve the accuracy of gesture decoding, where the Laplacian reference produced the best performance. Further analysis revealed that the superiority of the data cleaning method with excellent performance might be attributed to the increased distinguishability in the low-frequency band. The findings of this work highlighted the importance of applying proper data clean methods for SEEG signals and proposed the application of Laplacian reference for SEEG-based BCI.


2008 ◽  
Vol 53 (10) ◽  
pp. 886-893 ◽  
Author(s):  
Taku Miyagawa ◽  
Nao Nishida ◽  
Jun Ohashi ◽  
Ryosuke Kimura ◽  
Akihiro Fujimoto ◽  
...  

Author(s):  
Yulin Ding ◽  
Hui Lin ◽  
Rongrong Li

Recent breakthroughs in sensor networks have made it possible to collect and assemble increasing amounts of real-time observational data by observing dynamic phenomena at previously impossible time and space scales. Real-time observational data streams present potentially profound opportunities for real-time applications in disaster mitigation and emergency response, by providing accurate and timeliness estimates of environment’s status. However, the data are always subject to inevitable anomalies (including errors and anomalous changes/events) caused by various effects produced by the environment they are monitoring. The “big but dirty” real-time observational data streams can rarely achieve their full potential in the following real-time models or applications due to the low data quality. Therefore, timely and meaningful online data cleaning is a necessary pre-requisite step to ensure the quality, reliability, and timeliness of the real-time observational data. <br><br> In general, a straightforward streaming data cleaning approach, is to define various types of models/classifiers representing normal behavior of sensor data streams and then declare any deviation from this model as normal or erroneous data. The effectiveness of these models is affected by dynamic changes of deployed environments. Due to the changing nature of the complicated process being observed, real-time observational data is characterized by diversity and dynamic, showing a typical Big (Geo) Data characters. Dynamics and diversity is not only reflected in the data values, but also reflected in the complicated changing patterns of the data distributions. This means the pattern of the real-time observational data distribution is not <i>stationary or static</i> but <i>changing and dynamic</i>. After the data pattern changed, it is necessary to adapt the model over time to cope with the changing patterns of real-time data streams. Otherwise, the model will not fit the following observational data streams, which may led to large estimation error. In order to achieve the best generalization error, it is an important challenge for the data cleaning methodology to be able to characterize the behavior of data stream distributions and adaptively update a model to include new information and remove old information. However, the complicated data changing property invalidates traditional data cleaning methods, which rely on the assumption of a stationary data distribution, and drives the need for more dynamic and adaptive online data cleaning methods. <br><br> To overcome these shortcomings, this paper presents a change semantics constrained online filtering method for real-time observational data. Based on the principle that the filter parameter should vary in accordance to the data change patterns, this paper embeds semantic description, which quantitatively depicts the change patterns in the data distribution to self-adapt the filter parameter automatically. Real-time observational water level data streams of different precipitation scenarios are selected for testing. Experimental results prove that by means of this method, more accurate and reliable water level information can be available, which is prior to scientific and prompt flood assessment and decision-making.


Author(s):  
R. Prabha ◽  
Mohan G Kabadi

Reliability in the utilization of the Global Positioning System (GPS) data demands a higher degree of accuracy with respect to time and positional information required by the user. However, various extrinsic and intrinsic parameters disrupt the data transmission phenomenon from GPS satellite to GPS receiver which always questions the trustworthiness of such data. Therefore, this manuscript offers a comprehensive insight into the data preprocessing methodologies evolved and adopted by present-day researchers. The discussion is carried out with respect to standard methods of data cleaning as well as diversified existing research-based approaches. The review finds that irrespective of a good number of work carried out to address the problem of data cleaning, there are critical loopholes in almost all the existing studies. The paper extracts open end research problems as well as it also offers an evidential insight using use-cases where it is found that still there is a critical need to investigate data cleaning methods.


Author(s):  
William E. Winkler

Fayyad and Uthursamy (2002) have stated that the majority of the work (representing months or years) in creating a data warehouse is in cleaning up duplicates and resolving other anomalies. This article provides an overview of two methods for improving quality. The first is data cleaning for finding duplicates within files or across files. The second is edit/imputation for maintaining business rules and for filling in missing data. The fastest data-cleaning methods are suitable for files with hundreds of millions of records (Winkler, 1999b, 2003b). The fastest edit/imputation methods are suitable for files with millions of records (Winkler, 1999a, 2004b).


2015 ◽  
Vol 31 (5) ◽  
Author(s):  
Shu Xu ◽  
Bo Lu ◽  
Michael Baldea ◽  
Thomas F. Edgar ◽  
Willy Wojsznis ◽  
...  

AbstractIn the past decades, process engineers are facing increasingly more data analytics challenges and having difficulties obtaining valuable information from a wealth of process variable data trends. The raw data of different formats stored in databases are not useful until they are cleaned and transformed. Generally, data cleaning consists of four steps: missing data imputation, outlier detection, noise removal, and time alignment and delay estimation. This paper discusses available data cleaning methods that can be used in data pre-processing and help overcome challenges of “Big Data”.


Sign in / Sign up

Export Citation Format

Share Document