Robust and efficient fuzzy match for online data cleaning

Author(s):  
Surajit Chaudhuri ◽  
Kris Ganjam ◽  
Venkatesh Ganti ◽  
Rajeev Motwani
Author(s):  
Hongzhi Wang ◽  
Jianzhong Li ◽  
Ran Huo ◽  
Li Jia ◽  
Lian Jin ◽  
...  

Author(s):  
Yulin Ding ◽  
Hui Lin ◽  
Rongrong Li

Recent breakthroughs in sensor networks have made it possible to collect and assemble increasing amounts of real-time observational data by observing dynamic phenomena at previously impossible time and space scales. Real-time observational data streams present potentially profound opportunities for real-time applications in disaster mitigation and emergency response, by providing accurate and timeliness estimates of environment’s status. However, the data are always subject to inevitable anomalies (including errors and anomalous changes/events) caused by various effects produced by the environment they are monitoring. The “big but dirty” real-time observational data streams can rarely achieve their full potential in the following real-time models or applications due to the low data quality. Therefore, timely and meaningful online data cleaning is a necessary pre-requisite step to ensure the quality, reliability, and timeliness of the real-time observational data. <br><br> In general, a straightforward streaming data cleaning approach, is to define various types of models/classifiers representing normal behavior of sensor data streams and then declare any deviation from this model as normal or erroneous data. The effectiveness of these models is affected by dynamic changes of deployed environments. Due to the changing nature of the complicated process being observed, real-time observational data is characterized by diversity and dynamic, showing a typical Big (Geo) Data characters. Dynamics and diversity is not only reflected in the data values, but also reflected in the complicated changing patterns of the data distributions. This means the pattern of the real-time observational data distribution is not <i>stationary or static</i> but <i>changing and dynamic</i>. After the data pattern changed, it is necessary to adapt the model over time to cope with the changing patterns of real-time data streams. Otherwise, the model will not fit the following observational data streams, which may led to large estimation error. In order to achieve the best generalization error, it is an important challenge for the data cleaning methodology to be able to characterize the behavior of data stream distributions and adaptively update a model to include new information and remove old information. However, the complicated data changing property invalidates traditional data cleaning methods, which rely on the assumption of a stationary data distribution, and drives the need for more dynamic and adaptive online data cleaning methods. <br><br> To overcome these shortcomings, this paper presents a change semantics constrained online filtering method for real-time observational data. Based on the principle that the filter parameter should vary in accordance to the data change patterns, this paper embeds semantic description, which quantitatively depicts the change patterns in the data distribution to self-adapt the filter parameter automatically. Real-time observational water level data streams of different precipitation scenarios are selected for testing. Experimental results prove that by means of this method, more accurate and reliable water level information can be available, which is prior to scientific and prompt flood assessment and decision-making.


Author(s):  
Yulin Ding ◽  
Hui Lin ◽  
Rongrong Li

Recent breakthroughs in sensor networks have made it possible to collect and assemble increasing amounts of real-time observational data by observing dynamic phenomena at previously impossible time and space scales. Real-time observational data streams present potentially profound opportunities for real-time applications in disaster mitigation and emergency response, by providing accurate and timeliness estimates of environment’s status. However, the data are always subject to inevitable anomalies (including errors and anomalous changes/events) caused by various effects produced by the environment they are monitoring. The “big but dirty” real-time observational data streams can rarely achieve their full potential in the following real-time models or applications due to the low data quality. Therefore, timely and meaningful online data cleaning is a necessary pre-requisite step to ensure the quality, reliability, and timeliness of the real-time observational data. &lt;br&gt;&lt;br&gt; In general, a straightforward streaming data cleaning approach, is to define various types of models/classifiers representing normal behavior of sensor data streams and then declare any deviation from this model as normal or erroneous data. The effectiveness of these models is affected by dynamic changes of deployed environments. Due to the changing nature of the complicated process being observed, real-time observational data is characterized by diversity and dynamic, showing a typical Big (Geo) Data characters. Dynamics and diversity is not only reflected in the data values, but also reflected in the complicated changing patterns of the data distributions. This means the pattern of the real-time observational data distribution is not &lt;i&gt;stationary or static&lt;/i&gt; but &lt;i&gt;changing and dynamic&lt;/i&gt;. After the data pattern changed, it is necessary to adapt the model over time to cope with the changing patterns of real-time data streams. Otherwise, the model will not fit the following observational data streams, which may led to large estimation error. In order to achieve the best generalization error, it is an important challenge for the data cleaning methodology to be able to characterize the behavior of data stream distributions and adaptively update a model to include new information and remove old information. However, the complicated data changing property invalidates traditional data cleaning methods, which rely on the assumption of a stationary data distribution, and drives the need for more dynamic and adaptive online data cleaning methods. &lt;br&gt;&lt;br&gt; To overcome these shortcomings, this paper presents a change semantics constrained online filtering method for real-time observational data. Based on the principle that the filter parameter should vary in accordance to the data change patterns, this paper embeds semantic description, which quantitatively depicts the change patterns in the data distribution to self-adapt the filter parameter automatically. Real-time observational water level data streams of different precipitation scenarios are selected for testing. Experimental results prove that by means of this method, more accurate and reliable water level information can be available, which is prior to scientific and prompt flood assessment and decision-making.


2020 ◽  
Vol 41 (1) ◽  
pp. 30-36
Author(s):  
Steven V. Rouse

Abstract. Previous research has supported the use of Amazon’s Mechanical Turk (MTurk) for online data collection in individual differences research. Although MTurk Masters have reached an elite status because of strong approval ratings on previous tasks (and therefore gain higher payment for their work) no research has empirically examined whether researchers actually obtain higher quality data when they require that their MTurk Workers have Master status. In two different online survey studies (one using a personality test and one using a cognitive abilities test), the psychometric reliability of MTurk data was compared between a sample that required a Master qualification type and a sample that placed no status-level qualification requirement. In both studies, the Master samples failed to outperform the standard samples.


2009 ◽  
Vol 129 (11) ◽  
pp. 1290-1298
Author(s):  
Hiroyuki Ishikawa ◽  
Yasuyuki Shirai ◽  
Tanzo Nitta ◽  
Katsuhiko Shibata

Sign in / Sign up

Export Citation Format

Share Document