scholarly journals Incremental Interactive Record Linkage using Human Intelligence Tasks (HITs)

Author(s):  
Hye-Chung Kum

ABSTRACTObjectiveWhen analyzing population data, there is a need to link data about organizations. One challenge in linking organization level data is that unlike a person, there can be many definitions for an entity. For example, for hospitals, depending on the dataset, an entity might represent any one of the following similar but different semantic types:(1) physical units, (2) billing units (3) legal units, (5) licensed units, or (5) reporting units. How these different entities relate to each other can be complex such as one billing unit can span across many physical units or multiple billing units can exist for one physical unit. Thus, linking organization level data requires human involvement to sort through these issues in heterogeneous data sources to make informed decisions on the messy data. We design and evaluate a general framework for a hybrid Human-Machine process for ongoing integration and cleaning of hospital level data when no common identifiers exist such that we highlight the decisions that need human judgement and document and track the full processes to ensure reproducibility. Such ongoing integration is often called incremental record linkage (RL). ApproachAccurate linkage in big data requires well-defined tasks that need automatic or human processing. In the human computer interaction (HCI) field, Human Intelligence Tasks (HITs) are defined as micro tasks requiring human judgment and are often used in designing crowdsourcing systems. We designed HITs for linking organization level data and embed them into automatic deterministic linkage algorithms that supports interactive stepwise RL. The hybrid system is a framework for reproducible incremental RL. ResultsWe illustrate this framework by integrating four databases of hospitals in Texas from 2008 to 2014(N=664). The IDs used in the databases are the Texas Provider ID, the National Provider ID, the Medicare ID, and the Facility ID. We link the databases using provider name, including dba (i.e., doing business as), addresses, and phone numbers. Similarities in hospital names and addresses and the dynamic nature of hospital attributes over time make it impossible to build a fully automated linkage system for hospitals. Using our system to iteratively standardize and clean the data, we linked the hospitals with 100% precision using HITs that required confirming 79 approximate linkages and manually linking 28 hospitals. ConclusionEffective software that can support the interactive and iterative process of RL with well-designed HITs can streamline the linkage processes to support high quality replicable research using big data.

2020 ◽  
Vol 12 (14) ◽  
pp. 5595 ◽  
Author(s):  
Ana Lavalle ◽  
Miguel A. Teruel ◽  
Alejandro Maté ◽  
Juan Trujillo

Fostering sustainability is paramount for Smart Cities development. Lately, Smart Cities are benefiting from the rising of Big Data coming from IoT devices, leading to improvements on monitoring and prevention. However, monitoring and prevention processes require visualization techniques as a key component. Indeed, in order to prevent possible hazards (such as fires, leaks, etc.) and optimize their resources, Smart Cities require adequate visualizations that provide insights to decision makers. Nevertheless, visualization of Big Data has always been a challenging issue, especially when such data are originated in real-time. This problem becomes even bigger in Smart City environments since we have to deal with many different groups of users and multiple heterogeneous data sources. Without a proper visualization methodology, complex dashboards including data from different nature are difficult to understand. In order to tackle this issue, we propose a methodology based on visualization techniques for Big Data, aimed at improving the evidence-gathering process by assisting users in the decision making in the context of Smart Cities. Moreover, in order to assess the impact of our proposal, a case study based on service calls for a fire department is presented. In this sense, our findings will be applied to data coming from citizen calls. Thus, the results of this work will contribute to the optimization of resources, namely fire extinguishing battalions, helping to improve their effectiveness and, as a result, the sustainability of a Smart City, operating better with less resources. Finally, in order to evaluate the impact of our proposal, we have performed an experiment, with non-expert users in data visualization.


Author(s):  
Hassan Mehmood ◽  
Ekaterina Gilman ◽  
Marta Cortes ◽  
Panos Kostakos ◽  
Andrew Byrne ◽  
...  

2012 ◽  
Vol 518-523 ◽  
pp. 1334-1339
Author(s):  
Jian Rang Zhang ◽  
Qing Tao Shen

In view of the complexity, redundancy and uncertainty of measuring data generated by mine environment monitoring systems, a structure of two level data fusion, an adaptive weighted first level fusion and a second level fusion of grey correlation analysis, is presented, thus to achieve the fusion for the monitoring data from heterogeneous data sources and the fusion for the data from heterogeneous sources. Application examples shows that the fusion model has stable performance with strong anti- interference and can be handled easily.


2021 ◽  
Vol 16 (2) ◽  
pp. 1-17
Author(s):  
Kevin O’hare ◽  
Anna Jurek-Loughrey ◽  
Cassio De Campos

Data integration is an important component of Big Data analytics. One of the key challenges in data integration is record linkage, that is, matching records that represent the same real-world entity. Because of computational costs, methods referred to as blocking are employed as a part of the record linkage pipeline in order to reduce the number of comparisons among records. In the past decade, a range of blocking techniques have been proposed. Real-world applications require approaches that can handle heterogeneous data sources and do not rely on labelled data. We propose high-value token-blocking (HVTB), a simple and efficient approach for blocking that is unsupervised and schema-agnostic, based on a crafted use of Term Frequency-Inverse Document Frequency. We compare HVTB with multiple methods and over a range of datasets, including a novel unstructured dataset composed of titles and abstracts of scientific papers. We thoroughly discuss results in terms of accuracy, use of computational resources, and different characteristics of datasets and records. The simplicity of HVTB yields fast computations and does not harm its accuracy when compared with existing approaches. It is shown to be significantly superior to other methods, suggesting that simpler methods for blocking should be considered before resorting to more sophisticated methods.


2019 ◽  
Vol 2019 ◽  
pp. 1-18 ◽  
Author(s):  
Sebastian Neubert ◽  
André Geißler ◽  
Thomas Roddelkopf ◽  
Regina Stoll ◽  
Karl-Heinz Sandmann ◽  
...  

Investigations in preventive and occupational medicine are often based on the acquisition of data in the customer’s daily routine. This requires convenient measurement solutions including physiological, psychological, physical, and sometimes emotional parameters. In this paper, the introduction of a decentralized multi-sensor-fusion approach for a preventive health-management system is described. The aim is the provision of a flexible mobile data-collection platform, which can be used in many different health-care related applications. Different heterogeneous data sources can be integrated and measured data are prepared and transferred to a superordinated data-science-oriented cloud-solution. The presented novel approach focuses on the integration and fusion of different mobile data sources on a mobile data collection system (mDCS). This includes directly coupled wireless sensor devices, indirectly coupled devices offering the datasets via vendor-specific cloud solutions (as e.g., Fitbit, San Francisco, USA and Nokia, Espoo, Finland) and questionnaires to acquire subjective and objective parameters. The mDCS functions as a user-specific interface adapter and data concentrator decentralized from a data-science-oriented processing cloud. A low-level data fusion in the mDCS includes the synchronization of the data sources, the individual selection of required data sets and the execution of pre-processing procedures. Thus, the mDCS increases the availability of the processing cloud and in consequence also of the higher level data-fusion procedures. The developed system can be easily adapted to changing health-care applications by using different sensor combinations. The complex processing for data analysis can be supported and intervention measures can be provided.


2015 ◽  
Vol 26 (2) ◽  
pp. 14-31 ◽  
Author(s):  
Alejandro Maté ◽  
Hector Llorens ◽  
Elisa de Gregorio ◽  
Roberto Tardío ◽  
David Gil ◽  
...  

The huge amount of information available and its heterogeneity has surpassed the capacity of current data management technologies. Dealing with huge amounts of structured and unstructured data, often referred as Big Data, is a hot research topic and a technological challenge. In this paper, the authors present an approach aimed to enable OLAP queries over different, heterogeneous, data sources. Their approach is based on a MapReduce paradigm, which integrates different formats into the recent RDF Data Cube format. The benefits of their approach are that it is capable of querying different sources of information, while maintaining at the same time, an integrated, comprehensive view of the data available. The paper discusses the advantages and disadvantages, as well as the implementation challenges that such approach presents. Furthermore, the approach is evaluated in detail by means of a case study.


Sign in / Sign up

Export Citation Format

Share Document