Better than blocking: building and using multiple variable pattern indexes (MVPi) to screen data sources for possible matches and reduce linkage search space

IntroductionScaling up data linkage presents a challenging problem that has no straightforward solution. Lacking a prescribed ID in common between two data sources, the number of records to compare increases geometrically with data volume. Data linkers have for the main part resorted to “blocking” on demographic or other identifying variables. Objectives and ApproachAmong the more efficient of better blocking methods, carefully constructed multiple variable pattern indexes (MVPi) offer a robust and efficient method for reduction of linkage search spaces in Big Data. This realistic, large-scale demonstration of MVPi combines 30,156 SSA Death Master File (DMF) and NDI matches on SSN with equal dates of death (true matches) and 16,332 DMF records with different or missing SSN, and links the total of 46,448 records on names, date of birth, and postal code (ignoring SSN) to >94MM DMF records. The proportion of true matches not linked tests for loss of information during the blocking phase of data linkage. ResultsBlocking has an obvious cost in terms of completeness of linkage: any errors in a single blocking variable mean that blocking will miss true matches. Remedies for this problem usually add more blocking variables and stages, or make better use of information in blocking variables, requiring more time and computing resources. MVPi screening makes fuller use of information in blocking variables, but does so in this demonstration in one cohort (>30K) and DMF (>94MM) data sets. In this acid-test demonstration with few identifying variables and messy data, MVPi screening failed to link less than eight percent of the cohort records to its corresponding true match in the SSA DMF. MVPi screening reduced trillions of possible pairs requiring comparisons to a manageable 83MM . Conclusion/ImplicationsThe screening phase of a large-scale linkage project reduces linkage search space to the pairs of records more likely to be true matches, but it may also lead to selectivity bias and underestimates of the sensitivity (recall) of data linkage methods. Efficient MVPi screening allows fuller use of identifying information.

Download Full-text

Network-based Rendering Techniques for Large-scale Volume Data Sets

Mathematics and Visualization - Hierarchical and Geometrical Methods in Scientific Visualization ◽

10.1007/978-3-642-55787-3_17 ◽

2003 ◽

pp. 283-295

Author(s):

Joerg Meyer ◽

Ragnar Borg ◽

Bernd Hamann ◽

Kenneth I. Joy ◽

Arthur J. Olson

Keyword(s):

Large Scale ◽

Data Sets ◽

Volume Data ◽

Rendering Techniques

Download Full-text

Efficient exploratory clustering analyses in large-scale exploration processes

The VLDB Journal ◽

10.1007/s00778-021-00716-y ◽

2021 ◽

Author(s):

Manuel Fritz ◽

Michael Behringer ◽

Dennis Tschechlov ◽

Holger Schwarz

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Comprehensive Evaluation ◽

State Of The Art ◽

Clustering Algorithms ◽

Search Space ◽

Large Datasets ◽

Search Spaces ◽

Multiple Challenges ◽

The One

AbstractClustering is a fundamental primitive in manifold applications. In order to achieve valuable results in exploratory clustering analyses, parameters of the clustering algorithm have to be set appropriately, which is a tremendous pitfall. We observe multiple challenges for large-scale exploration processes. On the one hand, they require specific methods to efficiently explore large parameter search spaces. On the other hand, they often exhibit large runtimes, in particular when large datasets are analyzed using clustering algorithms with super-polynomial runtimes, which repeatedly need to be executed within exploratory clustering analyses. We address these challenges as follows: First, we present LOG-Means and show that it provides estimates for the number of clusters in sublinear time regarding the defined search space, i.e., provably requiring less executions of a clustering algorithm than existing methods. Second, we demonstrate how to exploit fundamental characteristics of exploratory clustering analyses in order to significantly accelerate the (repetitive) execution of clustering algorithms on large datasets. Third, we show how these challenges can be tackled at the same time. To the best of our knowledge, this is the first work which simultaneously addresses the above-mentioned challenges. In our comprehensive evaluation, we unveil that our proposed methods significantly outperform state-of-the-art methods, thus especially supporting novice analysts for exploratory clustering analyses in large-scale exploration processes.

Download Full-text

Data Provenance

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch085 ◽

2011 ◽

pp. 544-549

Author(s):

Vikram Sorathia

Keyword(s):

Large Scale ◽

Data Sources ◽

Computer Networking ◽

Sensor Technology ◽

Data Provenance ◽

Data Sets ◽

Automated Identification ◽

Automated Generation ◽

Time Stamps ◽

Long Time

In recent years, our sensing capability has increased manifold. The developments in sensor technology, telecommunication, computer networking and distributed computing domain have created strong grounds for building sensor networks that are now reaching global scales (Balazinska et al., 2007). As data sources are increasing, the task of processing and analysis has gone beyond the capabilities of conventional desktop data processing tools. For quite a long time, data was assumed to be available on the single user-desktop; and handling, processing as well as analysis was carried out single-handedly. With proliferation of streaming data-sources and near real-time applications, it has become important to make provisions of automated identification and attribution of data-sets derived from such diverse sources. Considering the sharing and reuse of such diverse data-sets, the information about: the source of data, ownership, time-stamps, accuracy related details, processes and transformations subjected to it etc. have become essential. The piece of data that provide such information about the given data-set is known as Metadata. The need is recognized for creating and handling of metadata as an integrated part of large-scale systems. Considering the information requirements of scientific and research community, the efforts towards the building global data commons have came into existence (Onsrud & Campbell, 2007). A special type of service is required that can address the issues like: explication of licensing & Intellectual Property Rights, standards based automated generation of metadata, data provenance, archival and peer-review. While each of these terms is being addressed as individual research topics, the present article is focused only on Data Provenance.

Download Full-text

Assembly of the LongSHOT cohort: public record linkage on a grand scale

Injury Prevention ◽

10.1136/injuryprev-2019-043385 ◽

2019 ◽

Vol 26 (2) ◽

pp. 153-158 ◽

Cited By ~ 1

Author(s):

Yifan Zhang ◽

Erin E Holsinger ◽

Lea Prince ◽

Jonathan A Rodden ◽

Sonja A Swanson ◽

...

Keyword(s):

Large Scale ◽

Data Linkage ◽

Careful Consideration ◽

Mortality Data ◽

Data Sets ◽

Case Control Studies ◽

Mortality And Morbidity ◽

Individual Level ◽

Large Scale Data ◽

Matching Techniques

BackgroundVirtually all existing evidence linking access to firearms to elevated risks of mortality and morbidity comes from ecological and case–control studies. To improve understanding of the health risks and benefits of firearm ownership, we launched a cohort study: the Longitudinal Study of Handgun Ownership and Transfer (LongSHOT).MethodsUsing probabilistic matching techniques we linked three sources of individual-level, state-wide data in California: official voter registration records, an archive of lawful handgun transactions and all-cause mortality data. There were nearly 28.8 million unique voter registrants, 5.5 million handgun transfers and 3.1 million deaths during the study period (18 October 2004 to 31 December 2016). The linkage relied on several identifying variables (first, middle and last names; date of birth; sex; residential address) that were available in all three data sets, deploying them in a series of bespoke algorithms.ResultsAssembly of the LongSHOT cohort commenced in January 2016 and was completed in March 2019. Approximately three-quarters of matches identified were exact matches on all link variables. The cohort consists of 28.8 million adult residents of California followed for up to 12.2 years. A total of 1.2 million cohort members purchased at least one handgun during the study period, and 1.6 million died.ConclusionsThree steps taken early may be particularly useful in enhancing the efficiency of large-scale data linkage: thorough data cleaning; assessment of the suitability of off-the-shelf data linkage packages relative to bespoke coding; and careful consideration of the minimum sample size and matching precision needed to support rigorous investigation of the study questions.

Download Full-text

Faculty Opinions recommendation of Comparative assessment of large-scale data sets of protein-protein interactions.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1006598.82257 ◽

2002 ◽

Author(s):

Rob Russell

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Comparative Assessment ◽

Data Sets ◽

Protein Protein Interactions ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Galaxy spin direction distribution in HST and SDSS show similar large-scale asymmetry

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2020.46 ◽

2020 ◽

Vol 37 ◽

Author(s):

Lior Shamir

Keyword(s):

Large Scale ◽

Spiral Galaxies ◽

Hubble Space Telescope ◽

Gravitational Interaction ◽

Large Data ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Dipole Axis ◽

Data Set ◽

The Asymmetry

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .

Download Full-text

A Large-Scale COVID-19 Twitter Chatter Dataset for Open Scientific Research—An International Collaboration

Epidemiologia ◽

10.3390/epidemiologia2030024 ◽

2021 ◽

Vol 2 (3) ◽

pp. 315-324

Author(s):

Juan M. Banda ◽

Ramya Tekumalla ◽

Guanyu Wang ◽

Jingyuan Yu ◽

Tuo Liu ◽

...

Keyword(s):

Large Scale ◽

Social Dynamics ◽

Additional Data ◽

Open Data ◽

Data Sources ◽

Research Projects ◽

Research Groups ◽

The World ◽

Data Source

As the COVID-19 pandemic continues to spread worldwide, an unprecedented amount of open data is being generated for medical, genetics, and epidemiological research. The unparalleled rate at which many research groups around the world are releasing data and publications on the ongoing pandemic is allowing other scientists to learn from local experiences and data generated on the front lines of the COVID-19 pandemic. However, there is a need to integrate additional data sources that map and measure the role of social dynamics of such a unique worldwide event in biomedical, biological, and epidemiological analyses. For this purpose, we present a large-scale curated dataset of over 1.12 billion tweets, growing daily, related to COVID-19 chatter generated from 1 January 2020 to 27 June 2021 at the time of writing. This data source provides a freely available additional data source for researchers worldwide to conduct a wide and diverse number of research projects, such as epidemiological analyses, emotional and mental responses to social distancing measures, the identification of sources of misinformation, stratified measurement of sentiment towards the pandemic in near real time, among many others.

Download Full-text

Solving the Real Power Limitations in the Dynamic Economic Dispatch of Large-Scale Thermal Power Units under the Effects of Valve-Point Loading and Ramp-Rate Limitations

Sustainability ◽

10.3390/su13031274 ◽

2021 ◽

Vol 13 (3) ◽

pp. 1274

Author(s):

Loau Al-Bahrani ◽

Mehdi Seyedmahmoudian ◽

Ben Horan ◽

Alex Stojcevski

Keyword(s):

Large Scale ◽

Thermal Power ◽

Optimization Technique ◽

Economic Dispatch ◽

Pso Algorithm ◽

Search Space ◽

Economic Benefits ◽

Optimization Techniques ◽

Ramp Rate ◽

Dynamic Economic Dispatch

Few non-traditional optimization techniques are applied to the dynamic economic dispatch (DED) of large-scale thermal power units (TPUs), e.g., 1000 TPUs, that consider the effects of valve-point loading with ramp-rate limitations. This is a complicated multiple mode problem. In this investigation, a novel optimization technique, namely, a multi-gradient particle swarm optimization (MG-PSO) algorithm with two stages for exploring and exploiting the search space area, is employed as an optimization tool. The M particles (explorers) in the first stage are used to explore new neighborhoods, whereas the M particles (exploiters) in the second stage are used to exploit the best neighborhood. The M particles’ negative gradient variation in both stages causes the equilibrium between the global and local search space capabilities. This algorithm’s authentication is demonstrated on five medium-scale to very large-scale power systems. The MG-PSO algorithm effectively reduces the difficulty of handling the large-scale DED problem, and simulation results confirm this algorithm’s suitability for such a complicated multi-objective problem at varying fitness performance measures and consistency. This algorithm is also applied to estimate the required generation in 24 h to meet load demand changes. This investigation provides useful technical references for economic dispatch operators to update their power system programs in order to achieve economic benefits.

Download Full-text

Accelerating In-Transit Co-Processing for Scientific Simulations Using Region-Based Data-Driven Analysis

Algorithms ◽

10.3390/a14050154 ◽

2021 ◽

Vol 14 (5) ◽

pp. 154

Author(s):

Marcus Walldén ◽

Masao Okita ◽

Fumihiko Ino ◽

Dimitris Drikakis ◽

Ioannis Kokkinakis

Keyword(s):

Large Scale ◽

Data Driven ◽

Data Sets ◽

Output Constraints ◽

Data Driven Approach ◽

Scientific Simulations ◽

Multiple Metrics ◽

In Transit ◽

Multiple Compression ◽

Large Scale Simulations

Increasing processing capabilities and input/output constraints of supercomputers have increased the use of co-processing approaches, i.e., visualizing and analyzing data sets of simulations on the fly. We present a method that evaluates the importance of different regions of simulation data and a data-driven approach that uses the proposed method to accelerate in-transit co-processing of large-scale simulations. We use the importance metrics to simultaneously employ multiple compression methods on different data regions to accelerate the in-transit co-processing. Our approach strives to adaptively compress data on the fly and uses load balancing to counteract memory imbalances. We demonstrate the method’s efficiency through a fluid mechanics application, a Richtmyer–Meshkov instability simulation, showing how to accelerate the in-transit co-processing of simulations. The results show that the proposed method expeditiously can identify regions of interest, even when using multiple metrics. Our approach achieved a speedup of 1.29× in a lossless scenario. The data decompression time was sped up by 2× compared to using a single compression method uniformly.

Download Full-text