Data Provenance

Author(s):  
Vikram Sorathia

In recent years, our sensing capability has increased manifold. The developments in sensor technology, telecommunication, computer networking and distributed computing domain have created strong grounds for building sensor networks that are now reaching global scales (Balazinska et al., 2007). As data sources are increasing, the task of processing and analysis has gone beyond the capabilities of conventional desktop data processing tools. For quite a long time, data was assumed to be available on the single user-desktop; and handling, processing as well as analysis was carried out single-handedly. With proliferation of streaming data-sources and near real-time applications, it has become important to make provisions of automated identification and attribution of data-sets derived from such diverse sources. Considering the sharing and reuse of such diverse data-sets, the information about: the source of data, ownership, time-stamps, accuracy related details, processes and transformations subjected to it etc. have become essential. The piece of data that provide such information about the given data-set is known as Metadata. The need is recognized for creating and handling of metadata as an integrated part of large-scale systems. Considering the information requirements of scientific and research community, the efforts towards the building global data commons have came into existence (Onsrud & Campbell, 2007). A special type of service is required that can address the issues like: explication of licensing & Intellectual Property Rights, standards based automated generation of metadata, data provenance, archival and peer-review. While each of these terms is being addressed as individual research topics, the present article is focused only on Data Provenance.

Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-16 ◽  
Author(s):  
Yiwen Zhang ◽  
Yuanyuan Zhou ◽  
Xing Guo ◽  
Jintao Wu ◽  
Qiang He ◽  
...  

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.


1990 ◽  
Vol 5 ◽  
pp. 262-272
Author(s):  
William Miller

Paleontologists have lavished much time and energy on description and explanation of large-scale patterns in the fossil record (e.g., mass extinctions, histories of monophyletic taxa, deployment of major biogeographic units), while paying comparatively little attention to biologic patterns preserved only in local stratigraphic sequences. Interpretation of the large-scale patterns will always be seen as the chief justification for the science of paleontology, but solving problems framed by long time spans and large areas is rife with tenuous inference and patterns are prone to varied interpretation by different investigators using virtually the same data sets (as in the controversy over ultimate cause of the terminal Cretaceous extinctions). In other words, the large-scale patterns in the history of life are the true philosophical property of paleontology, but there will always be serious problems in attempting to resolve processes that transpired over millions to hundreds-of-millions of years and encompassed vast areas of seafloor or landscape. By contrast, less spectacular and more commonplace changes in local habitats (often related to larger-scale events and cycles) and attendant biologic responses are closer to our direct experience of the living world and should be easier to interpret unequivocally. These small-scale responses are reflected in the fossil record at the scale of local outcrops.


Author(s):  
Sigurd Hermansen

IntroductionScaling up data linkage presents a challenging problem that has no straightforward solution. Lacking a prescribed ID in common between two data sources, the number of records to compare increases geometrically with data volume. Data linkers have for the main part resorted to “blocking” on demographic or other identifying variables. Objectives and ApproachAmong the more efficient of better blocking methods, carefully constructed multiple variable pattern indexes (MVPi) offer a robust and efficient method for reduction of linkage search spaces in Big Data. This realistic, large-scale demonstration of MVPi combines 30,156 SSA Death Master File (DMF) and NDI matches on SSN with equal dates of death (true matches) and 16,332 DMF records with different or missing SSN, and links the total of 46,448 records on names, date of birth, and postal code (ignoring SSN) to >94MM DMF records. The proportion of true matches not linked tests for loss of information during the blocking phase of data linkage. ResultsBlocking has an obvious cost in terms of completeness of linkage: any errors in a single blocking variable mean that blocking will miss true matches. Remedies for this problem usually add more blocking variables and stages, or make better use of information in blocking variables, requiring more time and computing resources. MVPi screening makes fuller use of information in blocking variables, but does so in this demonstration in one cohort (>30K) and DMF (>94MM) data sets. In this acid-test demonstration with few identifying variables and messy data, MVPi screening failed to link less than eight percent of the cohort records to its corresponding true match in the SSA DMF. MVPi screening reduced trillions of possible pairs requiring comparisons to a manageable 83MM . Conclusion/ImplicationsThe screening phase of a large-scale linkage project reduces linkage search space to the pairs of records more likely to be true matches, but it may also lead to selectivity bias and underestimates of the sensitivity (recall) of data linkage methods. Efficient MVPi screening allows fuller use of identifying information.


2018 ◽  
Vol 14 (5) ◽  
pp. 155014771877956
Author(s):  
Lu Sun ◽  
Wei Zhou ◽  
Jian Guan ◽  
You He

Approaches of vessel recognition are mostly accomplished by sensing targets and extracting target features, without taking advantage of spatial and temporal motion features. With maritime situation management systems widely applied, vessels’ spatial and temporal state information can be obtained by many kinds of distributed sensors, which is easy to achieve long-time accumulation but are often forgotten in databases. In order to get valuable information from large-scale stored trajectories for unknown vessel recognition, a spatial and temporal constrained trajectory similarity model and a mining algorithm based on spatial and temporal constrained trajectory similarity are proposed in this article by searching trajectories with similar motion features. Based on the idea of finding matching points between trajectories, baseline matching points are first defined to provide time reference for trajectories at different time, then the almost matching points are obtained by setting the spatial and temporal constraints, and the similarity of pairwise almost matching points is defined, which derives the spatial and temporal similarity of trajectories. By searching the matching points from trajectories, the similar motion pattern is extracted. Experiments on real data sets show that the proposed algorithm is useful for similar moving behavior mining from historic trajectories, which can strengthen motion feature with the length increases, and the support for vessel with unknown property is larger than other models.


2012 ◽  
Vol 174-177 ◽  
pp. 1927-1930 ◽  
Author(s):  
Tao Shang ◽  
Shui Peng Zhang

Image rendering of shadow faces a problem existed for a long time,that is the contradiction of quality and performance. Variant algorithms are presented to ameliorate this problem,shadow map is the one which is representative for that. Even though shadow maps have been widely used for the shadow of Three-dimensional scene,some imperfection still exist in this method like aliasing problem.So,the focus of the paper is introduce an algorithm which layering the data sets of the large scale building's shadow rapidly and intelligently based shadow map. First, we ascertain the fragment which create the shadow by shadow mapping's two scan. Second, we process the float data in the depth buffer by using uniformization and render the two depth data in the texture.Then use Gauss Filter to blur.Finally,use the algorithm of BIRCH cluster the uniformization data to improve the obscure and tweened effect.This method brings reduction of aliasing problem with low overhead as well as performance to a certain extent .


2020 ◽  
Vol 10 (2) ◽  
pp. 103-106
Author(s):  
ASTEMIR ZHURTOV ◽  

Cruel and inhumane acts that harm human life and health, as well as humiliate the dignity, are prohibited in most countries of the world, and Russia is no exception in this issue. The article presents an analysis of the institution of responsibility for torture in the Russian Federation. The author comes to the conclusion that the current criminal law of Russia superficially and fragmentally regulates liability for torture, in connection with which the author formulated the proposals to define such act as an independent crime. In the frame of modern globalization, the world community pays special attention to the protection of human rights, in connection with which large-scale international standards have been created a long time ago. The Universal Declaration of Human Rights and other international acts enshrine prohibitions of cruel and inhumane acts that harm human life and health, as well as degrade the dignity.Considering the historical experience of the past, these standards focus on the prohibition of any kind of torture, regardless of the purpose of their implementation.


Author(s):  
Lior Shamir

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .


Epidemiologia ◽  
2021 ◽  
Vol 2 (3) ◽  
pp. 315-324
Author(s):  
Juan M. Banda ◽  
Ramya Tekumalla ◽  
Guanyu Wang ◽  
Jingyuan Yu ◽  
Tuo Liu ◽  
...  

As the COVID-19 pandemic continues to spread worldwide, an unprecedented amount of open data is being generated for medical, genetics, and epidemiological research. The unparalleled rate at which many research groups around the world are releasing data and publications on the ongoing pandemic is allowing other scientists to learn from local experiences and data generated on the front lines of the COVID-19 pandemic. However, there is a need to integrate additional data sources that map and measure the role of social dynamics of such a unique worldwide event in biomedical, biological, and epidemiological analyses. For this purpose, we present a large-scale curated dataset of over 1.12 billion tweets, growing daily, related to COVID-19 chatter generated from 1 January 2020 to 27 June 2021 at the time of writing. This data source provides a freely available additional data source for researchers worldwide to conduct a wide and diverse number of research projects, such as epidemiological analyses, emotional and mental responses to social distancing measures, the identification of sources of misinformation, stratified measurement of sentiment towards the pandemic in near real time, among many others.


Algorithms ◽  
2021 ◽  
Vol 14 (5) ◽  
pp. 154
Author(s):  
Marcus Walldén ◽  
Masao Okita ◽  
Fumihiko Ino ◽  
Dimitris Drikakis ◽  
Ioannis Kokkinakis

Increasing processing capabilities and input/output constraints of supercomputers have increased the use of co-processing approaches, i.e., visualizing and analyzing data sets of simulations on the fly. We present a method that evaluates the importance of different regions of simulation data and a data-driven approach that uses the proposed method to accelerate in-transit co-processing of large-scale simulations. We use the importance metrics to simultaneously employ multiple compression methods on different data regions to accelerate the in-transit co-processing. Our approach strives to adaptively compress data on the fly and uses load balancing to counteract memory imbalances. We demonstrate the method’s efficiency through a fluid mechanics application, a Richtmyer–Meshkov instability simulation, showing how to accelerate the in-transit co-processing of simulations. The results show that the proposed method expeditiously can identify regions of interest, even when using multiple metrics. Our approach achieved a speedup of 1.29× in a lossless scenario. The data decompression time was sped up by 2× compared to using a single compression method uniformly.


Sign in / Sign up

Export Citation Format

Share Document