Network-based Rendering Techniques for Large-scale Volume Data Sets

IntroductionScaling up data linkage presents a challenging problem that has no straightforward solution. Lacking a prescribed ID in common between two data sources, the number of records to compare increases geometrically with data volume. Data linkers have for the main part resorted to “blocking” on demographic or other identifying variables. Objectives and ApproachAmong the more efficient of better blocking methods, carefully constructed multiple variable pattern indexes (MVPi) offer a robust and efficient method for reduction of linkage search spaces in Big Data. This realistic, large-scale demonstration of MVPi combines 30,156 SSA Death Master File (DMF) and NDI matches on SSN with equal dates of death (true matches) and 16,332 DMF records with different or missing SSN, and links the total of 46,448 records on names, date of birth, and postal code (ignoring SSN) to >94MM DMF records. The proportion of true matches not linked tests for loss of information during the blocking phase of data linkage. ResultsBlocking has an obvious cost in terms of completeness of linkage: any errors in a single blocking variable mean that blocking will miss true matches. Remedies for this problem usually add more blocking variables and stages, or make better use of information in blocking variables, requiring more time and computing resources. MVPi screening makes fuller use of information in blocking variables, but does so in this demonstration in one cohort (>30K) and DMF (>94MM) data sets. In this acid-test demonstration with few identifying variables and messy data, MVPi screening failed to link less than eight percent of the cohort records to its corresponding true match in the SSA DMF. MVPi screening reduced trillions of possible pairs requiring comparisons to a manageable 83MM . Conclusion/ImplicationsThe screening phase of a large-scale linkage project reduces linkage search space to the pairs of records more likely to be true matches, but it may also lead to selectivity bias and underestimates of the sensitivity (recall) of data linkage methods. Efficient MVPi screening allows fuller use of identifying information.

Download Full-text

Faculty Opinions recommendation of Comparative assessment of large-scale data sets of protein-protein interactions.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1006598.82257 ◽

2002 ◽

Author(s):

Rob Russell

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Comparative Assessment ◽

Data Sets ◽

Protein Protein Interactions ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Galaxy spin direction distribution in HST and SDSS show similar large-scale asymmetry

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2020.46 ◽

2020 ◽

Vol 37 ◽

Author(s):

Lior Shamir

Keyword(s):

Large Scale ◽

Spiral Galaxies ◽

Hubble Space Telescope ◽

Gravitational Interaction ◽

Large Data ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Dipole Axis ◽

Data Set ◽

The Asymmetry

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .

Download Full-text

Accelerating In-Transit Co-Processing for Scientific Simulations Using Region-Based Data-Driven Analysis

Algorithms ◽

10.3390/a14050154 ◽

2021 ◽

Vol 14 (5) ◽

pp. 154

Author(s):

Marcus Walldén ◽

Masao Okita ◽

Fumihiko Ino ◽

Dimitris Drikakis ◽

Ioannis Kokkinakis

Keyword(s):

Large Scale ◽

Data Driven ◽

Data Sets ◽

Output Constraints ◽

Data Driven Approach ◽

Scientific Simulations ◽

Multiple Metrics ◽

In Transit ◽

Multiple Compression ◽

Large Scale Simulations

Increasing processing capabilities and input/output constraints of supercomputers have increased the use of co-processing approaches, i.e., visualizing and analyzing data sets of simulations on the fly. We present a method that evaluates the importance of different regions of simulation data and a data-driven approach that uses the proposed method to accelerate in-transit co-processing of large-scale simulations. We use the importance metrics to simultaneously employ multiple compression methods on different data regions to accelerate the in-transit co-processing. Our approach strives to adaptively compress data on the fly and uses load balancing to counteract memory imbalances. We demonstrate the method’s efficiency through a fluid mechanics application, a Richtmyer–Meshkov instability simulation, showing how to accelerate the in-transit co-processing of simulations. The results show that the proposed method expeditiously can identify regions of interest, even when using multiple metrics. Our approach achieved a speedup of 1.29× in a lossless scenario. The data decompression time was sped up by 2× compared to using a single compression method uniformly.

Download Full-text

Compartment and hub definitions tune metabolic networks for metabolomic interpretations

GigaScience ◽

10.1093/gigascience/giz137 ◽

2020 ◽

Vol 9 (1) ◽

Cited By ~ 3

Author(s):

T Cameron Waller ◽

Jordan A Berg ◽

Alexander Lex ◽

Brian E Chapman ◽

Jared Rutter

Keyword(s):

Large Scale ◽

Metabolic Networks ◽

Shortest Paths ◽

Large Data ◽

Differential Regulation ◽

Large Data Sets ◽

Data Sets ◽

Human Metabolism ◽

Experimental Conditions ◽

Systemic Model

Abstract Background Metabolic networks represent all chemical reactions that occur between molecular metabolites in an organism’s cells. They offer biological context in which to integrate, analyze, and interpret omic measurements, but their large scale and extensive connectivity present unique challenges. While it is practical to simplify these networks by placing constraints on compartments and hubs, it is unclear how these simplifications alter the structure of metabolic networks and the interpretation of metabolomic experiments. Results We curated and adapted the latest systemic model of human metabolism and developed customizable tools to define metabolic networks with and without compartmentalization in subcellular organelles and with or without inclusion of prolific metabolite hubs. Compartmentalization made networks larger, less dense, and more modular, whereas hubs made networks larger, more dense, and less modular. When present, these hubs also dominated shortest paths in the network, yet their exclusion exposed the subtler prominence of other metabolites that are typically more relevant to metabolomic experiments. We applied the non-compartmental network without metabolite hubs in a retrospective, exploratory analysis of metabolomic measurements from 5 studies on human tissues. Network clusters identified individual reactions that might experience differential regulation between experimental conditions, several of which were not apparent in the original publications. Conclusions Exclusion of specific metabolite hubs exposes modularity in both compartmental and non-compartmental metabolic networks, improving detection of relevant clusters in omic measurements. Better computational detection of metabolic network clusters in large data sets has potential to identify differential regulation of individual genes, transcripts, and proteins.

Download Full-text

High Performance Computational Analysis of Large-scale Proteome Data Sets to Assess Incremental Contribution to Coverage of the Human Genome

Journal of Proteome Research ◽

10.1021/pr400181q ◽

2013 ◽

Vol 12 (6) ◽

pp. 2858-2868 ◽

Cited By ~ 29

Author(s):

Nadin Neuhauser ◽

Nagarjuna Nagaraj ◽

Peter McHardy ◽

Sara Zanivan ◽

Richard Scheltema ◽

...

Keyword(s):

Human Genome ◽

High Performance ◽

Large Scale ◽

Computational Analysis ◽

Data Sets

Download Full-text

ON GENERATING DIGITAL ELEVATION MODELS FROM LIDAR DATA – RESOLUTION VERSUS ACCURACY AND TOPOGRAPHIC WETNESS INDEX INDICES IN NORTHERN PEATLANDS

Geodesy and Cartography ◽

10.3846/20296991.2012.702983 ◽

2012 ◽

Vol 38 (2) ◽

pp. 57-69 ◽

Cited By ~ 12

Author(s):

Abdulghani Hasan ◽

Petter Pilesjö ◽

Andreas Persson

Keyword(s):

Large Scale ◽

Drainage Area ◽

Data Sets ◽

Topographic Wetness Index ◽

Absolute Deviation ◽

Digital Elevation ◽

Elevation Data ◽

Scale Modelling ◽

Data Points ◽

Emission Modelling

Global change and GHG emission modelling are dependent on accurate wetness estimations for predictions of e.g. methane emissions. This study aims to quantify how the slope, drainage area and the TWI vary with the resolution of DEMs for a flat peatland area. Six DEMs with spatial resolutions from 0.5 to 90 m were interpolated with four different search radiuses. The relationship between accuracy of the DEM and the slope was tested. The LiDAR elevation data was divided into two data sets. The number of data points facilitated an evaluation dataset with data points not more than 10 mm away from the cell centre points in the interpolation dataset. The DEM was evaluated using a quantile-quantile test and the normalized median absolute deviation. It showed independence of the resolution when using the same search radius. The accuracy of the estimated elevation for different slopes was tested using the 0.5 meter DEM and it showed a higher deviation from evaluation data for steep areas. The slope estimations between resolutions showed differences with values that exceeded 50%. Drainage areas were tested for three resolutions, with coinciding evaluation points. The model ability to generate drainage area at each resolution was tested by pair wise comparison of three data subsets and showed differences of more than 50% in 25% of the evaluated points. The results show that consideration of DEM resolution is a necessity for the use of slope, drainage area and TWI data in large scale modelling.

Download Full-text

The Midlatitude Continental Convective Clouds Experiment (MC3E) sounding network: operations, processing and analysis

Atmospheric Measurement Techniques ◽

10.5194/amt-8-421-2015 ◽

2015 ◽

Vol 8 (1) ◽

pp. 421-434 ◽

Cited By ~ 18

Author(s):

M. P. Jensen ◽

T. Toto ◽

D. Troyan ◽

P. E. Ciesielski ◽

D. Holdridge ◽

...

Keyword(s):

Large Scale ◽

Scale Model ◽

Data Sets ◽

Central Plains ◽

Data Set ◽

Convective Systems ◽

Convective Clouds ◽

Quality Checks ◽

Network Operations ◽

The Impact

Abstract. The Midlatitude Continental Convective Clouds Experiment (MC3E) took place during the spring of 2011 centered in north-central Oklahoma, USA. The main goal of this field campaign was to capture the dynamical and microphysical characteristics of precipitating convective systems in the US Central Plains. A major component of the campaign was a six-site radiosonde array designed to capture the large-scale variability of the atmospheric state with the intent of deriving model forcing data sets. Over the course of the 46-day MC3E campaign, a total of 1362 radiosondes were launched from the enhanced sonde network. This manuscript provides details on the instrumentation used as part of the sounding array, the data processing activities including quality checks and humidity bias corrections and an analysis of the impacts of bias correction and algorithm assumptions on the determination of convective levels and indices. It is found that corrections for known radiosonde humidity biases and assumptions regarding the characteristics of the surface convective parcel result in significant differences in the derived values of convective levels and indices in many soundings. In addition, the impact of including the humidity corrections and quality controls on the thermodynamic profiles that are used in the derivation of a large-scale model forcing data set are investigated. The results show a significant impact on the derived large-scale vertical velocity field illustrating the importance of addressing these humidity biases.

Download Full-text

Historical drought patterns over Canada and their teleconnections with large-scale climate signals

Hydrology and Earth System Sciences ◽

10.5194/hess-22-3105-2018 ◽

2018 ◽

Vol 22 (6) ◽

pp. 3105-3124 ◽

Cited By ~ 24

Author(s):

Zilefac Elvis Asong ◽

Howard Simon Wheater ◽

Barrie Bonsal ◽

Saman Razavi ◽

Sopan Kurkute

Keyword(s):

Large Scale ◽

Southern Oscillation ◽

Low Frequency ◽

Kendall Test ◽

Wavelet Coherence ◽

Climate Indices ◽

Data Sets ◽

Canadian Prairies ◽

Agriculture Industry ◽

Spatio Temporal

Abstract. Drought is a recurring extreme climate event and among the most costly natural disasters in the world. This is particularly true over Canada, where drought is both a frequent and damaging phenomenon with impacts on regional water resources, agriculture, industry, aquatic ecosystems, and health. However, nationwide drought assessments are currently lacking and impacted by limited ground-based observations. This study provides a comprehensive analysis of historical droughts over the whole of Canada, including the role of large-scale teleconnections. Drought events are characterized by the Standardized Precipitation Evapotranspiration Index (SPEI) over various temporal scales (1, 3, 6, and 12 consecutive months, 6 months from April to September, and 12 months from October to September) applied to different gridded monthly data sets for the period 1950–2013. The Mann–Kendall test, rotated empirical orthogonal function, continuous wavelet transform, and wavelet coherence analyses are used, respectively, to investigate the trend, spatio-temporal patterns, periodicity, and teleconnectivity of drought events. Results indicate that southern (northern) parts of the country experienced significant trends towards drier (wetter) conditions although substantial variability exists. Two spatially well-defined regions with different temporal evolution of droughts were identified – the Canadian Prairies and northern central Canada. The analyses also revealed the presence of a dominant periodicity of between 8 and 32 months in the Prairie region and between 8 and 40 months in the northern central region. These cycles of low-frequency variability are found to be associated principally with the Pacific–North American (PNA) and Multivariate El Niño/Southern Oscillation Index (MEI) relative to other considered large-scale climate indices. This study is the first of its kind to identify dominant periodicities in drought variability over the whole of Canada in terms of when the drought events occur, their duration, and how often they occur.

Download Full-text

Open source riverscapes: Analyzing the river corridor of the Naryn River in Kyrgyzstan based on open access data

10.5194/egusphere-egu21-4735 ◽

2021 ◽

Author(s):

Florian Betz ◽

Magdalena Lauermann ◽

Bernd Cyffka

Keyword(s):

Remote Sensing ◽

Open Access ◽

Large Scale ◽

Data Sets ◽

Hierarchical Systems ◽

Geomorphological Mapping ◽

Open Access Data ◽

Access Data ◽

River Corridors ◽

River Corridor

<p>In fluvial geomorphology as well as in freshwater ecology, rivers are commonly seen as nested hierarchical systems functioning over a range of spatial and temporal scales. Thus, for a comprehensive assessment, information on various scales is required. Over the past decade, remote sensing based approaches have become increasingly popular in river science to increase the spatial scale of analysis. However, data-scarce areas have been mostly ignored so far despite the fact that most remaining free flowing &#8211; and thus ecologically valuable &#8211; rivers worldwide are located in regions characterized by a lack of data sources like LiDAR or even aerial imagery. High resolution satellite data would be able to fill this data gap, but tends to be too costly for large scale applications what limits the ability for comprehensive studies on river systems in such remote areas. This in turn is a limitation for management and conservation of these rivers.</p><p>In this contribution, we suggest an approach for river corridor mapping based on open access data only in order to foster large scale geomorphological mapping of river corridors in data-scarce areas. For this aim, we combine advanced terrain analysis with multispectral remote sensing using the SRTM-1 DEM along with Landsat OLI imagery. We take the Naryn River in Kyrgyzstan as an example to demonstrate the potential of these open access data sets to derive a comprehensive set of parameters for characterizing this river corridor. The methods are adapted to the specific characteristics of medium resolution open access data sets and include an innovative, fuzzy logic based approach for riparian zone delineation, longitudinal profile smoothing based on constrained quantile regression and a delineation of the active channel width as needed for specific stream power computation. In addition, an indicator for river dynamics based on Landsat time series is developed. For each derived river corridor parameter, a rigor validation is performed. The results demonstrate, that our open access approach for geomorphological mapping of river corridors is capable to provide results sufficiently accurate to derive reach averaged information. Thus, it is well suited for large scale river characterization in data-scarce regions where otherwise the river corridors would remain largely unexplored from an up-to-date riverscape perspective. Such a characterization might be an entry point for further, more detailed research in selected study reaches and can deliver the required comprehensive background information for a range of topics in river science.</p>

Download Full-text