Queryable Compression on Time-evolving Web and Social Networks with Streaming

Michael Nelson; Sridhar Radhakrishnan; Chandra Sekharan; Amlan Chatterjee; Sudhindra Gopal Krishna

doi:10.1145/3495012

Queryable Compression on Time-evolving Web and Social Networks with Streaming

ACM Transactions on the Web ◽

10.1145/3495012 ◽

2022 ◽

Vol 16 (2) ◽

pp. 1-21

Author(s):

Michael Nelson ◽

Sridhar Radhakrishnan ◽

Chandra Sekharan ◽

Amlan Chatterjee ◽

Sudhindra Gopal Krishna

Keyword(s):

Data Structure ◽

Adjacency Matrix ◽

Binary Tree ◽

Network Data ◽

Data Sets ◽

Undirected Graphs ◽

Sparse Graphs ◽

Compressed Data ◽

Over Time ◽

Better Than

Time-evolving web and social network graphs are modeled as a set of pages/individuals (nodes) and their arcs (links/relationships) that change over time. Due to their popularity, they have become increasingly massive in terms of their number of nodes, arcs, and lifetimes. However, these graphs are extremely sparse throughout their lifetimes. For example, it is estimated that Facebook has over a billion vertices, yet at any point in time, it has far less than 0.001% of all possible relationships. The space required to store these large sparse graphs may not fit in most main memories using underlying representations such as a series of adjacency matrices or adjacency lists. We propose building a compressed data structure that has a compressed binary tree corresponding to each row of each adjacency matrix of the time-evolving graph. We do not explicitly construct the adjacency matrix, and our algorithms take the time-evolving arc list representation as input for its construction. Our compressed structure allows for directed and undirected graphs, faster arc and neighborhood queries, as well as the ability for arcs and frames to be added and removed directly from the compressed structure (streaming operations). We use publicly available network data sets such as Flickr, Yahoo!, and Wikipedia in our experiments and show that our new technique performs as well or better than our benchmarks on all datasets in terms of compression size and other vital metrics.

Download Full-text

Cache-efficient sweeping-based interval joins for extended Allen relation predicates

The VLDB Journal ◽

10.1007/s00778-020-00650-5 ◽

2021 ◽

Author(s):

Danila Piatov ◽

Sven Helmer ◽

Anton Dignös ◽

Fabio Persia

Keyword(s):

Data Structure ◽

Experimental Evaluation ◽

State Of The Art ◽

Temporal Databases ◽

Access Method ◽

Wide Range ◽

Interval Relation ◽

Cache Efficient ◽

Join Algorithms ◽

Better Than

AbstractWe develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen’s relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.

Download Full-text

A Novel LSTM Model with Interaction Dual Attention for Radar Echo Extrapolation

Remote Sensing ◽

10.3390/rs13020164 ◽

2021 ◽

Vol 13 (2) ◽

pp. 164

Author(s):

Chuyao Luo ◽

Xutao Li ◽

Yongliang Wen ◽

Yunming Ye ◽

Xiaofeng Zhang

Keyword(s):

Short Term Memory ◽

Weather Forecast ◽

Vital Role ◽

Data Sets ◽

Short Term ◽

Learning Techniques ◽

Radar Echo ◽

Hidden States ◽

Better Than

The task of precipitation nowcasting is significant in the operational weather forecast. The radar echo map extrapolation plays a vital role in this task. Recently, deep learning techniques such as Convolutional Recurrent Neural Network (ConvRNN) models have been designed to solve the task. These models, albeit performing much better than conventional optical flow based approaches, suffer from a common problem of underestimating the high echo value parts. The drawback is fatal to precipitation nowcasting, as the parts often lead to heavy rains that may cause natural disasters. In this paper, we propose a novel interaction dual attention long short-term memory (IDA-LSTM) model to address the drawback. In the method, an interaction framework is developed for the ConvRNN unit to fully exploit the short-term context information by constructing a serial of coupled convolutions on the input and hidden states. Moreover, a dual attention mechanism on channels and positions is developed to recall the forgotten information in the long term. Comprehensive experiments have been conducted on CIKM AnalytiCup 2017 data sets, and the results show the effectiveness of the IDA-LSTM in addressing the underestimation drawback. The extrapolation performance of IDA-LSTM is superior to that of the state-of-the-art methods.

Download Full-text

Belief Movement, Uncertainty Reduction, and Rational Updating*

The Quarterly Journal of Economics ◽

10.1093/qje/qjaa043 ◽

2021 ◽

Author(s):

Ned Augenblick ◽

Matthew Rabin

Keyword(s):

Statistical Tests ◽

Uncertainty Reduction ◽

The State ◽

Data Sets ◽

New Information ◽

The World ◽

Core Concepts ◽

Market Beliefs ◽

Over Time

Abstract When a Bayesian learns new information and changes her beliefs, she must on average become concomitantly more certain about the state of the world. Consequently, it is rare for a Bayesian to frequently shift beliefs substantially while remaining relatively uncertain, or, conversely, become very confident with relatively little belief movement. We formalize this intuition by developing specific measures of movement and uncertainty reduction given a Bayesian’s changing beliefs over time, showing that these measures are equal in expectation and creating consequent statistical tests for Bayesianess. We then show connections between these two core concepts and four common psychological biases, suggesting that the test might be particularly good at detecting these biases. We provide support for this conclusion by simulating the performance of our test and other martingale tests. Finally, we apply our test to data sets of individual, algorithmic, and market beliefs.

Download Full-text

BioNetLink - An Architecture for Working with Network Data

Journal of Integrative Bioinformatics ◽

10.1515/jib-2014-241 ◽

2014 ◽

Vol 11 (2) ◽

pp. 68-79

Author(s):

Matthias Klapperstück ◽

Falk Schreiber

Keyword(s):

Experimental Data ◽

Biological Networks ◽

Regulatory Networks ◽

Large Data ◽

Biological Data ◽

Network Data ◽

Data Sets ◽

Dynamic Views ◽

Gene Regulatory ◽

Multiple Network

Summary The visualization of biological data gained increasing importance in the last years. There is a large number of methods and software tools available that visualize biological data including the combination of measured experimental data and biological networks. With growing size of networks their handling and exploration becomes a challenging task for the user. In addition, scientists also have an interest in not just investigating a single kind of network, but on the combination of different types of networks, such as metabolic, gene regulatory and protein interaction networks. Therefore, fast access, abstract and dynamic views, and intuitive exploratory methods should be provided to search and extract information from the networks. This paper will introduce a conceptual framework for handling and combining multiple network sources that enables abstract viewing and exploration of large data sets including additional experimental data. It will introduce a three-tier structure that links network data to multiple network views, discuss a proof of concept implementation, and shows a specific visualization method for combining metabolic and gene regulatory networks in an example.

Download Full-text

“Pure Milk Is Better Than Purified Milk”

Social Science History ◽

10.1017/s0145553200013808 ◽

2007 ◽

Vol 31 (3) ◽

pp. 411-433 ◽

Cited By ~ 2

Author(s):

Alan Czaplicki

Keyword(s):

Interest Groups ◽

Political Interest ◽

The Political ◽

Organizational Practices ◽

Policy And Practice ◽

Organizational Processes ◽

Political Interests ◽

Pure Milk ◽

Over Time ◽

Better Than

This article explains how pasteurization—with few outspoken political supporters during this period—first became a primary milk purification strategy in Chicago and why eight years passed between pasteurization’s initial introduction into law and the city’s adoption of full mandatory pasteurization. It expands the current focus on the political agreement to pasteurize to include the organizational processes involved in incorporating pasteurization into both policy and practice. It shows that the decision to pasteurize did not occur at a clearly defined point but instead evolved over time as a consequence of the interplay of political interest groups, state-municipal legal relations, and the merging of different organizational practices. Such an approach considerably complicates and expands existing accounts of how political interests and agreements shaped pasteurization and milk purification policies and practice.

Download Full-text

Remembering the News: Effects of Medium and Message Discrepancy on News Recall over Time

Journalism & Mass Communication Quarterly ◽

10.1177/107769909507200316 ◽

1995 ◽

Vol 72 (3) ◽

pp. 666-681 ◽

Cited By ~ 7

Author(s):

Robert H. Wicks

Keyword(s):

Associative Memory ◽

News Media ◽

Common Knowledge ◽

Theoretical Explanation ◽

Important Variable ◽

News Stories ◽

New Information ◽

News Effects ◽

Over Time ◽

Better Than

This article suggests a theoretical explanation of the processes related to recall and learning of media news information. It does so by linking the concepts of schematic thinking and the Search of Associative Memory (SAM) to the variable of time. It argues that learning from the news may be better than many recent studies suggest. Although humans may have trouble recalling discrete news stories in recall examinations, it seems likely that they acquire “common knowledge” from the news media. Time is an important variable in helping people to remember news if they use it to think about new information in the context of previously stored knowledge.

Download Full-text

Disparities Across Time: Exploring Absenteeism Patterns between Cohorts of Students with Disabilities

Teachers College Record ◽

10.1177/016146812012201114 ◽

2020 ◽

Vol 122 (11) ◽

pp. 1-32

Author(s):

Michael A. Gottfried ◽

Vi-Nhuan Le ◽

J. Jacob Kirksey

Keyword(s):

Students With Disabilities ◽

Social Needs ◽

Data Sets ◽

Chronic Absenteeism ◽

Data Set ◽

Full Day Kindergarten ◽

Effective Interventions ◽

Nationally Representative ◽

Single Data ◽

Over Time

Background It is of grave concern that kindergartners are missing more school than students in any other year of elementary school; therefore, documenting which students are absent and for how long is of upmost importance. Yet, doing so for students with disabilities (SWDs) has received little attention. This study addresses this gap by examining two cohorts of SWDs, separated by more than a decade, to document changes in attendance patterns. Research Questions First, for SWDs, has the number of school days missed or chronic absenteeism rates changed over time? Second, how are changes in the number of school days missed and chronic absenteeism rates related to changes in academic emphasis, presence of teacher aides, SWD-specific teacher training, and preschool participation? Subjects This study uses data from the Early Childhood Longitudinal Study (ECLS), a nationally representative data set of children in kindergarten. We rely on both ECLS data sets— the kindergarten classes of 1998–1999 and 2010–2011. Measures were identical in both data sets, making it feasible to compare children across the two cohorts. Given identical measures, we combined the data sets into a single data set with an indicator for being in the older cohort. Research Design This study examined two sets of outcomes: The first was number of days absent, and the second was likelihood of being chronically absent. These outcomes were regressed on a measure for being in the older cohort (our key measure for changes over time) and numerous control variables. The error term was clustered by classroom. Findings We found that SWDs are absent more often now than they were a decade earlier, and this growth in absenteeism was larger than what students without disabilities experienced. Absenteeism among SWDs was higher for those enrolled in full-day kindergarten, although having attended center-based care mitigates this disparity over time. Implications are discussed. Conclusions Our study calls for additional attention and supports to combat the increasing rates of absenteeism for SWDs over time. Understanding contextual shifts and trends in rates of absenteeism for SWDs in kindergarten is pertinent to crafting effective interventions and research geared toward supporting the academic and social needs of these students.

Download Full-text

Quantifying paleogeography using biogeography: a test case for the Ordovician and Silurian of Avalonia based on brachiopods and trilobites

Paleobiology ◽

10.1666/0094-8373(2002)028<0343:qpubat>2.0.co;2 ◽

2002 ◽

Vol 28 (3) ◽

pp. 343-363 ◽

Cited By ~ 33

Author(s):

David C. Lees ◽

Richard A. Fortey ◽

L. Robin M. Cocks

Keyword(s):

Goodness Of Fit ◽

Total Evidence ◽

Pairwise Distance ◽

Test Case ◽

Data Sets ◽

Plate Tectonic ◽

Plate Model ◽

Optimal Arrangement ◽

Evidence Analysis ◽

Better Than

Despite substantial advances in plate tectonic modeling in the last three decades, the postulated position of terranes in the Paleozoic has seldom been validated by faunal data. Fewer studies still have attempted a quantitative approach to distance based on explicit data sets. As a test case, we examine the position of Avalonia in the Ordovician (Arenig, Llanvirn, early Caradoc, and Ashgill) to mid-Silurian (Wenlock) with respect to Laurentia, Baltica, and West Gondwana. Using synoptic lists of 623 trilobite genera and 622 brachiopod genera for these four plates, summarized as Venn diagrams, we have devised proportional indices of mean endemism (ME, normalized by individual plate faunas to eliminate area biogeographic effects) and complementarity (C) for objective paleobiogeographic comparisons. These can discriminate the relative position of Avalonia by assessing the optimal arrangement of inter-centroid distances (measured as great circles) between relevant pairs of continental masses. The proportional indices are used to estimate the “goodness-of-fit” of the faunal data to two widely used dynamic plate tectonic models for these time slices, those of Smith and Rush (1998) and Ross and Scotese (1997). Our faunal data are more consistent with the latter model, which we use to suggest relationships between faunal indices for the five time slices and new rescaled inter-centroid distances between all six plate pairs. We have examined linear and exponential models in relation to continental separation for these indices. For our generic data, the linear model fits distinctly better overall. The fits of indices generated by using independent trilobite and brachiopod lists are mostly similar to each other at each time slice and for a given plate, reflecting a common biogeographic signal; however, the indices vary across the time slices. Combining groups into the same matrix in a “total evidence” analysis performs better still as a measure of distance for mean endemism in the “Scotese” plate model. Four-plate mean endemism performs much better than complementarity as an indicator of pairwise distance for either plate model in the test case.

Download Full-text

Developing an Empirical Yield-Prediction Model Based on Wheat and Wild Oat (Avena fatua) Density, Nitrogen and Herbicide Rate, and Growing-Season Precipitation

Weed Science ◽

10.1614/ws-05-018.1 ◽

2007 ◽

Vol 55 (6) ◽

pp. 652-664 ◽

Cited By ~ 9

Author(s):

N. C. Wagner ◽

B. D. Maxwell ◽

M. L. Taper ◽

L. J. Rew

Keyword(s):

Field Data ◽

Wheat Yield ◽

Growing Season ◽

Predictor Variables ◽

Data Sets ◽

Wild Oat ◽

Best Fitting ◽

Weed Infestation ◽

Growing Season Precipitation ◽

Better Than

To develop a more complete understanding of the ecological factors that regulate crop productivity, we tested the relative predictive power of yield models driven by five predictor variables: wheat and wild oat density, nitrogen and herbicide rate, and growing-season precipitation. Existing data sets were collected and used in a meta-analysis of the ability of at least two predictor variables to explain variations in wheat yield. Yield responses were asymptotic with increasing crop and weed density; however, asymptotic trends were lacking as herbicide and fertilizer levels were increased. Based on the independent field data, the three best-fitting models (in order) from the candidate set of models were a multiple regression equation that included all five predictor variables (R2= 0.71), a double-hyperbolic equation including three input predictor variables (R2= 0.63), and a nonlinear model including all five predictor variables (R2= 0.56). The double-hyperbolic, three-predictor model, which did not include herbicide and fertilizer influence on yield, performed slightly better than the five-variable nonlinear model including these predictors, illustrating the large amount of variation in wheat yield and the lack of concrete knowledge upon which farmers base their fertilizer and herbicide management decisions, especially when weed infestation causes competition for limited nitrogen and water. It was difficult to elucidate the ecological first principles in the noisy field data and to build effective models based on disjointed data sets, where none of the studies measured all five variables. To address this disparity, we conducted a five-variable full-factorial greenhouse experiment. Based on our five-variable greenhouse experiment, the best-fitting model was a new nonlinear equation including all five predictor variables and was shown to fit the greenhouse data better than four previously developed agronomic models with anR2of 0.66. Development of this mathematical model, through model selection and parameterization with field and greenhouse data, represents the initial step in building a decision support system for site-specific and variable-rate management of herbicide, fertilizer, and crop seeding rate that considers varying levels of available water and weed infestation.

Download Full-text

Calibration of pneumotachographs using a calibrated syringe

Journal of Applied Physiology ◽

10.1152/japplphysiol.00196.2003 ◽

2003 ◽

Vol 95 (2) ◽

pp. 571-576 ◽

Cited By ~ 14

Author(s):

Yongquan Tang ◽

Martin J. Turner ◽

Johnny S. Yem ◽

A. Barry Baker

Keyword(s):

Calibration Method ◽

Linear Range ◽

Data Sets ◽

Constant Flow ◽

Calibration Constant ◽

Third Order ◽

Polynomial Coefficients ◽

Calibration Curves ◽

Order Polynomial ◽

Better Than

Pneumotachograph require frequent calibration. Constant-flow methods allow polynomial calibration curves to be derived but are time consuming. The iterative syringe stroke technique is moderately efficient but results in discontinuous conductance arrays. This study investigated the derivation of first-, second-, and third-order polynomial calibration curves from 6 to 50 strokes of a calibration syringe. We used multiple linear regression to derive first-, second-, and third-order polynomial coefficients from two sets of 6–50 syringe strokes. In part A, peak flows did not exceed the specified linear range of the pneumotachograph, whereas flows in part B peaked at 160% of the maximum linear range. Conductance arrays were derived from the same data sets by using a published algorithm. Volume errors of the calibration strokes and of separate sets of 70 validation strokes ( part A) and 140 validation strokes ( part B) were calculated by using the polynomials and conductance arrays. Second- and third-order polynomials derived from 10 calibration strokes achieved volume variability equal to or better than conductance arrays derived from 50 strokes. We found that evaluation of conductance arrays using the calibration syringe strokes yields falsely low volume variances. We conclude that accurate polynomial curves can be derived from as few as 10 syringe strokes, and the new polynomial calibration method is substantially more time efficient than previously published conductance methods.

Download Full-text