Tidy Tuples and Flying Start: fast compilation and fast execution of relational queries in Umbra

The VLDB Journal ◽

10.1007/s00778-020-00643-4 ◽

2021 ◽

Author(s):

Timo Kersten ◽

Viktor Leis ◽

Thomas Neumann

Keyword(s):

Code Generation ◽

High Speed ◽

Ad Hoc ◽

Large Data ◽

Database System ◽

Lessons Learned ◽

Small Data ◽

Data Sets ◽

Query Execution ◽

Major Barrier

AbstractAlthough compiling queries to efficient machine code has become a common approach for query execution, a number of newly created database system projects still refrain from using compilation. It is sometimes claimed that the intricacies of code generation make compilation-based engines too complex. Also, a major barrier for adoption, especially for interactive ad hoc queries, is long compilation time. In this paper, we examine all stages of compiling query execution engines and show how to reduce compilation overhead. We incorporate the lessons learned from a decade of generating code in HyPer into a design that manages complexity and yields high speed. First, we introduce a code generation framework that establishes abstractions to manage complexity, yet generates code in a single fast pass. Second, we present a program representation whose data structures are tuned to support fast code generation and compilation. Third, we introduce a new compiler backend that is optimized for minimal compile time, and simultaneously, yields superior execution performance to competing approaches, e.g., Volcano-style or bytecode interpretation. We implemented these optimizations in our database system Umbra to show that it is possible to unite fast compilation and fast execution. Indeed, Umbra achieves unprecedentedly low query latencies. On small data sets, it is even faster than interpreter engines like DuckDB and PostgreSQL. At the same time, on large data sets, its throughput is on par with the state-of-the-art compiling system HyPer.

Download Full-text

49 Current status of genomic selection

Journal of Animal Science ◽

10.1093/jas/skz258.105 ◽

2019 ◽

Vol 97 (Supplement_3) ◽

pp. 52-53

Author(s):

Ignacy Misztal

Keyword(s):

Genomic Selection ◽

Ad Hoc ◽

Large Data ◽

Single Step ◽

Breeding Value ◽

Current Status ◽

Small Data ◽

Data Sets ◽

Effective Population ◽

Early Application

Abstract Early application of genomic selection relied on SNP estimation with phenotypes or de-regressed proofs (DRP). Chips of 50k SNP seemed sufficient. Estimated breeding value was an index with parent average and deduction to eliminate double counting. Use of SNP selection or weighting increased accuracy with small data sets but less or none with large data sets. Use of DRP with female information required ad-hoc modifications. As BLUP is biased by genomic selection, use of DRP under genomic selection required adjustments. Efforts to include potentially causative SNP derived from sequence analysis showed limited or no gain. The genomic selection was greatly simplified using single-step GBLUP (ssGBLUP) because the procedure automatically creates the index, can use any combination of male and female genotypes, and accounts for preselection. ssGBLUP requires careful scaling for compatibility between pedigree and genomic relationships to avoid biases especially under strong selection. Large data computations in ssGBLUP were solved by exploiting limited dimensionality of SNP due to limited effective population size. With such dimensionality ranging from 4k in chicken to about 15k in Holsteins, the inverse of GRM can be created directly (e.g., by the APY algorithm) in linear cost. Due to its simplicity and accuracy ssGBLUP is routinely used for genomic selection by major companies in chicken, pigs and beef. ssGBLUP can be used to derive SNP effects for indirect prediction, and for GWAS, including computations of the P-values. An alternative single-step called ssBR exists that uses SNP effects instead of GRM. As BLUP is affected by pre-selection, there is need for new validation procedures unaffected by selection, and for parameter estimation that accounts for all the genomic data used in selection. Another issue are reduced variances due to the Bulmer effect.

Download Full-text

Distributed and collaborative visualization of large data sets using high-speed networks

Future Generation Computer Systems ◽

10.1016/j.future.2006.03.026 ◽

2006 ◽

Vol 22 (8) ◽

pp. 1004-1010 ◽

Cited By ~ 22

Author(s):

Andrei Hutanu ◽

Gabrielle Allen ◽

Stephen D. Beck ◽

Petr Holub ◽

Hartmut Kaiser ◽

...

Keyword(s):

High Speed ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

High Speed Networks ◽

Collaborative Visualization

Download Full-text

Practical Constraint K-Segment Principal Curve Algorithms for Generating Railway GPS Digital Map

Mathematical Problems in Engineering ◽

10.1155/2013/258694 ◽

2013 ◽

Vol 2013 ◽

pp. 1-11

Author(s):

Dewang Chen ◽

Long Chen

Keyword(s):

Data Storage ◽

Expert Knowledge ◽

Low Cost ◽

Large Data ◽

Gps Data ◽

Small Data ◽

Data Sets ◽

Digital Map ◽

Digital Maps ◽

Practical Algorithms

In order to obtain a decent trade-off between the low-cost, low-accuracy Global Positioning System (GPS) receivers and the requirements of high-precision digital maps for modern railways, using the concept of constraint K-segment principal curves (CKPCS) and the expert knowledge on railways, we propose three practical CKPCS generation algorithms with reduced computational complexity, and thereafter more suitable for engineering applications. The three algorithms are named ALLopt, MPMopt, and DCopt, in which ALLopt exploits global optimization and MPMopt and DCopt apply local optimization with different initial solutions. We compare the three practical algorithms according to their performance on average projection error, stability, and the fitness for simple and complex simulated trajectories with noise data. It is found that ALLopt only works well for simple curves and small data sets. The other two algorithms can work better for complex curves and large data sets. Moreover, MPMopt runs faster than DCopt, but DCopt can work better for some curves with cross points. The three algorithms are also applied in generating GPS digital maps for two railway GPS data sets measured in Qinghai-Tibet Railway (QTR). Similar results like the ones in synthetic data are obtained. Because the trajectory of a railway is relatively simple and straight, we conclude that MPMopt works best according to the comprehensive considerations on the speed of computation and the quality of generated CKPCS. MPMopt can be used to obtain some key points to represent a large amount of GPS data. Hence, it can greatly reduce the data storage requirements and increase the positioning speed for real-time digital map applications.

Download Full-text

The relationship between journal citation impact and citation sentiment: A study of 32 million citances in PubMed Central

Quantitative Science Studies ◽

10.1162/qss_a_00040 ◽

2020 ◽

pp. 1-11

Author(s):

Erjia Yan ◽

Zheng Chen ◽

Kai Li

Keyword(s):

Social Science ◽

Natural Science ◽

Citation Impact ◽

Large Data ◽

Small Data ◽

Data Sets ◽

Data Set ◽

Pubmed Central ◽

Small Data Sets ◽

Sentiment Score

Citation sentiment plays an important role in citation analysis and scholarly communication research, but prior citation sentiment studies have used small data sets and relied largely on manual annotation. This paper uses a large data set of PubMed Central (PMC) full-text publications and analyzes citation sentiment in more than 32 million citances within PMC, revealing citation sentiment patterns at the journal and discipline levels. This paper finds a weak relationship between a journal’s citation impact (as measured by CiteScore) and the average sentiment score of citances to its publications. When journals are aggregated into quartiles based on citation impact, we find that journals in higher quartiles are cited more favorably than those in the lower quartiles. Further, social science journals are found to be cited with higher sentiment, followed by engineering and natural science and biomedical journals, respectively. This result may be attributed to disciplinary discourse patterns in which social science researchers tend to use more subjective terms to describe others’ work than do natural science or biomedical researchers.

Download Full-text

An Accurate Substitution Method To Minimize Left Censoring Bias in Serum Steroid Measurements

Endocrinology ◽

10.1210/en.2019-00340 ◽

2019 ◽

Vol 160 (10) ◽

pp. 2395-2400 ◽

Cited By ~ 5

Author(s):

David J Handelsman ◽

Lam P Ly

Keyword(s):

Data Analysis ◽

Ad Hoc ◽

Likelihood Estimation ◽

Large Data ◽

Serum Testosterone ◽

Accurate Method ◽

Estimation Methods ◽

Data Sets ◽

Full Data ◽

Data Set

Abstract Hormone assay results below the assay detection limit (DL) can introduce bias into quantitative analysis. Although complex maximum likelihood estimation methods exist, they are not widely used, whereas simple substitution methods are often used ad hoc to replace the undetectable (UD) results with numeric values to facilitate data analysis with the full data set. However, the bias of substitution methods for steroid measurements is not reported. Using a large data set (n = 2896) of serum testosterone (T), DHT, estradiol (E2) concentrations from healthy men, we created modified data sets with increasing proportions of UD samples (≤40%) to which we applied five different substitution methods (deleting UD samples as missing and substituting UD sample with DL, DL/√2, DL/2, or 0) to calculate univariate descriptive statistics (mean, SD) or bivariate correlations. For all three steroids and for univariate as well as bivariate statistics, bias increased progressively with increasing proportion of UD samples. Bias was worst when UD samples were deleted or substituted with 0 and least when UD samples were substituted with DL/√2, whereas the other methods (DL or DL/2) displayed intermediate bias. Similar findings were replicated in randomly drawn small subsets of 25, 50, and 100. Hence, we propose that in steroid hormone data with ≤40% UD samples, substituting UD with DL/√2 is a simple, versatile, and reasonably accurate method to minimize left censoring bias, allowing for data analysis with the full data set.

Download Full-text

Integral methods for automatic quantification of fast-scan-cyclic-voltammetry detected neurotransmitters

PLoS ONE ◽

10.1371/journal.pone.0254594 ◽

2021 ◽

Vol 16 (7) ◽

pp. e0254594

Author(s):

Leonardo X. Espín ◽

Anders J. Asp ◽

James K. Trevathan ◽

Kip A. Ludwig ◽

J. Luis Lujan

Keyword(s):

Ad Hoc ◽

Time Integration ◽

Large Data ◽

Data Sets ◽

Oxidation Reactions ◽

Fast Scan Cyclic Voltammetry ◽

Integral Methods ◽

Selection Of

Modern techniques for estimating basal levels of electroactive neurotransmitters rely on the measurement of oxidative charges. This requires time integration of oxidation currents at certain intervals. Unfortunately, the selection of integration intervals relies on ad-hoc visual identification of peaks on the oxidation currents, which introduces sources of error and precludes the development of automated procedures necessary for analysis and quantification of neurotransmitter levels in large data sets. In an effort to improve charge quantification techniques, here we present novel methods for automatic selection of integration boundaries. Our results show that these methods allow quantification of oxidation reactions both in vitro and in vivo and of multiple analytes in vitro.

Download Full-text

Assessment of Master Curve Material Inhomogeneity Using Small Data Sets

Volume 1A: Codes and Standards ◽

10.1115/pvp2018-84297 ◽

2018 ◽

Author(s):

Kim Wallin

Keyword(s):

Modal Analysis ◽

Master Curve ◽

Large Data ◽

Large Data Sets ◽

Small Data ◽

Data Sets ◽

Inhomogeneous Materials ◽

Material Inhomogeneity ◽

Analysis Methods ◽

Small Data Sets

The standard Master Curve (MC) deals only with materials assumed to be homogeneous, but MC analysis methods for inhomogeneous materials have also been developed. Especially the bi-modal and multi-modal analysis methods are becoming more and more standard. Their drawback is that these methods are generally reliable only with sufficiently large data sets (number of valid tests, r ≥ 15–20). Here, the possibility of using the multi-modal analysis method with smaller data sets is assessed, and a new procedure to conservatively account for possible inhomogeneities is proposed.

Download Full-text

CRADA Factsheet: Advanced High Speed Networks for Remote Access, Image Processing, and Delivery of Large Data Sets: Sprint Corporation

Fact Sheet ◽

10.3133/fs13095 ◽

1995 ◽

Author(s):

Keyword(s):

Image Processing ◽

High Speed ◽

Large Data ◽

Remote Access ◽

Large Data Sets ◽

Data Sets ◽

High Speed Networks

Download Full-text

Fooled Again and Again

The Phantom Pattern Problem ◽

10.1093/oso/9780198864165.003.0005 ◽

2020 ◽

pp. 75-100

Author(s):

Gary Smith ◽

Jay Cordes

Keyword(s):

Test Scores ◽

Large Data ◽

Large Data Sets ◽

Small Data ◽

Data Sets ◽

Small Data Sets ◽

Athletic Competitions

Patterns are inevitable and we should not be surprised by them. Streaks, clusters, and correlations are the norm, not the exception. In a large number of coin flips, there are likely to be coincidental clusters of heads and tails. In nationwide data on cancer, crime, or test scores, there are likely to be flukey clusters. When the data are separated into smaller geographic units like cities, the most extreme results are likely to be found in the smallest cities. In athletic competitions between well-matched teams, the outcome of a small number of games is almost meaningless. Our challenge is to overcome our inherited inclination to think that all patterns are meaningful; for example, thinking that clustering in large data sets or differences among small data sets must be something real that needs to be explained. Often, it is just meaningless happenstance.

Download Full-text

The European Creep Collaborative Committee (ECCC) Approach to Creep Data Assessment

Journal of Pressure Vessel Technology ◽

10.1115/1.2894296 ◽

2008 ◽

Vol 130 (2) ◽

Cited By ~ 9

Author(s):

Stuart Holdsworth

Keyword(s):

Rupture Strength ◽

Large Data ◽

Small Data ◽

Data Sets ◽

Creep Data ◽

Data Set ◽

Data Assessment ◽

Long Time ◽

Small Data Sets ◽

Material Conditions

The European Creep Collaborative Committee (ECCC) approach to creep data assessment has now been established for almost ten years. The methodology covers the analysis of rupture strength and ductility, creep strain, and stress relaxation data, for a range of material conditions. This paper reviews the concepts and procedures involved. The original approach was devised to determine data sheets for use by committees responsible for the preparation of National and International Design and Product Standards, and the methods developed for data quality evaluation and data analysis were therefore intentionally rigorous. The focus was clearly on the determination of long-time property values from the largest possible data sets involving a significant number of observations in the mechanism regime for which predictions were required. More recently, the emphasis has changed. There is now an increasing requirement for full property descriptions from very short times to very long and hence the need for much more flexible model representations than were previously required. There continues to be a requirement for reliable long-time predictions from relatively small data sets comprising relatively short duration tests, in particular, to exploit new alloy developments at the earliest practical opportunity. In such circumstances, it is not feasible to apply the same degree of rigor adopted for large data set assessment. Current developments are reviewed.

Download Full-text