Proceedings of the VLDB Endowment
Latest Publications


TOTAL DOCUMENTS

3003
(FIVE YEARS 804)

H-INDEX

101
(FIVE YEARS 10)

Published By Vldb Endowment

2150-8097

2021 ◽  
Vol 14 (13) ◽  
pp. 3419-3419
Author(s):  
Manasi Vartak

The last 5+ years in ML have focused on building the best models, hyperparameter optimization, parallel training, massive neural networks, etc. Now that the building of models has become easy, models are being integrated into every piece of software and device - from smart kitchens to radiology to detecting performance of turbines. This shift from training ML models to building intelligent, ML-driven applications has highlighted a variety of problems going from "a model" to a whole application or business process running on ML. These challenges range from operational challenges (how to package and deploy different types of models using existing SDLC tools and practices), rethinking what existing abstractions mean for ML (e.g., testing, monitoring, warehouses for ML), and collaboration challenges arising from disparate skill sets involved in ML product development (DS vs. SWE), and brand-new problems unique to ML (e.g., explainability, fairness, retraining, etc.) In this talk, I will discuss the slew of challenges that still exist in operationalizing ML to build intelligent applications, some solutions that the community has adopted, and highlight various open problems that would benefit from the research community's contributions.


2021 ◽  
Vol 15 (1) ◽  
pp. 1-10
Author(s):  
Kang Zhao ◽  
Liuyihan Song ◽  
Yingya Zhang ◽  
Pan Pan ◽  
Yinghui Xu ◽  
...  

Thanks to the popularity of GPU and the growth of its computational power, more and more deep learning tasks, such as face recognition, image retrieval and word embedding, can take advantage of extreme classification to improve accuracy. However, it remains a big challenge to train a deep model with millions of classes efficiently due to the huge memory and computation consumption in the last layer. By sampling a small set of classes to avoid the total classes calculation, sampling-based approaches have been proved to be an effective solution. But most of them suffer from the following two issues: i) the important classes are ignored or only partly sampled, such as the methods using random sampling scheme or retrieval techniques of low recall (e.g., locality-sensitive hashing), resulting in the degradation of accuracy; ii) inefficient implementation owing to incompatibility with GPU, like selective softmax. It uses hashing forest to help select classes, but the search process is implemented in CPU. To address the above issues, we propose a new sampling-based softmax called ANN Softmax in this paper. Specifically, we employ binary quantization with inverted file system to improve the recall of important classes. With the help of dedicated kernel design, it can be totally parallelized in mainstream training framework. Then, we find the size of important classes that are recalled by each training sample has a great impact on the final accuracy, so we introduce sample grouping optimization to well approximate the full classes training. Experimental evaluations on two tasks (Embedding Learning and Classification) and ten datasets (e.g., MegaFace, ImageNet, SKU datasets) demonstrate our proposed method maintains the same precision as Full Softmax for different loss objectives, including cross entropy loss, ArcFace, CosFace and D-Softmax loss, with only 1/10 sampled classes, which outperforms the state-of-the-art techniques. Moreover, we implement ANN Soft-max in a complete GPU pipeline that can accelerate the training more than 4.3X. Equipped our method with a 256 GPUs cluster, the time of training a classifier of 300 million classes on our SKU-300M dataset can be reduced to ten days.


2021 ◽  
Vol 14 (13) ◽  
pp. 3322-3334
Author(s):  
Yunkai Lou ◽  
Chaokun Wang ◽  
Tiankai Gu ◽  
Hao Feng ◽  
Jun Chen ◽  
...  

Many real-world networks have been evolving, and are finely modeled as temporal graphs from the viewpoint of the graph theory. A temporal graph is informative, and always contains two types of information, i.e., the temporal information and topological information, where the temporal information reflects the time when the relationships are established, and the topological information focuses on the structure of the graph. In this paper, we perform time-topology analysis on temporal graphs to extract useful information. Firstly, a new metric named T-cohesiveness is proposed to evaluate the cohesiveness of a temporal subgraph. It defines the cohesiveness of a temporal subgraph from the time and topology dimensions jointly. Specifically, given a temporal graph G s = ( Vs , ε Es ), cohesiveness in the time dimension reflects whether the connections in G s happen in a short period of time, while cohesiveness in the topology dimension indicates whether the vertices in V s are densely connected and have few connections with vertices out of G s . Then, T-cohesiveness is utilized to perform time-topology analysis on temporal graphs, and two time-topology analysis methods are proposed. In detail, T-cohesiveness evolution tracking traces the evolution of the T-cohesiveness of a subgraph, and combo searching finds out all the subgraphs that contain the query vertex and have T-cohesiveness larger than a given threshold. Moreover, a pruning strategy is proposed to improve the efficiency of combo searching. Experimental results confirm the efficiency of the proposed time-topology analysis methods and the pruning strategy.


2021 ◽  
Vol 15 (1) ◽  
pp. 98-111
Author(s):  
Dong He ◽  
Maureen Daum ◽  
Walter Cai ◽  
Magdalena Balazinska

We design, implement, and evaluate DeepEverest, a system for the efficient execution of interpretation by example queries over the activation values of a deep neural network. DeepEverest consists of an efficient indexing technique and a query execution algorithm with various optimizations. We prove that the proposed query execution algorithm is instance optimal. Experiments with our prototype show that DeepEverest, using less than 20% of the storage of full materialization, significantly accelerates individual queries by up to 63X and consistently outperforms other methods on multi-query workloads that simulate DNN interpretation processes.


2021 ◽  
Vol 15 (1) ◽  
pp. 127-140
Author(s):  
Muhammad Adnan ◽  
Yassaman Ebrahimzadeh Maboud ◽  
Divya Mahajan ◽  
Prashant J. Nair

Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000X more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3X and 1.52X in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy.


2021 ◽  
Vol 15 (1) ◽  
pp. 31-45
Author(s):  
Arjit Jain ◽  
Sunita Sarawagi ◽  
Prithviraj Sen

Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on such tasks require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find informative samples for the user to label, grows quadratically for instance-pair tasks making active learning hard to scale. Previous works, in this setting, rely on hand-crafted predicates, pre-trained language model embeddings, or rule learning to prune away unlikely pairs from the Cartesian product. This blocking step can miss out on important regions in the product space leading to low recall. We propose DIAL, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member learns representations based on powerful pre-trained transformer language models. We highlight surprising differences between the matcher and the blocker in the creation of the training data and the objective used to train their parameters. Experiments on five benchmark datasets and a multilingual record matching dataset show the effectiveness of our approach in terms of precision, recall and running time.


2021 ◽  
Vol 14 (13) ◽  
pp. 3253-3266
Author(s):  
Jian Liu ◽  
Kefei Wang ◽  
Feng Chen

Time-series databases are becoming an indispensable component in today's data centers. In order to manage the rapidly growing time-series data, we need an effective and efficient system solution to handle the huge traffic of time-series data queries. A promising solution is to deploy a high-speed, large-capacity cache system to relieve the burden on the backend time-series databases and accelerate query processing. However, time-series data is drastically different from other traditional data workloads, bringing both challenges and opportunities. In this paper, we present a flash-based cache system design for time-series data, called TSCache . By exploiting the unique properties of time-series data, we have developed a set of optimization schemes, such as a slab-based data management, a two-layered data indexing structure, an adaptive time-aware caching policy, and a low-cost compaction process. We have implemented a prototype based on Twitter's Fatcache. Our experimental results show that TSCache can significantly improve client query performance, effectively increasing the bandwidth by a factor of up to 6.7 and reducing the latency by up to 84.2%.


2021 ◽  
Vol 14 (13) ◽  
pp. 3376-3388
Author(s):  
Bailu Ding ◽  
Surajit Chaudhuri ◽  
Johannes Gehrke ◽  
Vivek Narasayya

We describe a new benchmark, DSB, for evaluating both workload-driven and traditional database systems on modern decision support workloads. DSB is adapted from the widely-used industrial-standard TPC-DS benchmark. It enhances the TPC-DS benchmark with complex data distribution and challenging yet semantically meaningful query templates. DSB also introduces configurable and dynamic workloads to assess the adaptability of database systems. Since workload-driven and traditional database systems have different performance dimensions, including the additional resources required for tuning and maintaining the systems, we provide guidelines on evaluation methodology and metrics to report. We show a case study on how to evaluate both workload-driven and traditional database systems with the DSB benchmark. The code for the DSB benchmark is open sourced and is available at https://aka.ms/dsb.


2021 ◽  
Vol 14 (13) ◽  
pp. 3417-3417
Author(s):  
Nigam Shah

Using evidence derived from previously collected medical records to guide patient care has been a long-standing vision of clinicians and informaticians, and one with the potential to transform medical practice. We offered an on-demand consultation service to derive evidence from millions of other patients' data to answer clinician questions and support their bedside decision making. We describe the design and implementation of the service as well as a summary of our experience in responding to the first 100 requests. We will also review a new paradigm for a scalable time-aware clinical data search, and to describe the design, implementation, and use of a search engine realizing this paradigm.


2021 ◽  
Vol 15 (1) ◽  
pp. 46-58
Author(s):  
Xuanhe Zhou ◽  
Guoliang Li ◽  
Chengliang Chai ◽  
Jianhua Feng

Query rewrite transforms a SQL query into an equivalent one but with higher performance. However, SQL rewrite is an NP-hard problem, and existing approaches adopt heuristics to rewrite the queries. These heuristics have two main limitations. First, the order of applying different rewrite rules significantly affects the query performance. However, the search space of all possible rewrite orders grows exponentially with the number of query operators and rules and it is rather hard to find the optimal rewrite order. Existing methods apply a pre-defined order to rewrite queries and will fall in a local optimum. Second, different rewrite rules have different benefits for different queries. Existing methods work on single plans but cannot effectively estimate the benefits of rewriting a query. To address these challenges, we propose a policy tree based query rewrite framework, where the root is the input query and each node is a rewritten query from its parent. We aim to explore the tree nodes in the policy tree to find the optimal rewrite query. We propose to use Monte Carlo Tree Search to explore the policy tree, which navigates the policy tree to efficiently get the optimal node. Moreover, we propose a learning-based model to estimate the expected performance improvement of each rewritten query, which guides the tree search more accurately. We also propose a parallel algorithm that can explore the tree search in parallel in order to improve the performance. Experimental results showed that our method significantly outperformed existing approaches.


Sign in / Sign up

Export Citation Format

Share Document