ACM/IMS Transactions on Data Science

Adversarial Energy Disaggregation

ACM/IMS Transactions on Data Science ◽

10.1145/3477301 ◽

2021 ◽

Vol 2 (4) ◽

pp. 1-16

Author(s):

Zhekai Du ◽

Jingjing Li ◽

Lei Zhu ◽

Ke Lu ◽

Heng Tao Shen

Keyword(s):

Energy Efficiency ◽

Data Science ◽

Adversarial Learning ◽

Typical Application ◽

Energy Disaggregation ◽

Load Monitoring ◽

Novel Method ◽

Real World Datasets ◽

Sustainable Energy Policy ◽

Non Negative Matrix Factorization

Energy disaggregation, also known as non-intrusive load monitoring (NILM), challenges the problem of separating the whole-home electricity usage into appliance-specific individual consumptions, which is a typical application of data analysis. NILM aims to help households understand how the energy is used and consequently tell them how to effectively manage the energy, thus allowing energy efficiency, which is considered as one of the twin pillars of sustainable energy policy (i.e., energy efficiency and renewable energy). Although NILM is unidentifiable, it is widely believed that the NILM problem can be addressed by data science. Most of the existing approaches address the energy disaggregation problem by conventional techniques such as sparse coding, non-negative matrix factorization, and the hidden Markov model. Recent advances reveal that deep neural networks (DNNs) can get favorable performance for NILM since DNNs can inherently learn the discriminative signatures of the different appliances. In this article, we propose a novel method named adversarial energy disaggregation based on DNNs. We introduce the idea of adversarial learning into NILM, which is new for the energy disaggregation task. Our method trains a generator and multiple discriminators via an adversarial fashion. The proposed method not only learns shared representations for different appliances but captures the specific multimode structures of each appliance. Extensive experiments on real-world datasets verify that our method can achieve new state-of-the-art performance.

Download Full-text

Recent Developments in Privacy-preserving Mining of Clinical Data

ACM/IMS Transactions on Data Science ◽

10.1145/3447774 ◽

2021 ◽

Vol 2 (4) ◽

pp. 1-32

Author(s):

Chance Desmet ◽

Diane J. Cook

Keyword(s):

Data Mining ◽

Clinical Data ◽

Personal Information ◽

Personal Data ◽

Privacy Preserving ◽

Clinical Applications ◽

Future Research ◽

Privacy Preserving Data Mining ◽

Recent Developments

With the dramatic improvements in both the capability to collect personal data and the capability to analyze large amounts of data, increasingly sophisticated and personal insights are being drawn. These insights are valuable for clinical applications but also open up possibilities for identification and abuse of personal information. In this article, we survey recent research on classical methods of privacy-preserving data mining. Looking at dominant techniques and recent innovations to them, we examine the applicability of these methods to the privacy-preserving analysis of clinical data. We also discuss promising directions for future research in this area.

Download Full-text

PoBery: Possibly-complete Big Data Queries with Probabilistic Data Placement and Scanning

ACM/IMS Transactions on Data Science ◽

10.1145/3465375 ◽

2021 ◽

Vol 2 (3) ◽

pp. 1-28

Author(s):

Jie Song ◽

Qiang He ◽

Feifei Chen ◽

Ye Yuan ◽

Ge Yu

Keyword(s):

Big Data ◽

Query Processing ◽

State Of The Art ◽

Data Placement ◽

Probabilistic Data ◽

Trade Off ◽

Query Performance ◽

Data Query ◽

Query Efficiency ◽

The Given

In big data query processing, there is a trade-off between query accuracy and query efficiency, for example, sampling query approaches trade-off query completeness for efficiency. In this article, we argue that query performance can be significantly improved by slightly losing the possibility of query completeness, that is, the chance that a query is complete. To quantify the possibility, we define a new concept, Probability of query Completeness (hereinafter referred to as PC). For example, If a query is executed 100 times, PC = 0.95 guarantees that there are no more than 5 incomplete results among 100 results. Leveraging the probabilistic data placement and scanning, we trade off PC for query performance. In the article, we propose PoBery (POssibly-complete Big data quERY), a method that supports neither complete queries nor incomplete queries, but possibly-complete queries. The experimental results conducted on HiBench prove that PoBery can significantly accelerate queries while ensuring the PC. Specifically, it is guaranteed that the percentage of complete queries is larger than the given PC confidence. Through comparison with state-of-the-art key-value stores, we show that while Drill-based PoBery performs as fast as Drill on complete queries, it is 1.7 ×, 1.1 ×, and 1.5 × faster on average than Drill, Impala, and Hive, respectively, on possibly-complete queries.

Download Full-text

CBIR Using Features Derived by Deep Learning

ACM/IMS Transactions on Data Science ◽

10.1145/3470568 ◽

2021 ◽

Vol 2 (3) ◽

pp. 1-24

Author(s):

Subhadip Maji ◽

Smarajit Bose

Keyword(s):

Deep Learning ◽

Network Models ◽

Classification Problem ◽

Semantic Gap ◽

Large Database ◽

High Similarity ◽

Query Image ◽

Usual Procedure ◽

Similar Images ◽

High Level

In a Content-based Image Retrieval (CBIR) System, the task is to retrieve similar images from a large database given a query image. The usual procedure is to extract some useful features from the query image and retrieve images that have a similar set of features. For this purpose, a suitable similarity measure is chosen, and images with high similarity scores are retrieved. Naturally, the choice of these features play a very important role in the success of this system, and high-level features are required to reduce the “semantic gap.” In this article, we propose to use features derived from pre-trained network models from a deep-learning convolution network trained for a large image classification problem. This approach appears to produce vastly superior results for a variety of databases, and it outperforms many contemporary CBIR systems. We analyse the retrieval time of the method and also propose a pre-clustering of the database based on the above-mentioned features, which yields comparable results in a much shorter time in most of the cases.

Download Full-text

Tourism Event Analytics with Mobile Phone Data

ACM/IMS Transactions on Data Science ◽

10.1145/3479975 ◽

2021 ◽

Vol 2 (3) ◽

pp. 1-22

Author(s):

Yan Leng ◽

Alejandro Noriega ◽

Alex Pentland

Keyword(s):

Mobile Phone ◽

Large Scale ◽

Added Value ◽

Mobile Phone Data ◽

Small Samples ◽

Environment Policy ◽

Analysis And Evaluation ◽

Event Tourism ◽

Negative Impacts ◽

Cultural Performances

Tourism has been an increasingly significant contributor to the economy, society, and environment. Policy-making and research on tourism traditionally rely on surveys and economic datasets, which are based on small samples and depict tourism dynamics at a low granularity. Anonymous call detail record (CDR) is a novel source of data with enormous potential in areas of high societal value: epidemics, poverty, and urban development. This study demonstrates the added value of CDR in event tourism, especially for the analysis and evaluation of marketing strategies, event operations, and the externalities at the local and national levels. To achieve this aim, we formalize 14 indicators in high spatial and temporal resolutions to measure both the positive and the negative impacts of the touristic events. We exemplify the use of these indicators in a tourism country, Andorra, on 22 high-impact events including sports competitions, cultural performances, and music festivals. We analyze these touristic events using the large-scale CDR data across 2 years. Our approach serves as a prescriptive and a diagnostic tool with mobile phone data and opens up future directions for tourism analytics.

Download Full-text

GLOVE: Towards Privacy-Preserving Publishing of Record-Level-Truthful Mobile Phone Trajectories

ACM/IMS Transactions on Data Science ◽

10.1145/3451178 ◽

2021 ◽

Vol 2 (3) ◽

pp. 1-36

Author(s):

Marco Gramaglia ◽

Marco Fiore ◽

Angelo Furno ◽

Razvan Stanica

Keyword(s):

Mobile Phone ◽

Large Scale ◽

Real Life ◽

Personal Data ◽

Privacy Preserving ◽

Proof Of Concept ◽

Privacy Concerns ◽

Large Populations ◽

Real World Datasets ◽

Privacy Risks

Datasets of mobile phone trajectories collected by network operators offer an unprecedented opportunity to discover new knowledge from the activity of large populations of millions. However, publishing such trajectories also raises significant privacy concerns, as they contain personal data in the form of individual movement patterns. Privacy risks induce network operators to enforce restrictive confidential agreements in the rare occasions when they grant access to collected trajectories, whereas a less involved circulation of these data would fuel research and enable reproducibility in many disciplines. In this work, we contribute a building block toward the design of privacy-preserving datasets of mobile phone trajectories that are truthful at the record level. We present GLOVE, an algorithm that implements k -anonymity, hence solving the crucial unicity problem that affects this type of data while ensuring that the anonymized trajectories correspond to real-life users. GLOVE builds on original insights about the root causes behind the undesirable unicity of mobile phone trajectories, and leverages generalization and suppression to remove them. Proof-of-concept validations with large-scale real-world datasets demonstrate that the approach adopted by GLOVE allows preserving a substantial level of accuracy in the data, higher than that granted by previous methodologies.

Download Full-text

A Survey on the Role of Centrality as Seed Nodes for Information Propagation in Large Scale Network

ACM/IMS Transactions on Data Science ◽

10.1145/3465374 ◽

2021 ◽

Vol 2 (3) ◽

pp. 1-25

Author(s):

Paramita Dey ◽

Subhayan Bhattacharya ◽

Sarbani Roy

Keyword(s):

Large Scale ◽

Similarity Measures ◽

Small World ◽

Centrality Measures ◽

Node Degree ◽

Information Propagation ◽

Stream Data ◽

Scale Free ◽

Large Scale Network ◽

Propagation Methods

From the popular concept of six-degree separation, social networks are generally analyzed in the perspective of small world networks where centrality of nodes play a pivotal role in information propagation. However, working with a large dataset of a scale-free network (which follows power law) may be different due to the nature of the social graph. Moreover, the derivation of centrality may be difficult due to the computational complexity of identifying centrality measures. This study provides a comprehensive and extensive review and comparison of seven centrality measures (clustering coefficients, Node degree, K-core, Betweenness, Closeness, Eigenvector, PageRank) using four information propagation methods (Breadth First Search, Random Walk, Susceptible-Infected-Removed, Forest Fire). Five benchmark similarity measures (Tanimoto, Hamming, Dice, Sorensen, Jaccard) have been used to measure the similarity between the seed nodes identified using the centrality measures with actual source seeds derived through Google's LargeStar-SmallStar algorithm on Twitter Stream Data. MapReduce has been utilized for identifying the seed nodes based on centrality measures and for information propagation simulation. It is observed that most of the centrality measures perform well compared to the actual source in the initial stage but are saturated after a certain level of influence maximization in terms of both affected nodes and propagation level.

Download Full-text

DataStorm: Coupled, Continuous Simulations for Complex Urban Environments

ACM/IMS Transactions on Data Science ◽

10.1145/3447572 ◽

2021 ◽

Vol 2 (3) ◽

pp. 1-37

Author(s):

Hans Walter Behrens ◽

K. Selçuk Candan ◽

Xilun Chen ◽

Yash Garg ◽

Mao-Lin Li ◽

...

Keyword(s):

Parameter Space ◽

Simulation Analysis ◽

Urban System ◽

Decision Criterion ◽

Urban Environments ◽

Parallel Execution ◽

Effective Density ◽

Urban Systems ◽

Space Partitioning ◽

Imperfect Knowledge

Urban systems are characterized by complexity and dynamicity. Data-driven simulations represent a promising approach in understanding and predicting complex dynamic processes in the presence of shifting demands of urban systems. Yet, today’s silo-based, de-coupled simulation engines fail to provide an end-to-end view of the complex urban system, preventing informed decision-making. In this article, we present DataStorm to support integration of existing simulation, analysis and visualization components into integrated workflows. DataStorm provides a flow engine, DataStorm-FE , for coordinating data and decision flows among multiple actors (each representing a model, analytic operation, or a decision criterion) and enables ensemble planning and optimization across cloud resources. DataStorm provides native support for simulation ensemble creation through parameter space sampling to decide which simulations to run, as well as distributed instantiation and parallel execution of simulation instances on cluster resources. Recognizing that simulation ensembles are inherently sparse relative to the potential parameter space, we also present a density-boosting partition-stitch sampling scheme to increase the effective density of the simulation ensemble through a sub-space partitioning scheme, complemented with an efficient stitching mechanism that leverages partial and imperfect knowledge from partial dynamical systems to effectively obtain a global view of the complex urban process being simulated.

Download Full-text

Algorithmic Techniques for Necessary and Possible Winners

ACM/IMS Transactions on Data Science ◽

10.1145/3458472 ◽

2021 ◽

Vol 2 (3) ◽

pp. 1-23

Author(s):

Vishal Chakraborty ◽

Theo Delemazure ◽

Benny Kimelfeld ◽

Phokion G. Kolaitis ◽

Kunal Relia ◽

...

Keyword(s):

Experimental Study ◽

Linear Programming ◽

Integer Linear Programming ◽

Synthetic Data ◽

Polynomial Time Algorithm ◽

Time Algorithm ◽

Optimization Techniques ◽

Worst Case ◽

Voter Preferences ◽

Algorithmic Techniques

We investigate the practical aspects of computing the necessary and possible winners in elections over incomplete voter preferences. In the case of the necessary winners, we show how to implement and accelerate the polynomial-time algorithm of Xia and Conitzer. In the case of the possible winners, where the problem is NP-hard, we give a natural reduction to Integer Linear Programming (ILP) for all positional scoring rules and implement it in a leading commercial optimization solver. Further, we devise optimization techniques to minimize the number of ILP executions and, oftentimes, avoid them altogether. We conduct a thorough experimental study that includes the construction of a rich benchmark of election data based on real and synthetic data. Our findings suggest that, the worst-case intractability of the possible winners notwithstanding, the algorithmic techniques presented here scale well and can be used to compute the possible winners in realistic scenarios.

Download Full-text

TRIPDECODER: Study Travel Time Attributes and Route Preferences of Metro Systems from Smart Card Data

ACM/IMS Transactions on Data Science ◽

10.1145/3430768 ◽

2021 ◽

Vol 2 (3) ◽

pp. 1-21

Author(s):

Xiancai Tian ◽

Baihua Zheng ◽

Yazhe Wang ◽

Hsiao-Ting Huang ◽

Chih-Chieh Hung

Keyword(s):

Travel Time ◽

Smart Card ◽

Total Duration ◽

The Other ◽

Smart Card Data ◽

Additional Input ◽

Metro Systems ◽

The City ◽

Metro System ◽

Inference Task

In this article, we target at recovering the exact routes taken by commuters inside a metro system that are not captured by an Automated Fare Collection (AFC) system and hence remain unknown. We strategically propose two inference tasks to handle the recovering, one to infer the travel time of each travel link that contributes to the total duration of any trip inside a metro network and the other to infer the route preferences based on historical trip records and the travel time of each travel link inferred in the previous inference task. As these two inference tasks have interrelationship, most of existing works perform these two tasks simultaneously. However, our solution TripDecoder adopts a totally different approach. TripDecoder fully utilizes the fact that there are some trips inside a metro system with only one practical route available. It strategically decouples these two inference tasks by only taking those trip records with only one practical route as the input for the first inference task of travel time and feeding the inferred travel time to the second inference task as an additional input, which not only improves the accuracy but also effectively reduces the complexity of both inference tasks. Two case studies have been performed based on the city-scale real trip records captured by the AFC systems in Singapore and Taipei to compare the accuracy and efficiency of TripDecoder and its competitors. As expected, TripDecoder has achieved the best accuracy in both datasets, and it also demonstrates its superior efficiency and scalability.

Download Full-text

ACM/IMS Transactions on Data Science
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Association For Computing Machinery (ACM)

Adversarial Energy Disaggregation

Recent Developments in Privacy-preserving Mining of Clinical Data

PoBery: Possibly-complete Big Data Queries with Probabilistic Data Placement and Scanning

CBIR Using Features Derived by Deep Learning

Tourism Event Analytics with Mobile Phone Data

GLOVE: Towards Privacy-Preserving Publishing of Record-Level-Truthful Mobile Phone Trajectories

A Survey on the Role of Centrality as Seed Nodes for Information Propagation in Large Scale Network

DataStorm: Coupled, Continuous Simulations for Complex Urban Environments

Algorithmic Techniques for Necessary and Possible Winners

TRIPDECODER: Study Travel Time Attributes and Route Preferences of Metro Systems from Smart Card Data

Export Citation Format

ACM/IMS Transactions on Data ScienceLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Association For Computing Machinery (ACM)

Adversarial Energy Disaggregation

Recent Developments in Privacy-preserving Mining of Clinical Data

PoBery: Possibly-complete Big Data Queries with Probabilistic Data Placement and Scanning

CBIR Using Features Derived by Deep Learning

Tourism Event Analytics with Mobile Phone Data

GLOVE: Towards Privacy-Preserving Publishing of Record-Level-Truthful Mobile Phone Trajectories

A Survey on the Role of Centrality as Seed Nodes for Information Propagation in Large Scale Network

DataStorm: Coupled, Continuous Simulations for Complex Urban Environments

Algorithmic Techniques for Necessary and Possible Winners

TRIPDECODER: Study Travel Time Attributes and Route Preferences of Metro Systems from Smart Card Data

ACM/IMS Transactions on Data Science
Latest Publications