ACM/IMS Transactions on Data Science
Latest Publications


TOTAL DOCUMENTS

59
(FIVE YEARS 59)

H-INDEX

1
(FIVE YEARS 1)

Published By Association For Computing Machinery (ACM)

2691-1922

2021 ◽  
Vol 2 (4) ◽  
pp. 1-16
Author(s):  
Zhekai Du ◽  
Jingjing Li ◽  
Lei Zhu ◽  
Ke Lu ◽  
Heng Tao Shen

Energy disaggregation, also known as non-intrusive load monitoring (NILM), challenges the problem of separating the whole-home electricity usage into appliance-specific individual consumptions, which is a typical application of data analysis. NILM aims to help households understand how the energy is used and consequently tell them how to effectively manage the energy, thus allowing energy efficiency, which is considered as one of the twin pillars of sustainable energy policy (i.e., energy efficiency and renewable energy). Although NILM is unidentifiable, it is widely believed that the NILM problem can be addressed by data science. Most of the existing approaches address the energy disaggregation problem by conventional techniques such as sparse coding, non-negative matrix factorization, and the hidden Markov model. Recent advances reveal that deep neural networks (DNNs) can get favorable performance for NILM since DNNs can inherently learn the discriminative signatures of the different appliances. In this article, we propose a novel method named adversarial energy disaggregation based on DNNs. We introduce the idea of adversarial learning into NILM, which is new for the energy disaggregation task. Our method trains a generator and multiple discriminators via an adversarial fashion. The proposed method not only learns shared representations for different appliances but captures the specific multimode structures of each appliance. Extensive experiments on real-world datasets verify that our method can achieve new state-of-the-art performance.


2021 ◽  
Vol 2 (4) ◽  
pp. 1-32
Author(s):  
Chance Desmet ◽  
Diane J. Cook

With the dramatic improvements in both the capability to collect personal data and the capability to analyze large amounts of data, increasingly sophisticated and personal insights are being drawn. These insights are valuable for clinical applications but also open up possibilities for identification and abuse of personal information. In this article, we survey recent research on classical methods of privacy-preserving data mining. Looking at dominant techniques and recent innovations to them, we examine the applicability of these methods to the privacy-preserving analysis of clinical data. We also discuss promising directions for future research in this area.


2021 ◽  
Vol 2 (3) ◽  
pp. 1-28
Author(s):  
Jie Song ◽  
Qiang He ◽  
Feifei Chen ◽  
Ye Yuan ◽  
Ge Yu

In big data query processing, there is a trade-off between query accuracy and query efficiency, for example, sampling query approaches trade-off query completeness for efficiency. In this article, we argue that query performance can be significantly improved by slightly losing the possibility of query completeness, that is, the chance that a query is complete. To quantify the possibility, we define a new concept, Probability of query Completeness (hereinafter referred to as PC). For example, If a query is executed 100 times, PC = 0.95 guarantees that there are no more than 5 incomplete results among 100 results. Leveraging the probabilistic data placement and scanning, we trade off PC for query performance. In the article, we propose PoBery (POssibly-complete Big data quERY), a method that supports neither complete queries nor incomplete queries, but possibly-complete queries. The experimental results conducted on HiBench prove that PoBery can significantly accelerate queries while ensuring the PC. Specifically, it is guaranteed that the percentage of complete queries is larger than the given PC confidence. Through comparison with state-of-the-art key-value stores, we show that while Drill-based PoBery performs as fast as Drill on complete queries, it is 1.7 ×, 1.1 ×, and 1.5 × faster on average than Drill, Impala, and Hive, respectively, on possibly-complete queries.


2021 ◽  
Vol 2 (3) ◽  
pp. 1-24
Author(s):  
Subhadip Maji ◽  
Smarajit Bose

In a Content-based Image Retrieval (CBIR) System, the task is to retrieve similar images from a large database given a query image. The usual procedure is to extract some useful features from the query image and retrieve images that have a similar set of features. For this purpose, a suitable similarity measure is chosen, and images with high similarity scores are retrieved. Naturally, the choice of these features play a very important role in the success of this system, and high-level features are required to reduce the “semantic gap.” In this article, we propose to use features derived from pre-trained network models from a deep-learning convolution network trained for a large image classification problem. This approach appears to produce vastly superior results for a variety of databases, and it outperforms many contemporary CBIR systems. We analyse the retrieval time of the method and also propose a pre-clustering of the database based on the above-mentioned features, which yields comparable results in a much shorter time in most of the cases.


2021 ◽  
Vol 2 (3) ◽  
pp. 1-22
Author(s):  
Yan Leng ◽  
Alejandro Noriega ◽  
Alex Pentland

Tourism has been an increasingly significant contributor to the economy, society, and environment. Policy-making and research on tourism traditionally rely on surveys and economic datasets, which are based on small samples and depict tourism dynamics at a low granularity. Anonymous call detail record (CDR) is a novel source of data with enormous potential in areas of high societal value: epidemics, poverty, and urban development. This study demonstrates the added value of CDR in event tourism, especially for the analysis and evaluation of marketing strategies, event operations, and the externalities at the local and national levels. To achieve this aim, we formalize 14 indicators in high spatial and temporal resolutions to measure both the positive and the negative impacts of the touristic events. We exemplify the use of these indicators in a tourism country, Andorra, on 22 high-impact events including sports competitions, cultural performances, and music festivals. We analyze these touristic events using the large-scale CDR data across 2 years. Our approach serves as a prescriptive and a diagnostic tool with mobile phone data and opens up future directions for tourism analytics.


2021 ◽  
Vol 2 (3) ◽  
pp. 1-36
Author(s):  
Marco Gramaglia ◽  
Marco Fiore ◽  
Angelo Furno ◽  
Razvan Stanica

Datasets of mobile phone trajectories collected by network operators offer an unprecedented opportunity to discover new knowledge from the activity of large populations of millions. However, publishing such trajectories also raises significant privacy concerns, as they contain personal data in the form of individual movement patterns. Privacy risks induce network operators to enforce restrictive confidential agreements in the rare occasions when they grant access to collected trajectories, whereas a less involved circulation of these data would fuel research and enable reproducibility in many disciplines. In this work, we contribute a building block toward the design of privacy-preserving datasets of mobile phone trajectories that are truthful at the record level. We present GLOVE, an algorithm that implements k -anonymity, hence solving the crucial unicity problem that affects this type of data while ensuring that the anonymized trajectories correspond to real-life users. GLOVE builds on original insights about the root causes behind the undesirable unicity of mobile phone trajectories, and leverages generalization and suppression to remove them. Proof-of-concept validations with large-scale real-world datasets demonstrate that the approach adopted by GLOVE allows preserving a substantial level of accuracy in the data, higher than that granted by previous methodologies.


2021 ◽  
Vol 2 (3) ◽  
pp. 1-25
Author(s):  
Paramita Dey ◽  
Subhayan Bhattacharya ◽  
Sarbani Roy

From the popular concept of six-degree separation, social networks are generally analyzed in the perspective of small world networks where centrality of nodes play a pivotal role in information propagation. However, working with a large dataset of a scale-free network (which follows power law) may be different due to the nature of the social graph. Moreover, the derivation of centrality may be difficult due to the computational complexity of identifying centrality measures. This study provides a comprehensive and extensive review and comparison of seven centrality measures (clustering coefficients, Node degree, K-core, Betweenness, Closeness, Eigenvector, PageRank) using four information propagation methods (Breadth First Search, Random Walk, Susceptible-Infected-Removed, Forest Fire). Five benchmark similarity measures (Tanimoto, Hamming, Dice, Sorensen, Jaccard) have been used to measure the similarity between the seed nodes identified using the centrality measures with actual source seeds derived through Google's LargeStar-SmallStar algorithm on Twitter Stream Data. MapReduce has been utilized for identifying the seed nodes based on centrality measures and for information propagation simulation. It is observed that most of the centrality measures perform well compared to the actual source in the initial stage but are saturated after a certain level of influence maximization in terms of both affected nodes and propagation level.


2021 ◽  
Vol 2 (3) ◽  
pp. 1-37
Author(s):  
Hans Walter Behrens ◽  
K. Selçuk Candan ◽  
Xilun Chen ◽  
Yash Garg ◽  
Mao-Lin Li ◽  
...  

Urban systems are characterized by complexity and dynamicity. Data-driven simulations represent a promising approach in understanding and predicting complex dynamic processes in the presence of shifting demands of urban systems. Yet, today’s silo-based, de-coupled simulation engines fail to provide an end-to-end view of the complex urban system, preventing informed decision-making. In this article, we present DataStorm to support integration of existing simulation, analysis and visualization components into integrated workflows. DataStorm provides a flow engine, DataStorm-FE , for coordinating data and decision flows among multiple actors (each representing a model, analytic operation, or a decision criterion) and enables ensemble planning and optimization across cloud resources. DataStorm provides native support for simulation ensemble creation through parameter space sampling to decide which simulations to run, as well as distributed instantiation and parallel execution of simulation instances on cluster resources. Recognizing that simulation ensembles are inherently sparse relative to the potential parameter space, we also present a density-boosting partition-stitch sampling scheme to increase the effective density of the simulation ensemble through a sub-space partitioning scheme, complemented with an efficient stitching mechanism that leverages partial and imperfect knowledge from partial dynamical systems to effectively obtain a global view of the complex urban process being simulated.


2021 ◽  
Vol 2 (3) ◽  
pp. 1-23
Author(s):  
Vishal Chakraborty ◽  
Theo Delemazure ◽  
Benny Kimelfeld ◽  
Phokion G. Kolaitis ◽  
Kunal Relia ◽  
...  

We investigate the practical aspects of computing the necessary and possible winners in elections over incomplete voter preferences. In the case of the necessary winners, we show how to implement and accelerate the polynomial-time algorithm of Xia and Conitzer. In the case of the possible winners, where the problem is NP-hard, we give a natural reduction to Integer Linear Programming (ILP) for all positional scoring rules and implement it in a leading commercial optimization solver. Further, we devise optimization techniques to minimize the number of ILP executions and, oftentimes, avoid them altogether. We conduct a thorough experimental study that includes the construction of a rich benchmark of election data based on real and synthetic data. Our findings suggest that, the worst-case intractability of the possible winners notwithstanding, the algorithmic techniques presented here scale well and can be used to compute the possible winners in realistic scenarios.


2021 ◽  
Vol 2 (3) ◽  
pp. 1-21
Author(s):  
Xiancai Tian ◽  
Baihua Zheng ◽  
Yazhe Wang ◽  
Hsiao-Ting Huang ◽  
Chih-Chieh Hung

In this article, we target at recovering the exact routes taken by commuters inside a metro system that are not captured by an Automated Fare Collection (AFC) system and hence remain unknown. We strategically propose two inference tasks to handle the recovering, one to infer the travel time of each travel link that contributes to the total duration of any trip inside a metro network and the other to infer the route preferences based on historical trip records and the travel time of each travel link inferred in the previous inference task. As these two inference tasks have interrelationship, most of existing works perform these two tasks simultaneously. However, our solution TripDecoder adopts a totally different approach. TripDecoder fully utilizes the fact that there are some trips inside a metro system with only one practical route available. It strategically decouples these two inference tasks by only taking those trip records with only one practical route as the input for the first inference task of travel time and feeding the inferred travel time to the second inference task as an additional input, which not only improves the accuracy but also effectively reduces the complexity of both inference tasks. Two case studies have been performed based on the city-scale real trip records captured by the AFC systems in Singapore and Taipei to compare the accuracy and efficiency of TripDecoder and its competitors. As expected, TripDecoder has achieved the best accuracy in both datasets, and it also demonstrates its superior efficiency and scalability.


Sign in / Sign up

Export Citation Format

Share Document