Detecting Group Anomalies in Tera-Scale Multi-Aspect Data via Dense-Subtensor Mining

How can we detect fraudulent lockstep behavior in large-scale multi-aspect data (i.e., tensors)? Can we detect it when data are too large to fit in memory or even on a disk? Past studies have shown that dense subtensors in real-world tensors (e.g., social media, Wikipedia, TCP dumps, etc.) signal anomalous or fraudulent behavior such as retweet boosting, bot activities, and network attacks. Thus, various approaches, including tensor decomposition and search, have been proposed for detecting dense subtensors rapidly and accurately. However, existing methods suffer from low accuracy, or they assume that tensors are small enough to fit in main memory, which is unrealistic in many real-world applications such as social media and web. To overcome these limitations, we propose D-Cube, a disk-based dense-subtensor detection method, which also can run in a distributed manner across multiple machines. Compared to state-of-the-art methods, D-Cube is (1) Memory Efficient: requires up to 1,561× less memory and handles 1,000× larger data (2.6TB), (2) Fast: up to 7× faster due to its near-linear scalability, (3) Provably Accurate: gives a guarantee on the densities of the detected subtensors, and (4) Effective: spotted network attacks from TCP dumps and synchronized behavior in rating data most accurately.

Download Full-text

Digital Participatory Platforms for Co-Production in Urban Development

Crowdsourcing ◽

10.4018/978-1-5225-8362-2.ch033 ◽

2019 ◽

pp. 663-690

Author(s):

Enzo Falco ◽

Reinout Kleinhans

Keyword(s):

Systematic Review ◽

Social Media ◽

Urban Development ◽

Real World ◽

Public Services ◽

Realistic Potential ◽

Financial Pressure ◽

Real World Applications ◽

Comprehensive Picture

A renewed interest has appeared in citizen co-production of public services due to financial pressure on governments. While social media are considered an important facilitator, many digital participatory platforms (DPPs) have been developed to facilitate co-production between citizens and governments in the context of urban development. Previous studies have delivered a fragmented overview of DPPs in a few socio-spatial contexts and failed to take stock of the rise of DPPs. This article aims to provide a more comprehensive picture of the availability and functionalities of DPPs. Through a systematic review, 113 active DPPs have been identified, analysed, and classified within a citizen-government relationship typology. Almost a quarter of these DPPs demonstrate a realistic potential for online and offline co-production between governments and citizens. The article critically analyses the characteristics of these DPPs and explores their real-world applications in urban development. The article concludes with directions for further research.

Download Full-text

Approximation Guarantees of Stochastic Greedy Algorithms for Subset Selection

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/205 ◽

2018 ◽

Cited By ~ 3

Author(s):

Chao Qian ◽

Yang Yu ◽

Ke Tang

Keyword(s):

Real World ◽

Large Scale ◽

Fundamental Problem ◽

Greedy Algorithms ◽

Subset Selection ◽

Objective Functions ◽

General Constraint ◽

Real World Applications ◽

General Stochastic

Subset selection is a fundamental problem in many areas, which aims to select the best subset of size at most $k$ from a universe. Greedy algorithms are widely used for subset selection, and have shown good approximation performances in deterministic situations. However, their behaviors are stochastic in many realistic situations (e.g., large-scale and noisy). For general stochastic greedy algorithms, bounded approximation guarantees were obtained only for subset selection with monotone submodular objective functions, while real-world applications often involve non-monotone or non-submodular objective functions and can be subject to a more general constraint than a size constraint. This work proves their approximation guarantees in these cases, and thus largely extends the applicability of stochastic greedy algorithms.

Download Full-text

A DIVISION OF LABOR: THE ROLE OF BIG DATA ANALYSIS IN THE REPERTOIRE OF INTERNET RESEARCH METHODS

AoIR Selected Papers of Internet Research ◽

10.5210/spir.v2018i0.10467 ◽

2020 ◽

Author(s):

Rasmus Helles ◽

Jacob Ørmen ◽

Klaus Bruhn Jensen ◽

Signe Sophus Lai ◽

Ericka Menchen-Trevino ◽

...

Keyword(s):

Social Media ◽

Big Data ◽

Data Analysis ◽

Research Methods ◽

Real World ◽

Large Scale ◽

Big Data Analysis ◽

Internet Research ◽

Internet Research Methods

In recent years, large-scale analysis of log data from digital devices - often termed ""big data analysis"" (Lazer, Kennedy, King, & Vespignani, 2014) - have taken hold in the field of internet research. Through Application Programming Interfaces (APIs) and commercial measurement, scholars have been able to analyze social media users (Freelon 2014) and web audiences (Taneja, 2016) on an uprecedented scale. And by developing digital research tools, scholars have been able to track individuals across websites (Menchen-Trevino, 2013) and mobile applications (Ørmen & Thorhauge 2015) in greater detail than ever before. Big data analysis holds unique potential for studying communication in depth and across many individuals (see e.g. Boase & Ling, 2013; Prior, 2013). At the same time, this approach introduces new methodological challenges in the transparency of data collection (Webster, 2014), sampling of participants and validity of conclusions (Rieder, Abdulla, Poell, Woltering, & Zack, 2015). Firstly, data aggregation is typically designed for commercial rather than academic purposes. The type of data included as well as how it is presented depend in large part on the business interests of measurement and advertisement companies (Webster, 2014). Secondly, when relying on this kind of secondary data it can be difficult to validate the output or techniques used to generate the data (Rieder, Abdulla, Poell, Woltering, & Zack, 2015). Thirdly, often the unit of analysis is media-centric, taking specific websites or social network pages as the empirical basis instead of individual users (Taneja, 2016). This makes it hard to untangle the behavior of real-world users from the aggregate trends. Lastly, variations in what users do might be so large that it is necessary to move from the aggregate to smaller groups of users to make meaningful inferences (Welles, 2014). Internet research is thus faced with a new research approach in big data analysis with potentials and perils that need to be discussed in combination with traditional approaches. This panel explores the role of big data analysis in relation to the wider repertoire of methods in internet research. The panel comprises four presentations that each sheds light on the complementarity of big data analysis with more traditional qualitative and quantitative methods. The first presentation opens the discussion with an overview of strategies for combining digital traces and commercial audience data with qualitative interviews and quantitative survey methods. The next presentation explores the potential of trace data to improve upon the experimental method. Researcher-collected data enables scholars to operate in a real-world setting, in contrast to a research lab, while obtaining informed consent from participants. The third presentation argues that large-scale audience data provide a unique perspective on internet use. By integrating census-level information about users with detailed traces of their behavior across websites, commercial audience data combines the strength of surveys and digital trace data respectively. Lastly, the fourth presentation shows how multi-institutional collaboration makes it possible do document social media activity (on Twitter) for a whole country (Australia) in a comprehensive manner. A feat not possible through other methods on a similar scale. Through these four presentations, the panel aims to situate big data analysis in the broader repertoire of internet research methods.

Download Full-text

Executing large-scale processes in a blockchain

Journal of Capital Markets Studies ◽

10.1108/jcms-05-2018-0020 ◽

2018 ◽

Vol 2 (2) ◽

pp. 106-120 ◽

Cited By ~ 2

Author(s):

Mahalingam Ramkumar

Keyword(s):

Real World ◽

Design Methodology ◽

Large Scale ◽

Trusted Computing ◽

Digital Signatures ◽

Cryptographic Protocols ◽

Content Type ◽

Computing Platform ◽

Trusted Computing Platform ◽

Real World Applications

Purpose The purpose of this paper is to examine the blockchain as a trusted computing platform. Understanding the strengths and limitations of this platform is essential to execute large-scale real-world applications in blockchains. Design/methodology/approach This paper proposes several modiﬁcations to conventional blockchain networks to improve the scale and scope of applications. Findings Simple modifications to cryptographic protocols for constructing blockchain ledgers, and digital signatures for authentication of transactions, are sufficient to realize a scalable blockchain platform. Originality/value The original contributions of this paper are concrete steps to overcome limitations of current blockchain networks.

Download Full-text

Distributed Pareto Optimization for Subset Selection

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/207 ◽

2018 ◽

Cited By ~ 2

Author(s):

Chao Qian ◽

Guiying Li ◽

Chao Feng ◽

Ke Tang

Keyword(s):

Real World ◽

Large Scale ◽

State Of The Art ◽

Subset Selection ◽

Data Sets ◽

Mapreduce Framework ◽

Real World Data ◽

Real World Applications ◽

Approximation Guarantee ◽

Better Than

The subset selection problem that selects a few items from a ground set arises in many applications such as maximum coverage, influence maximization, sparse regression, etc. The recently proposed POSS algorithm is a powerful approximation solver for this problem. However, POSS requires centralized access to the full ground set, and thus is impractical for large-scale real-world applications, where the ground set is too large to be stored on one single machine. In this paper, we propose a distributed version of POSS (DPOSS) with a bounded approximation guarantee. DPOSS can be easily implemented in the MapReduce framework. Our extensive experiments using Spark, on various real-world data sets with size ranging from thousands to millions, show that DPOSS can achieve competitive performance compared with the centralized POSS, and is almost always better than the state-of-the-art distributed greedy algorithm RandGreeDi.

Download Full-text

Digital Participatory Platforms for Co-Production in Urban Development

International Journal of E-Planning Research ◽

10.4018/ijepr.2018070105 ◽

2018 ◽

Vol 7 (3) ◽

pp. 52-79 ◽

Cited By ~ 30

Author(s):

Enzo Falco ◽

Reinout Kleinhans

Keyword(s):

Systematic Review ◽

Social Media ◽

Urban Development ◽

Real World ◽

Public Services ◽

Realistic Potential ◽

Financial Pressure ◽

Real World Applications ◽

Comprehensive Picture

Download Full-text

Machine Learning: Algorithms, Real-World Applications and Research Directions

10.20944/preprints202103.0216.v1 ◽

2021 ◽

Author(s):

Iqbal H. Sarker

Keyword(s):

Machine Learning ◽

Real World ◽

Large Scale ◽

Smart Cities ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Research Directions ◽

Digital World ◽

Real World Applications

In the current age of the Fourth Industrial Revolution ($4IR$ or Industry $4.0$), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyze these data and develop the corresponding real-world applications, the knowledge of artificial intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning, which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale. In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, this study's key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world applications areas, such as cybersecurity, smart cities, healthcare, business, agriculture, and many more. We also highlight the challenges and potential research directions based on our study. Overall, this paper aims to serve as a reference point for not only the application developers but also the decision-makers and researchers in various real-world application areas, particularly from the technical point of view.

Download Full-text

Structural similarity based common library detection method for Android

Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University ◽

10.1051/jnwpu/20213920448 ◽

2021 ◽

Vol 39 (2) ◽

pp. 448-453

Author(s):

Zhiying Mu ◽

Zhihu Li ◽

Xiaoyu Li

Keyword(s):

Large Scale ◽

Detection Method ◽

False Positive Rate ◽

Structural Similarity ◽

Detection Methods ◽

Fine Grained ◽

Real World Applications ◽

Positive Rate ◽

Android Applications ◽

Detection Speed

The correct classifying and filtering of common libraries in Android applications can effectively improve the accuracy of repackaged application detection. However, the existing common library detection methods barely meet the requirement of large-scale app markets due to the low detection speed caused by their classification rules. Aiming at this problem, a structural similarity based common library detection method for Android is presented. The sub-packages with weak association to main package are extracted as common library candidates from the decompiled APK (Android application package) by using PDG (program dependency graph) method. With package structures and API calls being used as features, the classifying of those candidates is accomplished through coarse and fine-grained filtering. The experimental results by using real-world applications as dataset show that the detection speed of the present method is higher while the accuracy and false positive rate are both ensured. The method is proved to be efficient and precise.

Download Full-text

QBSUM: A large-scale query-based document summarization dataset from real-world applications

Computer Speech & Language ◽

10.1016/j.csl.2020.101166 ◽

2021 ◽

Vol 66 ◽

pp. 101166

Author(s):

Mingjun Zhao ◽

Shengli Yan ◽

Bang Liu ◽

Xinwang Zhong ◽

Qian Hao ◽

...

Keyword(s):

Real World ◽

Large Scale ◽

Document Summarization ◽

Real World Applications

Download Full-text

Indoor Scene Change Captioning Based on Multimodality Data

Sensors ◽

10.3390/s20174761 ◽

2020 ◽

Vol 20 (17) ◽

pp. 4761 ◽

Cited By ~ 1

Author(s):

Yue Qiu ◽

Yutaka Satoh ◽

Ryota Suzuki ◽

Kenji Iwata ◽

Hirokatsu Kataoka

Keyword(s):

Real World ◽

Point Cloud ◽

Large Scale ◽

Three Dimensional ◽

Point Cloud Data ◽

Scene Change ◽

Cloud Data ◽

Indoor Scene ◽

Real World Applications ◽

Rgb Images

This study proposes a framework for describing a scene change using natural language text based on indoor scene observations conducted before and after a scene change. The recognition of scene changes plays an essential role in a variety of real-world applications, such as scene anomaly detection. Most scene understanding research has focused on static scenes. Most existing scene change captioning methods detect scene changes from single-view RGB images, neglecting the underlying three-dimensional structures. Previous three-dimensional scene change captioning methods use simulated scenes consisting of geometry primitives, making it unsuitable for real-world applications. To solve these problems, we automatically generated large-scale indoor scene change caption datasets. We propose an end-to-end framework for describing scene changes from various input modalities, namely, RGB images, depth images, and point cloud data, which are available in most robot applications. We conducted experiments with various input modalities and models and evaluated model performance using datasets with various levels of complexity. Experimental results show that the models that combine RGB images and point cloud data as input achieve high performance in sentence generation and caption correctness and are robust for change type understanding for datasets with high complexity. The developed datasets and models contribute to the study of indoor scene change understanding.

Download Full-text