Fundamental resource trade-offs for encoded distributed optimization

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa026 ◽

2020 ◽

Author(s):

A Salman Avestimehr ◽

Seyed Mohammadreza Mousavi Kalan ◽

Mahdi Soltanolkotabi

Keyword(s):

Computational Time ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Computational Framework ◽

Data Set ◽

Trade Offs ◽

Major Bottleneck ◽

Computing Environments ◽

Analyze Data

Abstract Dealing with the shear size and complexity of today’s massive data sets requires computational platforms that can analyze data in a parallelized and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. A recent computational framework, called encoded optimization, creates redundancy in the data to mitigate the effect of stragglers. In this paper, we develop novel mathematical understanding for this framework demonstrating its effectiveness in much broader settings than was previously understood. We also analyze the convergence behavior of iterative encoded optimization algorithms, allowing us to characterize fundamental trade-offs between convergence rate, size of data set, accuracy, computational load (or data redundancy) and straggler toleration in this framework.

Download Full-text

A methodology for supporting collaborative exploratory analysis of massive data sets in tele-immersive environments

Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469) ◽

10.1109/hpdc.1999.805283 ◽

2003 ◽

Cited By ~ 8

Author(s):

J. Leigh ◽

A.E. Johnson ◽

T.A. DeFanti ◽

S. Bailey ◽

R. Grossman

Keyword(s):

Exploratory Analysis ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Immersive Environments

Download Full-text

Massive Data Sets Issues in Earth Observing

Massive Computing - Handbook of Massive Data Sets ◽

10.1007/978-1-4615-0005-6_29 ◽

2002 ◽

pp. 1093-1140 ◽

Cited By ~ 3

Author(s):

Ruixin Yang ◽

Menas Kafatos

Keyword(s):

Massive Data ◽

Data Sets ◽

Massive Data Sets

Download Full-text

SKT

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476287 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2369-2382

Author(s):

Monica Chiosa ◽

Thomas B. Preußer ◽

Gustavo Alonso

Keyword(s):

Frequency Distribution ◽

Empirical Evaluation ◽

Large Data ◽

Cloud Service ◽

Data Sets ◽

Data Set ◽

Single Pass ◽

Trade Offs ◽

Significant Performance ◽

Spatial Architecture

Data analysts often need to characterize a data stream as a first step to its further processing. Some of the initial insights to be gained include, e.g., the cardinality of the data set and its frequency distribution. Such information is typically extracted by using sketch algorithms, now widely employed to process very large data sets in manageable space and in a single pass over the data. Often, analysts need more than one parameter to characterize the stream. However, computing multiple sketches becomes expensive even when using high-end CPUs. Exploiting the increasing adoption of hardware accelerators, this paper proposes SKT , an FPGA-based accelerator that can compute several sketches along with basic statistics (average, max, min, etc.) in a single pass over the data. SKT has been designed to characterize a data set by calculating its cardinality, its second frequency moment, and its frequency distribution. The design processes data streams coming either from PCIe or TCP/IP, and it is built to fit emerging cloud service architectures, such as Microsoft's Catapult or Amazon's AQUA. The paper explores the trade-offs of designing sketch algorithms on a spatial architecture and how to combine several sketch algorithms into a single design. The empirical evaluation shows how SKT on an FPGA offers a significant performance gain over high-end, server-class CPUs.

Download Full-text

Identifying Victims of Human Sex Trafficking in Online Ads

Encyclopedia of Criminal Activities and the Deep Web ◽

10.4018/978-1-5225-9715-5.ch034 ◽

2020 ◽

pp. 497-517

Author(s):

Jessica Whitney ◽

Marisa Hultgren ◽

Murray Eugene Jennex ◽

Aaron Elkins ◽

Eric Frost

Keyword(s):

Social Media ◽

Knowledge Management ◽

Sex Trafficking ◽

Data Sets ◽

Data Set ◽

Viable Approach ◽

Unexpected Outcome ◽

Analyze Data

Social media and the interactive web have enabled human traffickers to lure victims and then sell them faster and in greater safety than ever before. However, these same tools have also enabled investigators in their search for victims and criminals. A prototype was designed to identify victims of human sex trafficking by analyzing online ads. The prototype used a knowledge management to generate actionable intelligence by applying a set of strong filters based on an ontology to identify potential victims. The prototype was used to analyze data sets generated from online ads. An unexpected outcome of the second data set was the discovery of the use of emojis in an expanded ontology. The final prototype used the expanded ontology to identify potential victims. The results of applying the prototypes suggest a viable approach to identifying victims of human sex trafficking in online ads.

Download Full-text

Neural Network for Big Data Sets

10.4018/978-1-6684-2408-7.ch003 ◽

2022 ◽

pp. 41-67

Author(s):

Vo Ngoc Phu ◽

Vo Thi Ngoc Tran

Keyword(s):

Neural Network ◽

Big Data ◽

Computer Science ◽

Large Scale ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Large Scale Data ◽

Commercial Applications ◽

Novel Model

Machine learning (ML), neural network (NN), evolutionary algorithm (EA), fuzzy systems (FSs), as well as computer science have been very famous and very significant for many years. They have been applied to many different areas. They have contributed much to developments of many large-scale corporations, massive organizations, etc. Lots of information and massive data sets (MDSs) have been generated from these big corporations, organizations, etc. These big data sets (BDSs) have been the challenges of many commercial applications, researches, etc. Therefore, there have been many algorithms of the ML, the NN, the EA, the FSs, as well as computer science which have been developed to handle these massive data sets successfully. To support for this process, the authors have displayed all the possible algorithms of the NN for the large-scale data sets (LSDSs) successfully in this chapter. Finally, they have presented a novel model of the NN for the BDS in a sequential environment (SE) and a distributed network environment (DNE).

Download Full-text

Diabetic Data Warehouses

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch069 ◽

2011 ◽

pp. 359-363

Author(s):

Joseph L. Breault

Keyword(s):

United States ◽

Health Care ◽

The United States ◽

Massive Data ◽

Data Sets ◽

National Academy Of Sciences ◽

Massive Data Sets ◽

Number Of Patients ◽

Indicator Variables ◽

Academy Of Sciences

The National Academy of Sciences convened in 1995 for a conference on massive data sets. The presentation on health care noted that “massive applies in several dimensions . . . the data themselves are massive, both in terms of the number of observations and also in terms of the variables . . . there are tens of thousands of indicator variables coded for each patient” (Goodall, 1995, paragraph 18). We multiply this by the number of patients in the United States, which is hundreds of millions.

Download Full-text

Gut microbiota and artificial intelligence approaches: A scoping review

Health and Technology ◽

10.1007/s12553-020-00486-7 ◽

2020 ◽

Vol 10 (6) ◽

pp. 1343-1358

Author(s):

Ernesto Iadanza ◽

Rachele Fabbri ◽

Džana Bašić-ČiČak ◽

Amedeo Amedei ◽

Jasminka Hasic Telalovic

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Deep Learning ◽

Gut Microbiota ◽

Scoping Review ◽

Massive Data ◽

Data Sets ◽

Research Groups ◽

Human Gut ◽

Massive Data Sets

Abstract This article aims to provide a thorough overview of the use of Artificial Intelligence (AI) techniques in studying the gut microbiota and its role in the diagnosis and treatment of some important diseases. The association between microbiota and diseases, together with its clinical relevance, is still difficult to interpret. The advances in AI techniques, such as Machine Learning (ML) and Deep Learning (DL), can help clinicians in processing and interpreting these massive data sets. Two research groups have been involved in this Scoping Review, working in two different areas of Europe: Florence and Sarajevo. The papers included in the review describe the use of ML or DL methods applied to the study of human gut microbiota. In total, 1109 papers were considered in this study. After elimination, a final set of 16 articles was considered in the scoping review. Different AI techniques were applied in the reviewed papers. Some papers applied ML, while others applied DL techniques. 11 papers evaluated just different ML algorithms (ranging from one to eight algorithms applied to one dataset). The remaining five papers examined both ML and DL algorithms. The most applied ML algorithm was Random Forest and it also exhibited the best performances.

Download Full-text

LCSS-Based Algorithm for Computing Multivariate Data Set Similarity: A Case Study of Real-Time WSN Data

Sensors ◽

10.3390/s19010166 ◽

2019 ◽

Vol 19 (1) ◽

pp. 166 ◽

Cited By ~ 2

Author(s):

Rahim Khan ◽

Ihsan Ali ◽

Saleh M. Altowaijri ◽

Muhammad Zakarya ◽

Atiq Ur Rahman ◽

...

Keyword(s):

Dynamic Programming ◽

Dna Analysis ◽

Multivariate Data ◽

Longest Common Subsequence ◽

Sensor Data ◽

Computational Time ◽

Data Sets ◽

Data Set ◽

Classical Dynamic ◽

Engineering Sciences

Multivariate data sets are common in various application areas, such as wireless sensor networks (WSNs) and DNA analysis. A robust mechanism is required to compute their similarity indexes regardless of the environment and problem domain. This study describes the usefulness of a non-metric-based approach (i.e., longest common subsequence) in computing similarity indexes. Several non-metric-based algorithms are available in the literature, the most robust and reliable one is the dynamic programming-based technique. However, dynamic programming-based techniques are considered inefficient, particularly in the context of multivariate data sets. Furthermore, the classical approaches are not powerful enough in scenarios with multivariate data sets, sensor data or when the similarity indexes are extremely high or low. To address this issue, we propose an efficient algorithm to measure the similarity indexes of multivariate data sets using a non-metric-based methodology. The proposed algorithm performs exceptionally well on numerous multivariate data sets compared with the classical dynamic programming-based algorithms. The performance of the algorithms is evaluated on the basis of several benchmark data sets and a dynamic multivariate data set, which is obtained from a WSN deployed in the Ghulam Ishaq Khan (GIK) Institute of Engineering Sciences and Technology. Our evaluation suggests that the proposed algorithm can be approximately 39.9% more efficient than its counterparts for various data sets in terms of computational time.

Download Full-text

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets

Nature Biotechnology ◽

10.1038/nbt.3988 ◽

2017 ◽

Vol 35 (11) ◽

pp. 1026-1028 ◽

Cited By ~ 256

Author(s):

Martin Steinegger ◽

Johannes Söding

Keyword(s):

Protein Sequence ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets

Download Full-text

Scaling clustering algorithms for massive data sets using data streams

Proceedings. 20th International Conference on Data Engineering ◽

10.1109/icde.2004.1320061 ◽

2004 ◽

Cited By ~ 8

Author(s):

S. Nittel ◽

K.T. Leung ◽

A. Braverman

Keyword(s):

Data Streams ◽

Clustering Algorithms ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Using Data

Download Full-text