A Dynamic Scaling Approach in Hadoop YARN

doi:10.4018/ijoci.286176

A Real-Time Log Analyzer Based on MongoDB

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.571-572.497 ◽

2014 ◽

Vol 571-572 ◽

pp. 497-501 ◽

Cited By ~ 3

Author(s):

Qi Lv ◽

Wei Xie

Keyword(s):

Real Time ◽

Large Scale ◽

Performance Comparison ◽

Log Analysis ◽

Data Sets ◽

Time Data ◽

Real Time Analysis ◽

Large Scale Data ◽

Implementation Approach ◽

And Performance

Real-time log analysis on large scale data is important for applications. Specifically, real-time refers to UI latency within 100ms. Therefore, techniques which efficiently support real-time analysis over large log data sets are desired. MongoDB provides well query performance, aggregation frameworks, and distributed architecture which is suitable for real-time data query and massive log analysis. In this paper, a novel implementation approach for an event driven file log analyzer is presented, and performance comparison of query, scan and aggregation operations over MongoDB, HBase and MySQL is analyzed. Our experimental results show that HBase performs best balanced in all operations, while MongoDB provides less than 10ms query speed in some operations which is most suitable for real-time applications.

Download Full-text

Pattern Recognition for Large-Scale Data Processing

Strategic Data-Based Wisdom in the Big Data Era - Advances in Knowledge Acquisition, Transfer, and Management ◽

10.4018/978-1-4666-8122-4.ch011 ◽

2015 ◽

pp. 198-208 ◽

Cited By ~ 2

Author(s):

Amir Basirat ◽

Asad I. Khan ◽

Heinz W. Schmidt

Keyword(s):

Large Scale ◽

Distributed Processing ◽

Data Sets ◽

Distributed Data ◽

Time Data ◽

Deterministic Learning ◽

Large Scale Data ◽

Future Data ◽

Large Scale Data Processing ◽

Learning Schemes

One of the main challenges for large-scale computer clouds dealing with massive real-time data is in coping with the rate at which unprocessed data is being accumulated. Transforming big data into valuable information requires a fundamental re-think of the way in which future data management models will need to be developed on the Internet. Unlike the existing relational schemes, pattern-matching approaches can analyze data in similar ways to which our brain links information. Such interactions when implemented in voluminous data clouds can assist in finding overarching relations in complex and highly distributed data sets. In this chapter, a different perspective of data recognition is considered. Rather than looking at conventional approaches, such as statistical computations and deterministic learning schemes, this chapter focuses on distributed processing approach for scalable data recognition and processing.

Download Full-text

Developing a ‘Semi-Systematic’ Approach to Using Large-Scale Data-Sets for Small-Scale Interventions: The ‘Baby Matterz’ Initiative as a Case Study

The Urban Review ◽

10.1007/s11256-009-0144-z ◽

2010 ◽

Vol 43 (2) ◽

pp. 235-254

Author(s):

Mark O’Brien

Keyword(s):

Large Scale ◽

Systematic Approach ◽

Small Scale ◽

Data Sets ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

The joint contribution of participation and performance to learning functions: Exploring the effects of age in large-scale data sets

Behavior Research Methods ◽

10.3758/s13428-018-1128-2 ◽

2018 ◽

Vol 51 (4) ◽

pp. 1531-1543 ◽

Cited By ~ 5

Author(s):

Mark Steyvers ◽

Aaron S. Benjamin

Keyword(s):

Large Scale ◽

Data Sets ◽

Large Scale Data ◽

Learning Functions ◽

And Performance ◽

Joint Contribution ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Pattern Recognition for Large-Scale Data Processing

Big Data ◽

10.4018/978-1-4666-9840-6.ch043 ◽

2016 ◽

pp. 929-940

Author(s):

Amir Basirat ◽

Asad I. Khan ◽

Heinz W. Schmidt

Keyword(s):

Large Scale ◽

Distributed Processing ◽

Data Sets ◽

Distributed Data ◽

Time Data ◽

Deterministic Learning ◽

Large Scale Data ◽

Future Data ◽

Large Scale Data Processing ◽

Learning Schemes

One of the main challenges for large-scale computer clouds dealing with massive real-time data is in coping with the rate at which unprocessed data is being accumulated. Transforming big data into valuable information requires a fundamental re-think of the way in which future data management models will need to be developed on the Internet. Unlike the existing relational schemes, pattern-matching approaches can analyze data in similar ways to which our brain links information. Such interactions when implemented in voluminous data clouds can assist in finding overarching relations in complex and highly distributed data sets. In this chapter, a different perspective of data recognition is considered. Rather than looking at conventional approaches, such as statistical computations and deterministic learning schemes, this chapter focuses on distributed processing approach for scalable data recognition and processing.

Download Full-text

Faculty Opinions recommendation of Comparative assessment of large-scale data sets of protein-protein interactions.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1006598.82257 ◽

2002 ◽

Author(s):

Rob Russell

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Comparative Assessment ◽

Data Sets ◽

Protein Protein Interactions ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Self-Adaptive K-Means Based on a Covering Algorithm

Complexity ◽

10.1155/2018/7698274 ◽

2018 ◽

Vol 2018 ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Yiwen Zhang ◽

Yuanyuan Zhou ◽

Xing Guo ◽

Jintao Wu ◽

Qiang He ◽

...

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Real Data ◽

Second Phase ◽

Data Sets ◽

Number Of Clusters ◽

Large Scale Data ◽

Long Time ◽

Two Phases ◽

Selection Of

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

Download Full-text

Pattern Recognition in Large-Scale Data Sets: Application in Integrated Circuit Manufacturing

Big Data Analytics - Lecture Notes in Computer Science ◽

10.1007/978-3-319-03689-2_13 ◽

2013 ◽

pp. 185-196 ◽

Cited By ~ 1

Author(s):

Choudur K. Lakshminarayan ◽

Michael I. Baron

Keyword(s):

Pattern Recognition ◽

Integrated Circuit ◽

Large Scale ◽

Data Sets ◽

Integrated Circuit Manufacturing ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Exploring Imputation Techniques for Missing Data in Transportation Management Systems

Transportation Research Record Journal of the Transportation Research Board ◽

10.3141/1836-17 ◽

2003 ◽

Vol 1836 (1) ◽

pp. 132-142 ◽

Cited By ~ 61

Author(s):

Brian L. Smith ◽

William T. Scherer ◽

James H. Conklin

Keyword(s):

Missing Data ◽

Urban Areas ◽

Large Scale ◽

Management Systems ◽

Data Sets ◽

Transportation Industry ◽

Transportation Management ◽

Reduced Data ◽

Natural Characteristics

Many states have implemented large-scale transportation management systems to improve mobility in urban areas. These systems are highly prone to missing and erroneous data, which results in drastically reduced data sets for analysis and real-time operations. Imputation is the practice of filling in missing data with estimated values. Currently, the transportation industry generally does not use imputation as a means for handling missing data. Other disciplines have recognized the importance of addressing missing data and, as a result, methods and software for imputing missing data are becoming widely available. The feasibility and applicability of imputing missing traffic data are addressed, and a preliminary analysis of several heuristic and statistical imputation techniques is performed. Preliminary results produced excellent performance in the case study and indicate that the statistical techniques are more accurate while maintaining the natural characteristics of the data.

Download Full-text

Discovering Latent Class Labels for Multi-Label Learning

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/423 ◽

2020 ◽

Author(s):

Jun Huang ◽

Linchuan Xu ◽

Jing Wang ◽

Lei Feng ◽

Kenji Yamanishi

Keyword(s):

Large Scale ◽

Latent Class ◽

Training Data ◽

Data Sets ◽

Robust Learning ◽

Large Scale Data ◽

Novel Approach ◽

Fixed Set ◽

Class Labels ◽

Scale Data

Existing multi-label learning (MLL) approaches mainly assume all the labels are observed and construct classification models with a fixed set of target labels (known labels). However, in some real applications, multiple latent labels may exist outside this set and hide in the data, especially for large-scale data sets. Discovering and exploring the latent labels hidden in the data may not only find interesting knowledge but also help us to build a more robust learning model. In this paper, a novel approach named DLCL (i.e., Discovering Latent Class Labels for MLL) is proposed which can not only discover the latent labels in the training data but also predict new instances with the latent and known labels simultaneously. Extensive experiments show a competitive performance of DLCL against other state-of-the-art MLL approaches.

Download Full-text