Efficient data distribution and results merging for parallel data clustering in mapreduce environment

Abstract Hierarchical phrase-based machine translation systems rely on the synchronous context free grammar formalism to learn and use translation rules containing gaps. The grammars learned by such systems become unmanageably large even for medium sized parallel corpora. The traditional approach of preprocessing the training data and loading all possible translation rules into memory does not scale well for hierarchical phrase-based systems. Online grammar extractors address this problem by constructing memory efficient data structures on top of the source sideof the parallel data (often based on suffix arrays), which are usedto efficiently match phrases in the corpus and to extract translation rules on the fly during decoding. This paper describes an open source implementation of an online synchronous context free grammar extractor. Our approach builds on the work of Lopez (2008a) and introduces a new technique for extending the lists of phrase matches for phrases containing gaps that reduces the extraction time by a factor of 4. Our extractor is available as part of the cdec toolkit1 (Dyer et al., 2010).

Download Full-text

Efficient Data Distribution for DWS

Data Warehousing and Knowledge Discovery - Lecture Notes in Computer Science ◽

10.1007/978-3-540-85836-2_8 ◽

2008 ◽

pp. 75-86 ◽

Cited By ~ 5

Author(s):

Raquel Almeida ◽

Jorge Vieira ◽

Marco Vieira ◽

Henrique Madeira ◽

Jorge Bernardino

Keyword(s):

Data Distribution ◽

Efficient Data

Download Full-text

Efficient Data Clustering using Fast Choice for Number of Clusters

Journal of Society of Korea Industrial and Systems Engineering ◽

10.11627/jkise.2018.41.2.001 ◽

2018 ◽

Vol 41 (2) ◽

pp. 1-8

Author(s):

Sung-Soo Kim ◽

Bum-Su Kang

Keyword(s):

Data Clustering ◽

Number Of Clusters ◽

Efficient Data

Download Full-text

Performance analysis of efficient data distribution in P2P environment using hybrid clustering techniques

Soft Computing ◽

10.1007/s00500-019-03796-9 ◽

2019 ◽

Vol 23 (19) ◽

pp. 9253-9263 ◽

Cited By ~ 3

Author(s):

S. Raju ◽

M. Chandrasekaran

Keyword(s):

Performance Analysis ◽

Data Distribution ◽

Clustering Techniques ◽

Hybrid Clustering ◽

Efficient Data

Download Full-text

A heuristic for efficient data distribution management in distributed simulation

10.1117/12.604070 ◽

2005 ◽

Author(s):

Pankaj Gupta ◽

Ratan K. Guha

Keyword(s):

Distributed Simulation ◽

Data Distribution ◽

Distribution Management ◽

Efficient Data

Download Full-text

Efficient ensemble data assimilation for coupled models with the Parallel Data Assimilation Framework: Example of AWI-CM

10.5194/gmd-2019-167 ◽

2019 ◽

Cited By ~ 2

Author(s):

Lars Nerger ◽

Qi Tang ◽

Longjiang Mu

Keyword(s):

Data Assimilation ◽

Numerical Models ◽

Computing Time ◽

Coupled Model ◽

Ocean Model ◽

Coupled Models ◽

Parallel Data ◽

Efficient Data ◽

Abstract Data ◽

Assimilation Process

Abstract. Data assimilation integrates information from observational measurements with numerical models. When used with coupled models of Earth system compartments, e.g. the atmosphere and the ocean, consistent joint states can be estimated. A common approach for data assimilation are ensemble-based methods which use an ensemble of state realizations to estimate the state and its uncertainty. These methods are far more costly to compute than a single coupled model because of the required integration of the ensemble. However, with uncoupled models, the methods also have been shown to exhibit a particularly good scaling behavior. This study discusses an approach to augment a coupled model with data assimilation functionality provided by the Parallel Data Assimilation Framework (PDAF). Using only minimal changes in the codes of the different compartment models, a particularly efficient data assimilation system is generated that utilizes parallelization and in-memory data transfers between the models and the data assimilation functions and hence avoids most of the filter reading and writing and also model restarts during the data assimilation process. The study explains the required modifications of the programs on the example of the coupled atmosphere-sea ice-ocean model AWI-CM. Using the case of the assimilation of oceanic observations shows that the data assimilation leads only small overheads in computing time of about 15 % compared to the model without data assimilation and a very good parallel scalability. The model-agnostic structure of the assimilation software ensures a separation of concerns in that the development of data assimilation methods and be separated from the model application.

Download Full-text