Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

Author(s):  
Condro Wibawa ◽  
◽  
Irwan Bastian ◽  
Metty Mustikasari
Author(s):  
ThanhThuong T. Huynh ◽  
TruongAn Phamnguyen ◽  
Nhon V. Do

To represent the text document more expressively, a kind of graph-based semantic model is proposed, in which more semantic information among keyphrases as well as the structural information of the text are incorporated. The method produces structured representations of texts by utilizing common, popular knowledge bases (e.g. DBpedia, Wikipedia) to acquire fine-grained information about concepts, entities, and their semantic relations, thus resulting in a knowledge-rich interpretation. We demonstrate the benefits of these representations in the task of document similarity measurement. Relevance evaluation between two documents is done by calculating the semantic similarity between two keyphrase graphs that represent them. Experimental results show that our approach outperforms standard baselines based on traditional document representations, and able to come close in performance to the specialized methods particularly tuned to this task on the specific dataset.


Author(s):  
Mohammed Erritali ◽  
Abderrahim Beni-Hssane ◽  
Marouane Birjali ◽  
Youness Madani

<p>Semantic indexing and document similarity is an important information retrieval system problem in Big Data with broad applications. In this paper, we investigate MapReduce programming model as a specific framework for managing distributed processing in a large of amount documents. Then we study the state of the art of different approaches for computing the similarity of documents. Finally, we propose our approach of semantic similarity measures using WordNet as an external network semantic resource. For evaluation, we compare the proposed approach with other approaches previously presented by using our new MapReduce algorithm. Experimental results review that our proposed approach outperforms the state of the art ones on running time performance and increases the measurement of semantic similarity.</p>


2013 ◽  
Vol 2013 ◽  
pp. 1-12
Author(s):  
Lei Liu ◽  
Dongqing Liu ◽  
Shuai Lü ◽  
Peng Zhang

Map-Reduce-Merge is an improved parallel programming model based on Map-Reduce in cloud computing environment. Through the new Merge module, Map-Reduce-Merge can support processing multiple related heterogeneous datasets more efficiently. In order to demonstrate the validity and effectiveness of this new model, we present a rigorous description for Map-Reduce-Merge model using Haskell. Firstly, we describe the basic program skeleton of Map-Reduce-Merge programming model. Secondly, an abstract description for the Merge module is presented by analyzing the structure and function of the Merge module with Haskell as the description tool. Thirdly, we evaluate the Map-Reduce-Merge model on the basis of our description. We capture the functional characteristics of the Map-Reduce-Merge model by our abstract description, which can provide theoretical basis for designing more efficient parallel programming model to process join operation.


2017 ◽  
Vol 8 (1) ◽  
pp. 45-60 ◽  
Author(s):  
Zakaria Benmounah ◽  
Souham Meshoul ◽  
Mohamed Batouche

One of the remarkable results of the rapid advances in information technology is the production of tremendous amounts of data sets, so large or complex that available processing methods are inadequate, among these methods cluster analysis. Clustering becomes more challenging and complex. In this paper, the authors describe a highly scalable Differential Evolution (DE) algorithm based on map-reduce programming model. The traditional use of DE to deal with clustering of large sets of data is so time-consuming that it is not feasible. On the other hand, map-reduce is a programming model emerged lately to allow the design of parallel and distributed approaches. In this paper, four stages map-reduce differential evolution algorithm termed as DE-MRC is presented; each of these four phases is a map-reduce process and dedicated to a particular DE operation. DE-MRC has been tested on a real parallel platform of 128 computers connected with each other and more than 30 GB of data. Experimental results show the high scalability and robustness of DE-MRC.


2013 ◽  
Vol 427-429 ◽  
pp. 2618-2621 ◽  
Author(s):  
Ling Shen ◽  
Qing Xi Peng

As the emerging date intensive applications have received more and more attentions from researchers, its a severe challenge for near duplicated text detection for large scale data. This paper presents an algorithm based on MapReduce and ontology for near duplicated text detection via computing pair document similarity in large scale document collections. We mapping the words in the document to the synonym and then calculate the similarity between them. MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key /value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. In large scale test, experimental result demonstrates that this approach outperforms other state of the art solutions. Many advantages such as linear time and accuracy make the algorithm valuable in actual practice.


Sign in / Sign up

Export Citation Format

Share Document