Document Similarity for Arabic and Cross-Lingual Web Content

Author(s):  
Ali Salhi ◽  
Adnan H. Yahya
2007 ◽  
Vol 30 (1) ◽  
pp. 135-162 ◽  
Author(s):  
Ralf Steinberger ◽  
Bruno Pouliquen

Named Entity Recognition and Classification (NERC) is a known and well-explored text analysis application that has been applied to various languages. We are presenting an automatic, highly multilingual news analysis system that fully integrates NERC for locations, persons and organisations with document clustering, multi-label categorisation, name attribute extraction, name variant merging and the calculation of social networks. The proposed application goes beyond the state-of-the-art by automatically merging the information found in news written in ten different languages, and by using the aggregated name information to automatically link related news documents across languages for all 45 language pair combinations. While state-of-the-art approaches for cross-lingual name variant merging and document similarity calculation require bilingual resources, the methods proposed here are mostly language-independent and require a minimal amount of monolingual language-specific effort. The development of resources for additional languages is therefore kept to a minimum and new languages can be plugged into the system effortlessly. The presented online news analysis application is fully functional and has, at the end of the year 2006, reached average usage statistics of 600,000 hits per day.


2016 ◽  
Vol 22 (4) ◽  
pp. 627-653 ◽  
Author(s):  
RAZIEH RAHIMI ◽  
AZADEH SHAKERY ◽  
JAVID DADASHKARIMI ◽  
MOZHDEH ARIANNEZHAD ◽  
MOSTAFA DEHGHANI ◽  
...  

AbstractComparable corpora are key translation resources for both languages and domains with limited linguistic resources. The existing approaches for building comparable corpora are mostly based on ranking candidate documents in the target language for each source document using a cross-lingual retrieval model. These approaches also exploit other evidence of document similarity, such as proper names and publication dates, to build more reliable alignments. However, the importance of each evidence in the scores of candidate target documents is determined heuristically. In this paper, we employ a learning to rank method for ranking candidate target documents with respect to each source document. The ranking model is constructed by defining each evidence for similarity of bilingual documents as a feature whose weight is learned automatically. Learning feature weights can significantly improve the quality of alignments, because the reliability of features depends on the characteristics of both source and target languages of a comparable corpus. We also propose a method to generate appropriate training data for the task of building comparable corpora. We employed the proposed learning-based approach to build a multi-domain English–Persian comparable corpus which covers twelve different domains obtained from Open Directory Project. Experimental results show that the created alignments have high degrees of comparability. Comparison with existing approaches for building comparable corpora shows that our learning-based approach improves both quality and coverage of alignments.


Author(s):  
Jan Rupnik ◽  
Andrej Muhič ◽  
Gregor Leban ◽  
Blaž Fortuna ◽  
Marko Grobelnik

In today's world, we follow news which is distributed globally. Significant events are reported by different sources and in different languages. In this work, we address the problem of tracking of events in a large multilingual stream. Within a recently developed system Event Registry we examine two aspects of this problem: how to compare articles in different languages and how to link collections of articles in different languages which refer to the same event. Building on previous work, we show there are methods which scale well and can compute a meaningful similarity between articles from languages with little or no direct overlap in the training data.Using this capability, we then propose an approach to link clusters of articles across languages which represent the same event.


2016 ◽  
Vol 55 ◽  
pp. 283-316 ◽  
Author(s):  
Jan Rupnik ◽  
Andrej Muhic ◽  
Gregor Leban ◽  
Primoz Skraba ◽  
Blaz Fortuna ◽  
...  

In today's world, we follow news which is distributed globally. Significant events are reported by different sources and in different languages. In this work, we address the problem of tracking of events in a large multilingual stream. Within a recently developed system Event Registry we examine two aspects of this problem: how to compare articles in different languages and how to link collections of articles in different languages which refer to the same event. Taking a multilingual stream and clusters of articles from each language, we compare different cross-lingual document similarity measures based on Wikipedia. This allows us to compute the similarity of any two articles regardless of language. Building on previous work, we show there are methods which scale well and can compute a meaningful similarity between articles from languages with little or no direct overlap in the training data. Using this capability, we then propose an approach to link clusters of articles across languages which represent the same event. We provide an extensive evaluation of the system as a whole, as well as an evaluation of the quality and robustness of the similarity measure and the linking algorithm.


2012 ◽  
Author(s):  
Xin Liu ◽  
Xiaobin Zhou ◽  
Jianjun Zhu ◽  
Jing-Jen Wang

Sign in / Sign up

Export Citation Format

Share Document