scholarly journals The data-driven Bulgarian WordNet: BTBWN

Author(s):  
Petya Osenova ◽  
Kiril Simov

The data-driven Bulgarian WordNet: BTBWNThe paper presents our work towards the simultaneous creation of a data-driven WordNet for Bulgarian and a manually annotated treebank with semantic information. Such an approach requires synchronization of the word senses in both - syntactic and lexical resources, without limiting the WordNet senses to the corpus or vice versa. Our strategy focuses on the identification of senses used in BulTreeBank, but the missing senses of a lemma also have been covered through exploration of bigger corpora. The identified senses have been organized in synsets for the Bulgarian WordNet. Then they have been aligned to the Princeton WordNet synsets. Various types of mappings are considered between both resources in a cross-lingual aspect and with respect to ensuring maximum connectivity and potential for incorporating the language specific concepts. The mapping between the two WordNets (English and Bulgarian) is a basis for applications such as machine translation and multilingual information retrieval. Oparty na danych WordNet bułgarski: BTBWNW artykule przedstawiono naszą pracę na rzecz jednoczesnej budowy opartego na danych wordnetu dla języka bułgarskiego oraz ręcznie oznaczonego informacjami semantycznymi banku drzew. Takie podejście wymaga uzgodnienia znaczeń słów zarówno w zasobach składniowych, jak i leksykalnych, bez ograniczania znaczeń umieszczanych w wordnecie do tych obecnych w korpusie, jak i odwrotnie. Nasza strategia koncentruje się na identyfikacji znaczeń stosowanych w BulTreeBank, przy czym brakujące znaczenia lematu zostały również zbadane przez zgłębienie większych korpusów. Zidentyfikowane znaczenia zostały zorganizowane w synsety bułgarskiego wordnetu, a następnie powiązane z synsetami Princeton WordNet. Rozmaite rodzaje rzutowań są rozpatrywane pomiędzy obydwoma zasobami w kontekście międzyjęzykowym, a także w odniesieniu do zapewnienia maksymalnej łączności i możliwości uwzględnienia pojęć specyficznych dla języka bułgarskiego. Rzutowanie między dwoma wordnetami (angielskim i bułgarskim) jest podstawą dla aplikacji, takich jak tłumaczenie maszynowe i wielojęzyczne wyszukiwanie informacji.

2005 ◽  
Vol 04 (02) ◽  
pp. 133-138
Author(s):  
D. Manjula ◽  
T. V. Geetha

The traditional Boolean word-based approach to information retrieval (IR) considers only words for indexing. Irrelevant information is retrieved because of non-inclusion of semantic information like word senses and word context. In this work, the importance of representing the documents along another semantic dimension in addition to sense context information is considered. The incorporation of semantic relations as an additional dimension gives a better insight into the interpretation of the document. The micro-contexts generated from the documents are also used in indexing. The retrieval performance is measured in terms of precision and recall. The results tabulated show better performance.


2017 ◽  
Vol 108 (1) ◽  
pp. 355-366 ◽  
Author(s):  
Ankit Srivastava ◽  
Georg Rehm ◽  
Felix Sasaki

Abstract With the ever increasing availability of linked multilingual lexical resources, there is a renewed interest in extending Natural Language Processing (NLP) applications so that they can make use of the vast set of lexical knowledge bases available in the Semantic Web. In the case of Machine Translation, MT systems can potentially benefit from such a resource. Unknown words and ambiguous translations are among the most common sources of error. In this paper, we attempt to minimise these types of errors by interfacing Statistical Machine Translation (SMT) models with Linked Open Data (LOD) resources such as DBpedia and BabelNet. We perform several experiments based on the SMT system Moses and evaluate multiple strategies for exploiting knowledge from multilingual linked data in automatically translating named entities. We conclude with an analysis of best practices for multilingual linked data sets in order to optimise their benefit to multilingual and cross-lingual applications.


Author(s):  
Vasudeva Varma ◽  
Aditya Mogadala

In this chapter, the authors start their discussion highlighting the importance of Cross Lingual and Multilingual Information Retrieval and access research areas. They then discuss the distinction between Cross Language Information Retrieval (CLIR), Multilingual Information Retrieval (MLIR), Cross Language Information Access (CLIA), and Multilingual Information Access (MLIA) research areas. In addition, in further sections, issues and challenges in these areas are outlined, and various approaches, including machine learning-based and knowledge-based approaches to address the multilingual information access, are discussed. The authors describe various subsystems of a MLIA system ranging from query processing to output generation by sharing their experience of building a MLIA system and discuss its architecture. Then evaluation aspects of the MLIA and CLIA systems are discussed at the end of this chapter.


Information ◽  
2020 ◽  
Vol 11 (10) ◽  
pp. 492
Author(s):  
Aishan Wumaier ◽  
Cuiyun Xu ◽  
Zaokere Kadeer ◽  
Wenqi Liu ◽  
Yingbo Wang ◽  
...  

The recognition and translation of organization names (ONs) is challenging due to the complex structures and high variability involved. ONs consist not only of common generic words but also names, rare words, abbreviations and business and industry jargon. ONs are a sub-class of named entity (NE) phrases, which convey key information in text. As such, the correct translation of ONs is critical for machine translation and cross-lingual information retrieval. The existing Chinese–Uyghur neural machine translation systems have performed poorly when applied to ON translation tasks. As there are no publicly available Chinese–Uyghur ON translation corpora, an ON translation corpus is developed here, which includes 191,641 ON translation pairs. A word segmentation approach involving characterization, tagged characterization, byte pair encoding (BPE) and syllabification is proposed here for ON translation tasks. A recurrent neural network (RNN) attention framework and transformer are adapted here for ON translation tasks with different sequence granularities. The experimental results indicate that the transformer model not only outperforms the RNN attention model but also benefits from the proposed word segmentation approach. In addition, a Chinese–Uyghur ON translation system is developed here to automatically generate new translation pairs. This work significantly improves Chinese–Uyghur ON translation and can be applied to improve Chinese–Uyghur machine translation and cross-lingual information retrieval. It can also easily be extended to other agglutinative languages.


2014 ◽  
Vol 61 (3) ◽  
pp. 165-185 ◽  
Author(s):  
Pavel Pecina ◽  
Ondřej Dušek ◽  
Lorraine Goeuriot ◽  
Jan Hajič ◽  
Jaroslava Hlaváčová ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document