multilingual corpus
Recently Published Documents


TOTAL DOCUMENTS

82
(FIVE YEARS 24)

H-INDEX

6
(FIVE YEARS 2)

2021 ◽  
pp. 1-35
Author(s):  
MARTIJN VAN DER KLIS ◽  
BERT LE BRUYN ◽  
HENRIËTTE DE SWART

The western European present perfect is subject to substantial crosslinguistic variation. The literature, however, focuses on individual languages or on comparisons of a restricted number of languages. We piece together the puzzle and do so in a data-driven way by comparing the use of the present perfect through a parallel corpus based on the French novel L’Étranger and its translations in Italian, German, Dutch, European Spanish, British English, and Modern Greek. We introduce and showcase Translation Mining, a software suite combining a parallel corpus database with annotation and analysis tools. Translation Mining allows us to generate descriptive statistics of tense use across languages but also to visualize variation through its multidimensional scaling component and to link the variation we find to the underlying data through its integrated setup. We confirm that the present perfect competes with the past and we reveal the fine-grained scalar nature of the variation. To complete the puzzle, we ascertain the dimensions of variation, ranging from lexical and compositional semantics to dynamic semantics and pragmatics.1


2021 ◽  
Vol 19 (3) ◽  
pp. e24
Author(s):  
Márcia Barros ◽  
Pedro Ruas ◽  
Diana Sousa ◽  
Ali Haider Bangash ◽  
Francisco M. Couto

Tracking the most recent advances in Coronavirus disease 2019 (COVID-19)‒related research is essential, given the disease's novelty and its impact on society. However, with the publication pace speeding up, researchers and clinicians require automatic approaches to keep up with the incoming information regarding this disease. A solution to this problem requires the development of text mining pipelines; the efficiency of which strongly depends on the availability of curated corpora. However, there is a lack of COVID-19‒related corpora, even more, if considering other languages besides English. This project's main contribution was the annotation of a multilingual parallel corpus and the generation of a recommendation dataset (EN-PT and EN-ES) regarding relevant entities, their relations, and recommendation, providing this resource to the community to improve the text mining research on COVID-19‒related literature. This work was developed during the 7th Biomedical Linked Annotation Hackathon (BLAH7).


2021 ◽  
Vol 11 (1) ◽  
pp. 47-68
Author(s):  
Thomas Egan

This paper presents the results of a study of double object constructions containing the cognate verbs English tell and Norwegian fortelle, based on data from the English–Norwegian Parallel Corpus. The results show that there is a certain degree of correspondence between the two verbs in constructions with nominal direct objects, with less mutual correspondence in constructions with finite clausal objects, very little correspondence in constructions with objects in the form of direct speech, and none whatsoever in the case of non-finite clausal objects, which only occur with tell. The paper then expands the topic to include tell predications in French. The data were retrieved from the Oslo Multilingual Corpus. It transpires that the form of French translations of Norwegian expressions are more similar, at least for some constructions, to the Norwegian originals than are their English counterparts.


2021 ◽  
Vol 11 (15) ◽  
pp. 7160
Author(s):  
Ramon Ruiz-Dolz ◽  
Montserrat Nofre ◽  
Mariona Taulé ◽  
Stella Heras ◽  
Ana García-Fornes

The application of the latest Natural Language Processing breakthroughs in computational argumentation has shown promising results, which have raised the interest in this area of research. However, the available corpora with argumentative annotations are often limited to a very specific purpose or are not of adequate size to take advantage of state-of-the-art deep learning techniques (e.g., deep neural networks). In this paper, we present VivesDebate, a large, richly annotated and versatile professional debate corpus for computational argumentation research. The corpus has been created from 29 transcripts of a debate tournament in Catalan and has been machine-translated into Spanish and English. The annotation contains argumentative propositions, argumentative relations, debate interactions and professional evaluations of the arguments and argumentation. The presented corpus can be useful for research on a heterogeneous set of computational argumentation underlying tasks such as Argument Mining, Argument Analysis, Argument Evaluation or Argument Generation, among others. All this makes VivesDebate a valuable resource for computational argumentation research within the context of massive corpora aimed at Natural Language Processing tasks.


2021 ◽  
Vol 66 ◽  
pp. 101155
Author(s):  
Roldano Cattoni ◽  
Mattia Antonino Di Gangi ◽  
Luisa Bentivogli ◽  
Matteo Negri ◽  
Marco Turchi

2021 ◽  
Author(s):  
Ayyoob ImaniGooghari ◽  
Masoud Jalili Sabet ◽  
Philipp Dufter ◽  
Michael Cysou ◽  
Hinrich Schütze

2021 ◽  
Vol 2 (2) ◽  
Author(s):  
Grant Aiton

This extended abstract details the process of constructing an annotated XML corpus suitable for quantitative analysis of morphosyntactic and phonetic phenomena in the Eibela language of Papua New Guinea. Preliminary results will also be included, which investigate the semantic, phonetic, and discourse correlates of argument realization. The goal of this paper is to illustrate how legacy materials can be enriched and investigated using computational methodologies including forced alignment of phonetic segments using bulk processing of data in Python and R, the Montreal Forced Aligner (MFA), and morphosyntactic annotation developed as part of the Multilingual Corpus of Annotated Spoken Texts (Multi-CAST).


Author(s):  
Rohan Nanda ◽  
Llio Humphreys ◽  
Lorenzo Grossio ◽  
Adebayo Kolawole John

This paper presents a multilingual legal information retrieval system for mapping recitals to articles in European Union (EU) directives and normative provisions in national legislation. Such a system could be useful for purposive interpretation of norms. A previous work on mapping recitals and normative provisions was limited to EU legislation in English and only one lexical text similarity technique. In this paper, we develop state-of-the-art text similarity models to investigate the interplay between directive recitals, directive (sub-)articles and provisions of national implementing measures (NIMs) on a multilingual corpus (from Ireland, Italy and Luxembourg). Our results indicate that directive recitals do not have a direct influence on NIM provisions, but they sometimes contain additional information that is not present in the transposed directive sub-article, and can therefore facilitate purposive interpretation.


Corpora ◽  
2020 ◽  
Vol 15 (3) ◽  
pp. 273-290
Author(s):  
Matylda Włodarczyk ◽  
Joanna Kopaczyk ◽  
Michał Kozak

This paper introduces the Electronic Repository of Greater Poland Oaths, eROThA (1386–1446), a digitisation project of a diplomatic edition of mediaeval land court oaths recorded in Latin and Old Polish, resulting in a small, lightly tagged specialised bilingual corpus. We present the background, aims, design and methodology of the project. We also discuss the problems and limitations entrenched in turning a printed diplomatic edition into a machine-readable diplomatic edition equipped with a new interpretative layer that is sensitive to the switches between Latin and Old Polish. In addition to the automatic annotation of code-switched items on the basis of typographic characteristics of the printed edition, flexible coding of recurrent language and discourse boundary phenomena has been introduced manually to account for linguistically ambiguous or neutral forms. The project offers a fully multilingual corpus, as well as customised Polish-only and Latin-only datasets, and enables filtered metadata searches in the online front-end. Overall, the report presents a methodology for constructing multilingual corpora in the context of legal cultures in medieval Central Europe that may be extrapolated to datasets originating in other periods and regions.


Sign in / Sign up

Export Citation Format

Share Document