french corpus
Recently Published Documents


TOTAL DOCUMENTS

38
(FIVE YEARS 6)

H-INDEX

6
(FIVE YEARS 0)

2021 ◽  
Vol 14 (2) ◽  
pp. 494-508
Author(s):  
Francina Sole-Mauri ◽  
Pilar Sánchez-Gijón ◽  
Antoni Oliver

This article presents Cadlaws, a new English–French corpus built from Canadian legal documents, and describes the corpus construction process and preliminary statistics obtained from it. The corpus contains over 16 million words in each language and includes unique features since it is composed of documents that are legally equivalent in both languages but not the result of a translation. The corpus is built upon enactments co-drafted by two jurists to ensure legal equality of each version and to re­flect the concepts, terms and institutions of two legal traditions. In this article the corpus definition as a parallel corpus instead of a comparable one is also discussed. Cadlaws has been pre-processed for machine translation and baseline Bilingual Evaluation Understudy (bleu), a score for comparing a candidate translation of text to a gold-standard translation of a neural machine translation system. To the best of our knowledge, this is the largest parallel corpus of texts which convey the same meaning in this language pair and is freely available for non-commercial use.


2021 ◽  
Vol 8 (2) ◽  
Author(s):  
Marie Candito ◽  
Mathieu Constant ◽  
Carlos Ramisch ◽  
Agata Savary ◽  
Bruno Guillaume ◽  
...  

We present the enrichment of a French treebank of various genres with a new annotation layer for multiword expressions (MWEs) and named entities (NEs).1 Our contribution with respect to previous work on NE and MWE annotation is the particular care taken to use formal criteria, organized into decision flowcharts, shedding some light on the interactions between NEs and MWEs. Moreover, in order to cope with the well-known difficulty to draw a clear-cut frontier between compositional expressions and MWEs, we chose to use sufficient criteria only. As a result, annotated MWEs satisfy a varying number of sufficient criteria, accounting for the scalar nature of the MWE status. In addition to the span of the elements, annotation includes the subcategory of NEs (e.g., person, location) and one matching sufficient criterion for non-verbal MWEs (e.g., lexical substitution). The 3,099 sentences of the treebank were double-annotated and adjudicated, and we paid attention to cross-type consistency and compatibility with thesyntactic layer. Overall inter-annotator agreement on non-verbal MWEs and NEs reached 71.1%. The released corpus contains 3,112 annotated NEs and 3,440 MWEs, and is distributed under an open license.


Author(s):  
V.Z. Demyankov ◽  

‘Possibility’ belongs to the main concepts of modal logic, it plays a key role in the ‘theory of possible worlds’ which was initiated by G.W. Leibniz, restarted in the 1950’s and is still fairly popular in formal logic, in philosophy of mind, and in cognitive semantics. Its main axioms and their consequences are here explored from a linguistic point of view demonstrating analogies between localist and purely logical approaches to the truth values in actual and in metaphoric worlds. Statistical analysis of a French corpus of literary and scholarly texts shows that lexical items of the ‘possible’ / ‘possibilité’ class are almost two times more frequent than their negative counterparts of the ‘impossible’ / ‘impossibilité’ class. The most frequent construction is ‘possible’ / ‘possibilité’ + ‘de’ + infinitive requiring the maximal array of factual information available to the speaker. The least frequent construction is ‘possible / ‘possibilité’ + ‘ à ’ + infinitive. Such and similar data reveal a scale of textual expectations generated by the lexical items belonging to the ‘possibility’ class in French.


2020 ◽  
pp. 175-193
Author(s):  
Agnieszka Dryjańska

The subject of this paper is a corpus analysis of patrimoine in contrastive perspective with its Polish equivalent dziedzictwo within the framework of the intercultural approach in French language teaching. Its purpose will be to reveal the semantic differences and similarities of these two words in terms of the results provided by the frequency and collocation analysis based on the Polish National Corpus, the French Corpus Frantext and Corpora Collection of Leipzig University. The study showed that one of the strongest and most frequent collocations, indicated by different collocation measures, for both Polish and French, is cultural heritage (Fr. patrimoine culturel, Pl. dziedzictwo kulturowe). Typical Polish collocations are national heritage and Christian heritage, while in French these are patrimoine artistique et patrimoine touristique.


2019 ◽  
Vol 40 (1) ◽  
pp. 61-84
Author(s):  
Elissa Pustka

Abstract Focusing on sibilant-stop onsets, this paper deals with syllabic complexity in Romance languages. At its core are two empirical studies that address the complex case of French: a type-level study is based on the Petit Robert, and a token-level study uses Parisian and Southern French corpus data elaborated in the framework of the PFC program (Phonologie du Français Contemporain). The paper identifies three factors behind the emergence of phonotactic complexity: (a) vowel elision, (b) borrowing, and (c) expressivity.


2018 ◽  
Author(s):  
Natalia Grabar ◽  
Vincent Claveau ◽  
Clément Dalloux
Keyword(s):  

2017 ◽  
Vol 28 (3) ◽  
pp. 333-349
Author(s):  
GABRIEL BERGOUNIOUX
Keyword(s):  

ABSTRACTThis article, based on a French oral corpus, discusses the standing and the interpretation of some vocalic false starts. It recaps the characteristics of false starts and presents the spoken data on which this study is grounded. It details the false starts in the corpus that begin with a vowel in order to draw some conclusions about their form, focusing on the differences between what would be expected and a few discrepancies. The analysis starts from the auditors' perception and proposes an interpretation of the effects of morphophonological constraints.


2017 ◽  
Vol 22 (2) ◽  
pp. 242-269 ◽  
Author(s):  
Ludivine Crible

Abstract While discourse markers (DMs) and (dis)fluency have been extensively studied in the past as separate phenomena, corpus-based research combining large-scale yet fine-grained annotations of both categories has, however, never been carried out before. Integrating these two levels of analysis, while methodologically challenging, is not only innovative but also highly relevant to the investigation of spoken discourse in general and form-meaning patterns in particular. The aim of this paper is to provide corpus-based evidence of the register-sensitivity of DMs and other disfluencies (e.g. pauses, repetitions) and of their tendency to combine in recurrent clusters. These claims are supported by quantitative findings on the variation and combination of DMs with other (dis)fluency devices in DisFrEn, a richly annotated and comparable English-French corpus representative of eight different interaction settings. The analysis uncovers the prominent place of DMs within (dis)fluency and meaningful association patterns between forms and functions, in a usage-based approach to meaning-in-context.


Sign in / Sign up

Export Citation Format

Share Document