scholarly journals Many Languages, One Parser

Author(s):  
Waleed Ammar ◽  
George Mulcaire ◽  
Miguel Ballesteros ◽  
Chris Dyer ◽  
Noah A. Smith

We train one multilingual model for dependency parsing and use it to parse sentences in several languages. The parsing model uses (i) multilingual word clusters and embeddings; (ii) token-level language information; and (iii) language-specific features (fine-grained POS tags). This input representation enables the parser not only to parse effectively in multiple languages, but also to generalize across languages based on linguistic universals and typological similarities, making it more effective to learn from limited annotations. Our parser’s performance compares favorably to strong baselines in a range of data scenarios, including when the target language has a large treebank, a small treebank, or no treebank for training.

2018 ◽  
Vol 6 ◽  
pp. 667-685 ◽  
Author(s):  
Dingquan Wang ◽  
Jason Eisner

We introduce a novel framework for delexicalized dependency parsing in a new language. We show that useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training yields further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous work’s interpretable typological features that require parsed corpora or expert categorization of the language. Our best method improved attachment scores on held-out test languages by an average of 5.6 percentage points over past work that does not inspect the unparsed data (McDonald et al., 2011), and by 20.7 points over past “grammar induction” work that does not use training languages (Naseem et al., 2010).


2021 ◽  
Vol 31 ◽  
Author(s):  
THOMAS VAN STRYDONCK ◽  
FRANK PIESSENS ◽  
DOMINIQUE DEVRIESE

Abstract Separation logic is a powerful program logic for the static modular verification of imperative programs. However, dynamic checking of separation logic contracts on the boundaries between verified and untrusted modules is hard because it requires one to enforce (among other things) that outcalls from a verified to an untrusted module do not access memory resources currently owned by the verified module. This paper proposes an approach to dynamic contract checking by relying on support for capabilities, a well-studied form of unforgeable memory pointers that enables fine-grained, efficient memory access control. More specifically, we rely on a form of capabilities called linear capabilities for which the hardware enforces that they cannot be copied. We formalize our approach as a fully abstract compiler from a statically verified source language to an unverified target language with support for linear capabilities. The key insight behind our compiler is that memory resources described by spatial separation logic predicates can be represented at run time by linear capabilities. The compiler is separation-logic-proof-directed: it uses the separation logic proof of the source program to determine how memory accesses in the source program should be compiled to linear capability accesses in the target program. The full abstraction property of the compiler essentially guarantees that compiled verified modules can interact with untrusted target language modules as if they were compiled from verified code as well. This article is an extended version of one that was presented at ICFP 2019 (Van Strydonck et al., 2019).


Interpreting ◽  
1998 ◽  
Vol 3 (2) ◽  
pp. 163-199 ◽  
Author(s):  
Robin Setton

Existing simultaneous interpretation (SI) process models lack an account of intermediate representation compatible with the cognitive and linguistic processes inferred from corpus descriptions or psycholinguistic experimentation. Comparison of SL and TL at critical points in synchronised transcripts of German-English and Chinese-English SI shows how interpreters use procedural and intentional clues in the input to overcome typological asymmetries and build a dynamic conceptual and intentional mental model which supports fine-grained incremental comprehension. An Executive, responsible for overall co-ordination and secondary pragmatic processing, compensates at the production stage for the inevitable semantic approximations and re-injects pragmatic guidance in the target language. The methodological and cognitive assumptions for the study are provided by Relevance Theory and a 'weakly interactive' parsing model adapted to simultaneous interpretation.


2019 ◽  
Author(s):  
Meg Cychosz ◽  
Alejandrina Cristia ◽  
Elika Bergelson ◽  
Marisa Casillas ◽  
Gladys Baudet ◽  
...  

This study evaluates whether early vocalizations develop in similar ways in children across diverse cultural contexts. We analyze data from daylong audio-recordings of 49 children (1-36 months) from five different language/cultural backgrounds. Citizen scientists annotated these recordings to determine if child vocalizations contained canonical transitions or not (e.g., "ba'' versus "ee''). Results revealed that the proportion of clips reported to contain canonical transitions increased with age. Further, this proportion exceeded 0.15 by around 7 months, replicating and extending previous findings on canonical vocalization development but using data from the natural environments of a culturally and linguistically diverse sample. This work explores how crowdsourcing can be used to annotate corpora, helping establish developmental milestones relevant to multiple languages and cultures. Lower inter-annotator reliability on the crowdsourcing platform, relative to more traditional in-lab expert annotators, means that a larger number of unique annotators and/or annotations are required and that crowdsourcing may not be a suitable method for more fine-grained annotation decisions. Audio clips used for this project are compiled into a large-scale infant vocal corpus that is available for other researchers to use in future work.


2020 ◽  
Vol 8 ◽  
pp. 361-376
Author(s):  
Ion Madrazo Azpiazu ◽  
Maria Soledad Pera

The alignment of word embedding spaces in different languages into a common crosslingual space has recently been in vogue. Strategies that do so compute pairwise alignments and then map multiple languages to a single pivot language (most often English). These strategies, however, are biased towards the choice of the pivot language, given that language proximity and the linguistic characteristics of the target language can strongly impact the resultant crosslingual space in detriment of topologically distant languages. We present a strategy that eliminates the need for a pivot language by learning the mappings across languages in a hierarchical way. Experiments demonstrate that our strategy significantly improves vocabulary induction scores in all existing benchmarks, as well as in a new non-English–centered benchmark we built, which we make publicly available.


2019 ◽  
Vol 12 (1) ◽  
pp. 89-115
Author(s):  
Eitan Grossman

This paper sketches the integration of Greek-origin loan verbs into the valency and transitivity patterns of Coptic (Afroasiatic, Egypt), arguing that transitivities are language-specific descriptive categories, and the comparison of donor-language transitivity with target-language transitivity reveals fine-grained degrees of loan-verb integration. Based on a comparison of Coptic Transitivity and Greek Transitivity, it is shown that Greek-origin loanwords are only partially integrated into the transitivity patterns of Coptic. Specifically, while Greek-origin loan verbs have the same coding properties as native verbs in terms of the A domain, i.e., Differential Subject Marking (dsm), they differ in important respects in terms of the P domain, i.e., Differential Object Marking (dom) and Differential Object Indexing (doi). A main result of this study is that language contact – specifically, massive lexical borrowing – can induce significant transitivity splits in a language’s lexicon and grammar. Furthermore, the findings of this study cast doubt on the usefulness of an overarching cross-linguistic category of transitivity.


2009 ◽  
Vol 81 ◽  
pp. 75-85
Author(s):  
S. Winkler

The present paper deals with the acquisition of finiteness in German and Dutch child language. More specifically, it discusses the assumption of fundamental similarities in the development of the finiteness category in German and Dutch L1 as postulated by Dimroth et al. (2003). A comparison of German and Dutch child corpus data will show that Dimroth et al.'s assumption can be maintained as far as the overall development of the finiteness category is concerned. At a more fine-grained level, however, German and Dutch children exhibit different linguistic behaviour. This concerns in particular the means for the expression of early finiteness and the status of the auxiliary hebben/haben 'to have'. The observed differences can be explained as the result of target language specific properties of the input.


1981 ◽  
Vol 10 ◽  
pp. 132-137

The present paper deals with the acquisition of finiteness in German and Dutch child language. More specifically, it discusses the assumption of fundamental similarities in the development of the finiteness category in German and Dutch L1 as postulated by Dimroth et al. (2003). A comparison of German and Dutch child corpus data will show that Dimroth et al.'s assumption can be maintained as far as the overall development of the finiteness category is concerned. At a more fine-grained level, however, German and Dutch children exhibit different linguistic behaviour. This concerns in particular the means for the expression of early finiteness and the status of the auxiliary hebben/haben 'to have'. The observed differences can be explained as the result of target language specific properties of the input.


Author(s):  
Dave Kush ◽  
Charlotte Sant ◽  
Sunniva Briså Strætkvern

Norwegian allows filler-gap dependencies into relative clauses (RCs) and embedded questions (EQs) – domains that are usually considered islands. We conducted a corpus study on youth-directed reading material to assess what direct evidence Norwegian children receive for filler-gap dependencies into islands. Results suggest that the input contains examples of Filler-gap dependencies into both RCs and EQs, but such examples are significantly less frequent than long-distance filler-gap dependencies into non-island clauses. Moreover, evidence for island violations is characterized by the absence of forms that are, in principle, acceptable in the target grammar. Thus, although they encounter dependencies into islands, children must generalize beyond the fine-grained distributional characteristics of the input to acquire the full pattern of island-insensitivity in their target language. We conclude by considering how different learning models would fare on acquiring the target generalizations.


Sign in / Sign up

Export Citation Format

Share Document