Many Languages, One Parser

We introduce a novel framework for delexicalized dependency parsing in a new language. We show that useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training yields further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous work’s interpretable typological features that require parsed corpora or expert categorization of the language. Our best method improved attachment scores on held-out test languages by an average of 5.6 percentage points over past work that does not inspect the unparsed data (McDonald et al., 2011), and by 20.7 points over past “grammar induction” work that does not use training languages (Naseem et al., 2010).

Download Full-text

Linear capabilities for fully abstract compilation of separation-logic-verified code

Journal of Functional Programming ◽

10.1017/s0956796821000022 ◽

2021 ◽

Vol 31 ◽

Author(s):

THOMAS VAN STRYDONCK ◽

FRANK PIESSENS ◽

DOMINIQUE DEVRIESE

Keyword(s):

Spatial Separation ◽

Separation Logic ◽

Target Language ◽

Fine Grained ◽

Dynamic Contract ◽

Modular Verification ◽

Source Program ◽

Memory Accesses ◽

Memory Resources ◽

Efficient Memory

Abstract Separation logic is a powerful program logic for the static modular verification of imperative programs. However, dynamic checking of separation logic contracts on the boundaries between verified and untrusted modules is hard because it requires one to enforce (among other things) that outcalls from a verified to an untrusted module do not access memory resources currently owned by the verified module. This paper proposes an approach to dynamic contract checking by relying on support for capabilities, a well-studied form of unforgeable memory pointers that enables fine-grained, efficient memory access control. More specifically, we rely on a form of capabilities called linear capabilities for which the hardware enforces that they cannot be copied. We formalize our approach as a fully abstract compiler from a statically verified source language to an unverified target language with support for linear capabilities. The key insight behind our compiler is that memory resources described by spatial separation logic predicates can be represented at run time by linear capabilities. The compiler is separation-logic-proof-directed: it uses the separation logic proof of the source program to determine how memory accesses in the source program should be compiled to linear capability accesses in the target program. The full abstraction property of the compiler essentially guarantees that compiled verified modules can interact with untrusted target language modules as if they were compiled from verified code as well. This article is an extended version of one that was presented at ICFP 2019 (Van Strydonck et al., 2019).

Download Full-text

Meaning assembly in simultaneous interpretation

Interpreting ◽

10.1075/intp.3.2.03set ◽

1998 ◽

Vol 3 (2) ◽

pp. 163-199 ◽

Cited By ~ 14

Author(s):

Robin Setton

Keyword(s):

Critical Points ◽

Mental Model ◽

Relevance Theory ◽

Process Models ◽

Target Language ◽

Intermediate Representation ◽

Simultaneous Interpretation ◽

Fine Grained ◽

Production Stage

Existing simultaneous interpretation (SI) process models lack an account of intermediate representation compatible with the cognitive and linguistic processes inferred from corpus descriptions or psycholinguistic experimentation. Comparison of SL and TL at critical points in synchronised transcripts of German-English and Chinese-English SI shows how interpreters use procedural and intentional clues in the input to overcome typological asymmetries and build a dynamic conceptual and intentional mental model which supports fine-grained incremental comprehension. An Executive, responsible for overall co-ordination and secondary pragmatic processing, compensates at the production stage for the inevitable semantic approximations and re-injects pragmatic guidance in the target language. The methodological and cognitive assumptions for the study are provided by Relevance Theory and a 'weakly interactive' parsing model adapted to simultaneous interpretation.

Download Full-text

Vocal development in a large-scale crosslinguistic corpus

10.31234/osf.io/9vzs5 ◽

2019 ◽

Cited By ~ 1

Author(s):

Meg Cychosz ◽

Alejandrina Cristia ◽

Elika Bergelson ◽

Marisa Casillas ◽

Gladys Baudet ◽

...

Keyword(s):

Large Scale ◽

Linguistically Diverse ◽

Natural Environments ◽

Cultural Backgrounds ◽

Fine Grained ◽

Audio Recordings ◽

Using Data ◽

Future Work ◽

Multiple Languages ◽

Analyze Data

This study evaluates whether early vocalizations develop in similar ways in children across diverse cultural contexts. We analyze data from daylong audio-recordings of 49 children (1-36 months) from five different language/cultural backgrounds. Citizen scientists annotated these recordings to determine if child vocalizations contained canonical transitions or not (e.g., "ba'' versus "ee''). Results revealed that the proportion of clips reported to contain canonical transitions increased with age. Further, this proportion exceeded 0.15 by around 7 months, replicating and extending previous findings on canonical vocalization development but using data from the natural environments of a culturally and linguistically diverse sample. This work explores how crowdsourcing can be used to annotate corpora, helping establish developmental milestones relevant to multiple languages and cultures. Lower inter-annotator reliability on the crowdsourcing platform, relative to more traditional in-lab expert annotators, means that a larger number of unique annotators and/or annotations are required and that crowdsourcing may not be a suitable method for more fine-grained annotation decisions. Audio clips used for this project are compiled into a large-scale infant vocal corpus that is available for other researchers to use in future work.

Download Full-text

Hierarchical Mapping for Crosslingual Word Embedding Alignment

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00320 ◽

2020 ◽

Vol 8 ◽

pp. 361-376

Author(s):

Ion Madrazo Azpiazu ◽

Maria Soledad Pera

Keyword(s):

Word Embedding ◽

Target Language ◽

Multiple Languages ◽

Do So ◽

Embedding Spaces

The alignment of word embedding spaces in different languages into a common crosslingual space has recently been in vogue. Strategies that do so compute pairwise alignments and then map multiple languages to a single pivot language (most often English). These strategies, however, are biased towards the choice of the pivot language, given that language proximity and the linguistic characteristics of the target language can strongly impact the resultant crosslingual space in detriment of topologically distant languages. We present a strategy that eliminates the need for a pivot language by learning the mappings across languages in a hierarchical way. Experiments demonstrate that our strategy significantly improves vocabulary induction scores in all existing benchmarks, as well as in a new non-English–centered benchmark we built, which we make publicly available.

Download Full-text

Language-Specific Transitivities in Contact: The Case of Coptic

Journal of Language Contact ◽

10.1163/19552629-20180001 ◽

2019 ◽

Vol 12 (1) ◽

pp. 89-115

Author(s):

Eitan Grossman

Keyword(s):

Language Contact ◽

Target Language ◽

Lexical Borrowing ◽

Differential Object Marking ◽

Fine Grained ◽

Cast Doubt ◽

P Domain ◽

Object Marking ◽

Object Indexing

This paper sketches the integration of Greek-origin loan verbs into the valency and transitivity patterns of Coptic (Afroasiatic, Egypt), arguing that transitivities are language-specific descriptive categories, and the comparison of donor-language transitivity with target-language transitivity reveals fine-grained degrees of loan-verb integration. Based on a comparison of Coptic Transitivity and Greek Transitivity, it is shown that Greek-origin loanwords are only partially integrated into the transitivity patterns of Coptic. Specifically, while Greek-origin loan verbs have the same coding properties as native verbs in terms of the A domain, i.e., Differential Subject Marking (dsm), they differ in important respects in terms of the P domain, i.e., Differential Object Marking (dom) and Differential Object Indexing (doi). A main result of this study is that language contact – specifically, massive lexical borrowing – can induce significant transitivity splits in a language’s lexicon and grammar. Furthermore, the findings of this study cast doubt on the usefulness of an overarching cross-linguistic category of transitivity.

Download Full-text

Early Finiteness in German and Dutch Child Language

Toegepaste Taalwetenschap in Artikelen ◽

10.1075/ttwia.81.08win ◽

2009 ◽

Vol 81 ◽

pp. 75-85

Author(s):

S. Winkler

Keyword(s):

Child Language ◽

Target Language ◽

Fine Grained ◽

The Status ◽

Dutch Child ◽

Corpus Data ◽

Linguistic Behaviour ◽

Dutch Children

The present paper deals with the acquisition of finiteness in German and Dutch child language. More specifically, it discusses the assumption of fundamental similarities in the development of the finiteness category in German and Dutch L1 as postulated by Dimroth et al. (2003). A comparison of German and Dutch child corpus data will show that Dimroth et al.'s assumption can be maintained as far as the overall development of the finiteness category is concerned. At a more fine-grained level, however, German and Dutch children exhibit different linguistic behaviour. This concerns in particular the means for the expression of early finiteness and the status of the auxiliary hebben/haben 'to have'. The observed differences can be explained as the result of target language specific properties of the input.

Download Full-text

Summaries

Toegepaste Taalwetenschap in Artikelen ◽

10.1075/ttwia.10.08sum ◽

1981 ◽

Vol 10 ◽

pp. 132-137

Keyword(s):

Child Language ◽

Target Language ◽

Fine Grained ◽

The Status ◽

Dutch Child ◽

Corpus Data ◽

Linguistic Behaviour ◽

Dutch Children

The present paper deals with the acquisition of finiteness in German and Dutch child language. More specifically, it discusses the assumption of fundamental similarities in the development of the finiteness category in German and Dutch L1 as postulated by Dimroth et al. (2003). A comparison of German and Dutch child corpus data will show that Dimroth et al.'s assumption can be maintained as far as the overall development of the finiteness category is concerned. At a more fine-grained level, however, German and Dutch children exhibit different linguistic behaviour. This concerns in particular the means for the expression of early finiteness and the status of the auxiliary hebben/haben 'to have'. The observed differences can be explained as the result of target language specific properties of the input.

Download Full-text

Coarse-Grained vs. Fine-Grained Lithuanian Dependency Parsing

Intelligent Algorithms in Software Engineering - Advances in Intelligent Systems and Computing ◽

10.1007/978-3-030-51965-0_39 ◽

2020 ◽

pp. 450-464

Author(s):

Jurgita Kapočiūtė-Dzikienė ◽

Robertas Damaševičius

Keyword(s):

Coarse Grained ◽

Dependency Parsing ◽

Fine Grained

Download Full-text

¬Learning Island-insensitivity from the input: A corpus analysis of child- and youth-directed text in Norwegian

Glossa a journal of general linguistics ◽

10.16995/glossa.5774 ◽

2021 ◽

Author(s):

Dave Kush ◽

Charlotte Sant ◽

Sunniva Briså Strætkvern

Keyword(s):

Relative Clauses ◽

Direct Evidence ◽

Corpus Analysis ◽

Target Language ◽

Learning Models ◽

Long Distance ◽

Corpus Study ◽

Fine Grained ◽

Embedded Questions ◽

Reading Material

Norwegian allows filler-gap dependencies into relative clauses (RCs) and embedded questions (EQs) – domains that are usually considered islands. We conducted a corpus study on youth-directed reading material to assess what direct evidence Norwegian children receive for filler-gap dependencies into islands. Results suggest that the input contains examples of Filler-gap dependencies into both RCs and EQs, but such examples are significantly less frequent than long-distance filler-gap dependencies into non-island clauses. Moreover, evidence for island violations is characterized by the absence of forms that are, in principle, acceptable in the target grammar. Thus, although they encounter dependencies into islands, children must generalize beyond the fine-grained distributional characteristics of the input to acquire the full pattern of island-insensitivity in their target language. We conclude by considering how different learning models would fare on acquiring the target generalizations.

Download Full-text