Weakly Supervised SVM for Chinese- English Cross-lingual Subcategorization Lexicon Acquisition

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity

Computational Linguistics ◽

10.1162/coli_a_00391 ◽

2020 ◽

pp. 1-51

Author(s):

Ivan Vulić ◽

Simon Baker ◽

Edoardo Maria Ponti ◽

Ulla Petti ◽

Ira Leviant ◽

...

Keyword(s):

Semantic Similarity ◽

Large Scale ◽

Representation Learning ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lexical Representations ◽

Language Data ◽

Weakly Supervised ◽

Cross Lingual

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex -style resources for additional languages.We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via aWeb site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.

Download Full-text

Weakly Supervised Attentional Model for Low Resource Ad-hoc Cross-lingual Information Retrieval

10.18653/v1/d19-6129 ◽

2019 ◽

Cited By ~ 2

Author(s):

Lingjun Zhao ◽

Rabih Zbib ◽

Zhuolin Jiang ◽

Damianos Karakos ◽

Zhongqiang Huang

Keyword(s):

Information Retrieval ◽

Ad Hoc ◽

Attentional Model ◽

Low Resource ◽

Weakly Supervised ◽

Cross Lingual

Download Full-text

Weakly supervised spoken term discovery using cross-lingual side information

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2017.7953260 ◽

2017 ◽

Author(s):

Sameer Bansal ◽

Herman Kamper ◽

Sharon Goldwater ◽

Adam Lopez

Keyword(s):

Side Information ◽

Lingual Side ◽

Weakly Supervised ◽

Cross Lingual

Download Full-text

Weakly Supervised Cross-Lingual Named Entity Recognition via Effective Annotation and Representation Projection

10.18653/v1/p17-1135 ◽

2017 ◽

Cited By ~ 7

Author(s):

Jian Ni ◽

Georgiana Dinu ◽

Radu Florian

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity ◽

Weakly Supervised ◽

Cross Lingual

Download Full-text

Cross-lingual Projected Expectation Regularization for Weakly Supervised Learning

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00165 ◽

2014 ◽

Vol 2 ◽

pp. 55-66 ◽

Cited By ~ 10

Author(s):

Mengqiu Wang ◽

Christopher D. Manning

Keyword(s):

Supervised Learning ◽

Model Uncertainty ◽

State Of The Art ◽

New Method ◽

Weakly Supervised Learning ◽

Supervised Methods ◽

Weakly Supervised ◽

Language Boundaries ◽

Standard Chinese ◽

Cross Lingual

We consider a multilingual weakly supervised learning scenario where knowledge from annotated corpora in a resource-rich language is transferred via bitext to guide the learning in other languages. Past approaches project labels across bitext and use them as features or gold labels for training. We propose a new method that projects model expectations rather than labels, which facilities transfer of model uncertainty across language boundaries. We encode expectations as constraints and train a discriminative CRF model using Generalized Expectation Criteria (Mann and McCallum, 2010). Evaluated on standard Chinese-English and German-English NER datasets, our method demonstrates F1 scores of 64% and 60% when no labeled data is used. Attaining the same accuracy with supervised CRFs requires 12k and 1.5k labeled sentences. Furthermore, when combined with labeled examples, our method yields significant improvements over state-of-the-art supervised methods, achieving best reported numbers to date on Chinese OntoNotes and German CoNLL-03 datasets.

Download Full-text

Latent Sentiment Model for Weakly-Supervised Cross-Lingual Sentiment Classification

Lecture Notes in Computer Science - Advances in Information Retrieval ◽

10.1007/978-3-642-20161-5_22 ◽

2011 ◽

pp. 214-225 ◽

Cited By ~ 4

Author(s):

Yulan He

Keyword(s):

Sentiment Classification ◽

Weakly Supervised ◽

Cross Lingual

Download Full-text

Weakly Supervised Cross-lingual Semantic Relation Classification via Knowledge Distillation

10.18653/v1/d19-1532 ◽

2019 ◽

Author(s):

Yogarshi Vyas ◽

Marine Carpuat

Keyword(s):

Semantic Relation ◽

Knowledge Distillation ◽

Weakly Supervised ◽

Relation Classification ◽

Cross Lingual

Download Full-text

Weakly Supervised POS Taggers Perform Poorly on Truly Low-Resource Languages

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6317 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8066-8073

Author(s):

Katharina Kann ◽

Ophélie Lacroix ◽

Anders Søgaard

Keyword(s):

State Of The Art ◽

Sources Of Information ◽

High Coverage ◽

Weak Supervision ◽

Low Resource ◽

Pos Tagging ◽

Part Of Speech ◽

Resource Poor ◽

Weakly Supervised ◽

Cross Lingual

Part-of-speech (POS) taggers for low-resource languages which are exclusively based on various forms of weak supervision – e.g., cross-lingual transfer, type-level supervision, or a combination thereof – have been reported to perform almost as well as supervised ones. However, weakly supervised POS taggers are commonly only evaluated on languages that are very different from truly low-resource languages, and the taggers use sources of information, like high-coverage and almost error-free dictionaries, which are likely not available for resource-poor languages. We train and evaluate state-of-the-art weakly supervised POS taggers for a typologically diverse set of 15 truly low-resource languages. On these languages, given a realistic amount of resources, even our best model gets only less than half of the words right. Our results highlight the need for new and different approaches to POS tagging for truly low-resource languages.

Download Full-text