Large Scale Intent Detection in Turkish Short Sentences with Contextual Word Embeddings

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex -style resources for additional languages.We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via aWeb site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.

Download Full-text

Lexicons on Demand: Neural Word Embeddings for Large-Scale Text Analysis

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/677 ◽

2017 ◽

Cited By ~ 1

Author(s):

Ethan Fast ◽

Binbin Chen ◽

Michael S. Bernstein

Keyword(s):

Text Analysis ◽

Large Scale ◽

Data Driven ◽

Word Embeddings ◽

Human Language ◽

Lexical Categories ◽

On Demand ◽

Small Set ◽

Highly Correlated ◽

The Web

Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like "bleed" and "punch" to generate the category violence). Empath draws connotations between words and phrases by learning a neural embedding across billions of words on the web. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated such as neglect, government, and social media. We show that Empath's data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.

Download Full-text

Large scale hierarchical text classification

10.12681/eadd/36242 ◽

2015 ◽

Author(s):

Άρης Κοσμόπουλος

Keyword(s):

Principal Component Analysis ◽

Text Classification ◽

Large Scale ◽

Principal Component ◽

Component Analysis ◽

Bag Of Words ◽

Word Embeddings ◽

Medical Text ◽

Hierarchical Text Classification

Οι ιεραρχίες χρησιμοποιούνται όλο και πιο συχνά στην την οργάνωση κειμένων και η χρήση αυτή είναι ακόμη πιο συχνή στο διαδίκτυο. Οι κατάλογοι ιστοσελίδων, όπως το Yahoo Directory και το Dmoz Directory, είναι τέτοια τυπικά παραδείγματα. Μαζί με την συχνή χρήση τους όμως προκύπτει και η ανάγκη για αυτοματοποιημένους τρόπους ταξινόμησης των νέων κειμένων στις κατηγορίες των ιεραρχιών αυτών. Σε αυτή τη διατριβή, ονομάζουμε το πρόβλημα αυτό "μεγάλης κλίμακας Ιεραρχική κατηγοριοποίηση κειμένων". Είναι μεγάλης κλίμακας, γιατί οι κατηγορίες είναι χιλιάδες και τα κείμενα μπορεί να είναι από εκατοντάδες χιλιάδες μέχρι και εκατομμύρια. Είναι επίσης ιεραρχικό επειδή οι κατηγορίες συν΄δεονται μεταξύ τους με σχέσεις γονέα-πατέρα. Ένα σημαντικό θέμα στην ιεραρχική κατηγοριοποίηση είναι η αξιολόγηση διαφορετικών αλγορίθμων κατηγοριοποίησης, που είναι ακόμη πιο έντονο λόγο της ύπαρξης της ιεραρχίας. Διάφορα ιεραρχικά μέτρα έχουν προταθεί στο παρελθόν, αλλά χωρίς να προσφέρουν ένα ενοποιημένο τρόπο εποπτείας του προβλήματος. Σε αυτή τη διατριβή, μελετούμε το πρόβλημα της αξιολόγησης στην ιεραρχική κατηγοριοποίηση, αναλύοντας τα βασικά στοιχεία των υπαρχόντων ιεραρχικών μέτρων. Επίσης διαχωρίζουμε τα υπάρχοντα ιεραρχικά μέτρα σε δυο εναλλακτικά γενικά μοντέλα και προτείνουμε δυο καινοτόμα μέτρα για κάθε μοντέλο. Τα υπάρχοντα και τα προτεινόμενα μέτρα δοκιμάζονται σε τρία μεγάλα σύνολα δεδομένων κατηγοριοποίησης κειμένων. Τα αποτελέσματα των πειραμάτων δείχνουν τους περιορισμούς των υπαρχόντων μέτρων και το πως τα νέα προτεινόμενα μέτρα ξεπερνούν αυτούς τους περιορισμούς. Στη συνέχεια επικεντρωνόμαστε στην απλούστερη μορφή ιεραρχικής κατηγοριοποίησης όπου κάθε κείμενο ανήκει σε μόνο μία κατηγορία και η ιεραρχία έχει μορφή δένδρου. Η πιο συνηθισμένη μορφή ιεραρχικής κατηγοριοποίησης είναι αυτή του Cascade, στην οποία διατρέχεται η ιεραρχία από τη ρίζα του δένδρου ως το προτεινόμενο φύλλο. Για να πραγματοποιηθεί αυτή η διαδικασία, πρέπει να εκπαιδευτεί ένας ταξινομητής σε κάθε κόμβο του δένδρου, αλλά στα πιο ψηλά επίπεδα ο αριθμός των χαρακτηριστικών μπορεί να γίνει απαγορευτικά υψηλός. Για αυτό και είναι επιθυμητή η μείωση της διαστασιμότητας του χώρου των χαρακτηριστικών σε αυτά τα επίπεδα. Δεδομένου ότι η πιο ευρέος διαδεδομένη μέθοδος μείωσης χαρακτηριστικών είναι το Principal Component Analysis (PCA), εξετάζουμε τη χρήση του στο Cascade μελετώντας την επίδραση του στο υπολογιστικό κόστος αλλά και την ακρίβεια των ταξινομικών. Επίσης προτείνουμε έναν εναλλακτικό τρόπο πιθανοτικού Cascade ο οποίος κάνοντας καλύτερη χρήση των πιθανοτήτων των ταξινομητών επιτυγχάνει καλύτερα αποτελέσματα σε σχέση με το παραδοσιακό Cascade. Τέλος, εξετάζουμε ένα πιο πολύπλοκο πρόβλημα, γνωστό ως βιοϊατρική σημασιολογική ταξινόμηση όπου βιοϊατρικά κείμενα πρέπει να ταξινομηθούν σε κατηγορίες που ανήκουν σε μια μεγάλη βιοϊατρική ιεραρχία. Το πρόβλημα αυτό είναι πιο πολύπλοκο διότι η ιεραρχία είναι κατευθυνόμενος γράφος και όχι απλά δένδρο, ενώ κάθε κείμενο μπορεί να ανήκει σε πολλές κατηγορίες η οποίες μάλιστα μπορεί να μην είναι απαραίτητα φύλλα του γράφου. Σε αυτό το πρόβλημα, εξετάζουμε της χρήση πυκνών διανυσμάτων λέξεων (word embeddings) ως ένα τρόπο για μείωση της διαστασημότητας των χαρακτηριστικών. Εξετάζουμε διάφορες προσεγγίσεις για να περάσουμε από τα διανύσματα λέξεων σε διανύσματα κειμένων και προτείνουμε μια απλή διαδικασία με χρήση κεντροειδούς η οποία είναι κατάλληλη για το πρόβλημα. Επίσης δείχνουμε πως η υιοθέτηση αυτής της προσέγγισης κάνει το πρόβλημα της μεγάλης κλίμακας ιεραρχικής κατηγοριοποίησης πολύ πιο κλιμακώσιμο, χωρίς να υστερεί σε ακρίβεια σε σχέση με τη συνηθισμένη προσέγγιση bag-of-words. Στα πειράματά μας εξετάζουμε τη χρήση ιεραρχικών και μη ιεραρχικών ταξινομητών κ-κοντινότερων-γειτόνων και μελετάμε την επίδραση των διαφόρων παραμέτρων τους. Επίσης παρουσιάζουμε ένα υψηλής ακρίβειας σύστημα που συνδυάζεται με το ευρέος χρησιμοποιημένο Medical Text Indexer (MTI) σύστημα της Εθνικής Βιβλιοθήκης της Ιατρικής με στόχο τη βελτίωση των προβλέψεών του.

Download Full-text

Large Scale Intent Detection in Turkish Short Sentences with Contextual Word Embeddings

Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management ◽

10.5220/0010108301810186 ◽

2020 ◽

Author(s):

Enes Dündar ◽

Osman Kılıç ◽

Tolga Çekiç ◽

Yusufcan Manav ◽

Onur Deniz

Keyword(s):

Large Scale ◽

Word Embeddings

Download Full-text

Incorporating Word Embeddings into Open Directory Project Based Large-Scale Classification

Advances in Knowledge Discovery and Data Mining - Lecture Notes in Computer Science ◽

10.1007/978-3-319-93037-4_30 ◽

2018 ◽

pp. 376-388 ◽

Cited By ~ 2

Author(s):

Kang-Min Kim ◽

Aliyeva Dinara ◽

Byung-Ju Choi ◽

SangKeun Lee

Keyword(s):

Large Scale ◽

Word Embeddings ◽

Scale Classification

Download Full-text

Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora

Political Analysis ◽

10.1017/pan.2019.26 ◽

2019 ◽

Vol 28 (1) ◽

pp. 112-133 ◽

Cited By ~ 7

Author(s):

Ludovic Rheault ◽

Christopher Cochrane

Keyword(s):

Language Processing ◽

Large Scale ◽

Network Models ◽

The United States ◽

Word Embeddings ◽

Party Affiliation ◽

Neural Network Models ◽

Members Of Parliament ◽

Indicator Variables ◽

Quantities Of Interest

Word embeddings, the coefficients from neural network models predicting the use of words in context, have now become inescapable in applications involving natural language processing. Despite a few studies in political science, the potential of this methodology for the analysis of political texts has yet to be fully uncovered. This paper introduces models of word embeddings augmented with political metadata and trained on large-scale parliamentary corpora from Britain, Canada, and the United States. We fit these models with indicator variables of the party affiliation of members of parliament, which we refer to as party embeddings. We illustrate how these embeddings can be used to produce scaling estimates of ideological placement and other quantities of interest for political research. To validate the methodology, we assess our results against indicators from the Comparative Manifestos Project, surveys of experts, and measures based on roll-call votes. Our findings suggest that party embeddings are successful at capturing latent concepts such as ideology, and the approach provides researchers with an integrated framework for studying political language.

Download Full-text

Clarifying Ambiguous Keywords with Personal Word Embeddings for Personalized Search

ACM Transactions on Information Systems ◽

10.1145/3470564 ◽

2022 ◽

Vol 40 (3) ◽

pp. 1-29

Author(s):

Jing Yao ◽

Zhicheng Dou ◽

Ji-Rong Wen

Keyword(s):

Large Scale ◽

Training Model ◽

Search Problem ◽

Word Embeddings ◽

Information Need ◽

User Interest ◽

Personalized Search ◽

Individual User ◽

User Interests ◽

Matching Score

Personalized search tailors document ranking lists for each individual user based on her interests and query intent to better satisfy the user’s information need. Many personalized search models have been proposed. They first build a user interest profile from the user’s search history, and then re-rank the documents based on the personalized matching scores between the created profile and candidate documents. In this article, we attempt to solve the personalized search problem from an alternative perspective of clarifying the user’s intention of the current query. We know that there are many ambiguous words in natural language such as “Apple.” People with different knowledge backgrounds and interests have personalized understandings of these words. Therefore, we propose a personalized search model with personal word embeddings for each individual user that mainly contain the word meanings that the user already knows and can reflect the user interests. To learn great personal word embeddings, we design a pre-training model that captures both the textual information of the query log and the information about user interests contained in the click-through data represented as a graph structure. With personal word embeddings, we obtain the personalized word and context-aware representations of the query and documents. Furthermore, we also employ the current session as the short-term search context to dynamically disambiguate the current query. Finally, we use a matching model to calculate the matching score between the personalized query and document representations for ranking. Experimental results on two large-scale query logs show that our designed model significantly outperforms state-of-the-art personalization models.

Download Full-text