Trade-Off for Space Mechanisms Actuator Technology via a General Purpose Language and Domain Specific Simulations Framework

We consider the following problem: given neural language models (embeddings) each of which is trained on an unknown data set, how can we determine which model would provide a better result when used for feature representation in a downstream task such as text classification or entity recognition? In this paper, we assess the word similarity measure through analyzing its impact on word embeddings learned from various datasets and how they perform in a simple classification task. Word representations were learned and assessed under the same conditions. For training word vectors, we used the implementation of Continuous Bag of Words described in [1]. To assess the quality of the vectors, we applied the analogy questions test for word similarity described in the same paper. Further, to measure the retrieval rate of an embedding model, we introduced a new metric (Average Retrieval Error) which measures the percentage of missing words in the model. We observe that scoring a high accuracy of syntactic and semantic similarities between word pairs is not an indicator of better classification results. This observation can be justified by the fact that a domain-specific corpus contributes to the performance better than a general-purpose corpus. For reproducibility, we release our experiments scripts and results.

Download Full-text

Domain-specific Evaluation Dataset Generator for Multilingual Text Analysis

Journal of Intelligent Systems with Applications ◽

10.54856/jiswa.201912084 ◽

2019 ◽

pp. 140-147

Author(s):

Emrah Inan ◽

Vahab Mostafapour ◽

Fatif Tekbacak

Keyword(s):

Text Analysis ◽

General Purpose ◽

Entity Linking ◽

Named Entity ◽

Domain Specific ◽

Benchmark Datasets ◽

Concise Information ◽

Multilingual Text ◽

The Given ◽

Specific Evaluation

Web enables to retrieve concise information about specific entities including people, organizations, movies and their features. Additionally, large amount of Web resources generally lies on a unstructured form and it tackles to find critical information for specific entities. Text analysis approaches such as Named Entity Recognizer and Entity Linking aim to identify entities and link them to relevant entities in the given knowledge base. To evaluate these approaches, there are a vast amount of general purpose benchmark datasets. However, it is difficult to evaluate domain-specific approaches due to lack of evaluation datasets for specific domains. This study presents WeDGeM that is a multilingual evaluation set generator for specific domains exploiting Wikipedia category pages and DBpedia hierarchy. Also, Wikipedia disambiguation pages are used to adjust the ambiguity level of the generated texts. Based on this generated test data, a use case for well-known Entity Linking systems supporting Turkish texts are evaluated in the movie domain.

Download Full-text

Bootstrap Domain-Specific Sentiment Classifiers from Unlabeled Corpora

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00020 ◽

2018 ◽

Vol 6 ◽

pp. 269-285 ◽

Cited By ~ 3

Author(s):

Andrius Mudinas ◽

Dell Zhang ◽

Mark Levene

Keyword(s):

Supervised Learning ◽

Learning Algorithms ◽

General Purpose ◽

Machine Learning Algorithms ◽

Sentiment Classification ◽

Two Phase ◽

Transductive Learning ◽

Domain Specific ◽

Sentiment Lexicon ◽

Supervised Learning Algorithms

There is often the need to perform sentiment classification in a particular domain where no labeled document is available. Although we could make use of a general-purpose off-the-shelf sentiment classifier or a pre-built one for a different domain, the effectiveness would be inferior. In this paper, we explore the possibility of building domain-specific sentiment classifiers with unlabeled documents only. Our investigation indicates that in the word embeddings learned from the unlabeled corpus of a given domain, the distributed word representations (vectors) for opposite sentiments form distinct clusters, though those clusters are not transferable across domains. Exploiting such a clustering structure, we are able to utilize machine learning algorithms to induce a quality domain-specific sentiment lexicon from just a few typical sentiment words (“seeds”). An important finding is that simple linear model based supervised learning algorithms (such as linear SVM) can actually work better than more sophisticated semi-supervised/transductive learning algorithms which represent the state-of-the-art technique for sentiment lexicon induction. The induced lexicon could be applied directly in a lexicon-based method for sentiment classification, but a higher performance could be achieved through a two-phase bootstrapping method which uses the induced lexicon to assign positive/negative sentiment scores to unlabeled documents first, a nd t hen u ses those documents found to have clear sentiment signals as pseudo-labeled examples to train a document sentiment classifier v ia supervised learning algorithms (such as LSTM). On several benchmark datasets for document sentiment classification, our end-to-end pipelined approach which is overall unsupervised (except for a tiny set of seed words) outperforms existing unsupervised approaches and achieves an accuracy comparable to that of fully supervised approaches.

Download Full-text

Dependability Modeling and Assessment in UML-Based Software Development

The Scientific World JOURNAL ◽

10.1100/2012/614635 ◽

2012 ◽

Vol 2012 ◽

pp. 1-11 ◽

Cited By ~ 3

Author(s):

Simona Bernardi ◽

José Merseguer ◽

Dorina C. Petriu

Keyword(s):

Software Development ◽

Formal Model ◽

General Purpose ◽

Software Modeling ◽

Model Driven ◽

Domain Specific ◽

Software Model ◽

Software Models ◽

Assessment Approach ◽

System Properties

Assessment of software nonfunctional properties (NFP) is an important problem in software development. In the context of model-driven development, an emerging approach for the analysis of different NFPs consists of the following steps: (a) to extend the software models with annotations describing the NFP of interest; (b) to transform automatically the annotated software model to the formalism chosen for NFP analysis; (c) to analyze the formal model using existing solvers; (d) to assess the software based on the results and give feedback to designers. Such a modeling→analysis→assessment approach can be applied to any software modeling language, be it general purpose or domain specific. In this paper, we focus on UML-based development and on the dependability NFP, which encompasses reliability, availability, safety, integrity, and maintainability. The paper presents the profile used to extend UML with dependability information, the model transformation to generate a DSPN formal model, and the assessment of the system properties based on the DSPN results.

Download Full-text

Geoscience Language Processing for Exploration

10.2118/207766-ms ◽

2021 ◽

Author(s):

Huseyin Denli ◽

Hassan A Chughtai ◽

Brian Hughes ◽

Robert Gistri ◽

Peng Xu

Keyword(s):

Language Processing ◽

Similarity Search ◽

Question Answering ◽

Language Translation ◽

Automated Analysis ◽

General Purpose ◽

Step Change ◽

Domain Specific ◽

Specific Meaning ◽

Processing Solution

Abstract Deep learning has recently been providing step-change capabilities, particularly using transformer models, for natural language processing applications such as question answering, query-based summarization, and language translation for general-purpose context. We have developed a geoscience-specific language processing solution using such models to enable geoscientists to perform rapid, fully-quantitative and automated analysis of large corpuses of data and gain insights. One of the key transformer-based model is BERT (Bidirectional Encoder Representations from Transformers). It is trained with a large amount of general-purpose text (e.g., Common Crawl). Use of such a model for geoscience applications can face a number of challenges. One is due to the insignificant presence of geoscience-specific vocabulary in general-purpose context (e.g. daily language) and the other one is due to the geoscience jargon (domain-specific meaning of words). For example, salt is more likely to be associated with table salt within a daily language but it is used as a subsurface entity within geosciences. To elevate such challenges, we retrained a pre-trained BERT model with our 20M internal geoscientific records. We will refer the retrained model as GeoBERT. We fine-tuned the GeoBERT model for a number of tasks including geoscience question answering and query-based summarization. BERT models are very large in size. For example, BERT-Large has 340M trained parameters. Geoscience language processing with these models, including GeoBERT, could result in a substantial latency when all database is processed at every call of the model. To address this challenge, we developed a retriever-reader engine consisting of an embedding-based similarity search as a context retrieval step, which helps the solution to narrow the context for a given query before processing the context with GeoBERT. We built a solution integrating context-retrieval and GeoBERT models. Benchmarks show that it is effective to help geologists to identify answers and context for given questions. The prototype will also produce a summary to different granularity for a given set of documents. We have also demonstrated that domain-specific GeoBERT outperforms general-purpose BERT for geoscience applications.

Download Full-text

Word Sense Disambiguation

Emerging Applications of Natural Language Processing ◽

10.4018/978-1-4666-2169-5.ch002 ◽

2013 ◽

pp. 22-51

Author(s):

Pushpak Bhattacharyya ◽

Mitesh Khapra

Keyword(s):

State Of The Art ◽

Word Sense Disambiguation ◽

Current Trend ◽

General Purpose ◽

Word Sense ◽

Domain Specific ◽

Knowledge Based ◽

Current State ◽

Sense Disambiguation ◽

State Of Affairs

This chapter discusses the basic concepts of Word Sense Disambiguation (WSD) and the approaches to solving this problem. Both general purpose WSD and domain specific WSD are presented. The first part of the discussion focuses on existing approaches for WSD, including knowledge-based, supervised, semi-supervised, unsupervised, hybrid, and bilingual approaches. The accuracy value for general purpose WSD as the current state of affairs seems to be pegged at around 65%. This has motivated investigations into domain specific WSD, which is the current trend in the field. In the latter part of the chapter, we present a greedy neural network inspired algorithm for domain specific WSD and compare its performance with other state-of-the-art algorithms for WSD. Our experiments suggest that for domain-specific WSD, simply selecting the most frequent sense of a word does as well as any state-of-the-art algorithm.

Download Full-text

MDA, Metamodeling and Transformation

Advances in Computer and Electrical Engineering - Model Driven Architecture for Reverse Engineering Technologies ◽

10.4018/978-1-61520-649-0.ch003 ◽

2010 ◽

pp. 34-47

Author(s):

Liliana María Favre

Keyword(s):

Programming Languages ◽

General Purpose ◽

Modeling Languages ◽

Domain Specific Languages ◽

Domain Specific ◽

Underlying Principle

MDA requires the ability to understand different languages such as general purpose languages, domain specific languages, modeling languages or programming languages. An underlying principle of MDA for integrating semantically in a unified and interoperable way such languages is using metamodeling techniques.

Download Full-text