IEEE standard upper ontology: a progress report

2002 ◽  
Vol 17 (1) ◽  
pp. 65-70 ◽  
Author(s):  
ADAM PEASE ◽  
IAN NILES

The IEEE Standard Upper Ontology (IEEE, 2001) is an effort to create a large, general-purpose, formal ontology. The ontology will be an open standard that can be reused for both academic and commercial purposes without fee, and it will be designed to support additional domain-specific ontologies. The effort is targeted for use in automated inference, semantic interoperability between heterogeneous information systems and natural language processing applications. The effort was begun in May 2000 with an e-mail discussion list, and since then there have been over 6000 e-mail messages among 170 subscribers. These subscribers include representatives from government, academia and industry in various countries. The effort was officially approved as an IEEE standards project in December 2000. Recently a successful workshop was held at IJCAI 2001 to discuss progress and proposals for this project (IJCAI, 2001).

2021 ◽  
Author(s):  
Huseyin Denli ◽  
Hassan A Chughtai ◽  
Brian Hughes ◽  
Robert Gistri ◽  
Peng Xu

Abstract Deep learning has recently been providing step-change capabilities, particularly using transformer models, for natural language processing applications such as question answering, query-based summarization, and language translation for general-purpose context. We have developed a geoscience-specific language processing solution using such models to enable geoscientists to perform rapid, fully-quantitative and automated analysis of large corpuses of data and gain insights. One of the key transformer-based model is BERT (Bidirectional Encoder Representations from Transformers). It is trained with a large amount of general-purpose text (e.g., Common Crawl). Use of such a model for geoscience applications can face a number of challenges. One is due to the insignificant presence of geoscience-specific vocabulary in general-purpose context (e.g. daily language) and the other one is due to the geoscience jargon (domain-specific meaning of words). For example, salt is more likely to be associated with table salt within a daily language but it is used as a subsurface entity within geosciences. To elevate such challenges, we retrained a pre-trained BERT model with our 20M internal geoscientific records. We will refer the retrained model as GeoBERT. We fine-tuned the GeoBERT model for a number of tasks including geoscience question answering and query-based summarization. BERT models are very large in size. For example, BERT-Large has 340M trained parameters. Geoscience language processing with these models, including GeoBERT, could result in a substantial latency when all database is processed at every call of the model. To address this challenge, we developed a retriever-reader engine consisting of an embedding-based similarity search as a context retrieval step, which helps the solution to narrow the context for a given query before processing the context with GeoBERT. We built a solution integrating context-retrieval and GeoBERT models. Benchmarks show that it is effective to help geologists to identify answers and context for given questions. The prototype will also produce a summary to different granularity for a given set of documents. We have also demonstrated that domain-specific GeoBERT outperforms general-purpose BERT for geoscience applications.


Author(s):  
Valerie Cross ◽  
Vishal Bathija

AbstractOntologies are an emerging means of knowledge representation to improve information organization and management, and they are becoming more prevalent in the domain of engineering design. The task of creating new ontologies manually is not only tedious and cumbersome but also time consuming and expensive. Research aimed at addressing these problems in creating ontologies has investigated methods of automating ontology reuse mainly by extracting smaller application ontologies from larger, more general purpose ontologies. Motivated by the wide variety of existing learning algorithms, this paper describes a new approach focused on the reuse of domain-specific ontologies. The approach integrates existing software tools for natural language processing with new algorithms for pruning concepts not relevant to the new domain and extending the pruned ontology by adding relevant concepts. The approach is assessed experimentally by automatically adapting a design rationale ontology for the software engineering domain to a new one for the related domain of engineering design. The experiment produced an ontology that exhibits comparable quality to previous attempts to automate ontology creation as measured by standard content performance metrics such as coverage, accuracy, precision, and recall. However, further analysis of the ontology suggests that the automated approach should be augmented with recommendations presented to a domain expert who monitors the pruning and extending processes in order to improve the structure of the ontology.


Author(s):  
SungKu Kang ◽  
Lalit Patil ◽  
Arvind Rangarajan ◽  
Abha Moitra ◽  
Tao Jia ◽  
...  

Formal ontology and rule-based approaches founded on semantic technologies have been proposed as powerful mechanisms to enable early manufacturability feedback. A fundamental unresolved problem in this context is that all manufacturing knowledge is encoded in unstructured text and there are no reliable methods to automatically convert it to formal ontologies and rules. It is impractical for engineers to write accurate domain rules in a structured semantic languages such as Web Ontology Language (OWL) or Semantic Application Design Language (SADL). Previous efforts in manufacturing research that have targeted extraction of OWL ontologies from text have focused on basic concept names and hierarchies. This paper presents a semantics-based framework for acquiring more complex manufacturing knowledge, primarily rules, in a semantically-usable form from unstructured English text such as those written in manufacturing handbooks. The approach starts with existing domain knowledge in the form of OWL ontologies and applies natural language processing techniques to extract dependencies between different words in the text that contains the rule. Domain-specific triples capturing each rule are then extracted from each dependency graph. Finally, new computer-interpretable rules are composed from the triples. The feasibility of the framework has been evaluated by automatically and accurately generating rules for manufacturability from a manufacturing handbook. The paper also documents the cases that result in ambiguous results. Analysis of the results shows that the proposed framework can be extended to extract domain ontologies which forms part of the ongoing work that also focuses on addressing challenges to automate different steps and improve the reliability of the system.


Author(s):  
Nicolás José Fernández-Martínez ◽  
Carlos Periñán-Pascual

Location-based systems require rich geospatial data in emergency and crisis-related situations (e.g. earthquakes, floods, terrorist attacks, car accidents or pandemics) for the geolocation of not only a given incident but also the affected places and people in need of immediate help, which could potentially save lives and prevent further damage to urban or environmental areas. Given the sparsity of geotagged tweets, geospatial data must be obtained from the locative references mentioned in textual data such as tweets. In this context, we introduce nLORE (neural LOcative Reference Extractor), a deep-learning system that serves to detect locative references in English tweets by making use of the linguistic knowledge provided by LORE. nLORE, which captures fine-grained complex locative references of any type, outperforms not only LORE, but also well-known general-purpose or domain-specific off-the-shelf entity-recognizer systems, both qualitatively and quantitatively. However, LORE shows much better runtime efficiency, which is especially important in emergency-based and crisis-related scenarios that demand quick intervention to send first responders to affected areas and people. This highlights the often undervalued yet very important role of rule-based models in natural language processing for real-life and real-time scenarios.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Pilar López-Úbeda ◽  
Alexandra Pomares-Quimbaya ◽  
Manuel Carlos Díaz-Galiano ◽  
Stefan Schulz

Abstract Background Controlled vocabularies are fundamental resources for information extraction from clinical texts using natural language processing (NLP). Standard language resources available in the healthcare domain such as the UMLS metathesaurus or SNOMED CT are widely used for this purpose, but with limitations such as lexical ambiguity of clinical terms. However, most of them are unambiguous within text limited to a given clinical specialty. This is one rationale besides others to classify clinical text by the clinical specialty to which they belong. Results This paper addresses this limitation by proposing and applying a method that automatically extracts Spanish medical terms classified and weighted per sub-domain, using Spanish MEDLINE titles and abstracts as input. The hypothesis is biomedical NLP tasks benefit from collections of domain terms that are specific to clinical subdomains. We use PubMed queries that generate sub-domain specific corpora from Spanish titles and abstracts, from which token n-grams are collected and metrics of relevance, discriminatory power, and broadness per sub-domain are computed. The generated term set, called Spanish core vocabulary about clinical specialties (SCOVACLIS), was made available to the scientific community and used in a text classification problem obtaining improvements of 6 percentage points in the F-measure compared to the baseline using Multilayer Perceptron, thus demonstrating the hypothesis that a specialized term set improves NLP tasks. Conclusion The creation and validation of SCOVACLIS support the hypothesis that specific term sets reduce the level of ambiguity when compared to a specialty-independent and broad-scope vocabulary.


2021 ◽  
Vol 30 (6) ◽  
pp. 526-534
Author(s):  
Evelina Fedorenko ◽  
Cory Shain

Understanding language requires applying cognitive operations (e.g., memory retrieval, prediction, structure building) that are relevant across many cognitive domains to specialized knowledge structures (e.g., a particular language’s lexicon and syntax). Are these computations carried out by domain-general circuits or by circuits that store domain-specific representations? Recent work has characterized the roles in language comprehension of the language network, which is selective for high-level language processing, and the multiple-demand (MD) network, which has been implicated in executive functions and linked to fluid intelligence and thus is a prime candidate for implementing computations that support information processing across domains. The language network responds robustly to diverse aspects of comprehension, but the MD network shows no sensitivity to linguistic variables. We therefore argue that the MD network does not play a core role in language comprehension and that past findings suggesting the contrary are likely due to methodological artifacts. Although future studies may reveal some aspects of language comprehension that require the MD network, evidence to date suggests that those will not be related to core linguistic processes such as lexical access or composition. The finding that the circuits that store linguistic knowledge carry out computations on those representations aligns with general arguments against the separation of memory and computation in the mind and brain.


2019 ◽  
Vol 1 (3) ◽  
Author(s):  
A. Aziz Altowayan ◽  
Lixin Tao

We consider the following problem: given neural language models (embeddings) each of which is trained on an unknown data set, how can we determine which model would provide a better result when used for feature representation in a downstream task such as text classification or entity recognition? In this paper, we assess the word similarity measure through analyzing its impact on word embeddings learned from various datasets and how they perform in a simple classification task. Word representations were learned and assessed under the same conditions. For training word vectors, we used the implementation of Continuous Bag of Words described in [1]. To assess the quality of the vectors, we applied the analogy questions test for word similarity described in the same paper. Further, to measure the retrieval rate of an embedding model, we introduced a new metric (Average Retrieval Error) which measures the percentage of missing words in the model. We observe that scoring a high accuracy of syntactic and semantic similarities between word pairs is not an indicator of better classification results. This observation can be justified by the fact that a domain-specific corpus contributes to the performance better than a general-purpose corpus. For reproducibility, we release our experiments scripts and results.


Author(s):  
Emrah Inan ◽  
Vahab Mostafapour ◽  
Fatif Tekbacak

Web enables to retrieve concise information about specific entities including people, organizations, movies and their features. Additionally, large amount of Web resources generally lies on a unstructured form and it tackles to find critical information for specific entities. Text analysis approaches such as Named Entity Recognizer and Entity Linking aim to identify entities and link them to relevant entities in the given knowledge base. To evaluate these approaches, there are a vast amount of general purpose benchmark datasets. However, it is difficult to evaluate domain-specific approaches due to lack of evaluation datasets for specific domains. This study presents WeDGeM that is a multilingual evaluation set generator for specific domains exploiting Wikipedia category pages and DBpedia hierarchy. Also, Wikipedia disambiguation pages are used to adjust the ambiguity level of the generated texts. Based on this generated test data, a use case for well-known Entity Linking systems supporting Turkish texts are evaluated in the movie domain.


Sign in / Sign up

Export Citation Format

Share Document