Extracting interrogative intents and concepts from geo-analytic questions

Abstract. Understanding syntactic and semantic structure of geographic questions is a necessary step towards true geographic question-answering (GeoQA) machines. The empirical basis for the understanding of the capabilities expected from GeoQA systems are geographic question corpora. Available corpora in English have been mostly drawn from generic Web search logs or limited user studies, supporting the focus of GeoQA systems on retrieving factoids: factual knowledge about particular places and everyday processes. Yet, the majority of questions enquired about in the spatial sciences go beyond simple place facts, with more complex analytical intents informing the questions. In this paper, we introduce a new corpus of geo-analytic questions drawn from English textbooks and scientific articles. We analyse and compare this corpus with two general-purpose GeoQA corpora in terms of grammatical complexity and semantic concepts, using a new parsing method that allows us to differentiate and quantify patterns of a question’s intent.

Download Full-text

Natural Language Intelligences

Reality, Universal Ontology and Knowledge Systems ◽

10.4018/978-1-59904-966-3.ch011 ◽

2008 ◽

pp. 258-275

Author(s):

Azamat Abdoullaev

Keyword(s):

Web Search ◽

Question Answering ◽

Database Systems ◽

General Purpose ◽

Knowledge Systems ◽

Open Domain ◽

Keyword Query ◽

Trade Name ◽

Question Answering Systems ◽

Special Value

Of all possible intelligent NL applications and semantic artifacts, a special value is today ascribed to building the question answering systems (Q&A) with broad and wide ontological learning (Onto Query Project, 2004), classified as open-domain Q&A knowledge systems [Question Answering, From Wikipedia, 2006]. This line of research is considered as upgrading of a traditional keyword query processing in database systems, as endowing the Web search engines with answering deduction capacities. Ideally, such a general-purpose Q&A agent should be able to cover questions (matters, subjects, topics, issues, themes) from any branch of knowledge and domain of interest by giving answers to any meaningful questions, like the Digital Aristotle, “an application that will encompass much of the world’s scientific knowledge and be capable of answering novel questions and advanced problemsolving” (Project Halo, 2004). The trade name of the Digital Aristotle was inspired by the scholar mostly admired for the depth and width of his perception, whose mind spread over ontology, physics, logics, epistemology, biology, zoology, medicine, psychology, literary theory, politics, and art.

Download Full-text

Identifying comparable entities with indirectly associative relations and word embeddings from web search logs

Decision Support Systems ◽

10.1016/j.dss.2020.113465 ◽

2020 ◽

pp. 113465

Author(s):

Liye Wang ◽

Jin Zhang ◽

Guoqing Chen ◽

Dandan Qiao

Keyword(s):

Web Search ◽

Word Embeddings ◽

Search Logs

Download Full-text

Exploiting Search Logs to Aid in Training and Automating Infrastructure for Question Answering in Professional Domains

Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law - ICAIL '19 ◽

10.1145/3322640.3326738 ◽

2019 ◽

Author(s):

Filippo Pompili ◽

Jack G. Conrad† ◽

Carter Kolbeck

Keyword(s):

Question Answering ◽

Search Logs

Download Full-text

Emerging trends: A gentle introduction to fine-tuning

Natural Language Engineering ◽

10.1017/s1351324921000322 ◽

2021 ◽

Vol 27 (6) ◽

pp. 763-778

Author(s):

Kenneth Ward Church ◽

Zeyu Chen ◽

Yanjun Ma

Keyword(s):

Natural Language ◽

Language Processing ◽

Question Answering ◽

General Purpose ◽

Fine Tuning ◽

Language Engineering ◽

Training Models ◽

Emerging Trends ◽

Foundation Model ◽

Programming Skills

AbstractThe previous Emerging Trends article (Church et al., 2021. Natural Language Engineering27(5), 631–645.) introduced deep nets to poets. Poets is an imperfect metaphor, intended as a gesture toward inclusion. The future for deep nets will benefit by reaching out to a broad audience of potential users, including people with little or no programming skills, and little interest in training models. That paper focused on inference, the use of pre-trained models, as is, without fine-tuning. The goal of this paper is to make fine-tuning more accessible to a broader audience. Since fine-tuning is more challenging than inference, the examples in this paper will require modest programming skills, as well as access to a GPU. Fine-tuning starts with a general purpose base (foundation) model and uses a small training set of labeled data to produce a model for a specific downstream application. There are many examples of fine-tuning in natural language processing (question answering (SQuAD) and GLUE benchmark), as well as vision and speech.

Download Full-text

Geoscience Language Processing for Exploration

10.2118/207766-ms ◽

2021 ◽

Author(s):

Huseyin Denli ◽

Hassan A Chughtai ◽

Brian Hughes ◽

Robert Gistri ◽

Peng Xu

Keyword(s):

Language Processing ◽

Similarity Search ◽

Question Answering ◽

Language Translation ◽

Automated Analysis ◽

General Purpose ◽

Step Change ◽

Domain Specific ◽

Specific Meaning ◽

Processing Solution

Abstract Deep learning has recently been providing step-change capabilities, particularly using transformer models, for natural language processing applications such as question answering, query-based summarization, and language translation for general-purpose context. We have developed a geoscience-specific language processing solution using such models to enable geoscientists to perform rapid, fully-quantitative and automated analysis of large corpuses of data and gain insights. One of the key transformer-based model is BERT (Bidirectional Encoder Representations from Transformers). It is trained with a large amount of general-purpose text (e.g., Common Crawl). Use of such a model for geoscience applications can face a number of challenges. One is due to the insignificant presence of geoscience-specific vocabulary in general-purpose context (e.g. daily language) and the other one is due to the geoscience jargon (domain-specific meaning of words). For example, salt is more likely to be associated with table salt within a daily language but it is used as a subsurface entity within geosciences. To elevate such challenges, we retrained a pre-trained BERT model with our 20M internal geoscientific records. We will refer the retrained model as GeoBERT. We fine-tuned the GeoBERT model for a number of tasks including geoscience question answering and query-based summarization. BERT models are very large in size. For example, BERT-Large has 340M trained parameters. Geoscience language processing with these models, including GeoBERT, could result in a substantial latency when all database is processed at every call of the model. To address this challenge, we developed a retriever-reader engine consisting of an embedding-based similarity search as a context retrieval step, which helps the solution to narrow the context for a given query before processing the context with GeoBERT. We built a solution integrating context-retrieval and GeoBERT models. Benchmarks show that it is effective to help geologists to identify answers and context for given questions. The prototype will also produce a summary to different granularity for a given set of documents. We have also demonstrated that domain-specific GeoBERT outperforms general-purpose BERT for geoscience applications.

Download Full-text

Predicting Lexical Answer Types in Open Domain QA

International Journal on Semantic Web and Information Systems ◽

10.4018/jswis.2012070104 ◽

2012 ◽

Vol 8 (3) ◽

pp. 74-88 ◽

Cited By ~ 1

Author(s):

Alfio Massimiliano Gliozzo ◽

Aditya Kalyanpur

Keyword(s):

Knowledge Acquisition ◽

Knowledge Base ◽

Large Scale ◽

Question Answering ◽

General Purpose ◽

Open Domain ◽

Lexical Knowledge ◽

Research Challenge ◽

Lexical Knowledge Base

Automatic open-domain Question Answering has been a long standing research challenge in the AI community. IBM Research undertook this challenge with the design of the DeepQA architecture and the implementation of Watson. This paper addresses a specific subtask of Deep QA, consisting of predicting the Lexical Answer Type (LAT) of a question. Our approach is completely unsupervised and is based on PRISMATIC, a large-scale lexical knowledge base automatically extracted from a Web corpus. Experiments on the Jeopardy! data shows that it is possible to correctly predict the LAT in a substantial number of questions. This approach can be used for general purpose knowledge acquisition tasks such as frame induction from text.

Download Full-text

Web Search Engine Architectures and their Performance Analysis

Handbook of Research on Web Information Systems Quality ◽

10.4018/978-1-59904-847-5.ch028 ◽

2011 ◽

pp. 491-509

Author(s):

Xiannong Meng

Keyword(s):

Performance Analysis ◽

Search Engine ◽

Search Engines ◽

Web Search ◽

General Purpose ◽

Performance Measurements ◽

Web Documents ◽

System Architectures ◽

Web Search Engine ◽

And Performance

This chapter surveys various technologies involved in a Web search engine with an emphasis on performance analysis issues. The aspects of a general-purpose search engine covered in this survey include system architectures, information retrieval theories as the basis of Web search, indexing and ranking of Web documents, relevance feedback and machine learning, personalization, and performance measurements. The objectives of the chapter are to review the theories and technologies pertaining to Web search, and help us understand how Web search engines work and how to use the search engines more effectively and efficiently.

Download Full-text

Reliability and Evaluation of Health Information Online

Web Mobile-Based Applications for Healthcare Management ◽

10.4018/978-1-59140-658-7.ch008 ◽

2011 ◽

pp. 181-196

Author(s):

Elmer V. Bernstam ◽

Funda Meric-Bernstam

Keyword(s):

Quality Assessment ◽

Health Information ◽

Search Engine ◽

General Public ◽

Web Search ◽

Healthcare Professionals ◽

General Purpose ◽

Assessment Tools ◽

Online Health Information ◽

Web Search Engine

This chapter discusses the problem of how to evaluate online health information. The quality and accuracy of online health information is an area of increasing concern for healthcare professionals and the general public. We define relevant concepts including quality, accuracy, utility, and popularity. Most users access online health information via general-purpose search engines, therefore we briefly review Web search-engine fundamentals. We discuss desirable characteristics for quality-assessment tools and the available evidence regarding their effectiveness and usability. We conclude with advice for healthcare consumers as they search for health information online.

Download Full-text

Document Summarization Using Sentence-Level Semantic Based on Word Embeddings

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194019500086 ◽

2019 ◽

Vol 29 (02) ◽

pp. 177-196 ◽

Cited By ~ 1

Author(s):

Kamal Al-Sabahi ◽

Zhang Zuping

Keyword(s):

Language Processing ◽

Web Search ◽

Question Answering ◽

Information Overload ◽

Good Representation ◽

Intelligence Analysis ◽

Word Embeddings ◽

Question Answering Systems ◽

Active Research ◽

News Recommendation

In the era of information overload, text summarization has become a focus of attention in a number of diverse fields such as, question answering systems, intelligence analysis, news recommendation systems, search results in web search engines, and so on. A good document representation is the key point in any successful summarizer. Learning this representation becomes a very active research in natural language processing field (NLP). Traditional approaches mostly fail to deliver a good representation. Word embedding has proved an excellent performance in learning the representation. In this paper, a modified BM25 with Word Embeddings are used to build the sentence vectors from word vectors. The entire document is represented as a set of sentence vectors. Then, the similarity between every pair of sentence vectors is computed. After that, TextRank, a graph-based model, is used to rank the sentences. The summary is generated by picking the top-ranked sentences according to the compression rate. Two well-known datasets, DUC2002 and DUC2004, are used to evaluate the models. The experimental results show that the proposed models perform comprehensively better compared to the state-of-the-art methods.

Download Full-text

Using Web Search Logs to Identify Query Classification Terms

Fourth International Conference on Information Technology (ITNG'07) ◽

10.1109/itng.2007.202 ◽

2007 ◽

Cited By ~ 1

Author(s):

Isak Taksa ◽

Sarah Zelikovitz ◽

Amanda Spink

Keyword(s):

Web Search ◽

Query Classification ◽

Search Logs

Download Full-text