scholarly journals Mapping the semantic organization of the English odor vocabulary using natural language data

2020 ◽  
Author(s):  
Thomas Hörberg ◽  
Maria Larsson ◽  
Jonas Olofsson

Olfactory experiences are hard to verbalize, partly because most languages lack devoted odor vocabularies. Yet, there is a need for a standardized odor vocabulary, but no descriptive system for describing the full range of odor experiences has been agreed upon. Many studies of the English odor vocabulary have been based on perceptual data such as odor-descriptor ratings, thereby being limited to a small set of pre-selected descriptors. In the present study, we present a data-driven approach that automatically identifies odor descriptors in English, and then derive their semantic organization on the basis of their distributions in natural texts. Olfactory descriptors are automatically identified on the basis of their degree of olfactory association, and their semantic organization is derived with a distributional-semantic word embedding model. We identify and derive the semantic organization of the descriptors most frequently used to describe odors and flavors in English, both within and across source-based, abstract and evaluative descriptor categories. Our method is to a large extent able to capture semantic differences between descriptors related to aroma and flavor qualities, rather than e.g. functional or linguistic aspects, in that it primarily differentiates descriptors with respect to valence and edibility, and the semantic space derived from it is qualitatively similar to a space derived from perceptual data.

2016 ◽  
Author(s):  
Geoffrey Fouad ◽  
André Skupin ◽  
Christina L. Tague

Abstract. Percentile flows are statistics derived from the flow duration curve (FDC) that describe the flow equaled or exceeded for a given percent of time. These statistics provide important information for managing rivers, but are often unavailable since most basins are ungauged. A common approach for predicting percentile flows is to deploy regional regression models based on gauged percentile flows and related independent variables derived from physical and climatic data. The first step of this process identifies groups of basins through a cluster analysis of the independent variables, followed by the development of a regression model for each group. This entire process hinges on the independent variables selected to summarize the physical and climatic state of basins. Distributed physical and climatic datasets now exist for the contiguous United States (US). However, it remains unclear how to best represent these data for the development of regional regression models. The study presented here developed regional regression models for the contiguous US, and evaluated the effect of different approaches for selecting the initial set of independent variables on the predictive performance of the regional regression models. An expert assessment of the dominant controls on the FDC was used to identify a small set of independent variables likely related to percentile flows. A data-driven approach was also applied to evaluate two larger sets of variables that consist of either (1) the averages of data for each basin or (2) both the averages and statistical distribution of basin data distributed in space and time. The small set of variables from the expert assessment of the FDC and two larger sets of variables for the data-driven approach were each applied for a regional regression procedure. Differences in predictive performance were evaluated using 184 validation basins withheld from regression model development. The small set of independent variables selected through expert assessment produced similar, if not better, performance than the two larger sets of variables. A parsimonious set of variables only consisted of mean annual precipitation, potential evapotranspiration, and baseflow index. Additional variables in the two larger sets of variables added little to no predictive information. Regional regression models based on the parsimonious set of variables were developed using 734 calibration basins, and were converted into a tool for predicting 13 percentile flows in the contiguous US. Supplementary Material for this paper includes an R graphical user interface for predicting the percentile flows of basins within the range of conditions used to calibrate the regression models. The equations and performance statistics of the models are also supplied in tabular form.


2020 ◽  
Vol 12 (1) ◽  
pp. 182-202 ◽  
Author(s):  
BILL THOMPSON ◽  
MARCUS PERLMAN ◽  
GARY LUPYAN ◽  
ZED SEVCIKOVA SEHYR ◽  
KAREN EMMOREY

abstractA growing body of research shows that both signed and spoken languages display regular patterns of iconicity in their vocabularies. We compared iconicity in the lexicons of American Sign Language (ASL) and English by combining previously collected ratings of ASL signs (Caselli, Sevcikova Sehyr, Cohen-Goldberg, & Emmorey, 2017) and English words (Winter, Perlman, Perry, & Lupyan, 2017) with the use of data-driven semantic vectors derived from English. Our analyses show that models of spoken language lexical semantics drawn from large text corpora can be useful for predicting the iconicity of signs as well as words. Compared to English, ASL has a greater number of regions of semantic space with concentrations of highly iconic vocabulary. There was an overall negative relationship between semantic density and the iconicity of both English words and ASL signs. This negative relationship disappeared for highly iconic signs, suggesting that iconic forms may be more easily discriminable in ASL than in English. Our findings contribute to an increasingly detailed picture of how iconicity is distributed across different languages.


2021 ◽  
Author(s):  
Russell J Jarvis ◽  
Patrick M. McGurrin ◽  
Rebecca Featherston ◽  
Marc Skov Madsen ◽  
Shivam Bansal ◽  
...  

Here we present a new text analysis tool that consists of a text analysis service and an author search service. These services were created by using or extending many existing Free and Open Source tools, including streamlit, requests, WordCloud, TextStat, and The Natural Language Tool Kit. The tool has the capability to retrieve journal hosting links and journal article content from APIs and journal hosting websites. Together, these services allow the user to review the complexity of a scientist’s published work relative to other online-based text repositories. Rather than providing feedback as to the complexity of a single text as previous tools have done, the tool presented here shows the relative complexity across many texts from the same author, while also comparing the readability of the author’s body of work to a variety of other scientific and lay text types. The goal of this work is to apply a more data-driven approach that provides established academic authors with statistical insights into their body of published peer reviewed work. By monitoring these readability metrics, scientists may be able to cater their writing to reach broader audiences, contributing to an improved global communication and understanding of complex topics.


Author(s):  
Kangqi Luo ◽  
Xusheng Luo ◽  
Xianyang Chen ◽  
Kenny Q. Zhu

This paper studies the problem of discovering the structured knowledge representation of binary natural language relations.The representation, known as the schema, generalizes the traditional path of predicates to support more complex semantics.We present a search algorithm to generate schemas over a knowledge base, and propose a data-driven learning approach to discover the most suitable representations to one relation. Evaluation results show that inferred schemas are able to represent precise semantics, and can be used to enrich manually crafted knowledge bases.


Author(s):  
Sena Assaf ◽  
Mohamad Awada ◽  
Issam Srour

2020 ◽  
pp. 3-17
Author(s):  
Peter Nabende

Natural Language Processing for under-resourced languages is now a mainstream research area. However, there are limited studies on Natural Language Processing applications for many indigenous East African languages. As a contribution to covering the current gap of knowledge, this paper focuses on evaluating the application of well-established machine translation methods for one heavily under-resourced indigenous East African language called Lumasaaba. Specifically, we review the most common machine translation methods in the context of Lumasaaba including both rule-based and data-driven methods. Then we apply a state of the art data-driven machine translation method to learn models for automating translation between Lumasaaba and English using a very limited data set of parallel sentences. Automatic evaluation results show that a transformer-based Neural Machine Translation model architecture leads to consistently better BLEU scores than the recurrent neural network-based models. Moreover, the automatically generated translations can be comprehended to a reasonable extent and are usually associated with the source language input.


2012 ◽  
Author(s):  
Michael Ghil ◽  
Mickael D. Chekroun ◽  
Dmitri Kondrashov ◽  
Michael K. Tippett ◽  
Andrew Robertson ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document