scholarly journals Linking Norms, Ratings, and Relations of Words and Concepts Across Multiple Language Varieties

2020 ◽  
Author(s):  
Annika Tjuka ◽  
Robert Forkel ◽  
Johann-Mattis List

Psychologists and linguists have collected a great diversity of data for word and concept properties. In psychology, many studies accumulate norms and ratings such as word frequencies or age-of-acquisition often for a large number of words. Linguistics, on the other hand, provides valuable insights into relations of word meanings. We present a collection of those data sets for norms, ratings, and relations that cover different languages: ‘NoRaRe.’ To enable a comparison between the diverse data types, we established workflows that facilitate the expansion of the database. A web application allows convenient access to the data (https://digling.org/norare/). Furthermore, a software API ensures consistent data curation by providing tests to validate the data sets. The NoRaRe collection is linked to the database curated by the Concepticon project (https://concepticon.clld.org) which offers a reference catalog of unified concept sets. The link between words in the data sets and the Concepticon concept sets makes a cross-linguistic comparison possible. In three case studies, we test the validity of our approach, the accuracy of our workflow, and the applicability of our database. The results indicate that the NoRaRe database can be applied for the study of word properties across multiple languages. The data can be used by psychologists and linguists to benefit from the knowledge rooted in both research disciplines.

Author(s):  
Annika Tjuka ◽  
Robert Forkel ◽  
Johann-Mattis List

AbstractPsychologists and linguists collect various data on word and concept properties. In psychology, scholars have accumulated norms and ratings for a large number of words in languages with many speakers. In linguistics, scholars have accumulated cross-linguistic information about the relations between words and concepts. Until now, however, there have been no efforts to combine information from the two fields, which would allow comparison of psychological and linguistic properties across different languages. The Database of Cross-Linguistic Norms, Ratings, and Relations for Words and Concepts (NoRaRe) is the first attempt to close this gap. Building on a reference catalog that offers standardization of concepts used in historical and typological language comparison, it integrates data from psychology and linguistics, collected from 98 data sets, covering 65 unique properties for 40 languages. The database is curated with the help of manual, automated, semi-automated workflows and uses a software API to control and access the data. The database is accessible via a web application, the software API, or using scripting languages. In this study, we present how the database is structured, how it can be extended, and how we control the quality of the data curation process. To illustrate its application, we present three case studies that test the validity of our approach, the accuracy of our workflows, and the integrative potential of the database. Due to regular version updates, the NoRaRe database has the potential to advance research in psychology and linguistics by offering researchers an integrated perspective on both fields.


2020 ◽  
pp. 958-971
Author(s):  
Marcel Ramos ◽  
Ludwig Geistlinger ◽  
Sehyun Oh ◽  
Lucas Schiffer ◽  
Rimsha Azhar ◽  
...  

PURPOSE Investigations of the molecular basis for the development, progression, and treatment of cancer increasingly use complementary genomic assays to gather multiomic data, but management and analysis of such data remain complex. The cBioPortal for cancer genomics currently provides multiomic data from > 260 public studies, including The Cancer Genome Atlas (TCGA) data sets, but integration of different data types remains challenging and error prone for computational methods and tools using these resources. Recent advances in data infrastructure within the Bioconductor project enable a novel and powerful approach to creating fully integrated representations of these multiomic, pan-cancer databases. METHODS We provide a set of R/Bioconductor packages for working with TCGA legacy data and cBioPortal data, with special considerations for loading time; efficient representations in and out of memory; analysis platform; and an integrative framework, such as MultiAssayExperiment. Large methylation data sets are provided through out-of-memory data representation to provide responsive loading times and analysis capabilities on machines with limited memory. RESULTS We developed the curatedTCGAData and cBioPortalData R/Bioconductor packages to provide integrated multiomic data sets from the TCGA legacy database and the cBioPortal web application programming interface using the MultiAssayExperiment data structure. This suite of tools provides coordination of diverse experimental assays with clinicopathological data with minimal data management burden, as demonstrated through several greatly simplified multiomic and pan-cancer analyses. CONCLUSION These integrated representations enable analysts and tool developers to apply general statistical and plotting methods to extensive multiomic data through user-friendly commands and documented examples.


2020 ◽  
Author(s):  
Mark Naylor ◽  
Kirsty Bayliss ◽  
Finn Lindgren ◽  
Francesco Serafini ◽  
Ian Main

<p>Many earthquake forecasting approaches have developed bespokes codes to model and forecast the spatio-temporal eveolution of seismicity. At the same time, the statistics community have been working on a range of point process modelling codes. For example, motivated by ecological applications, inlabru models spatio-temporal point processes as a log-Gaussian Cox Process and is implemented in R. Here we present an initial implementation of inlabru to model seismicity. This fully Bayesian approach is computationally efficient because it uses a nested Laplace approximation such that posteriors are assumed to be Gaussian so that their means and standard deviations can be deterministically estimated rather than having to be constructed through sampling. Further, building on existing packages in R to handle spatial data, it can construct covariate maprs from diverse data-types, such as fault maps, in an intutitive and simple manner.</p><p>Here we present an initial application to the California earthqauke catalogue to determine the relative performance of different data-sets for describing the spatio-temporal evolution of seismicity.</p>


2020 ◽  
Vol 45 (4) ◽  
pp. 737-763 ◽  
Author(s):  
Anirban Laha ◽  
Parag Jain ◽  
Abhijit Mishra ◽  
Karthik Sankaranarayanan

We present a framework for generating natural language description from structured data such as tables; the problem comes under the category of data-to-text natural language generation (NLG). Modern data-to-text NLG systems typically use end-to-end statistical and neural architectures that learn from a limited amount of task-specific labeled data, and therefore exhibit limited scalability, domain-adaptability, and interpretability. Unlike these systems, ours is a modular, pipeline-based approach, and does not require task-specific parallel data. Rather, it relies on monolingual corpora and basic off-the-shelf NLP tools. This makes our system more scalable and easily adaptable to newer domains. Our system utilizes a three-staged pipeline that: (i) converts entries in the structured data to canonical form, (ii) generates simple sentences for each atomic entry in the canonicalized representation, and (iii) combines the sentences to produce a coherent, fluent, and adequate paragraph description through sentence compounding and co-reference replacement modules. Experiments on a benchmark mixed-domain data set curated for paragraph description from tables reveals the superiority of our system over existing data-to-text approaches. We also demonstrate the robustness of our system in accepting other popular data sets covering diverse data types such as knowledge graphs and key-value maps.


2019 ◽  
Vol 317 (3) ◽  
pp. L347-L360 ◽  
Author(s):  
Yina Du ◽  
Geremy C. Clair ◽  
Denise Al Alam ◽  
Soula Danopoulos ◽  
Daniel Schnell ◽  
...  

Systems biology uses computational approaches to integrate diverse data types to understand cell and organ behavior. Data derived from complementary technologies, for example transcriptomic and proteomic analyses, are providing new insights into development and disease. We compared mRNA and protein profiles from purified endothelial, epithelial, immune, and mesenchymal cells from normal human infant lung tissue. Signatures for each cell type were identified and compared at both mRNA and protein levels. Cell-specific biological processes and pathways were predicted by analysis of concordant and discordant RNA-protein pairs. Cell clustering and gene set enrichment comparisons identified shared versus unique processes associated with transcriptomic and/or proteomic data. Clear cell-cell correlations between mRNA and protein data were obtained from each cell type. Approximately 40% of RNA-protein pairs were coherently expressed. While the correlation between RNA and their protein products was relatively low (Spearman rank coefficient rs ~0.4), cell-specific signature genes involved in functional processes characteristic of each cell type were more highly correlated with their protein products. Consistency of cell-specific RNA-protein signatures indicated an essential framework for the function of each cell type. Visualization and reutilization of the protein and RNA profiles are supported by a new web application, “LungProteomics,” which is freely accessible to the public.


Author(s):  
Prerak Desai

The use of systems biology to study complex biological questions is gaining ground due to the ever-increasing amount of genetic tools and genome sequences available. As such, systems biology concepts and approaches are increasingly underpinning our concept of microbial physiology. Three tools for use in functional genomics are gene expression, proteomics, and metabolomics. However, these tools produce such large data sets that we sometimes become paralyzed trying to merge the data and link it to form a consistent biological interpretation. Use of functional groupings has relieved some of the issues in merging data for biological meaning. Statistical analysis and visualization of these multi-dimension data sets are needed to aid the microbiologist, which brings additional methods that are often not familiar. Progress is being made to bring these diverse data types together to understand fundamental metabolic processes and pathways. These efforts are paying tremendous dividends in our understanding of how microbes live, grow, survive, and metabolize nutrients. These insights allow metabolic engineering to progress and allow scientists to further define the mechanisms of metabolism.


2021 ◽  
Author(s):  
Shohre Masoumi ◽  
Maxwell W. Libbrecht ◽  
Kay C. Wiese

Motivation: With the advancement of sequencing technologies, genomic data sets are constantly being expanded by high volumes of different data types. One recently introduced data type in genomic science is genomic signals, which are usually short-read coverage measurements over the genome. An example of genomic signals is Epigenomic marks which are utilized to locate functional and nonfunctional elements in genome annotation studies. To understand and evaluate the results of such studies, one needs to understand and analyze the characteristics of the input data. Results: SigTools is an R-based genomic signals visualization package developed with two objectives: 1) to facilitate genomic signals exploration in order to uncover insights for later model training, refinement, and development by including distribution and autocorrelation plots. 2) to enable genomic signals interpretation by including correlation, and aggregation plots. Moreover, Sigtools also provides text-based descriptive statistics of the given signals which can be practical when developing and evaluating learning models. We also include results from 2 case studies. The first examines several previously studied genomic signals called histone modifications. This use case demonstrates how SigTools can be beneficial for satisfying scientists curiosity in exploring and establishing recognized datasets. The second use case examines a dataset of novel chromatin state features which are novel genomic signals generated by a learning model. This use case demonstrates how SigTools can assist in exploring the characteristics and behavior of novel signals towards their interpretation. In addition, our corresponding web application, SigTools-Shiny, extends the accessibility scope of these modules to people who are more comfortable working with graphical user interfaces instead of command-line tools.


2021 ◽  
pp. 016555152199863
Author(s):  
Ismael Vázquez ◽  
María Novo-Lourés ◽  
Reyes Pavón ◽  
Rosalía Laza ◽  
José Ramón Méndez ◽  
...  

Current research has evolved in such a way scientists must not only adequately describe the algorithms they introduce and the results of their application, but also ensure the possibility of reproducing the results and comparing them with those obtained through other approximations. In this context, public data sets (sometimes shared through repositories) are one of the most important elements for the development of experimental protocols and test benches. This study has analysed a significant number of CS/ML ( Computer Science/ Machine Learning) research data repositories and data sets and detected some limitations that hamper their utility. Particularly, we identify and discuss the following demanding functionalities for repositories: (1) building customised data sets for specific research tasks, (2) facilitating the comparison of different techniques using dissimilar pre-processing methods, (3) ensuring the availability of software applications to reproduce the pre-processing steps without using the repository functionalities and (4) providing protection mechanisms for licencing issues and user rights. To show the introduced functionality, we created STRep (Spam Text Repository) web application which implements our recommendations adapted to the field of spam text repositories. In addition, we launched an instance of STRep in the URL https://rdata.4spam.group to facilitate understanding of this study.


2011 ◽  
Vol 84 (8) ◽  
Author(s):  
Tracy Holsclaw ◽  
Ujjaini Alam ◽  
Bruno Sansó ◽  
Herbie Lee ◽  
Katrin Heitmann ◽  
...  

2017 ◽  
Vol 6 (2) ◽  
pp. 12
Author(s):  
Abhith Pallegar

The objective of the paper is to elucidate how interconnected biological systems can be better mapped and understood using the rapidly growing area of Big Data. We can harness network efficiencies by analyzing diverse medical data and probe how we can effectively lower the economic cost of finding cures for rare diseases. Most rare diseases are due to genetic abnormalities, many forms of cancers develop due to genetic mutations. Finding cures for rare diseases requires us to understand the biology and biological processes of the human body. In this paper, we explore what the historical shift of focus from pharmacology to biotechnology means for accelerating biomedical solutions. With biotechnology playing a leading role in the field of medical research, we explore how network efficiencies can be harnessed by strengthening the existing knowledge base. Studying rare or orphan diseases provides rich observable statistical data that can be leveraged for finding solutions. Network effects can be squeezed from working with diverse data sets that enables us to generate the highest quality medical knowledge with the fewest resources. This paper examines gene manipulation technologies like Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) that can prevent diseases of genetic variety. We further explore the role of the emerging field of Big Data in analyzing large quantities of medical data with the rapid growth of computing power and some of the network efficiencies gained from this endeavor. 


Sign in / Sign up

Export Citation Format

Share Document