scholarly journals Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity

2020 ◽  
pp. 1-51
Author(s):  
Ivan Vulić ◽  
Simon Baker ◽  
Edoardo Maria Ponti ◽  
Ulla Petti ◽  
Ira Leviant ◽  
...  

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex -style resources for additional languages.We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via aWeb site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.

2021 ◽  
pp. 1-48
Author(s):  
Olga Majewska ◽  
Diana McCarthy ◽  
Jasper J. F. van den Bosch ◽  
Nikolaus Kriegeskorte ◽  
Ivan Vulić ◽  
...  

Research into representation learning models of lexical semantics usually utilizes some form of intrinsic evaluation to ensure that the learned representations reflect human semantic judgments. Lexical semantic similarity estimation is a widely used evaluation method, but efforts have typically focused on pairwise judgments of words in isolation, or are limited to specific contexts and lexical stimuli. There are limitations with these approaches that either do not provide any context for judgments, and thereby ignore ambiguity, or provide very specific sentential contexts that cannot then be used to generate a larger lexical resource. Furthermore, similarity between more than two items is not considered. We provide a full description and analysis of our recently proposed methodology for large-scale data set construction that produces a semantic classification of a large sample of verbs in the first phase, as well as multiway similarity judgments made within the resultant semantic classes in the second phase. The methodology uses a spatial multi-arrangement approach proposed in the field of cognitive neuroscience for capturing multi-way similarity judgments of visual stimuli. We have adapted this method to handle polysemous linguistic stimuli and much larger samples than previous work.We specifically target verbs, but the method can equally be applied to other parts of speech. We perform cluster analysis on the data from the first phase and demonstrate how this might be useful in the construction of a comprehensive verb resource. We also analyze the semantic information captured by the second phase and discuss the potential of the spatially induced similarity judgments to better reflect human notions of word similarity.We demonstrate how the resultant data set can be used for fine-grained analyses and evaluation of representation learning models on the intrinsic tasks of semantic clustering and semantic similarity. In particular, we find that stronger static word embedding methods still outperform lexical representations emerging from more recent pre-training methods, both on word-level similarity and clustering. Moreover, thanks to the data set’s vast coverage, we are able to compare the benefits of specializing vector representations for a particular type of external knowledge by evaluating FrameNet- and VerbNet-retrofitted models on specific semantic domains such as “Heat” or “Motion.”


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Yahya Albalawi ◽  
Jim Buckley ◽  
Nikola S. Nikolov

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.


Author(s):  
Lior Shamir

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .


2015 ◽  
Vol 8 (1) ◽  
pp. 421-434 ◽  
Author(s):  
M. P. Jensen ◽  
T. Toto ◽  
D. Troyan ◽  
P. E. Ciesielski ◽  
D. Holdridge ◽  
...  

Abstract. The Midlatitude Continental Convective Clouds Experiment (MC3E) took place during the spring of 2011 centered in north-central Oklahoma, USA. The main goal of this field campaign was to capture the dynamical and microphysical characteristics of precipitating convective systems in the US Central Plains. A major component of the campaign was a six-site radiosonde array designed to capture the large-scale variability of the atmospheric state with the intent of deriving model forcing data sets. Over the course of the 46-day MC3E campaign, a total of 1362 radiosondes were launched from the enhanced sonde network. This manuscript provides details on the instrumentation used as part of the sounding array, the data processing activities including quality checks and humidity bias corrections and an analysis of the impacts of bias correction and algorithm assumptions on the determination of convective levels and indices. It is found that corrections for known radiosonde humidity biases and assumptions regarding the characteristics of the surface convective parcel result in significant differences in the derived values of convective levels and indices in many soundings. In addition, the impact of including the humidity corrections and quality controls on the thermodynamic profiles that are used in the derivation of a large-scale model forcing data set are investigated. The results show a significant impact on the derived large-scale vertical velocity field illustrating the importance of addressing these humidity biases.


2020 ◽  
Vol 223 (2) ◽  
pp. 1378-1397
Author(s):  
Rosemary A Renaut ◽  
Jarom D Hogue ◽  
Saeed Vatankhah ◽  
Shuang Liu

SUMMARY We discuss the focusing inversion of potential field data for the recovery of sparse subsurface structures from surface measurement data on a uniform grid. For the uniform grid, the model sensitivity matrices have a block Toeplitz Toeplitz block structure for each block of columns related to a fixed depth layer of the subsurface. Then, all forward operations with the sensitivity matrix, or its transpose, are performed using the 2-D fast Fourier transform. Simulations are provided to show that the implementation of the focusing inversion algorithm using the fast Fourier transform is efficient, and that the algorithm can be realized on standard desktop computers with sufficient memory for storage of volumes up to size n ≈ 106. The linear systems of equations arising in the focusing inversion algorithm are solved using either Golub–Kahan bidiagonalization or randomized singular value decomposition algorithms. These two algorithms are contrasted for their efficiency when used to solve large-scale problems with respect to the sizes of the projected subspaces adopted for the solutions of the linear systems. The results confirm earlier studies that the randomized algorithms are to be preferred for the inversion of gravity data, and for data sets of size m it is sufficient to use projected spaces of size approximately m/8. For the inversion of magnetic data sets, we show that it is more efficient to use the Golub–Kahan bidiagonalization, and that it is again sufficient to use projected spaces of size approximately m/8. Simulations support the presented conclusions and are verified for the inversion of a magnetic data set obtained over the Wuskwatim Lake region in Manitoba, Canada.


2009 ◽  
Vol 2 (1) ◽  
pp. 87-98 ◽  
Author(s):  
C. Lerot ◽  
M. Van Roozendael ◽  
J. van Geffen ◽  
J. van Gent ◽  
C. Fayt ◽  
...  

Abstract. Total O3 columns have been retrieved from six years of SCIAMACHY nadir UV radiance measurements using SDOAS, an adaptation of the GDOAS algorithm previously developed at BIRA-IASB for the GOME instrument. GDOAS and SDOAS have been implemented by the German Aerospace Center (DLR) in the version 4 of the GOME Data Processor (GDP) and in version 3 of the SCIAMACHY Ground Processor (SGP), respectively. The processors are being run at the DLR processing centre on behalf of the European Space Agency (ESA). We first focus on the description of the SDOAS algorithm with particular attention to the impact of uncertainties on the reference O3 absorption cross-sections. Second, the resulting SCIAMACHY total ozone data set is globally evaluated through large-scale comparisons with results from GOME and OMI as well as with ground-based correlative measurements. The various total ozone data sets are found to agree within 2% on average. However, a negative trend of 0.2–0.4%/year has been identified in the SCIAMACHY O3 columns; this probably originates from instrumental degradation effects that have not yet been fully characterized.


2017 ◽  
Vol 14 (4) ◽  
pp. 172988141770907 ◽  
Author(s):  
Hanbo Wu ◽  
Xin Ma ◽  
Zhimeng Zhang ◽  
Haibo Wang ◽  
Yibin Li

Human daily activity recognition has been a hot spot in the field of computer vision for many decades. Despite best efforts, activity recognition in naturally uncontrolled settings remains a challenging problem. Recently, by being able to perceive depth and visual cues simultaneously, RGB-D cameras greatly boost the performance of activity recognition. However, due to some practical difficulties, the publicly available RGB-D data sets are not sufficiently large for benchmarking when considering the diversity of their activities, subjects, and background. This severely affects the applicability of complicated learning-based recognition approaches. To address the issue, this article provides a large-scale RGB-D activity data set by merging five public RGB-D data sets that differ from each other on many aspects such as length of actions, nationality of subjects, or camera angles. This data set comprises 4528 samples depicting 7 action categories (up to 46 subcategories) performed by 74 subjects. To verify the challengeness of the data set, three feature representation methods are evaluated, which are depth motion maps, spatiotemporal depth cuboid similarity feature, and curvature space scale. Results show that the merged large-scale data set is more realistic and challenging and therefore more suitable for benchmarking.


2017 ◽  
Vol 44 (2) ◽  
pp. 203-229 ◽  
Author(s):  
Javier D Fernández ◽  
Miguel A Martínez-Prieto ◽  
Pablo de la Fuente Redondo ◽  
Claudio Gutiérrez

The publication of semantic web data, commonly represented in Resource Description Framework (RDF), has experienced outstanding growth over the last few years. Data from all fields of knowledge are shared publicly and interconnected in active initiatives such as Linked Open Data. However, despite the increasing availability of applications managing large-scale RDF information such as RDF stores and reasoning tools, little attention has been given to the structural features emerging in real-world RDF data. Our work addresses this issue by proposing specific metrics to characterise RDF data. We specifically focus on revealing the redundancy of each data set, as well as common structural patterns. We evaluate the proposed metrics on several data sets, which cover a wide range of designs and models. Our findings provide a basis for more efficient RDF data structures, indexes and compressors.


2021 ◽  
Vol 15 ◽  
Author(s):  
Jianwei Zhang ◽  
Xubin Zhang ◽  
Lei Lv ◽  
Yining Di ◽  
Wei Chen

Background: Learning discriminative representation from large-scale data sets has made a breakthrough in decades. However, it is still a thorny problem to generate representative embedding from limited examples, for example, a class containing only one image. Recently, deep learning-based Few-Shot Learning (FSL) has been proposed. It tackles this problem by leveraging prior knowledge in various ways. Objective: In this work, we review recent advances of FSL from the perspective of high-dimensional representation learning. The results of the analysis can provide insights and directions for future work. Methods: We first present the definition of general FSL. Then we propose a general framework for the FSL problem and give the taxonomy under the framework. We survey two FSL directions: learning policy and meta-learning. Results: We review the advanced applications of FSL, including image classification, object detection, image segmentation and other tasks etc., as well as the corresponding benchmarks to provide an overview of recent progress. Conclusion: FSL needs to be further studied in medical images, language models, and reinforcement learning in future work. In addition, cross-domain FSL, successive FSL, and associated FSL are more challenging and valuable research directions.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Hendri Murfi

PurposeThe aim of this research is to develop an eigenspace-based fuzzy c-means method for scalable topic detection.Design/methodology/approachThe eigenspace-based fuzzy c-means (EFCM) combines representation learning and clustering. The textual data are transformed into a lower-dimensional eigenspace using truncated singular value decomposition. Fuzzy c-means is performed on the eigenspace to identify the centroids of each cluster. The topics are provided by transforming back the centroids into the nonnegative subspace of the original space. In this paper, we extend the EFCM method for scalability by using the two approaches, i.e. single-pass and online. We call the developed topic detection methods as oEFCM and spEFCM.FindingsOur simulation shows that both oEFCM and spEFCM methods provide faster running times than EFCM for data sets that do not fit in memory. However, there is a decrease in the average coherence score. For both data sets that fit and do not fit into memory, the oEFCM method provides a tradeoff between running time and coherence score, which is better than spEFCM.Originality/valueThis research produces a scalable topic detection method. Besides this scalability capability, the developed method also provides a faster running time for the data set that fits in memory.


Sign in / Sign up

Export Citation Format

Share Document