Comparing the Quality and Speed of Sentence Classification with Modern Language Models

After the advent of Glove and Word2vec, the dynamic development of language models (LMs) used to generate word embeddings has enabled the creation of better text classifier frameworks. With the vector representations of words generated by newer LMs, embeddings are no longer static but are context-aware. However, the quality of results provided by state-of-the-art LMs comes at the price of speed. Our goal was to present a benchmark to provide insight into the speed–quality trade-off of a sentence classifier framework based on word embeddings provided by selected LMs. We used a recurrent neural network with gated recurrent units to create sentence-level vector representations from word embeddings provided by an LM and a single fully connected layer for classification. Benchmarking was performed on two sentence classification data sets: The Sixth Text REtrieval Conference (TREC6)set and a 1000-sentence data set of our design. Our Monte Carlo cross-validated results based on these two data sources demonstrated that the newest deep learning LMs provided improvements over Glove and FastText in terms of weighted Matthews correlation coefficient (MCC) scores. We postulate that progress in LMs is more apparent when more difficult classification tasks are addressed.

Download Full-text

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal Of Big Data ◽

10.1186/s40537-021-00488-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Yahya Albalawi ◽

Jim Buckley ◽

Nikola S. Nikolov

Keyword(s):

Social Media ◽

Deep Learning ◽

Comprehensive Evaluation ◽

Classification Problem ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lower Accuracy ◽

Health Related ◽

The Impact

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity

Computational Linguistics ◽

10.1162/coli_a_00391 ◽

2020 ◽

pp. 1-51

Author(s):

Ivan Vulić ◽

Simon Baker ◽

Edoardo Maria Ponti ◽

Ulla Petti ◽

Ira Leviant ◽

...

Keyword(s):

Semantic Similarity ◽

Large Scale ◽

Representation Learning ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lexical Representations ◽

Language Data ◽

Weakly Supervised ◽

Cross Lingual

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex -style resources for additional languages.We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via aWeb site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.

Download Full-text

Text-to-Text Similarity of Sentences

Applied Natural Language Processing ◽

10.4018/978-1-60960-741-8.ch007 ◽

2012 ◽

pp. 110-121 ◽

Cited By ~ 4

Author(s):

Vasile Rus ◽

Mihai Lintean ◽

Arthur C. Graesser ◽

Danielle S. McNamara

Keyword(s):

Semantic Analysis ◽

Intelligent Tutoring ◽

Intelligent Tutoring System ◽

Semantic Relations ◽

Data Sets ◽

Data Set ◽

Word Similarity ◽

Tutoring Systems ◽

Standard Data ◽

Sentence Level

Assessing the semantic similarity between two texts is a central task in many applications, including summarization, intelligent tutoring systems, and software testing. Similarity of texts is typically explored at the level of word, sentence, paragraph, and document. The similarity can be defined quantitatively (e.g. in the form of a normalized value between 0 and 1) and qualitatively in the form of semantic relations such as elaboration, entailment, or paraphrase. In this chapter, we focus first on measuring quantitatively and then on detecting qualitatively sentence-level text-to-text semantic relations. A generic approach that relies on word-to-word similarity measures is presented as well as experiments and results obtained with various instantiations of the approach. In addition, we provide results of a study on the role of weighting in Latent Semantic Analysis, a statistical technique to assess similarity of texts. The results were obtained on two data sets: a standard data set on sentence-level paraphrase detection and a data set from an intelligent tutoring system.

Download Full-text

Supervised and Unsupervised Neural Approaches to Text Readability

Computational Linguistics ◽

10.1162/coli_a_00398 ◽

2021 ◽

Vol 47 (1) ◽

pp. 141-179

Author(s):

Matej Martinc ◽

Senja Pollak ◽

Marko Robnik-Šikonja

Keyword(s):

State Of The Art ◽

Comprehensive Analysis ◽

Language Models ◽

Feature Engineering ◽

Data Sets ◽

Data Set ◽

Systematic Comparison ◽

Current State ◽

Text Readability ◽

Unsupervised Approach

Abstract We present a set of novel neural supervised and unsupervised approaches for determining the readability of documents. In the unsupervised setting, we leverage neural language models, whereas in the supervised setting, three different neural classification architectures are tested. We show that the proposed neural unsupervised approach is robust, transferable across languages, and allows adaptation to a specific readability task and data set. By systematic comparison of several neural architectures on a number of benchmark and new labeled readability data sets in two languages, this study also offers a comprehensive analysis of different neural approaches to readability classification. We expose their strengths and weaknesses, compare their performance to current state-of-the-art classification approaches to readability, which in most cases still rely on extensive feature engineering, and propose possibilities for improvements.

Download Full-text

Closing the Performance Gap between Siamese Networks for Dissimilarity Image Classification and Convolutional Neural Networks

Sensors ◽

10.3390/s21175809 ◽

2021 ◽

Vol 21 (17) ◽

pp. 5809

Author(s):

Loris Nanni ◽

Giovanni Minchio ◽

Sheryl Brahnam ◽

Davide Sarraggiotto ◽

Alessandra Lumini

Keyword(s):

Image Classification ◽

Image Data ◽

Support Vector ◽

Final Decision ◽

Data Sets ◽

Data Set ◽

Additional Information ◽

New Strategy ◽

Fully Connected ◽

Siamese Networks

In this paper, we examine two strategies for boosting the performance of ensembles of Siamese networks (SNNs) for image classification using two loss functions (Triplet and Binary Cross Entropy) and two methods for building the dissimilarity spaces (FULLY and DEEPER). With FULLY, the distance between a pattern and a prototype is calculated by comparing two images using the fully connected layer of the Siamese network. With DEEPER, each pattern is described using a deeper layer combined with dimensionality reduction. The basic design of the SNNs takes advantage of supervised k-means clustering for building the dissimilarity spaces that train a set of support vector machines, which are then combined by sum rule for a final decision. The robustness and versatility of this approach are demonstrated on several cross-domain image data sets, including a portrait data set, two bioimage and two animal vocalization data sets. Results show that the strategies employed in this work to increase the performance of dissimilarity image classification using SNN are closing the gap with standalone CNNs. Moreover, when our best system is combined with an ensemble of CNNs, the resulting performance is superior to an ensemble of CNNs, demonstrating that our new strategy is extracting additional information.

Download Full-text

Material Recognition Using Deep Learning Techniques

10.35543/osf.io/vu9g4 ◽

2019 ◽

Author(s):

Nampally Tejasri ◽

Mongkol Ekapanyapong

Keyword(s):

Neural Network ◽

State Of The Art ◽

Fine Tuning ◽

Data Sets ◽

Network Architectures ◽

Data Set ◽

Material Recognition ◽

Learning Techniques ◽

Computer Vision Systems ◽

Fully Connected

The classification and recognition of variety of materials that are present in our surroundings be-come an important visual competition have been focused by computer vision systems in the recentyears. Understanding the recognition of the materials in different images that involve a deep learn-ing process made use of the recent development in the field of Artificial Neural Networks broughtthe ability to train various neural network architectures for the extraction of features for this chal-lenging task. In this work, state-of-the-art Convolutional Neural Network (CNN) techniques areused to classify materials and also compare the results obtained by them.The results are gath-ered over two material data sets applying the two popular approaches of Transfer Learning. Theresults showcase that fine-tuning approach achieves very good results compared to the case of ap-proach when the information derived from the layer which is just before the fully connected layeris limited. The results of the comparison indicates the fact that there is an improvement in theperformance and the accuracy of the system particularly in the data set that contains large numberof images.

Download Full-text

On the Use of Deep Active Semi-Supervised Learning for Fast Rendering in Global Illumination

Journal of Imaging ◽

10.3390/jimaging6090091 ◽

2020 ◽

Vol 6 (9) ◽

pp. 91

Author(s):

Ibtissam Constantin ◽

Joseph Constantin ◽

André Bigand

Keyword(s):

Supervised Learning ◽

Global Illumination ◽

Data Sets ◽

Learning Models ◽

Data Set ◽

Realistic Rendering ◽

Theoretical Concepts ◽

Accelerate Convergence ◽

Fast Rendering ◽

Fully Connected

Convolution neural networks usually require large labeled data-sets to construct accurate models. However, in many real-world scenarios, such as global illumination, labeling data are a time-consuming and costly human intelligent task. Semi-supervised learning methods leverage this issue by making use of a small labeled data-set and a larger set of unlabeled data. In this paper, our contributions focus on the development of a robust algorithm that combines active and deep semi-supervised convolution neural network to reduce labeling workload and to accelerate convergence in case of real-time global illumination. While the theoretical concepts of photo-realistic rendering are well understood, the increased need for the delivery of highly dynamic interactive content in vast virtual environments has increased recently. Particularly, the quality measure of computer-generated images is of great importance. The experiments are conducted on global illumination scenes which contain diverse distortions. Compared with human psycho-visual thresholds, the good consistency between these thresholds and the learning models quality measures can been seen. A comparison has also been made with SVM and other state-of-the-art deep learning models. We do transfer learning by running the convolution base of these models over our image set. Then, we use the output features of the convolution base as input to retrain the parameters of the fully connected layer. The obtained results show that our proposed method provides promising efficiency in terms of precision, time complexity, and optimal architecture.

Download Full-text

Closing the Performance Gap between Siamese Networks for Dissimilarity Image Classification and Convolutional Neural Networks

10.20944/preprints202108.0094.v1 ◽

2021 ◽

Author(s):

Loris Nanni ◽

Giovanni Minchio ◽

Sheryl Brahnam ◽

Davide Sarraggiotto ◽

Alessandra Lumini

Keyword(s):

Image Classification ◽

Image Data ◽

Support Vector ◽

Final Decision ◽

Data Sets ◽

Data Set ◽

Additional Information ◽

New Strategy ◽

Fully Connected ◽

Siamese Networks

In this paper, we examine two strategies for boosting the performance of ensembles of Siamese networks (SNNs) for image classification using two loss functions (Triplet and Binary Cross Entropy) and two methods for building the dissimilarity spaces (FULLY and DEEPER). With FULLY, the distance between a pattern and a prototype is calculated by comparing two images using the fully connected layer of the Siamese network. With DEEPER, each pattern is described using a deeper layer combined with dimensionality reduction. The basic design of the SNNs takes advantage of supervised k-means clustering for building the dissimilarity spaces that train a set of support vector machines, which are then combined by sum rule for a final decision. The robustness and versatility of this approach are demonstrated on several cross-domain image data sets, including a portrait data set, two bioimage and two animal vocalization data sets. Results show that the strategies employed in this work to increase the performance of dissimilarity image classification using SNN is closing the gap with standalone CNNs. Moreover, when our best system is combined with an ensemble of CNNs, the resulting performance is superior to an ensemble of CNNs, demonstrating that our new strategy is extracting additional information.

Download Full-text

The social wasp Vespula germanica (Fabricius) (Hymenoptera: Vespidae) population dynamics in England over 39 years.

The Entomologist s monthly magazine ◽

10.31184/m00138908.1542.3906 ◽

2018 ◽

Vol 154 (2) ◽

pp. 149-155

Author(s):

Michael Archer

Keyword(s):

Population Dynamics ◽

Population Dynamic ◽

Ecological Factors ◽

Social Wasp ◽

Data Sets ◽

Data Set ◽

Vespula Germanica ◽

The Social ◽

Minimum Number ◽

Suction Traps

1. Yearly records of worker Vespula germanica (Fabricius) taken in suction traps at Silwood Park (28 years) and at Rothamsted Research (39 years) are examined. 2. Using the autocorrelation function (ACF), a significant negative 1-year lag followed by a lesser non-significant positive 2-year lag was found in all, or parts of, each data set, indicating an underlying population dynamic of a 2-year cycle with a damped waveform. 3. The minimum number of years before the 2-year cycle with damped waveform was shown varied between 17 and 26, or was not found in some data sets. 4. Ecological factors delaying or preventing the occurrence of the 2-year cycle are considered.

Download Full-text

Predictive and Descriptive CoMFA Models: The Effect of Variable Selection

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207321666180212162028 ◽

2018 ◽

Vol 21 (2) ◽

pp. 117-124 ◽

Cited By ~ 4

Author(s):

Bakhtyar Sepehri ◽

Nematollah Omidikia ◽

Mohsen Kompany-Zareh ◽

Raouf Ghavami

Keyword(s):

Variable Selection ◽

Predictive Power ◽

Selection Method ◽

Data Sets ◽

Data Set ◽

Comfa Model ◽

Variable Selection Method

Aims & Scope: In this research, 8 variable selection approaches were used to investigate the effect of variable selection on the predictive power and stability of CoMFA models. Materials & Methods: Three data sets including 36 EPAC antagonists, 79 CD38 inhibitors and 57 ATAD2 bromodomain inhibitors were modelled by CoMFA. First of all, for all three data sets, CoMFA models with all CoMFA descriptors were created then by applying each variable selection method a new CoMFA model was developed so for each data set, 9 CoMFA models were built. Obtained results show noisy and uninformative variables affect CoMFA results. Based on created models, applying 5 variable selection approaches including FFD, SRD-FFD, IVE-PLS, SRD-UVEPLS and SPA-jackknife increases the predictive power and stability of CoMFA models significantly. Result & Conclusion: Among them, SPA-jackknife removes most of the variables while FFD retains most of them. FFD and IVE-PLS are time consuming process while SRD-FFD and SRD-UVE-PLS run need to few seconds. Also applying FFD, SRD-FFD, IVE-PLS, SRD-UVE-PLS protect CoMFA countor maps information for both fields.

Download Full-text