A study of Turkish emotion classification with pretrained language models

Emotion classification is a research field that aims to detect the emotions in a text using machine learning methods. In traditional machine learning (TML) methods, feature engineering processes cause the loss of some meaningful information, and classification performance is negatively affected. In addition, the success of modelling using deep learning (DL) approaches depends on the sample size. More samples are needed for Turkish due to the unique characteristics of the language. However, emotion classification data sets in Turkish are quite limited. In this study, the pretrained language model approach was used to create a stronger emotion classification model for Turkish. Well-known pretrained language models were fine-tuned for this purpose. The performances of these fine-tuned models for Turkish emotion classification were comprehensively compared with the performances of TML and DL methods in experimental studies. The proposed approach provides state-of-the-art performance for Turkish emotion classification.

Download Full-text

Transformer Oil Quality Assessment Using Random Forest with Feature Engineering

Energies ◽

10.3390/en14071809 ◽

2021 ◽

Vol 14 (7) ◽

pp. 1809

Author(s):

Mohammed El Amine Senoussaoui ◽

Mostefa Brahami ◽

Issouf Fofana

Keyword(s):

Machine Learning ◽

Random Forest ◽

Oil Quality ◽

Principal Component ◽

Condition Assessment ◽

Classification Performance ◽

Transformer Oil ◽

Classification Model ◽

Insulation Degradation ◽

Transformer Oils

Machine learning is widely used as a panacea in many engineering applications including the condition assessment of power transformers. Most statistics attribute the main cause of transformer failure to insulation degradation. Thus, a new, simple, and effective machine-learning approach was proposed to monitor the condition of transformer oils based on some aging indicators. The proposed approach was used to compare the performance of two machine-learning classifiers: J48 decision tree and random forest. The service-aged transformer oils were classified into four groups: the oils that can be maintained in service, the oils that should be reconditioned or filtered, the oils that should be reclaimed, and the oils that must be discarded. From the two algorithms, random forest exhibited a better performance and high accuracy with only a small amount of data. Good performance was achieved through not only the application of the proposed algorithm but also the approach of data preprocessing. Before feeding the classification model, the available data were transformed using the simple k-means method. Subsequently, the obtained data were filtered through correlation-based feature selection (CFsSubset). The resulting features were again retransformed by conducting the principal component analysis and were passed through the CFsSubset filter. The transformation and filtration of the data improved the classification performance of the adopted algorithms, especially random forest. Another advantage of the proposed method is the decrease in the number of the datasets required for the condition assessment of transformer oils, which is valuable for transformer condition monitoring.

Download Full-text

Identification of DNA-binding proteins via Hypergraph based Laplacian Support Vector Machine

Current Bioinformatics ◽

10.2174/1574893616666210806091922 ◽

2021 ◽

Vol 16 ◽

Author(s):

Yuqing Qian ◽

Hao Meng ◽

Weizhong Lu ◽

Zhijun Liao ◽

Yijie Ding ◽

...

Keyword(s):

Machine Learning ◽

Dna Binding ◽

Large Scale ◽

Binding Proteins ◽

Predictive Accuracy ◽

Dna Binding Proteins ◽

Research Field ◽

Support Vector ◽

Data Sets ◽

Independent Test

Background: The identification of DNA binding proteins (DBP) is an important research field. Experiment-based methods are time-consuming and labor-intensive for detecting DBP. Objective: To solve the problem of large-scale DBP identification, some machine learning methods are proposed. However, these methods have insufficient predictive accuracy. Our aim is to develop a sequence-based machine learning model to predict DBP. Methods: In our study, we extract six types of features (including NMBAC, GE, MCD, PSSM-AB, PSSM-DWT, and PsePSSM) from protein sequences. We use Multiple Kernel Learning based on Hilbert-Schmidt Independence Criterion (MKL-HSIC) to estimate the optimal kernel. Then, we construct a hypergraph model to describe the relationship between labeled and unlabeled samples. Finally, Laplacian Support Vector Machines (LapSVM) is employed to train the predictive model. Our method is tested on PDB186, PDB1075, PDB2272 and PDB14189 data sets. Result: Compared with other methods, our model achieves best results on benchmark data sets. Conclusion: The accuracy of 87.1% and 74.2% are achieved on PDB186 (Independent test of PDB1075) and PDB2272 (Independent test of PDB14189), respectively.

Download Full-text

Surface-Related Features Responsible for Cytotoxic Behavior of MXenes Layered Materials Predicted with Machine Learning Approach

Materials ◽

10.3390/ma13143083 ◽

2020 ◽

Vol 13 (14) ◽

pp. 3083

Author(s):

Maciej E. Marchwiany ◽

Magdalena Birowska ◽

Mariusz Popielski ◽

Jacek A. Majewski ◽

Agnieszka M. Jastrzębska

Keyword(s):

Machine Learning ◽

Transition Metal Oxides ◽

2D Materials ◽

Experimental Studies ◽

Biological Data ◽

Data Sets ◽

Toxicological Studies ◽

Two Factors ◽

Health Need

To speed up the implementation of the two-dimensional materials in the development of potential biomedical applications, the toxicological aspects toward human health need to be addressed. Due to time-consuming and expensive analysis, only part of the continuously expanding family of 2D materials can be tested in vitro. The machine learning methods can be used—by extracting new insights from available biological data sets, and provide further guidance for experimental studies. This study identifies the most relevant highly surface-specific features that might be responsible for cytotoxic behavior of 2D materials, especially MXenes. In particular, two factors, namely, the presence of transition metal oxides and lithium atoms on the surface, are identified as cytotoxicity-generating features. The developed machine learning model succeeds in predicting toxicity for other 2D MXenes, previously not tested in vitro, and hence, is able to complement the existing knowledge coming from in vitro studies. Thus, we claim that it might be one of the solutions for reducing the number of toxicological studies needed, and allows for minimizing failures in future biological applications.

Download Full-text

Research on the Confidence Regression Based on KNN Algorithm

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.713-715.1877 ◽

2015 ◽

Vol 713-715 ◽

pp. 1877-1881

Author(s):

Fang Chun Jiang ◽

Sheng Feng Tian

Keyword(s):

Machine Learning ◽

Experimental Data ◽

Research Field ◽

Experimental Results ◽

Data Sets ◽

Error Evaluation ◽

Significant Research ◽

Specific Error

Confidence regression is a significant research field of confidence machine learning. This paper adopts KNN algorithm as a tool, and performs error evaluation on results of regressive learning to classify the accept field and the refuse field so as to achieve the confidence regression. By setting specific error value, this approach achieves controllable confidence regression, which has been tested on experimental data of bodyfat and other data sets. The experimental results presented show the feasibility of our approach.

Download Full-text

An investigation on the factors affecting machine learning classifications in gamma-ray astronomy

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa166 ◽

2020 ◽

Vol 492 (4) ◽

pp. 5377-5390 ◽

Cited By ~ 3

Author(s):

Shengda Luo ◽

Alex P Leung ◽

C Y Hui ◽

K L Li

Keyword(s):

Machine Learning ◽

Gamma Ray ◽

Small Sample Size ◽

Classification Performance ◽

Small Sample ◽

Classification Model ◽

Machine Learning Techniques ◽

Large Area ◽

Actual Performance ◽

Statistical Fluctuations

ABSTRACT We have investigated a number of factors that can have significant impacts on the classification performance of gamma-ray sources detected by Fermi Large Area Telescope (LAT) with machine learning techniques. We show that a framework of automatic feature selection can construct a simple model with a small set of features that yields better performance over previous results. Secondly, because of the small sample size of the training/test sets of certain classes in gamma-ray, nested re-sampling and cross-validations are suggested for quantifying the statistical fluctuations of the quoted accuracy. We have also constructed a test set by cross-matching the identified active galactic nuclei (AGNs) and the pulsars (PSRs) in the Fermi-LAT 8-yr point source catalogue (4FGL) with those unidentified sources in the previous 3rd Fermi-LAT Source Catalog (3FGL). Using this cross-matched set, we show that some features used for building classification model with the identified source can suffer from the problem of covariate shift, which can be a result of various observational effects. This can possibly hamper the actual performance when one applies such model in classifying unidentified sources. Using our framework, both AGN/PSR and young pulsar (YNG)/millisecond pulsar (MSP) classifiers are automatically updated with the new features and the enlarged training samples in 4FGL catalogue incorporated. Using a two-layer model with these updated classifiers, we have selected 20 promising MSP candidates with confidence scores $\gt 98{{\ \rm per\ cent}}$ from the unidentified sources in 4FGL catalogue that can provide inputs for a multiwavelength identification campaign.

Download Full-text

Integration of Leaky-Integrate-and-Fire Neurons in Standard Machine Learning Architectures to Generate Hybrid Networks: A Surrogate Gradient Approach

Neural Computation ◽

10.1162/neco_a_01424 ◽

2021 ◽

pp. 1-26

Author(s):

Richard C. Gerum ◽

Achim Schilling

Keyword(s):

Machine Learning ◽

Learning Community ◽

Gradient Methods ◽

Image Data ◽

Classification Performance ◽

Data Sets ◽

Neuron Models ◽

Data Set ◽

Integrate And Fire ◽

Gradient Approach

Up to now, modern machine learning (ML) has been based on approximating big data sets with high-dimensional functions, taking advantage of huge computational resources. We show that biologically inspired neuron models such as the leaky-integrate-and-fire (LIF) neuron provide novel and efficient ways of information processing. They can be integrated in machine learning models and are a potential target to improve ML performance. Thus, we have derived simple update rules for LIF units to numerically integrate the differential equations. We apply a surrogate gradient approach to train the LIF units via backpropagation. We demonstrate that tuning the leak term of the LIF neurons can be used to run the neurons in different operating modes, such as simple signal integrators or coincidence detectors. Furthermore, we show that the constant surrogate gradient, in combination with tuning the leak term of the LIF units, can be used to achieve the learning dynamics of more complex surrogate gradients. To prove the validity of our method, we applied it to established image data sets (the Oxford 102 flower data set, MNIST), implemented various network architectures, used several input data encodings and demonstrated that the method is suitable to achieve state-of-the-art classification performance. We provide our method as well as further surrogate gradient methods to train spiking neural networks via backpropagation as an open-source KERAS package to make it available to the neuroscience and machine learning community. To increase the interpretability of the underlying effects and thus make a small step toward opening the black box of machine learning, we provide interactive illustrations, with the possibility of systematically monitoring the effects of parameter changes on the learning characteristics.

Download Full-text

Humboldtian Diagnosis of Peach Tree (Prunus persica) Nutrition Using Machine-Learning and Compositional Methods

Agronomy ◽

10.3390/agronomy10060900 ◽

2020 ◽

Vol 10 (6) ◽

pp. 900 ◽

Cited By ~ 3

Author(s):

Debora Leitzke Betemps ◽

Betania Vahl de Paula ◽

Serge-Étienne Parent ◽

Simone P. Galarça ◽

Newton A. Mayer ◽

...

Keyword(s):

Machine Learning ◽

Euclidean Distance ◽

Prunus Persica ◽

Nutrient Status ◽

Classification Model ◽

Data Sets ◽

Local Conditions ◽

Random Forest Classification ◽

Forest Classification ◽

Euclidean Distances

Regional nutrient ranges are commonly used to diagnose plant nutrient status. In contrast, local diagnosis confronts unhealthy to healthy compositional entities in comparable surroundings. Robust local diagnosis requires well-documented data sets processed by machine learning and compositional methods. Our objective was to customize nutrient diagnosis of peach (Prunus persica) trees at local scale. We collected 472 observations from commercial orchards and fertilizer trials across eleven cultivars of Prunus persica and six rootstocks in the state of Rio Grande do Sul (RS), Brazil. The random forest classification model returned an area under curve exceeding 0.80 and classification accuracy of 80% about yield cutoff of 16 Mg ha−1. Centered log ratios (clr) of foliar defective compositions have appropriate geometry to compute Euclidean distances from closest successful compositions in “enchanting islands”. Successful specimens closest to defective specimens as shown by Euclidean distance allowed reaching trustful fruit yields using site-specific corrective measures. Comparing tissue composition of low-yielding orchards to that of the closest successful neighbors in two major Brazilian peach-producing regions, regional diagnosis differed from local diagnosis, indicating that regional standards may fail to fit local conditions. Local diagnosis requires well-documented Humboldtian data sets that can be acquired through ethical collaboration between researchers and stakeholders.

Download Full-text

Rerouting an Organocatalytic Reaction by Intercepting its Reactive Intermediates

10.26434/chemrxiv.13061717 ◽

2020 ◽

Author(s):

Santosh C. Gadekar ◽

Vasudevan Dhayalan ◽

Ashim Nandi ◽

Inbal L. Zak ◽

Shahar Barkai ◽

...

Keyword(s):

Machine Learning ◽

Experimental Studies ◽

Aromatic Aldehydes ◽

Reactive Intermediates ◽

Classification Model ◽

Chemical Transformations ◽

Exchange Reactions ◽

Selective Stabilization ◽

Hydrogen Deuterium ◽

Shed Light

Reactive intermediates are key to halting and promoting chemical transformations, however due to their elusive nature, they are seldom harnessed for reaction design. Herein, we describe studies aimed at stabilizing reactive intermediates in the N-heterocyclic carbene (NHC) catalytic cycle, which enabled fully shutting down the known benzoin coupling pathway, while rerouting its intermediates toward deuteration. The reversible nature of NHC catalysis and the selective stabilization of reaction intermediates facilitated clean hydrogen-deuterium exchange reactions of aromatic aldehydes by D<sub>2</sub>O, even for challenging electron withdrawing substrates. The addition of catalytic amounts of phenyl boronic acid was used to further stabilize highly reactive intermediates and mitigate the formation of benzoin coupling by-products. The mechanistic understanding at the foundation of this work resulted in unprecedented mild conditions with base and catalyst loadings as low as 0.1 mol%, and a scalable deuteration reaction applicable to a broad substrate scope with outstanding functional group tolerance. More importantly, adopting this approach enabled the construction of a machine-learning derived guideline for identifying the appropriate catalyst and conditions for different substrates based on a logistic regression classification model. Experimental studies combined with machine learning and computational methods shed light on the non-trivial mechanistic underpinnings of this reaction.

Download Full-text

Enhance Image Classification Performance Via Unsupervised Pre-trained Transformers Language Models

10.21203/rs.3.rs-93060/v1 ◽

2020 ◽

Author(s):

Dezhou Shen

Keyword(s):

Image Classification ◽

Language Processing ◽

Classification Performance ◽

Language Models ◽

Classification Model ◽

Linear Probe ◽

Image Set ◽

Image Pixels ◽

The Difference ◽

Fully Connected

Abstract Image classification and categorization are essential to the capability of telling the difference between images for a machine. As Bidirectional Encoder Representations from Transformers became popular in many tasks of natural language processing recent years, it is intuitive to use these pre-trained language models for enhancing the computer vision tasks, \eg image classification. In this paper, by encoding image pixels using pre-trained transformers, then connect to a fully connected layer, the classification model outperforms the Wide ResNet model and the linear-probe iGPT-L model, and achieved accuracy of 99.60%～99.74% on the CIFAR-10 image set and accuracy of 99.10%～99.76% on the CIFAR-100 image set.

Download Full-text

Handwritten digits recognition with decision tree classification: a machine learning approach

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i5.pp4446-4451 ◽

2019 ◽

Vol 9 (5) ◽

pp. 4446

Author(s):

Tsehay Admassu Assegie ◽

Pramod Sekharan Nair

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Classification Model ◽

Learning Approach ◽

Data Sets ◽

The Past ◽

Decision Tree Classification ◽

Machine Learning Approach ◽

Future Data ◽

Class Labels

Handwritten digits recognition is an area of machine learning, in which a machine is trained to identify handwritten digits. One method of achieving this is with decision tree classification model. A decision tree classification is a machine learning approach that uses the predefined labels from the past known sets to determine or predict the classes of the future data sets where the class labels are unknown. In this paper we have used the standard kaggle digits dataset for recognition of handwritten digits using a decision tree classification approach. And we have evaluated the accuracy of the model against each digit from 0 to 9.

Download Full-text