matthew correlation coefficient
Recently Published Documents


TOTAL DOCUMENTS

7
(FIVE YEARS 7)

H-INDEX

0
(FIVE YEARS 0)

Author(s):  
Yakobus Wiciaputra ◽  
Julio Young ◽  
Andre Rusli

With the large amount of text information circulating on the internet, there is a need of a solution that can help processing data in the form of text for various purposes. In Indonesia, text information circulating on the internet generally uses 2 languages, English and Indonesian. This research focuses in building a model that is able to classify text in more than one language, or also commonly known as multilingual text classification. The multilingual text classification will use the XLM-RoBERTa model in its implementation. This study applied the transfer learning concept used by XLM-RoBERTa to build a classification model for texts in Indonesian using only the English News Dataset as a training dataset with Matthew Correlation Coefficient value of 42.2%. The results of this study also have the highest accuracy value when tested on a large English News Dataset (37,886) with Matthew Correlation Coefficient value of 90.8%, accuracy of 93.3%, precision of 93.4%, recall of 93.3%, and F1 of 93.3% and the accuracy value when tested on a large Indonesian News Dataset (70,304) with Matthew Correlation Coefficient value of 86.4%, accuracy, precision, recall, and F1 values of 90.2% using the large size Mixed News Dataset (108,190) in the model training process. Keywords: Multilingual Text Classification, Natural Language Processing, News Dataset, Transfer Learning, XLM-RoBERTa


2021 ◽  
Author(s):  
Céline Marquet ◽  
Michael Heinzinger ◽  
Tobias Olenyi ◽  
Christian Dallago ◽  
Michael Bernhofer ◽  
...  

Abstract The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthew Correlation Coefficient – MCC - for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Scoring without alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor in a statistically significant manner, independently of the performance measure applied (incl. two-state accuracy: Q2, MCC, Spearman and Pearson correlation). Lastly, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20k proteins) within 40 minutes on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.


2021 ◽  
Vol 38 (5) ◽  
pp. 1439-1447
Author(s):  
Zarith Liyana Zahari ◽  
Mahfuzah Mustafa ◽  
Zaridah Mat Zain ◽  
Rafiuddin Abdubrani ◽  
Faradila Naim

The prolonged stress needs to be determined and controlled before it harms the physical and mental conditions. This research used questionnaire and physiological approaches in determine stress. EEG signal is an electrophysiological signal to analyze the signal features. The standard features used are peak-to-peak values, mean, standard deviation and root means square (RMS). The unique features in this research are Matthew Correlation Coefficient Advanced (MCCA) and multimodal capabilities in the area of frequency and time-frequency analysis are proposed. In the frequency domain, Power Spectral Density (PSD) techniques were applied while Short Time Fourier Transform (STFT) and Continuous Wavelet Transform (CWT) were utilized to extract seven features based on time-frequency domain. Various methods applied from previous works are still limited by the stress indices. The merged works between quantities score and physiological measurements were enhanced the stress level from three-levels to six stress levels based on music application will be the second contribution. To validate the proposed method and enhance performance between electroencephalogram (EEG) signals and stress score, support vector machine (SVM), random forest (RF), K- nearest neighbor (KNN) classifier is needed. From the finding, RF gained the best performance average accuracy 85% ±10% in Ten-fold and K-fold techniques compared with SVM and KNN.


2021 ◽  
Vol 83 (6) ◽  
pp. 53-61
Author(s):  
Mahfuzah Mustafa ◽  
Zarith Liyana Zahari ◽  
Rafiuddin Abdubrani

The connection between music and human are very synonyms because music could reduce stress. The state of stress could be measured using EEG signal, an electroencephalogram (EEG) measurement which contains an arousal and valence index value. In previous studies, it is found that the Matthew Correlation Coefficient (MCC) performance accuracy is of 85±5%. The arousal indicates strong emotion, and valence indicates positive and negative degree of emotion. Arousal and valence values could be used to measure the accuracy performance. This research focuses on the enhance MCC parameter equation based on arousal and valence values to perform the maximum accuracy percentage in the frequency domain and time-frequency domain analysis. Twenty-one features were used to improve the significance of feature extraction results and the investigated arousal and valence value. The substantial feature extraction involved alpha, beta, delta and theta frequency bands in measuring the arousal and valence index formula. Based on the results, the arousal and valance index is accepted to be applied as parameters in the MCC equations. However, in certain cases, the improvement of the MCC parameter is required to achieve a high accuracy percentage and this research proposed Matthew correlation coefficient advanced (MCCA) in order to improve the performance result by using a six sigma method. In conclusion, the MCCA equation is established to enhance the existing MCC parameter to improve the accuracy percentage up to 99.9% for the arousal and valence index.


Author(s):  
S. Bhaskara Naik ◽  
B. Mahesh

Malware, is any program or document that is unsafe to a PC client. Kinds of malware can incorporate PC infections, worms, Trojan ponies and spyware. These noxious projects can play out an assortment of capacities like taking, scrambling or erasing touchy information, adjusting or commandeering center processing capacities and observing clients' PC action. Malware identification is the way toward checking the PC and documents to distinguish malware. It is viable at distinguishing malware on the grounds that it includes numerous instruments and approaches. It's anything but a single direction measure, it's very intricate. The beneficial thing is malware identification and evacuation take under 50 seconds as it were. The outstanding development of malware is representing an extraordinary risk to the security of classified data. The issue with a significant number of the current order calculations is their small presentation in term of their capacity to identify and forestall malware from tainting the PC framework. There is a critical need to assess the exhibition of the current Machine Learning characterization calculations utilized for malware identification. This will help in making more hearty and productive calculations that have the ability to conquer the shortcomings of the current calculations. As of late, AI methods have been the main focus of the security specialists to distinguish malware and foresee their families powerfully. Yet, to the best of our information, there exists no complete work that looks at and assesses a sufficient number of machine learning strategies for characterizing malware and favorable examples. In this work, we led a set of examinations to assess AI strategies for distinguishing malware and their classification into respective families powerfully. This investigation did the presentation assessment of some characterization calculations like J45, LMT, Naive Bayes, Random Forest, MLP Classifier, Random Tree, AdaBoost, KStar. The presentation of the calculations was assessed as far as Accuracy, Precision, Recall, Kappa Statistics, F-Measure, Matthew Correlation Coefficient, Receiver Operator Characteristics Area and Root Mean Squared Error utilizing WEKA AI and information mining recreation device. Our test results showed that Random Forest calculation delivered the best exactness of 99.2%. This decidedly shows that the Random Forest calculation accomplishes great precision rates in identifying malware.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Abdul Karim ◽  
Matthew Lee ◽  
Thomas Balle ◽  
Abdul Sattar

Abstract Motivation Ether-a-go-go-related gene (hERG) channel blockade by small molecules is a big concern during drug development in the pharmaceutical industry. Blockade of hERG channels may cause prolonged QT intervals that potentially could lead to cardiotoxicity. Various in-silico techniques including deep learning models are widely used to screen out small molecules with potential hERG related toxicity. Most of the published deep learning methods utilize a single type of features which might restrict their performance. Methods based on more than one type of features such as DeepHIT struggle with the aggregation of extracted information. DeepHIT shows better performance when evaluated against one or two accuracy metrics such as negative predictive value (NPV) and sensitivity (SEN) but struggle when evaluated against others such as Matthew correlation coefficient (MCC), accuracy (ACC), positive predictive value (PPV) and specificity (SPE). Therefore, there is a need for a method that can efficiently aggregate information gathered from models based on different chemical representations and boost hERG toxicity prediction over a range of performance metrics. Results In this paper, we propose a deep learning framework based on step-wise training to predict hERG channel blocking activity of small molecules. Our approach utilizes five individual deep learning base models with their respective base features and a separate neural network to combine the outputs of the five base models. By using three external independent test sets with potency activity of IC50 at a threshold of 10 $$\upmu$$ μ m, our method achieves better performance for a combination of classification metrics. We also investigate the effective aggregation of chemical information extracted for robust hERG activity prediction. In summary, CardioTox net can serve as a robust tool for screening small molecules for hERG channel blockade in drug discovery pipelines and performs better than previously reported methods on a range of classification metrics.


2020 ◽  
Vol 5 (2) ◽  
pp. 57
Author(s):  
Novia Hasdyna ◽  
Rozzi Kesuma Dinata

K-Nearest Neighbor (K-NN) is a machine learning algorithm that functions to classify data. This study aims to measure the performance of K-NN algorithm by using Matthew Correlation Coefficient (MCC). The data that used in this study are the ornamental fish which consisting of 3 classes named Premium, Medium, and Low. The analysis results of the Matthew Correlation Coefficient on K-NN using Euclidean Distance obtained the highest MCC value in Medium class which is 0.786542. The second highest MCC value is in Premium class which is 0.567434. The lowest MCC value is in Low class which is 0.435269. Overall, the MCC values is statistically which is 0,596415.


Sign in / Sign up

Export Citation Format

Share Document