Automatic Misinformation Detection About COVID-19 in Brazilian Portuguese WhatsApp Messages

During the coronavirus pandemic, the problem of misinformation arose once again, quite intensely, through social networks. In Brazil, one of the primary sources of misinformation is the messaging application WhatsApp. However, due to WhatsApp's private messaging nature, there still few methods of misinformation detection developed specifically for this platform. In this context, the automatic misinformation detection (MID) about COVID-19 in Brazilian Portuguese WhatsApp messages becomes a crucial challenge. In this work, we present the COVID-19.BR, a data set of WhatsApp messages about coronavirus in Brazilian Portuguese, collected from Brazilian public groups and manually labeled. Then, we are investigating different machine learning methods in order to build an efficient MID for WhatsApp messages. So far, our best result achieved an F1 score of 0.774 due to the predominance of short texts. However, when texts with less than 50 words are filtered, the F1 score rises to 0.85.

Comparison of machine learning methods for crack localization

Acta et Commentationes Universitatis Tartuensis de Mathematica ◽

10.12697/acutm.2019.23.13 ◽

2019 ◽

Vol 23 (1) ◽

pp. 125-142

Author(s):

Helle Hein ◽

Ljubov Jaanuska

Keyword(s):

Machine Learning ◽

Random Forests ◽

Crack Depth ◽

Haar Wavelet ◽

Extensive Investigation ◽

Learning Methods ◽

Data Set ◽

Crack Location ◽

Discrete Transform

In this paper, the Haar wavelet discrete transform, the artificial neural networks (ANNs), and the random forests (RFs) are applied to predict the location and severity of a crack in an Euler–Bernoulli cantilever subjected to the transverse free vibration. An extensive investigation into two data collection sets and machine learning methods showed that the depth of a crack is more difficult to predict than its location. The data set of eight natural frequency parameters produces more accurate predictions on the crack depth; meanwhile, the data set of eight Haar wavelet coefficients produces more precise predictions on the crack location. Furthermore, the analysis of the results showed that the ensemble of 50 ANN trained by Bayesian regularization and Levenberg–Marquardt algorithms slightly outperforms RF.

Modelling of diesel engine performance using advanced machine learning methods under scarce and exponential data set

Applied Soft Computing ◽

10.1016/j.asoc.2013.06.006 ◽

2013 ◽

Vol 13 (11) ◽

pp. 4428-4441 ◽

Cited By ~ 25

Author(s):

Ka In Wong ◽

Pak Kin Wong ◽

Chun Shun Cheung ◽

Chi Man Vong

Keyword(s):

Machine Learning ◽

Diesel Engine ◽

Engine Performance ◽

Learning Methods ◽

Data Set ◽

Smart Intelligent Computing and Applications - Smart Innovation, Systems and Technologies ◽

Analysis of Cancer Data Set with Statistical and Unsupervised Machine Learning Methods

10.1007/978-981-13-1921-1_27 ◽

2018 ◽

pp. 267-276

Author(s):

T. Panduranga Vital ◽

K. Dileep Kumar ◽

H. V. Bhagya Sri ◽

M. Murali Krishna

Keyword(s):

Machine Learning ◽

Learning Methods ◽

Data Set ◽

Unsupervised Machine Learning ◽

Cancer Data ◽

Assessing Replicability of Machine Learning Results: An Introduction to Methods on Predictive Accuracy in Social Sciences

Social Science Computer Review ◽

10.1177/0894439319888445 ◽

2019 ◽

pp. 089443931988844

Author(s):

Ranjith Vijayakumar ◽

Mike W.-L. Cheung

Keyword(s):

Machine Learning ◽

Empirical Data ◽

Fixed Effects ◽

Predictive Accuracy ◽

Support Vector ◽

Learning Methods ◽

Data Set ◽

Replication Studies ◽

Accuracy Measure

Machine learning methods have become very popular in diverse fields due to their focus on predictive accuracy, but little work has been conducted on how to assess the replicability of their findings. We introduce and adapt replication methods advocated in psychology to the aims and procedural needs of machine learning research. In Study 1, we illustrate these methods with the use of an empirical data set, assessing the replication success of a predictive accuracy measure, namely, R 2 on the cross-validated and test sets of the samples. We introduce three replication aims. First, tests of inconsistency examine whether single replications have successfully rejected the original study. Rejection will be supported if the 95% confidence interval (CI) of R 2 difference estimates between replication and original does not contain zero. Second, tests of consistency help support claims of successful replication. We can decide apriori on a region of equivalence, where population values of the difference estimates are considered equivalent for substantive reasons. The 90% CI of a different estimate lying fully within this region supports replication. Third, we show how to combine replications to construct meta-analytic intervals for better precision of predictive accuracy measures. In Study 2, R 2 is reduced from the original in a subset of replication studies to examine the ability of the replication procedures to distinguish true replications from nonreplications. We find that when combining studies sampled from same population to form meta-analytic intervals, random-effects methods perform best for cross-validated measures while fixed-effects methods work best for test measures. Among machine learning methods, regression was comparable to many complex methods, while support vector machine performed most reliably across a variety of scenarios. Social scientists who use machine learning to model empirical data can use these methods to enhance the reliability of their findings.

Prediction of Collapsibility of Loess of Construction Sites in Xining Based on Machine Learning Methods

10.21203/rs.3.rs-307514/v1 ◽

2021 ◽

Author(s):

Qifei Zhao ◽

Xiaojun Li ◽

Yunning Cao ◽

Zhikun Li ◽

Jixin Fan

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Training Data ◽

Support Vector ◽

Engineering Practice ◽

Burial Depth ◽

Learning Methods ◽

Data Set ◽

North East

Abstract Collapsibility of loess is a significant factor affecting engineering construction in loess area, and testing the collapsibility of loess is costly. In this study, A total of 4,256 loess samples are collected from the north, east, west and middle regions of Xining. 70% of the samples are used to generate training data set, and the rest are used to generate verification data set, so as to construct and validate the machine learning models. The most important six factors are selected from thirteen factors by using Grey Relational analysis and multicollinearity analysis: burial depth、water content、specific gravity of soil particles、void rate、geostatic stress and plasticity limit. In order to predict the collapsibility of loess, four machine learning methods: Support Vector Machine (SVM), Random Subspace Based Support Vector Machine (RSSVM), Random Forest (RF) and Naïve Bayes Tree (NBTree), are studied and compared. The receiver operating characteristic (ROC) curve indicators, standard error (SD) and 95% confidence interval (CI) are used to verify and compare the models in different research areas. The results show that: RF model is the most efficient in predicting the collapsibility of loess in Xining, and its AUC average is above 80%, which can be used in engineering practice.

Natural language processing systems for data extraction and mapping on the basis of unstructured text blocks

Proceedings of the International conference “InterCarto/InterGIS” ◽

10.35595/2414-9179-2020-3-26-53-61 ◽

2020 ◽

Vol 26 (3) ◽

pp. 53-61

Author(s):

Pavel Kikin ◽

Alexey Kolesnikov ◽

Alexey Portnov ◽

Denis Grischenko

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Mathematical Models ◽

Optimal Algorithm ◽

The State ◽

Gradient Boosting ◽

Learning Methods ◽

Data Set ◽

Spatio Temporal

The state of ecological systems, along with their general characteristics, is almost always described by indicators that vary in space and time, which leads to a significant complication of constructing mathematical models for predicting the state of such systems. One of the ways to simplify and automate the construction of mathematical models for predicting the state of such systems is the use of machine learning methods. The article provides a comparison of traditional and based on neural networks, algorithms and machine learning methods for predicting spatio-temporal series representing ecosystem data. Analysis and comparison were carried out among the following algorithms and methods: logistic regression, random forest, gradient boosting on decision trees, SARIMAX, neural networks of long-term short-term memory (LSTM) and controlled recurrent blocks (GRU). To conduct the study, data sets were selected that have both spatial and temporal components: the values of the number of mosquitoes, the number of dengue infections, the physical condition of tropical grove trees, and the water level in the river. The article discusses the necessary steps for preliminary data processing, depending on the algorithm used. Also, Kolmogorov complexity was calculated as one of the parameters that can help formalize the choice of the most optimal algorithm when constructing mathematical models of spatio-temporal data for the sets used. Based on the results of the analysis, recommendations are given on the application of certain methods and specific technical solutions, depending on the characteristics of the data set that describes a particular ecosystem

Comparison of Soil Total Nitrogen Content Prediction Models Based on Vis-NIR Spectroscopy

Sensors ◽

10.3390/s20247078 ◽

2020 ◽

Vol 20 (24) ◽

pp. 7078

Author(s):

Yueting Wang ◽

Minzan Li ◽

Ronghua Ji ◽

Minjuan Wang ◽

Lihua Zheng

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Prediction Model ◽

Total Nitrogen ◽

Nir Spectroscopy ◽

Learning Method ◽

Learning Methods ◽

Data Set ◽

Conventional Machine

Visible-near-infrared spectrum (Vis-NIR) spectroscopy technology is one of the most important methods for non-destructive and rapid detection of soil total nitrogen (STN) content. In order to find a practical way to build STN content prediction model, three conventional machine learning methods and one deep learning approach are investigated and their predictive performances are compared and analyzed by using a public dataset called LUCAS Soil (19,019 samples). The three conventional machine learning methods include ordinary least square estimation (OLSE), random forest (RF), and extreme learning machine (ELM), while for the deep learning method, three different structures of convolutional neural network (CNN) incorporated Inception module are constructed and investigated. In order to clarify effectiveness of different pre-treatments on predicting STN content, the three conventional machine learning methods are combined with four pre-processing approaches (including baseline correction, smoothing, dimensional reduction, and feature selection) are investigated, compared, and analyzed. The results indicate that the baseline-corrected and smoothed ELM model reaches practical precision (coefficient of determination (R2) = 0.89, root mean square error of prediction (RMSEP) = 1.60 g/kg, and residual prediction deviation (RPD) = 2.34). While among three different structured CNN models, the one with more 1 × 1 convolutions preforms better (R2 = 0.93; RMSEP = 0.95 g/kg; and RPD = 3.85 in optimal case). In addition, in order to evaluate the influence of data set characteristics on the model, the LUCAS data set was divided into different data subsets according to dataset size, organic carbon (OC) content and countries, and the results show that the deep learning method is more effective and practical than conventional machine learning methods and, on the premise of enough data samples, it can be used to build a robust STN content prediction model with high accuracy for the same type of soil with similar agricultural treatment.

Predicting The Cricket Match Outcome Using Crowd Opinions On Social Networks: A Comparative Study Of Machine Learning Methods

Malaysian Journal of Computer Science ◽

10.22452/mjcs.vol30no1.5 ◽

2017 ◽

Vol 30 (1) ◽

pp. 63-76 ◽

Cited By ~ 16

Author(s):

Raza Ul Mustafa ◽

M. Saqib Nawaz ◽

M. Ikram Ullah Lali ◽

Tehseen Zia ◽

Waqar Mehmood

Keyword(s):

Machine Learning ◽

Social Networks ◽

Comparative Study ◽

Learning Methods ◽

STACKING OF THE SGTM NEURAL-LIKE STRUCTURE WITH RBF LAYER BASED ON GENERATION OF A RANDOM CURTAIN OF ITS HYPERPARAMETERS FOR PREDICTION TASKS

Ukrainian Journal of Information Technology ◽

10.23939/ujit2021.03.049 ◽

2021 ◽

Vol 3 (1) ◽

pp. 49-55

Author(s):

R. O. Tkachenko ◽

◽

I. V. Izonіn ◽

V. M. Danylyk ◽

V. Yu. Mykhalevych ◽

...

Keyword(s):

Machine Learning ◽

Prediction Accuracy ◽

Experimental Studies ◽

Real Data ◽

Optimal Number ◽

Individual Member ◽

Optimal Parameters ◽

Learning Methods ◽

Data Set ◽

Improving prediction accuracy by artificial intelligence tools is an important task in various industries, economics, medicine. Ensemble learning is one of the possible options to solve this task. In particular, the construction of stacking models based on different machine learning methods, or using different parts of the existing data set demonstrates high prediction accuracy of the. However, the need for proper selection of ensemble members, their optimal parameters, etc., necessitates large time costs for the construction of such models. This paper proposes a slightly different approach to building a simple but effective ensemble method. The authors developed a new model of stacking of nonlinear SGTM neural-like structures, which is based on the use of only one type of ANN as an element base of the ensemble and the use of the same training sample for all members of the ensemble. This approach provides a number of advantages over the procedures for building ensembles based on different machine learning methods, at least in the direction of selecting the optimal parameters for each of them. In our case, a tuple of random hyperparameters for each individual member of the ensemble was used as the basis of ensemble. That is, the training of each combined SGTM neural-like structure with an additional RBF layer, as a separate member of the ensemble occurs using different, randomly selected values of RBF centers and centersfof mass. This provides the necessary variety of ensemble elements. Experimental studies on the effectiveness of the developed ensemble were conducted using a real data set. The task is to predict the amount of health insurance costs based on a number of independent attributes. The optimal number of ensemble members is determined experimentally, which provides the highest prediction accuracy. The results of the work of the developed ensemble are compared with the existing methods of this class. The highest prediction accuracy of the developed ensemble at satisfactory duration of procedure of its training is established.

Machine-learning methods in the classification of water bodies

Environmental & Socio-economic Studies ◽

10.1515/environ-2016-0010 ◽

2016 ◽

Vol 4 (2) ◽

pp. 34-42 ◽

Cited By ~ 1

Author(s):

Marek Sołtysiak ◽

Marcin Blachnik ◽

Dominika Dąbrowska

Keyword(s):

Machine Learning ◽

Water Body ◽

Urban Areas ◽

Water Bodies ◽

Learning Methods ◽

Amphibian Species ◽

Data Set ◽

Nearest Neighbours

AbstractAmphibian species have been considered as useful ecological indicators. They are used as indicators of environmental contamination, ecosystem health and habitat quality., Amphibian species are sensitive to changes in the aquatic environment and therefore, may form the basis for the classification of water bodies. Water bodies in which there are a large number of amphibian species are especially valuable even if they are located in urban areas. The automation of the classification process allows for a faster evaluation of the presence of amphibian species in the water bodies. Three machine-learning methods (artificial neural networks, decision trees and the k-nearest neighbours algorithm) have been used to classify water bodies in Chorzów – one of 19 cities in the Upper Silesia Agglomeration. In this case, classification is a supervised data mining method consisting of several stages such as building the model, the testing phase and the prediction. Seven natural and anthropogenic features of water bodies (e.g. the type of water body, aquatic plants, the purpose of the water body (destination), position of the water body in relation to any possible buildings, condition of the water body, the degree of littering, the shore type and fishing activities) have been taken into account in the classification. The data set used in this study involved information about 71 different water bodies and 9 amphibian species living in them. The results showed that the best average classification accuracy was obtained with the multilayer perceptron neural network.