A corpus of debunked and verified user-generated videos

Purpose As user-generated content (UGC) is entering the news cycle alongside content captured by news professionals, it is important to detect misleading content as early as possible and avoid disseminating it. The purpose of this paper is to present an annotated dataset of 380 user-generated videos (UGVs), 200 debunked and 180 verified, along with 5,195 near-duplicate reposted versions of them, and a set of automatic verification experiments aimed to serve as a baseline for future comparisons. Design/methodology/approach The dataset was formed using a systematic process combining text search and near-duplicate video retrieval, followed by manual annotation using a set of journalism-inspired guidelines. Following the formation of the dataset, the automatic verification step was carried out using machine learning over a set of well-established features. Findings Analysis of the dataset shows distinctive patterns in the spread of verified vs debunked videos, and the application of state-of-the-art machine learning models shows that the dataset poses a particularly challenging problem to automatic methods. Research limitations/implications Practical limitations constrained the current collection to three platforms: YouTube, Facebook and Twitter. Furthermore, there exists a wealth of information that can be drawn from the dataset analysis, which goes beyond the constraints of a single paper. Extension to other platforms and further analysis will be the object of subsequent research. Practical implications The dataset analysis indicates directions for future automatic video verification algorithms, and the dataset itself provides a challenging benchmark. Social implications Having a carefully collected and labelled dataset of debunked and verified videos is an important resource both for developing effective disinformation-countering tools and for supporting media literacy activities. Originality/value Besides its importance as a unique benchmark for research in automatic verification, the analysis also allows a glimpse into the dissemination patterns of UGC, and possible telltale differences between fake and real content.

Download Full-text

Supervised learning for the detection of negation and of its scope in French and Brazilian Portuguese biomedical corpora

Natural Language Engineering ◽

10.1017/s1351324920000352 ◽

2020 ◽

pp. 1-21 ◽

Cited By ~ 2

Author(s):

Clément Dalloux ◽

Vincent Claveau ◽

Natalia Grabar ◽

Lucas Emanuel Silva Oliveira ◽

Claudia Maria Cabral Moro ◽

...

Keyword(s):

Machine Learning ◽

Information Extraction ◽

State Of The Art ◽

Automatic Detection ◽

Brazilian Portuguese ◽

Supervised Machine Learning ◽

Biomedical Domain ◽

Learning Approaches ◽

Cross Domain ◽

Automatic Methods

Abstract Automatic detection of negated content is often a prerequisite in information extraction systems in various domains. In the biomedical domain especially, this task is important because negation plays an important role. In this work, two main contributions are proposed. First, we work with languages which have been poorly addressed up to now: Brazilian Portuguese and French. Thus, we developed new corpora for these two languages which have been manually annotated for marking up the negation cues and their scope. Second, we propose automatic methods based on supervised machine learning approaches for the automatic detection of negation marks and of their scopes. The methods show to be robust in both languages (Brazilian Portuguese and French) and in cross-domain (general and biomedical languages) contexts. The approach is also validated on English data from the state of the art: it yields very good results and outperforms other existing approaches. Besides, the application is accessible and usable online. We assume that, through these issues (new annotated corpora, application accessible online, and cross-domain robustness), the reproducibility of the results and the robustness of the NLP applications will be augmented.

Download Full-text

Feasibility study of automatically performing the concrete delivery dispatching through machine learning techniques

Engineering Construction & Architectural Management ◽

10.1108/ecam-06-2014-0081 ◽

2015 ◽

Vol 22 (5) ◽

pp. 573-590 ◽

Cited By ~ 14

Author(s):

Mojtaba Maghrebi ◽

Claude Sammut ◽

S. Travis Waller

Keyword(s):

Machine Learning ◽

Human Resources ◽

Design Methodology ◽

Machine Learning Techniques ◽

Mixing Process ◽

Content Type ◽

Practical Solution ◽

Learning Techniques ◽

Ready Mixed Concrete ◽

Practical Implications

Purpose – The purpose of this paper is to study the implementation of machine learning (ML) techniques in order to automatically measure the feasibility of performing ready mixed concrete (RMC) dispatching jobs. Design/methodology/approach – Six ML techniques were selected and tested on data that was extracted from a developed simulation model and answered by a human expert. Findings – The results show that the performance of most of selected algorithms were the same and achieved an accuracy of around 80 per cent in terms of accuracy for the examined cases. Practical implications – This approach can be applied in practice to match experts’ decisions. Originality/value – In this paper the feasibility of handling complex concrete delivery problems by ML techniques is studied. Currently, most of the concrete mixing process is done by machines. However, RMC dispatching still relies on human resources to complete many tasks. In this paper the authors are addressing to reconstruct experts’ decisions as only practical solution.

Download Full-text

A systematic review of machine learning-based missing value imputation techniques

Data Technologies and Applications ◽

10.1108/dta-12-2020-0298 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Tressy Thomas ◽

Enayat Rajabi

Keyword(s):

Machine Learning ◽

Selection Process ◽

Evaluation Metrics ◽

Correct Prediction ◽

Data Sets ◽

Data Set ◽

Missing Value ◽

Content Type ◽

Missing Value Imputation ◽

Literature Reviews

PurposeThe primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?Design/methodology/approachThe review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.FindingsThis study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.Originality/valueIt is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.

Download Full-text

Application of machine learning in predicting construction project profit in Ghana using Support Vector Regression Algorithm (SVRA)

Engineering Construction & Architectural Management ◽

10.1108/ecam-08-2020-0618 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Emmanuel Adinyira ◽

Emmanuel Akoi-Gyebi Adjei ◽

Kofi Agyekum ◽

Frank Desmond Kofi Fugar

Keyword(s):

Machine Learning ◽

Support Vector Regression ◽

Cash Flow ◽

Predictive Accuracy ◽

Model Development ◽

Construction Project ◽

Support Vector ◽

Sensitivity Index ◽

Content Type ◽

Hyperparameter Selection

PurposeKnowledge of the effect of various cash-flow factors on expected project profit is important to effectively manage productivity on construction projects. This study was conducted to develop and test the sensitivity of a Machine Learning Support Vector Regression Algorithm (SVRA) to predict construction project profit in Ghana.Design/methodology/approachThe study relied on data from 150 institutional projects executed within the past five years (2014–2018) in developing the model. Eighty percent (80%) of the data from the 150 projects was used at hyperparameter selection and final training phases of the model development and the remaining 20% for model testing. Using MATLAB for Support Vector Regression, the parameters available for tuning were the epsilon values, the kernel scale, the box constraint and standardisations. The sensitivity index was computed to determine the degree to which the independent variables impact the dependent variable.FindingsThe developed model's predictions perfectly fitted the data and explained all the variability of the response data around its mean. Average predictive accuracy of 73.66% was achieved with all the variables on the different projects in validation. The developed SVR model was sensitive to labour and loan.Originality/valueThe developed SVRA combines variation, defective works and labour with other financial constraints, which have been the variables used in previous studies. It will aid contractors in predicting profit on completion at commencement and also provide information on the effect of changes to cash-flow factors on profit.

Download Full-text

Different firm responses to the COVID-19 pandemic shocks: machine-learning evidence on the Vietnamese labor market

International Journal of Emerging Markets ◽

10.1108/ijoem-02-2021-0292 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Lam Hoang Viet Le ◽

Toan Luu Duc Huynh ◽

Bryan S. Weber ◽

Bao Khac Quoc Nguyen

Keyword(s):

Machine Learning ◽

Labor Market ◽

Large Scale ◽

Government Support ◽

Policy Implications ◽

Machine Learning Techniques ◽

Firm Characteristics ◽

Data Set ◽

Content Type ◽

Firm Responses

PurposeThis paper aims to identify the disproportionate impacts of the COVID-19 pandemic on labor markets.Design/methodology/approachThe authors conduct a large-scale survey on 16,000 firms from 82 industries in Ho Chi Minh City, Vietnam, and analyze the data set by using different machine-learning methods.FindingsFirst, job loss and reduction in state-owned enterprises have been significantly larger than in other types of organizations. Second, employees of foreign direct investment enterprises suffer a significantly lower labor income than those of other groups. Third, the adverse effects of the COVID-19 pandemic on the labor market are heterogeneous across industries and geographies. Finally, firms with high revenue in 2019 are more likely to adopt preventive measures, including the reduction of labor forces. The authors also find a significant correlation between firms' revenue and labor reduction as traditional econometrics and machine-learning techniques suggest.Originality/valueThis study has two main policy implications. First, although government support through taxes has been provided, the authors highlight evidence that there may be some additional benefit from targeting firms that have characteristics associated with layoffs or other negative labor responses. Second, the authors provide information that shows which firm characteristics are associated with particular labor market responses such as layoffs, which may help target stimulus packages. Although the COVID-19 pandemic affects most industries and occupations, heterogeneous firm responses suggest that there could be several varieties of targeted policies-targeting firms that are likely to reduce labor forces or firms likely to face reduced revenue. In this paper, the authors outline several industries and firm characteristics which appear to more directly be reducing employee counts or having negative labor responses which may lead to more cost–effect stimulus.

Download Full-text

Pitting corrosion prediction from cathodic data: application of machine learning

Anti-Corrosion Methods and Materials ◽

10.1108/acmm-06-2020-2334 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Mohamed Nadir Boucherit ◽

Fahd Arbaoui

Keyword(s):

Machine Learning ◽

Physicochemical Properties ◽

Input Data ◽

Surface Condition ◽

Recurrent Network ◽

Localized Corrosion ◽

Potential Region ◽

Pitting Potential ◽

Content Type ◽

Nuclear Installation

Purpose To constitute input data, the authors carried out electrochemical experiments. The authors performed voltammetric scans in a very cathodic potential region. The authors constituted an experimental table where for each experiment we note the current values recorded at a low polarization range and the pitting potential observed in the anodic region. This study aims to concern carbon steel used in a nuclear installation. The properties of the chemical solutions are close to that of the cooling fluid used in the circuit. Design/methodology/approach In a previous study, this paper demonstrated the effectiveness of machine learning in predicting the localized corrosion resistance of a material by considering as input data the physicochemical properties of its environment (Boucherit et al., 2019). With the present study, the authors improve the results by considering as input data, cathodic currents. The reason of such an approach is to have input data that integrate both the surface state of the material and the physicochemical properties of its environment. Findings The experimental table was submitted to two neural networks, namely, a recurrent network and a convolution network. The convolution network gives better pitting potential predictions. Results also prove that the prediction by observing cathodic currents is better than that obtained by considering the physicochemical properties of the solution. Originality/value The originality of the study lies in the use of cathodic currents as input data. These data contain implicit information on both the chemical environment of the material and its surface condition. This approach appears to be more efficient than considering the chemical composition of the solution as input data. The objective of this study remains, at the same time, to seek the optimal neuronal architectures and the best input data.

Download Full-text

Who’s making the decisions? How managers can harness artificial intelligence and remain in charge

Journal of Business Strategy ◽

10.1108/jbs-05-2021-0090 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Pooya Tabesh

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Decision Making ◽

Big Data ◽

Digital Transformation ◽

Decision Processes ◽

Decision Makers ◽

Content Type ◽

Human Decision ◽

Practical Implications

Purpose While it is evident that the introduction of machine learning and the availability of big data have revolutionized various organizational operations and processes, existing academic and practitioner research within decision process literature has mostly ignored the nuances of these influences on human decision-making. Building on existing research in this area, this paper aims to define these concepts from a decision-making perspective and elaborates on the influences of these emerging technologies on human analytical and intuitive decision-making processes. Design/methodology/approach The authors first provide a holistic understanding of important drivers of digital transformation. The authors then conceptualize the impact that analytics tools built on artificial intelligence (AI) and big data have on intuitive and analytical human decision processes in organizations. Findings The authors discuss similarities and differences between machine learning and two human decision processes, namely, analysis and intuition. While it is difficult to jump to any conclusions about the future of machine learning, human decision-makers seem to continue to monopolize the majority of intuitive decision tasks, which will help them keep the upper hand (vis-à-vis machines), at least in the near future. Research limitations/implications The work contributes to research on rational (analytical) and intuitive processes of decision-making at the individual, group and organization levels by theorizing about the way these processes are influenced by advanced AI algorithms such as machine learning. Practical implications Decisions are building blocks of organizational success. Therefore, a better understanding of the way human decision processes can be impacted by advanced technologies will prepare managers to better use these technologies and make better decisions. By clarifying the boundaries/overlaps among concepts such as AI, machine learning and big data, the authors contribute to their successful adoption by business practitioners. Social implications The work suggests that human decision-makers will not be replaced by machines if they continue to invest in what they do best: critical thinking, intuitive analysis and creative problem-solving. Originality/value The work elaborates on important drivers of digital transformation from a decision-making perspective and discusses their practical implications for managers.

Download Full-text

Using Machine Learning to Identify Metabolomic Signatures of Pediatric Chronic Kidney Disease Etiology

Journal of the American Society of Nephrology ◽

10.1681/asn.2021040538 ◽

2022 ◽

pp. ASN.2021040538

Author(s):

Arthur M. Lee ◽

Jian Hu ◽

Yunwen Xu ◽

Alison G. Abraham ◽

Rui Xiao ◽

...

Keyword(s):

Machine Learning ◽

Chronic Kidney Disease ◽

Kidney Disease ◽

Gut Microbiome ◽

Support Vector ◽

Pathway Enrichment Analysis ◽

Disease Etiology ◽

Content Type ◽

Extreme Gradient Boosting ◽

Pediatric Ckd

BackgroundUntargeted plasma metabolomic profiling combined with machine learning (ML) may lead to discovery of metabolic profiles that inform our understanding of pediatric CKD causes. We sought to identify metabolomic signatures in pediatric CKD based on diagnosis: FSGS, obstructive uropathy (OU), aplasia/dysplasia/hypoplasia (A/D/H), and reflux nephropathy (RN).MethodsUntargeted metabolomic quantification (GC-MS/LC-MS, Metabolon) was performed on plasma from 702 Chronic Kidney Disease in Children study participants (n: FSGS=63, OU=122, A/D/H=109, and RN=86). Lasso regression was used for feature selection, adjusting for clinical covariates. Four methods were then applied to stratify significance: logistic regression, support vector machine, random forest, and extreme gradient boosting. ML training was performed on 80% total cohort subsets and validated on 20% holdout subsets. Important features were selected based on being significant in at least two of the four modeling approaches. We additionally performed pathway enrichment analysis to identify metabolic subpathways associated with CKD cause.ResultsML models were evaluated on holdout subsets with receiver-operator and precision-recall area-under-the-curve, F1 score, and Matthews correlation coefficient. ML models outperformed no-skill prediction. Metabolomic profiles were identified based on cause. FSGS was associated with the sphingomyelin-ceramide axis. FSGS was also associated with individual plasmalogen metabolites and the subpathway. OU was associated with gut microbiome–derived histidine metabolites.ConclusionML models identified metabolomic signatures based on CKD cause. Using ML techniques in conjunction with traditional biostatistics, we demonstrated that sphingomyelin-ceramide and plasmalogen dysmetabolism are associated with FSGS and that gut microbiome–derived histidine metabolites are associated with OU.

Download Full-text

Forecasting the risk at infractions: an ensemble comparison of machine learning approach

Industrial Management & Data Systems ◽

10.1108/imds-10-2020-0603 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Lei Li ◽

Desheng Wu

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Short Term Memory ◽

Model Performance ◽

Large Data ◽

Support Vector ◽

Learning Approaches ◽

Content Type ◽

Day To Day Operations ◽

Prediction Approach

PurposeThe infraction of securities regulations (ISRs) of listed firms in their day-to-day operations and management has become one of common problems. This paper proposed several machine learning approaches to forecast the risk at infractions of listed corporates to solve financial problems that are not effective and precise in supervision.Design/methodology/approachThe overall proposed research framework designed for forecasting the infractions (ISRs) include data collection and cleaning, feature engineering, data split, prediction approach application and model performance evaluation. We select Logistic Regression, Naïve Bayes, Random Forest, Support Vector Machines, Artificial Neural Network and Long Short-Term Memory Networks (LSTMs) as ISRs prediction models.FindingsThe research results show that prediction performance of proposed models with the prior infractions provides a significant improvement of the ISRs than those without prior, especially for large sample set. The results also indicate when judging whether a company has infractions, we should pay attention to novel artificial intelligence methods, previous infractions of the company, and large data sets.Originality/valueThe findings could be utilized to address the problems of identifying listed corporates' ISRs at hand to a certain degree. Overall, results elucidate the value of the prior infraction of securities regulations (ISRs). This shows the importance of including more data sources when constructing distress models and not only focus on building increasingly more complex models on the same data. This is also beneficial to the regulatory authorities.

Download Full-text

Evaluating disaster-related tweet credibility using content-based and user-based features

Information Discovery and Delivery ◽

10.1108/idd-04-2020-0044 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Nasser Assery ◽

Yuan (Dorothy) Xiaohong ◽

Qu Xiuli ◽

Roy Kaushik ◽

Sultan Almalki

Keyword(s):

Machine Learning ◽

Unsupervised Learning ◽

Supervised Learning ◽

Emergency Response ◽

Learning Model ◽

Performance Comparison ◽

Supervised Machine Learning ◽

Learning Methods ◽

Content Type ◽

Machine Learning Classification

Purpose This study aims to propose an unsupervised learning model to evaluate the credibility of disaster-related Twitter data and present a performance comparison with commonly used supervised machine learning models. Design/methodology/approach First historical tweets on two recent hurricane events are collected via Twitter API. Then a credibility scoring system is implemented in which the tweet features are analyzed to give a credibility score and credibility label to the tweet. After that, supervised machine learning classification is implemented using various classification algorithms and their performances are compared. Findings The proposed unsupervised learning model could enhance the emergency response by providing a fast way to determine the credibility of disaster-related tweets. Additionally, the comparison of the supervised classification models reveals that the Random Forest classifier performs significantly better than the SVM and Logistic Regression classifiers in classifying the credibility of disaster-related tweets. Originality/value In this paper, an unsupervised 10-point scoring model is proposed to evaluate the tweets’ credibility based on the user-based and content-based features. This technique could be used to evaluate the credibility of disaster-related tweets on future hurricanes and would have the potential to enhance emergency response during critical events. The comparative study of different supervised learning methods has revealed effective supervised learning methods for evaluating the credibility of Tweeter data.

Download Full-text