Using CATA and Machine Learning to Operationalize Old Constructs in New Ways: An Illustration Using U.S. Governors’ COVID-19 Press Briefings

Increased computing power and greater access to online data have led to rapid growth in the use of computer-aided text analysis (CATA) and machine learning methods. Using “big data”, researchers have not only advanced new streams of research, but also new research methodologies. Noting this trend and simultaneously recognizing the value of traditional research methods, we lay out a methodology that bridges the gap between old and new approaches to operationalize old constructs in new ways. With a combination of web scraping, CATA, and supervised machine learning, using ground truth data, we train a model to predict CIP (Charismatic-Ideological-Pragmatic) categorical leadership styles from running text. To illustrate this method, we apply the model to classify U.S. state governors’ COVID-19 press briefings according to their CIP leadership style. In addition, we demonstrate content and convergent validity of the method.

Download Full-text

QuestionComb: A Gamification Approach for the Visual Explanation of Linguistic Phenomena through Interactive Labeling

ACM Transactions on Interactive Intelligent Systems ◽

10.1145/3429448 ◽

2021 ◽

Vol 11 (3-4) ◽

pp. 1-38

Author(s):

Rita Sevastjanova ◽

Wolfgang Jentner ◽

Fabian Sperrle ◽

Rebecca Kehlbeck ◽

Jürgen Bernard ◽

...

Keyword(s):

Machine Learning ◽

Information Seeking ◽

Visual Analytics ◽

Evaluation Studies ◽

Model Performance ◽

Ground Truth ◽

Training Data ◽

Supervised Machine Learning ◽

Ground Truth Data ◽

The Creation

Linguistic insight in the form of high-level relationships and rules in text builds the basis of our understanding of language. However, the data-driven generation of such structures often lacks labeled resources that can be used as training data for supervised machine learning. The creation of such ground-truth data is a time-consuming process that often requires domain expertise to resolve text ambiguities and characterize linguistic phenomena. Furthermore, the creation and refinement of machine learning models is often challenging for linguists as the models are often complex, in-transparent, and difficult to understand. To tackle these challenges, we present a visual analytics technique for interactive data labeling that applies concepts from gamification and explainable Artificial Intelligence (XAI) to support complex classification tasks. The visual-interactive labeling interface promotes the creation of effective training data. Visual explanations of learned rules unveil the decisions of the machine learning model and support iterative and interactive optimization. The gamification-inspired design guides the user through the labeling process and provides feedback on the model performance. As an instance of the proposed technique, we present QuestionComb , a workspace tailored to the task of question classification (i.e., in information-seeking vs. non-information-seeking questions). Our evaluation studies confirm that gamification concepts are beneficial to engage users through continuous feedback, offering an effective visual analytics technique when combined with active learning and XAI.

Download Full-text

Glean

Proceedings of the VLDB Endowment ◽

10.14778/3447689.3447703 ◽

2021 ◽

Vol 14 (6) ◽

pp. 997-1005

Author(s):

Sandeep Tata ◽

Navneet Potti ◽

James B. Wendt ◽

Lauro Beltrão Costa ◽

Marc Najork ◽

...

Keyword(s):

Machine Learning ◽

Data Management ◽

Real World ◽

Empirical Studies ◽

Ground Truth ◽

Training Data ◽

Ground Truth Data ◽

Document Type ◽

Machine Learning Model ◽

Structured Information

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.

Download Full-text

EXPLORING MACHINE LEARNING CLASSIFICATION ALGORITHMS FOR CROP CLASSIFICATION USING SENTINEL 2 DATA

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-3-w6-573-2019 ◽

2019 ◽

Vol XLII-3/W6 ◽

pp. 573-578 ◽

Cited By ~ 3

Author(s):

◽

S. S. Ray

Keyword(s):

Machine Learning ◽

Satellite Data ◽

Classification Accuracy ◽

Ground Truth ◽

Kappa Coefficient ◽

Ground Truth Data ◽

Classification Techniques ◽

Machine Learning Classification ◽

Crop Classification ◽

Sentinel 2

Abstract. Crop Classification and recognition is a very important application of Remote Sensing. In the last few years, Machine learning classification techniques have been emerging for crop classification. Google Earth Engine (GEE) is a platform to explore the multiple satellite data with different advanced classification techniques without even downloading the satellite data. The main objective of this study is to explore the ability of different machine learning classification techniques like, Random Forest (RF), Classification And Regression Trees (CART) and Support Vector Machine (SVM) for crop classification. High Resolution optical data, Sentinel-2, MSI (10&thinsp;m) was used for crop classification in the Indian Agricultural Research Institute (IARI) farm for the Rabi season 2016 for major crops. Around 100 crop fields (~400 Hectare) in IARI were analysed. Smart phone-based ground truth data were collected. The best cloud free image of Sentinel 2 MSI data (5 Feb 2016) was used for classification using automatic filtering by percentage cloud cover property using the GEE. Polygons as feature space was used as training data sets based on the ground truth data for crop classification using machine learning techniques. Post classification, accuracy assessment analysis was done through the generation of the confusion matrix (producer and user accuracy), kappa coefficient and F value. In this study it was found that using GEE through cloud platform, satellite data accessing, filtering and pre-processing of satellite data could be done very efficiently. In terms of overall classification accuracy and kappa coefficient, Random Forest (93.3%, 0.9178) and CART (73.4%, 0.6755) classifiers performed better than SVM (74.3%, 0.6867) classifier. For validation, Field Operation Service Unit (FOSU) division of IARI, data was used and encouraging results were obtained.

Download Full-text

Integrating hierarchical statistical models and machine-learning algorithms for ground-truthing drone images of the vegetation: taxonomy, abundance and population ecological models

10.1101/491381 ◽

2018 ◽

Cited By ~ 1

Author(s):

Christian Damgaard

Keyword(s):

Machine Learning ◽

Statistical Models ◽

Learning Algorithms ◽

Plant Competition ◽

Image Data ◽

Ground Truth ◽

Ecological Models ◽

Machine Learning Algorithms ◽

Ground Truth Data ◽

Ground Truthing

AbstractIn order to fit population ecological models, e.g. plant competition models, to new drone-aided image data, we need to develop statistical models that may take the new type of measurement uncertainty when applying machine-learning algorithms into account and quantify its importance for statistical inferences and ecological predictions. Here, it is proposed to quantify the uncertainty and bias of image predicted plant taxonomy and abundance in a hierarchical statistical model that is linked to ground-truth data obtained by the pin-point method. It is critical that the error rate in the species identification process is minimized when the image data are fitted to the population ecological models, and several avenues for reaching this objective are discussed. The outlined method to statistically model known sources of uncertainty when applying machine-learning algorithms may be relevant for other applied scientific disciplines.

Download Full-text

Automated Well-Log Processing and Lithology Classification by Identifying Optimal Features Through Unsupervised and Supervised Machine-Learning Algorithms

SPE Journal ◽

10.2118/202477-pa ◽

2020 ◽

Vol 25 (05) ◽

pp. 2778-2800 ◽

Cited By ~ 1

Author(s):

Harpreet Singh ◽

Yongkoo Seol ◽

Evgeniy M. Myshakin

Keyword(s):

Machine Learning ◽

Case Studies ◽

Ground Truth ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Well Logs ◽

Petroleum Engineering ◽

Well Log ◽

Classification Problems ◽

Rock Types

Summary The application of specialized machine learning (ML) in petroleum engineering and geoscience is increasingly gaining attention in the development of rapid and efficient methods as a substitute to existing methods. Existing ML-based studies that use well logs contain two inherent limitations. The first limitation is that they start with one predefined combination of well logs that by default assumes that the chosen combination of well logs is poised to give the best outcome in terms of prediction, although the variation in accuracy obtained through different combinations of well logs can be substantial. The second limitation is that most studies apply unsupervised learning (UL) for classification problems, but it underperforms by a substantial margin compared with nearly all the supervised learning (SL) algorithms. In this context, this study investigates a variety of UL and SL ML algorithms applied on multiple well-log combinations (WLCs) to automate the traditional workflow of well-log processing and classification, including an optimization step to achieve the best output. The workflow begins by processing the measured well logs, which includes developing different combinations of measured well logs and their physics-motivated augmentations, followed by removal of potential outliers from the input WLCs. Reservoir lithology with four different rock types is investigated using eight UL and seven SL algorithms in two different case studies. The results from the two case studies are used to identify the optimal set of well logs and the ML algorithm that gives the best matching reservoir lithology to its ground truth. The workflow is demonstrated using two wells from two different reservoirs on Alaska North Slope to distinguish four different rock types along the well (brine-dominated sand, hydrate-dominated sand, shale, and others/mixed compositions). The results show that the automated workflow investigated in this study can discover the ground truth for the lithology with up to 80% accuracy with UL and up to 90% accuracy with SL, using six routine well logs [vp, vs, ρb, ϕneut, Rt, gamma ray (GR)], which is a significant improvement compared with the accuracy reported in the current state of the art, which is less than 70%.

Download Full-text

A Comparative Assessment of Ensemble-Based Machine Learning and Maximum Likelihood Methods for Mapping Seagrass Using Sentinel-2 Imagery in Tauranga Harbor, New Zealand

Remote Sensing ◽

10.3390/rs12030355 ◽

2020 ◽

Vol 12 (3) ◽

pp. 355 ◽

Cited By ~ 10

Author(s):

Nam Thang Ha ◽

Merilyn Manley-Harris ◽

Tien Dat Pham ◽

Ian Hawes

Keyword(s):

Machine Learning ◽

New Zealand ◽

Maximum Likelihood ◽

Ground Truth ◽

Machine Learning Techniques ◽

Ground Truth Data ◽

Seagrass Meadows ◽

Ensemble Machine Learning ◽

Novel Approach ◽

Sentinel 2

Seagrass has been acknowledged as a productive blue carbon ecosystem that is in significant decline across much of the world. A first step toward conservation is the mapping and monitoring of extant seagrass meadows. Several methods are currently in use, but mapping the resource from satellite images using machine learning is not widely applied, despite its successful use in various comparable applications. This research aimed to develop a novel approach for seagrass monitoring using state-of-the-art machine learning with data from Sentinel–2 imagery. We used Tauranga Harbor, New Zealand as a validation site for which extensive ground truth data are available to compare ensemble machine learning methods involving random forests (RF), rotation forests (RoF), and canonical correlation forests (CCF) with the more traditional maximum likelihood classifier (MLC) technique. Using a group of validation metrics including F1, precision, recall, accuracy, and the McNemar test, our results indicated that machine learning techniques outperformed the MLC with RoF as the best performer (F1 scores ranging from 0.75–0.91 for sparse and dense seagrass meadows, respectively). Our study is the first comparison of various ensemble-based methods for seagrass mapping of which we are aware, and promises to be an effective approach to enhance the accuracy of seagrass monitoring.

Download Full-text

On the potential and challenges of using machine-learning for automated quality control of environmental sensor data

10.5194/egusphere-egu2020-20777 ◽

2020 ◽

Author(s):

Lennart Schmidt ◽

Hannes Mollenhauer ◽

Corinna Rebmann ◽

David Schäfer ◽

Antje Claussnitzer ◽

...

Keyword(s):

Machine Learning ◽

Quality Control ◽

Ground Truth ◽

Sensor Data ◽

Small Scale ◽

Ground Truth Data ◽

Starting Point ◽

Environmental Sensor ◽

Spatio Temporal ◽

Automated Quality Control

With more and more data being gathered from environmental sensor networks, the importance of automated quality-control (QC) routines to provide usable data in near-real time is becoming increasingly apparent. Machine-learning (ML) algorithms exhibit a high potential to this respect as they are able to exploit the spatio-temporal relation of multiple sensors to identify anomalies while allowing for non-linear functional relations in the data. In this study, we evaluate the potential of ML for automated QC on two spatio-temporal datasets at different spatial scales: One is a dataset of atmospheric variables at 53 stations across Northern Germany. The second dataset contains timeseries of soil moisture and temperature at 40 sensors at a small-scale measurement plot.Furthermore, we investigate strategies to tackle three challenges that are commonly present when applying ML for QC: 1) As sensors might drop out, the ML models have to be designed to be robust against missing values in the input data. We address this by comparing different data imputation methods, coupled with a binary representation of whether a value is missing or not. 2) Quality flags that mark erroneous data points to serve as ground truth for model training might not be available. And 3) There is no guarantee that the system under study is stationary, which might render the outputs of a trained model useless in the future. To address 2) and 3), we frame the problem both as a supervised and unsupervised learning problem. Here, the use of unsupervised ML-models can be beneficial as they do not require ground truth data and can thus be retrained more easily should the system be subject to significant changes. In this presentation, we discuss the performance, advantages and drawbacks of the proposed strategies to tackle the aforementioned challenges. Thus, we provide a starting point for researchers in the largely untouched field of ML application for automated quality control of environmental sensor data.

Download Full-text

Validation of ground truth fire debris classification by supervised machine learning

Forensic Chemistry ◽

10.1016/j.forc.2021.100358 ◽

2021 ◽

Vol 26 ◽

pp. 100358

Author(s):

Michael E. Sigman ◽

Mary R. Williams ◽

Nicholas Thurn ◽

Taylor Wood

Keyword(s):

Machine Learning ◽

Ground Truth ◽

Supervised Machine Learning ◽

Fire Debris

Download Full-text

Integrating Hierarchical Statistical Models and Machine-Learning Algorithms for Ground-Truthing Drone Images of the Vegetation: Taxonomy, Abundance and Population Ecological Models

Remote Sensing ◽

10.3390/rs13061161 ◽

2021 ◽

Vol 13 (6) ◽

pp. 1161

Author(s):

Christian Damgaard

Keyword(s):

Machine Learning ◽

Statistical Models ◽

Learning Algorithms ◽

Plant Competition ◽

Image Data ◽

Ground Truth ◽

Ecological Models ◽

Machine Learning Algorithms ◽

Ground Truth Data ◽

Ground Truthing

In order to fit population ecological models, e.g., plant competition models, to new drone-aided image data, we need to develop statistical models that may take the new type of measurement uncertainty when applying machine-learning algorithms into account and quantify its importance for statistical inferences and ecological predictions. Here, it is proposed to quantify the uncertainty and bias of image predicted plant taxonomy and abundance in a hierarchical statistical model that is linked to ground-truth data obtained by the pin-point method. It is critical that the error rate in the species identification process is minimized when the image data are fitted to the population ecological models, and several avenues for reaching this objective are discussed. The outlined method to statistically model known sources of uncertainty when applying machine-learning algorithms may be relevant for other applied scientific disciplines.

Download Full-text

Collective annotation patterns in learning from crowds

Intelligent Data Analysis ◽

10.3233/ida-200009 ◽

2020 ◽

Vol 24 ◽

pp. 63-86

Author(s):

Francisco Mena ◽

Ricardo Ñanculef ◽

Carlos Valle

Keyword(s):

Machine Learning ◽

Large Scale ◽

Ground Truth ◽

Experimental Results ◽

Ground Truth Data ◽

Satisfactory Performance ◽

Machine Learning Applications ◽

Data Points ◽

Confusion Matrices

The lack of annotated data is one of the major barriers facing machine learning applications today. Learning from crowds, i.e. collecting ground-truth data from multiple inexpensive annotators, has become a common method to cope with this issue. It has been recently shown that modeling the varying quality of the annotations obtained in this way, is fundamental to obtain satisfactory performance in tasks where inexpert annotators may represent the majority but not the most trusted group. Unfortunately, existing techniques represent annotation patterns for each annotator individually, making the models difficult to estimate in large-scale scenarios. In this paper, we present two models to address these problems. Both methods are based on the hypothesis that it is possible to learn collective annotation patterns by introducing confusion matrices that involve groups of data point annotations or annotators. The first approach clusters data points with a common annotation pattern, regardless the annotators from which the labels have been obtained. Implicitly, this method attributes annotation mistakes to the complexity of the data itself and not to the variable behavior of the annotators. The second approach explicitly maps annotators to latent groups that are collectively parametrized to learn a common annotation pattern. Our experimental results show that, compared with other methods for learning from crowds, both methods have advantages in scenarios with a large number of annotators and a small number of annotations per annotator.

Download Full-text