Glean

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.

Download Full-text

Crowdsourcing Image Analysis for Plant Phenomics to Generate Ground Truth Data for Machine Learning

10.1101/265918 ◽

2018 ◽

Author(s):

Naihui Zhou ◽

Zachary D Siegel ◽

Scott Zarecor ◽

Nigel Lee ◽

Darwin A Campbell ◽

...

Keyword(s):

Machine Learning ◽

Image Analysis ◽

Best Practices ◽

Ground Truth ◽

Training Data ◽

Quality Data ◽

High Quality ◽

Ground Truth Data ◽

Plant Phenomics

AbstractThe accuracy of machine learning tasks critically depends on high quality ground truth data. Therefore, in many cases, producing good ground truth data typically involves trained professionals; however, this can be costly in time, effort, and money. Here we explore the use of crowdsourcing to generate a large number of training data of good quality. We explore an image analysis task involving the segmentation of corn tassels from images taken in a field setting. We investigate the accuracy, speed and other quality metrics when this task is performed by students for academic credit, Amazon MTurk workers, and Master Amazon MTurk workers. We conclude that the Amazon MTurk and Master Mturk workers perform significantly better than the for-credit students, but with no significant difference between the two MTurk worker types. Furthermore, the quality of the segmentation produced by Amazon MTurk workers rivals that of an expert worker. We provide best practices to assess the quality of ground truth data, and to compare data quality produced by different sources. We conclude that properly managed crowdsourcing can be used to establish large volumes of viable ground truth data at a low cost and high quality, especially in the context of high throughput plant phenotyping. We also provide several metrics for assessing the quality of the generated datasets.Author SummaryFood security is a growing global concern. Farmers, plant breeders, and geneticists are hastening to address the challenges presented to agriculture by climate change, dwindling arable land, and population growth. Scientists in the field of plant phenomics are using satellite and drone images to understand how crops respond to a changing environment and to combine genetics and environmental measures to maximize crop growth efficiency. However, the terabytes of image data require new computational methods to extract useful information. Machine learning algorithms are effective in recognizing select parts of images, butthey require high quality data curated by people to train them, a process that can be laborious and costly. We examined how well crowdsourcing works in providing training data for plant phenomics, specifically, segmenting a corn tassel – the male flower of the corn plant – from the often-cluttered images of a cornfield. We provided images to students, and to Amazon MTurkers, the latter being an on-demand workforce brokered by Amazon.com and paid on a task-by-task basis. We report on best practices in crowdsourcing image labeling for phenomics, and compare the different groups on measures such as fatigue and accuracy over time. We find that crowdsourcing is a good way of generating quality labeled data, rivaling that of experts.

Download Full-text

A New Satellite-Based Retrieval of Low-Cloud Liquid-Water Path Using Machine Learning and Meteosat SEVIRI Data

Remote Sensing ◽

10.3390/rs12213475 ◽

2020 ◽

Vol 12 (21) ◽

pp. 3475

Author(s):

Miae Kim ◽

Jan Cermak ◽

Hendrik Andersen ◽

Julia Fuchs ◽

Roland Stirnberg

Keyword(s):

Machine Learning ◽

Liquid Water ◽

Ground Truth ◽

Learning Model ◽

Statistical Machine Learning ◽

Liquid Water Path ◽

Ground Truth Data ◽

Water Path ◽

Machine Learning Model ◽

Low Cloud

Clouds are one of the major uncertainties of the climate system. The study of cloud processes requires information on cloud physical properties, in particular liquid water path (LWP). This parameter is commonly retrieved from satellite data using look-up table approaches. However, existing LWP retrievals come with uncertainties related to assumptions inherent in physical retrievals. Here, we present a new retrieval technique for cloud LWP based on a statistical machine learning model. The approach utilizes spectral information from geostationary satellite channels of Meteosat Spinning-Enhanced Visible and Infrared Imager (SEVIRI), as well as satellite viewing geometry. As ground truth, data from CloudNet stations were used to train the model. We found that LWP predicted by the machine-learning model agrees substantially better with CloudNet observations than a current physics-based product, the Climate Monitoring Satellite Application Facility (CM SAF) CLoud property dAtAset using SEVIRI, edition 2 (CLAAS-2), highlighting the potential of such approaches for future retrieval developments.

Download Full-text

Survey of Public Assay Data: Opportunities and Challenges to Understanding Antimicrobial Resistance

10.1101/2019.12.13.874909 ◽

2019 ◽

Author(s):

Akshay Agarwal ◽

Gowri Nayar ◽

James Kaufman

Keyword(s):

Machine Learning ◽

Antimicrobial Resistance ◽

Pathogen Detection ◽

Ground Truth ◽

Training Data ◽

Learning Models ◽

Learning Methods ◽

Ground Truth Data ◽

Specificity And Sensitivity ◽

Machine Learning Models

ABSTRACTComputational learning methods allow researchers to make predictions, draw inferences, and automate generation of mathematical models. These models are crucial to solving real world problems, such as antimicrobial resistance, pathogen detection, and protein evolution. Machine learning methods depend upon ground truth data to achieve specificity and sensitivity. Since the data is limited in this case, as we will show during the course of this paper, and as the size of available data increases super-linearly, it is of paramount importance to understand the distribution of ground truth data and the analyses it is suited and where it may have limitations that bias downstream learning methods. In this paper, we focus on training data required to model antimicrobial resistance (AR). We report an analysis of bacterial biochemical assay data associated with whole genome sequencing (WGS) from the National Center for Biotechnology Information (NCBI), and discuss important implications when making use of assay data, utilizing genetic features as training data for machine learning models. Complete discussion of machine learning model implementation is outside the scope of this paper and the subject to a later publication.The antimicrobial assay data was obtained from NCBI BioSample, which contains descriptive information about the physical biological specimen from which experimental data is obtained and the results of those experiments themselves.[1] Assay data includes minimum inhibitory concentrations (MIC) of antibiotics, links to associated microbial WGS data, and treatment of a particular microorganism with antibiotics.We observe that there is minimal microbial data available for many antibiotics and for targeted taxonomic groups. The antibiotics with the highest number of assays have less than 1500 measurements each. Corresponding bias in available assays makes machine learning problematic for some important microbes and for building more advanced models that can work across microbial genera. In this study we focus, therefore, on the antibiotic with most assay data (tetracycline) and the corresponding genus with the most available sequence (Acinetobacter with 14000 measurements across 49 antibiotic compounds). Using this data for training and testing, we observed contradictions in the distribution of assay outcomes and report methods to identify and resolve such conflicts. Per antibiotic, we find that there can be up to 30% of (resolvable) conflicting measurements. As more data becomes available, automated training data curation will be an important part of creating useful machine learning models to predict antibiotic resistance.CCS CONCEPTS• Applied computing → Computational biology; Computational genomics; Bioinformatics;

Download Full-text

Training and Validating a Machine Learning Model for the Sensor-Based Monitoring of Lying Behavior in Dairy Cows on Pasture and in the Barn

Animals ◽

10.3390/ani11092660 ◽

2021 ◽

Vol 11 (9) ◽

pp. 2660

Author(s):

Lara Schmeling ◽

Golnaz Elmamooz ◽

Phan Thai Hoang ◽

Anastasiia Kozar ◽

Daniela Nicklas ◽

...

Keyword(s):

Machine Learning ◽

Dairy Cows ◽

Ground Truth ◽

Monitoring Systems ◽

Ground Truth Data ◽

Machine Learning Model ◽

Video Observations ◽

Standing Up ◽

Sensitivity Specificity ◽

Lying Down

Monitoring systems assist farmers in monitoring the health of dairy cows by predicting behavioral patterns (e.g., lying) and their changes with machine learning models. However, the available systems were developed either for indoors or for pasture and fail to predict the behavior in other locations. Therefore, the goal of our study was to train and evaluate a model for the prediction of lying on a pasture and in the barn. On three farms, 7–11 dairy cows each were equipped with the prototype of the monitoring system containing an accelerometer, a magnetometer and a gyroscope. Video observations on the pasture and in the barn provided ground truth data. We used 34.5 h of datasets from pasture for training and 480.5 h from both locations for evaluating. In comparison, random forest, an orientation-independent feature set with 5 s windows without overlap, achieved the highest accuracy. Sensitivity, specificity and accuracy were 95.6%, 80.5% and 87.4%, respectively. Accuracy on the pasture (93.2%) exceeded accuracy in the barn (81.4%). Ruminating while standing was the most confused with lying. Out of individual lying bouts, 95.6 and 93.4% were identified on the pasture and in the barn, respectively. Adding a model for standing up events and lying down events could improve the prediction of lying in the barn.

Download Full-text

QuestionComb: A Gamification Approach for the Visual Explanation of Linguistic Phenomena through Interactive Labeling

ACM Transactions on Interactive Intelligent Systems ◽

10.1145/3429448 ◽

2021 ◽

Vol 11 (3-4) ◽

pp. 1-38

Author(s):

Rita Sevastjanova ◽

Wolfgang Jentner ◽

Fabian Sperrle ◽

Rebecca Kehlbeck ◽

Jürgen Bernard ◽

...

Keyword(s):

Machine Learning ◽

Information Seeking ◽

Visual Analytics ◽

Evaluation Studies ◽

Model Performance ◽

Ground Truth ◽

Training Data ◽

Supervised Machine Learning ◽

Ground Truth Data ◽

The Creation

Linguistic insight in the form of high-level relationships and rules in text builds the basis of our understanding of language. However, the data-driven generation of such structures often lacks labeled resources that can be used as training data for supervised machine learning. The creation of such ground-truth data is a time-consuming process that often requires domain expertise to resolve text ambiguities and characterize linguistic phenomena. Furthermore, the creation and refinement of machine learning models is often challenging for linguists as the models are often complex, in-transparent, and difficult to understand. To tackle these challenges, we present a visual analytics technique for interactive data labeling that applies concepts from gamification and explainable Artificial Intelligence (XAI) to support complex classification tasks. The visual-interactive labeling interface promotes the creation of effective training data. Visual explanations of learned rules unveil the decisions of the machine learning model and support iterative and interactive optimization. The gamification-inspired design guides the user through the labeling process and provides feedback on the model performance. As an instance of the proposed technique, we present QuestionComb , a workspace tailored to the task of question classification (i.e., in information-seeking vs. non-information-seeking questions). Our evaluation studies confirm that gamification concepts are beneficial to engage users through continuous feedback, offering an effective visual analytics technique when combined with active learning and XAI.

Download Full-text

DeepGOZero: Improving protein function prediction from sequence and zero-shot learning based on ontology axioms

10.1101/2022.01.14.476325 ◽

2022 ◽

Author(s):

Maxat Kulmanov ◽

Robert Hoehndorf

Keyword(s):

Machine Learning ◽

Protein Function ◽

Protein Function Prediction ◽

Prediction Method ◽

Function Prediction ◽

Training Data ◽

Large Set ◽

Theoretic Approach ◽

Machine Learning Model ◽

Protein Functions

Motivation: Protein functions are often described using the Gene Ontology (GO) which is an ontology consisting of over 50,000 classes and a large set of formal axioms. Predicting the functions of proteins is one of the key challenges in computational biology and a variety of machine learning methods have been developed for this purpose. However, these methods usually require significant amount of training data and cannot make predictions for GO classes which have only few or no experimental annotations. Results: We developed DeepGOZero, a machine learning model which improves predictions for functions with no or only a small number of annotations. To achieve this goal, we rely on a model-theoretic approach for learning ontology embeddings and combine it with neural networks for protein function prediction. DeepGOZero can exploit formal axioms in the GO to make zero-shot predictions, i.e., predict protein functions even if not a single protein in the training phase was associated with that function. Furthermore, the zero-shot prediction method employed by DeepGOZero is generic and can be applied whenever associations with ontology classes need to be predicted. Availability: http://github.com/bio-ontology-research-group/deepgozero

Download Full-text

Zero-Shot Feature Selection via Transferring Supervised Knowledge

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2021040101 ◽

2021 ◽

Vol 17 (2) ◽

pp. 1-20

Author(s):

Zheng Wang ◽

Qiao Wang ◽

Tingzhang Zhao ◽

Chaokun Wang ◽

Xiaojun Ye

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Dimensionality Reduction ◽

Real World ◽

Rapid Growth ◽

Learning Systems ◽

Training Data ◽

Effective Technique ◽

Supervised Methods ◽

Real World Datasets

Feature selection, an effective technique for dimensionality reduction, plays an important role in many machine learning systems. Supervised knowledge can significantly improve the performance. However, faced with the rapid growth of newly emerging concepts, existing supervised methods might easily suffer from the scarcity and validity of labeled data for training. In this paper, the authors study the problem of zero-shot feature selection (i.e., building a feature selection model that generalizes well to “unseen” concepts with limited training data of “seen” concepts). Specifically, they adopt class-semantic descriptions (i.e., attributes) as supervision for feature selection, so as to utilize the supervised knowledge transferred from the seen concepts. For more reliable discriminative features, they further propose the center-characteristic loss which encourages the selected features to capture the central characteristics of seen concepts. Extensive experiments conducted on various real-world datasets demonstrate the effectiveness of the method.

Download Full-text

A study of real-world micrograph data quality and machine learning model robustness

npj Computational Materials ◽

10.1038/s41524-021-00616-3 ◽

2021 ◽

Vol 7 (1) ◽

Author(s):

Xiaoting Zhong ◽

Brian Gallagher ◽

Keenan Eves ◽

Emily Robertson ◽

T. Nathan Mundhenk ◽

...

Keyword(s):

Machine Learning ◽

Data Quality ◽

Real World ◽

Image Feature ◽

Molecular Solids ◽

Pixel Intensity ◽

Machine Learning Model ◽

Model Predictions ◽

Intensity Normalization ◽

Model Robustness

AbstractMachine-learning (ML) techniques hold the potential of enabling efficient quantitative micrograph analysis, but the robustness of ML models with respect to real-world micrograph quality variations has not been carefully evaluated. We collected thousands of scanning electron microscopy (SEM) micrographs for molecular solid materials, in which image pixel intensities vary due to both the microstructure content and microscope instrument conditions. We then built ML models to predict the ultimate compressive strength (UCS) of consolidated molecular solids, by encoding micrographs with different image feature descriptors and training a random forest regressor, and by training an end-to-end deep-learning (DL) model. Results show that instrument-induced pixel intensity signals can affect ML model predictions in a consistently negative way. As a remedy, we explored intensity normalization techniques. It is seen that intensity normalization helps to improve micrograph data quality and ML model robustness, but microscope-induced intensity variations can be difficult to eliminate.

Download Full-text

Generating Pseudo-Data to Enhance the Performance of Classification-Based Engineering Design: A Preliminary Investigation

Volume 6: Design, Systems, and Complexity ◽

10.1115/imece2020-24634 ◽

2020 ◽

Author(s):

Xianping Du ◽

Onur Bilgen ◽

Hongyi Xu

Keyword(s):

Machine Learning ◽

Engineering Design ◽

Real World ◽

Surrogate Model ◽

Preliminary Investigation ◽

Learning Model ◽

Classification Model ◽

Design Decision ◽

Large Dataset ◽

Machine Learning Model

Abstract Machine learning for classification has been used widely in engineering design, for example, feasible domain recognition and hidden pattern discovery. Training an accurate machine learning model requires a large dataset; however, high computational or experimental costs are major issues in obtaining a large dataset for real-world problems. One possible solution is to generate a large pseudo dataset with surrogate models, which is established with a smaller set of real training data. However, it is not well understood whether the pseudo dataset can benefit the classification model by providing more information or deteriorates the machine learning performance due to the prediction errors and uncertainties introduced by the surrogate model. This paper presents a preliminary investigation towards this research question. A classification-and-regressiontree model is employed to recognize the design subspaces to support design decision-making. It is implemented on the geometric design of a vehicle energy-absorbing structure based on finite element simulations. Based on a small set of real-world data obtained by simulations, a surrogate model based on Gaussian process regression is employed to generate pseudo datasets for training. The results showed that the tree-based method could help recognize feasible design domains efficiently. Furthermore, the additional information provided by the surrogate model enhances the accuracy of classification. One important conclusion is that the accuracy of the surrogate model determines the quality of the pseudo dataset and hence, the improvements in the machine learning model.

Download Full-text

Programmatic Labeling of Dark Data for Artificial Intelligence in Spatial Informatics

10.5194/egusphere-egu21-16326 ◽

2021 ◽

Author(s):

Jason Meil

Keyword(s):

Machine Learning ◽

Binary Classification ◽

Joint Probability ◽

Ground Truth ◽

Training Data ◽

Data Sources ◽

Joint Probability Distribution ◽

Weak Supervision ◽

Response Variable ◽

High Degree

Data preparation process generally consumes up to 80% of the Data Scientists time, with 60% of that being attributed to cleaning and labeling data.[1]&#160; Our solution is to use automated pipelines to prepare, annotate, and catalog data. The first step upon ingestion, especially in the case of real world&#8212;unstructured and unlabeled datasets&#8212;is to leverage Snorkel, a tool specifically designed around a paradigm to rapidly create, manage, and model training data. Configured properly, Snorkel can be leveraged to temper this labeling bottle-neck through a process called weak supervision. Weak supervision uses programmatic labeling functions&#8212;heuristics, distant supervision, SME or knowledge base&#8212;scripted in python to generate &#8220;noisy labels&#8221;. The function traverses the entirety of the dataset and feeds the labeled data into a generative&#8212;conditionally probabilistic&#8212;model. The function of this model is to output the distribution of each response variable and predict the conditional probability based on a joint probability distribution algorithm. This is done by comparing the various labeling functions and the degree to which their outputs are congruent to each other. A single labeling function that has a high degree of congruence with other labeling functions will have a high degree of learned accuracy, that is, the fraction of predictions that the model got right. Conversely, single labeling functions that have a low degree of congruence with other functions will have low learned accuracy. Each prediction is then combined by the estimated weighted accuracy, whereby the predictions of the higher learned functions are counted multiple times. The result yields a transformation from a binary classification of 0 or 1 to a fuzzy label between 0 and 1&#8212; there is &#8220;x&#8221; probability that based on heuristic &#8220;n&#8221;, the response variable is &#8220;y&#8221;. The addition of data to this generative model multi-class inference will be made on the response variables positive, negative, or abstain, assigning probabilistic labels to potentially millions of data points. Thus, we have generated a discriminative ground truth for all further labeling efforts and have improved the scalability of our models. Labeling functions can be applied to unlabeled data to further machine learning efforts. &#160; Once our datasets are labeled and a ground truth is established, we need to persist the data into our delta lake since it combines the most performant aspects of a warehouse with the low-cost storage for data lakes. In addition, the lake can accept unstructured, semi structured, or structured data sources, and those sources can be further aggregated into raw ingestion, cleaned, and feature engineered data layers.&#160; By sectioning off the data sources into these &#8220;layers&#8221;, the data engineering portion is abstracted away from the data scientist, who can access model ready data at any time.&#160; Data can be ingested via batch or stream.&#160; &#160; The design of the entire ecosystem is to eliminate as much technical debt in machine learning paradigms as possible in terms of configuration, data collection, verification, governance, extraction, analytics, process management, resource management, infrastructure, monitoring, and post verification.&#160;

Download Full-text