Accurate Segmentation of Bacterial Cells using Synthetic Training Data

We present a novel method of bacterial image segmentation using machine learning based on Synthetic Micro-graphs of Bacteria (SyMBac). SyMBac allows for rapid, automatic creation of arbitrary amounts of training datathat combines detailed models of cell growth, physical interactions, and microscope optics to create synthetic images which closely resemble real micrographs, with access to the ground truth positions of cells. We also demonstrate that models trained on SyMBac data generate more accurate and precise cell masks than those trained on human annotated data, because the model learns the true position of the cell irrespective of imaging artefacts

Download Full-text

P5627Automated detection of calcified plaques in coronary optical coherence tomography images using image segmentation based on machine learning

European Heart Journal ◽

10.1093/eurheartj/ehz746.0571 ◽

2019 ◽

Vol 40 (Supplement_1) ◽

Author(s):

C Gangl ◽

C Roth ◽

D Dalos ◽

G Delle-Karth ◽

T Neunteufl ◽

...

Keyword(s):

Machine Learning ◽

Optical Coherence Tomography ◽

Image Segmentation ◽

Prediction Accuracy ◽

Ground Truth ◽

Training Data ◽

Optical Coherence ◽

Learning Methods ◽

Machine Learning Methods ◽

Plaque Detection

Abstract Background and aim Automated image recognition based on machine learning methods was proven to be feasible in several medical imaging applications recently. Beside image classification methods to categorize input images for example into healthy or suspicious, image segmentation allows accurate localization of pathologies and thereby facilitates a wide area of applications. Because of the unique composition of every machine learning problem the applicability of image segmentation methods for detecting coronary pathologies in optical coherence tomography images remains unclear. Furthermore, the prediction accuracy of deep learning methods usually depends on vast amounts of training data which are often not available for particular medical questions. Therefore special strategies need to be applied to achieve satisfying results with smaller training datasets. We aimed to investigate the applicability of machine learning methods for plaque detection in coronary OCT images, especially considering the challenge of a small training dataset. Methods Originating from a dataset of 104 OCT frames containing calcified plaques, we performed image preprocessing using a custom build OCT image processing software to crop the luminal part as well as the areas outside the circular OCT signal to reduce entropy. Furthermore, plaques were identified and marked by an experienced OCT analyst, drawing plaque-enclosing polygonal masks using the same software. We also performed common image augmentation strategies, primarily applying rotation and zoom operations. Subsequently, we split the samples randomly into training, validation and test datasets (80:10:10%). To train the segmentation model, we fed the training and validation samples into an U-Net Convolutional Neuronal Network implementation with domain-specific adaptions using the RMSprop optimizer based on the publicly available PyTorch library. Results After 50 training epochs, we could achieve a prediction accuracy of 74.4% with the current configuration measured by the Sørensen–Dice coefficient comparing the similarity of the predicted plaque masks with the ground truth samples (figure 1 illustrates an exemplary comparison between predicted and ground truth plaque masks). Exemplary projection of a predicted mask Conclusion We were able to show that image segmentation based on machine learning strategies is a feasible way for automated plaque detection in coronary OCT imaging even based on small training datasets. Larger training datasets are necessary to raise prediction accuracy.

Download Full-text

Unraveling the deep learning gearbox in optical coherence tomography image segmentation towards explainable artificial intelligence

Communications Biology ◽

10.1038/s42003-021-01697-y ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Peter M. Maloca ◽

Philipp L. Müller ◽

Aaron Y. Lee ◽

Adnan Tufail ◽

Konstantinos Balaskas ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Optical Coherence Tomography ◽

Image Segmentation ◽

Convolutional Neural Network ◽

Learning Algorithm ◽

Ground Truth ◽

Optical Coherence Tomography Image ◽

Optical Coherence ◽

Tomography Image

AbstractMachine learning has greatly facilitated the analysis of medical data, while the internal operations usually remain intransparent. To better comprehend these opaque procedures, a convolutional neural network for optical coherence tomography image segmentation was enhanced with a Traceable Relevance Explainability (T-REX) technique. The proposed application was based on three components: ground truth generation by multiple graders, calculation of Hamming distances among graders and the machine learning algorithm, as well as a smart data visualization (‘neural recording’). An overall average variability of 1.75% between the human graders and the algorithm was found, slightly minor to 2.02% among human graders. The ambiguity in ground truth had noteworthy impact on machine learning results, which could be visualized. The convolutional neural network balanced between graders and allowed for modifiable predictions dependent on the compartment. Using the proposed T-REX setup, machine learning processes could be rendered more transparent and understandable, possibly leading to optimized applications.

Download Full-text

Inverse biomechanical modeling via machine learning and synthetic training data

Medical Imaging 2018: Image-Guided Procedures, Robotic Interventions, and Modeling ◽

10.1117/12.2296927 ◽

2018 ◽

Author(s):

Maureen Stone ◽

Aniket Tolpadi ◽

Aaron Carass ◽

Jerry Prince ◽

Arnold Gomez

Keyword(s):

Machine Learning ◽

Training Data ◽

Biomechanical Modeling ◽

Synthetic Training Data

Download Full-text

Glean

Proceedings of the VLDB Endowment ◽

10.14778/3447689.3447703 ◽

2021 ◽

Vol 14 (6) ◽

pp. 997-1005

Author(s):

Sandeep Tata ◽

Navneet Potti ◽

James B. Wendt ◽

Lauro Beltrão Costa ◽

Marc Najork ◽

...

Keyword(s):

Machine Learning ◽

Data Management ◽

Real World ◽

Empirical Studies ◽

Ground Truth ◽

Training Data ◽

Ground Truth Data ◽

Document Type ◽

Machine Learning Model ◽

Structured Information

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.

Download Full-text

Programmatic Labeling of Dark Data for Artificial Intelligence in Spatial Informatics

10.5194/egusphere-egu21-16326 ◽

2021 ◽

Author(s):

Jason Meil

Keyword(s):

Machine Learning ◽

Binary Classification ◽

Joint Probability ◽

Ground Truth ◽

Training Data ◽

Data Sources ◽

Joint Probability Distribution ◽

Weak Supervision ◽

Response Variable ◽

High Degree

Data preparation process generally consumes up to 80% of the Data Scientists time, with 60% of that being attributed to cleaning and labeling data.[1]&#160; Our solution is to use automated pipelines to prepare, annotate, and catalog data. The first step upon ingestion, especially in the case of real world&#8212;unstructured and unlabeled datasets&#8212;is to leverage Snorkel, a tool specifically designed around a paradigm to rapidly create, manage, and model training data. Configured properly, Snorkel can be leveraged to temper this labeling bottle-neck through a process called weak supervision. Weak supervision uses programmatic labeling functions&#8212;heuristics, distant supervision, SME or knowledge base&#8212;scripted in python to generate &#8220;noisy labels&#8221;. The function traverses the entirety of the dataset and feeds the labeled data into a generative&#8212;conditionally probabilistic&#8212;model. The function of this model is to output the distribution of each response variable and predict the conditional probability based on a joint probability distribution algorithm. This is done by comparing the various labeling functions and the degree to which their outputs are congruent to each other. A single labeling function that has a high degree of congruence with other labeling functions will have a high degree of learned accuracy, that is, the fraction of predictions that the model got right. Conversely, single labeling functions that have a low degree of congruence with other functions will have low learned accuracy. Each prediction is then combined by the estimated weighted accuracy, whereby the predictions of the higher learned functions are counted multiple times. The result yields a transformation from a binary classification of 0 or 1 to a fuzzy label between 0 and 1&#8212; there is &#8220;x&#8221; probability that based on heuristic &#8220;n&#8221;, the response variable is &#8220;y&#8221;. The addition of data to this generative model multi-class inference will be made on the response variables positive, negative, or abstain, assigning probabilistic labels to potentially millions of data points. Thus, we have generated a discriminative ground truth for all further labeling efforts and have improved the scalability of our models. Labeling functions can be applied to unlabeled data to further machine learning efforts. &#160; Once our datasets are labeled and a ground truth is established, we need to persist the data into our delta lake since it combines the most performant aspects of a warehouse with the low-cost storage for data lakes. In addition, the lake can accept unstructured, semi structured, or structured data sources, and those sources can be further aggregated into raw ingestion, cleaned, and feature engineered data layers.&#160; By sectioning off the data sources into these &#8220;layers&#8221;, the data engineering portion is abstracted away from the data scientist, who can access model ready data at any time.&#160; Data can be ingested via batch or stream.&#160; &#160; The design of the entire ecosystem is to eliminate as much technical debt in machine learning paradigms as possible in terms of configuration, data collection, verification, governance, extraction, analytics, process management, resource management, infrastructure, monitoring, and post verification.&#160;

Download Full-text

Key Technology Considerations in Developing and Deploying Machine Learning Models in Clinical Radiology Practice (Preprint)

10.2196/preprints.28776 ◽

2021 ◽

Author(s):

Viraj Kulkarni ◽

Manish Gawali ◽

Amit Kharat

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Ground Truth ◽

Research Literature ◽

Training Data ◽

Software Developers ◽

Intelligent Software ◽

Misclassification Costs ◽

Clinical Radiology ◽

Class Representation

UNSTRUCTURED The use of machine learning to develop intelligent software tools for interpretation of radiology images has gained widespread attention in recent years. The development, deployment, and eventual adoption of these models in clinical practice, however, remains fraught with challenges. In this paper, we propose a list of key considerations that machine learning researchers must recognize and address to make their models accurate, robust, and usable in practice. Namely, we discuss: insufficient training data, decentralized datasets, high cost of annotations, ambiguous ground truth, imbalance in class representation, asymmetric misclassification costs, relevant performance metrics, generalization of models to unseen datasets, model decay, adversarial attacks, explainability, fairness and bias, and clinical validation. We describe each consideration and identify techniques to address it. Although these techniques have been discussed in prior research literature, by freshly examining them in the context of medical imaging and compiling them in the form of a laundry list, we hope to make them more accessible to researchers, software developers, radiologists, and other stakeholders.

Download Full-text

Genus-Physiognomy-Ecosystem Map with 88 Legends Produced at 10m-resolution First-time in a Country Scale through Machine Learning of Multi-temporal Satellite Images

10.20944/preprints202201.0123.v1 ◽

2022 ◽

Author(s):

Ram C. Sharma ◽

Keitarou Hara

Keyword(s):

Machine Learning ◽

Satellite Images ◽

Graphics Processing Unit ◽

Ground Truth ◽

Training Data ◽

Gradient Boosting ◽

Processing Unit ◽

Spectral Bands ◽

Multi Temporal ◽

First Time

This research introduces Genus-Physiognomy-Ecosystem (GPE) mapping at a prefecture level through machine learning of multi-spectral and multi-temporal satellite images at 10m spatial resolution, and later integration of prefecture wise maps into country scale for dealing with 88 GPE types to be classified from a large size of training data involved in the research effectively. This research was made possible by harnessing entire archives of Level-2A product, Bottom of Atmosphere reflectance images collected by MultiSpectral Instruments onboard a constellation of two polar-orbiting Sentinel-2 mission satellites. The satellite images were pre-processed for cloud masking and monthly median composite images consisting of 10 multi-spectral bands and 7 spectral indexes were generated. The ground truth labels were extracted from extant vegetation survey maps by implementing systematic stratified sampling approach and noisy labels were dropped out for preparing a reliable ground truth database. Graphics Processing Unit (GPU) implementation of Gradient Boosting Decision Trees (GBDT) classifier was employed for classification of 88 GPE types from 204 satellite features. The classification accuracy computed with 25% test data varied from 65-81% in terms of F1-score across 48 prefectural regions. This research produced seamless maps of 88 GPE types first time at a country scale with an average 72% F1-score.

Download Full-text

Crowdsourcing Image Analysis for Plant Phenomics to Generate Ground Truth Data for Machine Learning

10.1101/265918 ◽

2018 ◽

Author(s):

Naihui Zhou ◽

Zachary D Siegel ◽

Scott Zarecor ◽

Nigel Lee ◽

Darwin A Campbell ◽

...

Keyword(s):

Machine Learning ◽

Image Analysis ◽

Best Practices ◽

Ground Truth ◽

Training Data ◽

Quality Data ◽

High Quality ◽

Ground Truth Data ◽

Plant Phenomics

AbstractThe accuracy of machine learning tasks critically depends on high quality ground truth data. Therefore, in many cases, producing good ground truth data typically involves trained professionals; however, this can be costly in time, effort, and money. Here we explore the use of crowdsourcing to generate a large number of training data of good quality. We explore an image analysis task involving the segmentation of corn tassels from images taken in a field setting. We investigate the accuracy, speed and other quality metrics when this task is performed by students for academic credit, Amazon MTurk workers, and Master Amazon MTurk workers. We conclude that the Amazon MTurk and Master Mturk workers perform significantly better than the for-credit students, but with no significant difference between the two MTurk worker types. Furthermore, the quality of the segmentation produced by Amazon MTurk workers rivals that of an expert worker. We provide best practices to assess the quality of ground truth data, and to compare data quality produced by different sources. We conclude that properly managed crowdsourcing can be used to establish large volumes of viable ground truth data at a low cost and high quality, especially in the context of high throughput plant phenotyping. We also provide several metrics for assessing the quality of the generated datasets.Author SummaryFood security is a growing global concern. Farmers, plant breeders, and geneticists are hastening to address the challenges presented to agriculture by climate change, dwindling arable land, and population growth. Scientists in the field of plant phenomics are using satellite and drone images to understand how crops respond to a changing environment and to combine genetics and environmental measures to maximize crop growth efficiency. However, the terabytes of image data require new computational methods to extract useful information. Machine learning algorithms are effective in recognizing select parts of images, butthey require high quality data curated by people to train them, a process that can be laborious and costly. We examined how well crowdsourcing works in providing training data for plant phenomics, specifically, segmenting a corn tassel – the male flower of the corn plant – from the often-cluttered images of a cornfield. We provided images to students, and to Amazon MTurkers, the latter being an on-demand workforce brokered by Amazon.com and paid on a task-by-task basis. We report on best practices in crowdsourcing image labeling for phenomics, and compare the different groups on measures such as fatigue and accuracy over time. We find that crowdsourcing is a good way of generating quality labeled data, rivaling that of experts.

Download Full-text

The Active Segmentation Platform for Microscopic Image Classification and Segmentation

Brain Sciences ◽

10.3390/brainsci11121645 ◽

2021 ◽

Vol 11 (12) ◽

pp. 1645

Author(s):

Sumit K. Vohra ◽

Dimiter Prodanov

Keyword(s):

Machine Learning ◽

Image Segmentation ◽

Image Classification ◽

Domain Knowledge ◽

Feature Space ◽

Ground Truth ◽

Classification Problem ◽

Data Sets ◽

Learning Approaches ◽

Data Set

Image segmentation still represents an active area of research since no universal solution can be identified. Traditional image segmentation algorithms are problem-specific and limited in scope. On the other hand, machine learning offers an alternative paradigm where predefined features are combined into different classifiers, providing pixel-level classification and segmentation. However, machine learning only can not address the question as to which features are appropriate for a certain classification problem. The article presents an automated image segmentation and classification platform, called Active Segmentation, which is based on ImageJ. The platform integrates expert domain knowledge, providing partial ground truth, with geometrical feature extraction based on multi-scale signal processing combined with machine learning. The approach in image segmentation is exemplified on the ISBI 2012 image segmentation challenge data set. As a second application we demonstrate whole image classification functionality based on the same principles. The approach is exemplified using the HeLa and HEp-2 data sets. Obtained results indicate that feature space enrichment properly balanced with feature selection functionality can achieve performance comparable to deep learning approaches. In summary, differential geometry can substantially improve the outcome of machine learning since it can enrich the underlying feature space with new geometrical invariant objects.

Download Full-text

“Garbage In, Garbage Out” Revisited: What Do Machine Learning Application Papers Report About Human-Labeled Training Data?

Quantitative Science Studies ◽

10.1162/qss_a_00144 ◽

2021 ◽

pp. 1-32

Author(s):

R. Stuart Geiger ◽

Dominique Cope ◽

Jamie Ip ◽

Marsha Lotosh ◽

Aayush Shah ◽

...

Keyword(s):

Machine Learning ◽

Best Practices ◽

Ground Truth ◽

Training Data ◽

Supervised Machine Learning ◽

Social Media Platforms ◽

Learning Research ◽

Research And Education ◽

Application Fields

Abstract Supervised machine learning, in which models are automatically derived from labeled training data, is only as good as the quality of that data. This study builds on prior work that investigated to what extent ‘best practices’ around labeling training data were followed in applied ML publications within a single domain (social media platforms). In this paper, we expand by studying publications that apply supervised ML in a far broader spectrum of disciplines, focusing on human-labeled data. We report to what extent a random sample of ML application papers across disciplines give specific details about whether best practices were followed, while acknowledging that a greater range of application fields necessarily produces greater diversity of labeling and annotation methods. Because much of machine learning research and education only focuses on what is done once a “ground truth” or “gold standard” of training data is available, it is especially relevant to discuss issues around the equally-important aspect of whether such data is reliable in the first place. This determination becomes increasingly complex when applied to a variety of specialized fields, as labeling can range from a task requiring little-to-no background knowledge to one that must be performed by someone with career expertise. Peer Review https://publons.com/publon/10.1162/qss_a_00144

Download Full-text