scholarly journals Superpixel segmentations for thin sections: evaluation of methods to enable the generation of machine learning training data sets

2021 ◽  
Author(s):  
Jiaxin Yu ◽  
Florian Wellmann ◽  
Simon Virgo ◽  
Marven von Domarus ◽  
Mingze Jiang ◽  
...  

Training data is the backbone of developing either Machine Learning (ML) models or specific deep learning algorithms. The paucity of well-labeled training image data has significantly impeded the applications of ML-based approaches, especially the development of novel Deep Learning (DL) methods like Convolutional Neural Networks (CNNs) in mineral thin section images identification. However, image annotation, especially pixel-wise annotation is always a costly process. Manually creating dense semantic labels for rock thin section images has been long considered as an unprecedented challenge in view of the ubiquitous variety and complexity of minerals in thin sections. To speed up the annotation, we propose a human-computer collaborative pipeline in which superpixel segmentation is used as a boundary extractor to avoid hand delineation of instances boundaries. The pipeline consists of two steps: superpixel segmentation using MultiSLIC, and superpixel labeling through a specific-designed tool. We use a cutting-edge methodology Virtual Petroscopy (ViP) for automatic image acquisition. Bentheimer sandstone sample is used to conduct performance testing of the pipeline. Three standard error metrics are used to evaluate the performance of MultiSLIC. The result indicates that MultiSLIC is able to extract compact superpixels with satisfying boundary adherence given multiple input images. According to our test results, large and complex thin section images with pixel-wisely accurate labels can be annotated with the labeling tool more efficiently than in a conventional, purely manual work, and generate data of high quality.

2020 ◽  
Author(s):  
Jiaxin Yu ◽  
Joyce Schmatz ◽  
Marven von Domarus ◽  
Mingze Jiang ◽  
Simon Virgo ◽  
...  

<p>Machine learning approaches and deep learning-based methods are efficient tools to address problems for which large amounts of observations and data are documented. They have proven excellent performance for many applications in the geosciences and remote sensing area. However, to one of the most fundamental data types in geoscientific studies, mineral thin sections, they have not yet been applied to its full potential. Mineral thin sections contain a treasure of information. It is anticipated that thin section samples can be systematically and quantitatively analyzed with a specifically designed system equipped with ML approaches or deep learning methods such as CNNs. The development of any artificial intelligence application that enables automated image analysis requires consistent and sufficiently large training datasets with ground truth labels. However, a dataset which serves for visual object detection in petrographic thin sections analysis is still missing. We wish to close this data gap by generating a large dataset of pixel-wise annotated microscopic images for thin sections.</p><p>The variation of optical features of certain minerals under different settings of a petrographic microscope is closely related to crystallographic characteristics that can be indicative for a mineral. In order to fully capture optical features into digital images, we generated raw data of microscopic images for different rock samples by using virtual petrographic microscopy (ViP), a cutting-edge methodology that is able to automatically scan entire thin sections in Gigapixel resolution under various polarization angle and illumination conditions. We proved that using ViP data will result in better segmentation result compared to single image acquisition.</p><p>Image annotation, especially pixel-wise annotation is always a time-consuming and inefficient process. Moreover, it would be particularly challenging when to manually create dense semantic labels for ViP data in view of its size and dimensionality. To address this problem, we proposed a human-computer collaborative annotation pipeline where computers extract image boundaries by splitting images into superpixels, while human-annotators subsequently associate each superpixel manually with a class label with a single mouse click or brush stroke. This frees the human annotator from the burden of painstakingly delineating the exact boundaries of grains by hand and it has the potential to significantly speed up the annotation process.</p><p>Instead of providing a discrete representation of images, superpixels are better aligned with region boundaries and largely reduce the image complexity. The use of superpixel segmentation in the annotation pipeline not only significantly reduce the manual workload for human annotators but also provides a significant dataset reduction by reducing the number of image primitives to operate on. In order to find the most suitable algorithms to generate superpixel segmentation, we evaluated state-of-art superpixel algorithms with regard to standard error metrics based on scanned ViP images and corresponding boundary maps traced by hand. We also proposed a novel adaption of the SLIC superpixel extraction algorithm that can cope with the multiple information layers of ViP data. We plan to use these superpixel algorithms in our pipeline to generate open data sets of several types of mineral thin sections for training of ML and DL algorithms.</p>


2019 ◽  
Author(s):  
Mojtaba Haghighatlari ◽  
Gaurav Vishwakarma ◽  
Mohammad Atif Faiz Afzal ◽  
Johannes Hachmann

<div><div><div><p>We present a multitask, physics-infused deep learning model to accurately and efficiently predict refractive indices (RIs) of organic molecules, and we apply it to a library of 1.5 million compounds. We show that it outperforms earlier machine learning models by a significant margin, and that incorporating known physics into data-derived models provides valuable guardrails. Using a transfer learning approach, we augment the model to reproduce results consistent with higher-level computational chemistry training data, but with a considerably reduced number of corresponding calculations. Prediction errors of machine learning models are typically smallest for commonly observed target property values, consistent with the distribution of the training data. However, since our goal is to identify candidates with unusually large RI values, we propose a strategy to boost the performance of our model in the remoter areas of the RI distribution: We bias the model with respect to the under-represented classes of molecules that have values in the high-RI regime. By adopting a metric popular in web search engines, we evaluate our effectiveness in ranking top candidates. We confirm that the models developed in this study can reliably predict the RIs of the top 1,000 compounds, and are thus able to capture their ranking. We believe that this is the first study to develop a data-derived model that ensures the reliability of RI predictions by model augmentation in the extrapolation region on such a large scale. These results underscore the tremendous potential of machine learning in facilitating molecular (hyper)screening approaches on a massive scale and in accelerating the discovery of new compounds and materials, such as organic molecules with high-RI for applications in opto-electronics.</p></div></div></div>


Author(s):  
Tobias M. Rasse ◽  
Réka Hollandi ◽  
Péter Horváth

AbstractVarious pre-trained deep learning models for the segmentation of bioimages have been made available as ‘developer-to-end-user’ solutions. They usually require neither knowledge of machine learning nor coding skills, are optimized for ease of use, and deployability on laptops. However, testing these tools individually is tedious and success is uncertain.Here, we present the ‘Op’en ‘Se’gmentation ‘F’ramework (OpSeF), a Python framework for deep learning-based instance segmentation. OpSeF aims at facilitating the collaboration of biomedical users with experienced image analysts. It builds on the analysts’ knowledge in Python, machine learning, and workflow design to solve complex analysis tasks at any scale in a reproducible, well-documented way. OpSeF defines standard inputs and outputs, thereby facilitating modular workflow design and interoperability with other software. Users play an important role in problem definition, quality control, and manual refinement of results. All analyst tasks are optimized for deployment on Linux workstations or GPU clusters, all user tasks may be performed on any laptop in ImageJ.OpSeF semi-automates preprocessing, convolutional neural network (CNN)-based segmentation in 2D or 3D, and post-processing. It facilitates benchmarking of multiple models in parallel. OpSeF streamlines the optimization of parameters for pre- and post-processing such, that an available model may frequently be used without retraining. Even if sufficiently good results are not achievable with this approach, intermediate results can inform the analysts in the selection of the most promising CNN-architecture in which the biomedical user might invest the effort of manually labeling training data.We provide Jupyter notebooks that document sample workflows based on various image collections. Analysts may find these notebooks useful to illustrate common segmentation challenges, as they prepare the advanced user for gradually taking over some of their tasks and completing their projects independently. The notebooks may also be used to explore the analysis options available within OpSeF in an interactive way and to document and share final workflows.Currently, three mechanistically distinct CNN-based segmentation methods, the U-Net implementation used in Cellprofiler 3.0, StarDist, and Cellpose have been integrated within OpSeF. The addition of new networks requires little, the addition of new models requires no coding skills. Thus, OpSeF might soon become both an interactive model repository, in which pre-trained models might be shared, evaluated, and reused with ease.


Author(s):  
Lucile-Morgane Hays ◽  
Adeline Kerner

Digitization and online publishing of museum specimen data are happening worldwide. Studies based solely on online data become increasingly accessible. The current events, for example, reducing our transport-related carbon footprint or the COVID-19 pandemic, provide key opportunities to highlight the full value of digitized collections and their related tools, which allow us to continue our research from home or at least without travelling. Are existing data resources and tools adequate for engaging in a research project from beginning to end? To address this issue, we propose to use the Mexican archaeocyaths digitized collection from the Museum National d’Histoire Naturelle, Paris, France (MNHN) and the freeware Annotate in order to describe and identify all the archaeocyaths from the Mexican Cambrian reef. Archaeocyaths are aspiculate sponges that lived during the Cambrian Period. They were the first animals to build reefs. In the MNHN collection, they are found as thin-sections with several archaeocyaths per thin-section (Fig. 1). Multiple individuals are grouped under a single collection number and a single species name. The list of species in the thin-section is only captured on the paper label, and cannot currently be found online. To study an archaeocyaths' reef, the archaeocyaths have to be described and identified one by one, and the location of each specimen has to be accuratly captured. Is it possible to do this with Annotate? Can a palaeontologist use only digitized specimens and Annotate to study a complete fauna of a given time and space? Annotate is an image annotation tool for the natural sciences. It allows users to measure, count, and tag all the morphological structures of an organism. Photos may be imported from the Recolnat database or users may import their own photos. Users can measure lengths, surfaces, and angles, count occurrences and add points of interest. Users can also tag the different individuals to identify them. Morphological terms may be imported as a standardized list from Xper2 or Xper3. Xper3 is a web platform that manages descriptive data and provides interactive identification keys. The results of the measurements and annotations can be exported into CSV format (comma-separated values) or into a structured descriptive data (SDD) format. To identify an archaeocyath to genus level, we need to identify morphological structures and count the occurrence of some of them, and for an identification to the species level, we need to measure different additional parts. The standardized list of morphological terms has been imported from the archaeocyaths genera knowledge base and the list of measurements has been created directly in Annotate. Lengths (e.g., pore size, cup diameter), counts (e.g., number of septae, number of pores) and points of interest (e.g., tumuli, canals, septa) are easy to use. What are the key lessons learnt to remember at the end of this study? The digitized archaeocyaths from Mexico have been identified as easily with Annotate as if a microscope and thin sections were used. The CSV export provided quick access to statistics calculations. The main difference between a microscope and Annotate is the working time. Some functionalities of Annotate are not optimized, their uses are time consuming. For instance, the importation of photos is not really appropriate for archaeocyaths studies. Two sections (transversal and longitudinal) per specimen are necessary to see all the morphological structures. These two parts of the same rock are packed together with one collection number. While users can easily switch from one section to another with a microscope, they can not with Annotate. Annotate allows only one photo per collection number from Recolnat, but not images of the two sections and their metadata. The main difference between a microscope and Annotate is the working time. Some functionalities of Annotate are not optimized, their uses are time consuming. For instance, the importation of photos is not really appropriate for archaeocyaths studies. Two sections (transversal and longitudinal) per specimen are necessary to see all the morphological structures. These two parts of the same rock are packed together with one collection number. While users can easily switch from one section to another with a microscope, they can not with Annotate. Annotate allows only one photo per collection number from Recolnat, but not images of the two sections and their metadata. Although Annotate is not an intuitive tool to use it is still very powerful however, some training is required to fully take advantage of it, and there is no documentation available. This freeware has great potential as it can assist researchers in their work and proposes an alternative to the need to travel around the world to study a fossil.


2022 ◽  
pp. 1559-1575
Author(s):  
Mário Pereira Véstias

Machine learning is the study of algorithms and models for computing systems to do tasks based on pattern identification and inference. When it is difficult or infeasible to develop an algorithm to do a particular task, machine learning algorithms can provide an output based on previous training data. A well-known machine learning model is deep learning. The most recent deep learning models are based on artificial neural networks (ANN). There exist several types of artificial neural networks including the feedforward neural network, the Kohonen self-organizing neural network, the recurrent neural network, the convolutional neural network, the modular neural network, among others. This article focuses on convolutional neural networks with a description of the model, the training and inference processes and its applicability. It will also give an overview of the most used CNN models and what to expect from the next generation of CNN models.


2021 ◽  
Author(s):  
Jaydip Sen ◽  
Sidra Mehtab ◽  
Abhishek Dutta

Prediction of stock prices has been an important area of research for a long time. While supporters of the <i>efficient market hypothesis</i> believe that it is impossible to predict stock prices accurately, there are formal propositions demonstrating that accurate modeling and designing of appropriate variables may lead to models using which stock prices and stock price movement patterns can be very accurately predicted. Researchers have also worked on technical analysis of stocks with a goal of identifying patterns in the stock price movements using advanced data mining techniques. In this work, we propose an approach of hybrid modeling for stock price prediction building different machine learning and deep learning-based models. For the purpose of our study, we have used NIFTY 50 index values of the National Stock Exchange (NSE) of India, during the period December 29, 2014 till July 31, 2020. We have built eight regression models using the training data that consisted of NIFTY 50 index records from December 29, 2014 till December 28, 2018. Using these regression models, we predicted the <i>open</i> values of NIFTY 50 for the period December 31, 2018 till July 31, 2020. We, then, augment the predictive power of our forecasting framework by building four deep learning-based regression models using long-and short-term memory (LSTM) networks with a novel approach of walk-forward validation. Using the grid-searching technique, the hyperparameters of the LSTM models are optimized so that it is ensured that validation losses stabilize with the increasing number of epochs, and the convergence of the validation accuracy is achieved. We exploit the power of LSTM regression models in forecasting the future NIFTY 50 <i>open</i> values using four different models that differ in their architecture and in the structure of their input data. Extensive results are presented on various metrics for all the regression models. The results clearly indicate that the LSTM-based univariate model that uses one-week prior data as input for predicting the next week's <i>open</i> value of the NIFTY 50 time series is the most accurate model.


2020 ◽  
Author(s):  
Tim Henning ◽  
Benjamin Bergner ◽  
Christoph Lippert

Instance segmentation is a common task in quantitative cell analysis. While there are many approaches doing this using machine learning, typically, the training process requires a large amount of manually annotated data. We present HistoFlow, a software for annotation-efficient training of deep learning models for cell segmentation and analysis with an interactive user interface.It provides an assisted annotation tool to quickly draw and correct cell boundaries and use biomarkers as weak annotations. It also enables the user to create artificial training data to lower the labeling effort. We employ a universal U-Net neural network architecture that allows accurate instance segmentation and the classification of phenotypes in only a single pass of the network. Transfer learning is available through the user interface to adapt trained models to new tissue types.We demonstrate HistoFlow for fluorescence breast cancer images. The models trained using only artificial data perform comparably to those trained with time-consuming manual annotations. They outperform traditional cell segmentation algorithms and match state-of-the-art machine learning approaches. A user test shows that cells can be annotated six times faster than without the assistance of our annotation tool. Extending a segmentation model for classification of epithelial cells can be done using only 50 to 1500 annotations.Our results show that, unlike previous assumptions, it is possible to interactively train a deep learning model in a matter of minutes without many manual annotations.


Author(s):  
A. Wichmann ◽  
A. Agoub ◽  
M. Kada

Machine learning methods have gained in importance through the latest development of artificial intelligence and computer hardware. Particularly approaches based on deep learning have shown that they are able to provide state-of-the-art results for various tasks. However, the direct application of deep learning methods to improve the results of 3D building reconstruction is often not possible due, for example, to the lack of suitable training data. To address this issue, we present RoofN3D which provides a new 3D point cloud training dataset that can be used to train machine learning models for different tasks in the context of 3D building reconstruction. It can be used, among others, to train semantic segmentation networks or to learn the structure of buildings and the geometric model construction. Further details about RoofN3D and the developed data preparation framework, which enables the automatic derivation of training data, are described in this paper. Furthermore, we provide an overview of other available 3D point cloud training data and approaches from current literature in which solutions for the application of deep learning to unstructured and not gridded 3D point cloud data are presented.


2018 ◽  
Vol 7 (4.36) ◽  
pp. 444 ◽  
Author(s):  
Alan F. Smeaton ◽  
. .

One of the mathematical cornerstones of modern data ana-lytics is machine learning whereby we automatically learn subtle patterns which may be hidden in training data, we associate those patterns with outcomes and we apply these patterns to new and unseen data and make predictions about as yet unseen outcomes. This form of data analytics al-lows us to bring value to the huge volumes of data that is collected from people, from the environment, from commerce, from online activities, from scienti c experiments, from many other sources. The mathematical basis for this form of machine learning has led to tools like Support Vector Machines which have shown moderate e ectiveness and good e ciency in their implementation. Recently, however, these have been usurped by the emergence of deep learning based on convolutional neural networks. In this presentation we will examine the basis for why such deep net-works are remarkably successful and accurate, their similarity to ways in which the human brain is organised, and the challenges of implementing such deep networks on conventional computer architectures.  


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Dmitrii Bychkov ◽  
Nina Linder ◽  
Aleksei Tiulpin ◽  
Hakan Kücükel ◽  
Mikael Lundin ◽  
...  

AbstractThe treatment of patients with ERBB2 (HER2)-positive breast cancer with anti-ERBB2 therapy is based on the detection of ERBB2 gene amplification or protein overexpression. Machine learning (ML) algorithms can predict the amplification of ERBB2 based on tumor morphological features, but it is not known whether ML-derived features can predict survival and efficacy of anti-ERBB2 treatment. In this study, we trained a deep learning model with digital images of hematoxylin–eosin (H&E)-stained formalin-fixed primary breast tumor tissue sections, weakly supervised by ERBB2 gene amplification status. The gene amplification was determined by chromogenic in situ hybridization (CISH). The training data comprised digitized tissue microarray (TMA) samples from 1,047 patients. The correlation between the deep learning–predicted ERBB2 status, which we call H&E-ERBB2 score, and distant disease-free survival (DDFS) was investigated on a fully independent test set, which included whole-slide tumor images from 712 patients with trastuzumab treatment status available. The area under the receiver operating characteristic curve (AUC) in predicting gene amplification in the test sets was 0.70 (95% CI, 0.63–0.77) on 354 TMA samples and 0.67 (95% CI, 0.62–0.71) on 712 whole-slide images. Among patients with ERBB2-positive cancer treated with trastuzumab, those with a higher than the median morphology–based H&E-ERBB2 score derived from machine learning had more favorable DDFS than those with a lower score (hazard ratio [HR] 0.37; 95% CI, 0.15–0.93; P = 0.034). A high H&E-ERBB2 score was associated with unfavorable survival in patients with ERBB2-negative cancer as determined by CISH. ERBB2-associated morphology correlated with the efficacy of adjuvant anti-ERBB2 treatment and can contribute to treatment-predictive information in breast cancer.


Sign in / Sign up

Export Citation Format

Share Document