PUB-SalNet: A Pre-Trained Unsupervised Self-Aware Backpropagation Network for Biomedical Salient Segmentation

Salient segmentation is a critical step in biomedical image analysis, aiming to cut out regions that are most interesting to humans. Recently, supervised methods have achieved promising results in biomedical areas, but they depend on annotated training data sets, which requires labor and proficiency in related background knowledge. In contrast, unsupervised learning makes data-driven decisions by obtaining insights directly from the data themselves. In this paper, we propose a completely unsupervised self-aware network based on pre-training and attentional backpropagation for biomedical salient segmentation, named as PUB-SalNet. Firstly, we aggregate a new biomedical data set from several simulated Cellular Electron Cryo-Tomography (CECT) data sets featuring rich salient objects, different SNR settings, and various resolutions, which is called SalSeg-CECT. Based on the SalSeg-CECT data set, we then pre-train a model specially designed for biomedical tasks as a backbone module to initialize network parameters. Next, we present a U-SalNet network to learn to selectively attend to salient objects. It includes two types of attention modules to facilitate learning saliency through global contrast and local similarity. Lastly, we jointly refine the salient regions together with feature representations from U-SalNet, with the parameters updated by self-aware attentional backpropagation. We apply PUB-SalNet for analysis of 2D simulated and real images and achieve state-of-the-art performance on simulated biomedical data sets. Furthermore, our proposed PUB-SalNet can be easily extended to 3D images. The experimental results on the 2d and 3d data sets also demonstrate the generalization ability and robustness of our method.

Download Full-text

Data Analysis With Shapley Values For Automatic Subject Selection in Alzheimer's Disease Data Sets Using Interpretable Machine Learning

10.21203/rs.3.rs-245707/v1 ◽

2021 ◽

Author(s):

Louise Bloch ◽

Christoph M. Friedrich

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Test Data ◽

Noisy Data ◽

Training Data ◽

Data Sets ◽

Data Set ◽

Model Interpretation ◽

Percentage Points ◽

Shapley Values

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.

Download Full-text

An Incremental Isomap Method for Hyperspectral Dimensionality Reduction and Classification

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.7.445 ◽

2021 ◽

Vol 87 (6) ◽

pp. 445-455

Author(s):

Yi Ma ◽

Zezhong Zheng ◽

Yutang Ma ◽

Mingcang Zhu ◽

Ran Huang ◽

...

Keyword(s):

Manifold Learning ◽

Nearest Neighbor ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Data Points

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.

Download Full-text

Simple Convolutional-Based Models: Are They Learning the Task or the Data?

Neural Computation ◽

10.1162/neco_a_01446 ◽

2021 ◽

pp. 1-17

Author(s):

Luis Sa-Couto ◽

Andreas Wichert

Keyword(s):

Neural Networks ◽

Pattern Recognition ◽

Training Data ◽

Model Complexity ◽

Data Sets ◽

Simple Task ◽

Data Set ◽

Knowing That ◽

Handwritten Digit ◽

End To End

Abstract Convolutional neural networks (CNNs) evolved from Fukushima's neocognitron model, which is based on the ideas of Hubel and Wiesel about the early stages of the visual cortex. Unlike other branches of neocognitron-based models, the typical CNN is based on end-to-end supervised learning by backpropagation and removes the focus from built-in invariance mechanisms, using pooling not as a way to tolerate small shifts but as a regularization tool that decreases model complexity. These properties of end-to-end supervision and flexibility of structure allow the typical CNN to become highly tuned to the training data, leading to extremely high accuracies on typical visual pattern recognition data sets. However, in this work, we hypothesize that there is a flip side to this capability, a hidden overfitting. More concretely, a supervised, backpropagation based CNN will outperform a neocognitron/map transformation cascade (MTCCXC) when trained and tested inside the same data set. Yet if we take both models trained and test them on the same task but on another data set (without retraining), the overfitting appears. Other neocognitron descendants like the What-Where model go in a different direction. In these models, learning remains unsupervised, but more structure is added to capture invariance to typical changes. Knowing that, we further hypothesize that if we repeat the same experiments with this model, the lack of supervision may make it worse than the typical CNN inside the same data set, but the added structure will make it generalize even better to another one. To put our hypothesis to the test, we choose the simple task of handwritten digit classification and take two well-known data sets of it: MNIST and ETL-1. To try to make the two data sets as similar as possible, we experiment with several types of preprocessing. However, regardless of the type in question, the results align exactly with expectation.

Download Full-text

Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm

Genes ◽

10.3390/genes11070717 ◽

2020 ◽

Vol 11 (7) ◽

pp. 717

Author(s):

Garba Abdulrauf Sharifai ◽

Zurinahni Zainol

Keyword(s):

Feature Selection ◽

Optimization Algorithm ◽

Imbalanced Data ◽

High Dimensional ◽

Data Sets ◽

Biomedical Data ◽

Data Set ◽

Grasshopper Optimization Algorithm ◽

Imbalanced Class ◽

Grasshopper Optimization

The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.

Download Full-text

An assessment of Bayesian bias estimator for numerical weather prediction

Nonlinear Processes in Geophysics ◽

10.5194/npg-15-1013-2008 ◽

2008 ◽

Vol 15 (6) ◽

pp. 1013-1022 ◽

Cited By ~ 2

Author(s):

J. Son ◽

D. Hou ◽

Z. Toth

Keyword(s):

Time Series ◽

Numerical Weather Prediction ◽

Sampling Error ◽

Weather Prediction ◽

Training Data ◽

Statistical Characteristics ◽

Forecast Errors ◽

Data Sets ◽

Data Set ◽

Numerical Weather

Abstract. Various statistical methods are used to process operational Numerical Weather Prediction (NWP) products with the aim of reducing forecast errors and they often require sufficiently large training data sets. Generating such a hindcast data set for this purpose can be costly and a well designed algorithm should be able to reduce the required size of these data sets. This issue is investigated with the relatively simple case of bias correction, by comparing a Bayesian algorithm of bias estimation with the conventionally used empirical method. As available forecast data sets are not large enough for a comprehensive test, synthetically generated time series representing the analysis (truth) and forecast are used to increase the sample size. Since these synthetic time series retained the statistical characteristics of the observations and operational NWP model output, the results of this study can be extended to real observation and forecasts and this is confirmed by a preliminary test with real data. By using the climatological mean and standard deviation of the meteorological variable in consideration and the statistical relationship between the forecast and the analysis, the Bayesian bias estimator outperforms the empirical approach in terms of the accuracy of the estimated bias, and it can reduce the required size of the training sample by a factor of 3. This advantage of the Bayesian approach is due to the fact that it is less liable to the sampling error in consecutive sampling. These results suggest that a carefully designed statistical procedure may reduce the need for the costly generation of large hindcast datasets.

Download Full-text

SaltSeg: Automatic 3D salt segmentation using a deep convolutional neural network

Interpretation ◽

10.1190/int-2018-0235.1 ◽

2019 ◽

Vol 7 (3) ◽

pp. SE113-SE122 ◽

Cited By ~ 26

Author(s):

Yunzhi Shi ◽

Xinming Wu ◽

Sergey Fomel

Keyword(s):

Large Scale ◽

Model Building ◽

Ground Truth ◽

Velocity Model ◽

Training Data ◽

Data Sets ◽

Validation Data ◽

Data Set ◽

Seismic Image ◽

Data Generator

Salt boundary interpretation is important for the understanding of salt tectonics and velocity model building for seismic migration. Conventional methods consist of computing salt attributes and extracting salt boundaries. We have formulated the problem as 3D image segmentation and evaluated an efficient approach based on deep convolutional neural networks (CNNs) with an encoder-decoder architecture. To train the model, we design a data generator that extracts randomly positioned subvolumes from large-scale 3D training data set followed by data augmentation, then feed a large number of subvolumes into the network while using salt/nonsalt binary labels generated by thresholding the velocity model as ground truth labels. We test the model on validation data sets and compare the blind test predictions with the ground truth. Our results indicate that our method is capable of automatically capturing subtle salt features from the 3D seismic image with less or no need for manual input. We further test the model on a field example to indicate the generalization of this deep CNN method across different data sets.

Download Full-text

Reconstruction of Axial Tomographic High Resolution Data from Confocal Fluorescence Microscopy: A Method for Improving 3D FISH Images

Analytical Cellular Pathology ◽

10.1155/2000/459351 ◽

2000 ◽

Vol 20 (1) ◽

pp. 7-15 ◽

Cited By ~ 18

Author(s):

R. Heintzmann ◽

G. Kreth ◽

C. Cremer

Keyword(s):

High Resolution ◽

Laser Scanning ◽

Axial Direction ◽

Laser Scanning Microscopy ◽

Three Dimensions ◽

Confocal Laser ◽

Data Sets ◽

Data Set ◽

High Resolution Data ◽

3D Data

Fluorescent confocal laser scanning microscopy allows an improved imaging of microscopic objects in three dimensions. However, the resolution along the axial direction is three times worse than the resolution in lateral directions. A method to overcome this axial limitation is tilting the object under the microscope, in a way that the direction of the optical axis points into different directions relative to the sample. A new technique for a simultaneous reconstruction from a number of such axial tomographic confocal data sets was developed and used for high resolution reconstruction of 3D‐data both from experimental and virtual microscopic data sets. The reconstructed images have a highly improved 3D resolution, which is comparable to the lateral resolution of a single deconvolved data set. Axial tomographic imaging in combination with simultaneous data reconstruction also opens the possibility for a more precise quantification of 3D data. The color images of this publication can be accessed from http://www.esacp.org/acp/2000/20‐1/heintzmann.htm. At this web address an interactive 3D viewer is additionally provided for browsing the 3D data. This java applet displays three orthogonal slices of the data set which are dynamically updated by user mouse clicks or keystrokes.

Download Full-text

Learning to rank with click-through features in a reinforcement learning framework

International Journal of Web Information Systems ◽

10.1108/ijwis-12-2015-0046 ◽

2016 ◽

Vol 12 (4) ◽

pp. 448-476 ◽

Cited By ~ 2

Author(s):

Amir Hosein Keyhanipour ◽

Behzad Moshiri ◽

Maryam Piroozmand ◽

Farhad Oroumchian ◽

Ali Moeini

Keyword(s):

Reinforcement Learning ◽

Learning To Rank ◽

Training Data ◽

High Dimensionality ◽

Compact Representation ◽

Second Phase ◽

Data Sets ◽

Data Set ◽

Content Type ◽

Benchmark Data

Purpose Learning to rank algorithms inherently faces many challenges. The most important challenges could be listed as high-dimensionality of the training data, the dynamic nature of Web information resources and lack of click-through data. High dimensionality of the training data affects effectiveness and efficiency of learning algorithms. Besides, most of learning to rank benchmark datasets do not include click-through data as a very rich source of information about the search behavior of users while dealing with the ranked lists of search results. To deal with these limitations, this paper aims to introduce a novel learning to rank algorithm by using a set of complex click-through features in a reinforcement learning (RL) model. These features are calculated from the existing click-through information in the data set or even from data sets without any explicit click-through information. Design/methodology/approach The proposed ranking algorithm (QRC-Rank) applies RL techniques on a set of calculated click-through features. QRC-Rank is as a two-steps process. In the first step, Transformation phase, a compact benchmark data set is created which contains a set of click-through features. These feature are calculated from the original click-through information available in the data set and constitute a compact representation of click-through information. To find most effective click-through feature, a number of scenarios are investigated. The second phase is Model-Generation, in which a RL model is built to rank the documents. This model is created by applying temporal difference learning methods such as Q-Learning and SARSA. Findings The proposed learning to rank method, QRC-rank, is evaluated on WCL2R and LETOR4.0 data sets. Experimental results demonstrate that QRC-Rank outperforms the state-of-the-art learning to rank methods such as SVMRank, RankBoost, ListNet and AdaRank based on the precision and normalized discount cumulative gain evaluation criteria. The use of the click-through features calculated from the training data set is a major contributor to the performance of the system. Originality/value In this paper, we have demonstrated the viability of the proposed features that provide a compact representation for the click through data in a learning to rank application. These compact click-through features are calculated from the original features of the learning to rank benchmark data set. In addition, a Markov Decision Process model is proposed for the learning to rank problem using RL, including the sets of states, actions, rewarding strategy and the transition function.

Download Full-text

Completing the ENCODE3 compendium yields accurate imputations across a variety of assays and human biosamples

10.1101/533273 ◽

2019 ◽

Cited By ~ 7

Author(s):

Jacob Schreiber ◽

Jeffrey Bilmes ◽

William Stafford Noble

Keyword(s):

Biological Activity ◽

Protein Binding ◽

Histone Modification ◽

Chromatin Accessibility ◽

Training Data ◽

Data Sets ◽

Cellular Mechanisms ◽

Data Set ◽

Genome Wide

AbstractMotivationRecent efforts to describe the human epigenome have yielded thousands of uniformly processed epigenomic and transcriptomic data sets. These data sets characterize a rich variety of biological activity in hundreds of human cell lines and tissues (“biosamples”). Understanding these data sets, and specifically how they differ across biosamples, can help explain many cellular mechanisms, particularly those driving development and disease. However, due primarily to cost, the total number of assays that can be performed is limited. Previously described imputation approaches, such as Avocado, have sought to overcome this limitation by predicting genome-wide epigenomics experiments using learned associations among available epigenomic data sets. However, these previous imputations have focused primarily on measurements of histone modification and chromatin accessibility, despite other biological activity being crucially important.ResultsWe applied Avocado to a data set of 3,814 tracks of data derived from the ENCODE compendium, spanning 400 human biosamples and 84 assays. The resulting imputations cover measurements of chromatin accessibility, histone modification, transcription, and protein binding. We demonstrate the quality of these imputations by comprehensively evaluating the model’s predictions and by showing significant improvements in protein binding performance compared to the top models in an ENCODE-DREAM challenge. Additionally, we show that the Avocado model allows for efficient addition of new assays and biosamples to a pre-trained model, achieving high accuracy at predicting protein binding, even with only a single track of training data.AvailabilityTutorials and source code are available under an Apache 2.0 license at https://github.com/jmschrei/[email protected] or [email protected]

Download Full-text

M2 Macrophage-Based Prognostic Nomogram for Gastric Cancer After Surgical Resection

Frontiers in Oncology ◽

10.3389/fonc.2021.690037 ◽

2021 ◽

Vol 11 ◽

Author(s):

Jianwen Hu ◽

Yongchen Ma ◽

Ju Ma ◽

Yanpeng Yang ◽

Yingze Ning ◽

...

Keyword(s):

Gastric Cancer ◽

Tnm Staging ◽

Training Data ◽

M2 Macrophage ◽

Data Sets ◽

Good Prediction ◽

Data Set ◽

Age And Gender ◽

Cd163 Expression ◽

And Gender

A good prediction model is useful to accurately predict patient prognosis. Tumor–node–metastasis (TNM) staging often cannot accurately predict prognosis when used alone. Some researchers have shown that the infiltration of M2 macrophages in many tumors indicates poor prognosis. This approach has the potential to predict prognosis more accurately when used in combination with TNM staging, but there is less research in gastric cancer. A multivariate analysis demonstrated that CD163 expression, TNM staging, age, and gender were independent risk factors for overall survival. Thus, these parameters were assessed to develop the nomogram in the training data set, which was tested in the validation and whole data sets. The model showed a high degree of discrimination, calibration, and good clinical benefit in the training, validation, and whole data sets. In conclusion, we combined CD163 expression in macrophages, TNM staging, age, and gender to develop a nomogram to predict 3- and 5-year overall survivals after curative resection for gastric cancer. This model has the potential to provide further diagnostic and prognostic value for patients with gastric cancer.

Download Full-text