scholarly journals An Efficient Algorithm for the Detection of Outliers in Mislabeled Omics Data

2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Hongwei Sun ◽  
Jiu Wang ◽  
Zhongwen Zhang ◽  
Naibao Hu ◽  
Tong Wang

High dimensionality and noise have made it difficult to detect related biomarkers in omics data. Through previous study, penalized maximum trimmed likelihood estimation is effective in identifying mislabeled samples in high-dimensional data with mislabeled error. However, the algorithm commonly used in these studies is the concentration step (C-step), and the C-step algorithm that is applied to robust penalized regression does not ensure that the criterion function is gradually optimized iteratively, because the regularized parameters change during the iteration. This makes the C-step algorithm runs very slowly, especially when dealing with high-dimensional omics data. The AR-Cstep (C-step combined with an acceptance-rejection scheme) algorithm is proposed. In simulation experiments, the AR-Cstep algorithm converged faster (the average computation time was only 2% of that of the C-step algorithm) and was more accurate in terms of variable selection and outlier identification than the C-step algorithm. The two algorithms were further compared on triple negative breast cancer (TNBC) RNA-seq data. AR-Cstep can solve the problem of the C-step not converging and ensures that the iterative process is in the direction that improves criterion function. As an improvement of the C-step algorithm, the AR-Cstep algorithm can be extended to other robust models with regularized parameters.

2017 ◽  
Vol 2017 ◽  
pp. 1-14 ◽  
Author(s):  
Anne-Laure Boulesteix ◽  
Riccardo De Bin ◽  
Xiaoyu Jiang ◽  
Mathias Fuchs

As modern biotechnologies advance, it has become increasingly frequent that different modalities of high-dimensional molecular data (termed “omics” data in this paper), such as gene expression, methylation, and copy number, are collected from the same patient cohort to predict the clinical outcome. While prediction based on omics data has been widely studied in the last fifteen years, little has been done in the statistical literature on the integration of multiple omics modalities to select a subset of variables for prediction, which is a critical task in personalized medicine. In this paper, we propose a simple penalized regression method to address this problem by assigning different penalty factors to different data modalities for feature selection and prediction. The penalty factors can be chosen in a fully data-driven fashion by cross-validation or by taking practical considerations into account. In simulation studies, we compare the prediction performance of our approach, called IPF-LASSO (Integrative LASSO with Penalty Factors) and implemented in the R package ipflasso, with the standard LASSO and sparse group LASSO. The use of IPF-LASSO is also illustrated through applications to two real-life cancer datasets. All data and codes are available on the companion website to ensure reproducibility.


Author(s):  
Ervina Varijki ◽  
Bambang Krismono Triwijoyo

One type of cancer that is capable identified using MRI technology is breast cancer. Breast cancer is still the leading cause of death world. therefore early detection of this disease is needed. In identifying breast cancer, a doctor or radiologist analyzing the results of magnetic resonance image that is stored in the format of the Digital Imaging Communication In Medicine (DICOM). It takes skill and experience sufficient for diagnosis is appropriate, andaccurate, so it is necessary to create a digital image processing applications by utilizing the process of object segmentation and edge detection to assist the physician or radiologist in identifying breast cancer. MRI image segmentation using edge detection to identification of breast cancer using a method stages gryascale change the image format, then the binary image thresholding and edge detection process using the latest Robert operator. Of the20 tested the input image to produce images with the appearance of the boundary line of each region or object that is visible and there are no edges are cut off, with the average computation time less than one minute.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Van Hoan Do ◽  
Stefan Canzar

AbstractEmerging single-cell technologies profile multiple types of molecules within individual cells. A fundamental step in the analysis of the produced high-dimensional data is their visualization using dimensionality reduction techniques such as t-SNE and UMAP. We introduce j-SNE and j-UMAP as their natural generalizations to the joint visualization of multimodal omics data. Our approach automatically learns the relative contribution of each modality to a concise representation of cellular identity that promotes discriminative features but suppresses noise. On eight datasets, j-SNE and j-UMAP produce unified embeddings that better agree with known cell types and that harmonize RNA and protein velocity landscapes.


Author(s):  
Yang Xu ◽  
Priyojit Das ◽  
Rachel Patton McCord

Abstract Motivation Deep learning approaches have empowered single-cell omics data analysis in many ways and generated new insights from complex cellular systems. As there is an increasing need for single cell omics data to be integrated across sources, types, and features of data, the challenges of integrating single-cell omics data are rising. Here, we present an unsupervised deep learning algorithm that learns discriminative representations for single-cell data via maximizing mutual information, SMILE (Single-cell Mutual Information Learning). Results Using a unique cell-pairing design, SMILE successfully integrates multi-source single-cell transcriptome data, removing batch effects and projecting similar cell types, even from different tissues, into the shared space. SMILE can also integrate data from two or more modalities, such as joint profiling technologies using single-cell ATAC-seq, RNA-seq, DNA methylation, Hi-C, and ChIP data. When paired cells are known, SMILE can integrate data with unmatched feature, such as genes for RNA-seq and genome wide peaks for ATAC-seq. Integrated representations learned from joint profiling technologies can then be used as a framework for comparing independent single source data. Supplementary information Supplementary data are available at Bioinformatics online. The source code of SMILE including analyses of key results in the study can be found at: https://github.com/rpmccordlab/SMILE.


Author(s):  
FATHALLAH NOUBOUD ◽  
RÉJEAN PLAMONDON

This paper presents a real-time constraint-free handprinted character recognition system based on a structural approach. After the preprocessing operation, a chain code is extracted to represent the character. The classification is based on the use of a processor dedicated to string comparison. The average computation time to recognize a character is about 0.07 seconds. During the learning step, the user can define any set of characters or symbols to be recognized by the system. Thus there are no constraints on the handprinting. The experimental tests show a high degree of accuracy (96%) for writer-dependent applications. Comparisons with other system and methods are discussed. We also present a comparison between the processor used in this system and the Wagner and Fischer algorithm. Finally, we describe some applications of the system.


2007 ◽  
Vol 46 (03) ◽  
pp. 324-331 ◽  
Author(s):  
P. Jäger ◽  
S. Vogel ◽  
A. Knepper ◽  
T. Kraus ◽  
T. Aach ◽  
...  

Summary Objectives: Pleural thickenings as biomarker of exposure to asbestos may evolve into malignant pleural mesothelioma. Foritsearly stage, pleurectomy with perioperative treatment can reduce morbidity and mortality. The diagnosis is based on a visual investigation of CT images, which is a time-consuming and subjective procedure. Our aim is to develop an automatic image processing approach to detect and quantitatively assess pleural thickenings. Methods: We first segment the lung areas, and identify the pleural contours. A convexity model is then used together with a Hounsfield unit threshold to detect pleural thickenings. The assessment of the detected pleural thickenings is based on a spline-based model of the healthy pleura. Results: Tests were carried out on 14 data sets from three patients. In all cases, pleural contours were reliably identified, and pleural thickenings detected. PC-based Computation times were 85 min for a data set of 716 slices, 35 min for 401 slices, and 4 min for 75 slices, resulting in an average computation time of about 5.2 s per slice. Visualizations of pleurae and detected thickeningswere provided. Conclusion: Results obtained so far indicate that our approach is able to assist physicians in the tedious task of finding and quantifying pleural thickenings in CT data. In the next step, our system will undergo an evaluation in a clinical test setting using routine CT data to quantifyits performance.


2010 ◽  
Vol 3 (6) ◽  
pp. 1555-1568 ◽  
Author(s):  
B. Mijling ◽  
O. N. E. Tuinder ◽  
R. F. van Oss ◽  
R. J. van der A

Abstract. The Ozone Profile Algorithm (OPERA), developed at KNMI, retrieves the vertical ozone distribution from nadir spectral satellite measurements of back scattered sunlight in the ultraviolet and visible wavelength range. To produce consistent global datasets the algorithm needs to have good global performance, while short computation time facilitates the use of the algorithm in near real time applications. To test the global performance of the algorithm we look at the convergence behaviour as diagnostic tool of the ozone profile retrievals from the GOME instrument (on board ERS-2) for February and October 1998. In this way, we uncover different classes of retrieval problems, related to the South Atlantic Anomaly, low cloud fractions over deserts, desert dust outflow over the ocean, and the intertropical convergence zone. The influence of the first guess and the external input data including the ozone cross-sections and the ozone climatologies on the retrieval performance is also investigated. By using a priori ozone profiles which are selected on the expected total ozone column, retrieval problems due to anomalous ozone distributions (such as in the ozone hole) can be avoided. By applying the algorithm adaptations the convergence statistics improve considerably, not only increasing the number of successful retrievals, but also reducing the average computation time, due to less iteration steps per retrieval. For February 1998, non-convergence was brought down from 10.7% to 2.1%, while the mean number of iteration steps (which dominates the computational time) dropped 26% from 5.11 to 3.79.


Sign in / Sign up

Export Citation Format

Share Document