An Efficient Algorithm for the Detection of Outliers in Mislabeled Omics Data

High dimensionality and noise have made it difficult to detect related biomarkers in omics data. Through previous study, penalized maximum trimmed likelihood estimation is effective in identifying mislabeled samples in high-dimensional data with mislabeled error. However, the algorithm commonly used in these studies is the concentration step (C-step), and the C-step algorithm that is applied to robust penalized regression does not ensure that the criterion function is gradually optimized iteratively, because the regularized parameters change during the iteration. This makes the C-step algorithm runs very slowly, especially when dealing with high-dimensional omics data. The AR-Cstep (C-step combined with an acceptance-rejection scheme) algorithm is proposed. In simulation experiments, the AR-Cstep algorithm converged faster (the average computation time was only 2% of that of the C-step algorithm) and was more accurate in terms of variable selection and outlier identification than the C-step algorithm. The two algorithms were further compared on triple negative breast cancer (TNBC) RNA-seq data. AR-Cstep can solve the problem of the C-step not converging and ensures that the iterative process is in the direction that improves criterion function. As an improvement of the C-step algorithm, the AR-Cstep algorithm can be extended to other robust models with regularized parameters.

Download Full-text

IPF-LASSO: Integrative L1-Penalized Regression with Penalty Factors for Prediction Based on Multi-Omics Data

Computational and Mathematical Methods in Medicine ◽

10.1155/2017/7691937 ◽

2017 ◽

Vol 2017 ◽

pp. 1-14 ◽

Cited By ~ 25

Author(s):

Anne-Laure Boulesteix ◽

Riccardo De Bin ◽

Xiaoyu Jiang ◽

Mathias Fuchs

Keyword(s):

Cross Validation ◽

Real Life ◽

Penalized Regression ◽

R Package ◽

Molecular Data ◽

Regression Method ◽

Data Driven ◽

High Dimensional ◽

Omics Data ◽

Simulation Studies

As modern biotechnologies advance, it has become increasingly frequent that different modalities of high-dimensional molecular data (termed “omics” data in this paper), such as gene expression, methylation, and copy number, are collected from the same patient cohort to predict the clinical outcome. While prediction based on omics data has been widely studied in the last fifteen years, little has been done in the statistical literature on the integration of multiple omics modalities to select a subset of variables for prediction, which is a critical task in personalized medicine. In this paper, we propose a simple penalized regression method to address this problem by assigning different penalty factors to different data modalities for feature selection and prediction. The penalty factors can be chosen in a fully data-driven fashion by cross-validation or by taking practical considerations into account. In simulation studies, we compare the prediction performance of our approach, called IPF-LASSO (Integrative LASSO with Penalty Factors) and implemented in the R package ipflasso, with the standard LASSO and sparse group LASSO. The use of IPF-LASSO is also illustrated through applications to two real-life cancer datasets. All data and codes are available on the companion website to ensure reproducibility.

Download Full-text

AucPR: An AUC-based approach using penalized regression for disease prediction with high-dimensional omics data

BMC Genomics ◽

10.1186/1471-2164-15-s10-s1 ◽

2014 ◽

Vol 15 (S10) ◽

Cited By ~ 2

Author(s):

Wenbao Yu ◽

Taesung Park

Keyword(s):

Penalized Regression ◽

High Dimensional ◽

Disease Prediction ◽

Omics Data

Download Full-text

Segmentasi Citra Mri Menggunakan Deteksi Tepi Untuk Identifikasi Kanker Payudara

Matrik Jurnal Manajemen Teknik Informatika dan Rekayasa Komputer ◽

10.30812/matrik.v15i2.38 ◽

2017 ◽

Vol 15 (2) ◽

pp. 17 ◽

Cited By ~ 1

Author(s):

Ervina Varijki ◽

Bambang Krismono Triwijoyo

Keyword(s):

Breast Cancer ◽

Edge Detection ◽

Object Segmentation ◽

Computation Time ◽

Input Image ◽

Boundary Line ◽

Cancer Breast ◽

Average Computation Time ◽

Breast Cancer Mri ◽

Image Format

One type of cancer that is capable identified using MRI technology is breast cancer. Breast cancer is still the leading cause of death world. therefore early detection of this disease is needed. In identifying breast cancer, a doctor or radiologist analyzing the results of magnetic resonance image that is stored in the format of the Digital Imaging Communication In Medicine (DICOM). It takes skill and experience sufficient for diagnosis is appropriate, andaccurate, so it is necessary to create a digital image processing applications by utilizing the process of object segmentation and edge detection to assist the physician or radiologist in identifying breast cancer. MRI image segmentation using edge detection to identification of breast cancer using a method stages gryascale change the image format, then the binary image thresholding and edge detection process using the latest Robert operator. Of the20 tested the input image to produce images with the appearance of the boundary line of each region or object that is visible and there are no edges are cut off, with the average computation time less than one minute.

Download Full-text

A generalization of t-SNE and UMAP to single-cell multimodal omics

Genome Biology ◽

10.1186/s13059-021-02356-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Van Hoan Do ◽

Stefan Canzar

Keyword(s):

Dimensionality Reduction ◽

Single Cell ◽

Cell Types ◽

High Dimensional ◽

Omics Data ◽

Relative Contribution ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques ◽

Concise Representation ◽

Cellular Identity

AbstractEmerging single-cell technologies profile multiple types of molecules within individual cells. A fundamental step in the analysis of the produced high-dimensional data is their visualization using dimensionality reduction techniques such as t-SNE and UMAP. We introduce j-SNE and j-UMAP as their natural generalizations to the joint visualization of multimodal omics data. Our approach automatically learns the relative contribution of each modality to a concise representation of cellular identity that promotes discriminative features but suppresses noise. On eight datasets, j-SNE and j-UMAP produce unified embeddings that better agree with known cell types and that harmonize RNA and protein velocity landscapes.

Download Full-text

Maximum Likelihood Estimation of Regularization Parameters in High-Dimensional Inverse Problems: An Empirical Bayesian Approach Part I: Methodology and Experiments

SIAM Journal on Imaging Sciences ◽

10.1137/20m1339829 ◽

2020 ◽

Vol 13 (4) ◽

pp. 1945-1989 ◽

Cited By ~ 1

Author(s):

Ana Fernandez Vidal ◽

Valentin De Bortoli ◽

Marcelo Pereyra ◽

Alain Durmus

Keyword(s):

Maximum Likelihood ◽

Inverse Problems ◽

Maximum Likelihood Estimation ◽

Bayesian Approach ◽

Likelihood Estimation ◽

High Dimensional ◽

Empirical Bayesian ◽

Regularization Parameters ◽

Empirical Bayesian Approach

Download Full-text

SMILE: Mutual Information Learning for Integration of Single-cell Omics Data

Bioinformatics ◽

10.1093/bioinformatics/btab706 ◽

2021 ◽

Author(s):

Yang Xu ◽

Priyojit Das ◽

Rachel Patton McCord

Keyword(s):

Deep Learning ◽

Mutual Information ◽

Single Cell ◽

Learning Algorithm ◽

Cellular Systems ◽

Supplementary Information ◽

Omics Data ◽

Learning Approaches ◽

Rna Seq ◽

Integrate Data

Abstract Motivation Deep learning approaches have empowered single-cell omics data analysis in many ways and generated new insights from complex cellular systems. As there is an increasing need for single cell omics data to be integrated across sources, types, and features of data, the challenges of integrating single-cell omics data are rising. Here, we present an unsupervised deep learning algorithm that learns discriminative representations for single-cell data via maximizing mutual information, SMILE (Single-cell Mutual Information Learning). Results Using a unique cell-pairing design, SMILE successfully integrates multi-source single-cell transcriptome data, removing batch effects and projecting similar cell types, even from different tissues, into the shared space. SMILE can also integrate data from two or more modalities, such as joint profiling technologies using single-cell ATAC-seq, RNA-seq, DNA methylation, Hi-C, and ChIP data. When paired cells are known, SMILE can integrate data with unmatched feature, such as genes for RNA-seq and genome wide peaks for ATAC-seq. Integrated representations learned from joint profiling technologies can then be used as a framework for comparing independent single source data. Supplementary information Supplementary data are available at Bioinformatics online. The source code of SMILE including analyses of key results in the study can be found at: https://github.com/rpmccordlab/SMILE.

Download Full-text

A STRUCTURAL APPROACH TO ON-LINE CHARACTER RECOGNITION: SYSTEM DESIGN AND APPLICATIONS

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001491000181 ◽

1991 ◽

Vol 05 (01n02) ◽

pp. 311-335 ◽

Cited By ~ 3

Author(s):

FATHALLAH NOUBOUD ◽

RÉJEAN PLAMONDON

Keyword(s):

Character Recognition ◽

Computation Time ◽

Experimental Tests ◽

Recognition System ◽

Structural Approach ◽

Chain Code ◽

String Comparison ◽

Average Computation Time ◽

On Line ◽

A Chain

This paper presents a real-time constraint-free handprinted character recognition system based on a structural approach. After the preprocessing operation, a chain code is extracted to represent the character. The classification is based on the use of a processor dedicated to string comparison. The average computation time to recognize a character is about 0.07 seconds. During the learning step, the user can define any set of characters or symbols to be recognized by the system. Thus there are no constraints on the handprinting. The experimental tests show a high degree of accuracy (96%) for writer-dependent applications. Comparisons with other system and methods are discussed. We also present a comparison between the processor used in this system and the Wagner and Fischer algorithm. Finally, we describe some applications of the system.

Download Full-text

Computer-assisted Diagnosis for Early Stage Pleural Mesothelioma

Methods of Information in Medicine ◽

10.1160/me9050 ◽

2007 ◽

Vol 46 (03) ◽

pp. 324-331 ◽

Cited By ~ 10

Author(s):

P. Jäger ◽

S. Vogel ◽

A. Knepper ◽

T. Kraus ◽

T. Aach ◽

...

Keyword(s):

Early Stage ◽

Computation Time ◽

Hounsfield Unit ◽

Computer Assisted ◽

Clinical Test ◽

Pleural Mesothelioma ◽

Data Sets ◽

Data Set ◽

Average Computation Time ◽

Ct Data

Summary Objectives: Pleural thickenings as biomarker of exposure to asbestos may evolve into malignant pleural mesothelioma. Foritsearly stage, pleurectomy with perioperative treatment can reduce morbidity and mortality. The diagnosis is based on a visual investigation of CT images, which is a time-consuming and subjective procedure. Our aim is to develop an automatic image processing approach to detect and quantitatively assess pleural thickenings. Methods: We first segment the lung areas, and identify the pleural contours. A convexity model is then used together with a Hounsfield unit threshold to detect pleural thickenings. The assessment of the detected pleural thickenings is based on a spline-based model of the healthy pleura. Results: Tests were carried out on 14 data sets from three patients. In all cases, pleural contours were reliably identified, and pleural thickenings detected. PC-based Computation times were 85 min for a data set of 716 slices, 35 min for 401 slices, and 4 min for 75 slices, resulting in an average computation time of about 5.2 s per slice. Visualizations of pleurae and detected thickeningswere provided. Conclusion: Results obtained so far indicate that our approach is able to assist physicians in the tedious task of finding and quantifying pleural thickenings in CT data. In the next step, our system will undergo an evaluation in a clinical test setting using routine CT data to quantifyits performance.

Download Full-text

Improving ozone profile retrieval from spaceborne UV backscatter spectrometers using convergence behaviour diagnostics

Atmospheric Measurement Techniques ◽

10.5194/amt-3-1555-2010 ◽

2010 ◽

Vol 3 (6) ◽

pp. 1555-1568 ◽

Cited By ~ 13

Author(s):

B. Mijling ◽

O. N. E. Tuinder ◽

R. F. van Oss ◽

R. J. van der A

Keyword(s):

Cross Sections ◽

A Priori ◽

Computation Time ◽

External Input ◽

Computational Time ◽

Ozone Profile ◽

Global Performance ◽

Convergence Behaviour ◽

Low Cloud ◽

Average Computation Time

Abstract. The Ozone Profile Algorithm (OPERA), developed at KNMI, retrieves the vertical ozone distribution from nadir spectral satellite measurements of back scattered sunlight in the ultraviolet and visible wavelength range. To produce consistent global datasets the algorithm needs to have good global performance, while short computation time facilitates the use of the algorithm in near real time applications. To test the global performance of the algorithm we look at the convergence behaviour as diagnostic tool of the ozone profile retrievals from the GOME instrument (on board ERS-2) for February and October 1998. In this way, we uncover different classes of retrieval problems, related to the South Atlantic Anomaly, low cloud fractions over deserts, desert dust outflow over the ocean, and the intertropical convergence zone. The influence of the first guess and the external input data including the ozone cross-sections and the ozone climatologies on the retrieval performance is also investigated. By using a priori ozone profiles which are selected on the expected total ozone column, retrieval problems due to anomalous ozone distributions (such as in the ozone hole) can be avoided. By applying the algorithm adaptations the convergence statistics improve considerably, not only increasing the number of successful retrievals, but also reducing the average computation time, due to less iteration steps per retrieval. For February 1998, non-convergence was brought down from 10.7% to 2.1%, while the mean number of iteration steps (which dominates the computational time) dropped 26% from 5.11 to 3.79.

Download Full-text

High-Dimensional, Penalized-Regression Models in Time-to-Event Clinical Trials

Textbook of Clinical Trials in Oncology ◽

10.1201/9781315112084-18 ◽

2019 ◽

pp. 376-396

Author(s):

Federico Rotolo ◽

Nils Ternès ◽

Stefan Michiels

Keyword(s):

Clinical Trials ◽

Regression Models ◽

Penalized Regression ◽

High Dimensional ◽

Time To Event

Download Full-text