Extreme data compression while searching for new physics

Alan F Heavens; Elena Sellentin; Andrew H Jaffe

doi:10.1093/mnras/staa2589

Extreme data compression while searching for new physics

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa2589 ◽

2020 ◽

Vol 498 (3) ◽

pp. 3440-3451

Author(s):

Alan F Heavens ◽

Elena Sellentin ◽

Andrew H Jaffe

Keyword(s):

Data Compression ◽

Principal Component ◽

New Physics ◽

Standard Theory ◽

Data Sets ◽

Data Set ◽

Formidable Challenge ◽

Public Data ◽

Bayesian Evidence ◽

Data Points

ABSTRACT Bringing a high-dimensional data set into science-ready shape is a formidable challenge that often necessitates data compression. Compression has accordingly become a key consideration for contemporary cosmology, affecting public data releases, and reanalyses searching for new physics. However, data compression optimized for a particular model can suppress signs of new physics, or even remove them altogether. We therefore provide a solution for exploring new physics during data compression. In particular, we store additional agnostic compressed data points, selected to enable precise constraints of non-standard physics at a later date. Our procedure is based on the maximal compression of the MOPED algorithm, which optimally filters the data with respect to a baseline model. We select additional filters, based on a generalized principal component analysis, which are carefully constructed to scout for new physics at high precision and speed. We refer to the augmented set of filters as MOPED-PC. They enable an analytic computation of Bayesian Evidence that may indicate the presence of new physics, and fast analytic estimates of best-fitting parameters when adopting a specific non-standard theory, without further expensive MCMC analysis. As there may be large numbers of non-standard theories, the speed of the method becomes essential. Should no new physics be found, then our approach preserves the precision of the standard parameters. As a result, we achieve very rapid and maximally precise constraints of standard and non-standard physics, with a technique that scales well to large dimensional data sets.

Download Full-text

Application of multivariable statistical techniques in plant-wide WWTP control strategies analysis

Water Science & Technology ◽

10.2166/wst.2007.586 ◽

2007 ◽

Vol 56 (6) ◽

pp. 75-83 ◽

Cited By ~ 3

Author(s):

X. Flores ◽

J. Comas ◽

I.R. Roda ◽

L. Jiménez ◽

K.V. Gernaey

Keyword(s):

Cluster Analysis ◽

Control Strategies ◽

Treatment Plant ◽

Principal Component ◽

Statistical Techniques ◽

Data Sets ◽

Data Set ◽

Casual Relation ◽

Evaluation Matrix ◽

Natural Groups

The main objective of this paper is to present the application of selected multivariable statistical techniques in plant-wide wastewater treatment plant (WWTP) control strategies analysis. In this study, cluster analysis (CA), principal component analysis/factor analysis (PCA/FA) and discriminant analysis (DA) are applied to the evaluation matrix data set obtained by simulation of several control strategies applied to the plant-wide IWA Benchmark Simulation Model No 2 (BSM2). These techniques allow i) to determine natural groups or clusters of control strategies with a similar behaviour, ii) to find and interpret hidden, complex and casual relation features in the data set and iii) to identify important discriminant variables within the groups found by the cluster analysis. This study illustrates the usefulness of multivariable statistical techniques for both analysis and interpretation of the complex multicriteria data sets and allows an improved use of information for effective evaluation of control strategies.

Download Full-text

Fast and accurate detection of surface defect based on improved YOLOv4

Assembly Automation ◽

10.1108/aa-04-2021-0044 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Jiawei Lian ◽

Junhong He ◽

Yun Niu ◽

Tianze Wang

Keyword(s):

Feature Extraction ◽

Real Time ◽

Surface Defect ◽

Steel Ingot ◽

Industrial Applications ◽

Data Sets ◽

Data Set ◽

Processing Technologies ◽

Content Type ◽

Public Data

Purpose The current popular image processing technologies based on convolutional neural network have the characteristics of large computation, high storage cost and low accuracy for tiny defect detection, which is contrary to the high real-time and accuracy, limited computing resources and storage required by industrial applications. Therefore, an improved YOLOv4 named as YOLOv4-Defect is proposed aim to solve the above problems. Design/methodology/approach On the one hand, this study performs multi-dimensional compression processing on the feature extraction network of YOLOv4 to simplify the model and improve the feature extraction ability of the model through knowledge distillation. On the other hand, a prediction scale with more detailed receptive field is added to optimize the model structure, which can improve the detection performance for tiny defects. Findings The effectiveness of the method is verified by public data sets NEU-CLS and DAGM 2007, and the steel ingot data set collected in the actual industrial field. The experimental results demonstrated that the proposed YOLOv4-Defect method can greatly improve the recognition efficiency and accuracy and reduce the size and computation consumption of the model. Originality/value This paper proposed an improved YOLOv4 named as YOLOv4-Defect for the detection of surface defect, which is conducive to application in various industrial scenarios with limited storage and computing resources, and meets the requirements of high real-time and precision.

Download Full-text

A Support Based Initialization Algorithm for Categorical Data Clustering

Journal of Information Technology Research ◽

10.4018/jitr.2018040104 ◽

2018 ◽

Vol 11 (2) ◽

pp. 53-67

Author(s):

Ajay Kumar ◽

Shishir Kumar

Keyword(s):

Categorical Data ◽

Selection Process ◽

Numerical Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Data Object ◽

Data Points ◽

Wu Method ◽

Selection Algorithms

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.

Download Full-text

A SELF-ORGANIZING MAP FOR MIXED CONTINUOUS AND CATEGORICAL DATA

International Journal of Computing ◽

10.47839/ijc.10.1.733 ◽

2011 ◽

pp. 24-32 ◽

Cited By ~ 1

Author(s):

Nicoleta Rogovschi ◽

Mustapha Lebbah ◽

Younès Bennani

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Mixed Data ◽

Categorical Variables ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Public Data ◽

Self Organizing

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.

Download Full-text

An Incremental Isomap Method for Hyperspectral Dimensionality Reduction and Classification

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.7.445 ◽

2021 ◽

Vol 87 (6) ◽

pp. 445-455

Author(s):

Yi Ma ◽

Zezhong Zheng ◽

Yutang Ma ◽

Mingcang Zhu ◽

Ran Huang ◽

...

Keyword(s):

Manifold Learning ◽

Nearest Neighbor ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Data Points

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.

Download Full-text

Dimensionality and Its Reduction

Statistics, Data Mining, and Machine Learning in Astronomy ◽

10.23943/princeton/9780691151687.003.0007 ◽

2014 ◽

Author(s):

Andrew J. Connolly ◽

Jacob T. VanderPlas ◽

Alexander Gray ◽

Andrew J. Connolly ◽

Jacob T. VanderPlas ◽

...

Keyword(s):

Principal Component Analysis ◽

Principal Component ◽

Reduction Technique ◽

High Dimensional ◽

Data Sets ◽

Data Set ◽

Gaussian Distributions ◽

Dimensionality Reduction Technique ◽

Alternative Techniques ◽

New Generation

With the dramatic increase in data available from a new generation of astronomical telescopes and instruments, many analyses must address the question of the complexity as well as size of the data set. This chapter deals with how we can learn which measurements, properties, or combinations thereof carry the most information within a data set. It describes techniques that are related to concepts discussed when describing Gaussian distributions, density estimation, and the concepts of information content. The chapter begins with an exploration of the problems posed by high-dimensional data. It then describes the data sets used in this chapter, and introduces perhaps the most important and widely used dimensionality reduction technique, principal component analysis (PCA). The remainder of the chapter discusses several alternative techniques which address some of the weaknesses of PCA.

Download Full-text

Decomposing the Apoptosis Pathway Into Biologically Interpretable Principal Components

Cancer Informatics ◽

10.1177/1176935118771082 ◽

2018 ◽

Vol 17 ◽

pp. 117693511877108 ◽

Cited By ~ 4

Author(s):

Min Wang ◽

Steven M Kornblau ◽

Kevin R Coombes

Keyword(s):

Principal Components ◽

Myeloid Leukemia ◽

Principal Component ◽

R Package ◽

Biological Data ◽

Data Sets ◽

Proteomics Data ◽

Data Set ◽

Apoptosis Pathway ◽

Biological Interpretation

Principal component analysis (PCA) is one of the most common techniques in the analysis of biological data sets, but applying PCA raises 2 challenges. First, one must determine the number of significant principal components (PCs). Second, because each PC is a linear combination of genes, it rarely has a biological interpretation. Existing methods to determine the number of PCs are either subjective or computationally extensive. We review several methods and describe a new R package, PCDimension, that implements additional methods, the most important being an algorithm that extends and automates a graphical Bayesian method. Using simulations, we compared the methods. Our newly automated procedure is competitive with the best methods when considering both accuracy and speed and is the most accurate when the number of objects is small compared with the number of attributes. We applied the method to a proteomics data set from patients with acute myeloid leukemia. Proteins in the apoptosis pathway could be explained using 6 PCs. By clustering the proteins in PC space, we were able to replace the PCs by 6 “biological components,” 3 of which could be immediately interpreted from the current literature. We expect this approach combining PCA with clustering to be widely applicable.

Download Full-text

ON THE APPLICATION OF METHODS USED TO CALCULATE THE FRACTAL DIMENSION OF FRACTURE SURFACES

Fractals ◽

10.1142/s0218348x01000464 ◽

2001 ◽

Vol 09 (01) ◽

pp. 105-128 ◽

Cited By ~ 26

Author(s):

TAYFUN BABADAGLI ◽

KAYHAN DEVELI

Keyword(s):

Fractal Dimension ◽

Fractal Dimensions ◽

Natural Fracture ◽

Data Sets ◽

Data Set ◽

Fracture Surfaces ◽

Power Spectral ◽

Data Points ◽

2D Data ◽

Fracturing Mechanism

This paper presents an evaluation of the methods applied to calculate the fractal dimension of fracture surfaces. Variogram (applicable to 1D self-affine sets) and power spectral density analyses (applicable to 2D self-affine sets) are selected to calculate the fractal dimension of synthetic 2D data sets generated using fractional Brownian motion (fBm). Then, the calculated values are compared with the actual fractal dimensions assigned in the generation of the synthetic surfaces. The main factor considered is the size of the 2D data set (number of data points). The critical sample size that yields the best agreement between the calculated and actual values is defined for each method. Limitations and the proper use of each method are clarified after an extensive analysis. The two methods are also applied to synthetically and naturally developed fracture surfaces of different types of rocks. The methods yield inconsistent fractal dimensions for natural fracture surfaces and the reasons of this are discussed. The anisotropic feature of fractal dimension that may lead to a correlation of fracturing mechanism and multifractality of the fracture surfaces is also addressed.

Download Full-text

CNVScope: Visually Exploring Copy Number Aberrations in Cancer Genomes

Cancer Informatics ◽

10.1177/1176935119890290 ◽

2019 ◽

Vol 18 ◽

pp. 117693511989029

Author(s):

James LT Dalgleish ◽

Yonghong Wang ◽

Jack Zhu ◽

Paul S Meltzer

Keyword(s):

Copy Number ◽

High Performance ◽

Data Sets ◽

Data Set ◽

The Public ◽

Public Data ◽

Analysis Package ◽

Cis And Trans ◽

High Performance Computing Cluster ◽

Shiny Application

Motivation: DNA copy number (CN) data are a fast-growing source of information used in basic and translational cancer research. Most CN segmentation data are presented without regard to the relationship between chromosomal regions. We offer both a toolkit to help scientists without programming experience visually explore the CN interactome and a package that constructs CN interactomes from publicly available data sets. Results: The CNVScope visualization, based on a publicly available neuroblastoma CN data set, clearly displays a distinct CN interaction in the region of the MYCN, a canonical frequent amplicon target in this cancer. Exploration of the data rapidly identified cis and trans events, including a strong anticorrelation between 11q loss and17q gain with the region of 11q loss bounded by the cell cycle regulator CCND1. Availability: The shiny application is readily available for use at http://cnvscope.nci.nih.gov/ , and the package can be downloaded from CRAN ( https://cran.r-project.org/package=CNVScope ), where help pages and vignettes are located. A newer version is available on the GitHub site ( https://github.com/jamesdalg/CNVScope/ ), which features an animated tutorial. The CNVScope package can be locally installed using instructions on the GitHub site for Windows and Macintosh systems. This CN analysis package also runs on a linux high-performance computing cluster, with options for multinode and multiprocessor analysis of CN variant data. The shiny application can be started using a single command (which will automatically install the public data package).

Download Full-text

Quantitative comparison of the variability in observed and simulated shortwave reflectance

Atmospheric Chemistry and Physics ◽

10.5194/acp-13-3133-2013 ◽

2013 ◽

Vol 13 (6) ◽

pp. 3133-3147 ◽

Cited By ~ 10

Author(s):

Y. L. Roberts ◽

P. Pilewskie ◽

B. C. Kindel ◽

D. R. Feldman ◽

W. D. Collins

Keyword(s):

Principal Component ◽

Reflectance Spectra ◽

Observation System ◽

Data Sets ◽

Water Vapor Absorption ◽

Data Set ◽

Shared Spaces ◽

Si Traceability ◽

Scanning Imaging ◽

Earth’S Climate

Abstract. The Climate Absolute Radiance and Refractivity Observatory (CLARREO) is a climate observation system that has been designed to monitor the Earth's climate with unprecedented absolute radiometric accuracy and SI traceability. Climate Observation System Simulation Experiments (OSSEs) have been generated to simulate CLARREO hyperspectral shortwave imager measurements to help define the measurement characteristics needed for CLARREO to achieve its objectives. To evaluate how well the OSSE-simulated reflectance spectra reproduce the Earth's climate variability at the beginning of the 21st century, we compared the variability of the OSSE reflectance spectra to that of the reflectance spectra measured by the Scanning Imaging Absorption Spectrometer for Atmospheric Cartography (SCIAMACHY). Principal component analysis (PCA) is a multivariate decomposition technique used to represent and study the variability of hyperspectral radiation measurements. Using PCA, between 99.7% and 99.9% of the total variance the OSSE and SCIAMACHY data sets can be explained by subspaces defined by six principal components (PCs). To quantify how much information is shared between the simulated and observed data sets, we spectrally decomposed the intersection of the two data set subspaces. The results from four cases in 2004 showed that the two data sets share eight (January and October) and seven (April and July) dimensions, which correspond to about 99.9% of the total SCIAMACHY variance for each month. The spectral nature of these shared spaces, understood by examining the transformed eigenvectors calculated from the subspace intersections, exhibit similar physical characteristics to the original PCs calculated from each data set, such as water vapor absorption, vegetation reflectance, and cloud reflectance.

Download Full-text