data transformations Latest Research Papers

1.AbstractPersonalized Oncology is a rapidly evolving area and offers cancer patients therapy options more specific than ever. Yet, there is still a lack of understanding regarding transcriptomic similarities or differences of metastases and corresponding primary sites. Approaching this question, we used two different unsupervised dimension reduction methods – t-SNE and UMAP – on three different metastases datasets – prostate cancer, neuroendocrine prostate cancer, and skin cutaneous melanoma – including 682 different samples, with three different underlying data transformations – unprocessed FPKM values, log10 transformed FPKM values, and log10+1 transformed FPKM values – to visualize potential underlying clusters. The approaches resulted in formation of different clusters that were independent of respective resection sites. Additionally, data transformation critically affected cluster formation in most cases. Of note, our study revealed no tight link between the metastasis resection site and specific transcriptomic features. Instead, our analysis demonstrates the dependency of cluster formation on the underlying data transformation and the dimension reduction method applied. These observations propose data transformation as another key element in the interpretation of visual clustering approaches apart from well-known determinants such as initialization and parameters. Furthermore, the results show the need for further evaluation of underlying data alterations based on the biological question and subsequently used methods and applications.

Download Full-text

Issues in clustering algorithm consistency in fixed dimensional spaces. Some solutions for k-means

Journal of Intelligent Information Systems ◽

10.1007/s10844-021-00657-6 ◽

2021 ◽

Author(s):

Mieczysław A. Kłopotek ◽

Robert A. Kłopotek

Keyword(s):

Clustering Algorithm ◽

Dimensional Space ◽

Clustering Algorithms ◽

Practical Importance ◽

Axiomatic System ◽

One Dimension ◽

Data Set ◽

Data Transformations ◽

Higher Dimensional ◽

Consistency Property

AbstractKleinberg introduced an axiomatic system for clustering functions. Out of three axioms, he proposed, two (scale invariance and consistency) are concerned with data transformations that should produce the same clustering under the same clustering function. The so-called consistency axiom provides the broadest range of transformations of the data set. Kleinberg claims that one of the most popular clustering algorithms, k-means does not have the property of consistency. We challenge this claim by pointing at invalid assumptions of his proof (infinite dimensionality) and show that in one dimension in Euclidean space the k-means algorithm has the consistency property. We also prove that in higher dimensional space, k-means is, in fact, inconsistent. This result is of practical importance when choosing testbeds for implementation of clustering algorithms while it tells under which circumstances clustering after consistency transformation shall return the same clusters. Two types of remedy are proposed: gravitational consistency property and dataset consistency property which both hold for k-means and hence are suitable when developing the mentioned testbeds.

Download Full-text

Data transformations when constructing a composite system quality index

Journal of Physics Conference Series ◽

10.1088/1742-6596/2052/1/012058 ◽

2021 ◽

Vol 2052 (1) ◽

pp. 012058

Author(s):

T V Zhgun

Keyword(s):

Composite System ◽

Signal To Noise Ratio ◽

Logarithmic Correction ◽

System Quality ◽

Data Transformations ◽

The Right ◽

Ratio Variables ◽

Weighting Coefficients ◽

Anomalous Nature ◽

Composite Indexes

Abstract The features of the data distribution can significantly affect the composite characteristics of objects, so composite indexes of objects must necessarily take into account the features of the data. Some types of data are characterized by distributions with a significant anomaly, when the vast majority of observations are concentrated near the boundary values. This type of data cannot always be characterized by an asymmetry coefficient. In addition, if the values of a variable are approximately symmetric with respect to zero or are concentrated near zero, the sample cannot also be characterized by the coefficient of variation. The paper proposes a transformation that allows us to identify the anomalous nature of variables using the signal-to-noise ratio. Variables are evaluated in the standard range, which is shifted to the right relative to zero. If it is necessary to logarithm, such a transformation will avoid the pressure of small values of variables that, after direct logarithm, would have large negative values. The application of logarithmic correction for the detected anomalous variables redistributes the values of the obtained weighting coefficients in the direction of a more correct interpretation and, in particular, solves the problem with the negativity of the weighting coefficients.

Download Full-text

SELF-SUPERVISED ACOUSTIC ANOMALY DETECTION VIA CONTRASTIVE LEARNING

10.36227/techrxiv.16828363 ◽

2021 ◽

Author(s):

Hadi Hojjati ◽

Narges Armanfard

Keyword(s):

Anomaly Detection ◽

Detection Algorithm ◽

Learning Approaches ◽

Data Transformations ◽

Specific Data ◽

Latent Space ◽

Sound Detection ◽

Acoustic Anomaly ◽

Abnormal Points ◽

First Time

We propose an acoustic anomaly detection algorithm based on the framework of contrastive learning. Contrastive learning is a recently proposed self-supervised approach that has shown promising results in image classification and speech recognition. However, its application in anomaly detection is underexplored. Earlier studies have demonstrated that it can achieve state-of-the-art performance in image anomaly detection, but its capability in anomalous sound detection is yet to be investigated. For the first time, we propose a contrastive learning-based framework that is suitable for acoustic anomaly detection. Since most existing contrastive learning approaches are targeted toward images, the effect of other data transformations on the performance of the algorithm is unknown. Our framework learns a representation from unlabeled data by applying audio-specific data augmentations. We show that in the resulting latent space, normal and abnormal points are distinguishable. Experiments conducted on the MIMII dataset confirm that our approach can outperform competing methods in detecting anomalies.

Download Full-text

SELF-SUPERVISED ACOUSTIC ANOMALY DETECTION VIA CONTRASTIVE LEARNING

10.36227/techrxiv.16828363.v1 ◽

2021 ◽

Author(s):

Hadi Hojjati

Keyword(s):

Anomaly Detection ◽

Detection Algorithm ◽

Learning Approaches ◽

Data Transformations ◽

Specific Data ◽

Latent Space ◽

Sound Detection ◽

Acoustic Anomaly ◽

Abnormal Points ◽

First Time

We propose an acoustic anomaly detection algorithm based on the framework of contrastive learning. Contrastive learning is a recently proposed self-supervised approach that has shown promising results in image classification and speech recognition. However, its application in anomaly detection is underexplored. Earlier studies have demonstrated that it can achieve state-of-the-art performance in image anomaly detection, but its capability in anomalous sound detection is yet to be investigated. For the first time, we propose a contrastive learning-based framework that is suitable for acoustic anomaly detection. Since most existing contrastive learning approaches are targeted toward images, the effect of other data transformations on the performance of the algorithm is unknown. Our framework learns a representation from unlabeled data by applying audio-specific data augmentations. We show that in the resulting latent space, normal and abnormal points are distinguishable. Experiments conducted on the MIMII dataset confirm that our approach can outperform competing methods in detecting anomalies.

Download Full-text

SamQL: a structured query language and filtering tool for the SAM/BAM file format

BMC Bioinformatics ◽

10.1186/s12859-021-04390-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Christopher T. Lee ◽

Manolis Maragkakis

Keyword(s):

High Throughput Sequencing ◽

Query Language ◽

Expressive Power ◽

Massive Data ◽

File Format ◽

Data Transformations ◽

File Formats ◽

Primary Output ◽

Database Engine ◽

Do So

Abstract Background The Sequence Alignment/Map Format Specification (SAM) is one of the most widely adopted file formats in bioinformatics and many researchers use it daily. Several tools, including most high-throughput sequencing read aligners, use it as their primary output and many more tools have been developed to process it. However, despite its flexibility, SAM encoded files can often be difficult to query and understand even for experienced bioinformaticians. As genomic data are rapidly growing, structured, and efficient queries on data that are encoded in SAM/BAM files are becoming increasingly important. Existing tools are very limited in their query capabilities or are not efficient. Critically, new tools that address these shortcomings, should not be able to support existing large datasets but should also do so without requiring massive data transformations and file infrastructure reorganizations. Results Here we introduce SamQL, an SQL-like query language for the SAM format with intuitive syntax that supports complex and efficient queries on top of SAM/BAM files and that can replace commonly used Bash one-liners employed by many bioinformaticians. SamQL has high expressive power with no upper limit on query size and when parallelized, outperforms other substantially less expressive software. Conclusions SamQL is a complete query language that we envision as a step to a structured database engine for genomics. SamQL is written in Go, and is freely available as standalone program and as an open-source library under an MIT license, https://github.com/maragkakislab/samql/.

Download Full-text

PRINCIPLES FOR DEVELOPMENT OF PROGRAM INTERFACES IN INDUSTRIAL INFORMATION, MEASURING, AND CONTROLLING SYSTEMS

Kontrol Diagnostika ◽

10.14489/td.2021.10.pp.044-047 ◽

2021 ◽

pp. 44-47

Author(s):

A. V. Prutskov

Keyword(s):

Design Patterns ◽

Program Module ◽

Data Transformations ◽

Object Oriented Design ◽

Information Measuring ◽

Control Signals ◽

Measurement Results ◽

Model View Controller ◽

And Control ◽

Subsequent Modification

Industrial information, measuring, and controlling systems have a program module designed to convert measurement results into data for display and control signals. A program module interconnected to other modules by program interfaces. In this case, data transformations are necessary when they are sent between modules. Object-oriented design patterns can be used when programming data transformations. When converting measurement results to objects, the Adapter design pattern can be applied. Programming the Command pattern is intended for converting objects into control signals. Data processing should be separated from their representation, storage and transmission. Functions between modules can be divided using the Model–View–Controller pattern. The use of design patterns reduces the development time and subsequent modification of software for both information, measuring, and control systems, as well as systems in other subjects of science and economics.

Download Full-text

Study on model for cutting force when milling SCM440 steel

EUREKA Physics and Engineering ◽

10.21303/2461-4262.2021.001743 ◽

2021 ◽

pp. 23-35

Author(s):

Nguyen Van Thien ◽

Do Duc Trung

Keyword(s):

Cutting Force ◽

Cutting Speed ◽

Absolute Error ◽

Experimental Results ◽

Parameter Analysis ◽

Coefficient Of Determination ◽

Data Transformations ◽

Study Results ◽

Input Parameters ◽

Cutting Insert

This article presents empirical study results when milling SCM440 steel. The cutting insert to be used was a TiN coated cutting insert with tool tip radius of 0.5 mm. Experimental process was carried out with 18 experiments according to Box-Behnken matrix, in which cutting speed, feed rate and cutting depth were selected as the input parameters of each experiment. In addition, cutting force was selected as the output parameter. Analysis of experimental results has determined the influence of the input parameters as well as the interaction between them on the output parameters. From the experimental results, a regression model showing the relationship between cutting force and input parameters was built. Box-Cox and Johnson data transformations were applied to construct two other models of cutting force. These three regression models were used to predict cutting force and compare with experimental results. Using parameters including coefficient of determination (R-Sq), adjusted coefficient of determination (R-Sq(adj)) and percentage mean absolute error (% MAE) between the results predicted by the models and the experimental results are the criteria to compare the accuracy of the cutting force models. The results have determined that the two models using two data transformations have higher accuracy than model not using two data transformations. A comparison of the model using the Box-Cox transformation and the model using the Johnson transformation was made with a t-test. The results confirmed that these two models have equal accuracy. Finally, the development direction for the next study is mentioned in this article

Download Full-text

Oil Spills or Look-Alikes? Classification Rank of Surface Ocean Slick Signatures in Satellite Data

Remote Sensing ◽

10.3390/rs13173466 ◽

2021 ◽

Vol 13 (17) ◽

pp. 3466

Author(s):

Gustavo de Araújo Carvalho ◽

Peter J. Minnett ◽

Nelson F. F. Ebecken ◽

Luiz Landau

Keyword(s):

Oil Spills ◽

Location Parameter ◽

Width Ratio ◽

Cube Root ◽

Data Transformations ◽

Linear Discriminant ◽

Size Information ◽

Surface Ocean ◽

Location Parameters ◽

Binary Classifiers

Linear discriminant analysis (LDA) is a mathematically robust multivariate data analysis approach that is sometimes used for surface oil slick signature classification. Our goal is to rank the effectiveness of LDAs to differentiate oil spills from look-alike slicks. We explored multiple combinations of (i) variables (size information, Meteorological-Oceanographic (metoc), geo-location parameters) and (ii) data transformations (non-transformed, cube root, log10). Active and passive satellite-based measurements of RADARSAT, QuikSCAT, AVHRR, SeaWiFS, and MODIS were used. Results from two experiments are reported and discussed: (i) an investigation of 60 combinations of several attributes subjected to the same data transformation and (ii) a survey of 54 other data combinations of three selected variables subjected to different data transformations. In Experiment 1, the best discrimination was reached using ten cube-transformed attributes: ~85% overall accuracy using six pieces of size information, three metoc variables, and one geo-location parameter. In Experiment 2, two combinations of three variables tied as the most effective: ~81% of overall accuracy using area (log transformed), length-to-width ratio (log- or cube-transformed), and number of feature parts (non-transformed). After verifying the classification accuracy of 114 algorithms by comparing with expert interpretations, we concluded that applying different data transformations and accounting for metoc and geo-location attributes optimizes the accuracies of binary classifiers (oil spill vs. look-alike slicks) using the simple LDA technique.

Download Full-text

Generalized Linear Models

10.1093/oso/9780198798170.003.0015 ◽

2021 ◽

pp. 195-208

Author(s):

Andy Hector

Keyword(s):

Regression Analysis ◽

Maximum Likelihood ◽

Generalized Linear Models ◽

Linear Models ◽

Square Root ◽

Link Function ◽

Data Transformations ◽

Normal Distributions ◽

Equal Variance ◽

The Mean

This chapter revisits a regression analysis to explore the normal least squares assumption of approximately equal variance. It also considers some of the data transformations that can be used to achieve this. A linear regression of transformed data is compared with a generalized linear-model equivalent that avoids transformation by using a link function and non-normal distributions. Generalized linear models based on maximum likelihood use a link function to model the mean (in this case a square-root link) and a variance function to model the variability (in this case the gamma distribution, where the variance increases as the square of the mean). The Box–Cox family of transformations is explained in detail.

Download Full-text

data transformations
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Visual Clustering of Transcriptomic Data from Primary and Metastatic Tumors – Dependencies and Novel Pitfalls

Issues in clustering algorithm consistency in fixed dimensional spaces. Some solutions for k-means

Data transformations when constructing a composite system quality index

SELF-SUPERVISED ACOUSTIC ANOMALY DETECTION VIA CONTRASTIVE LEARNING

SELF-SUPERVISED ACOUSTIC ANOMALY DETECTION VIA CONTRASTIVE LEARNING

SamQL: a structured query language and filtering tool for the SAM/BAM file format

PRINCIPLES FOR DEVELOPMENT OF PROGRAM INTERFACES IN INDUSTRIAL INFORMATION, MEASURING, AND CONTROLLING SYSTEMS

Study on model for cutting force when milling SCM440 steel

Oil Spills or Look-Alikes? Classification Rank of Surface Ocean Slick Signatures in Satellite Data

Generalized Linear Models

Export Citation Format

data transformationsRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Visual Clustering of Transcriptomic Data from Primary and Metastatic Tumors – Dependencies and Novel Pitfalls

Issues in clustering algorithm consistency in fixed dimensional spaces. Some solutions for k-means

Data transformations when constructing a composite system quality index

SELF-SUPERVISED ACOUSTIC ANOMALY DETECTION VIA CONTRASTIVE LEARNING

SELF-SUPERVISED ACOUSTIC ANOMALY DETECTION VIA CONTRASTIVE LEARNING

SamQL: a structured query language and filtering tool for the SAM/BAM file format

PRINCIPLES FOR DEVELOPMENT OF PROGRAM INTERFACES IN INDUSTRIAL INFORMATION, MEASURING, AND CONTROLLING SYSTEMS

Study on model for cutting force when milling SCM440 steel

Oil Spills or Look-Alikes? Classification Rank of Surface Ocean Slick Signatures in Satellite Data

Generalized Linear Models

data transformations
Recently Published Documents