scholarly journals A reversible database watermarking method non-redundancy shifting-based histogram gaps

2020 ◽  
Vol 16 (5) ◽  
pp. 155014772092176 ◽  
Author(s):  
Yan Li ◽  
Junwei Wang ◽  
Xiangyang Luo

In relational databases, embedding watermarks in integer data using traditional histogram shifting method has the problem of large data distortion. To solve this problem, a reversible database watermarking method without redundant shifting distortion is proposed, taking advantage of a large number of gaps in the integer histogram. This method embeds the watermark bit by bit on the basis of grouping. First, an integer data histogram is constructed with the absolute value of the prediction error of the data as a variable. Second, the positional relationship between each column and the gap in the histogram is analyzed to find out all the columns adjacent to the gap. Third, the highest column is selected as the embedded point. Finally, a watermark bit is embedded on the group by the histogram non-redundant shifting method. Experimental results show that compared with existing reversible database watermarking methods, such as genetic algorithm and histogram shift watermarking and histogram gap–based watermarking, the proposed method has no data distortion caused by the shifting redundant histogram columns after embedding watermarks on forest cover type data set and effectively reduces the data distortion rate after embedding watermarks.

Mathematics ◽  
2020 ◽  
Vol 8 (11) ◽  
pp. 1994
Author(s):  
Yan Li ◽  
Junwei Wang ◽  
Hongyong Jia

Due to the discreteness of integer data, there are a large number of gaps and continuous columns in the histogram based on integer data. Aiming at the characteristics, this paper presents a robust and reversible watermarking algorithm for a relational database based on continuous columns in histogram. Firstly, it groups the database tuples according to the watermark length and the grouping key. Secondly, it calculates the prediction errors and uses the absolute values of the prediction errors to construct the histogram. Thirdly, it traverses the histogram to find all the continuous columns and in turn, computes the sum of the height of each continuous column and selects the group of continuous columns that has the largest sum as the positions to embed the watermarks. FCTD (Forest cover type data set) is utilized for experimental verification. A large amount of experimental data shows that the method is effective and robust. Not only does the data distortion caused by shifting histogram columns not exist, but the robustness of the watermark is also greatly improved.


This chapter provides implementation of the proposed model on Forest Cover Type data set. The chapter includes the implementation of pattern extraction from this dataset by following a series of steps discussed in the proposed model chapter. It also includes detailed implementation of pattern prediction from Automobile dataset for prediction of numeric variables, nominal variables, and aggregate data. The implementation of pattern prediction is also a series of steps as discussed before.


2018 ◽  
Vol 2 ◽  
pp. 31 ◽  
Author(s):  
Greg Finak ◽  
Bryan Mayer ◽  
William Fulp ◽  
Paul Obrecht ◽  
Alicia Sato ◽  
...  

A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.


2018 ◽  
Author(s):  
Greg Finak ◽  
Bryan T. Mayer ◽  
William Fulp ◽  
Paul Obrecht ◽  
Alicia Sato ◽  
...  

AbstractA central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ‘omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high “startup” costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.


2018 ◽  
Vol 2 ◽  
pp. 31 ◽  
Author(s):  
Greg Finak ◽  
Bryan Mayer ◽  
William Fulp ◽  
Paul Obrecht ◽  
Alicia Sato ◽  
...  

A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.


2004 ◽  
Vol 16 (7) ◽  
pp. 1345-1351 ◽  
Author(s):  
Xiaomei Liu ◽  
Lawrence O. Hall ◽  
Kevin W. Bowyer

Collobert, Bengio, and Bengio (2002) recently introduced a novel approach to using a neural network to provide a class prediction from an ensemble of support vector machines (SVMs). This approach has the advantage that the required computation scales well to very large data sets. Experiments on the Forest Cover data set show that this parallel mixture is more accurate than a single SVM, with 90.72% accuracy reported on an independent test set. Although this accuracy is impressive, their article does not consider alternative types of classifiers. We show that a simple ensemble of decision trees results in a higher accuracy, 94.75%, and is computationally efficient. This result is somewhat surprising and illustrates the general value of experimental comparisons using different types of classifiers.


2020 ◽  
Vol 39 (5) ◽  
pp. 6419-6430
Author(s):  
Dusan Marcek

To forecast time series data, two methodological frameworks of statistical and computational intelligence modelling are considered. The statistical methodological approach is based on the theory of invertible ARIMA (Auto-Regressive Integrated Moving Average) models with Maximum Likelihood (ML) estimating method. As a competitive tool to statistical forecasting models, we use the popular classic neural network (NN) of perceptron type. To train NN, the Back-Propagation (BP) algorithm and heuristics like genetic and micro-genetic algorithm (GA and MGA) are implemented on the large data set. A comparative analysis of selected learning methods is performed and evaluated. From performed experiments we find that the optimal population size will likely be 20 with the lowest training time from all NN trained by the evolutionary algorithms, while the prediction accuracy level is lesser, but still acceptable by managers.


2019 ◽  
Vol 21 (9) ◽  
pp. 662-669 ◽  
Author(s):  
Junnan Zhao ◽  
Lu Zhu ◽  
Weineng Zhou ◽  
Lingfeng Yin ◽  
Yuchen Wang ◽  
...  

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.


Sign in / Sign up

Export Citation Format

Share Document