scholarly journals Optimization through Bayesian Classification on the k-Anonymized Data

Author(s):  
L Mohana Tirumala ◽  
S. Srinivasa Rao

Privacy preserving in Data mining & publishing, plays a major role in today networked world. It is important to preserve the privacy of the vital information corresponding to a data set. This process can be achieved by k-anonymization solution for classification. Along with the privacy preserving using anonymization, yielding the optimized data sets is also of equal importance with a cost effective approach. In this paper Top-Down Refinement algorithm has been proposed which yields optimum results in a cost effective manner. Bayesian Classification has been proposed in this paper to predict class membership probabilities for a data tuple for which the associated class label is unknown.

2021 ◽  
Vol 14 (11) ◽  
pp. 2519-2532
Author(s):  
Fatemeh Nargesian ◽  
Abolfazl Asudeh ◽  
H. V. Jagadish

Data scientists often develop data sets for analysis by drawing upon sources of data available to them. A major challenge is to ensure that the data set used for analysis has an appropriate representation of relevant (demographic) groups: it meets desired distribution requirements. Whether data is collected through some experiment or obtained from some data provider, the data from any single source may not meet the desired distribution requirements. Therefore, a union of data from multiple sources is often required. In this paper, we study how to acquire such data in the most cost effective manner, for typical cost functions observed in practice. We present an optimal solution for binary groups when the underlying distributions of data sources are known and all data sources have equal costs. For the generic case with unequal costs, we design an approximation algorithm that performs well in practice. When the underlying distributions are unknown, we develop an exploration-exploitation based strategy with a reward function that captures the cost and approximations of group distributions in each data source. Besides theoretical analysis, we conduct comprehensive experiments that confirm the effectiveness of our algorithms.


2020 ◽  
Author(s):  
Eung Seok Lee ◽  
Ryan Wolbert

<p>Acid mine drainage (AMD) is considered as one of the most prevalent environmental problems worldwide and remediation of AMD-affected streams remains a major challenge due to the large affected areas, large volume of polluted water, poor accessibility, and lack of financial supports. Advanced oxidation processes (AOPs) have been widely investigated as potential remedial options for contaminated water bodies of variety of settings, such as groundwater and waste discharges. This study presents a novel cost-effective approach for utilizing AOPs on improving quality of AMD-affected streams. Slow-release cylinders and pellets were created using polymeric binder and reagent salts that release strong oxidant and alkalinity upon dissolution in water. Results of column tests demonstrated that release durations were over 29 days and up to 100% iron removal was achieved within 20 minutes. Field-scale slow-release forms were manufactured and applied to an AMD site in southeast Ohio, USA for a 29-day demonstration study. Narrow channels were constructed for installation of slow-release forms and characterization of quality and flow of mine seeps and AMD stream during low subsurface flow periods. Results of field investigations suggest that the slow-release forms can be used to rapidly remove metals from AMD, as well as improve water parameters such as pH and minimize ecological impacts of remediation within the system in cost-effective manner.</p><p> </p>


2019 ◽  
Vol 16 (2) ◽  
pp. 445-452
Author(s):  
Kishore S. Verma ◽  
A. Rajesh ◽  
Adeline J. S. Johnsana

K anonymization is one of the worldwide used approaches to protect the individual records from the privacy leakage attack of Privacy Preserving Data Mining (PPDM) arena. Typically anonymized dataset will impact the effectiveness of data mining results. Anyhow, currently researchers of PPDM progress in driving their efforts in finding out the optimum trade-off between privacy and utility. This work tends in bringing out the optimum classifier from a set of best classifiers of data mining approaches that are capable enough in generating value-added classifying results on utility aware k-anonymized data set. We performed the analytical approach on the data set that are anonymized in sense of accompanying the anonymity utility factors like null values count and transformation pattern loss. The experimentation is done with three widely used classifiers HNB, PART and J48 and these classifiers are analysed with Accuracy, F-measure, and ROC-AUC which are literately proved to be the perfect measures of classification. Our experimental analysis reveals the best classifiers on the utility aware anonymized data sets of Cell oriented Anonymization (CoA), Attribute oriented Anonymization (AoA) and Record oriented Anonymization (RoA).


2014 ◽  
Vol 121 (1) ◽  
pp. 131-141 ◽  
Author(s):  
Silvain Bériault ◽  
Abbas F. Sadikot ◽  
Fahd Alsubaie ◽  
Simon Drouin ◽  
D. Louis Collins ◽  
...  

Careful trajectory planning on preoperative vascular imaging is an essential step in deep brain stimulation (DBS) to minimize risks of hemorrhagic complications and postoperative neurological deficits. This paper compares 2 MRI methods for visualizing cerebral vasculature and planning DBS probe trajectories: a single data set T1-weighted scan with double-dose gadolinium contrast (T1w-Gd) and a multi–data set protocol consisting of a T1-weighted structural, susceptibility-weighted venography, and time-of-flight angiography (T1w-SWI-TOF). Two neurosurgeons who specialize in neuromodulation surgery planned bilateral STN DBS in 18 patients with Parkinson's disease (36 hemispheres) using each protocol separately. Planned trajectories were then evaluated across all vascular data sets (T1w-Gd, SWI, and TOF) to detect possible intersection with blood vessels along the entire path via an objective vesselness measure. The authors' results show that trajectories planned on T1w-SWI-TOF successfully avoided the cerebral vasculature imaged by conventional T1w-Gd and did not suffer from missing vascular information or imprecise data set registration. Furthermore, with appropriate planning and visualization software, trajectory corridors planned on T1w-SWI-TOF intersected significantly less fine vasculature that was not detected on the T1w-Gd (p < 0.01 within 2 mm and p < 0.001 within 4 mm of the track centerline). The proposed T1w-SWI-TOF protocol comes with minimal effects on the imaging and surgical workflow, improves vessel avoidance, and provides a safe cost-effective alternative to injection of gadolinium contrast.


Geophysics ◽  
2001 ◽  
Vol 66 (6) ◽  
pp. 1761-1773 ◽  
Author(s):  
Roman Spitzer ◽  
Alan G. Green ◽  
Frank O. Nitsche

By appropriately decimating a comprehensive shallow 3‐D seismic reflection data set recorded across unconsolidated sediments in northern Switzerland, we have investigated the potential and limitations of four different source‐receiver acquisition patterns. For the original survey, more than 12 000 shots and 18 000 receivers deployed on a [Formula: see text] grid resulted in common midpoint (CMP) data with an average fold of ∼40 across a [Formula: see text] area. A principal goal of our investigation was to determine an acquisition strategy capable of producing reliable subsurface images in a more efficient and cost‐effective manner. Field efforts for the four tested acquisition strategies were approximately 50%, 50%, 25%, and 20% of the original effort. All four data subsets were subjected to a common processing sequence. Static corrections, top‐mute functions, and stacking velocities were estimated individually for each subset. Because shallow reflections were difficult to discern on shot and CMP gathers generated with the lowest density acquisition pattern (20% field effort) such that dependable top‐mute functions could not be estimated, data resulting from this acquisition pattern were not processed to completion. Of the three fully processed data subsets, two (50% field effort and 25% field effort) yielded 3‐D migrated images comparable to that derived from the entire data set, whereas the third (50% field effort) resulted in good‐quality images only in the shallow subsurface because of a lack of far‐offset data. On the basis of these results, we concluded that all geological objectives associated with our particular study site, which included mapping complex lithological units and their intervening shallow dipping boundaries, would have been achieved by conducting a 3‐D seismic reflection survey that was 75% less expensive than the original one.


Author(s):  
Quanming Yao ◽  
Xiawei Guo ◽  
James Kwok ◽  
Weiwei Tu ◽  
Yuqiang Chen ◽  
...  

To meet the standard of differential privacy, noise is usually added into the original data, which inevitably deteriorates the predicting performance of subsequent learning algorithms. In this paper, motivated by the success of improving predicting performance by ensemble learning, we propose to enhance privacy-preserving logistic regression by stacking. We show that this can be done either by sample-based or feature-based partitioning. However, we prove that when privacy-budgets are the same, feature-based partitioning requires fewer samples than sample-based one, and thus likely has better empirical performance. As transfer learning is difficult to be integrated with a differential privacy guarantee, we further combine the proposed method with hypothesis transfer learning to address the problem of learning across different organizations. Finally, we not only demonstrate the effectiveness of our method on two benchmark data sets, i.e., MNIST and NEWS20, but also apply it into a real application of cross-organizational diabetes prediction from RUIJIN data set, where privacy is of a significant concern.


2019 ◽  
Author(s):  
Adam H. Freedman ◽  
John M. Gaspar ◽  
Timothy B. Sackton

ABSTRACTBackgroundTypical experimental design advice for expression analyses using RNA-seq generally assumes that single-end reads provide robust gene-level expression estimates in a cost-effective manner, and that the additional benefits obtained from paired-end sequencing are not worth the additional cost. However, in many cases (e.g., with Illumina NextSeq and NovaSeq instruments), shorter paired-end reads and longer single-end reads can be generated for the same cost, and it is not obvious which strategy should be preferred. Using publicly available data, we test whether short-paired end reads can achieve more robust expression estimates and differential expression results than single-end reads of approximately the same total number of sequenced bases.ResultsAt both the transcript and gene levels, 2×40 paired-end reads unequivocally provide expression estimates that are more highly correlated with 2×125 than 1×75 reads; in nearly all cases, those correlations are also greater than for 1×125, despite the greater total number of sequenced bases for the latter. Across an array of metrics, differential expression tests based upon 2×40 consistently outperform those using 1×75.ConclusionResearchers seeking a cost-effective approach for gene-level expression analysis should prefer short paired-end reads over a longer single-end strategy. Short paired-end reads will also give reasonably robust expression estimates and differential expression results at the isoform level.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Jaak Simm ◽  
Lina Humbeck ◽  
Adam Zalewski ◽  
Noe Sturm ◽  
Wouter Heyndrickx ◽  
...  

AbstractWith the increase in applications of machine learning methods in drug design and related fields, the challenge of designing sound test sets becomes more and more prominent. The goal of this challenge is to have a realistic split of chemical structures (compounds) between training, validation and test set such that the performance on the test set is meaningful to infer the performance in a prospective application. This challenge is by its own very interesting and relevant, but is even more complex in a federated machine learning approach where multiple partners jointly train a model under privacy-preserving conditions where chemical structures must not be shared between the different participating parties. In this work we discuss three methods which provide a splitting of a data set and are applicable in a federated privacy-preserving setting, namely: a. locality-sensitive hashing (LSH), b. sphere exclusion clustering, c. scaffold-based binning (scaffold network). For evaluation of these splitting methods we consider the following quality criteria (compared to random splitting): bias in prediction performance, classification label and data imbalance, similarity distance between the test and training set compounds. The main findings of the paper are a. both sphere exclusion clustering and scaffold-based binning result in high quality splitting of the data sets, b. in terms of compute costs sphere exclusion clustering is very expensive in the case of federated privacy-preserving setting.


Geophysics ◽  
2006 ◽  
Vol 71 (5) ◽  
pp. G249-G260 ◽  
Author(s):  
Esben Auken ◽  
Louise Pellerin ◽  
Niels B. Christensen ◽  
Kurt Sørensen

Electrical and electromagnetic (E&EM) methods for near-surface investigations have undergone rapid improvements over the past few decades. Besides the traditional applications in groundwater investigations, natural-resource exploration, and geological mapping, a number of new applications have appeared. These include hazardous-waste characterization studies, precision-agriculture applications, archeological surveys, and geotechnical investigations. The inclu-sion of microprocessors in survey instruments, development of new interpretation algorithms, and easy access to powerful computers have supported innovation throughout the geophysical community and the E&EM community is no exception. Most notable are development of continuous-measurement systems that generate large, dense data sets efficiently. These have contributed significantly to the usefulness of E&EM methods by allowing measurements over wide areas without sacrificing lateral resolution. The availability of these luxuriant data sets in turn spurred development of interpretation algorithms, including: Laterally constrained 1D inversion as well as innovative 2D- and 3D-inversion methods. Taken together, these developments can be expected to improve the resolution and usefulness of E&EM methods and permit them to be applied economically. The trend is clearly toward dense surveying over larger areas, followed by highly automated, post-acquisition processing and interpretation to provide improved resolution of the shallow subsurface in a cost-effective manner.


Sign in / Sign up

Export Citation Format

Share Document