scholarly journals Data Sets Replicas Placements Strategy from Cost-Effective View in the Cloud

2016 ◽  
Vol 2016 ◽  
pp. 1-13 ◽  
Author(s):  
Xiuguo Wu

Replication technology is commonly used to improve data availability and reduce data access latency in the cloud storage system by providing users with different replicas of the same service. Most current approaches largely focus on system performance improvement, neglecting management cost in deciding replicas number and their store places, which cause great financial burden for cloud users because the cost for replicas storage and consistency maintenance may lead to high overhead with the number of new replicas increased in a pay-as-you-go paradigm. In this paper, towards achieving the approximate minimum data sets management cost benchmark in a practical manner, we propose a replicas placements strategy from cost-effective view with the premise that system performance meets requirements. Firstly, we design data sets management cost models, including storage cost and transfer cost. Secondly, we use the access frequency and the average response time to decide which data set should be replicated. Then, the method of calculating replicas’ number and their store places with minimum management cost is proposed based on location problem graph. Both the theoretical analysis and simulations have shown that the proposed strategy offers the benefits of lower management cost with fewer replicas.

Author(s):  
L Mohana Tirumala ◽  
S. Srinivasa Rao

Privacy preserving in Data mining & publishing, plays a major role in today networked world. It is important to preserve the privacy of the vital information corresponding to a data set. This process can be achieved by k-anonymization solution for classification. Along with the privacy preserving using anonymization, yielding the optimized data sets is also of equal importance with a cost effective approach. In this paper Top-Down Refinement algorithm has been proposed which yields optimum results in a cost effective manner. Bayesian Classification has been proposed in this paper to predict class membership probabilities for a data tuple for which the associated class label is unknown.


2016 ◽  
Vol 39 (11) ◽  
pp. 1477-1501 ◽  
Author(s):  
Victoria Goode ◽  
Nancy Crego ◽  
Michael P. Cary ◽  
Deirdre Thornlow ◽  
Elizabeth Merwin

Researchers need to evaluate the strengths and weaknesses of data sets to choose a secondary data set to use for a health care study. This research method review informs the reader of the major issues necessary for investigators to consider while incorporating secondary data into their repertoire of potential research designs and shows the range of approaches the investigators may take to answer nursing research questions in a variety of context areas. The researcher requires expertise in locating and judging data sets and in the development of complex data management skills for managing large numbers of records. There are important considerations such as firm knowledge of the research question supported by the conceptual framework and the selection of appropriate databases, which guide the researcher in delineating the unit of analysis. Other more complex issues for researchers to consider when conducting secondary data research methods include data access, management and security, and complex variable construction.


Energies ◽  
2020 ◽  
Vol 13 (15) ◽  
pp. 3859
Author(s):  
Benjamin Rösner ◽  
Sebastian Egli ◽  
Boris Thies ◽  
Tina Beyer ◽  
Doron Callies ◽  
...  

Coherent wind doppler lidar (CWDL) is a cost-effective way to estimate wind power potential at hub height without the need to build a meteorological tower. However, fog and low stratus (FLS) can have a negative impact on the availability of lidar measurements. Information about such reductions in wind data availability for a prospective lidar deployment site in advance is beneficial in the planning process for a measurement strategy. In this paper, we show that availability reductions by FLS can be estimated by comparing time series of lidar measurements, conducted with WindCubes v1 and v2, with time series of cloud base altitude (CBA) derived from satellite data. This enables us to compute average maps (2006–2017) of estimated availability, including FLS-induced data losses for Germany which can be used for planning purposes. These maps show that the lower mountain ranges and the Alpine regions in Germany often reach the critical data availability threshold of 80% or below. Especially during the winter time special care must be taken when using lidar in southern and central regions of Germany. If only shorter lidar campaigns are planned (3–6 months) the representativeness of weather types should be considered as well, because in individual years and under persistent weather types, lowland areas might also be temporally affected by higher rates of data losses. This is shown by different examples, e.g., during radiation fog under anticyclonic weather types.


2014 ◽  
Vol 121 (1) ◽  
pp. 131-141 ◽  
Author(s):  
Silvain Bériault ◽  
Abbas F. Sadikot ◽  
Fahd Alsubaie ◽  
Simon Drouin ◽  
D. Louis Collins ◽  
...  

Careful trajectory planning on preoperative vascular imaging is an essential step in deep brain stimulation (DBS) to minimize risks of hemorrhagic complications and postoperative neurological deficits. This paper compares 2 MRI methods for visualizing cerebral vasculature and planning DBS probe trajectories: a single data set T1-weighted scan with double-dose gadolinium contrast (T1w-Gd) and a multi–data set protocol consisting of a T1-weighted structural, susceptibility-weighted venography, and time-of-flight angiography (T1w-SWI-TOF). Two neurosurgeons who specialize in neuromodulation surgery planned bilateral STN DBS in 18 patients with Parkinson's disease (36 hemispheres) using each protocol separately. Planned trajectories were then evaluated across all vascular data sets (T1w-Gd, SWI, and TOF) to detect possible intersection with blood vessels along the entire path via an objective vesselness measure. The authors' results show that trajectories planned on T1w-SWI-TOF successfully avoided the cerebral vasculature imaged by conventional T1w-Gd and did not suffer from missing vascular information or imprecise data set registration. Furthermore, with appropriate planning and visualization software, trajectory corridors planned on T1w-SWI-TOF intersected significantly less fine vasculature that was not detected on the T1w-Gd (p < 0.01 within 2 mm and p < 0.001 within 4 mm of the track centerline). The proposed T1w-SWI-TOF protocol comes with minimal effects on the imaging and surgical workflow, improves vessel avoidance, and provides a safe cost-effective alternative to injection of gadolinium contrast.


2016 ◽  
Author(s):  
Dorothee C. E. Bakker ◽  
Benjamin Pfeil ◽  
Camilla S. Landa ◽  
Nicolas Metzl ◽  
Kevin M. O'Brien ◽  
...  

Abstract. The Surface Ocean CO2 Atlas (SOCAT) is a synthesis of quality-controlled fCO2 (fugacity of carbon dioxide) values for the global surface oceans and coastal seas with regular updates. Version 3 of SOCAT has 14.5 million fCO2 values from 3646 data sets covering the years 1957 to 2014. This latest version has an additional 4.4 million fCO2 values relative to version 2 and extends the record from 2011 to 2014. Version 3 also significantly increases the data availability for 2005 to 2013. SOCAT has an average of approximately 1.2 million surface water fCO2 values per year for the years 2006 to 2012. Quality and documentation of the data has improved. A new feature is the data set quality control (QC) flag of E for data from alternative sensors and platforms. The accuracy of surface water fCO2 has been defined for all data set QC flags. Automated range checking has been carried out for all data sets during their upload into SOCAT. The upgrade of the interactive Data Set Viewer (previously known as the Cruise Data Viewer) allows better interrogation of the SOCAT data collection and rapid creation of high-quality figures for scientific presentations. Automated data upload has been launched for version 4 and will enable more frequent SOCAT releases in the future. High-profile scientific applications of SOCAT include quantification of the ocean sink for atmospheric carbon dioxide and its long-term variation, detection of ocean acidification, as well as evaluation of coupled-climate and ocean-only biogeochemical models. Users of SOCAT data products are urged to acknowledge the contribution of data providers, as stated in the SOCAT Fair Data Use Statement. This ESSD (Earth System Science Data) "Living Data" publication documents the methods and data sets used for the assembly of this new version of the SOCAT data collection and compares these with those used for earlier versions of the data collection (Pfeil et al., 2013; Sabine et al., 2013; Bakker et al., 2014).


2019 ◽  
Vol 16 (3) ◽  
pp. 705-731
Author(s):  
Haoze Lv ◽  
Zhaobin Liu ◽  
Zhonglian Hu ◽  
Lihai Nie ◽  
Weijiang Liu ◽  
...  

With the invention of big data era, data releasing is becoming a hot topic in database community. Meanwhile, data privacy also raises the attention of users. As far as the privacy protection models that have been proposed, the differential privacy model is widely utilized because of its many advantages over other models. However, for the private releasing of multi-dimensional data sets, the existing algorithms are publishing data usually with low availability. The reason is that the noise in the released data is rapidly grown as the increasing of the dimensions. In view of this issue, we propose algorithms based on regular and irregular marginal tables of frequent item sets to protect privacy and promote availability. The main idea is to reduce the dimension of the data set, and to achieve differential privacy protection with Laplace noise. First, we propose a marginal table cover algorithm based on frequent items by considering the effectiveness of query cover combination, and then obtain a regular marginal table cover set with smaller size but higher data availability. Then, a differential privacy model with irregular marginal table is proposed in the application scenario with low data availability and high cover rate. Next, we obtain the approximate optimal marginal table cover algorithm by our analysis to get the query cover set which satisfies the multi-level query policy constraint. Thus, the balance between privacy protection and data availability is achieved. Finally, extensive experiments have been done on synthetic and real databases, demonstrating that the proposed method preforms better than state-of-the-art methods in most cases.


2021 ◽  
Author(s):  
Hans Ressl ◽  
Helfried Scheifinger ◽  
Thomas Hübner ◽  
Anita Paul ◽  
Markus Ungersböck

&lt;p&gt;&amp;#8220;Phenology &amp;#8211; the timing of seasonal activities of animals and plants &amp;#8211; is perhaps the simplest process in which to track changes in the ecology of species in response to climate change&amp;#8221; (IPCC 2007).&lt;/p&gt;&lt;p&gt;PEP725, the Pan-European Phenological Database, is a European research infrastructure to promote and facilitate phenological research. Its main objective is to build up and maintain a European-wide phenological database with an open, unrestricted data access for science, research and education. So far, 20 European meteorological services and 6 partners from different phenological network operators have joined PEP725.&lt;/p&gt;&lt;p&gt;The PEP725 phenological data base (www.pep725.eu) now offers more than 12 million phenological observations, all of them classified according to the so called BBCH scale. The first datasets in PEP725 date back to 1868; however, there are only a few observations available until 1950. Having accepted the PEP725 data policy and finished the registration, the data download is quick and easy and can be done according to various criteria, e.g., by a specific plant or all data from one country. The integration of new data sets for future partners is also easy to perform due to the flexible structure of the PEP725 database as well as the classification of the observed plants via the so-called gss format (genus, species and subspecies).&lt;/p&gt;&lt;p&gt;PEP725 is funded by EUMETNET, the network of European meteorological services, ZAMG, who is the acting host for PEP, and the Austrian ministry of education, science and research.&lt;/p&gt;&lt;p&gt;The phenological data set has been growing by about 100000 observations per year. Also the number of user registrations has continually been increasing, amounting to 305 new users and more than 28000 downloads in 2020. The greatest number of users are found in China, followed by Germany and the US. To date we could count 78 reviewed publications based on the PEP725 data set with 18 in 2020 and a total of 9 published in Nature and one in Science.&lt;/p&gt;&lt;p&gt;The data base statistics demonstrate the great demand and potential of the PEP725 phenological data set, which urgently needs development including a facilitated access, gridded versions and near real time products to attract a greater range of users.&lt;/p&gt;


2008 ◽  
Vol 55 (3) ◽  
pp. 526-542 ◽  
Author(s):  
Hideki Goda ◽  
Eriko Sasaki ◽  
Kenji Akiyama ◽  
Akiko Maruyama-Nakashita ◽  
Kazumi Nakabayashi ◽  
...  

2021 ◽  
Vol 14 (11) ◽  
pp. 2519-2532
Author(s):  
Fatemeh Nargesian ◽  
Abolfazl Asudeh ◽  
H. V. Jagadish

Data scientists often develop data sets for analysis by drawing upon sources of data available to them. A major challenge is to ensure that the data set used for analysis has an appropriate representation of relevant (demographic) groups: it meets desired distribution requirements. Whether data is collected through some experiment or obtained from some data provider, the data from any single source may not meet the desired distribution requirements. Therefore, a union of data from multiple sources is often required. In this paper, we study how to acquire such data in the most cost effective manner, for typical cost functions observed in practice. We present an optimal solution for binary groups when the underlying distributions of data sources are known and all data sources have equal costs. For the generic case with unequal costs, we design an approximation algorithm that performs well in practice. When the underlying distributions are unknown, we develop an exploration-exploitation based strategy with a reward function that captures the cost and approximations of group distributions in each data source. Besides theoretical analysis, we conduct comprehensive experiments that confirm the effectiveness of our algorithms.


2018 ◽  
Vol 12 (3) ◽  
pp. 100-122
Author(s):  
Benjamin Stark ◽  
Heiko Gewald ◽  
Heinrich Lautenbacher ◽  
Ulrich Haase ◽  
Siegmar Ruff

This article describes how the information about an individual's personal health is among ones most sensitive and important intangible belongings. When health information is misused, serious non-revertible damage can be caused, e.g. through making intimidating details public or leaking it to employers, insurances etc. Therefore, health information needs to be treated with the highest degree of confidentiality. In practice it proves difficult to achieve this goal. In a hospital setting medical staff across departments often needs to access patient data without directly obvious reasons, which makes it difficult to distinguish legitimate from illegitimate access. This article provides a mechanism to classify transactions at a large university medical center into plausible and questionable data access using a real-life data set of more than 60,000 transactions. The classification mechanism works with minimal data requirements and unsupervised data sets. The results were evaluated through manual cross-checks internally and by a group of external experts. Consequently, the hospital's data protection officer is now able to focus on analyzing questionable transactions instead of checking random samples.


Sign in / Sign up

Export Citation Format

Share Document