scholarly journals MapReduce-Based Dynamic Partition Join with Shannon Entropy for Data Skewness

2021 ◽  
Vol 2021 ◽  
pp. 1-15
Author(s):  
Donghua Chen ◽  
Runtong Zhang

Join operations of data sets play a crucial role in obtaining the relations of massive data in real life. Joining two data sets with MapReduce requires a proper design of the Map and Reduce stages for different scenarios. The factors affecting MapReduce join efficiency include the density of the data sets and data transmission over clusters like Hadoop. This study aims to improve the efficiency of MapReduce join algorithms on Hadoop by leveraging Shannon entropy to measure the information changes of data sets being joined in different MapReduce stages. To reduce the uncertainty of data sets in joins through the network, a novel MapReduce join algorithm with dynamic partition strategies called dynamic partition join (DPJ) is proposed. Leveraging the changes of entropy in the partitions of data sets during the Map and Reduce stages revises the logical partitions by changing the original input of the reduce tasks in the MapReduce jobs. Experimental results indicate that the entropy-based measures can measure entropy changes of join operations. Moreover, the DPJ variant methods achieved lower entropy compared with the existing joins, thereby increasing the feasibility of MapReduce join operations for different scenarios on Hadoop.

2021 ◽  
Author(s):  
Panagiotis Bouros ◽  
Nikos Mamoulis ◽  
Dimitrios Tsitsigkos ◽  
Manolis Terrovitis

AbstractThe interval join is a popular operation in temporal, spatial, and uncertain databases. The majority of interval join algorithms assume that input data reside on disk and so, their focus is to minimize the I/O accesses. Recently, an in-memory approach based on plane sweep (PS) for modern hardware was proposed which greatly outperforms previous work. However, this approach relies on a complex data structure and its parallelization has not been adequately studied. In this article, we investigate in-memory interval joins in two directions. First, we explore the applicability of a largely ignored forward scan (FS)-based plane sweep algorithm, for single-threaded join evaluation. We propose four optimizations for FS that greatly reduce its cost, making it competitive or even faster than the state-of-the-art. Second, we study in depth the parallel computation of interval joins. We design a non-partitioning-based approach that determines independent tasks of the join algorithm to run in parallel. Then, we address the drawbacks of the previously proposed hash-based partitioning and suggest a domain-based partitioning approach that does not produce duplicate results. Within our approach, we propose a novel breakdown of the partition-joins into mini-joins to be scheduled in the available CPU threads and propose an adaptive domain partitioning, aiming at load balancing. We also investigate how the partitioning phase can benefit from modern parallel hardware. Our thorough experimental analysis demonstrates the advantage of our novel partitioning-based approach for parallel computation.


2021 ◽  
Vol 11 (4) ◽  
pp. 249
Author(s):  
Irene Dogliotti ◽  
Simone Ragaini ◽  
Francesco Vassallo ◽  
Elia Boccellato ◽  
Gabriele De Luca ◽  
...  

Background. Bendamustine is a cytotoxic alkylating drug with a broad range of indications as a single agent or in combination therapy in lymphoid neoplasia patients. However, its tolerability in elderly patients is still debated. Methods: An observational, retrospective study was carried out; patients with chronic lymphocytic leukemia (CLL) or lymphoma, aged ≥ 65 years old, treated with bendamustine-based regimens in first or subsequent lines between 2010 and 2020 were considered eligible. Results: Overall, 179 patients aged ≥ 65 years were enrolled, 53% between 71 and 79 years old. Cumulative Illness Rating Scale (CIRS) comorbidity score was ≥6 in 54% patients. Overall survival (OS) at 12 months was 95% (95% confidence interval [CI]: 90–97%); after a median follow up of 50 months, median OS was 84 months. The overall response rate was 87%, with 56% complete responses; the median time to progression (TTP) was 61 months. The baseline factors affecting OS by multivariable analysis were sex, histological diagnosis, renal function, and planned bendamustine dose, while only type of lymphoma and bendamustine dose impacted on TTP. Main adverse events were neutropenia (grade ≥ 3: 43%) and infections (any grade: 36%), with 17% of patients requiring hospital admission. Conclusions: The responses to bendamustine, as well as survival, are relevant even in advanced age patients, with a manageable incidence of acute toxicity.


Author(s):  
A Salman Avestimehr ◽  
Seyed Mohammadreza Mousavi Kalan ◽  
Mahdi Soltanolkotabi

Abstract Dealing with the shear size and complexity of today’s massive data sets requires computational platforms that can analyze data in a parallelized and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. A recent computational framework, called encoded optimization, creates redundancy in the data to mitigate the effect of stragglers. In this paper, we develop novel mathematical understanding for this framework demonstrating its effectiveness in much broader settings than was previously understood. We also analyze the convergence behavior of iterative encoded optimization algorithms, allowing us to characterize fundamental trade-offs between convergence rate, size of data set, accuracy, computational load (or data redundancy) and straggler toleration in this framework.


2016 ◽  
Vol 2016 ◽  
pp. 1-18 ◽  
Author(s):  
Mustafa Yuksel ◽  
Suat Gonul ◽  
Gokce Banu Laleci Erturkmen ◽  
Ali Anil Sinaci ◽  
Paolo Invernizzi ◽  
...  

Depending mostly on voluntarily sent spontaneous reports, pharmacovigilance studies are hampered by low quantity and quality of patient data. Our objective is to improve postmarket safety studies by enabling safety analysts to seamlessly access a wide range of EHR sources for collecting deidentified medical data sets of selected patient populations and tracing the reported incidents back to original EHRs. We have developed an ontological framework where EHR sources and target clinical research systems can continue using their own local data models, interfaces, and terminology systems, while structural interoperability and Semantic Interoperability are handled through rule-based reasoning on formal representations of different models and terminology systems maintained in the SALUS Semantic Resource Set. SALUS Common Information Model at the core of this set acts as the common mediator. We demonstrate the capabilities of our framework through one of the SALUS safety analysis tools, namely, the Case Series Characterization Tool, which have been deployed on top of regional EHR Data Warehouse of the Lombardy Region containing about 1 billion records from 16 million patients and validated by several pharmacovigilance researchers with real-life cases. The results confirm significant improvements in signal detection and evaluation compared to traditional methods with the missing background information.


2014 ◽  
Vol 43 (2) ◽  
pp. 355-388 ◽  
Author(s):  
Xixian Han ◽  
Jianzhong Li ◽  
Hong Gao ◽  
Chengyu Yang
Keyword(s):  

2007 ◽  
Vol 41 (9) ◽  
pp. 1240-1265 ◽  
Author(s):  
Benjamin L. Read

Theories of civil society set high expectations for grassroots associations, claiming that they school citizens in democracy and constrain powerful institutions. But when do real-life organizations actually live up to this billing? Homeowner organizations in the United States and elsewhere have sparked debate among political scientists, criticized by some as nonparticipatory and harmful to the overall polity and defended by others as benign manifestations of local self-governance. With this as a backdrop, China's emerging homeowner groups are used as a testing ground for exploring variation in three criteria of performance: self-organization, participation, and the exercising of power. Comparisons are drawn cross-nationally, among 23 cases in four Chinese cities and over time within neighborhoods. The article puts forward several factors affecting the properties of grassroots groups, highlighting the role of conflict, the political—legal environment, and collective action problems in shaping the way they engage their members and take political action.


2018 ◽  
Vol 8 (10) ◽  
pp. 1730 ◽  
Author(s):  
Md. Safiuddin ◽  
A. Kaish ◽  
Chin-Ong Woon ◽  
Sudharshan Raman

Cracking is a common problem in concrete structures in real-life service conditions. In fact, crack-free concrete structures are very rare to find in real world. Concrete can undergo early-age cracking depending on the mix composition, exposure environment, hydration rate, and curing conditions. Understanding the causes and consequences of cracking thoroughly is essential for selecting proper measures to resolve the early-age cracking problem in concrete. This paper will help to identify the major causes and consequences of the early-age cracking in concrete. Also, this paper will be useful to adopt effective remedial measures for reducing or eliminating the early-age cracking problem in concrete. Different types of early-age crack, the factors affecting the initiation and growth of early-age cracks, the causes of early-age cracking, and the modeling of early-age cracking are discussed in this paper. A number of examples for various early-age cracking problems of concrete found in different structural elements are also shown. Above all, some recommendations are given for minimizing the early-age cracking in concrete. It is hoped that the information conveyed in this paper will be beneficial to improve the service life of concrete structures. Concrete researchers and practitioners may benefit from the contents of this paper.


Author(s):  
Barinaadaa John Nwikpe ◽  
Isaac Didi Essi

A new two-parameter continuous distribution called the Two-Parameter Nwikpe (TPAN) distribution is derived in this paper. The new distribution is a mixture of gamma and exponential distributions. A few statistical properties of the new probability distribution have been derived. The shape of its density for different values of the parameters has also been established.  The first four crude moments, the second and third moments about the mean of the new distribution were derived using the method of moment generating function. Other statistical properties derived include; the distribution of order statistics, coefficient of variation and coefficient of skewness. The parameters of the new distribution were estimated using maximum likelihood method. The flexibility of the Two-Parameter Nwikpe (TPAN) distribution was shown by fitting the distribution to three real life data sets. The goodness of fit shows that the new distribution outperforms the one parameter exponential, Shanker and Amarendra distributions for the data sets used for this study.


Sign in / Sign up

Export Citation Format

Share Document