MapReduce-Based Dynamic Partition Join with Shannon Entropy for Data Skewness

Scientific Programming ◽

10.1155/2021/1602767 ◽

2021 ◽

Vol 2021 ◽

pp. 1-15

Author(s):

Donghua Chen ◽

Runtong Zhang

Keyword(s):

Shannon Entropy ◽

Real Life ◽

Massive Data ◽

Data Sets ◽

Factors Affecting ◽

Join Algorithm ◽

Proper Design ◽

Lower Entropy ◽

Join Algorithms ◽

Measure Entropy

Join operations of data sets play a crucial role in obtaining the relations of massive data in real life. Joining two data sets with MapReduce requires a proper design of the Map and Reduce stages for different scenarios. The factors affecting MapReduce join efficiency include the density of the data sets and data transmission over clusters like Hadoop. This study aims to improve the efficiency of MapReduce join algorithms on Hadoop by leveraging Shannon entropy to measure the information changes of data sets being joined in different MapReduce stages. To reduce the uncertainty of data sets in joins through the network, a novel MapReduce join algorithm with dynamic partition strategies called dynamic partition join (DPJ) is proposed. Leveraging the changes of entropy in the partitions of data sets during the Map and Reduce stages revises the logical partitions by changing the original input of the reduce tasks in the MapReduce jobs. Experimental results indicate that the entropy-based measures can measure entropy changes of join operations. Moreover, the DPJ variant methods achieved lower entropy compared with the existing joins, thereby increasing the feasibility of MapReduce join operations for different scenarios on Hadoop.

In-Memory Interval Joins

The VLDB Journal ◽

10.1007/s00778-020-00639-0 ◽

2021 ◽

Author(s):

Panagiotis Bouros ◽

Nikos Mamoulis ◽

Dimitrios Tsitsigkos ◽

Manolis Terrovitis

Keyword(s):

Parallel Computation ◽

State Of The Art ◽

Complex Data ◽

Plane Sweep ◽

Join Algorithm ◽

Sweep Algorithm ◽

Join Algorithms ◽

Domain Partitioning ◽

Complex Data Structure ◽

Independent Tasks

AbstractThe interval join is a popular operation in temporal, spatial, and uncertain databases. The majority of interval join algorithms assume that input data reside on disk and so, their focus is to minimize the I/O accesses. Recently, an in-memory approach based on plane sweep (PS) for modern hardware was proposed which greatly outperforms previous work. However, this approach relies on a complex data structure and its parallelization has not been adequately studied. In this article, we investigate in-memory interval joins in two directions. First, we explore the applicability of a largely ignored forward scan (FS)-based plane sweep algorithm, for single-threaded join evaluation. We propose four optimizations for FS that greatly reduce its cost, making it competitive or even faster than the state-of-the-art. Second, we study in depth the parallel computation of interval joins. We design a non-partitioning-based approach that determines independent tasks of the join algorithm to run in parallel. Then, we address the drawbacks of the previously proposed hash-based partitioning and suggest a domain-based partitioning approach that does not produce duplicate results. Within our approach, we propose a novel breakdown of the partition-joins into mini-joins to be scheduled in the available CPU threads and propose an adaptive domain partitioning, aiming at load balancing. We also investigate how the partitioning phase can benefit from modern parallel hardware. Our thorough experimental analysis demonstrates the advantage of our novel partitioning-based approach for parallel computation.

Real Life Use of Bendamustine in Elderly Patients with Lymphoid Neoplasia

Journal of Personalized Medicine ◽

10.3390/jpm11040249 ◽

2021 ◽

Vol 11 (4) ◽

pp. 249

Author(s):

Irene Dogliotti ◽

Simone Ragaini ◽

Francesco Vassallo ◽

Elia Boccellato ◽

Gabriele De Luca ◽

...

Keyword(s):

Elderly Patients ◽

Rating Scale ◽

Single Agent ◽

Multivariable Analysis ◽

Real Life ◽

Lymphocytic Leukemia ◽

Factors Affecting ◽

Lymphoid Neoplasia ◽

Baseline Factors ◽

Grade 3

Background. Bendamustine is a cytotoxic alkylating drug with a broad range of indications as a single agent or in combination therapy in lymphoid neoplasia patients. However, its tolerability in elderly patients is still debated. Methods: An observational, retrospective study was carried out; patients with chronic lymphocytic leukemia (CLL) or lymphoma, aged ≥ 65 years old, treated with bendamustine-based regimens in first or subsequent lines between 2010 and 2020 were considered eligible. Results: Overall, 179 patients aged ≥ 65 years were enrolled, 53% between 71 and 79 years old. Cumulative Illness Rating Scale (CIRS) comorbidity score was ≥6 in 54% patients. Overall survival (OS) at 12 months was 95% (95% confidence interval [CI]: 90–97%); after a median follow up of 50 months, median OS was 84 months. The overall response rate was 87%, with 56% complete responses; the median time to progression (TTP) was 61 months. The baseline factors affecting OS by multivariable analysis were sex, histological diagnosis, renal function, and planned bendamustine dose, while only type of lymphoma and bendamustine dose impacted on TTP. Main adverse events were neutropenia (grade ≥ 3: 43%) and infections (any grade: 36%), with 17% of patients requiring hospital admission. Conclusions: The responses to bendamustine, as well as survival, are relevant even in advanced age patients, with a manageable incidence of acute toxicity.

Fundamental resource trade-offs for encoded distributed optimization

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa026 ◽

2020 ◽

Author(s):

A Salman Avestimehr ◽

Seyed Mohammadreza Mousavi Kalan ◽

Mahdi Soltanolkotabi

Keyword(s):

Computational Time ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Computational Framework ◽

Data Set ◽

Trade Offs ◽

Major Bottleneck ◽

Computing Environments ◽

Analyze Data

Abstract Dealing with the shear size and complexity of today’s massive data sets requires computational platforms that can analyze data in a parallelized and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. A recent computational framework, called encoded optimization, creates redundancy in the data to mitigate the effect of stragglers. In this paper, we develop novel mathematical understanding for this framework demonstrating its effectiveness in much broader settings than was previously understood. We also analyze the convergence behavior of iterative encoded optimization algorithms, allowing us to characterize fundamental trade-offs between convergence rate, size of data set, accuracy, computational load (or data redundancy) and straggler toleration in this framework.

An Interoperability Platform Enabling Reuse of Electronic Health Records for Signal Verification Studies

BioMed Research International ◽

10.1155/2016/6741418 ◽

2016 ◽

Vol 2016 ◽

pp. 1-18 ◽

Cited By ~ 5

Author(s):

Mustafa Yuksel ◽

Suat Gonul ◽

Gokce Banu Laleci Erturkmen ◽

Ali Anil Sinaci ◽

Paolo Invernizzi ◽

...

Keyword(s):

Real Life ◽

Case Series ◽

Background Information ◽

Data Sets ◽

Local Data ◽

Lombardy Region ◽

Common Information ◽

Spontaneous Reports ◽

Wide Range ◽

Common Information Model

Depending mostly on voluntarily sent spontaneous reports, pharmacovigilance studies are hampered by low quantity and quality of patient data. Our objective is to improve postmarket safety studies by enabling safety analysts to seamlessly access a wide range of EHR sources for collecting deidentified medical data sets of selected patient populations and tracing the reported incidents back to original EHRs. We have developed an ontological framework where EHR sources and target clinical research systems can continue using their own local data models, interfaces, and terminology systems, while structural interoperability and Semantic Interoperability are handled through rule-based reasoning on formal representations of different models and terminology systems maintained in the SALUS Semantic Resource Set. SALUS Common Information Model at the core of this set acts as the common mediator. We demonstrate the capabilities of our framework through one of the SALUS safety analysis tools, namely, the Case Series Characterization Tool, which have been deployed on top of regional EHR Data Warehouse of the Lombardy Region containing about 1 billion records from 16 million patients and validated by several pharmacovigilance researchers with real-life cases. The results confirm significant improvements in signal detection and evaluation compared to traditional methods with the missing background information.

A methodology for supporting collaborative exploratory analysis of massive data sets in tele-immersive environments

Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469) ◽

10.1109/hpdc.1999.805283 ◽

2003 ◽

Cited By ~ 8

Author(s):

J. Leigh ◽

A.E. Johnson ◽

T.A. DeFanti ◽

S. Bailey ◽

R. Grossman

Keyword(s):

Exploratory Analysis ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Immersive Environments

SEPT: an efficient skyline join algorithm on massive data

Knowledge and Information Systems ◽

10.1007/s10115-014-0734-2 ◽

2014 ◽

Vol 43 (2) ◽

pp. 355-388 ◽

Cited By ~ 2

Author(s):

Xixian Han ◽

Jianzhong Li ◽

Hong Gao ◽

Chengyu Yang

Keyword(s):

Massive Data ◽

Join Algorithm

Assessing Variation in Civil Society Organizations

Comparative Political Studies ◽

10.1177/0010414007302340 ◽

2007 ◽

Vol 41 (9) ◽

pp. 1240-1265 ◽

Cited By ~ 56

Author(s):

Benjamin L. Read

Keyword(s):

Civil Society ◽

Political Action ◽

Real Life ◽

The United States ◽

Civil Society Organizations ◽

High Expectations ◽

Factors Affecting ◽

Grassroots Groups ◽

Collective Action Problems ◽

Power Comparisons

Theories of civil society set high expectations for grassroots associations, claiming that they school citizens in democracy and constrain powerful institutions. But when do real-life organizations actually live up to this billing? Homeowner organizations in the United States and elsewhere have sparked debate among political scientists, criticized by some as nonparticipatory and harmful to the overall polity and defended by others as benign manifestations of local self-governance. With this as a backdrop, China's emerging homeowner groups are used as a testing ground for exploring variation in three criteria of performance: self-organization, participation, and the exercising of power. Comparisons are drawn cross-nationally, among 23 cases in four Chinese cities and over time within neighborhoods. The article puts forward several factors affecting the properties of grassroots groups, highlighting the role of conflict, the political—legal environment, and collective action problems in shaping the way they engage their members and take political action.

Massive Data Sets Issues in Earth Observing

Massive Computing - Handbook of Massive Data Sets ◽

10.1007/978-1-4615-0005-6_29 ◽

2002 ◽

pp. 1093-1140 ◽

Cited By ~ 3

Author(s):

Ruixin Yang ◽

Menas Kafatos

Keyword(s):

Massive Data ◽

Data Sets ◽

Massive Data Sets

Early-Age Cracking in Concrete: Causes, Consequences, Remedial Measures, and Recommendations

Applied Sciences ◽

10.3390/app8101730 ◽

2018 ◽

Vol 8 (10) ◽

pp. 1730 ◽

Cited By ~ 19

Author(s):

Md. Safiuddin ◽

A. Kaish ◽

Chin-Ong Woon ◽

Sudharshan Raman

Keyword(s):

Service Life ◽

Concrete Structures ◽

Real Life ◽

Early Age ◽

Structural Elements ◽

Remedial Measures ◽

Curing Conditions ◽

Factors Affecting ◽

Different Types ◽

Service Conditions

Cracking is a common problem in concrete structures in real-life service conditions. In fact, crack-free concrete structures are very rare to find in real world. Concrete can undergo early-age cracking depending on the mix composition, exposure environment, hydration rate, and curing conditions. Understanding the causes and consequences of cracking thoroughly is essential for selecting proper measures to resolve the early-age cracking problem in concrete. This paper will help to identify the major causes and consequences of the early-age cracking in concrete. Also, this paper will be useful to adopt effective remedial measures for reducing or eliminating the early-age cracking problem in concrete. Different types of early-age crack, the factors affecting the initiation and growth of early-age cracks, the causes of early-age cracking, and the modeling of early-age cracking are discussed in this paper. A number of examples for various early-age cracking problems of concrete found in different structural elements are also shown. Above all, some recommendations are given for minimizing the early-age cracking in concrete. It is hoped that the information conveyed in this paper will be beneficial to improve the service life of concrete structures. Concrete researchers and practitioners may benefit from the contents of this paper.

Two-Parameter Nwikpe (TPAN) Distribution with Application

Asian Journal of Probability and Statistics ◽

10.9734/ajpas/2021/v12i130279 ◽

2021 ◽

pp. 56-67

Author(s):

Barinaadaa John Nwikpe ◽

Isaac Didi Essi

Keyword(s):

Goodness Of Fit ◽

Continuous Distribution ◽

Real Life ◽

Moment Generating Function ◽

Statistical Properties ◽

Likelihood Method ◽

Data Sets ◽

Method Of Moment ◽

Two Parameter ◽

New Distribution

A new two-parameter continuous distribution called the Two-Parameter Nwikpe (TPAN) distribution is derived in this paper. The new distribution is a mixture of gamma and exponential distributions. A few statistical properties of the new probability distribution have been derived. The shape of its density for different values of the parameters has also been established. The first four crude moments, the second and third moments about the mean of the new distribution were derived using the method of moment generating function. Other statistical properties derived include; the distribution of order statistics, coefficient of variation and coefficient of skewness. The parameters of the new distribution were estimated using maximum likelihood method. The flexibility of the Two-Parameter Nwikpe (TPAN) distribution was shown by fitting the distribution to three real life data sets. The goodness of fit shows that the new distribution outperforms the one parameter exponential, Shanker and Amarendra distributions for the data sets used for this study.