EFFECTIVE SUMMARY FOR MASSIVE DATA SET

AbstractAntimicrobial resistance (AR) is a major global threat to public health. Understanding the population dynamics of AR is critical to restrain and control this issue. However, no study has provided a global picture of the resistome of Acinetobacter baumannii, a very important nosocomial pathogen. Here we analyze 1450+ genomes (covering > 40 countries and > 4 decades) to infer the global population dynamics of the resistome of this species. We show that gene flow and horizontal transfer have driven the dissemination of AR genes in A. baumannii. We found considerable variation in AR gene content across lineages. Although the individual AR gene histories have been affected by recombination, the AR gene content has been shaped by the phylogeny. Furthermore, many AR genes have been transferred to other well-known pathogens, such as Pseudomonas aeruginosa or Klebsiella pneumoniae. Finally, despite using this massive data set, we were not able to sample the whole diversity of AR genes, which suggests that this species has an open resistome. Ours results highlight the high mobilization risk of AR genes between important pathogens. On a broader perspective, this study gives a framework for an emerging perspective (resistome-centric) on the genome epidemiology (and surveillance) of bacterial pathogens.

Download Full-text

Similarity of TIMSS Math and Science Achievement of Nations

Education Policy Analysis Archives ◽

10.14507/epaa.v9n33.2001 ◽

2001 ◽

Vol 9 ◽

pp. 33 ◽

Cited By ~ 3

Author(s):

Algirdas Zabulionis

Keyword(s):

Science Achievement ◽

Educational Achievement ◽

International Association ◽

Massive Data ◽

Science Study ◽

Data Set ◽

Test Items ◽

Massive Data Set ◽

Math And Science ◽

Mathematics And Science

In 1991-97, the International Association for the Evaluation of Educational Achievement (IEA) undertook a Third International Mathematics and Science Study (TIMSS) in which data about the mathematics and science achievement of the thirteen year-old students in more than 40 countries were collected. These data provided the opportunity to search for patterns of students' answers to the test items: which group of items was relatively more difficult (or more easy) for the students from a particular country (or group of countries). Using this massive data set an attempt was made to measure the similarities among country profiles of how students responded to the test items.

Download Full-text

Research on MapReduce-Based Rocchio Relevance Feedback in Massive Information Filtering

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.962-965.2712 ◽

2014 ◽

Vol 962-965 ◽

pp. 2712-2715

Author(s):

Wen Chuan Yang ◽

Zhi Dong Shang ◽

Zhi Cheng Zhang

Keyword(s):

Relevance Feedback ◽

Information Filtering ◽

Classification Algorithms ◽

Massive Data ◽

Data Set ◽

Massive Data Set ◽

Hadoop Cluster ◽

Feedback Algorithm ◽

Massive Information ◽

Mapreduce Paradigm

Traditional text classification algorithms have vital impact on information filtering. However, their performances were confined to a large extent in terms of the massive data set. This paper proposes an approach using MapReduce-based Rocchio relevance feedback algorithm, which improved the traditional Rocchio algorithm in the MapReduce paradigm, to resolve the problem of massive information filtering. The experiments on Hadoop cluster showed an effective improvement in performance by using the new method.

Download Full-text

Massive Data Set Issues in Air Pollution Modelling

Massive Computing - Handbook of Massive Data Sets ◽

10.1007/978-1-4615-0005-6_31 ◽

2002 ◽

pp. 1169-1220 ◽

Cited By ~ 9

Author(s):

Zahari Zlatev

Keyword(s):

Air Pollution ◽

Massive Data ◽

Data Set ◽

Air Pollution Modelling ◽

Massive Data Set

Download Full-text

Detection for Approximately Duplicate Records Based on Fuzzy Comprehensive Evaluation

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.397-400.2464 ◽

2013 ◽

Vol 397-400 ◽

pp. 2464-2468

Author(s):

Li Juan Zhou ◽

Zhe Xiao

Keyword(s):

Evaluation Method ◽

Comprehensive Evaluation ◽

Fuzzy Comprehensive Evaluation ◽

Massive Data ◽

Data Set ◽

Comprehensive Evaluation Method ◽

Attribute Weight ◽

Massive Data Set ◽

Composition Factors

To solve the problem of attribute weight determination in the approximately duplicate records, we put forward a method based on fuzzy comprehensive evaluation to get attribute weight in data set. We first perform an analysis of the composition factors of attribute. Then we carry out an evaluation of their rank. Finally, we make a determination of the attribute weight using the fuzzy comprehensive evaluation method, on the basis of which the approximately duplicate records are detected. Theoretical analysis and experimental results show that the method can objectively determine all attributes weight, and effectively detect the approximately duplicate records in massive data set.

Download Full-text

Understanding the Utilization Characteristics of Bicycle-Sharing Systems in Underdeveloped Cities: A Case Study in Xuchang City, China

Transportation Research Record Journal of the Transportation Research Board ◽

10.3141/2634-12 ◽

2017 ◽

Vol 2634 (1) ◽

pp. 78-85 ◽

Cited By ~ 3

Author(s):

Yang Yang ◽

Tiezhu Li ◽

Tao Zhang ◽

Wanyu Yang

Keyword(s):

Massive Data ◽

Data Set ◽

Massive Data Set ◽

Smart Card Data ◽

Intercept Survey ◽

Degree Of Satisfaction ◽

Bicycle Sharing System ◽

Survey Questionnaires

In recent years, a growing number of cities in China have successively rolled out bicycle-sharing systems to facilitate bicycle use, including not only metropolises but also some underdeveloped cities with populations of less than 1 million. One of those underdeveloped cities, Xuchang, launched its bicycle-sharing system in 2014. This service provides a convenient way for members to cycle for some of their short trips. Interest in the bicycle-sharing systems of metropolises is growing rapidly; however, studies on underdeveloped cities are still limited. This study investigated the factors influencing the adoption of a bicycle-sharing system in Xuchang, by analyzing massive smart card data from July 2014 to mid-April 2015 and 500 intercept survey questionnaires in April 2015. Different questions were ready for members and nonmembers in the questionnaires and the statistical results show the characteristics of users of the Xuchang bicycle-sharing system, including demographic characteristics, travel habits, and degree of satisfaction. Moreover, the space–time distribution characteristics of the Xuchang bicycle-sharing system were analyzed by dividing a massive data set into three groups: weekdays, weekends, and holidays. Results showed that compared with the clearly defined role of “resolve the last-kilometer problem” in a metropolis, bicycle-sharing in underdeveloped cities acts as an alternative way of transportation rather than a transfer traffic mode. Results also showed that bicycle-sharing systems gained more popularity in underdeveloped cities than in metropolises because of the smaller extent of egression, resident travel habits, the traffic environment, and so on.

Download Full-text

Characterizing the performance of Impala over massive data set

Future Computer and Information Technology ◽

10.2495/icfcit131261 ◽

2013 ◽

Author(s):

Tianyuan Fu ◽

Mei Chen ◽

Hui Li

Keyword(s):

Massive Data ◽

Data Set ◽

Massive Data Set

Download Full-text

Socioeconomic differences and persistent segregation of Italian territories during COVID-19 pandemic

Scientific Reports ◽

10.1038/s41598-021-99548-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Giovanni Bonaccorsi ◽

Francesco Pierri ◽

Francesco Scotti ◽

Andrea Flori ◽

Francesco Manaresi ◽

...

Keyword(s):

Human Mobility ◽

Economic Consequences ◽

Massive Data ◽

Mobility Patterns ◽

Data Set ◽

Economic Segregation ◽

Massive Data Set ◽

Learning Techniques ◽

Quantile Regression Analysis ◽

Administrative Units

AbstractLockdowns implemented to address the COVID-19 pandemic have disrupted human mobility flows around the globe to an unprecedented extent and with economic consequences which are unevenly distributed across territories, firms and individuals. Here we study socioeconomic determinants of mobility disruption during both the lockdown and the recovery phases in Italy. For this purpose, we analyze a massive data set on Italian mobility from February to October 2020 and we combine it with detailed data on pre-existing local socioeconomic features of Italian administrative units. Using a set of unsupervised and supervised learning techniques, we reliably show that the least and the most affected areas persistently belong to two different clusters. Notably, the former cluster features significantly higher income per capita and lower income inequality than the latter. This distinction persists once the lockdown is lifted. The least affected areas display a swift (V-shaped) recovery in mobility patterns, while poorer, most affected areas experience a much slower (U-shaped) recovery: as of October 2020, their mobility was still significantly lower than pre-lockdown levels. These results are then detailed and confirmed with a quantile regression analysis. Our findings show that economic segregation has, thus, strengthened during the pandemic.

Download Full-text

Fundamental resource trade-offs for encoded distributed optimization

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa026 ◽

2020 ◽

Author(s):

A Salman Avestimehr ◽

Seyed Mohammadreza Mousavi Kalan ◽

Mahdi Soltanolkotabi

Keyword(s):

Computational Time ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Computational Framework ◽

Data Set ◽

Trade Offs ◽

Major Bottleneck ◽

Computing Environments ◽

Analyze Data

Abstract Dealing with the shear size and complexity of today’s massive data sets requires computational platforms that can analyze data in a parallelized and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. A recent computational framework, called encoded optimization, creates redundancy in the data to mitigate the effect of stragglers. In this paper, we develop novel mathematical understanding for this framework demonstrating its effectiveness in much broader settings than was previously understood. We also analyze the convergence behavior of iterative encoded optimization algorithms, allowing us to characterize fundamental trade-offs between convergence rate, size of data set, accuracy, computational load (or data redundancy) and straggler toleration in this framework.

Download Full-text

Weighted mining of massive collections of P-values by convex optimization

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iax013 ◽

2017 ◽

Vol 7 (2) ◽

pp. 251-275

Author(s):

Edgar Dobriban

Keyword(s):

Convex Optimization ◽

Multiple Testing ◽

Observational Cosmology ◽

Data Sets ◽

Data Set ◽

P Values ◽

False Discovery ◽

Massive Data Set ◽

Optimal Weighting ◽

Weighting Problem

Abstract Researchers in data-rich disciplines—think of computational genomics and observational cosmology—often wish to mine large bodies of $P$-values looking for significant effects, while controlling the false discovery rate or family-wise error rate. Increasingly, researchers also wish to prioritize certain hypotheses, for example, those thought to have larger effect sizes, by upweighting, and to impose constraints on the underlying mining, such as monotonicity along a certain sequence. We introduce Princessp, a principled method for performing weighted multiple testing by constrained convex optimization. Our method elegantly allows one to prioritize certain hypotheses through upweighting and to discount others through downweighting, while constraining the underlying weights involved in the mining process. When the $P$-values derive from monotone likelihood ratio families such as the Gaussian means model, the new method allows exact solution of an important optimal weighting problem previously thought to be non-convex and computationally infeasible. Our method scales to massive data set sizes. We illustrate the applications of Princessp on a series of standard genomics data sets and offer comparisons with several previous ‘standard’ methods. Princessp offers both ease of operation and the ability to scale to extremely large problem sizes. The method is available as open-source software from github.com/dobriban/pvalue_weighting_matlab (accessed 11 October 2017).

Download Full-text