scholarly journals EFFECTIVE SUMMARY FOR MASSIVE DATA SET

2015 ◽  
Vol 05 (04) ◽  
pp. 1046-1056
Author(s):  
Radhika A. ◽  
◽  
Michael Arock ◽  
2021 ◽  
Author(s):  
Ismael Hernández-González ◽  
Valeria Mateo-Estrada ◽  
Santiago Castillo-Ramírez

AbstractAntimicrobial resistance (AR) is a major global threat to public health. Understanding the population dynamics of AR is critical to restrain and control this issue. However, no study has provided a global picture of the resistome of Acinetobacter baumannii, a very important nosocomial pathogen. Here we analyze 1450+ genomes (covering > 40 countries and > 4 decades) to infer the global population dynamics of the resistome of this species. We show that gene flow and horizontal transfer have driven the dissemination of AR genes in A. baumannii. We found considerable variation in AR gene content across lineages. Although the individual AR gene histories have been affected by recombination, the AR gene content has been shaped by the phylogeny. Furthermore, many AR genes have been transferred to other well-known pathogens, such as Pseudomonas aeruginosa or Klebsiella pneumoniae. Finally, despite using this massive data set, we were not able to sample the whole diversity of AR genes, which suggests that this species has an open resistome. Ours results highlight the high mobilization risk of AR genes between important pathogens. On a broader perspective, this study gives a framework for an emerging perspective (resistome-centric) on the genome epidemiology (and surveillance) of bacterial pathogens.


2001 ◽  
Vol 9 ◽  
pp. 33 ◽  
Author(s):  
Algirdas Zabulionis

In 1991-97, the International Association for the Evaluation of Educational Achievement (IEA) undertook a Third International Mathematics and Science Study (TIMSS) in which data about the mathematics and science achievement of the thirteen year-old students in more than 40 countries were collected. These data provided the opportunity to search for patterns of students' answers to the test items: which group of items was relatively more difficult (or more easy) for the students from a particular country (or group of countries). Using this massive data set an attempt was made to measure the similarities among country profiles of how students responded to the test items.


2014 ◽  
Vol 962-965 ◽  
pp. 2712-2715
Author(s):  
Wen Chuan Yang ◽  
Zhi Dong Shang ◽  
Zhi Cheng Zhang

Traditional text classification algorithms have vital impact on information filtering. However, their performances were confined to a large extent in terms of the massive data set. This paper proposes an approach using MapReduce-based Rocchio relevance feedback algorithm, which improved the traditional Rocchio algorithm in the MapReduce paradigm, to resolve the problem of massive information filtering. The experiments on Hadoop cluster showed an effective improvement in performance by using the new method.


2013 ◽  
Vol 397-400 ◽  
pp. 2464-2468
Author(s):  
Li Juan Zhou ◽  
Zhe Xiao

To solve the problem of attribute weight determination in the approximately duplicate records, we put forward a method based on fuzzy comprehensive evaluation to get attribute weight in data set. We first perform an analysis of the composition factors of attribute. Then we carry out an evaluation of their rank. Finally, we make a determination of the attribute weight using the fuzzy comprehensive evaluation method, on the basis of which the approximately duplicate records are detected. Theoretical analysis and experimental results show that the method can objectively determine all attributes weight, and effectively detect the approximately duplicate records in massive data set.


Author(s):  
Yang Yang ◽  
Tiezhu Li ◽  
Tao Zhang ◽  
Wanyu Yang

In recent years, a growing number of cities in China have successively rolled out bicycle-sharing systems to facilitate bicycle use, including not only metropolises but also some underdeveloped cities with populations of less than 1 million. One of those underdeveloped cities, Xuchang, launched its bicycle-sharing system in 2014. This service provides a convenient way for members to cycle for some of their short trips. Interest in the bicycle-sharing systems of metropolises is growing rapidly; however, studies on underdeveloped cities are still limited. This study investigated the factors influencing the adoption of a bicycle-sharing system in Xuchang, by analyzing massive smart card data from July 2014 to mid-April 2015 and 500 intercept survey questionnaires in April 2015. Different questions were ready for members and nonmembers in the questionnaires and the statistical results show the characteristics of users of the Xuchang bicycle-sharing system, including demographic characteristics, travel habits, and degree of satisfaction. Moreover, the space–time distribution characteristics of the Xuchang bicycle-sharing system were analyzed by dividing a massive data set into three groups: weekdays, weekends, and holidays. Results showed that compared with the clearly defined role of “resolve the last-kilometer problem” in a metropolis, bicycle-sharing in underdeveloped cities acts as an alternative way of transportation rather than a transfer traffic mode. Results also showed that bicycle-sharing systems gained more popularity in underdeveloped cities than in metropolises because of the smaller extent of egression, resident travel habits, the traffic environment, and so on.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Giovanni Bonaccorsi ◽  
Francesco Pierri ◽  
Francesco Scotti ◽  
Andrea Flori ◽  
Francesco Manaresi ◽  
...  

AbstractLockdowns implemented to address the COVID-19 pandemic have disrupted human mobility flows around the globe to an unprecedented extent and with economic consequences which are unevenly distributed across territories, firms and individuals. Here we study socioeconomic determinants of mobility disruption during both the lockdown and the recovery phases in Italy. For this purpose, we analyze a massive data set on Italian mobility from February to October 2020 and we combine it with detailed data on pre-existing local socioeconomic features of Italian administrative units. Using a set of unsupervised and supervised learning techniques, we reliably show that the least and the most affected areas persistently belong to two different clusters. Notably, the former cluster features significantly higher income per capita and lower income inequality than the latter. This distinction persists once the lockdown is lifted. The least affected areas display a swift (V-shaped) recovery in mobility patterns, while poorer, most affected areas experience a much slower (U-shaped) recovery: as of October 2020, their mobility was still significantly lower than pre-lockdown levels. These results are then detailed and confirmed with a quantile regression analysis. Our findings show that economic segregation has, thus, strengthened during the pandemic.


Author(s):  
A Salman Avestimehr ◽  
Seyed Mohammadreza Mousavi Kalan ◽  
Mahdi Soltanolkotabi

Abstract Dealing with the shear size and complexity of today’s massive data sets requires computational platforms that can analyze data in a parallelized and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. A recent computational framework, called encoded optimization, creates redundancy in the data to mitigate the effect of stragglers. In this paper, we develop novel mathematical understanding for this framework demonstrating its effectiveness in much broader settings than was previously understood. We also analyze the convergence behavior of iterative encoded optimization algorithms, allowing us to characterize fundamental trade-offs between convergence rate, size of data set, accuracy, computational load (or data redundancy) and straggler toleration in this framework.


2017 ◽  
Vol 7 (2) ◽  
pp. 251-275
Author(s):  
Edgar Dobriban

Abstract Researchers in data-rich disciplines—think of computational genomics and observational cosmology—often wish to mine large bodies of $P$-values looking for significant effects, while controlling the false discovery rate or family-wise error rate. Increasingly, researchers also wish to prioritize certain hypotheses, for example, those thought to have larger effect sizes, by upweighting, and to impose constraints on the underlying mining, such as monotonicity along a certain sequence. We introduce Princessp, a principled method for performing weighted multiple testing by constrained convex optimization. Our method elegantly allows one to prioritize certain hypotheses through upweighting and to discount others through downweighting, while constraining the underlying weights involved in the mining process. When the $P$-values derive from monotone likelihood ratio families such as the Gaussian means model, the new method allows exact solution of an important optimal weighting problem previously thought to be non-convex and computationally infeasible. Our method scales to massive data set sizes. We illustrate the applications of Princessp on a series of standard genomics data sets and offer comparisons with several previous ‘standard’ methods. Princessp offers both ease of operation and the ability to scale to extremely large problem sizes. The method is available as open-source software from github.com/dobriban/pvalue_weighting_matlab (accessed 11 October 2017).


Sign in / Sign up

Export Citation Format

Share Document