massive data set Latest Research Papers

It is assumed that wild honey bees have become largely extinct across Europe since the 1980s, following the introduction of exotic ectoparasitic mite (Varroa) and the associated spillover of various pathogens. However, several recent studies reported on unmanaged colonies that survived the Varroa mite infestation. Herewith, we present another case of unmanaged, free-living population of honey bees in SE Europe, a rare case of feral bees inhabiting a large and highly populated urban area: Belgrade, the capital of Serbia. We compiled a massive data-set derived from opportunistic citizen science (>1300 records) during the 2011–2017 period and investigated whether these honey bee colonies and the high incidence of swarms could be a result of a stable, self-sustaining feral population (i.e., not of regular inflow of swarms escaping from local managed apiaries), and discussed various explanations for its existence. We also present the possibilities and challenges associated with the detection and effective monitoring of feral/wild honey bees in urban settings, and the role of citizen science in such endeavors. Our results will underpin ongoing initiatives to better understand and support naturally selected resistance mechanisms against the Varroa mite, which should contribute to alleviating current threats and risks to global apiculture and food production security.

Download Full-text

Socioeconomic differences and persistent segregation of Italian territories during COVID-19 pandemic

Scientific Reports ◽

10.1038/s41598-021-99548-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Giovanni Bonaccorsi ◽

Francesco Pierri ◽

Francesco Scotti ◽

Andrea Flori ◽

Francesco Manaresi ◽

...

Keyword(s):

Human Mobility ◽

Economic Consequences ◽

Massive Data ◽

Mobility Patterns ◽

Data Set ◽

Economic Segregation ◽

Massive Data Set ◽

Learning Techniques ◽

Quantile Regression Analysis ◽

Administrative Units

AbstractLockdowns implemented to address the COVID-19 pandemic have disrupted human mobility flows around the globe to an unprecedented extent and with economic consequences which are unevenly distributed across territories, firms and individuals. Here we study socioeconomic determinants of mobility disruption during both the lockdown and the recovery phases in Italy. For this purpose, we analyze a massive data set on Italian mobility from February to October 2020 and we combine it with detailed data on pre-existing local socioeconomic features of Italian administrative units. Using a set of unsupervised and supervised learning techniques, we reliably show that the least and the most affected areas persistently belong to two different clusters. Notably, the former cluster features significantly higher income per capita and lower income inequality than the latter. This distinction persists once the lockdown is lifted. The least affected areas display a swift (V-shaped) recovery in mobility patterns, while poorer, most affected areas experience a much slower (U-shaped) recovery: as of October 2020, their mobility was still significantly lower than pre-lockdown levels. These results are then detailed and confirmed with a quantile regression analysis. Our findings show that economic segregation has, thus, strengthened during the pandemic.

Download Full-text

Learning Time Acceleration in Support Vector Regression: A Case Study in Educational Data Mining

Stats ◽

10.3390/stats4030041 ◽

2021 ◽

Vol 4 (3) ◽

pp. 682-700

Author(s):

Jonatha Sousa Pimentel ◽

Raydonal Ospina ◽

Anderson Ara

Keyword(s):

Data Mining ◽

Support Vector Regression ◽

National Level ◽

Educational Data Mining ◽

Machine Learning Algorithms ◽

Information Support ◽

Support Vector ◽

Data Set ◽

Learning Time ◽

Massive Data Set

The development of a country involves directly investing in the education of its citizens. Learning analytics/educational data mining (LA/EDM) allows access to big observational structured/unstructured data captured from educational settings and relies mostly on machine learning algorithms to extract useful information. Support vector regression (SVR) is a supervised statistical learning approach that allows modelling and predicts the performance tendency of students to direct strategic plans for the development of high-quality education. In Brazil, performance can be evaluated at the national level using the average grades of a student on their National High School Exams (ENEMs) based on their socioeconomic information and school records. In this paper, we focus on increasing the computational efficiency of SVR applied to ENEM for online requisitions. The results are based on an analysis of a massive data set composed of more than five million observations, and they also indicate computational learning time savings of more than 90%, as well as providing a prediction of performance that is compatible with traditional modeling.

Download Full-text

Infodemic Pathways: Evaluating the Role That Traditional and Social Media Play in Cross-National Information Transfer

Frontiers in Political Science ◽

10.3389/fpos.2021.648646 ◽

2021 ◽

Vol 3 ◽

Author(s):

Aengus Bridgman ◽

Eric Merkley ◽

Oleg Zhilin ◽

Peter John Loewen ◽

Taylor Owen ◽

...

Keyword(s):

Social Media ◽

Information Transfer ◽

Data Set ◽

Social Media Networks ◽

Massive Data Set ◽

Information Spread ◽

Twitter Users ◽

Novel Coronavirus ◽

National Information ◽

Cross National

The COVID-19 pandemic has occurred alongside a worldwide infodemic where unprecedented levels of misinformation have contributed to widespread misconceptions about the novel coronavirus. Conspiracy theories, poorly sourced medical advice, and information trivializing the virus have ignored national borders and spread quickly. This information spread has occurred despite generally strong preferences for domestic national media and social media networks that tend to be geographically bounded. How, then, is (mis)information crossing borders so rapidly? Using social media and survey data, we evaluate the extent to which consumption and propagation patterns of domestic and international traditional news and social media can help inform theorizing about cross-national information spread. In a detailed case study of Canada, we employ a large multi-wave survey and a massive data set of Canadian Twitter users. We show that the majority of misinformation circulating on Twitter that is shared by Canadian accounts is retweeted from U.S.-based accounts. Moreover, exposure to U.S.-based media outlets is associated with COVID-19 misperceptions and increased exposure to U.S.-based information on Twitter is associated with an increased likelihood to post misinformation. We thus theorize and empirically identify a key globalizing infodemic pathway: disregard for national origin of social media posting.

Download Full-text

The promiscuous and highly mobile resistome of a superbug

10.1101/2021.02.03.429652 ◽

2021 ◽

Author(s):

Ismael Hernández-González ◽

Valeria Mateo-Estrada ◽

Santiago Castillo-Ramírez

Keyword(s):

Population Dynamics ◽

Gene Content ◽

Massive Data ◽

Nosocomial Pathogen ◽

Data Set ◽

Massive Data Set ◽

The Individual ◽

Global Threat ◽

And Control ◽

Global Population

AbstractAntimicrobial resistance (AR) is a major global threat to public health. Understanding the population dynamics of AR is critical to restrain and control this issue. However, no study has provided a global picture of the resistome of Acinetobacter baumannii, a very important nosocomial pathogen. Here we analyze 1450+ genomes (covering > 40 countries and > 4 decades) to infer the global population dynamics of the resistome of this species. We show that gene flow and horizontal transfer have driven the dissemination of AR genes in A. baumannii. We found considerable variation in AR gene content across lineages. Although the individual AR gene histories have been affected by recombination, the AR gene content has been shaped by the phylogeny. Furthermore, many AR genes have been transferred to other well-known pathogens, such as Pseudomonas aeruginosa or Klebsiella pneumoniae. Finally, despite using this massive data set, we were not able to sample the whole diversity of AR genes, which suggests that this species has an open resistome. Ours results highlight the high mobilization risk of AR genes between important pathogens. On a broader perspective, this study gives a framework for an emerging perspective (resistome-centric) on the genome epidemiology (and surveillance) of bacterial pathogens.

Download Full-text

Adoption-Driven Data Science for Transportation Planning: Methodology, Case Study, and Lessons Learned

Sustainability ◽

10.3390/su12156001 ◽

2020 ◽

Vol 12 (15) ◽

pp. 6001 ◽

Cited By ~ 2

Author(s):

Eduardo Graells-Garrido ◽

Vanessa Peña-Araya ◽

Loreto Bravo

Keyword(s):

Transportation Planning ◽

Data Science ◽

Cost Effective ◽

Lessons Learned ◽

Data Driven ◽

Mobile Phone Data ◽

Data Set ◽

Policy Makers ◽

Massive Data Set ◽

Planning Methodology

The rising availability of digital traces provides a fertile ground for data-driven solutions to problems in cities. However, even though a massive data set analyzed with data science methods may provide a powerful and cost-effective solution to a problem, its adoption by relevant stakeholders is not guaranteed due to adoption barriers such as lack of interpretability and interoperability. In this context, this paper proposes a methodology toward bridging two disciplines, data science and transportation, to identify, understand, and solve transportation planning problems with data-driven solutions that are suitable for adoption by urban planners and policy makers. The methodology is defined by four steps where people from both disciplines go from algorithm and model definition to the development of a potentially adoptable solution with evaluated outputs. We describe how this methodology was applied to define a model to infer commuting trips with mode of transportation from mobile phone data, and we report the lessons learned during the process.

Download Full-text

Weighted mining of massive collections of P-values by convex optimization

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iax013 ◽

2017 ◽

Vol 7 (2) ◽

pp. 251-275

Author(s):

Edgar Dobriban

Keyword(s):

Convex Optimization ◽

Multiple Testing ◽

Observational Cosmology ◽

Data Sets ◽

Data Set ◽

P Values ◽

False Discovery ◽

Massive Data Set ◽

Optimal Weighting ◽

Weighting Problem

Abstract Researchers in data-rich disciplines—think of computational genomics and observational cosmology—often wish to mine large bodies of $P$-values looking for significant effects, while controlling the false discovery rate or family-wise error rate. Increasingly, researchers also wish to prioritize certain hypotheses, for example, those thought to have larger effect sizes, by upweighting, and to impose constraints on the underlying mining, such as monotonicity along a certain sequence. We introduce Princessp, a principled method for performing weighted multiple testing by constrained convex optimization. Our method elegantly allows one to prioritize certain hypotheses through upweighting and to discount others through downweighting, while constraining the underlying weights involved in the mining process. When the $P$-values derive from monotone likelihood ratio families such as the Gaussian means model, the new method allows exact solution of an important optimal weighting problem previously thought to be non-convex and computationally infeasible. Our method scales to massive data set sizes. We illustrate the applications of Princessp on a series of standard genomics data sets and offer comparisons with several previous ‘standard’ methods. Princessp offers both ease of operation and the ability to scale to extremely large problem sizes. The method is available as open-source software from github.com/dobriban/pvalue_weighting_matlab (accessed 11 October 2017).

Download Full-text

Understanding the Utilization Characteristics of Bicycle-Sharing Systems in Underdeveloped Cities: A Case Study in Xuchang City, China

Transportation Research Record Journal of the Transportation Research Board ◽

10.3141/2634-12 ◽

2017 ◽

Vol 2634 (1) ◽

pp. 78-85 ◽

Cited By ~ 3

Author(s):

Yang Yang ◽

Tiezhu Li ◽

Tao Zhang ◽

Wanyu Yang

Keyword(s):

Massive Data ◽

Data Set ◽

Massive Data Set ◽

Smart Card Data ◽

Intercept Survey ◽

Degree Of Satisfaction ◽

Bicycle Sharing System ◽

Survey Questionnaires

In recent years, a growing number of cities in China have successively rolled out bicycle-sharing systems to facilitate bicycle use, including not only metropolises but also some underdeveloped cities with populations of less than 1 million. One of those underdeveloped cities, Xuchang, launched its bicycle-sharing system in 2014. This service provides a convenient way for members to cycle for some of their short trips. Interest in the bicycle-sharing systems of metropolises is growing rapidly; however, studies on underdeveloped cities are still limited. This study investigated the factors influencing the adoption of a bicycle-sharing system in Xuchang, by analyzing massive smart card data from July 2014 to mid-April 2015 and 500 intercept survey questionnaires in April 2015. Different questions were ready for members and nonmembers in the questionnaires and the statistical results show the characteristics of users of the Xuchang bicycle-sharing system, including demographic characteristics, travel habits, and degree of satisfaction. Moreover, the space–time distribution characteristics of the Xuchang bicycle-sharing system were analyzed by dividing a massive data set into three groups: weekdays, weekends, and holidays. Results showed that compared with the clearly defined role of “resolve the last-kilometer problem” in a metropolis, bicycle-sharing in underdeveloped cities acts as an alternative way of transportation rather than a transfer traffic mode. Results also showed that bicycle-sharing systems gained more popularity in underdeveloped cities than in metropolises because of the smaller extent of egression, resident travel habits, the traffic environment, and so on.

Download Full-text

EFFECTIVE SUMMARY FOR MASSIVE DATA SET

ICTACT Journal on Soft Computing ◽

10.21917/ijsc.2015.0146 ◽

2015 ◽

Vol 05 (04) ◽

pp. 1046-1056

Author(s):

Radhika A. ◽

◽

Michael Arock ◽

Keyword(s):

Massive Data ◽

Data Set ◽

Massive Data Set

Download Full-text

Empirical Evidence of Hurst Exponent Estimation Wavelet Based

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.687-691.1668 ◽

2014 ◽

Vol 687-691 ◽

pp. 1668-1671

Author(s):

Bin Luo ◽

Tong Zhou Zhao ◽

De Hua Li ◽

Dun Bo Cai

Keyword(s):

Long Range ◽

Hurst Exponent ◽

Wavelet Spectrum ◽

Long Range Dependence ◽

Range Analysis ◽

Data Set ◽

Massive Data Set ◽

Rescaled Range ◽

Rescaled Range Analysis ◽

Original Time

In this paper, we study long-range dependence of hydrological records with high frequent and massive data set. For detecting breakpoints, we apply the Evolutionary Wavelet Spectrum (EWS) to provide a segmentation of the original time series. And rescaled range analysis (R/S) for estimating the Hurst exponent that describe the long-range dependence phenomenon are used. The results affirm that the hydrological records have long-range dependent (LRD) behaviors.

Download Full-text

massive data set
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Unprecedented Density and Persistence of Feral Honey Bees in Urban Environments of a Large SE-European City (Belgrade, Serbia)

Socioeconomic differences and persistent segregation of Italian territories during COVID-19 pandemic

Learning Time Acceleration in Support Vector Regression: A Case Study in Educational Data Mining

Infodemic Pathways: Evaluating the Role That Traditional and Social Media Play in Cross-National Information Transfer

The promiscuous and highly mobile resistome of a superbug

Adoption-Driven Data Science for Transportation Planning: Methodology, Case Study, and Lessons Learned

Weighted mining of massive collections of P-values by convex optimization

Understanding the Utilization Characteristics of Bicycle-Sharing Systems in Underdeveloped Cities: A Case Study in Xuchang City, China

EFFECTIVE SUMMARY FOR MASSIVE DATA SET

Empirical Evidence of Hurst Exponent Estimation Wavelet Based

Export Citation Format

massive data setRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Unprecedented Density and Persistence of Feral Honey Bees in Urban Environments of a Large SE-European City (Belgrade, Serbia)

Socioeconomic differences and persistent segregation of Italian territories during COVID-19 pandemic

Learning Time Acceleration in Support Vector Regression: A Case Study in Educational Data Mining

Infodemic Pathways: Evaluating the Role That Traditional and Social Media Play in Cross-National Information Transfer

The promiscuous and highly mobile resistome of a superbug

Adoption-Driven Data Science for Transportation Planning: Methodology, Case Study, and Lessons Learned

Weighted mining of massive collections of P-values by convex optimization

Understanding the Utilization Characteristics of Bicycle-Sharing Systems in Underdeveloped Cities: A Case Study in Xuchang City, China

EFFECTIVE SUMMARY FOR MASSIVE DATA SET

Empirical Evidence of Hurst Exponent Estimation Wavelet Based

massive data set
Recently Published Documents