Big Data in Market Research: Why More Data Does Not Automatically Mean Better Information

Abstract Big data will change market research at its core in the long term because consumption of products and media can be logged electronically more and more, making it measurable on a large scale. Unfortunately, big data datasets are rarely representative, even if they are huge. Smart algorithms are needed to achieve high precision and prediction quality for digital and non-representative approaches. Also, big data can only be processed with complex and therefore error-prone software, which leads to measurement errors that need to be corrected. Another challenge is posed by missing but critical variables. The amount of data can indeed be overwhelming, but it often lacks important information. The missing observations can only be filled in by using statistical data imputation. This requires an additional data source with the additional variables, for example a panel. Linear imputation is a statistical procedure that is anything but trivial. It is an instrument to “transport information,” and the higher the observed data correlates with the data to be imputed, the better it works. It makes structures visible even if the depth of the data is limited.

Download Full-text

A Large-Scale COVID-19 Twitter Chatter Dataset for Open Scientific Research—An International Collaboration

Epidemiologia ◽

10.3390/epidemiologia2030024 ◽

2021 ◽

Vol 2 (3) ◽

pp. 315-324

Author(s):

Juan M. Banda ◽

Ramya Tekumalla ◽

Guanyu Wang ◽

Jingyuan Yu ◽

Tuo Liu ◽

...

Keyword(s):

Large Scale ◽

Social Dynamics ◽

Additional Data ◽

Open Data ◽

Data Sources ◽

Research Projects ◽

Research Groups ◽

The World ◽

Data Source

As the COVID-19 pandemic continues to spread worldwide, an unprecedented amount of open data is being generated for medical, genetics, and epidemiological research. The unparalleled rate at which many research groups around the world are releasing data and publications on the ongoing pandemic is allowing other scientists to learn from local experiences and data generated on the front lines of the COVID-19 pandemic. However, there is a need to integrate additional data sources that map and measure the role of social dynamics of such a unique worldwide event in biomedical, biological, and epidemiological analyses. For this purpose, we present a large-scale curated dataset of over 1.12 billion tweets, growing daily, related to COVID-19 chatter generated from 1 January 2020 to 27 June 2021 at the time of writing. This data source provides a freely available additional data source for researchers worldwide to conduct a wide and diverse number of research projects, such as epidemiological analyses, emotional and mental responses to social distancing measures, the identification of sources of misinformation, stratified measurement of sentiment towards the pandemic in near real time, among many others.

Download Full-text

Explore Deep Neural Network and Reinforcement Learning to Large-scale Tasks Processing in Big Data

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001419510108 ◽

2019 ◽

Vol 33 (13) ◽

pp. 1951010 ◽

Cited By ~ 1

Author(s):

Chunyi Wu ◽

Gaochao Xu ◽

Yan Ding ◽

Jia Zhao

Keyword(s):

Neural Network ◽

Big Data ◽

Reinforcement Learning ◽

Large Scale ◽

Policy Network ◽

Text Learning ◽

Mapping Algorithm ◽

Task Requirements ◽

Network Mapping

Large-scale tasks processing based on cloud computing has become crucial to big data analysis and disposal in recent years. Most previous work, generally, utilize the conventional methods and architectures for general scale tasks to achieve tons of tasks disposing, which is limited by the issues of computing capability, data transmission, etc. Based on this argument, a fat-tree structure-based approach called LTDR (Large-scale Tasks processing using Deep network model and Reinforcement learning) has been proposed in this work. Aiming at exploring the optimal task allocation scheme, a virtual network mapping algorithm based on deep convolutional neural network and [Formula: see text]-learning is presented herein. After feature extraction, we design and implement a policy network to make node mapping decisions. The link mapping scheme can be attained by the designed distributed value-function based reinforcement learning model. Eventually, tasks are allocated onto proper physical nodes and processed efficiently. Experimental results show that LTDR can significantly improve the utilization of physical resources and long-term revenue while satisfying task requirements in big data.

Download Full-text

Big Data for flood management: Realising the benefits for developing countries

10.5194/egusphere-egu2020-19654 ◽

2020 ◽

Author(s):

Namrata Bhattacharya Mis

Keyword(s):

Developing Countries ◽

Big Data ◽

Real Time ◽

Disaster Management ◽

Low Income ◽

Developing Nations ◽

Scoping Study ◽

Urban Settlements ◽

Data Source

<p>Agenda 2030 goal 11 commits towards making disaster risk reduction an integral part of sustainable social and economic development. Flooding poses some of the most serious challenges in front of developing nations by hitting hardest to the most vulnerable. Focussing on the urban poor, frequently at highest risk are characterised by inadequate housing, lack of services and infrastructure with high population growth and spatial expansion in dense, lower quality urban structures. Use of big data from within these low-quality urban settlement areas can be a useful step forward in generating information to have a better understanding of their vulnerabilities. Big data for resilience is a recent field of research which offers tremendous potential for increasing disaster resilience especially in the context of social resilience. This research focusses to unleash the unrealised opportunities of big data through the differential social and economic frames that can contribute towards better-targeted information generation in disaster management. The scoping study aims to contribute to the understanding of the potential of big data in developing particularly in low-income countries to empower the vulnerable population against natural hazards such as floods. Recognising the potential of providing real-time and long-term information for emergency management in flood-affected large urban settlements this research concentrates on flood hazard and use of remotely sensed data (NASA, TRMM, LANDSAT) as the big data source for quick disaster response (and recovery) in targeted areas. The research question for the scoping study is: Can big data source provide real-time and long- term information to improve emergency disaster management in urban settlements against floods in developing countries? &#160;Previous research has identified several potentials that big data has on faster response to the affected population but few attempts have been made to integrate the factors to develop an aggregated conceptual output . An international review of multi-discipline research, grey literature, grass-root projects, and emerging online social discourse will appraise the concepts and scope of big data to highlight the four objectives of the research and answer the specific questions around existing and future potentials of big data, operationalising and capacity building by agencies, risk associated and prospects of maximising impact. The research proposes a concept design for undertaking a thematic review of existing secondary data sources which will&#160; be used to provide a holistic picture of how big data can support in resilience through technological change within the specific scope of social and environmental contexts of developing countries. The implications of the study lie in the system integration and understanding of the socio-economics, political, legal and ethical contexts essential for investment decision making for strategic impact and resilience-building in developing nations.</p>

Download Full-text

Progress in managing the transition from the RS92 to the Vaisala RS41 as the operational radiosonde within the GCOS Reference Upper-Air Network

10.5194/gi-2019-36 ◽

2019 ◽

Author(s):

Ruud J. Dirksen ◽

Greg E. Bodeker ◽

Peter W. Thorne ◽

Andrea Merlone ◽

Tony Reale ◽

...

Keyword(s):

Measurement Errors ◽

Large Scale ◽

Integrated Approach ◽

Burden Sharing ◽

Ancillary Data ◽

Laboratory Characterization ◽

History Of ◽

Large Scale Change

Abstract. This paper describes the GRUAN-wide approach to manage the transition from the Vaisala RS92 to the Vaisala RS41 as the operational radiosonde. The goal of the GCOS Reference Upper-Air Network (GRUAN) is to provide long-term high-quality reference observations of upper air Essential Climate Variables (ECVs) such as temperature and water vapor. With GRUAN data being used for climate monitoring, it is vital that the change of measurement system does not introduce inhomogeneities in to the data record. The majority of the 27 GRUAN sites were launching the RS92 as their operational radiosonde, and following the end of production of the RS92 in the last quarter of 2017, most of these sites have now switched to the RS41. Such a large-scale change in instrumentation is unprecedented in the history of GRUAN and poses a challenge for the network. Several measurement programmes have been initiated to characterize differences in biases, uncertainties and noise between the two radiosonde types. These include laboratory characterization of measurement errors, extensive twin sounding studies with RS92 and RS41 on the same balloon, and comparison with ancillary data. This integrated approach is commensurate with the GRUAN principles of traceability and deliberate redundancy. A two-year period of regular twin soundings is recommended, and for sites that are not able to implement this burden sharing is employed, such that measurements at a certain site are considered representative of other sites with similar climatological characteristics. All data relevant to the RS92-RS41 transition are archived in a database that will be accessible to the scientific community for external scrutiny. Furthermore, the knowledge and experience gained about GRUAN's RS92-RS41 transition will be extensively documented to ensure traceability of the process. This documentation will benefit other networks in managing changes in their operational radiosonde systems. Preliminary analysis of the laboratory experiments indicates that the manufacturer's calibration of the RS41's temperature and humidity sensors is more accurate than for the RS92; with uncertainties of

Download Full-text

Observational Data Exploration Via Online Tool For For Drugs and Cancer Risk

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.25 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Usman Iqbal ◽

Phung Anh Nguyen ◽

Shabbir Syed-Abdul ◽

Wen-Shan Jian ◽

Yu-Chuan Jack Li

Keyword(s):

Health Care ◽

Big Data ◽

Cancer Risk ◽

Health Information ◽

Health Care Professionals ◽

Odds Ratio ◽

Large Scale ◽

Risk Estimation ◽

Self Control

ABSTRACTObjectiveRapid change in health information technology system had dramatically increased health data accumulated. We aimed to develop an online informatics tool in order to evaluate the risk of drugs for cancer by utilizing medical big data. Data SourceWe use the Taiwan’s National Health Insurance Database that has provided a huge data which covered all health information including characteristics and all drug information i.e. prescriptions, etc. of 23 million Taiwanese population. Front-end development: Web-based interface was developed by using PHP package and Javascript. In addition, we included the guidelines of evidence based medicine (EBM) level 3 for observational study such as cohort, case-control, and/or case serial self-control in order to support users interact with system. Back-end development: A package of Apache, MySQL & PHP was used to build the serve-side of the system. We integrated the Elasticsearch API5 to our system in order to search and analyze data immediately. The example of data transform to person-level from Taiwan NHI database is shown in Box 1. After then, we also integrated the analytics package (ie. R package) to perform the statistical analysis to a given study. This online analytical tool has capability to massively explore and visualize big data for long term use drugs and cancers through OMOSC system which will help to do mass online studies for long term use drugs and cancer risk. It would help to direct the health care professionals with lack of datamining skills to lead the study. The constructed online system would generate automatically case and controls by utilizing large databases for long term drug exposures and cancer risk. ResultsThe results are shown in odds ratio (OR) and if selected some confounding factors then could also get adjusted odds ratio (AOR) for risk estimation with 95% Confidence Intervals (CI). We used SAS statistical software on the same dataset to validate the OMOSC system results. It could help to do massive online studies which will saves time and cost effective. ConclusionSince the clinical trials are impossible to conduct due to cultural, cost, ethical, political or social obstacles. Therefore, this kind of research model would play an important role in health care industry by providing an excellent opportunities for solving the technological, informatics, and organizational issues towards other broad domains of drugs evaluation by utilizing large-scale databases.

Download Full-text

AVAILABILITY AS TOURISM STATISTICAL DATA OF LARGE SCALE AND LONG TERM HUMAN MOBILITY TRACKS BY GPS : A STUDY OF ISHIKAWA PREF

Journal of Japan Society of Civil Engineers Ser D3 (Infrastructure Planning and Management) ◽

10.2208/jscejipm.69.i_345 ◽

2013 ◽

Vol 69 (5) ◽

pp. I_345-I_352

Author(s):

Yoshikazu UBUKATA ◽

Yoshihide SEKIMOTO ◽

Teerayut HORANONT

Keyword(s):

Large Scale ◽

Statistical Data ◽

Human Mobility

Download Full-text

The Longitudinal IntermediaPlus (2014–2016): A Case Study in Structuring Unstructured Big Data

Research Data Journal for the Humanities and Social Sciences ◽

10.1163/24523666-06010001 ◽

2021 ◽

Vol 6 (1) ◽

pp. 1-16

Author(s):

Inga Brentel ◽

Kristi Winters

Keyword(s):

Big Data ◽

Academic Research ◽

The Novel ◽

Cross Media ◽

Novel Structure ◽

Metadata File ◽

Data Source ◽

Media Channels ◽

Research Challenge

Abstract This article details the novel structure developed to handle, harmonize and document big data for reuse and long-term preservation. ‘The Longitudinal IntermediaPlus (2014–2016)’ big data dataset is uniquely rich: it covers an array of German online media extendable to cross-media channels and user information. The metadata file for this dataset, and its documentation, were recently deposited as its own MySQL database called charmstana_sample_14-16.sql (https://data.gesis.org/sharing/#!Detail/10.7802/2030) (cs16) and is suitable for generating descriptive statistics. Analogous to the ‘Data View’ in spss, the charmstana_analysis (ca) contains the dataset’s numerical values. Both the cs16 and ca MySQL files are needed to conduct analysis on the full database. The research challenge was to process large-scaled datasets into one longitudinal, big-data data source suitable for academic research, and according to fair principles. The authors review four methodological recommendations that can serve as a framework for solving big-data structuring challenges, using the harmonization software CharmStats.

Download Full-text

Managing the transition from Vaisala RS92 to RS41 radiosondes within the Global Climate Observing System Reference Upper-Air Network (GRUAN): a progress report

Geoscientific Instrumentation Methods and Data Systems ◽

10.5194/gi-9-337-2020 ◽

2020 ◽

Vol 9 (2) ◽

pp. 337-355 ◽

Cited By ~ 1

Author(s):

Ruud J. Dirksen ◽

Greg E. Bodeker ◽

Peter W. Thorne ◽

Andrea Merlone ◽

Tony Reale ◽

...

Keyword(s):

Measurement Errors ◽

Large Scale ◽

Global Climate ◽

Integrated Approach ◽

Burden Sharing ◽

Ancillary Data ◽

Laboratory Characterization ◽

History Of

Abstract. This paper describes the Global Climate Observing System (GCOS) Reference Upper-Air Network (GRUAN) approach to managing the transition from the Vaisala RS92 to the Vaisala RS41 as the operational radiosonde. The goal of GRUAN is to provide long-term high-quality reference observations of upper-air essential climate variables (ECVs) such as temperature and water vapor. With GRUAN data being used for climate monitoring, it is vital that the change of measurement system does not introduce inhomogeneities to the data record. The majority of the 27 GRUAN sites were launching the RS92 as their operational radiosonde, and following the end of production of the RS92 in the last quarter of 2017, most of these sites have now switched to the RS41. Such a large-scale change in instrumentation is unprecedented in the history of GRUAN and poses a challenge for the network. Several measurement programs have been initiated to characterize differences in biases, uncertainties, and noise between the two radiosonde types. These include laboratory characterization of measurement errors, extensive twin sounding studies with RS92 and RS41 on the same balloon, and comparison with ancillary data. This integrated approach is commensurate with the GRUAN principles of traceability and deliberate redundancy. A 2-year period of regular twin soundings is recommended, and for sites that are not able to implement this, burden-sharing is employed such that measurements at a certain site are considered representative of other sites with similar climatological characteristics. All data relevant to the RS92–RS41 transition are archived in a database that will be accessible to the scientific community for external scrutiny. Furthermore, the knowledge and experience gained regarding GRUAN's RS92–RS41 transition will be extensively documented to ensure traceability of the process. This documentation will benefit other networks in managing changes in their operational radiosonde systems. Preliminary analysis of the laboratory experiments indicates that the manufacturer's calibration of the RS41 temperature and humidity sensors is more accurate than for the RS92, with uncertainties of <0.2 K for the temperature and <1.5 % RH (RH: relative humidity) for the humidity sensor. A first analysis of 224 RS92–RS41 twin soundings at Lindenberg Observatory shows nighttime temperature differences <0.1 K between the Vaisala-processed temperature data for the RS41 (TRS41) and the GRUAN data product for the RS92 (TRS92-GDP.2). However, daytime temperature differences in the stratosphere increase steadily with altitude, with TRS92-GDP.2 up to 0.6 K higher than TRS41 at 35 km. RHRS41 values are up to 8 % higher, which is consistent with the analysis of satellite–radiosonde collocations.

Download Full-text

Big data experiments with the archived Web: Methodological reflections on studying the development of a nation's Web

First Monday ◽

10.5210/fm.v25i3.10384 ◽

2020 ◽

Cited By ~ 1

Author(s):

Niels Brügger ◽

Janne Nielsen ◽

Ditte Laursen

Keyword(s):

Big Data ◽

Large Scale ◽

Historical Source ◽

Specific Data ◽

The Past ◽

The World ◽

Data Source ◽

Research Questions ◽

Digital Landscape ◽

Web Archive

This article outlines how the 'digital geography' of a nation can be studied, that is the online presence of one nation. The entire Danish Web domain and its development from 2006 to 2015 is used as a case, based on the holdings in the Danish national Web archive. The following research questions guide the investigation: What has the Danish Web domain looked like in the past, and how has it developed in the period 2006-2015? Methodologically, we investigate to what extent one can delimit 'a nation' on the Web, and what characterizes the archived Web as a historical source for academic studies, as well as the general characteristics of our specific data source. Analytically, the article introduces a design for how this type of big data analyses of an entire national Web domain can be performed. Our findings show some of the ways in which a nation's digital landscape can be mapped, ie. on size, content types and hyperlinks. On a broader canvas, this study demonstrates that with hard- and software as well as human competencies from different disciplines it is possible to perform large-scale historical studies of one of the biggest media sources of today, the World Wide Web.

Download Full-text

Veterinary Big Data: When Data Goes to the Dogs

Animals ◽

10.3390/ani11071872 ◽

2021 ◽

Vol 11 (7) ◽

pp. 1872

Author(s):

Ashley N. Paynter ◽

Matthew D. Dunbar ◽

Kate E. Creevy ◽

Audrey Ruple

Keyword(s):

Big Data ◽

Large Scale ◽

Phenotypic Diversity ◽

Insurance Companies ◽

Disease Outcomes ◽

Naturally Occurring ◽

Referral Hospitals ◽

Health And Disease ◽

Data Source ◽

Medical Big Data

Dogs provide an ideal model for study as they have the most phenotypic diversity and known naturally occurring diseases of all non-human land mammals. Thus, data related to dog health present many opportunities to discover insights into health and disease outcomes. Here, we describe several sources of veterinary medical big data that can be used in research. These sources include medical records from primary medical care centers or referral hospitals, medical claims data from animal insurance companies, and datasets constructed specifically for research purposes. No data source provides information that is without limitations, but large-scale, prospective, longitudinally collected data from dog populations are ideal for further research as they offer many advantages over other data sources.

Download Full-text