Big data integration enhancement based on attributes conditional dependency and similarity index method

<abstract> <p>Big data has attracted a lot of attention in many domain sectors. The volume of data-generating today in every domain in form of digital is enormous and same time acquiring such information for various analyses and decisions is growing in every field. So, it is significant to integrate the related information based on their similarity. But the existing integration techniques are usually having processing and time complexity and even having constraints in interconnecting multiple data sources. Many of these sources of information come from a variety of sources. Due to the complex distribution of many different data sources, it is difficult to determine the relationship between the data, and it is difficult to study the same data structures for integration to effectively access or retrieve data to meet the needs of different data analysis. In this paper, proposed an integration of big data with computation of attribute conditional dependency (ACD) and similarity index (SI) methods termed as ACD-SI. The ACD-SI mechanism allows using of an improved Bayesian mechanism to analyze the distribution of attributes in a document in the form of dependence on possible attributes. It also uses attribute conversion and selection mechanisms for mapping and grouping data for integration and uses methods such as LSA (latent semantic analysis) to analyze the content of data attributes to extract relevant and accurate data. It performs a series of experiments to measure the overall purity and normalization of the data integrity, using a large dataset of bibliographic data from various publications. The obtained purity and NMI ratio confined the clustered data relevancy and the measure of precision, recall, and accurate rate justified the improvement of the proposal is compared to the existing approaches.</p> </abstract>

Download Full-text

Big data and portfolio optimization: A novel approach integrating DEA with multiple data sources

Omega ◽

10.1016/j.omega.2021.102479 ◽

2021 ◽

pp. 102479

Author(s):

Zhongbao Zhou ◽

Meng Gao ◽

Helu Xiao ◽

Rui Wang ◽

Wenbin Liu

Keyword(s):

Big Data ◽

Portfolio Optimization ◽

Data Sources ◽

Multiple Data Sources ◽

Multiple Data ◽

Novel Approach

Download Full-text

A spatio-temporally weighted hybrid model to improve estimates of personal PM2.5 exposure: Incorporating big data from multiple data sources

Environmental Pollution ◽

10.1016/j.envpol.2019.07.034 ◽

2019 ◽

Vol 253 ◽

pp. 403-411 ◽

Cited By ~ 2

Author(s):

YuJie Ben ◽

FuJun Ma ◽

Hao Wang ◽

Muhammad Azher Hassan ◽

Romanenko Yevheniia ◽

...

Keyword(s):

Big Data ◽

Hybrid Model ◽

Data Sources ◽

Multiple Data Sources ◽

Multiple Data ◽

Pm2.5 Exposure

Download Full-text

Three Hierarchical Levels of Big-Data Market Model Over Multiple Data Sources for Internet of Things

IEEE Access ◽

10.1109/access.2018.2845105 ◽

2018 ◽

Vol 6 ◽

pp. 31269-31280 ◽

Cited By ~ 5

Author(s):

Busik Jang ◽

Sangdon Park ◽

Joohyung Lee ◽

Sang-Geun Hahn

Keyword(s):

Big Data ◽

Internet Of Things ◽

Data Sources ◽

Market Model ◽

Multiple Data Sources ◽

Multiple Data ◽

Hierarchical Levels ◽

Data Market

Download Full-text

Application of Dempster–Shafer Data Fusion Technique in Support of Decision Making with Big Data

Transportation Research Record Journal of the Transportation Research Board ◽

10.3141/2645-04 ◽

2017 ◽

Vol 2645 (1) ◽

pp. 32-37 ◽

Cited By ~ 2

Author(s):

Ping Yi ◽

Songling Zhang

Keyword(s):

Decision Making ◽

Big Data ◽

Data Fusion ◽

Incomplete Information ◽

Data Sources ◽

Fusion Technique ◽

Crowd Management ◽

Multiple Data ◽

Data Source ◽

Single Data

This paper introduces applications of the Dempster–Shafer (D-S) data fusion technique in transportation system decision making. D-S inference is a statistics-based data classification technique, and it can be used when data sources contribute discontinuous and incomplete information and no single data source can produce an overwhelmingly high probability of certainty for identifying the most probable event. The technique captures and combines the information contributed by the data sources by using Dempster’s rule to find the conjunction of the events and to determine the highest associated probability. The D-S theory is explained and its implementation described through numerical examples of a ride-hauling service and of crowd management at a subway station. Results from the applications have shown that the technique is very effective in dealing with incomplete information and multiple data sources in the era of big data.

Download Full-text

Big data solar power forecasting based on deep learning and multiple data sources

Expert Systems ◽

10.1111/exsy.12394 ◽

2019 ◽

Vol 36 (4) ◽

Cited By ~ 11

Author(s):

José F. Torres ◽

Alicia Troncoso ◽

Irena Koprinska ◽

Zheng Wang ◽

Francisco Martínez‐Álvarez

Keyword(s):

Big Data ◽

Deep Learning ◽

Solar Power ◽

Data Sources ◽

Multiple Data Sources ◽

Multiple Data ◽

Solar Power Forecasting ◽

Power Forecasting

Download Full-text

Data-Driven Surveillance: Effective Collection, Integration, and Interpretation of Data to Support Decision Making

Frontiers in Veterinary Science ◽

10.3389/fvets.2021.633977 ◽

2021 ◽

Vol 8 ◽

Author(s):

Fernanda C. Dórea ◽

Crawford W. Revie

Keyword(s):

Decision Making ◽

Big Data ◽

Data Integration ◽

Data Access ◽

Data Sources ◽

Data Driven ◽

Multiple Data ◽

Combining Evidence ◽

Complex Variety ◽

Support Decision Making

The biggest change brought about by the “era of big data” to health in general, and epidemiology in particular, relates arguably not to the volume of data encountered, but to its variety. An increasing number of new data sources, including many not originally collected for health purposes, are now being used for epidemiological inference and contextualization. Combining evidence from multiple data sources presents significant challenges, but discussions around this subject often confuse issues of data access and privacy, with the actual technical challenges of data integration and interoperability. We review some of the opportunities for connecting data, generating information, and supporting decision-making across the increasingly complex “variety” dimension of data in population health, to enable data-driven surveillance to go beyond simple signal detection and support an expanded set of surveillance goals.

Download Full-text

Particular matter prediction using synergy of multiple source urban big data in smart cities

Intelligent Decision Technologies ◽

10.3233/idt-200147 ◽

2021 ◽

pp. 1-15

Author(s):

Ali Reza Honarvar ◽

Ashkan Sami

Keyword(s):

Air Pollution ◽

Big Data ◽

Air Quality ◽

Smart Cities ◽

Data Sources ◽

Quality Data ◽

Data Set ◽

Multiple Data Sources ◽

Multiple Data ◽

Air Quality Data

At present, the issue of air quality in populated urban areas is recognized as an environmental crisis. Air pollution affects the sustainability of the city. In controlling air pollution and protecting its hazards from humans, air quality data are very important. However, the costs of constructing and maintaining air quality registration infrastructure are very expensive and high, and air quality data recording at one point will not be generalizable to even a few kilometers. Some of the gains come from the integration of multiple data sources, which can never be achieved through independent single-source processing. Urban organizations in each city independently produce and record data relevant to the organization’s goals and objectives. These issues create separate data silos associated with an urban system. These data are varied in model and structure, and the integration of such data provides an appropriate opportunity to discover knowledge that can be useful in urban planning and decision making. This paper aims to show the generality of our previous research, which proposed a novel model to predict Particulate Matter (PM) as the main factor of air quality in the regions of the cities where air quality sensors are not available through urban big data resources integration, by extending the model and experiments with various configuration for different settings in smart cities. This work extends the evaluation scenarios of the model with the extended dataset of city of Aarhus, in Denmark, and compare the model performance against various specified baselines. Details of removing the heterogeneity of multiple data sources in the Multiple Data Set Aggregator & Heterogeneity Remover (MDA&HR) and improving the operation of Train Data Splitter (TDS) part of the model by focusing on the finding more similar pattern of air quality also are presented in this paper. The acceptable accuracy of the results shows the generality of the model.

Download Full-text

Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences

The International Journal of Robotics Research ◽

10.1177/02783649211041652 ◽

2021 ◽

pp. 027836492110416

Author(s):

Erdem Bıyık ◽

Dylan P. Losey ◽

Malayandi Palan ◽

Nicholas C. Landolfi ◽

Gleb Shevchuk ◽

...

Keyword(s):

Data Sources ◽

Mobile Manipulator ◽

Sources Of Information ◽

Multiple Sources ◽

Reward Function ◽

Multiple Data ◽

Preference Queries ◽

Passive Data ◽

Reward Functions ◽

Human Teachers

Reward functions are a common way to specify the objective of a robot. As designing reward functions can be extremely challenging, a more promising approach is to directly learn reward functions from human teachers. Importantly, data from human teachers can be collected either passively or actively in a variety of forms: passive data sources include demonstrations (e.g., kinesthetic guidance), whereas preferences (e.g., comparative rankings) are actively elicited. Prior research has independently applied reward learning to these different data sources. However, there exist many domains where multiple sources are complementary and expressive. Motivated by this general problem, we present a framework to integrate multiple sources of information, which are either passively or actively collected from human users. In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then actively probes the user with preference queries to zero-in on their true reward. This algorithm not only enables us combine multiple data sources, but it also informs the robot when it should leverage each type of information. Further, our approach accounts for the human’s ability to provide data: yielding user-friendly preference queries which are also theoretically optimal. Our extensive simulated experiments and user studies on a Fetch mobile manipulator demonstrate the superiority and the usability of our integrated framework.

Download Full-text

An Efficient Multiple Data Sources Selection Algorithm in Data-Sharing Environments

Journal of Software ◽

10.3724/sp.j.1001.2008.00314 ◽

2008 ◽

Vol 19 (2) ◽

pp. 314-322 ◽

Cited By ~ 1

Author(s):

Xiao-Qing WANG

Keyword(s):

Data Sharing ◽

Data Sources ◽

Selection Algorithm ◽

Multiple Data Sources ◽

Multiple Data

Download Full-text

Big Data Driven Clinical Informatics & Surveillance (BDD_CIS) – A Multimodal Database Focused Clinical, Community, and Multi-Omics Surveillance Plan for COVID-19: A study Protocol (Preprint)

10.2196/preprints.24504 ◽

2020 ◽

Author(s):

Bankole Olatosi ◽

Jiajia Zhang ◽

Sharon Weissman ◽

Zhenlong Li ◽

Jianjun Hu ◽

...

Keyword(s):

Big Data ◽

South Carolina ◽

Data Science ◽

Age Groups ◽

The Elderly ◽

The United States ◽

Data Sources ◽

Patient Registries ◽

Multiple Partner ◽

Multimodal Data

BACKGROUND The Coronavirus Disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus (SARS-CoV-2) remains a serious global pandemic. Currently, all age groups are at risk for infection but the elderly and persons with underlying health conditions are at higher risk of severe complications. In the United States (US), the pandemic curve is rapidly changing with over 6,786,352 cases and 199,024 deaths reported. South Carolina (SC) as of 9/21/2020 reported 138,624 cases and 3,212 deaths across the state. OBJECTIVE The growing availability of COVID-19 data provides a basis for deploying Big Data science to leverage multitudinal and multimodal data sources for incremental learning. Doing this requires the acquisition and collation of multiple data sources at the individual and county level. METHODS The population for the comprehensive database comes from statewide COVID-19 testing surveillance data (March 2020- till present) for all SC COVID-19 patients (N≈140,000). This project will 1) connect multiple partner data sources for prediction and intelligence gathering, 2) build a REDCap database that links de-identified multitudinal and multimodal data sources useful for machine learning and deep learning algorithms to enable further studies. Additional data will include hospital based COVID-19 patient registries, Health Sciences South Carolina (HSSC) data, data from the office of Revenue and Fiscal Affairs (RFA), and Area Health Resource Files (AHRF). RESULTS The project was funded as of June 2020 by the National Institutes for Health. CONCLUSIONS The development of such a linked and integrated database will allow for the identification of important predictors of short- and long-term clinical outcomes for SC COVID-19 patients using data science.

Download Full-text