scholarly journals Big data integration enhancement based on attributes conditional dependency and similarity index method

2021 ◽  
Vol 18 (6) ◽  
pp. 8661-8682
Author(s):  
Vishnu Vandana Kolisetty ◽  
◽  
Dharmendra Singh Rajput ◽  

<abstract> <p>Big data has attracted a lot of attention in many domain sectors. The volume of data-generating today in every domain in form of digital is enormous and same time acquiring such information for various analyses and decisions is growing in every field. So, it is significant to integrate the related information based on their similarity. But the existing integration techniques are usually having processing and time complexity and even having constraints in interconnecting multiple data sources. Many of these sources of information come from a variety of sources. Due to the complex distribution of many different data sources, it is difficult to determine the relationship between the data, and it is difficult to study the same data structures for integration to effectively access or retrieve data to meet the needs of different data analysis. In this paper, proposed an integration of big data with computation of attribute conditional dependency (ACD) and similarity index (SI) methods termed as ACD-SI. The ACD-SI mechanism allows using of an improved Bayesian mechanism to analyze the distribution of attributes in a document in the form of dependence on possible attributes. It also uses attribute conversion and selection mechanisms for mapping and grouping data for integration and uses methods such as LSA (latent semantic analysis) to analyze the content of data attributes to extract relevant and accurate data. It performs a series of experiments to measure the overall purity and normalization of the data integrity, using a large dataset of bibliographic data from various publications. The obtained purity and NMI ratio confined the clustered data relevancy and the measure of precision, recall, and accurate rate justified the improvement of the proposal is compared to the existing approaches.</p> </abstract>

Omega ◽  
2021 ◽  
pp. 102479
Author(s):  
Zhongbao Zhou ◽  
Meng Gao ◽  
Helu Xiao ◽  
Rui Wang ◽  
Wenbin Liu

2019 ◽  
Vol 253 ◽  
pp. 403-411 ◽  
Author(s):  
YuJie Ben ◽  
FuJun Ma ◽  
Hao Wang ◽  
Muhammad Azher Hassan ◽  
Romanenko Yevheniia ◽  
...  

IEEE Access ◽  
2018 ◽  
Vol 6 ◽  
pp. 31269-31280 ◽  
Author(s):  
Busik Jang ◽  
Sangdon Park ◽  
Joohyung Lee ◽  
Sang-Geun Hahn

Author(s):  
Ping Yi ◽  
Songling Zhang

This paper introduces applications of the Dempster–Shafer (D-S) data fusion technique in transportation system decision making. D-S inference is a statistics-based data classification technique, and it can be used when data sources contribute discontinuous and incomplete information and no single data source can produce an overwhelmingly high probability of certainty for identifying the most probable event. The technique captures and combines the information contributed by the data sources by using Dempster’s rule to find the conjunction of the events and to determine the highest associated probability. The D-S theory is explained and its implementation described through numerical examples of a ride-hauling service and of crowd management at a subway station. Results from the applications have shown that the technique is very effective in dealing with incomplete information and multiple data sources in the era of big data.


2019 ◽  
Vol 36 (4) ◽  
Author(s):  
José F. Torres ◽  
Alicia Troncoso ◽  
Irena Koprinska ◽  
Zheng Wang ◽  
Francisco Martínez‐Álvarez

2021 ◽  
Vol 8 ◽  
Author(s):  
Fernanda C. Dórea ◽  
Crawford W. Revie

The biggest change brought about by the “era of big data” to health in general, and epidemiology in particular, relates arguably not to the volume of data encountered, but to its variety. An increasing number of new data sources, including many not originally collected for health purposes, are now being used for epidemiological inference and contextualization. Combining evidence from multiple data sources presents significant challenges, but discussions around this subject often confuse issues of data access and privacy, with the actual technical challenges of data integration and interoperability. We review some of the opportunities for connecting data, generating information, and supporting decision-making across the increasingly complex “variety” dimension of data in population health, to enable data-driven surveillance to go beyond simple signal detection and support an expanded set of surveillance goals.


2021 ◽  
pp. 1-15
Author(s):  
Ali Reza Honarvar ◽  
Ashkan Sami

At present, the issue of air quality in populated urban areas is recognized as an environmental crisis. Air pollution affects the sustainability of the city. In controlling air pollution and protecting its hazards from humans, air quality data are very important. However, the costs of constructing and maintaining air quality registration infrastructure are very expensive and high, and air quality data recording at one point will not be generalizable to even a few kilometers. Some of the gains come from the integration of multiple data sources, which can never be achieved through independent single-source processing. Urban organizations in each city independently produce and record data relevant to the organization’s goals and objectives. These issues create separate data silos associated with an urban system. These data are varied in model and structure, and the integration of such data provides an appropriate opportunity to discover knowledge that can be useful in urban planning and decision making. This paper aims to show the generality of our previous research, which proposed a novel model to predict Particulate Matter (PM) as the main factor of air quality in the regions of the cities where air quality sensors are not available through urban big data resources integration, by extending the model and experiments with various configuration for different settings in smart cities. This work extends the evaluation scenarios of the model with the extended dataset of city of Aarhus, in Denmark, and compare the model performance against various specified baselines. Details of removing the heterogeneity of multiple data sources in the Multiple Data Set Aggregator & Heterogeneity Remover (MDA&HR) and improving the operation of Train Data Splitter (TDS) part of the model by focusing on the finding more similar pattern of air quality also are presented in this paper. The acceptable accuracy of the results shows the generality of the model.


2021 ◽  
pp. 027836492110416
Author(s):  
Erdem Bıyık ◽  
Dylan P. Losey ◽  
Malayandi Palan ◽  
Nicholas C. Landolfi ◽  
Gleb Shevchuk ◽  
...  

Reward functions are a common way to specify the objective of a robot. As designing reward functions can be extremely challenging, a more promising approach is to directly learn reward functions from human teachers. Importantly, data from human teachers can be collected either passively or actively in a variety of forms: passive data sources include demonstrations (e.g., kinesthetic guidance), whereas preferences (e.g., comparative rankings) are actively elicited. Prior research has independently applied reward learning to these different data sources. However, there exist many domains where multiple sources are complementary and expressive. Motivated by this general problem, we present a framework to integrate multiple sources of information, which are either passively or actively collected from human users. In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then actively probes the user with preference queries to zero-in on their true reward. This algorithm not only enables us combine multiple data sources, but it also informs the robot when it should leverage each type of information. Further, our approach accounts for the human’s ability to provide data: yielding user-friendly preference queries which are also theoretically optimal. Our extensive simulated experiments and user studies on a Fetch mobile manipulator demonstrate the superiority and the usability of our integrated framework.


2020 ◽  
Author(s):  
Bankole Olatosi ◽  
Jiajia Zhang ◽  
Sharon Weissman ◽  
Zhenlong Li ◽  
Jianjun Hu ◽  
...  

BACKGROUND The Coronavirus Disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus (SARS-CoV-2) remains a serious global pandemic. Currently, all age groups are at risk for infection but the elderly and persons with underlying health conditions are at higher risk of severe complications. In the United States (US), the pandemic curve is rapidly changing with over 6,786,352 cases and 199,024 deaths reported. South Carolina (SC) as of 9/21/2020 reported 138,624 cases and 3,212 deaths across the state. OBJECTIVE The growing availability of COVID-19 data provides a basis for deploying Big Data science to leverage multitudinal and multimodal data sources for incremental learning. Doing this requires the acquisition and collation of multiple data sources at the individual and county level. METHODS The population for the comprehensive database comes from statewide COVID-19 testing surveillance data (March 2020- till present) for all SC COVID-19 patients (N≈140,000). This project will 1) connect multiple partner data sources for prediction and intelligence gathering, 2) build a REDCap database that links de-identified multitudinal and multimodal data sources useful for machine learning and deep learning algorithms to enable further studies. Additional data will include hospital based COVID-19 patient registries, Health Sciences South Carolina (HSSC) data, data from the office of Revenue and Fiscal Affairs (RFA), and Area Health Resource Files (AHRF). RESULTS The project was funded as of June 2020 by the National Institutes for Health. CONCLUSIONS The development of such a linked and integrated database will allow for the identification of important predictors of short- and long-term clinical outcomes for SC COVID-19 patients using data science.


Sign in / Sign up

Export Citation Format

Share Document