Distributed mashups: a collaborative approach to data integration

2015 ◽  
Vol 11 (3) ◽  
pp. 370-396 ◽  
Author(s):  
Tuan-Dat Trinh ◽  
Peter Wetz ◽  
Ba-Lam Do ◽  
Elmar Kiesling ◽  
A Min Tjoa

Purpose – This paper aims to present a collaborative mashup platform for dynamic integration of heterogeneous data sources. The platform encourages sharing and connects data publishers, integrators, developers and end users. Design/methodology/approach – This approach is based on a visual programming paradigm and follows three fundamental principles: openness, connectedness and reusability. The platform is based on semantic Web technologies and the concept of linked widgets, i.e. semantic modules that allow users to access, integrate and visualize data in a creative and collaborative manner. Findings – The platform can effectively tackle data integration challenges by allowing users to explore relevant data sources for different contexts, tackling the data heterogeneity problem and facilitating automatic data integration, easing data integration via simple operations and fostering reusability of data processing tasks. Research limitations/implications – This research has focused exclusively on conceptual and technical aspects so far; a comprehensive user study, extensive performance and scalability testing is left for future work. Originality/value – A key contribution of this paper is the concept of distributed mashups. These ad hoc data integration applications allow users to perform data processing tasks in a collaborative and distributed manner simultaneously on multiple devices. This approach requires no server infrastructure to upload data, but rather allows each user to keep control over their data and expose only relevant subsets. Distributed mashups can run persistently in the background and are hence ideal for real-time data monitoring or data streaming use cases. Furthermore, we introduce automatic mashup composition as an innovative approach based on an explicit semantic widget model.

Author(s):  
Lihua Lu ◽  
Hengzhen Zhang ◽  
Xiao-Zhi Gao

Purpose – Data integration is to combine data residing at different sources and to provide the users with a unified interface of these data. An important issue on data integration is the existence of conflicts among the different data sources. Data sources may conflict with each other at data level, which is defined as data inconsistency. The purpose of this paper is to aim at this problem and propose a solution for data inconsistency in data integration. Design/methodology/approach – A relational data model extended with data source quality criteria is first defined. Then based on the proposed data model, a data inconsistency solution strategy is provided. To accomplish the strategy, fuzzy multi-attribute decision-making (MADM) approach based on data source quality criteria is applied to obtain the results. Finally, users feedbacks strategies are proposed to optimize the result of fuzzy MADM approach as the final data inconsistent solution. Findings – To evaluate the proposed method, the data obtained from the sensors are extracted. Some experiments are designed and performed to explain the effectiveness of the proposed strategy. The results substantiate that the solution has a better performance than the other methods on correctness, time cost and stability indicators. Practical implications – Since the inconsistent data collected from the sensors are pervasive, the proposed method can solve this problem and correct the wrong choice to some extent. Originality/value – In this paper, for the first time the authors study the effect of users feedbacks on integration results aiming at the inconsistent data.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Han-Yu Sung ◽  
Yu-Liang Chi

Purpose This study aims to develop a Web-based application system called Infomediary of Taiwanese Indigenous Peoples (ITIP) that can help individuals comprehend the society and culture of indigenous people. The ITIP is based on the use of Semantic Web technologies to integrate a number of data sources, particularly including the bibliographic records of a museum. Moreover, an ontology model was developed to help users search cultural collections by topic concepts. Design/methodology/approach Two issues were identified that needed to be addressed: the integration of heterogeneous data sources and semantic-based information retrieval. Two corresponding methods were proposed: SPARQL federated queries were designed for data integration across the Web and ontology-driven queries were designed to semantically search by knowledge inference. Furthermore, to help users perform searches easily, three searching interfaces, namely, ethnicity, region and topic, were developed to take full advantage of the content available on the Web. Findings Most open government data provides structured but non-resource description framework data, Semantic Web consumers, therefore, require additional data conversion before the data can be used. On the other hand, although the library, archive and museum (LAM) community has produced some emerging linked data, very few data sets are released to the general public as open data. The Semantic Web’s vision of “web of data” remains challenging. Originality/value This study developed data integration from various institutions, including those of the LAM community. The development was conducted based on the mode of non-institution members (i.e. institutional outsiders). The challenges encountered included uncertain data quality and the absence of institutional participation.


2020 ◽  
Vol 245 ◽  
pp. 05020
Author(s):  
Vardan Gyurjyan ◽  
Sebastian Mancilla

The hardware landscape used in HEP and NP is changing from homogeneous multi-core systems towards heterogeneous systems with many different computing units, each with their own characteristics. To achieve maximum performance with data processing, the main challenge is to place the right computing on the right hardware. In this paper, we discuss CLAS12 charge particle tracking workflow orchestration that allows us to utilize both CPU and GPU to improve the performance. The tracking application algorithm was decomposed into micro-services that are deployed on CPU and GPU processing units, where the best features of both are intelligently combined to achieve maximum performance. In this heterogeneous environment, CLARA aims to match the requirements of each micro-service to the strength of a CPU or a GPU architecture. A predefined execution of a micro-service on a CPU or a GPU may not be the most optimal solution due to the streaming data-quantum size and the data-quantum transfer latency between CPU and GPU. So, the CLARA workflow orchestrator is designed to dynamically assign micro-service execution to a CPU or a GPU, based on the online benchmark results analyzed for a period of real-time data-processing.


2018 ◽  
Author(s):  
Larysse Silva ◽  
José Alex Lima ◽  
Nélio Cacho ◽  
Eiji Adachi ◽  
Frederico Lopes ◽  
...  

A notable characteristic of smart cities is the increase in the amount of available data generated by several devices and computational systems, thus augmenting the challenges related to the development of software that involves the integration of larges volumes of data. In this context, this paper presents a literature review aimed to identify the main strategies used in the development of solutions for data integration, relationship, and representation in smart cities. This study systematically selected and analyzed eleven studies published from 2015 to 2017. The achieved results reveal gaps regarding solutions for the continuous integration of heterogeneous data sources towards supporting application development and decision-making.


2019 ◽  
pp. 254-277 ◽  
Author(s):  
Ying Zhang ◽  
Chaopeng Li ◽  
Na Chen ◽  
Shaowen Liu ◽  
Liming Du ◽  
...  

Since large amount of geospatial data are produced by various sources, geospatial data integration is difficult because of the shortage of semantics. Despite standardised data format and data access protocols, such as Web Feature Service (WFS), can enable end-users with access to heterogeneous data stored in different formats from various sources, it is still time-consuming and ineffective due to the lack of semantics. To solve this problem, a prototype to implement the geospatial data integration is proposed by addressing the following four problems, i.e., geospatial data retrieving, modeling, linking and integrating. We mainly adopt four kinds of geospatial data sources to evaluate the performance of the proposed approach. The experimental results illustrate that the proposed linking method can get high performance in generating the matched candidate record pairs in terms of Reduction Ratio(RR), Pairs Completeness(PC), Pairs Quality(PQ) and F-score. The integrating results denote that each data source can get much Complementary Completeness(CC) and Increased Completeness(IC).


2019 ◽  
pp. 230-253
Author(s):  
Ying Zhang ◽  
Chaopeng Li ◽  
Na Chen ◽  
Shaowen Liu ◽  
Liming Du ◽  
...  

Since large amount of geospatial data are produced by various sources and stored in incompatible formats, geospatial data integration is difficult because of the shortage of semantics. Despite standardised data format and data access protocols, such as Web Feature Service (WFS), can enable end-users with access to heterogeneous data stored in different formats from various sources, it is still time-consuming and ineffective due to the lack of semantics. To solve this problem, a prototype to implement the geospatial data integration is proposed by addressing the following four problems, i.e., geospatial data retrieving, modeling, linking and integrating. First, we provide a uniform integration paradigm for users to retrieve geospatial data. Then, we align the retrieved geospatial data in the modeling process to eliminate heterogeneity with the help of Karma. Our main contribution focuses on addressing the third problem. Previous work has been done by defining a set of semantic rules for performing the linking process. However, the geospatial data has some specific geospatial relationships, which is significant for linking but cannot be solved by the Semantic Web techniques directly. We take advantage of such unique features about geospatial data to implement the linking process. In addition, the previous work will meet a complicated problem when the geospatial data sources are in different languages. In contrast, our proposed linking algorithms are endowed with translation function, which can save the translating cost among all the geospatial sources with different languages. Finally, the geospatial data is integrated by eliminating data redundancy and combining the complementary properties from the linked records. We mainly adopt four kinds of geospatial data sources, namely, OpenStreetMap(OSM), Wikmapia, USGS and EPA, to evaluate the performance of the proposed approach. The experimental results illustrate that the proposed linking method can get high performance in generating the matched candidate record pairs in terms of Reduction Ratio(RR), Pairs Completeness(PC), Pairs Quality(PQ) and F-score. The integrating results denote that each data source can get much Complementary Completeness(CC) and Increased Completeness(IC).


Author(s):  
Ying Zhang ◽  
Chaopeng Li ◽  
Na Chen ◽  
Shaowen Liu ◽  
Liming Du ◽  
...  

Since large amount of geospatial data are produced by various sources, geospatial data integration is difficult because of the shortage of semantics. Despite standardised data format and data access protocols, such as Web Feature Service (WFS), can enable end-users with access to heterogeneous data stored in different formats from various sources, it is still time-consuming and ineffective due to the lack of semantics. To solve this problem, a prototype to implement the geospatial data integration is proposed by addressing the following four problems, i.e., geospatial data retrieving, modeling, linking and integrating. We mainly adopt four kinds of geospatial data sources to evaluate the performance of the proposed approach. The experimental results illustrate that the proposed linking method can get high performance in generating the matched candidate record pairs in terms of Reduction Ratio(RR), Pairs Completeness(PC), Pairs Quality(PQ) and F-score. The integrating results denote that each data source can get much Complementary Completeness(CC) and Increased Completeness(IC).


2019 ◽  
Vol 31 (1) ◽  
pp. 265-290 ◽  
Author(s):  
Ganjar Alfian ◽  
Muhammad Fazal Ijaz ◽  
Muhammad Syafrudin ◽  
M. Alex Syaekhoni ◽  
Norma Latif Fitriyani ◽  
...  

PurposeThe purpose of this paper is to propose customer behavior analysis based on real-time data processing and association rule for digital signage-based online store (DSOS). The real-time data processing based on big data technology (such as NoSQL MongoDB and Apache Kafka) is utilized to handle the vast amount of customer behavior data.Design/methodology/approachIn order to extract customer behavior patterns, customers’ browsing history and transactional data from digital signage (DS) could be used as the input for decision making. First, the authors developed a DSOS and installed it in different locations, so that customers could have the experience of browsing and buying a product. Second, the real-time data processing system gathered customers’ browsing history and transaction data as it occurred. In addition, the authors utilized the association rule to extract useful information from customer behavior, so it may be used by the managers to efficiently enhance the service quality.FindingsFirst, as the number of customers and DS increases, the proposed system was capable of processing a gigantic amount of input data conveniently. Second, the data set showed that as the number of visit and shopping duration increases, the chance of products being purchased also increased. Third, by combining purchasing and browsing data from customers, the association rules from the frequent transaction pattern were achieved. Thus, the products will have a high possibility to be purchased if they are used as recommendations.Research limitations/implicationsThis research empirically supports the theory of association rule that frequent patterns, correlations or causal relationship found in various kinds of databases. The scope of the present study is limited to DSOS, although the findings can be interpreted and generalized in a global business scenario.Practical implicationsThe proposed system is expected to help management in taking decisions such as improving the layout of the DS and providing better product suggestions to the customer.Social implicationsThe proposed system may be utilized to promote green products to the customer, having a positive impact on sustainability.Originality/valueThe key novelty of the present study lies in system development based on big data technology to handle the enormous amounts of data as well as analyzing the customer behavior in real time in the DSOS. The real-time data processing based on big data technology (such as NoSQL MongoDB and Apache Kafka) is used to handle the vast amount of customer behavior data. In addition, the present study proposed association rule to extract useful information from customer behavior. These results can be used for promotion as well as relevant product recommendations to DSOS customers. Besides in today’s changing retail environment, analyzing the customer behavior in real time in DSOS helps to attract and retain customers more efficiently and effectively, and retailers can get a competitive advantage over their competitors.


2014 ◽  
Vol 912-914 ◽  
pp. 1201-1204
Author(s):  
Gang Huang ◽  
Xiu Ying Wu ◽  
Man Yuan

This paper provides an ontology-based distributed heterogeneous data integration framework (ODHDIF). The framework resolves the problem of semantic interoperability between heterogeneous data sources in semantic level. By metadatas specifying the distributed, heterogeneous data and by describing semantic information of data source , having "ontology" as a common semantic model, semantic match is established through ontology mapping between heterogeneous data sources and semantic difference institutions are shielded, so that semantic heterogeneity problem of the heterogeneous data sources can be effectively solved. It provides an effective technology measure for the interior information of enterprises to be shared in time accurately.


Sign in / Sign up

Export Citation Format

Share Document