An Unsupervised Approach for Determining Link Specifications

Author(s):  
Khayra Bencherif ◽  
Mimoun Malki ◽  
Djamel Amar Bensaber

This article describes how the Linked Open Data Cloud project allows data providers to publish structured data on the web according to the Linked Data principles. In this context, several link discovery frameworks have been developed for connecting entities contained in knowledge bases. In order to achieve a high effectiveness for the link discovery task, a suitable link configuration is required to specify the similarity conditions. Unfortunately, such configurations are specified manually; which makes the link discovery task tedious and more difficult for the users. In this article, the authors address this drawback by proposing a novel approach for the automatic determination of link specifications. The proposed approach is based on a neural network model to combine a set of existing metrics into a compound one. The authors evaluate the effectiveness of the proposed approach in three experiments using real data sets from the LOD Cloud. In addition, the proposed approach is compared against link specifications approaches to show that it outperforms them in most experiments.

Author(s):  
Heiko Paulheim ◽  
Christian Bizer

Linked Data on the Web is either created from structured data sources (such as relational databases), from semi-structured sources (such as Wikipedia), or from unstructured sources (such as text). In the latter two cases, the generated Linked Data will likely be noisy and incomplete. In this paper, we present two algorithms that exploit statistical distributions of properties and types for enhancing the quality of incomplete and noisy Linked Data sets: SDType adds missing type statements, and SDValidate identifies faulty statements. Neither of the algorithms uses external knowledge, i.e., they operate only on the data itself. We evaluate the algorithms on the DBpedia and NELL knowledge bases, showing that they are both accurate as well as scalable. Both algorithms have been used for building the DBpedia 3.9 release: With SDType, 3.4 million missing type statements have been added, while using SDValidate, 13,000 erroneous RDF statements have been removed from the knowledge base.


2017 ◽  
Vol 108 (1) ◽  
pp. 355-366 ◽  
Author(s):  
Ankit Srivastava ◽  
Georg Rehm ◽  
Felix Sasaki

Abstract With the ever increasing availability of linked multilingual lexical resources, there is a renewed interest in extending Natural Language Processing (NLP) applications so that they can make use of the vast set of lexical knowledge bases available in the Semantic Web. In the case of Machine Translation, MT systems can potentially benefit from such a resource. Unknown words and ambiguous translations are among the most common sources of error. In this paper, we attempt to minimise these types of errors by interfacing Statistical Machine Translation (SMT) models with Linked Open Data (LOD) resources such as DBpedia and BabelNet. We perform several experiments based on the SMT system Moses and evaluate multiple strategies for exploiting knowledge from multilingual linked data in automatically translating named entities. We conclude with an analysis of best practices for multilingual linked data sets in order to optimise their benefit to multilingual and cross-lingual applications.


Author(s):  
Heiko Paulheim ◽  
Christian Bizer

Linked Data on the Web is either created from structured data sources (such as relational databases), from semi-structured sources (such as Wikipedia), or from unstructured sources (such as text). In the latter two cases, the generated Linked Data will likely be noisy and incomplete. In this paper, we present two algorithms that exploit statistical distributions of properties and types for enhancing the quality of incomplete and noisy Linked Data sets: SDType adds missing type statements, and SDValidate identifies faulty statements. Neither of the algorithms uses external knowledge, i.e., they operate only on the data itself. We evaluate the algorithms on the DBpedia and NELL knowledge bases, showing that they are both accurate as well as scalable. Both algorithms have been used for building the DBpedia 3.9 release: With SDType, 3.4 million missing type statements have been added, while using SDValidate, 13,000 erroneous RDF statements have been removed from the knowledge base.


2016 ◽  
Vol 6 (2) ◽  
pp. 1-23 ◽  
Author(s):  
Surbhi Bhatia ◽  
Manisha Sharma ◽  
Komal Kumar Bhatia

Due to the sudden and explosive increase in web technologies, huge quantity of user generated content is available online. The experiences of people and their opinions play an important role in the decision making process. Although facts provide the ease of searching information on a topic but retrieving opinions is still a crucial task. Many studies on opinion mining have to be undertaken efficiently in order to extract constructive opinionated information from these reviews. The present work focuses on the design and implementation of an Opinion Crawler which downloads the opinions from various sites thereby, ignoring rest of the web. Besides, it also detects web pages which frequently undergo updation by calculating the timestamp for its revisit in order to extract relevant opinions. The performance of the Opinion Crawler is justified by taking real data sets that prove to be much more accurate in terms of precision and recall quality attributes.


2014 ◽  
Vol 08 (04) ◽  
pp. 415-439 ◽  
Author(s):  
Amna Basharat ◽  
I. Budak Arpinar ◽  
Shima Dastgheib ◽  
Ugur Kursuncu ◽  
Krys Kochut ◽  
...  

Crowdsourcing is one of the new emerging paradigms to exploit the notion of human-computation for harvesting and processing complex heterogenous data to produce insight and actionable knowledge. Crowdsourcing is task-oriented, and hence specification and management of not only tasks, but also workflows should play a critical role. Crowdsourcing research can still be considered in its infancy. Significant need is felt for crowdsourcing applications to be equipped with well defined task and workflow specifications ranging from simple human-intelligent tasks to more sophisticated and cooperative tasks to handle data and control-flow among these tasks. Addressing this need, we have attempted to devise a generic, flexible and extensible task specification and workflow management mechanism in crowdsourcing. We have contextualized this problem to linked data management as our domain of interest. More specifically, we develop CrowdLink, which utilizes an architecture for automated task specification, generation, publishing and reviewing to engage crowdworkers for verification and creation of triples in the Linked Open Data (LOD) cloud. The LOD incorporates various core data sets in the semantic web, yet is not in full conformance with the guidelines for publishing high quality linked data on the web. Our approach is not only useful in efficiently processing the LOD management tasks, it can also help in enriching and improving quality of mission-critical links in the LOD. We demonstrate usefulness of our approach through various link creation and verification tasks, and workflows using Amazon Mechanical Turk. Experimental evaluation demonstrates promising results not only in terms of ease of task generation, publishing and reviewing, but also in terms of accuracy of the links created, and verified by the crowdworkers.


2019 ◽  
Vol 32 (5) ◽  
pp. 451-466 ◽  
Author(s):  
Benedikt Simon Hitz-Gamper ◽  
Oliver Neumann ◽  
Matthias Stürmer

Purpose Linked data is a technical standard to structure complex information and relate independent sets of data. Recently, governments have started to use this technology for bridging separated data “(silos)” by launching linked open government data (LOGD) portals. The purpose of this paper is to explore the role of LOGD as a smart technology and strategy to create public value. This is achieved by enhancing the usability and visibility of open data provided by public organizations. Design/methodology/approach In this study, three different LOGD governance modes are deduced: public agencies could release linked data via a dedicated triple store, via a shared triple store or via an open knowledge base. Each of these modes has different effects on usability and visibility of open data. Selected case studies illustrate the actual use of these three governance modes. Findings According to this study, LOGD governance modes present a trade-off between retaining control over governmental data and potentially gaining public value by the increased use of open data by citizens. Originality/value This study provides recommendations for public sector organizations for the development of their data publishing strategy to balance control, usability and visibility considering also the growing popularity of open knowledge bases such as Wikidata.


Author(s):  
M Perzyk ◽  
R Biernacki ◽  
J Kozlowski

Determination of the most significant manufacturing process parameters using collected past data can be very helpful in solving important industrial problems, such as the detection of root causes of deteriorating product quality, the selection of the most efficient parameters to control the process, and the prediction of breakdowns of machines, equipment, etc. A methodology of determination of relative significances of process variables and possible interactions between them, based on interrogations of generalized regression models, is proposed and tested. The performance of several types of data mining tool, such as artificial neural networks, support vector machines, regression trees, classification trees, and a naïve Bayesian classifier, is compared. Also, some simple non-parametric statistical methods, based on an analysis of variance (ANOVA) and contingency tables, are evaluated for comparison purposes. The tests were performed using simulated data sets, with assumed hidden relationships, as well as on real data collected in the foundry industry. It was found that the performance of significance and interaction factors obtained from regression models, and, in particular, neural networks, is satisfactory, while the other methods appeared to be less accurate and/or less reliable.


1988 ◽  
Vol 64 (5) ◽  
pp. 2074-2082 ◽  
Author(s):  
R. W. Samsel ◽  
P. T. Schumacker

Normally, metabolic need determines tissue O2 consumption (VO2). In states of reduced supply, VO2 declines sharply below a critical level of O2 delivery (QO2 = blood flow X arterial O2 content). Although several investigators have measured a critical O2 delivery in whole animals or in isolated tissues, there is no general agreement over how to determine the critical point from a collection of real data. In this study, we compare three algorithms for finding the critical O2 delivery from a set of experimental data. We also present a technique for estimating the effect of experimental error on the precision of these algorithms. Using 16 data sets collected in normal dogs, we compare single-line, dual-line, and polynomial regression algorithms for identifying the critical O2 delivery. The dual-line and polynomial regression techniques fit the data better (mean residual square deviation 0.024 and 0.031, respectively) than the single-regression line approach (0.110). To investigate the influence of experimental error on the derived critical QO2, we used a Monte Carlo technique, repeatedly perturbing the experimental data to simulate experimental error. We then calculated the variance of the critical QO2 frequency distribution obtained when the three algorithms were applied to the perturbed data. By this analysis, the dual-line regression technique was less sensitive to experimental error than the polynomial technique.


Radiocarbon ◽  
2010 ◽  
Vol 52 (1) ◽  
pp. 165-170 ◽  
Author(s):  
Ugo Zoppi

Radiocarbon accelerator mass spectrometry (AMS) measurements are always carried out relative to internationally accepted standards with known 14C activities. The determination of accurate 14C concentrations relies on the fact that standards and unknown samples must be measured under the same conditions. When this is not the case, data reduction is either performed by splitting the collected data set into subsets with consistent measurement conditions or by applying correction factors.This paper introduces a mathematical framework that exploits the intrinsic variability of an AMS system by combining arbitrary measurement parameters into a normalization function. This novel approach allows the en-masse reduction of large data sets by providing individual normalization factors for each data point. Both general features and practicalities necessary for its efficient application are discussed.


Sign in / Sign up

Export Citation Format

Share Document