RDF graph mining for cluster-based theme identification

2020 ◽  
Vol 16 (2) ◽  
pp. 223-247
Author(s):  
Siham Eddamiri ◽  
Asmaa Benghabrit ◽  
Elmoukhtar Zemmouri

Purpose The purpose of this paper is to present a generic pipeline for Resource Description Framework (RDF) graph mining to provide a comprehensive review of each step in the knowledge discovery from data process. The authors also investigate different approaches and combinations to extract feature vectors from RDF graphs to apply the clustering and theme identification tasks. Design/methodology/approach The proposed methodology comprises four steps. First, the authors generate several graph substructures (Walks, Set of Walks, Walks with backward and Set of Walks with backward). Second, the authors build neural language models to extract numerical vectors of the generated sequences by using word embedding techniques (Word2Vec and Doc2Vec) combined with term frequency-inverse document frequency (TF-IDF). Third, the authors use the well-known K-means algorithm to cluster the RDF graph. Finally, the authors extract the most relevant rdf:type from the grouped vertices to describe the semantics of each theme by generating the labels. Findings The experimental evaluation on the state of the art data sets (AIFB, BGS and Conference) shows that the combination of Set of Walks-with-backward with TF-IDF and Doc2vec techniques give excellent results. In fact, the clustering results reach more than 97% and 90% in terms of purity and F-measure, respectively. Concerning the theme identification, the results show that by using the same combination, the purity and F-measure criteria reach more than 90% for all the considered data sets. Originality/value The originality of this paper lies in two aspects: first, a new machine learning pipeline for RDF data is presented; second, an efficient process to identify and extract relevant graph substructures from an RDF graph is proposed. The proposed techniques were combined with different neural language models to improve the accuracy and relevance of the obtained feature vectors that will be fed to the clustering mechanism.

Symmetry ◽  
2019 ◽  
Vol 11 (7) ◽  
pp. 926
Author(s):  
Kyoungsoo Bok ◽  
Junwon Kim ◽  
Jaesoo Yoo

Various resource description framework (RDF) partitioning methods have been studied for the efficient distributed processing of a large RDF graph. The RDF graph has symmetrical characteristics because subject and object can be used interchangeably if predicate is changed. This paper proposes a dynamic partitioning method of RDF graphs to support load balancing in distributed environments where data insertion and change continue to occur. The proposed method generates clusters and subclusters by considering the usage frequency of the RDF graph that are used by queries as the criteria to perform graph partitioning. It creates a cluster by grouping RDF subgraphs with higher usage frequency while creating a subcluster with lower usage frequency. These clusters and subclusters conduct load balancing by using the mean frequency of queries for the distributed server and conduct graph data partitioning by considering the size of the data stored in each distributed server. It also minimizes the number of edge-cuts connected to clusters and subclusters to minimize communication costs between servers. This solves the problem of data concentration to specific servers due to ongoing data changes and additions and allows efficient load balancing among servers. The performance results show that the proposed method significantly outperforms the existing partitioning methods in terms of query performance time in a distributed server.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Yuanxin Ouyang ◽  
Hongbo Zhang ◽  
Wenge Rong ◽  
Xiang Li ◽  
Zhang Xiong

Purpose The purpose of this paper is to propose an attention alignment method for opinion mining of massive open online course (MOOC) comments. Opinion mining is essential for MOOC applications. In this study, the authors analyze some of bidirectional encoder representations from transformers (BERT’s) attention heads and explore how to use these attention heads to extract opinions from MOOC comments. Design/methodology/approach The approach proposed is based on an attention alignment mechanism with the following three stages: first, extracting original opinions from MOOC comments with dependency parsing. Second, constructing frequent sets and using the frequent sets to prune the opinions. Third, pruning the opinions and discovering new opinions with the attention alignment mechanism. Findings The experiments on the MOOC comments data sets suggest that the opinion mining approach based on an attention alignment mechanism can obtain a better F1 score. Moreover, the attention alignment mechanism can discover some of the opinions filtered incorrectly by the frequent sets, which means the attention alignment mechanism can overcome the shortcomings of dependency analysis and frequent sets. Originality/value To take full advantage of pretrained language models, the authors propose an attention alignment method for opinion mining and combine this method with dependency analysis and frequent sets to improve the effectiveness. Furthermore, the authors conduct extensive experiments on different combinations of methods. The results show that the attention alignment method can effectively overcome the shortcomings of dependency analysis and frequent sets.


2017 ◽  
Vol 1 (2) ◽  
pp. 84-103 ◽  
Author(s):  
Dong Wang ◽  
Lei Zou ◽  
Dongyan Zhao

Abstract The Simple Protocol and RDF Query Language (SPARQL) query language allows users to issue a structural query over a resource description framework (RDF) graph. However, the lack of a spatiotemporal query language limits the usage of RDF data in spatiotemporal-oriented applications. As the spatiotemporal information continuously increases in RDF data, it is necessary to design an effective and efficient spatiotemporal RDF data management system. In this paper, we formally define the spatiotemporal information-integrated RDF data, introduce a spatiotemporal query language that extends the SPARQL language with spatiotemporal assertions to query spatiotemporal information-integrated RDF data, and design a novel index and the corresponding query algorithm. The experimental results on a large, real RDF graph integrating spatial and temporal information (> 180 million triples) confirm the superiority of our approach. In contrast to its competitors, gst-store outperforms by more than 20%-30% in most cases.


2004 ◽  
Vol 101 (Supplement3) ◽  
pp. 326-333 ◽  
Author(s):  
Klaus D. Hamm ◽  
Gunnar Surber ◽  
Michael Schmücking ◽  
Reinhard E. Wurm ◽  
Rene Aschenbach ◽  
...  

Object. Innovative new software solutions may enable image fusion to produce the desired data superposition for precise target definition and follow-up studies in radiosurgery/stereotactic radiotherapy in patients with intracranial lesions. The aim is to integrate the anatomical and functional information completely into the radiation treatment planning and to achieve an exact comparison for follow-up examinations. Special conditions and advantages of BrainLAB's fully automatic image fusion system are evaluated and described for this purpose. Methods. In 458 patients, the radiation treatment planning and some follow-up studies were performed using an automatic image fusion technique involving the use of different imaging modalities. Each fusion was visually checked and corrected as necessary. The computerized tomography (CT) scans for radiation treatment planning (slice thickness 1.25 mm), as well as stereotactic angiography for arteriovenous malformations, were acquired using head fixation with stereotactic arc or, in the case of stereotactic radiotherapy, with a relocatable stereotactic mask. Different magnetic resonance (MR) imaging sequences (T1, T2, and fluid-attenuated inversion-recovery images) and positron emission tomography (PET) scans were obtained without head fixation. Fusion results and the effects on radiation treatment planning and follow-up studies were analyzed. The precision level of the results of the automatic fusion depended primarily on the image quality, especially the slice thickness and the field homogeneity when using MR images, as well as on patient movement during data acquisition. Fully automated image fusion of different MR, CT, and PET studies was performed for each patient. Only in a few cases was it necessary to correct the fusion manually after visual evaluation. These corrections were minor and did not materially affect treatment planning. High-quality fusion of thin slices of a region of interest with a complete head data set could be performed easily. The target volume for radiation treatment planning could be accurately delineated using multimodal information provided by CT, MR, angiography, and PET studies. The fusion of follow-up image data sets yielded results that could be successfully compared and quantitatively evaluated. Conclusions. Depending on the quality of the originally acquired image, automated image fusion can be a very valuable tool, allowing for fast (∼ 1–2 minute) and precise fusion of all relevant data sets. Fused multimodality imaging improves the target volume definition for radiation treatment planning. High-quality follow-up image data sets should be acquired for image fusion to provide exactly comparable slices and volumetric results that will contribute to quality contol.


2019 ◽  
Vol 45 (9) ◽  
pp. 1183-1198
Author(s):  
Gaurav S. Chauhan ◽  
Pradip Banerjee

Purpose Recent papers on target capital structure show that debt ratio seems to vary widely in space and time, implying that the functional specifications of target debt ratios are of little empirical use. Further, target behavior cannot be adjudged correctly using debt ratios, as they could revert due to mechanical reasons. The purpose of this paper is to develop an alternative testing strategy to test the target capital structure. Design/methodology/approach The authors make use of a major “shock” to the debt ratios as an event and think of a subsequent reversion as a movement toward a mean or target debt ratio. By doing this, the authors no longer need to identify target debt ratios as a function of firm-specific variables or any other rigid functional form. Findings Similar to the broad empirical evidence in developed economies, there is no perceptible and systematic mean reversion by Indian firms. However, unlike developed countries, proportionate usage of debt to finance firms’ marginal financing deficits is extensive; equity is used rather sparingly. Research limitations/implications The trade-off theory could be convincingly refuted at least for the emerging market of India. The paper here stimulated further research on finding reasons for specific financing behavior of emerging market firms. Practical implications The results show that the firms’ financing choices are not only depending on their own firm’s specific variables but also on the financial markets in which they operate. Originality/value This study attempts to assess mean reversion in debt ratios in a unique but reassuring manner. The results are confirmed by extensive calibration of the testing strategy using simulated data sets.


2014 ◽  
Vol 6 (2) ◽  
pp. 211-233
Author(s):  
Thomas M. Bayer ◽  
John Page

Purpose – This paper aims to analyze the evolution of the marketing of paintings and related visual products from its nascent stages in England around 1700 to the development of the modern art market by 1900, with a brief discussion connecting to the present. Design/methodology/approach – Sources consist of a mixture of primary and secondary sources as well as a series of econometric and statistical analyses of specifically constructed and unique data sets that list nearly more than 50,000 different sales of paintings during this period. One set records sales of paintings at various English auction houses during the eighteenth and nineteenth centuries; the second set consists of all purchases and sales of paintings recorded in the stock books of the late nineteenth-century London art dealer, Arthur Tooth, during the years of 1870/1871. The authors interpret the data under a commoditization model first introduced by Igor Kopytoff in 1986 that posits that markets and their participants evolve toward maximizing the efficiency of their exchange process within the prevailing exchange technology. Findings – We found that artists were largely responsible for a series of innovations in the art market that replaced the prevailing direct relationship between artists and patron with a modern market for which painters produced works on speculation to be sold by enterprising middlemen to an anonymous public. In this process, artists displayed a remarkable creativity and a seemingly instinctive understanding of the principles of competitive marketing that should dispel the erroneous but persistent notion that artistic genius and business savvy are incompatible. Research limitations/implications – A similar marketing analysis could be done of the development of the art markets of other leading countries, such as France, Italy and Holland, as well as the current developments of the art market. Practical implications – The same process of the development of the art market in England is now occurring in Latin America and China. Also, the commoditization process continues in the present, now using the Internet and worldwide art dealers. Originality/value – This is the first article to trace the historical development of the marketing of art in all of its components: artists, dealers, artist organizations, museums, curators, art critics, the media and art historians.


2016 ◽  
Vol 12 (2) ◽  
pp. 126-149 ◽  
Author(s):  
Masoud Mansoury ◽  
Mehdi Shajari

Purpose This paper aims to improve the recommendations performance for cold-start users and controversial items. Collaborative filtering (CF) generates recommendations on the basis of similarity between users. It uses the opinions of similar users to generate the recommendation for an active user. As a similarity model or a neighbor selection function is the key element for effectiveness of CF, many variations of CF are proposed. However, these methods are not very effective, especially for users who provide few ratings (i.e. cold-start users). Design/methodology/approach A new user similarity model is proposed that focuses on improving recommendations performance for cold-start users and controversial items. To show the validity of the authors’ similarity model, they conducted some experiments and showed the effectiveness of this model in calculating similarity values between users even when only few ratings are available. In addition, the authors applied their user similarity model to a recommender system and analyzed its results. Findings Experiments on two real-world data sets are implemented and compared with some other CF techniques. The results show that the authors’ approach outperforms previous CF techniques in coverage metric while preserves accuracy for cold-start users and controversial items. Originality/value In the proposed approach, the conditions in which CF is unable to generate accurate recommendations are addressed. These conditions affect CF performance adversely, especially in the cold-start users’ condition. The authors show that their similarity model overcomes CF weaknesses effectively and improve its performance even in the cold users’ condition.


2015 ◽  
Vol 16 (1) ◽  
pp. 62-85 ◽  
Author(s):  
Cheri Jeanette Duncan ◽  
Genya Morgan O'Gara

Purpose – The purpose of this paper is to examine the development of a flexible collections assessment rubric comprised of a suite of tools for more consistently and effectively evaluating and expressing a holistic value of library collections to a variety of constituents, from administrators to faculty and students, with particular emphasis to the use of data already being collected at libraries to “take the temperature” of how responsive collections are in supporting institutional goals. Design/methodology/approach – Using a literature review, internal and external conversations, several collections pilot projects, and a variety of other investigative mechanisms, this paper explores methods for creating a more flexible, holistic collection development and assessment model using both qualitative and quantitative data. Findings – The products of scholarship that academic libraries include in their collections are expanding exponentially and range from journals and monographs in all formats, to databases, data sets, digital text and images, streaming media, visualizations and animations. Content is also being shared in new ways and on a variety of platforms. Yet the framework for evaluating this new landscape of scholarly output is in its infancy. So, how do libraries develop and assess collections in a consistent, holistic, yet agile, manner? Libraries must employ a variety of mechanisms to ensure this goal, while remaining flexible in adapting to the shifting collections environment. Originality/value – In so much as the authors are aware, this is the first paper to examine an agile, holistic approach to collections using both qualitative and quantitative data.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Tressy Thomas ◽  
Enayat Rajabi

PurposeThe primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?Design/methodology/approachThe review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.FindingsThis study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.Originality/valueIt is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Gangadhar Ch ◽  
S. Jana ◽  
Sankararao Majji ◽  
Prathyusha Kuncha ◽  
Fantin Irudaya Raj E. ◽  
...  

Purpose For the first time in a decade, a new form of pneumonia virus, coronavirus, COVID-19, appeared in Wuhan, China. To date, it has affected millions of people, killed thousands and resulted in thousands of deaths around the world. To stop the spread of this virus, isolate the infected people. Computed tomography (CT) imaging is very accurate in revealing the details of the lungs and allows oncologists to detect COVID. However, the analysis of CT scans, which can include hundreds of images, may cause delays in hospitals. The use of artificial intelligence (AI) in radiology could help to COVID-19-positive cancer in this manner is the main purpose of the work. Design/methodology/approach CT scans are a medical imaging procedure that gives a three-dimensional (3D) representation of the lungs for clinical purposes. The volumetric 3D data sets can be regarded as axial, coronal and transverse data sets. By using AI, we can diagnose the virus presence. Findings The paper discusses the use of an AI for COVID-19, and CT classification issue and vaccination details of COVID-19 have been detailed in this paper. Originality/value Originality of the work is, all the data can be collected genuinely and did research work doneown methodology.


Sign in / Sign up

Export Citation Format

Share Document