Breakthroughs and Limitations of XML Grammar Similarity

Author(s):  
Joe Tekli

W3C’s XML (eXtensible Mark-up Language) has recently gained unparalleled importance as a fundamental standard for efficient data management and exchange. The use of XML covers data representation and storage, database information interchange, data filtering, as well as Web applications interaction and interoperability. XML has been intensively exploited in the multimedia field as an effective and standard means for indexing, storing, and retrieving complex multimedia objects. SVG1, SMIL2, X3D3 and MPEG-74 are only some examples of XML-based multimedia data representations. With the ever-increasing Web exploitation of XML, there is an emergent need to automatically process XML documents and grammars for similarity classification and clustering, information extraction, and search functions. All these applications require some notion of structural similarity, XML representing semi-structured data. In this area, most work has focused on estimating similarity between XML documents (i.e., data layer). Nonetheless, few efforts have been dedicated to comparing XML grammars (i.e., type layer). Computing the structural similarity between XML documents is relevant in several scenarios such as change management (Chawathe, Rajaraman, Garcia- Molina, & Widom, 1996; Cobéna, Abiteboul, & Marian, 2002), XML structural query systems (finding and ranking results according to their similarity) (Schlieder, 2001; Zhang, Li, Cao, & Zhu, 2003) as well as the structural clustering of XML documents gathered from the Web (Dalamagas, Cheng, Winkel, & Sellis, 2006; Nierman & Jagadish, 2002). On the other hand, estimating similarity between XML grammars is useful for data integration purposes, in particular the integration of DTDs/schemas that contain nearly or exactly the same information but are constructed using different structures (Doan, Domingos, & Halevy, 2001; Melnik, Garcia-Molina, & Rahm, 2002). It is also exploited in data warehousing (mapping data sources to warehouse schemas) as well as XML data maintenance and schema evolution where we need to detect differences/updates between different versions of a given grammar/schema to consequently revalidate corresponding XML documents (Rahm & Bernstein, 2001). The goal of this article is to briefly review XML grammar structural similarity approaches. Here, we provide a unified view of the problem, assessing the different aspects and techniques related to XML grammar comparison. The remainder of this article is organized as follows. The second section presents an overview of XML grammar similarity, otherwise known as XML schema matching. The third section reviews the state of the art in XML grammar comparison methods. The fourth section discusses the main criterions characterizing the effectiveness of XML grammar similarity approaches. Conclusions and current research directions are covered in the last section.

2021 ◽  
pp. 1-13
Author(s):  
Yikai Zhang ◽  
Yong Peng ◽  
Hongyu Bian ◽  
Yuan Ge ◽  
Feiwei Qin ◽  
...  

Concept factorization (CF) is an effective matrix factorization model which has been widely used in many applications. In CF, the linear combination of data points serves as the dictionary based on which CF can be performed in both the original feature space as well as the reproducible kernel Hilbert space (RKHS). The conventional CF treats each dimension of the feature vector equally during the data reconstruction process, which might violate the common sense that different features have different discriminative abilities and therefore contribute differently in pattern recognition. In this paper, we introduce an auto-weighting variable into the conventional CF objective function to adaptively learn the corresponding contributions of different features and propose a new model termed Auto-Weighted Concept Factorization (AWCF). In AWCF, on one hand, the feature importance can be quantitatively measured by the auto-weighting variable in which the features with better discriminative abilities are assigned larger weights; on the other hand, we can obtain more efficient data representation to depict its semantic information. The detailed optimization procedure to AWCF objective function is derived whose complexity and convergence are also analyzed. Experiments are conducted on both synthetic and representative benchmark data sets and the clustering results demonstrate the effectiveness of AWCF in comparison with the related models.


Author(s):  
Amanda Galtman

Using XML as the source format for authoring technical publications creates opportunities to develop tools that provide analysis, author guidance, and visualization. This case study describes two web applications that take advantage of the XML source format of documents. The applications provide a browser-based tool for technical writers and editors in a 100-person documentation department of a software company. Compared to desktop tools, the web applications are more convenient for users and less affected by hard-to-predict inconsistencies among users' computers. One application analyzes file dependencies and produces custom reports that facilitate reorganizing files. The other helps authors visualize their network of topics in their documentation sets. Both applications rely on the XQuery language and its RESTXQ web API. The visualization application also uses JavaScript, including the powerful jQuery and D3 libraries. After discussing what the applications do and why, this paper describes some architectural highlights, including how the different technologies fit together and exchange data.


Data Mining ◽  
2013 ◽  
pp. 1-27
Author(s):  
Sangeetha Kutty ◽  
Richi Nayak ◽  
Tien Tran

With the increasing number of XML documents in varied domains, it has become essential to identify ways of finding interesting information from these documents. Data mining techniques can be used to derive this interesting information. However, mining of XML documents is impacted by the data model used in data representation due to the semi-structured nature of these documents. In this chapter, we present an overview of the various models of XML documents representations, how these models are used for mining, and some of the issues and challenges inherent in these models. In addition, this chapter also provides some insights into the future data models of XML documents for effectively capturing its two important features, structure and content, for mining.


2009 ◽  
pp. 505-526
Author(s):  
Ji Zhang ◽  
Han Liu ◽  
Tok Wang Ling ◽  
Robert M. Bruckner ◽  
A Min Tjoa

In this article, we propose a framework, called XAR-Miner, for mining ARs from XML documents efficiently. In XAR-Miner, raw data in the XML document first are preprocessed to transform either to an Indexed XML Tree (IX-tree) or to Multirelational Databases (Multi-DB), depending on the size of the XML document and the memory constraint of the system, for efficient data selection and AR mining. Concepts that are relevant to the AR mining task are generalized to produce generalized metapatterns. A suitable metric is devised for measuring the degree of concept generalization in order to prevent undergeneralization or overgeneralization. Resulting generalized metapatterns are used to generate large ARs that meet the support and confidence levels. A greedy algorithm is also presented in order to integrate data selection and large itemset generation to enhance the efficiency of the AR mining process. The experiments conducted show that XAR-Miner is more efficient in performing a large number of AR mining tasks from XML documents than the state-of-the-art method of repetitively scanning through XML documents in order to perform each of the mining tasks.


Author(s):  
George Pallis ◽  
Konstantina Stoupa ◽  
Athena Vakali

The Internet (and networks overall) are currently the core media for data and knowledge exchange. XML is currently the most popular standardization for Web document representation and is rapidly becoming a standard for data representation and exchange over the Internet. One of the main issues is XML documents and in particular, storage and accessing. Among data management issues, storage and security techniques have a particular importance, since the performance of the overall XML-based Web information system relies on them. Storage issues mainly rely on the usage of typical database management systems (DBMSs), whereas XML documents can also be stored in other storage environments (such as file systems and LDAP directories) (Amer-Yahia & Fernandez, 2002; Kanne & Moerkotte, 2000; Silberschatz, Korth & Sudarshan, 2002). Additionally, in order to guarantee the security of the XML data, which are located in a variety of the above storage topologies, the majority of implementations also provide an appropriate access control. Most storage systems cooperate with access control modules implementing various models (Joshi, Aref, Ghafoor & Spafford, 2001), whereas there are few commercial access control products available. However, there are some standardized XML-based access control languages that can be adopted by most tools.


Sign in / Sign up

Export Citation Format

Share Document