Strategic Advancements in Utilizing Data Mining and Warehousing Technologies
Latest Publications


TOTAL DOCUMENTS

23
(FIVE YEARS 0)

H-INDEX

1
(FIVE YEARS 0)

Published By IGI Global

9781605667171, 9781605667188

Author(s):  
Wei Mingjun ◽  
Chai Lei ◽  
Wei Renying ◽  
Huo Wang

Our team has won the Grand Champion (Tie) of PAKDD-2007 data mining competition. The data mining task is to score credit card customers of a consumer finance company according to the likelihood that customers take up the home loans offered by the company. This report presents our solution for this business problem. TreeNet and logistic regression are the data mining algorithms used in this project. The final score is based on the cross-algorithm ensemble of two within-algorithm ensembles of TreeNet and logistic regression. Finally, some discussions from our solution are presented.


Author(s):  
Lu Jing ◽  
Chen Weiru ◽  
Adjei Osei ◽  
Keech Malcolm

Sequential patterns mining is an important data-mining technique used to identify frequently observed sequential occurrence of items across ordered transactions over time. It has been extensively studied in the literature, and there exists a diversity of algorithms. However, more complex structural patterns are often hidden behind sequences. This article begins with the introduction of a model for the representation of sequential patterns—Sequential Patterns Graph—which motivates the search for new structural relation patterns. An integrative framework for the discovery of these patterns–Postsequential Patterns Mining–is then described which underpins the postprocessing of sequential patterns. A corresponding data-mining method based on sequential patterns postprocessing is proposed and shown to be effective in the search for concurrent patterns. From experiments conducted on three component algorithms, it is demonstrated that sequential patterns-based concurrent patterns mining provides an efficient method for structural knowledge discovery.


Author(s):  
Zhang Xiaodan ◽  
Hu Xiaohua ◽  
Xia Jiali ◽  
Zhou Xiaohua ◽  
Achananuparp Palakorn

In this article, we present a graph-based knowledge representation for biomedical digital library literature clustering. An efficient clustering method is developed to identify the ontology-enriched k-highest density term subgraphs that capture the core semantic relationship information about each document cluster. The distance between each document and the k term graph clusters is calculated. A document is then assigned to the closest term cluster. The extensive experimental results on two PubMed document sets (Disease10 and OHSUMED23) show that our approach is comparable to spherical k-means. The contributions of our approach are the following: (1) we provide two corpus-level graph representations to improve document clustering, a term co-occurrence graph and an abstract-title graph; (2) we develop an efficient and effective document clustering algorithm by identifying k distinguishable class-specific core term subgraphs using terms’ global and local importance information; and (3) the identified term clusters give a meaningful explanation for the document clustering results.


Author(s):  
Zhang Xiaodan ◽  
Jing Liping ◽  
Hu Xiaohua ◽  
Ng Michael ◽  
Xia Jiali ◽  
...  

Recent research shows that ontology as background knowledge can improve document clustering quality with its concept hierarchy knowledge. Previous studies take term semantic similarity as an important measure to incorporate domain knowledge into clustering process such as clustering initialization and term re-weighting. However, not many studies have been focused on how different types of term similarity measures affect the clustering performance for a certain domain. In this article, we conduct a comparative study on how different term semantic similarity measures including path-based, informationcontent- based and feature-based similarity measure affect document clustering. Term re-weighting of document vector is an important method to integrate domain ontology to clustering process. In detail, the weight of a term is augmented by the weights of its co-occurred concepts. Spherical k-means are used for evaluate document vector re-weighting on two real-world datasets: Disease10 and OHSUMED23. Experimental results on nine different semantic measures have shown that: (1) there is no certain type of similarity measures that significantly outperforms the others; (2) Several similarity measures have rather more stable performance than the others; (3) term re-weighting has positive effects on medical document clustering, but might not be significant when documents are short of terms.


Author(s):  
Ravat Franck ◽  
Teste Olivier ◽  
Tournier Ronan ◽  
Zurfluh Gilles

This article deals with multidimensional analyses. Analyzed data are designed according to a conceptual model as a constellation of facts and dimensions, which are composed of multi-hierarchies. This model supports a query algebra defining a minimal core of operators, which produce multidimensional tables for displaying analyzed data. This user-oriented algebra supports complex analyses through advanced operators and binary operators. A graphical language, based on this algebra, is also provided to ease the specification of multidimensional queries. These graphical manipulations are expressed from a constellation schema and they produce multidimensional tables.


Author(s):  
Zhang Zhi-Zhuo ◽  
Chen Qiong ◽  
Ke Shang-Fu ◽  
Wu Yi-Jun ◽  
Qi Fei

Ranking potential customers has become an effective tool for company decision makers to design marketing strategies. The task of PAKDD competition 2007 is a cross-selling problem between credit card and home loan, which can also be treated as a ranking potential customers problem. This article proposes a 3-level ranking model, namely Group-Ensemble, to handle such kinds of problems. In our model, Bagging, RankBoost and Expending Regression Tree are applied to solve crucial data mining problems like data imbalance, missing value and time-variant distribution. The article verifies the model with data provided by PAKDD Competition 2007 and shows that Group-Ensemble can make selling strategy much more efficient.


Author(s):  
Nikulin Vladimir

Imbalanced data represent a significant problem because the corresponding classifier has a tendency to ignore patterns which have smaller representation in the training set. We propose to consider a large number of balanced training subsets where representatives from the larger pattern are selected randomly. As an outcome, the system will produce a matrix of linear regression coefficients where rows represent random subsets and columns represent features. Based on the above matrix we make an assessment of the stability of the influence of the particular features. It is proposed to keep in the model only features with stable influence. The final model represents an average of the single models, which are not necessarily a linear regression. The above model had proven to be efficient and competitive during the PAKDD-2007 Data Mining Competition.


Author(s):  
Zhang Junping ◽  
Li Guo-Zheng

The PAKDD Competition 2007 involved the problem of predicting customers’ propensity to take up a home loan when a collection of data from credit card users are provided. It is rather difficult to address the problem because 1) the data set is extremely imbalanced; 2) the features are mixture types; and 3) there are many missing values. This article gives an overview on the competition, mainly consisting of three parts: 1) The background of the database and some statistical results of participants are introduced; 2) An analysis from the viewpoint of data preparation, resampling/reweighting and ensemble learning employed by different participants is given; and 3) Finally, some business insights are highlighted.


Author(s):  
A. Gadish David

The data quality of a vector spatial data can be assessed using the data contained within one or more data warehouses. Spatial consistency includes topological consistency, or the conformance to topological rules (Hadzilacos & Tryfona, 1992, Rodríguez, 2005). Detection of inconsistencies in vector spatial data is an important step for improvement of spatial data quality (Redman, 1992; Veregin, 1991). An approach for detecting topo-semantic inconsistencies in vector spatial data is presented. Inconsistencies between pairs of neighboring vector spatial objects are detected by comparing relations between spatial objects to rules (Klein, 2007). A property of spatial objects, called elasticity, has been defined to measure the contribution of each of the objects to inconsistent behavior. Grouping of multiple objects, which are inconsistent with one another, based on their elasticity is proposed. The ability to detect groups of neighboring objects that are inconsistent with one another can later serve as the basis of an effort to increase the quality of spatial data sets stored in data warehouses, as well as increase the quality of results of data-mining processes.


Author(s):  
Pighin Maurizio ◽  
Ieronutti Lucio

The design and configuration of a data warehouse can be difficult tasks especially in the case of very large databases and in the presence of redundant information. In particular, the choice of which attributes have to be considered as dimensions and measures can be not trivial and it can heavily influence the effectiveness of the final system. In this article, we propose a methodology targeted at supporting the design and deriving information on the total quality of the final data warehouse. We tested our proposal on three real-world commercial ERP databases.


Sign in / Sign up

Export Citation Format

Share Document