Strategic Advancements in Utilizing Data Mining and Warehousing Technologies

2011 ◽

pp. 285-289

Author(s):

Wei Mingjun ◽

Chai Lei ◽

Wei Renying ◽

Huo Wang

Keyword(s):

Data Mining ◽

Logistic Regression ◽

Credit Card ◽

Consumer Finance ◽

Data Mining Algorithms ◽

Finance Company ◽

Final Score ◽

The Cross ◽

Mining Algorithms

Our team has won the Grand Champion (Tie) of PAKDD-2007 data mining competition. The data mining task is to score credit card customers of a consumer finance company according to the likelihood that customers take up the home loans offered by the company. This report presents our solution for this business problem. TreeNet and logistic regression are the data mining algorithms used in this project. The final score is based on the cross-algorithm ensemble of two within-algorithm ensembles of TreeNet and logistic regression. Finally, some discussions from our solution are presented.

Download Full-text

Sequential Patterns Postprocessing for Structural Relation Patterns Mining

Strategic Advancements in Utilizing Data Mining and Warehousing Technologies ◽

10.4018/978-1-60566-717-1.ch012 ◽

2011 ◽

pp. 216-234

Author(s):

Lu Jing ◽

Chen Weiru ◽

Adjei Osei ◽

Keech Malcolm

Keyword(s):

Data Mining ◽

Knowledge Discovery ◽

Efficient Method ◽

Sequential Patterns ◽

Data Mining Technique ◽

Structural Patterns ◽

Structural Relation ◽

Integrative Framework ◽

Complex Structural ◽

Over Time

Sequential patterns mining is an important data-mining technique used to identify frequently observed sequential occurrence of items across ordered transactions over time. It has been extensively studied in the literature, and there exists a diversity of algorithms. However, more complex structural patterns are often hidden behind sequences. This article begins with the introduction of a model for the representation of sequential patterns—Sequential Patterns Graph—which motivates the search for new structural relation patterns. An integrative framework for the discovery of these patterns–Postsequential Patterns Mining–is then described which underpins the postprocessing of sequential patterns. A corresponding data-mining method based on sequential patterns postprocessing is proposed and shown to be effective in the search for concurrent patterns. From experiments conducted on three component algorithms, it is demonstrated that sequential patterns-based concurrent patterns mining provides an efficient method for structural knowledge discovery.

Download Full-text

A Graph-Based Biomedical Literature Clustering Approach Utilizing Term's Global and Local Importance Information

Strategic Advancements in Utilizing Data Mining and Warehousing Technologies ◽

10.4018/978-1-60566-717-1.ch008 ◽

2011 ◽

pp. 133-150

Author(s):

Zhang Xiaodan ◽

Hu Xiaohua ◽

Xia Jiali ◽

Zhou Xiaohua ◽

Achananuparp Palakorn

Keyword(s):

Clustering Algorithm ◽

Document Clustering ◽

Biomedical Literature ◽

Semantic Relationship ◽

Clustering Method ◽

Graph Representations ◽

The Core ◽

Document Cluster ◽

Clustering Approach ◽

Global And Local

In this article, we present a graph-based knowledge representation for biomedical digital library literature clustering. An efficient clustering method is developed to identify the ontology-enriched k-highest density term subgraphs that capture the core semantic relationship information about each document cluster. The distance between each document and the k term graph clusters is calculated. A document is then assigned to the closest term cluster. The extensive experimental results on two PubMed document sets (Disease10 and OHSUMED23) show that our approach is comparable to spherical k-means. The contributions of our approach are the following: (1) we provide two corpus-level graph representations to improve document clustering, a term co-occurrence graph and an abstract-title graph; (2) we develop an efficient and effective document clustering algorithm by identifying k distinguishable class-specific core term subgraphs using terms’ global and local importance information; and (3) the identified term clusters give a meaningful explanation for the document clustering results.

Download Full-text

Medical Document Clustering Using Ontology-Based Term Similarity Measures

Strategic Advancements in Utilizing Data Mining and Warehousing Technologies ◽

10.4018/978-1-60566-717-1.ch007 ◽

2011 ◽

pp. 121-132

Author(s):

Zhang Xiaodan ◽

Jing Liping ◽

Hu Xiaohua ◽

Ng Michael ◽

Xia Jiali ◽

...

Keyword(s):

Semantic Similarity ◽

Domain Knowledge ◽

Document Clustering ◽

Similarity Measures ◽

Concept Hierarchy ◽

Term Similarity ◽

Feature Based ◽

Document Vector ◽

Real World Datasets ◽

Medical Document

Recent research shows that ontology as background knowledge can improve document clustering quality with its concept hierarchy knowledge. Previous studies take term semantic similarity as an important measure to incorporate domain knowledge into clustering process such as clustering initialization and term re-weighting. However, not many studies have been focused on how different types of term similarity measures affect the clustering performance for a certain domain. In this article, we conduct a comparative study on how different term semantic similarity measures including path-based, informationcontent- based and feature-based similarity measure affect document clustering. Term re-weighting of document vector is an important method to integrate domain ontology to clustering process. In detail, the weight of a term is augmented by the weights of its co-occurred concepts. Spherical k-means are used for evaluate document vector re-weighting on two real-world datasets: Disease10 and OHSUMED23. Experimental results on nine different semantic measures have shown that: (1) there is no certain type of similarity measures that significantly outperforms the others; (2) Several similarity measures have rather more stable performance than the others; (3) term re-weighting has positive effects on medical document clustering, but might not be significant when documents are short of terms.

Download Full-text

Algebraic and Graphic Languages for OLAP Manipulations

Strategic Advancements in Utilizing Data Mining and Warehousing Technologies ◽

10.4018/978-1-60566-717-1.ch004 ◽

2011 ◽

pp. 60-90 ◽

Cited By ~ 1

Author(s):

Ravat Franck ◽

Teste Olivier ◽

Tournier Ronan ◽

Zurfluh Gilles

Keyword(s):

Conceptual Model ◽

Graphical Language ◽

Graphic Languages

This article deals with multidimensional analyses. Analyzed data are designed according to a conceptual model as a constellation of facts and dimensions, which are composed of multi-hierarchies. This model supports a query algebra defining a minimal core of operators, which produce multidimensional tables for displaying analyzed data. This user-oriented algebra supports complex analyses through advanced operators and binary operators. A graphical language, based on this algebra, is also provided to ease the specification of multidimensional queries. These graphical manipulations are expressed from a constellation schema and they produce multidimensional tables.

Download Full-text

Ranking Potential Customers Based on Group-Ensemble

Strategic Advancements in Utilizing Data Mining and Warehousing Technologies ◽

10.4018/978-1-60566-717-1.ch023 ◽

2011 ◽

pp. 355-365

Author(s):

Zhang Zhi-Zhuo ◽

Chen Qiong ◽

Ke Shang-Fu ◽

Wu Yi-Jun ◽

Qi Fei

Keyword(s):

Data Mining ◽

Credit Card ◽

Regression Tree ◽

Marketing Strategies ◽

Decision Makers ◽

Missing Value ◽

Data Imbalance ◽

Ranking Model ◽

Potential Customers

Ranking potential customers has become an effective tool for company decision makers to design marketing strategies. The task of PAKDD competition 2007 is a cross-selling problem between credit card and home loan, which can also be treated as a ranking potential customers problem. This article proposes a 3-level ranking model, namely Group-Ensemble, to handle such kinds of problems. In our model, Bagging, RankBoost and Expending Regression Tree are applied to solve crucial data mining problems like data imbalance, missing value and time-variant distribution. The article verifies the model with data provided by PAKDD Competition 2007 and shows that Group-Ensemble can make selling strategy much more efficient.

Download Full-text

Classification of Imbalanced Data with Random Sets and Mean-Variance Filtering

Strategic Advancements in Utilizing Data Mining and Warehousing Technologies ◽

10.4018/978-1-60566-717-1.ch022 ◽

2011 ◽

pp. 338-354

Author(s):

Nikulin Vladimir

Keyword(s):

Data Mining ◽

Linear Regression ◽

Imbalanced Data ◽

Random Sets ◽

Significant Problem ◽

Training Set ◽

Final Model ◽

The Stability ◽

Mean Variance

Imbalanced data represent a significant problem because the corresponding classifier has a tendency to ignore patterns which have smaller representation in the training set. We propose to consider a large number of balanced training subsets where representatives from the larger pattern are selected randomly. As an outcome, the system will produce a matrix of linear regression coefficients where rows represent random subsets and columns represent features. Based on the above matrix we make an assessment of the stability of the influence of the particular features. It is proposed to keep in the model only features with stable influence. The final model represents an average of the single models, which are not necessarily a linear regression. The above model had proven to be efficient and competitive during the PAKDD-2007 Data Mining Competition.

Download Full-text

Overview of PAKDD Competition 2007

Strategic Advancements in Utilizing Data Mining and Warehousing Technologies ◽

10.4018/978-1-60566-717-1.ch015 ◽

2011 ◽

pp. 277-284

Author(s):

Zhang Junping ◽

Li Guo-Zheng

Keyword(s):

Ensemble Learning ◽

Credit Card ◽

Missing Values ◽

Data Preparation ◽

Data Set ◽

Statistical Results

The PAKDD Competition 2007 involved the problem of predicting customers’ propensity to take up a home loan when a collection of data from credit card users are provided. It is rather difficult to address the problem because 1) the data set is extremely imbalanced; 2) the features are mixture types; and 3) there are many missing values. This article gives an overview on the competition, mainly consisting of three parts: 1) The background of the database and some statistical results of participants are introduced; 2) An analysis from the viewpoint of data preparation, resampling/reweighting and ensemble learning employed by different participants is given; and 3) Finally, some business insights are highlighted.

Download Full-text

Introducing the Elasticity of Spatial Data

Strategic Advancements in Utilizing Data Mining and Warehousing Technologies ◽

10.4018/978-1-60566-717-1.ch011 ◽

2011 ◽

pp. 198-215

Author(s):

A. Gadish David

Keyword(s):

Data Quality ◽

Spatial Data ◽

Data Sets ◽

Data Warehouses ◽

Multiple Objects ◽

Quality Of Results ◽

Spatial Objects ◽

Spatial Consistency ◽

Spatial Data Sets

The data quality of a vector spatial data can be assessed using the data contained within one or more data warehouses. Spatial consistency includes topological consistency, or the conformance to topological rules (Hadzilacos & Tryfona, 1992, Rodríguez, 2005). Detection of inconsistencies in vector spatial data is an important step for improvement of spatial data quality (Redman, 1992; Veregin, 1991). An approach for detecting topo-semantic inconsistencies in vector spatial data is presented. Inconsistencies between pairs of neighboring vector spatial objects are detected by comparing relations between spatial objects to rules (Klein, 2007). A property of spatial objects, called elasticity, has been defined to measure the contribution of each of the objects to inconsistent behavior. Grouping of multiple objects, which are inconsistent with one another, based on their elasticity is proposed. The ability to detect groups of neighboring objects that are inconsistent with one another can later serve as the basis of an effort to increase the quality of spatial data sets stored in data warehouses, as well as increase the quality of results of data-mining processes.

Download Full-text

A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses

Strategic Advancements in Utilizing Data Mining and Warehousing Technologies ◽

10.4018/978-1-60566-717-1.ch001 ◽

2011 ◽

pp. 1-21

Author(s):

Pighin Maurizio ◽

Ieronutti Lucio

Keyword(s):

Data Warehouse ◽

Real World ◽

Redundant Information ◽

Total Quality ◽

Quality Of Data ◽

Large Databases ◽

Final Data ◽

Very Large Databases ◽

Final System

The design and configuration of a data warehouse can be difficult tasks especially in the case of very large databases and in the presence of redundant information. In particular, the choice of which attributes have to be considered as dimensions and measures can be not trivial and it can heavily influence the effectiveness of the final system. In this article, we propose a methodology targeted at supporting the design and deriving information on the total quality of the final data warehouse. We tested our proposal on three real-world commercial ERP databases.

Download Full-text

Strategic Advancements in Utilizing Data Mining and Warehousing Technologies
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

A Solution to the Cross-Selling Problem of PAKDD-2007

Sequential Patterns Postprocessing for Structural Relation Patterns Mining

A Graph-Based Biomedical Literature Clustering Approach Utilizing Term's Global and Local Importance Information

Medical Document Clustering Using Ontology-Based Term Similarity Measures

Algebraic and Graphic Languages for OLAP Manipulations

Ranking Potential Customers Based on Group-Ensemble

Classification of Imbalanced Data with Random Sets and Mean-Variance Filtering

Overview of PAKDD Competition 2007

Introducing the Elasticity of Spatial Data

A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses

Export Citation Format

Strategic Advancements in Utilizing Data Mining and Warehousing TechnologiesLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

A Solution to the Cross-Selling Problem of PAKDD-2007

Sequential Patterns Postprocessing for Structural Relation Patterns Mining

A Graph-Based Biomedical Literature Clustering Approach Utilizing Term's Global and Local Importance Information

Medical Document Clustering Using Ontology-Based Term Similarity Measures

Algebraic and Graphic Languages for OLAP Manipulations

Ranking Potential Customers Based on Group-Ensemble

Classification of Imbalanced Data with Random Sets and Mean-Variance Filtering

Overview of PAKDD Competition 2007

Introducing the Elasticity of Spatial Data

A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses

Strategic Advancements in Utilizing Data Mining and Warehousing Technologies
Latest Publications