Extracting class diagram from hidden dependencies in data set

Bogumiła Hnatkowska; Zbigniew Huzar; Lech Tuzinkiewicz

doi:10.7494/csci.2020.21.2.3483

Extracting class diagram from hidden dependencies in data set

Computer Science ◽

10.7494/csci.2020.21.2.3483 ◽

2020 ◽

Vol 21 (2) ◽

Cited By ~ 1

Author(s):

Bogumiła Hnatkowska ◽

Zbigniew Huzar ◽

Lech Tuzinkiewicz

Keyword(s):

Conceptual Model ◽

Graphical Representation ◽

Real Data ◽

Class Diagram ◽

Data Sets ◽

Data Set ◽

Key Concepts ◽

High Level

A conceptual model is a high-level, graphical representation of a specic do-main, presenting its key concepts and relationships between them. In particular, these dependencies can be inferred from concepts' instances being a part of big raw data les. The paper aims to propose a method for constructing a conceptual model from data frames encompassed in data les. The result is presented in the form of a class diagram. The method is explained with several examples and veried by a case study in which the real data sets are processed. It can also be applied for checking the quality of the data set.

Download Full-text

ESTIMATION OF EXTREME QUANTILES: EMPIRICAL TOOLS FOR METHODS ASSESSMENT AND COMPARISON

International Journal of Reliability Quality and Safety Engineering ◽

10.1142/s0218539300000079 ◽

2000 ◽

Vol 07 (01) ◽

pp. 75-94 ◽

Cited By ~ 3

Author(s):

J. DIEBOLT ◽

M.-A. EL-AROUI ◽

V. DURBEC ◽

B. VILLAIN

Keyword(s):

Goodness Of Fit ◽

Simulated Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Extreme Quantiles ◽

Maintenance Policies ◽

Simulated Data Sets ◽

Industrial Context

When extreme quantiles have to be estimated from a given data set, the classical parametric approach can lead to very poor estimations. This has led to the introduction of specific methods for estimating extreme quantiles (MEEQ's) in a nonparametric spirit, e.g., Pickands excess method, methods based on Hill's estimate of the Pareto index, exponential tail (ET) and quadratic tail (QT) methods. However, no practical technique for assessing and comparing these MEEQ's when they are to be used on a given data set is available. This paper is a first attempt to provide such techniques. We first compare the estimations given by the main MEEQ's on several simulated data sets. Then we suggest goodness-of-fit (Gof) tests to assess the MEEQ's by measuring the quality of their underlying approximations. It is shown that Gof techniques bring very relevant tools to assess and compare ET and excess methods. Other empirical criterions for comparing MEEQ's are also proposed and studied through Monte-Carlo analyses. Finally, these assessment and comparison techniques are experimented on real-data sets issued from an industrial context where extreme quantiles are needed to define maintenance policies.

Download Full-text

Pemetaan Siswa Berprestasi Menggunakan Metode K-Means Clustring

JURTEKSI ◽

10.33330/jurteksi.v4i1.28 ◽

2017 ◽

Vol 4 (1) ◽

pp. 85-92

Author(s):

Mustika Larasati Sibuea ◽

Andy Safta

Keyword(s):

Data Mining ◽

Student Achievement ◽

Manhattan Distance ◽

Data Set ◽

Euclidian Distance ◽

Student Failure ◽

Human Resources Information Systems ◽

High Level

Abstract: The high level of student success and the low level of student failure is a quality of the education world. The world of education is currently required to have the ability to compete by utilizing all resources owned. In addition to facilities, infrastructure and human resources, information systems are one of the resources that can be used to improve competency skills. Data mining is a process of data analysis to find a dataset of data set. Data mining is able to analyze large amounts of data into information that has meaning for decision supporters. One process of data mining is clustring. Attributes used in the grouping of student achievement are Name, Extracurricular, Value which include Task Value, Uts Value, Value of Uses, total absenteeism, and Attitude value. The case study of 20 students with distance calculation using manhattan distance, chbychep distance and euclidian distance yielded 67% accuracy. Keywords: data mining, clustering, k-means, student achievement Abstrak: Tingginya tingkat keberhasilan siswa dan rendahnya tingkat kegagalan siswa merupakan cemin kualitas dunia pendidikan.Dunia pendidikan saat ini dituntut untuk memiliki kemampuan bersaing dengan memanfaatkan semua sumber daya yang dimiliki. Selain sumber daya sarana, prasarana dan manusia, sistem informasi merupakan salah satu sumber daya yang dapat digunakan untuk meningkatkan kemampuan barsaing. Data mining merupakan proses analisa data untuk menemukan suatu pola dara kumpulan data. Data mining mampu menganalisa jumlah data yang besar menjadi informasi yang mempunyai arti bagi pendukung keputusan. Salah satu proses data mining adalah clustring. Atribut yang digunakan dalam pengelompokan prestasi siswa adalah Nama, Ekstrakulikuler, Nilai yang meliputi Nilai Tugas, Nilai Uts, Nilai Uas, jumlah ketidak hadiran siswa (absensi), dan Nilai sikap. Studi kasus pada 20 siswa dengan perhitungan jarak menggunakan manhattan distance, chbychep distance dan euclidian distance menghasilkan akurasi sebesar 67%. Kata kunci: data mining, clustering, k-means, prestasi siswa

Download Full-text

CHOOSING SEEDS FOR SEMI-SUPERVISED GRAPH BASED CLUSTERING

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/35/4/14123 ◽

2019 ◽

Vol 35 (4) ◽

pp. 373-384

Author(s):

Cuong Le ◽

Viet Vu Vu ◽

Le Thi Kieu Oanh ◽

Nguyen Thi Hai Yen

Keyword(s):

Learning Algorithm ◽

Side Information ◽

Clustering Algorithms ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Supervised Clustering ◽

Efficient Data ◽

Graph Based Clustering

Though clustering algorithms have long history, nowadays clustering topic still attracts a lot of attention because of the need of efficient data analysis tools in many applications such as social network, electronic commerce, GIS, etc. Recently, semi-supervised clustering, for example, semi-supervised K-Means, semi-supervised DBSCAN, semi-supervised graph-based clustering (SSGC) etc., which uses side information, has received a great deal of attention. Generally, there are two forms of side information: seed form (labeled data) and constraint form (must-link, cannot-link). By integrating information provided by the user or domain expert, the semi-supervised clustering can produce expected results. In fact, clustering results usually depend on side information provided, so different side information will produce different results of clustering. In some cases, the performance of clustering may decrease if the side information is not carefully chosen. This paper addresses the problem of efficient collection of seeds for semi-supervised clustering, especially for graph based clustering by seeding (SSGC). The properly collected seeds can boost the quality of clustering and minimize the number of queries solicited from the user. For this purpose, we have developed an active learning algorithm (called SKMMM) for the seeds collection task, which identifies candidates to solicit users by using the K-Means and min-max algorithms. Experiments conducted on real data sets from UCI and a real collected document data set show the effectiveness of our approach compared with other methods.

Download Full-text

Vocational Development in Emerging Adulthood

10.1093/oso/9780199934263.003.0006 ◽

2018 ◽

Author(s):

Michael W. Pratt ◽

M. Kyle Matsuba

Keyword(s):

Turning Point ◽

Life Stories ◽

Life Story ◽

Emerging Adult ◽

Vocational Development ◽

Data Set ◽

The World ◽

Key Concepts

Chapter 6 reviews research on the topic of vocational/occupational development in relation to the McAdams and Pals tripartite personality framework of traits, goals, and life stories. Distinctions between types of motivations for the work role (as a job, career, or calling) are particularly highlighted. The authors then turn to research from the Futures Study on work motivations and their links to personality traits, identity, generativity, and the life story, drawing on analyses and quotes from the data set. To illustrate the key concepts from this vocation chapter, the authors end with a case study on Charles Darwin’s pivotal turning point, his round-the-world voyage as naturalist for the HMS Beagle. Darwin was an emerging adult in his 20s at the time, and we highlight the role of this journey as a turning point in his adult vocational development.

Download Full-text

A Support Based Initialization Algorithm for Categorical Data Clustering

Journal of Information Technology Research ◽

10.4018/jitr.2018040104 ◽

2018 ◽

Vol 11 (2) ◽

pp. 53-67

Author(s):

Ajay Kumar ◽

Shishir Kumar

Keyword(s):

Categorical Data ◽

Selection Process ◽

Numerical Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Data Object ◽

Data Points ◽

Wu Method ◽

Selection Algorithms

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.

Download Full-text

Farmer Preference to High Elevation Rice Technological Packages for Accelerating Technological Dissemination (A case Study in Humbang Hasundutan Regency)

Agro Ekonomi ◽

10.22146/ae.61367 ◽

2021 ◽

Vol 32 (2) ◽

Author(s):

Setia Sari Girsang ◽

Agung B Santosa ◽

Tommy Purba ◽

Deddy R Siagian ◽

Khadijah E Ramija

Keyword(s):

Low Cost ◽

High Elevation ◽

Primary Data ◽

Survey Method ◽

User Friendliness ◽

Level Of Satisfaction ◽

High Level ◽

Level Of Importance

Accelerating the introduction of a new technological package is needed to increase the productivity of high elevation puddled rice in Humbang Hasundutan. The objectives of the study are to find out the perception of the existence of technological packages and farmers' preference for a new technological package. The study used a survey method with primary data gathered using questionnaires. The criteria of locations and respondents were used to obtain relevant respondents and data concerning their knowledge of high elevation puddled rice cultivation. The collected data were processed by using Importance Performance Analysis in order to find out the level of Importance and Satisfaction of the indicators and the valued aspects in the technological package components. The results of the study showed that the socio-economic aspects had to be heeded in organizing the technological package. Indicators having a high level of importance and a low level of satisfaction consisted of production cost, quality of seeds, farmer groups empowerment, technology information institution, capital cost, agricultural tools and machines, pest control, sales price, irrigation canals, and farm roads. On the other hand, introducing new superior seeds, productivity attribute and planting age were important indicators for local farmers as to improve the quality of existing seeds. Farmers group expected that the technological package had a high level of productivity, better access to input, low cost, and good user-friendliness in its application.

Download Full-text

A SELF-ORGANIZING MAP FOR MIXED CONTINUOUS AND CATEGORICAL DATA

International Journal of Computing ◽

10.47839/ijc.10.1.733 ◽

2011 ◽

pp. 24-32 ◽

Cited By ~ 1

Author(s):

Nicoleta Rogovschi ◽

Mustapha Lebbah ◽

Younès Bennani

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Mixed Data ◽

Categorical Variables ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Public Data ◽

Self Organizing

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.

Download Full-text

EFFECTIVENESS OF GUIDANCE AND COUNSELING SERVICES BY USING THE APPLICATIONS INSTRUMENTATION AND ACTIVITY DATA SET IN SMA NEGERI 1 METRO LESSON 2009/2010

GUIDENA Jurnal Ilmu Pendidikan Psikologi Bimbingan dan Konseling ◽

10.24127/gdn.v4i1.366 ◽

2014 ◽

Vol 4 (1) ◽

pp. 35

Author(s):

Agus Wibowo

Keyword(s):

Qualitative Research ◽

Counseling Services ◽

Guidance And Counseling ◽

Data Sets ◽

Research Subjects ◽

Activity Data ◽

Data Set ◽

Research Level ◽

Data Collection Technique ◽

High Level

Abstract: Implementation of guidance and counseling services should be based on the needs and problems of students, so the effectiveness of the service will be achieved to the fullest. But the reality is a lot of implementation of guidance and counseling services in schools, do not notice it. So that the completion of the problems experienced by students sama.Berangkat always use the services of this, the research level of effectiveness of guidance and counseling that implementation has been using the application activity instrumentation and data sets as the basis for an implementation of the service. The method used is a qualitative research subjects that teachers BK and Students at SMA Negeri 1 Metro. Data collection technique through interview, observation and documentation. Research results show that by utilizing activity instrumentation applications and data sets, the counseling services have a high level of effectiveness. In carrying out the service, BK teachers can identify problems and needs experienced by students, so that the efforts of the assistance provided to be more precise, and problem students can terentaskan optimally.Keyword: Guidance and Counseling, Instrumentation Applications, Data Association

Download Full-text

Quality of Life Modeling at the Regional Level

Regional Development ◽

10.4018/978-1-4666-0882-5.ch111 ◽

2012 ◽

pp. 163-186

Author(s):

Jirí Krupka ◽

Miloslava Kašparová ◽

Pavel Jirava ◽

Jan Mandys

Keyword(s):

Quality Of Life ◽

Czech Republic ◽

Decision Tree ◽

Decision Rules ◽

Real Data ◽

Classification Model ◽

Data Sets ◽

The Czech Republic ◽

First Case

The chapter presents the problem of quality of life modeling in the Czech Republic based on classification methods. It concerns a comparison of methodological approaches; in the first case the approach of the Institute of Sociology of the Academy of Sciences of the Czech Republic was used, the second case is concerning a project of the civic association Team Initiative for Local Sustainable Development. On the basis of real data sets from the institute and team initiative the authors synthesized and analyzed quality of life classification models. They used decision tree classification algorithms for generating transparent decision rules and compare the classification results of decision tree. The classifier models on the basis of C5.0, CHAID, C&RT and C5.0 boosting algorithms were proposed and analyzed. The designed classification model was created in Clementine.

Download Full-text

Extended Odd Fréchet-G Family of Distributions

Journal of Probability and Statistics ◽

10.1155/2018/2931326 ◽

2018 ◽

Vol 2018 ◽

pp. 1-12 ◽

Cited By ~ 6

Author(s):

Suleman Nasiru

Keyword(s):

Maximum Likelihood Method ◽

Real Data ◽

Likelihood Method ◽

Data Sets ◽

Data Set ◽

New Class ◽

New Family ◽

Rate Functions ◽

The Given ◽

Family Of Distributions

The need to develop generalizations of existing statistical distributions to make them more flexible in modeling real data sets is vital in parametric statistical modeling and inference. Thus, this study develops a new class of distributions called the extended odd Fréchet family of distributions for modifying existing standard distributions. Two special models named the extended odd Fréchet Nadarajah-Haghighi and extended odd Fréchet Weibull distributions are proposed using the developed family. The densities and the hazard rate functions of the two special distributions exhibit different kinds of monotonic and nonmonotonic shapes. The maximum likelihood method is used to develop estimators for the parameters of the new class of distributions. The application of the special distributions is illustrated by means of a real data set. The results revealed that the special distributions developed from the new family can provide reasonable parametric fit to the given data set compared to other existing distributions.

Download Full-text