IRIS Data Classification Using Tolerant Rough Sets1

Author(s):  
Daijin Kim ◽  
◽  
Sung-Yang Bang

This paper proposes a new data classification method based on the tolerant rough set that extends the existing equivalent rough set. Similarity measure between two data points is described by a distance function of all constituent attributes and they are defined to be tolerant when their similarity measure exceeds a similarity threshold value. The determination of the optimal similarity threshold value is very important for accurate classification, so we determine it optimally by using the genetic algorithm (GA), where the goal of evolution is to balance two requirements so (1) some tolerant objects are required to be included in the same class as many as possible and (2) some objects in the same class are required to be tolerable as possible. After finding the optimal similarity threshold value, a tolerant set of each object is obtained and data set is grouped into the lower and upper approximation set depending on the coincidence of their classes. We propose a two-stage classification method where all data is classified by using the lower approximation at the first stage and then the nonclassified data at the first stage is classified again by using the rough membership functions obtained from the upper approximation set. The validity of the proposed classification method is tested by applying it IRIS data classification and its classification performance and processing time are compared to those of other classification methods such as BPNN, OFUNN, and FCM.

Author(s):  
Cheng-Chien Kuo ◽  
Horng-Lin Shieh

In this study, a semi-supervised learning algorithm for data classification and defect type recognition system for a 25 kV cross-linked polyethylene (XLPE) underground power cable joint is proposed. The proposed algorithm integrates the fuzzy-rough set and shared nearest neighbors (SNN) method for the assignment of labels to unlabeled data. As such, the data set is divided into two subsets: one is labeled and the other is unlabeled. The SNN is adopted for measuring the similarity between unlabeled data and labeled data subsets. Then, according to the levels of similarity, the fuzzy-rough set algorithm is adopted for assigning labels to the unlabeled data. A defect type recognition system XLPE cable classification problem is proposed in order to test the proposed algorithm. To demonstrate the performance of the proposed method, the proposed algorithm is applied to two well-known data sets. The experimental results show that the proposed algorithm can obtain outstanding levels of performance.


2011 ◽  
Vol 44 (1) ◽  
pp. 14271-14276 ◽  
Author(s):  
H. Chang ◽  
A. Astolfi
Keyword(s):  
Data Set ◽  

Author(s):  
Feng Honghai ◽  
Liu Baoyan ◽  
Yin Cheng ◽  
Li Ping ◽  
Yang Bingru ◽  
...  
Keyword(s):  

2019 ◽  
Vol 71 (1) ◽  
pp. 18-37 ◽  
Author(s):  
Güleda Doğan ◽  
Umut Al

Purpose The purpose of this paper is to analyze the similarity of intra-indicators used in research-focused international university rankings (Academic Ranking of World Universities (ARWU), NTU, University Ranking by Academic Performance (URAP), Quacquarelli Symonds (QS) and Round University Ranking (RUR)) over years, and show the effect of similar indicators on overall rankings for 2015. The research questions addressed in this study in accordance with these purposes are as follows: At what level are the intra-indicators used in international university rankings similar? Is it possible to group intra-indicators according to their similarities? What is the effect of similar intra-indicators on overall rankings? Design/methodology/approach Indicator-based scores of all universities in five research-focused international university rankings for all years they ranked form the data set of this study for the first and second research questions. The authors used a multidimensional scaling (MDS) and cosine similarity measure to analyze similarity of indicators and to answer these two research questions. Indicator-based scores and overall ranking scores for 2015 are used as data and Spearman correlation test is applied to answer the third research question. Findings Results of the analyses show that the intra-indicators used in ARWU, NTU and URAP are highly similar and that they can be grouped according to their similarities. The authors also examined the effect of similar indicators on 2015 overall ranking lists for these three rankings. NTU and URAP are affected least from the omitted similar indicators, which means it is possible for these two rankings to create very similar overall ranking lists to the existing overall ranking using fewer indicators. Research limitations/implications CWTS, Mapping Scientific Excellence, Nature Index, and SCImago Institutions Rankings (until 2015) are not included in the scope of this paper, since they do not create overall ranking lists. Likewise, Times Higher Education, CWUR and US are not included because of not presenting indicator-based scores. Required data were not accessible for QS for 2010 and 2011. Moreover, although QS ranks more than 700 universities, only first 400 universities in 2012–2015 rankings were able to be analyzed. Although QS’s and RUR’s data were analyzed in this study, it was statistically not possible to reach any conclusion for these two rankings. Practical implications The results of this study may be considered mainly by ranking bodies, policy- and decision-makers. The ranking bodies may use the results to review the indicators they use, to decide on which indicators to use in their rankings, and to question if it is necessary to continue overall rankings. Policy- and decision-makers may also benefit from the results of this study by thinking of giving up using overall ranking results as an important input in their decisions and policies. Originality/value This study is the first to use a MDS and cosine similarity measure for revealing the similarity of indicators. Ranking data is skewed that require conducting nonparametric statistical analysis; therefore, MDS is used. The study covers all ranking years and all universities in the ranking lists, and is different from the similar studies in the literature that analyze data for shorter time intervals and top-ranked universities in the ranking lists. It can be said that the similarity of intra-indicators for URAP, NTU and RUR is analyzed for the first time in this study, based on the literature review.


2013 ◽  
Vol 443 ◽  
pp. 741-745
Author(s):  
Hu Li ◽  
Peng Zou ◽  
Wei Hong Han ◽  
Rong Ze Xia

Many real world data is imbalanced, i.e. one category contains significantly more samples than other categories. Traditional classification methods take different categories equally and are often ineffective. Based on the comprehensive analysis of existing researches, we propose a new imbalanced data classification method based on clustering. The method clusters both majority class and minority class at first. Then, clustered minority class will be over-sampled by SMOTE while clustered majority class be under-sampled randomly. Through clustering, the proposed method can avoid the loss of useful information while resampling. Experiments on several UCI datasets show that the proposed method can effectively improve the classification results on imbalanced data.


2018 ◽  
Vol 35 (8) ◽  
pp. 1508-1518
Author(s):  
Rosembergue Pereira Souza ◽  
Luiz Fernando Rust da Costa Carmo ◽  
Luci Pirmez

Purpose The purpose of this paper is to present a procedure for finding unusual patterns in accredited tests using a rapid processing method for analyzing video records. The procedure uses the temporal differencing technique for object tracking and considers only frames not identified as statistically redundant. Design/methodology/approach An accreditation organization is responsible for accrediting facilities to undertake testing and calibration activities. Periodically, such organizations evaluate accredited testing facilities. These evaluations could use video records and photographs of the tests performed by the facility to judge their conformity to technical requirements. To validate the proposed procedure, a real-world data set with video records from accredited testing facilities in the field of vehicle safety in Brazil was used. The processing time of this proposed procedure was compared with the time needed to process the video records in a traditional fashion. Findings With an appropriate threshold value, the proposed procedure could successfully identify video records of fraudulent services. Processing time was faster than when a traditional method was employed. Originality/value Manually evaluating video records is time consuming and tedious. This paper proposes a procedure to rapidly find unusual patterns in videos of accredited tests with a minimum of manual effort.


2019 ◽  
Vol 1 (3) ◽  
Author(s):  
A. Aziz Altowayan ◽  
Lixin Tao

We consider the following problem: given neural language models (embeddings) each of which is trained on an unknown data set, how can we determine which model would provide a better result when used for feature representation in a downstream task such as text classification or entity recognition? In this paper, we assess the word similarity measure through analyzing its impact on word embeddings learned from various datasets and how they perform in a simple classification task. Word representations were learned and assessed under the same conditions. For training word vectors, we used the implementation of Continuous Bag of Words described in [1]. To assess the quality of the vectors, we applied the analogy questions test for word similarity described in the same paper. Further, to measure the retrieval rate of an embedding model, we introduced a new metric (Average Retrieval Error) which measures the percentage of missing words in the model. We observe that scoring a high accuracy of syntactic and semantic similarities between word pairs is not an indicator of better classification results. This observation can be justified by the fact that a domain-specific corpus contributes to the performance better than a general-purpose corpus. For reproducibility, we release our experiments scripts and results.


Sign in / Sign up

Export Citation Format

Share Document