LOGICAL CLASSIFICATION TREES IN RECOGNITION PROBLEMS

The paper is dedicated to algorithms for constructing a logical tree of classification. Nowadays, there exist many algorithms for constructing logical classification trees. However, all of them, as a rule, are reduced to the construction of a single classification tree based on the data of a fixed training sample. There are very few algorithms for constructing recognition trees that are designed for large data sets. It is obvious that such sets have objective factors associated with the peculiarities of the generation of such complex structures, methods of working with them and storage. In this paper, we focus on the description of the algorithm for constructing classification trees for a large training set and show the way to the possibility of a uniform description of a fixed class of recognition trees. A simple, effective, economical method of constructing a logical classification tree of the training sample allows you to provide the necessary speed, the level of complexity of the recognition scheme, which guarantees a simple and complete recognition of discrete objects.

Download Full-text

Is Error-Based Pruning Redeemable?

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213003001228 ◽

2003 ◽

Vol 12 (03) ◽

pp. 249-264 ◽

Cited By ~ 4

Author(s):

Lawrence O. Hall ◽

Kevin W. Bowyer ◽

Robert E. Banfield ◽

Steven Eschrich ◽

Richard Collins

Keyword(s):

Decision Tree ◽

Large Data ◽

Tree Size ◽

Large Data Sets ◽

Data Sets ◽

Certainty Factor ◽

Training Set ◽

Validation Data ◽

C4.5 Decision Tree ◽

Increase In Accuracy

Error based pruning can be used to prune a decision tree and it does not require the use of validation data. It is implemented in the widely used C4.5 decision tree software. It uses a parameter, the certainty factor, that affects the size of the pruned tree. Several researchers have compared error based pruning with other approaches, and have shown results that suggest that error based pruning results in larger trees that give no increase in accuracy. They further suggest that as more data is added to the training set, the tree size after applying error based pruning continues to grow even though there is no increase in accuracy. It appears that these results were obtained with the default certainty factor value. Here, we show that varying the certainty factor allows significantly smaller trees to be obtained with minimal or no accuracy loss. Also, the growth of tree size with added data can be halted with an appropriate choice of certainty factor. Methods of determining the certainty factor are discussed for both small and large data sets. Experimental results support the conclusion that error based pruning can be used to produce appropriately sized trees with good accuracy when compared with reduced error pruning.

Download Full-text

FEATURES OF SOFTWARE SOLUTIONS OF MODELS OF LOGICAL CLASSIFICATION TREES BASED ON SELECTION OF SETS OF ELEMENTARY FEATURES

TECHNICAL SCIENCES AND TECHNOLOG IES ◽

10.25140/2411-5363-2020-4(22)-72-90 ◽

2020 ◽

pp. 72-90

Author(s):

Igor Povkhan ◽

Keyword(s):

Artificial Intelligence ◽

Decision Trees ◽

Classification Tree ◽

Classification Trees ◽

Training Sample ◽

Software System ◽

Discrete Information ◽

Wide Range ◽

Logical Tree ◽

Selection Of

Urgency of the research.Currently there are several independent approaches (concepts) to solve the classification problem in the general setting, and the development of various concepts, approaches, methods, and models that cover the general issues of the theory of artificial intelligence and information systems, all of these approaches in a recognition theory have their advantages and disadvantages and form a single tool to solve applied problems of the theory of artificial intelligence. This study will focus on the current concept of decision trees (classification trees). The general problem of software (algorithmic) construction of logical recognition trees (classification) is considered. The object of this research is logical classification trees (LСT structures). The subject of the research is actual methods and algorithmic schemes for constructing logical classification trees. Target setting.The main existing methods and algorithms for working with arrays of discrete information in the construc-tion of recognition functions (classifiers) do not allow you to achieve a predetermined level of accuracy (efficiency) of the classification system and regulate their complexity in the construction process. However, this disadvantage is absent in meth-ods and schemes for building recognition systems based on the concept of logical classification trees (decision trees). That is, the coverage of the training sample the set of elementary signs in the case of LCT generates a fixed tree data structure (model LCT), which provides compression and conversion initial data TS, and therefore allows significant optimization and savings of hardware resources of the system, and is based on a single methodology – the optimal approximation test sample set of elementary features (attributes) that are included in some schema (operator) constructed in the learning process.Actual scientific researches and issues analysis. The possibility of an effective and economical software (algorithmic) scheme for constructing a logical classification tree (LCT structuremodel) based on the source arrays of training samples (arrays of discrete information) of a large sample.The research objective. Development of a simple and high-quality software method (algorithm and software system) for building models (structures) LCTfor large arrays of initial samples by synthesizing minimal forms of classification and recog-nition trees that provide an effective approximation of educational information with a set of ranked elementary features (at-tributes) is created on the basis of ascheme for branched feature selection in a wide range of applied problems.The statement of basic materials. We propose a general program scheme for constructing structures of logical classifi-cation trees, which for a given initial training sample builds a tree structure (classification model), which consists of a set of elementary features evaluated at each step of building the model for this sample. A method and ready-made software system build logic trees the main idea is to approximate the initial random sampling of the volume set of elementary features. This method provides the selection of the most informative (qualitative) elementary features from the source set when forming the current vertex of the logical tree (node). This approach allows to significantly reduce the size and complexity of the tree (the total number of branches and tiers of the structure) and improve the quality of its subsequent analysis.Conclusions. The developed and proposed mathematical support for constructing LCT structures (classification tree mod-els) allows it to be used for solving a wide range of practical problems of recognition and classification, and the prospectsfor further research may consist in creating a limited method of logical classification tree (LCT structures), which consists in maintaining the criterion for stopping the procedure for constructing a logical tree by the depth of the structure, optimizing its software implementations, as well as experimental studies of this method for a wider range of practicalproblems.

Download Full-text

Estimation of the complexity of constructing a logical classification tree for an arbitrary case in conditions of strong class separation of the initial training sample

Telecommunication and information technologies ◽

10.31673/2412-4338.2020.035566 ◽

2020 ◽

Vol 68 (3) ◽

Author(s):

I. F. Povkhan ◽

Keyword(s):

Decision Trees ◽

Classification Tree ◽

Classification Trees ◽

Training Sample ◽

Tree Structure ◽

Classification Model ◽

Initial Training ◽

Tree Models ◽

Logical Tree ◽

Selection Of

The paper offers an estimation of the complexity of the constructed logical tree structure for classifying an arbitrary case in the conditions of a strong class division of the initial training sample. The principal solution to this question is of a defining nature, regarding the assessment of the structural complexity of classification models (in the form of tree-like structures of LCT/ACT) of discrete objects for a wide range of applied classification and recognition problems in terms of developing promising schemes and methods for their final optimization (minimization) of post-pruning structure. The presented research is relevant not only for constructions (structures) of logical classification trees, but also allows us to extend the scheme of complexity estimation to the General case of algorithmic structures (ACT models) of classification trees (the concept of algorithm trees and trees of generalized features - TGF). Is investigated the actual question of the concept of decision trees (tree recognition) – evaluation of the maximum complexity of the General scheme of constructing a logical tree based classification procedure of stepwise selection of sets of elementary features (they can be diverse sets and combinations) that for given initial training sample (array of discrete information) builds a tree structure (classification model), from a set of elementary features (basic attributes) are estimated at each stage of the scheme of the model in this sample for the case of strong separation of classes. Modern information systems and technologies based on mathematical approaches (models) of pattern recognition (structures of logical and algorithmic classification trees) are widely used in socio-economic, environmental and other systems of primary analysis and processing of large amounts of information, and this is due to the fact that this approach allows you to eliminate a set of existing disadvantages of well-known classical methods, schemes and achieve a fundamentally new result. The research is devoted to the problems of classification tree models (decision trees), and offers an assessment of the complexity of logical tree structures (classification tree models), which consist of selected and ranked sets of elementary features (individual features and their combinations) built on the basis of the General concept of branched feature selection. This method, when forming the current vertex of the logical tree (node), provides the selection of the most informative (qualitative) elementary features from the source set. This approach allows you to significantly reduce the size and complexity of the tree (the total number of branches and tiers of the structure) and improve the quality of its subsequent instrumental analysis (the final decomposition of the model).

Download Full-text

Estimation of general complexity of the procedure for constructing a binary logical classification tree for an arbitrary case

Telecommunication and information technologies ◽

10.31673/2412-4338.2020.021100 ◽

2020 ◽

Vol 69 (2) ◽

Author(s):

I. F. Povkhan ◽

Keyword(s):

General Procedure ◽

Classification Tree ◽

Classification Trees ◽

Training Sample ◽

Classification Model ◽

Tree Structures ◽

Wide Range ◽

Tree Models ◽

Logical Tree ◽

Selection Of

We propose an upper estimate of the complexity of the binary logical tree synthesis procedure for classifying an arbitrary case (for conditions of weak and strong separation of classes in the training sample). The solution to this question is of a fundamental nature, regarding the assessment of the structural complexity of classification models (in the form of tree structures) of discrete objects for a wide range of applied classification and recognition problems in terms of developing promising schemes and methods for their final optimization (minimization) of the structure. This research is relevant not only for the constructions of logical classification trees, but also allows us to extend the complexity estimation scheme itself to the general case of algorithmic structures of classification trees (concepts of algorithm trees and generalized feature trees). The current issue of complexity of the general procedure for constructing a logical classification tree based on the concept of step-by-step selection of sets of elementary features (their possible heterogeneous sets and combinations), which for a given initial training sample (an array of discrete information) builds a tree structure (classification model), from a set of elementary features (basic attributes) evaluated at each stage of the model construction scheme for this sample. Thus, modern information technologies based on mathematical models of pattern recognition (logical and algorithmic classification trees) are widely used in socio-economic, environmental and other systems of primary analysis and processing of large amounts of information. This is due to the fact that this approach allows you to eliminate a set of existing disadvantages of well-known classical methods and schemes and achieve a fundamentally new result. The work is devoted to the problems of classification tree models (decision trees), and offers an assessment of the complexity of logical tree structures (classification tree models), which consist of selected and ranked sets of elementary features built on the basis of the General concept of branched feature selection. This method, when forming the current vertex of the logical tree (node), provides the selection of the most informative (qualitative) elementary features from the source set. This approach allows you to significantly reduce the size and complexity of the tree (the total number of branches and tiers of the structure) and improve the quality of its subsequent analysis.

Download Full-text

An example of spectrum imaging used for comparison of EELS quantitative analysis techniques on Al-Li

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s042482010008794x ◽

1991 ◽

Vol 49 ◽

pp. 726-727

Author(s):

John A. Hunt

Keyword(s):

Quantitative Analysis ◽

Large Data ◽

Difference Spectrum ◽

Large Data Sets ◽

Foil Thickness ◽

Data Sets ◽

Analysis Techniques ◽

Spectrum Imaging ◽

Normal Spectrum ◽

Electron Energy Loss

Spectrum-imaging is a useful technique for comparing different processing methods on very large data sets which are identical for each method. This paper is concerned with comparing methods of electron energy-loss spectroscopy (EELS) quantitative analysis on the Al-Li system. The spectrum-image analyzed here was obtained from an Al-10at%Li foil aged to produce δ' precipitates that can span the foil thickness. Two 1024 channel EELS spectra offset in energy by 1 eV were recorded and stored at each pixel in the 80x80 spectrum-image (25 Mbytes). An energy range of 39-89eV (20 channels/eV) are represented. During processing the spectra are either subtracted to create an artifact corrected difference spectrum, or the energy offset is numerically removed and the spectra are added to create a normal spectrum. The spectrum-images are processed into 2D floating-point images using methods and software described in [1].

Download Full-text

Cluster analysis for large data sets: applications to individual aerosol particles from the mid-pacific

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100132078 ◽

1992 ◽

Vol 50 (2) ◽

pp. 1488-1489

Author(s):

Thomas W. Shattuck ◽

James R. Anderson ◽

Neil W. Tindale ◽

Peter R. Buseck

Keyword(s):

Cluster Analysis ◽

Chemical Reactivity ◽

Large Data ◽

Large Data Sets ◽

Particle Analysis ◽

Data Sets ◽

Halogen Chemistry ◽

Complete Study ◽

Components Analysis ◽

Automated Scanning

Individual particle analysis involves the study of tens of thousands of particles using automated scanning electron microscopy and elemental analysis by energy-dispersive, x-ray emission spectroscopy (EDS). EDS produces large data sets that must be analyzed using multi-variate statistical techniques. A complete study uses cluster analysis, discriminant analysis, and factor or principal components analysis (PCA). The three techniques are used in the study of particles sampled during the FeLine cruise to the mid-Pacific ocean in the summer of 1990. The mid-Pacific aerosol provides information on long range particle transport, iron deposition, sea salt ageing, and halogen chemistry.Aerosol particle data sets suffer from a number of difficulties for pattern recognition using cluster analysis. There is a great disparity in the number of observations per cluster and the range of the variables in each cluster. The variables are not normally distributed, they are subject to considerable experimental error, and many values are zero, because of finite detection limits. Many of the clusters show considerable overlap, because of natural variability, agglomeration, and chemical reactivity.

Download Full-text

Faculty Opinions recommendation of Detecting novel associations in large data sets.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.13805958.793484294 ◽

2014 ◽

Author(s):

Daniel Lee

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Novel Associations

Download Full-text

NVESTIGATION OF THE EFFICIENCY OF DISTRIBUTED INFORMATION SYSTEMS BASED ON THE PROCESSING OF LARGE AMOUNTS OF DATA

Visnyk Universytetu “Ukraina” ◽

10.36994/2707-4110-2019-2-23-03 ◽

2019 ◽

Author(s):

Mykhajlo Klymash ◽

Olena Hordiichuk — Bublivska ◽

Ihor Tchaikovskyi ◽

Oksana Urikova

Keyword(s):

Distributed Systems ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Data Decomposition ◽

Distributed Information ◽

Software Model ◽

Computing Performance ◽

Mapreduce Model ◽

Singular Data

In this article investigated the features of processing large arrays of information for distributed systems. A method of singular data decomposition is used to reduce the amount of data processed, eliminating redundancy. Dependencies of computational efficiency on distributed systems were obtained using the MPI messaging protocol and MapReduce node interaction software model. Were analyzed the efficiency of the application of each technology for the processing of different sizes of data: Non — distributed systems are inefficient for large volumes of information due to low computing performance. It is proposed to use distributed systems that use the method of singular data decomposition, which will reduce the amount of information processed. The study of systems using the MPI protocol and MapReduce model obtained the dependence of the duration calculations time on the number of processes, which testify to the expediency of using distributed computing when processing large data sets. It is also found that distributed systems using MapReduce model work much more efficiently than MPI, especially with large amounts of data. MPI makes it possible to perform calculations more efficiently for small amounts of information. When increased the data sets, advisable to use the Map Reduce model.

Download Full-text

Using Large Data Sets and Statistical Tools to Define “Failure” and Demonstrate the Safety of DPR

Proceedings of the Water Environment Federation ◽

10.2175/193864718824828128 ◽

2018 ◽

Vol 2018 (6) ◽

pp. 38-39

Author(s):

Austa Parker ◽

Yan Qu ◽

David Hokanson ◽

Jeff Soller ◽

Eric Dickenson ◽

...

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Statistical Tools

Download Full-text

Online Judging Platform Utilizing Dynamic Plagiarism Detection Facilities

Computers ◽

10.3390/computers10040047 ◽

2021 ◽

Vol 10 (4) ◽

pp. 47

Author(s):

Fariha Iffath ◽

A. S. M. Kayes ◽

Md. Tahsin Rahman ◽

Jannatul Ferdows ◽

Mohammad Shamsul Arefin ◽

...

Keyword(s):

Source Code ◽

Large Data ◽

Large Data Sets ◽

Detection Technique ◽

Data Sets ◽

Plagiarism Detection ◽

Source Codes ◽

Efficient Detection ◽

Mathematical Problems ◽

Automatic Scoring

A programming contest generally involves the host presenting a set of logical and mathematical problems to the contestants. The contestants are required to write computer programs that are capable of solving these problems. An online judge system is used to automate the judging procedure of the programs that are submitted by the users. Online judges are systems designed for the reliable evaluation of the source codes submitted by the users. Traditional online judging platforms are not ideally suitable for programming labs, as they do not support partial scoring and efficient detection of plagiarized codes. When considering this fact, in this paper, we present an online judging framework that is capable of automatic scoring of codes by detecting plagiarized contents and the level of accuracy of codes efficiently. Our system performs the detection of plagiarism by detecting fingerprints of programs and using the fingerprints to compare them instead of using the whole file. We used winnowing to select fingerprints among k-gram hash values of a source code, which was generated by the Rabin–Karp Algorithm. The proposed system is compared with the existing online judging platforms to show the superiority in terms of time efficiency, correctness, and feature availability. In addition, we evaluated our system by using large data sets and comparing the run time with MOSS, which is the widely used plagiarism detection technique.

Download Full-text