SRI: A Scalable Rule Induction Algorithm

Rule induction as a method for constructing classifiers is particularly attractive in data mining applications, where the comprehensibility of the generated models is very important. Most existing techniques were designed for small data sets and thus are not practical for direct use on very large data sets because of their computational inefficiency. Scaling up rule induction methods to handle such data sets is a formidable challenge. This article presents a new algorithm for rule induction that can efficiently extract accurate and comprehensible models from large and noisy data sets. This algorithm has been tested on several complex data sets, and the results prove that it scales up well and is an extremely effective learner.

Download Full-text

Assessment of Master Curve Material Inhomogeneity Using Small Data Sets

Volume 1A: Codes and Standards ◽

10.1115/pvp2018-84297 ◽

2018 ◽

Author(s):

Kim Wallin

Keyword(s):

Modal Analysis ◽

Master Curve ◽

Large Data ◽

Large Data Sets ◽

Small Data ◽

Data Sets ◽

Inhomogeneous Materials ◽

Material Inhomogeneity ◽

Analysis Methods ◽

Small Data Sets

The standard Master Curve (MC) deals only with materials assumed to be homogeneous, but MC analysis methods for inhomogeneous materials have also been developed. Especially the bi-modal and multi-modal analysis methods are becoming more and more standard. Their drawback is that these methods are generally reliable only with sufficiently large data sets (number of valid tests, r ≥ 15–20). Here, the possibility of using the multi-modal analysis method with smaller data sets is assessed, and a new procedure to conservatively account for possible inhomogeneities is proposed.

Download Full-text

Fooled Again and Again

The Phantom Pattern Problem ◽

10.1093/oso/9780198864165.003.0005 ◽

2020 ◽

pp. 75-100

Author(s):

Gary Smith ◽

Jay Cordes

Keyword(s):

Test Scores ◽

Large Data ◽

Large Data Sets ◽

Small Data ◽

Data Sets ◽

Small Data Sets ◽

Athletic Competitions

Patterns are inevitable and we should not be surprised by them. Streaks, clusters, and correlations are the norm, not the exception. In a large number of coin flips, there are likely to be coincidental clusters of heads and tails. In nationwide data on cancer, crime, or test scores, there are likely to be flukey clusters. When the data are separated into smaller geographic units like cities, the most extreme results are likely to be found in the smallest cities. In athletic competitions between well-matched teams, the outcome of a small number of games is almost meaningless. Our challenge is to overcome our inherited inclination to think that all patterns are meaningful; for example, thinking that clustering in large data sets or differences among small data sets must be something real that needs to be explained. Often, it is just meaningless happenstance.

Download Full-text

On the Use of the Inverted Abel Integral for Evaluating Spectroscopic Sources

Applied Spectroscopy ◽

10.1366/0003702814731806 ◽

1981 ◽

Vol 35 (1) ◽

pp. 35-42 ◽

Cited By ~ 13

Author(s):

J. D. Algeo ◽

M. B. Denton

Keyword(s):

Noisy Data ◽

Large Data ◽

Cubic Spline ◽

Low Noise ◽

Large Data Sets ◽

Small Data ◽

Data Sets ◽

Spline Method ◽

Noise Levels ◽

Small Data Sets

A numerical method for evaluating the inverted Abel integral employing cubic spline approximations is described along with a modification of the procedure of Cremers and Birkebak, and an extension of the Barr method. The accuracy of the computations is evaluated at several noise levels and with varying resolution of the input data. The cubic spline method is found to be useful only at very low noise levels, but capable of providing good results with small data sets. The Barr method is computationally the simplest, and is adequate when large data sets are available. For noisy data, the method of Cremers and Birkebak gave the best results.

Download Full-text

The relationship between journal citation impact and citation sentiment: A study of 32 million citances in PubMed Central

Quantitative Science Studies ◽

10.1162/qss_a_00040 ◽

2020 ◽

pp. 1-11

Author(s):

Erjia Yan ◽

Zheng Chen ◽

Kai Li

Keyword(s):

Social Science ◽

Natural Science ◽

Citation Impact ◽

Large Data ◽

Small Data ◽

Data Sets ◽

Data Set ◽

Pubmed Central ◽

Small Data Sets ◽

Sentiment Score

Citation sentiment plays an important role in citation analysis and scholarly communication research, but prior citation sentiment studies have used small data sets and relied largely on manual annotation. This paper uses a large data set of PubMed Central (PMC) full-text publications and analyzes citation sentiment in more than 32 million citances within PMC, revealing citation sentiment patterns at the journal and discipline levels. This paper finds a weak relationship between a journal’s citation impact (as measured by CiteScore) and the average sentiment score of citances to its publications. When journals are aggregated into quartiles based on citation impact, we find that journals in higher quartiles are cited more favorably than those in the lower quartiles. Further, social science journals are found to be cited with higher sentiment, followed by engineering and natural science and biomedical journals, respectively. This result may be attributed to disciplinary discourse patterns in which social science researchers tend to use more subjective terms to describe others’ work than do natural science or biomedical researchers.

Download Full-text

Moving Window Two-Dimensional Correlation Spectroscopy and Determination of Signal-To-Noise Threshold in Correlation Spectra

Applied Spectroscopy ◽

10.1366/000370203322258977 ◽

2003 ◽

Vol 57 (8) ◽

pp. 996-1006 ◽

Cited By ~ 15

Author(s):

Slobodan Šašić ◽

Yukihiro Ozaki

Keyword(s):

Large Data ◽

Correlation Spectroscopy ◽

Data Matrix ◽

Data Sets ◽

Complex Data ◽

Moving Window ◽

Two Dimensional ◽

Complex Data Sets ◽

Coexisting Species ◽

Definition Of

In this paper we report two new developments in two-dimensional (2D) correlation spectroscopy; one is the combination of the moving window concept with 2D spectroscopy to facilitate the analysis of complex data sets, and the other is the definition of the noise level in synchronous/asynchronous maps. A graphical criterion for the latter is also proposed. The combination of the moving window concept with correlation spectra allows one to split a large data matrix into smaller and simpler subsets and to analyze them instead of computing overall correlation. A three-component system that mimics a consecutive chemical reaction is used as a model for the illustration of the two ideas. Both types of correlation matrices, variable–variable and sample–sample, are analyzed, and a very good agreement between the two is met. The proposed innovations enable one to comprehend the complexity of the data to be analyzed by 2D spectroscopy and thus to avoid the risks of over-interpretation, liable to occur whenever improper caution about the number of coexisting species in the system is taken.

Download Full-text

The European Creep Collaborative Committee (ECCC) Approach to Creep Data Assessment

Journal of Pressure Vessel Technology ◽

10.1115/1.2894296 ◽

2008 ◽

Vol 130 (2) ◽

Cited By ~ 9

Author(s):

Stuart Holdsworth

Keyword(s):

Rupture Strength ◽

Large Data ◽

Small Data ◽

Data Sets ◽

Creep Data ◽

Data Set ◽

Data Assessment ◽

Long Time ◽

Small Data Sets ◽

Material Conditions

The European Creep Collaborative Committee (ECCC) approach to creep data assessment has now been established for almost ten years. The methodology covers the analysis of rupture strength and ductility, creep strain, and stress relaxation data, for a range of material conditions. This paper reviews the concepts and procedures involved. The original approach was devised to determine data sheets for use by committees responsible for the preparation of National and International Design and Product Standards, and the methods developed for data quality evaluation and data analysis were therefore intentionally rigorous. The focus was clearly on the determination of long-time property values from the largest possible data sets involving a significant number of observations in the mechanism regime for which predictions were required. More recently, the emphasis has changed. There is now an increasing requirement for full property descriptions from very short times to very long and hence the need for much more flexible model representations than were previously required. There continues to be a requirement for reliable long-time predictions from relatively small data sets comprising relatively short duration tests, in particular, to exploit new alloy developments at the earliest practical opportunity. In such circumstances, it is not feasible to apply the same degree of rigor adopted for large data set assessment. Current developments are reviewed.

Download Full-text

A Semiautomated Framework for Integrating Expert Knowledge into Disease Marker Identification

Disease Markers ◽

10.1155/2013/613529 ◽

2013 ◽

Vol 35 ◽

pp. 513-523 ◽

Cited By ~ 3

Author(s):

Jing Wang ◽

Bobbie-Jo M. Webb-Robertson ◽

Melissa M. Matzke ◽

Susan M. Varnum ◽

Joseph N. Brown ◽

...

Keyword(s):

Biomarker Discovery ◽

Expert Knowledge ◽

Large Data ◽

Large Data Sets ◽

Biological Information ◽

Data Sets ◽

Complex Data ◽

Selection Scheme ◽

Biomarker Identification ◽

Biomarker Selection

Background. The availability of large complex data sets generated by high throughput technologies has enabled the recent proliferation of disease biomarker studies. However, a recurring problem in deriving biological information from large data sets is how to best incorporate expert knowledge into the biomarker selection process.Objective. To develop a generalizable framework that can incorporate expert knowledge into data-driven processes in a semiautomated way while providing a metric for optimization in a biomarker selection scheme.Methods. The framework was implemented as a pipeline consisting of five components for the identification of signatures from integrated clustering (ISIC). Expert knowledge was integrated into the biomarker identification process using the combination of two distinct approaches; a distance-based clustering approach and an expert knowledge-driven functional selection.Results. The utility of the developed framework ISIC was demonstrated on proteomics data from a study of chronic obstructive pulmonary disease (COPD). Biomarker candidates were identified in a mouse model using ISIC and validated in a study of a human cohort.Conclusions. Expert knowledge can be introduced into a biomarker discovery process in different ways to enhance the robustness of selected marker candidates. Developing strategies for extracting orthogonal and robust features from large data sets increases the chances of success in biomarker identification.

Download Full-text

Exception Included, Ordered Rule Induction from the Set of Exemplars (ExIORISE)

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b1039.1292s19 ◽

2019 ◽

Vol 9 (2S) ◽

pp. 57-62

Keyword(s):

Rule Induction ◽

Small Data ◽

Data Sets ◽

Human Beings ◽

Classification Problems ◽

Instance Based Learning ◽

Rule Set ◽

Small Data Sets ◽

Multi Class Classification ◽

Different Levels

An expert system is one which uses collection of data comprising the knowledge to offer guidance or make inferences. Its work in most cases can be seen as classification which is basically the task of assigning objects to different categories or classes, determined by the properties of those objects. Numerous research works have been and being done to develop efficient knowledge acquisition techniques for expert systems. Some state-of-the-art algorithms are great performers but need extensive learning whereas older rule / decision- tree based algorithms perform pretty well with small data sets. Moreover, co-existence of learners of different levels of expertise and accuracy is believed to be encouraged to achieve a cumulative intelligence just like the human beings have. RISE is one such algorithm that infuses both instance-based learning and rule induction. It proved to be quite efficient handling binary and multi-class classification problems for small data sets in terms of accuracy and cost as well. In this work, features like exclusion of inefficient rules, inclusion of exceptions in the rule set and ordering of the rules using weights beforehand are integrated with the classical RISE algorithm to develop a more efficient classifier system named as ExIORISE. Empirical study shows that ExIORISE outperforms RISE, C4.5 and CN2 significantly.

Download Full-text

Topological data analysis : an overview

10.21079/11681/40943 ◽

2021 ◽

Author(s):

Amy Bednar

Keyword(s):

Data Analysis ◽

Large Data ◽

Large Data Sets ◽

Topological Data Analysis ◽

Background Information ◽

High Dimensional ◽

Data Sets ◽

Complex Data ◽

Topological Network ◽

Topological Data

A growing area of mathematics topological data analysis (TDA) uses fundamental concepts of topology to analyze complex, high-dimensional data. A topological network represents the data, and the TDA uses the network to analyze the shape of the data and identify features in the network that correspond to patterns in the data. These patterns extract knowledge from the data. TDA provides a framework to advance machine learning’s ability to understand and analyze large, complex data. This paper provides background information about TDA, TDA applications for large data sets, and details related to the investigation and implementation of existing tools and environments.

Download Full-text

Rules-6: A Simple Rule Induction Algorithm for Handling Large Data Sets

Proceedings of the Institution of Mechanical Engineers Part C Journal of Mechanical Engineering Science ◽

10.1243/095440605x31931 ◽

2005 ◽

Vol 219 (10) ◽

pp. 1119-1137 ◽

Cited By ~ 11

Author(s):

D T Pham ◽

A A Afify

Keyword(s):

Inductive Learning ◽

Empirical Evaluation ◽

Large Data ◽

Rule Induction ◽

Large Data Sets ◽

Data Sets ◽

Practical Tool ◽

Noise Tolerant ◽

Efficient Mechanisms ◽

Rule Evaluation

RULES-3 Plus is a member of the RULES family of simple inductive learning algorithms with successful engineering applications. However, it requires modification in order to be a practical tool for problems involving large data sets. In particular, efficient mechanisms are needed for handling continuous attributes and noisy data. This article presents a new rule induction algorithm called RULES-6, which is derived from the RULES-3 Plus algorithm. The algorithm employs a fast and noise-tolerant search method for extracting IF-THEN rules from examples. It also uses simple and effective methods for rule evaluation and handling of continuous attributes. A detailed empirical evaluation of the algorithm is reported in this paper. The results presented demonstrate the strong performance of the algorithm.

Download Full-text