scholarly journals Calibration of the empirical fundamental relationship using very large databases

TRANSPORTES ◽  
2021 ◽  
Vol 29 (1) ◽  
pp. 212-228
Author(s):  
Juliana Mitsuyama Cardoso ◽  
Lucas Assirati ◽  
José Reynaldo Setti

This paper describes a procedure for fitting traffic stream models using very large traffic databases. The proposed approach consists of four steps: (1) an initial treatment to eliminate noisy, inaccurate data and to homogenize the information over the density range; (2) a first fitting of the model, based on the sum of squared orthogonal errors; (3) a second filter, to eliminate outliers that survived the initial data treatment; and (4) a second fitting of the model. The proposed approach was tested by fitting the Van Aerde traffic stream model to 104 thousand observations collected by a permanent traffic monitoring station on a freeway in the metropolitan region of São Paulo, Brazil. The model fitting used a genetic algorithm to search for the best values of the model parameters. The results demonstrate the effectiveness of the proposed approach.

Author(s):  
Rahime Belen ◽  
Tugba Taskaya Temizel

Many manually populated very large databases suffer from data quality problems such as missing, inaccurate data and duplicate entries. A recently recognized data quality problem is that of disguised missing data which arises when an explicit code for missing data such as NA (Not Available) is not provided and a legitimate data value is used instead. Presence of these values may affect the outcome of data mining tasks severely such that association mining algorithms or clustering techniques may result in biased inaccurate association rules and invalid clusters respectively. Detection and elimination of these values are necessary but burdensome to be carried out manually. In this chapter, the methods to detect disguised missing values by visual inspection are explained first. Then, the authors describe the methods used to detect these values automatically. Finally, the framework to detect disguised missing data is proposed and a demonstration of the framework on spatial and categorical data sets is provided.


Data Mining ◽  
2013 ◽  
pp. 603-623
Author(s):  
Rahime Belen ◽  
Tugba Taskaya Temizel

Many manually populated very large databases suffer from data quality problems such as missing, inaccurate data and duplicate entries. A recently recognized data quality problem is that of disguised missing data which arises when an explicit code for missing data such as NA (Not Available) is not provided and a legitimate data value is used instead. Presence of these values may affect the outcome of data mining tasks severely such that association mining algorithms or clustering techniques may result in biased inaccurate association rules and invalid clusters respectively. Detection and elimination of these values are necessary but burdensome to be carried out manually. In this chapter, the methods to detect disguised missing values by visual inspection are explained first. Then, the authors describe the methods used to detect these values automatically. Finally, the framework to detect disguised missing data is proposed and a demonstration of the framework on spatial and categorical data sets is provided.


2018 ◽  
Author(s):  
Josephine Ann Urquhart ◽  
Akira O'Connor

Receiver operating characteristics (ROCs) are plots which provide a visual summary of a classifier’s decision response accuracy at varying discrimination thresholds. Typical practice, particularly within psychological studies, involves plotting an ROC from a limited number of discrete thresholds before fitting signal detection parameters to the plot. We propose that additional insight into decision-making could be gained through increasing ROC resolution, using trial-by-trial measurements derived from a continuous variable, in place of discrete discrimination thresholds. Such continuous ROCs are not yet routinely used in behavioural research, which we attribute to issues of practicality (i.e. the difficulty of applying standard ROC model-fitting methodologies to continuous data). Consequently, the purpose of the current article is to provide a documented method of fitting signal detection parameters to continuous ROCs. This method reliably produces model fits equivalent to the unequal variance least squares method of model-fitting (Yonelinas et al., 1998), irrespective of the number of data points used in ROC construction. We present the suggested method in three main stages: I) building continuous ROCs, II) model-fitting to continuous ROCs and III) extracting model parameters from continuous ROCs. Throughout the article, procedures are demonstrated in Microsoft Excel, using an example continuous variable: reaction time, taken from a single-item recognition memory. Supplementary MATLAB code used for automating our procedures is also presented in Appendix B, with a validation of the procedure using simulated data shown in Appendix C.


Author(s):  
Gautam Das

In recent years, advances in data collection and management technologies have led to a proliferation of very large databases. These large data repositories typically are created in the hope that, through analysis such as data mining and decision support, they will yield new insights into the data and the real-world processes that created them. In practice, however, while the collection and storage of massive datasets has become relatively straightforward, effective data analysis has proven more difficult to achieve. One reason that data analysis successes have proven elusive is that most analysis queries, by their nature, require aggregation or summarization of large portions of the data being analyzed. For multi-gigabyte data repositories, this means that processing even a single analysis query involves accessing enormous amounts of data, leading to prohibitively expensive running times. This severely limits the feasibility of many types of analysis applications, especially those that depend on timeliness or interactivity.


Author(s):  
Andrew Borthwick ◽  
Stephen Ash ◽  
Bin Pang ◽  
Shehzad Qureshi ◽  
Timothy Jones

Author(s):  
Ivan Bruha

Research in intelligent information systems investigates the possibilities of enhancing their over-all performance, particularly their prediction accuracy and time complexity. One such discipline, data mining (DM), processes usually very large databases in a profound and robust way (Fayyad et al., 1996). DM points to the overall process of determining a useful knowledge from databases, that is, extracting high-level knowledge from low-level data in the context of large databases. This article discusses two newer directions in this field, namely knowledge combination and meta-learning (Vilalta & Drissi, 2002). There exist approaches to combine various paradigms into one robust (hybrid, multistrategy) system which utilizes the advantages of each subsystem and tries to eliminate their drawbacks. There is a general belief that integrating results obtained from multiple lower-level decision-making systems, each usually (but not required) based on a different paradigm, produce better performance. Such multi-level knowledgebased systems are usually referred to as knowledge integration systems. One subset of these systems is called knowledge combination (Fan et al., 1996). We focus on a common topology of the knowledge combination strategy with base learners and base classifiers (Bruha, 2004). Meta-learning investigates how learning systems may improve their performance through experience in order to become flexible. Its goal is to search dynamically for the best learning strategy. We define the fundamental characteristics of the meta-learning such as bias, and hypothesis space. Section 2 surveys the various directions in algorithms and topologies utilized in knowledge combination and meta-learning. Section 3 represents the main focus of this article: description of knowledge combination techniques, meta-learning, and a particular application including the corresponding flow charts. The last section presents the future trends in these topics.


Author(s):  
Rashed Mustafa ◽  
Md Javed Hossain ◽  
Thomas Chowdhury

Distributed Database Management System (DDBMS) is one of the prime concerns in distributed computing. The driving force of development of DDBMS is the demand of the applications that need to query very large databases (order of terabytes). Traditional Client- Server database systems are too slower to handle such applications. This paper presents a better way to find the optimal number of nodes in a distributed database management systems. Keywords: DDBMS, Data Fragmentation, Linear Search, RMI.   DOI: 10.3329/diujst.v4i2.4362 Daffodil International University Journal of Science and Technology Vol.4(2) 2009 pp.19-22


Sign in / Sign up

Export Citation Format

Share Document