Variable Binned Scatter Plots

The scatter plot is a well-known method of visualizing pairs of two continuous variables. Scatter plots are intuitive and easy-to-use, but often have a high degree of overlap which may occlude a significant portion of the data. To analyze a dense non-uniform data set, a recursive drill-down is required for detailed analysis. In this article, we propose variable binned scatter plots to allow the visualization of large amounts of data without overlapping. The basic idea is to use a non-uniform (variable) binning of the x and y dimensions and to plot all data points that are located within each bin into the corresponding squares. In the visualization, each data point is then represented by a small cell (pixel). Users are able to interact with individual data points for record level information. To analyze an interesting area of the scatter plot, the variable binned scatter plots with a refined scale for the subarea can be generated recursively as needed. Furthermore, we map a third attribute to color to obtain a visual clustering. We have applied variable binned scatter plots to solve real-world problems in the areas of credit card fraud and data center energy consumption to visualize their data distributions and cause-effect relationships among multiple attributes. A comparison of our methods with two recent scatter plot variants is included.

Download Full-text

Technical Note: A significance test for data-sparse zones in scatter plots

Hydrology and Earth System Sciences Discussions ◽

10.5194/hessd-9-1335-2012 ◽

2012 ◽

Vol 9 (1) ◽

pp. 1335-1343

Author(s):

V. V. Vetrova ◽

W. E. Bardsley

Keyword(s):

Significance Test ◽

Technical Note ◽

Internal Boundary ◽

Specific Situation ◽

Rainfall Time Series ◽

Scatter Plot ◽

Scatter Plots ◽

Data Points ◽

Abstract Data ◽

Lake Inflows

Abstract. Data-sparse zones in scatter plots of hydrological variables can be of interest in various contexts. For example, a well-defined data-sparse zone may indicate inhibition of one variable by another. It is of interest therefore to determine whether data-sparse regions in scatter plots are of sufficient extent to be beyond random chance. We consider the specific situation of data-sparse regions defined by a linear internal boundary within a scatter plot defined over a rectangular region. An Excel VBA macro is provided for carrying out a randomisation-based significance test of the data-sparse region, taking into account both the within-region number of data points and the extent of the region. Example applications are given with respect to a rainfall time series from Israel and to validation scatter plots from a seasonal forecasting model for lake inflows in New Zealand.

Download Full-text

Technical note: A significance test for data-sparse zones in scatter plots

Hydrology and Earth System Sciences ◽

10.5194/hess-16-1255-2012 ◽

2012 ◽

Vol 16 (4) ◽

pp. 1255-1257 ◽

Cited By ~ 2

Author(s):

V. V. Vetrova ◽

W. E. Bardsley

Keyword(s):

Significance Test ◽

Technical Note ◽

Internal Boundary ◽

Specific Situation ◽

Rainfall Time Series ◽

Scatter Plot ◽

Scatter Plots ◽

Data Points ◽

Abstract Data ◽

Lake Inflows

Abstract. Data-sparse zones in scatter plots of hydrological variables can be of interest in various contexts. For example, a well-defined data-sparse zone may indicate inhibition of one variable by another. It is of interest therefore to determine whether data-sparse regions in scatter plots are of sufficient extent to be beyond random chance. We consider the specific situation of data-sparse regions defined by a linear internal boundary within a scatter plot defined over a rectangular region. An Excel VBA macro is provided for carrying out a randomisation-based significance test of the data-sparse region, taking into account both the within-region number of data points and the extent of the region. Example applications are given with respect to a rainfall time series from Israel and also to validation scatter plots from a seasonal forecasting model for lake inflows in New Zealand.

Download Full-text

Measuring Congestion and Reliability Impacts of Safety Projects

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/03611981211006729 ◽

2021 ◽

pp. 036119812110067

Author(s):

Simona Babiceanu ◽

Sanhita Lahiri ◽

Mena Lockwood

Keyword(s):

Performance Measures ◽

Positive Impact ◽

Operating Conditions ◽

Vehicle Miles Traveled ◽

Data Set ◽

Data Points ◽

Practical Recommendations

This study uses a suite of performance measures that was developed by taking into consideration various aspects of congestion and reliability, to assess impacts of safety projects on congestion. Safety projects are necessary to help move Virginia’s roadways toward safer operation, but can contribute to congestion and unreliability during execution, and can affect operations after execution. However, safety projects are assessed primarily for safety improvements, not for congestion. This study identifies an appropriate suite of measures, and quantifies and compares the congestion and reliability impacts of safety projects on roadways for the periods before, during, and after project execution. The paper presents the performance measures, examines their sensitivity based on operating conditions, defines thresholds for congestion and reliability, and demonstrates the measures using a set of Virginia safety projects. The data set consists of 10 projects totalling 92 mi and more than 1M data points. The study found that, overall, safety projects tended to have a positive impact on congestion and reliability after completion, and the congestion variability measures were sensitive to the threshold of reliability. The study concludes with practical recommendations for primary measures that may be used to measure overall impacts of safety projects: percent vehicle miles traveled (VMT) reliable with a customized threshold for Virginia; percent VMT delayed; and time to travel 10 mi. However, caution should be used when applying the results directly to other situations, because of the limited number of projects used in the study.

Download Full-text

The Study of Multiple Classes Boosting Classification Method Based on Local Similarity

Algorithms ◽

10.3390/a14020037 ◽

2021 ◽

Vol 14 (2) ◽

pp. 37

Author(s):

Shixun Wang ◽

Qiang Chen

Keyword(s):

Image Retrieval ◽

Loss Function ◽

Single Mode ◽

Local Similarity ◽

Text And Image ◽

Data Set ◽

Standard Data ◽

Weak Learner ◽

Great Progress ◽

Data Points

Boosting of the ensemble learning model has made great progress, but most of the methods are Boosting the single mode. For this reason, based on the simple multiclass enhancement framework that uses local similarity as a weak learner, it is extended to multimodal multiclass enhancement Boosting. First, based on the local similarity as a weak learner, the loss function is used to find the basic loss, and the logarithmic data points are binarized. Then, we find the optimal local similarity and find the corresponding loss. Compared with the basic loss, the smaller one is the best so far. Second, the local similarity of the two points is calculated, and then the loss is calculated by the local similarity of the two points. Finally, the text and image are retrieved from each other, and the correct rate of text and image retrieval is obtained, respectively. The experimental results show that the multimodal multi-class enhancement framework with local similarity as the weak learner is evaluated on the standard data set and compared with other most advanced methods, showing the experience proficiency of this method.

Download Full-text

A new approach to categorising continuous variables in prediction models: Proposal and validation

Statistical Methods in Medical Research ◽

10.1177/0962280215601873 ◽

2015 ◽

Vol 26 (6) ◽

pp. 2586-2602 ◽

Cited By ~ 30

Author(s):

Irantzu Barrio ◽

Inmaculada Arostegui ◽

María-Xosé Rodríguez-Álvarez ◽

José-María Quintana

Keyword(s):

Prediction Models ◽

Characteristic Curve ◽

Prediction Rule ◽

Real Data ◽

Clinical Variable ◽

Chronic Obstructive ◽

Continuous Variables ◽

Discriminative Ability ◽

Data Set ◽

Statistical Point

When developing prediction models for application in clinical practice, health practitioners usually categorise clinical variables that are continuous in nature. Although categorisation is not regarded as advisable from a statistical point of view, due to loss of information and power, it is a common practice in medical research. Consequently, providing researchers with a useful and valid categorisation method could be a relevant issue when developing prediction models. Without recommending categorisation of continuous predictors, our aim is to propose a valid way to do it whenever it is considered necessary by clinical researchers. This paper focuses on categorising a continuous predictor within a logistic regression model, in such a way that the best discriminative ability is obtained in terms of the highest area under the receiver operating characteristic curve (AUC). The proposed methodology is validated when the optimal cut points’ location is known in theory or in practice. In addition, the proposed method is applied to a real data-set of patients with an exacerbation of chronic obstructive pulmonary disease, in the context of the IRYSS-COPD study where a clinical prediction rule for severe evolution was being developed. The clinical variable PCO2 was categorised in a univariable and a multivariable setting.

Download Full-text

Generation of a Complete Profile for Porosity Log While Drilling Complex Lithology by Employing the Artificial Intelligence

10.2118/208642-ms ◽

2021 ◽

Author(s):

Ahmed Al-Sabaa ◽

Hany Gamal ◽

Salaheldin Elkatatny

Keyword(s):

Artificial Intelligence ◽

Prediction Model ◽

Real Time ◽

Storage Capacity ◽

Data Set ◽

Drilling Parameters ◽

Unseen Data ◽

Rock Porosity ◽

Data Points ◽

Logging Tool

Abstract The formation porosity of drilled rock is an important parameter that determines the formation storage capacity. The common industrial technique for rock porosity acquisition is through the downhole logging tool. Usually logging while drilling, or wireline porosity logging provides a complete porosity log for the section of interest, however, the operational constraints for the logging tool might preclude the logging job, in addition to the job cost. The objective of this study is to provide an intelligent prediction model to predict the porosity from the drilling parameters. Artificial neural network (ANN) is a tool of artificial intelligence (AI) and it was employed in this study to build the porosity prediction model based on the drilling parameters as the weight on bit (WOB), drill string rotating-speed (RS), drilling torque (T), stand-pipe pressure (SPP), mud pumping rate (Q). The novel contribution of this study is to provide a rock porosity model for complex lithology formations using drilling parameters in real-time. The model was built using 2,700 data points from well (A) with 74:26 training to testing ratio. Many sensitivity analyses were performed to optimize the ANN model. The model was validated using unseen data set (1,000 data points) of Well (B), which is located in the same field and drilled across the same complex lithology. The results showed the high performance for the model either for training and testing or validation processes. The overall accuracy for the model was determined in terms of correlation coefficient (R) and average absolute percentage error (AAPE). Overall, R was higher than 0.91 and AAPE was less than 6.1 % for the model building and validation. Predicting the rock porosity while drilling in real-time will save the logging cost, and besides, will provide a guide for the formation storage capacity and interpretation analysis.

Download Full-text

Intergeneric Relationships within the Early-Diverging Angiosperm Family Nymphaeaceae Based on Chloroplast Phylogenomics

International Journal of Molecular Sciences ◽

10.3390/ijms19123780 ◽

2018 ◽

Vol 19 (12) ◽

pp. 3780 ◽

Cited By ~ 3

Author(s):

Dingxuan He ◽

Andrew Gichira ◽

Zhizhong Li ◽

John Nzei ◽

Youhao Guo ◽

...

Keyword(s):

Genome Structure ◽

Morphological Data ◽

Basal Angiosperm ◽

Phylogenetic Position ◽

Data Set ◽

Protein Coding ◽

Plastid Genomes ◽

Chloroplast Genomes ◽

Gene Data ◽

High Degree

The order Nymphaeales, consisting of three families with a record of eight genera, has gained significant interest from botanists, probably due to its position as a basal angiosperm. The phylogenetic relationships within the order have been well studied; however, a few controversial nodes still remain in the Nymphaeaceae. The position of the Nuphar genus and the monophyly of the Nymphaeaceae family remain uncertain. This study adds to the increasing number of the completely sequenced plastid genomes of the Nymphaeales and applies a large chloroplast gene data set in reconstructing the intergeneric relationships within the Nymphaeaceae. Five complete chloroplast genomes were newly generated, including a first for the monotypic Euryale genus. Using a set of 66 protein-coding genes from the chloroplast genomes of 17 taxa, the phylogenetic position of Nuphar was determined and a monophyletic Nymphaeaceae family was obtained with convincing statistical support from both partitioned and unpartitioned data schemes. Although genomic comparative analyses revealed a high degree of synteny among the chloroplast genomes of the ancient angiosperms, key minor variations were evident, particularly in the contraction/expansion of the inverted-repeat regions and in RNA-editing events. Genome structure, and gene content and arrangement were highly conserved among the chloroplast genomes. The intergeneric relationships defined in this study are congruent with those inferred using morphological data.

Download Full-text

A Support Based Initialization Algorithm for Categorical Data Clustering

Journal of Information Technology Research ◽

10.4018/jitr.2018040104 ◽

2018 ◽

Vol 11 (2) ◽

pp. 53-67

Author(s):

Ajay Kumar ◽

Shishir Kumar

Keyword(s):

Categorical Data ◽

Selection Process ◽

Numerical Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Data Object ◽

Data Points ◽

Wu Method ◽

Selection Algorithms

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.

Download Full-text

A Novel Density-based Technique for Outlier Detection of High Dimensional Data Utilizing Full Feature Space

Information Technology And Control ◽

10.5755/j01.itc.50.1.25588 ◽

2021 ◽

Vol 50 (1) ◽

pp. 138-152

Author(s):

Mujeeb Ur Rehman ◽

Dost Muhammad Khan

Keyword(s):

Data Mining ◽

Outlier Detection ◽

High Dimensional Data ◽

Research Work ◽

Feature Space ◽

High Dimensional ◽

Data Set ◽

Data Points ◽

Low Dimensional ◽

Intrinsic Feature

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.

Download Full-text

Lake and mire isolation data set for the estimation of post-glacial land uplift in Fennoscandia

Earth System Science Data ◽

10.5194/essd-12-869-2020 ◽

2020 ◽

Vol 12 (2) ◽

pp. 869-873

Author(s):

Jari Pohjola ◽

Jari Turunen ◽

Tarmo Lipping

Keyword(s):

Spatial Location ◽

Land Uplift ◽

Data Set ◽

Dry Land ◽

Archaeological Data ◽

Complex Process ◽

Postglacial Land Uplift ◽

Data Points ◽

Ice Retreat ◽

Shoreline Displacement

Abstract. Postglacial land uplift is a complex process related to the continental ice retreat that took place about 10 000 years ago and thus started the viscoelastic response of the Earth's crust to rebound back to its equilibrium state. To empirically model the land uplift process based on past behaviour of shoreline displacement, data points of known spatial location, elevation and dating are needed. Such data can be obtained by studying the isolation of lakes and mires from the sea. Archaeological data on human settlements (i.e. human remains, fireplaces etc.) are also very useful as the settlements were indeed situated on dry land and were often located close to the coast. This information can be used to validate and update the postglacial land uplift model. In this paper, a collection of data underlying empirical land uplift modelling in Fennoscandia is presented. The data set is available at https://doi.org/10.1594/PANGAEA.905352 (Pohjola et al., 2019).

Download Full-text