A characteristic turning point between summer and autumn in north China

Mapping Intimacies ◽

10.21203/rs.3.rs-96099/v1 ◽

2020 ◽

Author(s):

Yaoqin Xie

Keyword(s):

North China ◽

Turning Point ◽

Sunshine Duration ◽

Large Data ◽

Annual Average ◽

Data Set ◽

Knowledge Based ◽

Gradual Process ◽

Data Points ◽

Average Distribution

Abstract It is generally believed that seasonal alternation is a gradual process marked by temperature. Explored from a large data set containing 1,686,528 data points of temperature, humidity and sunshine duration, we established a seasonal dynamic model of north China. Based on the model, we discovered a turning point on the 220th day in the annual average distribution of humidity and sunshine duration, which can be used as a characteristic node to define the date of summer-autumn alternation in north China. Our results demonstrate that the alternation of summer and autumn is not a gradual process in this region, but a mutation in the annual distributions of humidity and sunshine duration, thus revealing the statistical invariance based on the local knowledge. The study also shows that humidity and sunshine duration can better reflect the climate characteristics of north China than temperature. Because the model is region-specific, the proposed method using big data can be further extended to quantitatively define other seasonal alternations and explore other climate characteristics in different regions, so as to benefit indigenous knowledge-based climate prediction.

Download Full-text

Exploring biological data: Mappings between ontology- and cluster-based representations

Information Visualization ◽

10.1177/1473871612468880 ◽

2013 ◽

Vol 12 (3-4) ◽

pp. 291-307 ◽

Cited By ~ 1

Author(s):

Ilir Jusufi ◽

Andreas Kerren ◽

Falk Schreiber

Keyword(s):

Hierarchical Clustering ◽

Visual Analysis ◽

Large Data ◽

Biological Data ◽

Data Set ◽

Metabolomics Data ◽

Domain Experts ◽

Data Points ◽

Biological Input ◽

Insight Into

Ontologies and hierarchical clustering are both important tools in biology and medicine to study high-throughput data such as transcriptomics and metabolomics data. Enrichment of ontology terms in the data is used to identify statistically overrepresented ontology terms, giving insight into relevant biological processes or functional modules. Hierarchical clustering is a standard method to analyze and visualize data to find relatively homogeneous clusters of experimental data points. Both methods support the analysis of the same data set but are usually considered independently. However, often a combined view is desired: visualizing a large data set in the context of an ontology under consideration of a clustering of the data. This article proposes new visualization methods for this task. They allow for interactive selection and navigation to explore the data under consideration as well as visual analysis of mappings between ontology- and cluster-based space-filling representations. In this context, we discuss our approach together with specific properties of the biological input data and identify features that make our approach easily usable for domain experts.

Download Full-text

A dynamic K-means clustering for data mining

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v13.i2.pp521-526 ◽

2019 ◽

Vol 13 (2) ◽

pp. 521

Author(s):

Md. Zakir Hossain ◽

Md.Nasim Akhtar ◽

R.B. Ahmad ◽

Mostafijur Rahman

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Large Data ◽

Threshold Value ◽

Specific Pattern ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Data Points

Data mining is the process of finding structure of data from large data sets. With this process, the decision makers can make a particular decision for further development of the real-world problems. Several data clusteringtechniques are used in data mining for finding a specific pattern of data. The K-means method isone of the familiar clustering techniques for clustering large data sets. The K-means clustering method partitions the data set based on the assumption that the number of clusters are fixed.The main problem of this method is that if the number of clusters is to be chosen small then there is a higher probability of adding dissimilar items into the same group. On the other hand, if the number of clusters is chosen to be high, then there is a higher chance of adding similar items in the different groups. In this paper, we address this issue by proposing a new K-Means clustering algorithm. The proposed method performs data clustering dynamically. The proposed method initially calculates a threshold value as a centroid of K-Means and based on this value the number of clusters are formed. At each iteration of K-Means, if the Euclidian distance between two points is less than or equal to the threshold value, then these two data points will be in the same group. Otherwise, the proposed method will create a new cluster with the dissimilar data point. The results show that the proposed method outperforms the original K-Means method.

Download Full-text

Optimal fast Johnson–Lindenstrauss embeddings for large data sets

Sampling Theory, Signal Processing, and Data Analysis ◽

10.1007/s43670-021-00003-5 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Stefan Bamberger ◽

Felix Krahmer

Keyword(s):

Fast Algorithms ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Embedding Dimension ◽

Data Set ◽

Optimal Dimension ◽

Discrete Algorithms ◽

Fast Multiplication ◽

Data Points

AbstractJohnson–Lindenstrauss embeddings are widely used to reduce the dimension and thus the processing time of data. To reduce the total complexity, also fast algorithms for applying these embeddings are necessary. To date, such fast algorithms are only available either for a non-optimal embedding dimension or up to a certain threshold on the number of data points. We address a variant of this problem where one aims to simultaneously embed larger subsets of the data set. Our method follows an approach by Nelson et al. (New constructions of RIP matrices with fast multiplication and fewer rows. In: Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1515-1528, 2014): a subsampled Hadamard transform maps points into a space of lower, but not optimal dimension. Subsequently, a random matrix with independent entries projects to an optimal embedding dimension. For subsets whose size scales at least polynomially in the ambient dimension, the complexity of this method comes close to the number of operations just to read the data under mild assumptions on the size of the data set that are considerably less restrictive than in previous works. We also prove a lower bound showing that subsampled Hadamard matrices alone cannot reach an optimal embedding dimension. Hence, the second embedding cannot be omitted.

Download Full-text

Massively scalable density based clustering (DBSCAN) on the HPCC systems big data platform

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v10.i1.pp207-214 ◽

2021 ◽

Vol 10 (1) ◽

pp. 207

Author(s):

Yatish H. R. ◽

Shubham Milind Phal ◽

Tanmay Sanjay Hukkeri ◽

Lili Xu ◽

Shobha G ◽

...

Keyword(s):

Clustering Algorithm ◽

Spatial Clustering ◽

Computation Time ◽

Large Data ◽

Single Node ◽

Data Set ◽

Traffic Pattern ◽

Density Based Clustering ◽

Data Points ◽

Hpcc Systems

Dealing with large samples of unlabeled data is a key challenge in today’s world, especially in applications such as traffic pattern analysis and disaster management. DBSCAN, or density based spatial clustering of applications with noise, is a well-known density-based clustering algorithm. Its key strengths lie in its capability to detect outliers and handle arbitrarily shaped clusters. However, the algorithm, being fundamentally sequential in nature, proves expensive and time consuming when operated on extensively large data chunks. This paper thus presents a novel implementation of a parallel and distributed DBSCAN algorithm on the HPCC Systems platform. The algorithm seeks to fully parallelize the algorithm implementation by making use of HPCC Systems optimal distributed architecture and performing a tree-based union to merge local clusters. The proposed approach* was tested both on synthetic as well as standard datasets (MFCCs Data Set) and found to be completely accurate. Additionally, when compared against a single node setup, a significant decrease in computation time was observed with no impact to accuracy. The parallelized algorithm performed eight times better for higher number of data points and takes exponentially lesser time as the number of data points increases.

Download Full-text

Boosting Inter-ply Fracture Toughness Data on Carbon Nanotube-Engineered Carbon Composites for Prognostics

Journal of Composites Science ◽

10.3390/jcs4040170 ◽

2020 ◽

Vol 4 (4) ◽

pp. 170

Author(s):

Sunil C. Joshi

Keyword(s):

Fracture Toughness ◽

Carbon Nanotube ◽

Large Data ◽

Mode I ◽

Mode Ii ◽

Knowledge Based ◽

Delamination Resistance ◽

Detailed Simulation ◽

Dataset Size ◽

Data Points

In order to build predictive analytic for engineering materials, large data is required for machine learning (ML). Gathering such a data can be demanding due to the challenges involved in producing specialty specimen and conducting ample experiments. Additionally, numerical simulations require efforts. Smaller datasets are still viable, however, they need to be boosted systematically for ML. A newly developed, knowledge-based data boosting (KBDB) process, named COMPOSITES, helps in logically enhancing the dataset size without further experimentation or detailed simulation. This process and its successful usage are discussed in this paper, using a combination of mode-I and mode-II inter-ply fracture toughness (IPFT) data on carbon nanotube (CNT) engineered carbon fiber reinforced polymer (CFRP) composites. The amount of CNT added to strengthen the mid-ply interface of CFRP vs the improvement in IPFT is studied. A simpler way of combining mode-I and mode-II values of IPFT to predict delamination resistance is presented. Every step of the 10-step KBDB process, its significance and implementation are explained and the results presented. The KBDB helped in not only adding a number of data points reliably, but also in finding boundaries and limitations of the augmented dataset. Such an authentically boosted dataset is vital for successful ML.

Download Full-text

An Approach for Data Labelling and Concept Drift Detection Based on Entropy Model in Rough Sets for Clustering Categorical Data

Journal of Information & Knowledge Management ◽

10.1142/s0219649214500208 ◽

2014 ◽

Vol 13 (02) ◽

pp. 1450020

Author(s):

H. Venkateswara Reddy ◽

S. Viswanadha Raju ◽

B. Suresh Kumar ◽

C. Jayachandra

Keyword(s):

Categorical Data ◽

Rough Sets ◽

Concept Drift ◽

Sampling Technique ◽

Large Data ◽

Sampled Data ◽

Entropy Model ◽

Data Set ◽

Clustering Quality ◽

Data Points

Clustering is an important technique in data mining. Clustering a large data set is difficult and time consuming. An approach called data labelling has been suggested for clustering large databases using sampling technique to improve efficiency of clustering. A sampled data is selected randomly for initial clustering and data points which are not sampled and unclustered are given cluster label or an outlier based on various data labelling techniques. Data labelling is an easy task in numerical domain because it is performed based on distance between a cluster and an unlabelled data point. However, in categorical domain since the distance is not defined properly between data points and data points with cluster, then data labelling is a difficult task for categorical data. This paper proposes a method for data labelling using entropy model in rough sets for categorical data. The concept of entropy, introduced by Shannon with particular reference to information theory is a powerful mechanism for the measurement of uncertainty information. In this method, data labelling is performed by integrating entropy with rough sets. This method is also applied to drift detection to establish if concept drift occurred or not when clustering categorical data. The cluster purity is also discussed using Rough Entropy for data labelling and for outlier detection. The experimental results show that the efficiency and clustering quality of this algorithm are better than the previous algorithms.

Download Full-text

The Mouse Heart Attack Research Tool 1.0 database

AJP Heart and Circulatory Physiology ◽

10.1152/ajpheart.00172.2018 ◽

2018 ◽

Vol 315 (3) ◽

pp. H522-H530 ◽

Cited By ~ 5

Author(s):

Kristine Y. DeLeon-Pennell ◽

Rugmani Padmanabhan Iyer ◽

Yonggang Ma ◽

Andriy Yabluchanskiy ◽

Rogelio Zamilpa ◽

...

Keyword(s):

Myocardial Infarction ◽

Heart Attack ◽

Ventricular Remodeling ◽

Left Ventricular Remodeling ◽

Large Data ◽

Left Ventricular ◽

Research Tool ◽

Mouse Heart ◽

Data Set ◽

Data Points

The generation of big data has enabled systems-level dissections into the mechanisms of cardiovascular pathology. Integration of genetic, proteomic, and pathophysiological variables across platforms and laboratories fosters discoveries through multidisciplinary investigations and minimizes unnecessary redundancy in research efforts. The Mouse Heart Attack Research Tool (mHART) consolidates a large data set of over 10 yr of experiments from a single laboratory for cardiovascular investigators to generate novel hypotheses and identify new predictive markers of progressive left ventricular remodeling after myocardial infarction (MI) in mice. We designed the mHART REDCap database using our own data to integrate cardiovascular community participation. We generated physiological, biochemical, cellular, and proteomic outputs from plasma and left ventricles obtained from post-MI and no-MI (naïve) control groups. We included both male and female mice ranging in age from 3 to 36 mo old. After variable collection, data underwent quality assessment for data curation (e.g., eliminate technical errors, check for completeness, remove duplicates, and define terms). Currently, mHART 1.0 contains >888,000 data points and includes results from >2,100 unique mice. Database performance was tested, and an example is provided to illustrate database utility. This report explains how the first version of the mHART database was established and provides researchers with a standard framework to aid in the integration of their data into our database or in the development of a similar database. NEW & NOTEWORTHY The Mouse Heart Attack Research Tool combines >888,000 cardiovascular data points from >2,100 mice. We provide this large data set as a REDCap database to generate novel hypotheses and identify new predictive markers of adverse left ventricular remodeling following myocardial infarction in mice and provide examples of use. The Mouse Heart Attack Research Tool is the first database of this size that integrates data sets across platforms that include genomic, proteomic, histological, and physiological data.

Download Full-text

Weighted Fuzzy-Possibilistic C-Means Over Large Data Sets

International Journal of Data Warehousing and Mining ◽

10.4018/jdwm.2012100104 ◽

2012 ◽

Vol 8 (4) ◽

pp. 82-107 ◽

Cited By ~ 2

Author(s):

Renxia Wan ◽

Yuelin Gao ◽

Caixia Li

Keyword(s):

Large Data ◽

Real Data ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Data Segment ◽

Possibilistic Clustering ◽

Data Points ◽

Weighted Data ◽

Natural Classes

Up to now, several algorithms for clustering large data sets have been presented. Most clustering approaches for data sets are the crisp ones, which cannot be well suitable to the fuzzy case. In this paper, the authors explore a single pass approach to fuzzy possibilistic clustering over large data set. The basic idea of the proposed approach (weighted fuzzy-possibilistic c-means, WFPCM) is to use a modified possibilistic c-means (PCM) algorithm to cluster the weighted data points and centroids with one data segment as a unit. Experimental results on both synthetic and real data sets show that WFPCM can save significant memory usage when comparing with the fuzzy c-means (FCM) algorithm and the possibilistic c-means (PCM) algorithm. Furthermore, the proposed algorithm is of an excellent immunity to noise and can avoid splitting or merging the exact clusters into some inaccurate clusters, and ensures the integrity and purity of the natural classes.

Download Full-text

Iterative Approaches to Handling Heteroscedasticity With Partially Known Error Variances

International Journal of Statistics and Probability ◽

10.5539/ijsp.v8n2p159 ◽

2019 ◽

Vol 8 (2) ◽

pp. 159

Author(s):

Morteza Marzjarani

Keyword(s):

Large Data ◽

Real Data ◽

General Linear Model ◽

Least Square ◽

Parameter Estimates ◽

Data Sets ◽

Data Set ◽

New Approach ◽

Data Points ◽

Mlr Model

Heteroscedasticity plays an important role in data analysis. In this article, this issue along with a few different approaches for handling heteroscedasticity are presented. First, an iterative weighted least square (IRLS) and an iterative feasible generalized least square (IFGLS) are deployed and proper weights for reducing heteroscedasticity are determined. Next, a new approach for handling heteroscedasticity is introduced. In this approach, through fitting a multiple linear regression (MLR) model or a general linear model (GLM) to a sufficiently large data set, the data is divided into two parts through the inspection of the residuals based on the results of testing for heteroscedasticity, or via simulations. The first part contains the records where the absolute values of the residuals could be assumed small enough to the point that heteroscedasticity would be ignorable. Under this assumption, the error variances are small and close to their neighboring points. Such error variances could be assumed known (but, not necessarily equal).The second or the remaining portion of the said data is categorized as heteroscedastic. Through real data sets, it is concluded that this approach reduces the number of unusual (such as influential) data points suggested for further inspection and more importantly, it will lowers the root MSE (RMSE) resulting in a more robust set of parameter estimates.

Download Full-text

Representative reduction of crystallographic orientation data

Journal of Applied Crystallography ◽

10.1107/s0021889813010972 ◽

2013 ◽

Vol 46 (4) ◽

pp. 960-971 ◽

Cited By ~ 4

Author(s):

Katja Jöchen ◽

Thomas Böhlke

Keyword(s):

Crystallographic Orientation ◽

Electron Backscatter Diffraction ◽

Three Dimensional ◽

Large Data ◽

Data Sets ◽

Data Set ◽

Orientation Data ◽

Clustering Technique ◽

Orientation Space ◽

Data Points

Experimental techniques [e.g.electron backscatter diffraction (EBSD)] yield detailed crystallographic information on the grain scale. In both two- and three-dimensional applications of EBSD, large data sets in the range of 105–109single-crystal orientations are obtained. With regard to the precise but efficient micromechanical computation of the polycrystalline material response, small representative sets of crystallographic orientation data are required. This paper describes two methods to systematically reduce experimentally measured orientation data. Inspired by the work of Gao, Przybyla & Adams [Metall. Mater. Trans. A(2006),37, 2379–2387], who used a tessellation of the orientation space in order to compute correlation functions, one method in this work uses a similar procedure to partition the orientation space into boxes, but with the aim of extracting the mean orientation of the data points of each box. The second method to reduce crystallographic texture data is based on a clustering technique. It is shown that, in terms of representativity of the reduced data, both methods deliver equally good results. While the clustering technique is computationally more costly, it works particularly well when the measured data set shows pronounced clusters in the orientation space. The quality of the results and the performance of the tessellation method are independent of the examined data set.

Download Full-text