data partition
Recently Published Documents


TOTAL DOCUMENTS

146
(FIVE YEARS 36)

H-INDEX

13
(FIVE YEARS 4)

2022 ◽  
Vol 2022 ◽  
pp. 1-11
Author(s):  
Yunsheng Song ◽  
Xiaohan Kong ◽  
Chao Zhang

Owing to the absence of hypotheses of the underlying distributions of the data and the strong generation ability, the k -nearest neighbor (kNN) classification algorithm is widely used to face recognition, text classification, emotional analysis, and other fields. However, kNN needs to compute the similarity between the unlabeled instance and all the training instances during the prediction process; it is difficult to deal with large-scale data. To overcome this difficulty, an increasing number of acceleration algorithms based on data partition are proposed. However, they lack theoretical analysis about the effect of data partition on classification performance. This paper has made a theoretical analysis of the effect using empirical risk minimization and proposed a large-scale k -nearest neighbor classification algorithm based on neighbor relationship preservation. The process of searching the nearest neighbors is converted to a constrained optimization problem. Then, it gives the estimation of the difference on the objective function value under the optimal solution with data partition and without data partition. According to the obtained estimation, minimizing the similarity of the instances in the different divided subsets can largely reduce the effect of data partition. The minibatch k -means clustering algorithm is chosen to perform data partition for its effectiveness and efficiency. Finally, the nearest neighbors of the test instance are continuously searched from the set generated by successively merging the candidate subsets until they do not change anymore, where the candidate subsets are selected based on the similarity between the test instance and cluster centers. Experiment results on public datasets show that the proposed algorithm can largely keep the same nearest neighbors and no significant difference in classification accuracy as the original kNN classification algorithm and better results than two state-of-the-art algorithms.


2022 ◽  
Vol 52 (5) ◽  
Author(s):  
Roberta de Amorim Ferreira ◽  
Gabriely Teixeira ◽  
Luiz Alexandre Peternelli

ABSTRACT: Splitting the whole dataset into training and testing subsets is a crucial part of optimizing models. This study evaluated the influence of the choice of the training subset in the construction of predictive models, as well as on their validation. For this purpose we assessed the Kennard-Stone (KS) and the Random Sampling (RS) methods in near-infrared spectroscopy data (NIR) and marker data SNPs (Single Nucleotide Polymorphisms). It is worth noting that in SNPs data, there is no knowledge of reports in the literature regarding the use of the KS method. For the construction and validation of the models, the partial least squares (PLS) estimation method and the Bayesian Lasso (BLASSO) proved to be more efficient for NIR data and for marker data SNPs, respectively. The evaluation of the predictive capacity of the models obtained after the data partition occurred through the correlation between the predicted and the observed values, and the corresponding square root of the mean squared error of prediction. For both datasets, results indicated that the results from KS and RS methods differ statistically from each other by the F test (P-value < 0.01). The KS method showed to be more efficient than RS in practically all repetitions. Also, KS method has the advantage of being easy and fast to be applied and also to select the same samples, which provides excellent benefits in the following analyses.


Symmetry ◽  
2021 ◽  
Vol 13 (10) ◽  
pp. 1789
Author(s):  
Chu Fang ◽  
Haiming Liu

Clustering is a major field in data mining, which is also an important method of data partition or grouping. Clustering has now been applied in various ways to commerce, market analysis, biology, web classification, and so on. Clustering algorithms include the partitioning method, hierarchical clustering as well as density-based, grid-based, model-based, and fuzzy clustering. The K-means algorithm is one of the essential clustering algorithms. It is a kind of clustering algorithm based on the partitioning method. This study’s aim was to improve the algorithm based on research, while with regard to its application, the aim was to use the algorithm for customer segmentation. Customer segmentation is an essential element in the enterprise’s utilization of CRM. The first part of the paper presents an elaboration of the object of study, its background as well as the goal this article would like to achieve; it also discusses the research the mentality and the overall content. The second part mainly introduces the basic knowledge on clustering and methods for clustering analysis based on the assessment of different algorithms, while identifying its advantages and disadvantages through the comparison of those algorithms. The third part introduces the application of the algorithm, as the study applies clustering technology to customer segmentation. First, the customer value system is built through AHP; customer value is then quantified, and customers are divided into different classifications using clustering technology. The efficient CRM can thus be used according to the different customer classifications. Currently, there are some systems used to evaluate customer value, but none of them can be put into practice efficiently. In order to solve this problem, the concept of continuous symmetry is introduced. It is very important to detect the continuous symmetry of a given problem. It allows for the detection of an observable state whose components are nonlinear functions of the original unobservable state. Thus, we built an evaluating system for customer value, which is in line with the development of the enterprise, using the method of data mining, based on the practical situation of the enterprise and through a series of practical evaluating indexes for customer value. The evaluating system can be used to quantify customer value, to segment the customers, and to build a decision-supporting system for customer value management. The fourth part presents the cure, mainly an analysis of the typical k-means algorithm; this paper proposes two algorithms to improve the k-means algorithm. Improved algorithm A can get the K automatically and can ensure the achievement of the global optimum value to some degree. Improved Algorithm B, which combines the sample technology and the arrangement agglomeration algorithm, is much more efficient than the k-means algorithm. In conclusion, the main findings of the study and further research directions are presented.


2021 ◽  
Vol 7 (9) ◽  
pp. 174
Author(s):  
Luís Viegas ◽  
Inês Domingues ◽  
Mateus Mendes

Mammography is the primary medical imaging method used for routine screening and early detection of breast cancer in women. However, the process of manually inspecting, detecting, and delimiting the tumoral massess in 2D images is a very time-consuming task, subject to human errors due to fatigue. Therefore, integrated computer-aided detection systems have been proposed, based on modern computer vision and machine learning methods. In the present work, mammogram images from the publicly available Inbreast dataset are first converted to pseudo-color and then used to train and test a Mask R-CNN deep neural network. The most common approach is to start with a dataset and split the images into train and test set randomly. However, since there are often two or more images of the same case in the dataset, the way the dataset is split may have an impact on the results. Our experiments show that random partition of the data can produce unreliable training, so the dataset must be split using case-wise partition for more stable results. In experimental results, the method achieves an average true positive rate of 0.936 with 0.063 standard deviation using random partition and 0.908 with 0.002 standard deviation using case-wise partition, showing that case-wise partition must be used for more reliable results.


CONVERTER ◽  
2021 ◽  
pp. 271-280
Author(s):  
Haoling Meng, Et al.

Toward the existing problems in targeted poverty alleviation system, such as unclear poverty alleviation records, inaccurate poverty alleviation targets, easy tampering of poverty alleviation data and single data storage structure, a targeted poverty alleviation system based on blockchain + IPFS + distributed database is designed.For different links in the industrial field, different consensus schemes are used to balance reliability and performance. Using the characteristics of blockchain technology to ensure that data are traceable and tampering-proof, and then combining this technology with the interstellar file system (IPFS) as data storage to solve the problem of high cost of large files; what’s more, the distributed database can play a role in achieving data partition management to solve the problem of single data storage structure. The data nodes are independent of each other, thus, they can realize the efficient control of resources and improve the concurrency and scalability of the system. Through the performance test and comparative analysis of the system, it is proved that the system has a great improvement in throughput, data read-write delay and other indicators.


2021 ◽  
pp. 286-292
Author(s):  
Kyrylo S. Krasnikov

One of the widely used methods to accelerate a numerical solver is implementation of multithreading. The problem of thread allocation on-demand at runtime is latency, caused by periodical instantiation of threads. The article is devoted to parallelization of solver for 3D mathematical model of ore sintering, based on software threads reusing them during computation. Computational domain is equally shared among available threads. Each thread writes only to own data partition. A looped barrier is proposed for guaranteed synchronization of all threads after iteration. The method allows scaling performance without recompilation of the solver by using similar CPU with more cores. Measurement of solver performance with 220 nodes using different thread count confirms scalability around 95% for double and single precision arithmetics. Presented pictures of perspective view with three slices of temperature field show influence of heat loss from pallets walls. A cross section of temperature field in layer after 16 minutes of sintering is calculated with appearance of two high-temperature regions inside. Comparison of temperature field with literature data gives good correspondence. The computer model takes into account important chemical reactions, such as, coke burning, carbonate dissolution, water vaporization, as well as mass-heat transfer inside the sinter layer and can be used in metallurgical plants to increase effectiveness of sintering.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Verônica A. Thode ◽  
Caetano T. Oliveira ◽  
Benoît Loeuille ◽  
Carolina M. Siniscalchi ◽  
José R. Pirani

AbstractWe assembled new plastomes of 19 species of Mikania and of Ageratina fastigiata, Litothamnus nitidus, and Stevia collina, all belonging to tribe Eupatorieae (Asteraceae). We analyzed the structure and content of the assembled plastomes and used the newly generated sequences to infer phylogenetic relationships and study the effects of different data partitions and inference methods on the topologies. Most phylogenetic studies with plastomes ignore that processes like recombination and biparental inheritance can occur in this organelle, using the whole genome as a single locus. Our study sought to compare this approach with multispecies coalescent methods that assume that different parts of the genome evolve at different rates. We found that the overall gene content, structure, and orientation are very conserved in all plastomes of the studied species. As observed in other Asteraceae, the 22 plastomes assembled here contain two nested inversions in the LSC region. The plastomes show similar length and the same gene content. The two most variable regions within Mikania are rpl32-ndhF and rpl16-rps3, while the three genes with the highest percentage of variable sites are ycf1, rpoA, and psbT. We generated six phylogenetic trees using concatenated maximum likelihood and multispecies coalescent methods and three data partitions: coding and non-coding sequences and both combined. All trees strongly support that the sampled Mikania species form a monophyletic group, which is further subdivided into three clades. The internal relationships within each clade are sensitive to the data partitioning and inference methods employed. The trees resulting from concatenated analysis are more similar among each other than to the correspondent tree generated with the same data partition but a different method. The multispecies coalescent analysis indicate a high level of incongruence between species and gene trees. The lack of resolution and congruence among trees can be explained by the sparse sampling (~ 0.45% of the currently accepted species) and by the low number of informative characters present in the sequences. Our study sheds light into the impact of data partitioning and methods over phylogenetic resolution and brings relevant information for the study of Mikania diversity and evolution, as well as for the Asteraceae family as a whole.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Liping Chen ◽  
Jiabao Jiang ◽  
Yong Zhang

The classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid sampling method based on data partition (HSDP) is proposed in this paper. First, all the data samples are partitioned into different data regions. Then, the data samples in the noise minority samples region are removed and the samples in the boundary minority samples region are selected as oversampling seeds to generate the synthetic samples. Finally, a weighted oversampling process is conducted considering the generation of synthetic samples in the same cluster of the oversampling seed. The weight of each selected minority class sample is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Generation of synthetic samples in the same cluster of the oversampling seed guarantees new synthetic samples located inside the minority class area. Experiments conducted on eight datasets show that the proposed method, HSDP, is better than or comparable with the typical sampling methods for F-measure and G-mean.


Sign in / Sign up

Export Citation Format

Share Document