Machine Learning K-Means Clustering Algorithm for Interpolative Separable Density Fitting to Accelerate Hybrid Functional Calculations with Numerical Atomic Orbitals

2020 ◽  
Vol 124 (48) ◽  
pp. 10066-10074
Author(s):  
Xinming Qin ◽  
Jielan Li ◽  
Wei Hu ◽  
Jinlong Yang
2020 ◽  
Vol 15 ◽  
Author(s):  
Shuwen Zhang ◽  
Qiang Su ◽  
Qin Chen

Abstract: Major animal diseases pose a great threat to animal husbandry and human beings. With the deepening of globalization and the abundance of data resources, the prediction and analysis of animal diseases by using big data are becoming more and more important. The focus of machine learning is to make computers learn how to learn from data and use the learned experience to analyze and predict. Firstly, this paper introduces the animal epidemic situation and machine learning. Then it briefly introduces the application of machine learning in animal disease analysis and prediction. Machine learning is mainly divided into supervised learning and unsupervised learning. Supervised learning includes support vector machines, naive bayes, decision trees, random forests, logistic regression, artificial neural networks, deep learning, and AdaBoost. Unsupervised learning has maximum expectation algorithm, principal component analysis hierarchical clustering algorithm and maxent. Through the discussion of this paper, people have a clearer concept of machine learning and understand its application prospect in animal diseases.


2021 ◽  
Author(s):  
Olusegun Peter Awe ◽  
Daniel Adebowale Babatunde ◽  
Sangarapillai Lambotharan ◽  
Basil AsSadhan

AbstractWe address the problem of spectrum sensing in decentralized cognitive radio networks using a parametric machine learning method. In particular, to mitigate sensing performance degradation due to the mobility of the secondary users (SUs) in the presence of scatterers, we propose and investigate a classifier that uses a pilot based second order Kalman filter tracker for estimating the slowly varying channel gain between the primary user (PU) transmitter and the mobile SUs. Using the energy measurements at SU terminals as feature vectors, the algorithm is initialized by a K-means clustering algorithm with two centroids corresponding to the active and inactive status of PU transmitter. Under mobility, the centroid corresponding to the active PU status is adapted according to the estimates of the channels given by the Kalman filter and an adaptive K-means clustering technique is used to make classification decisions on the PU activity. Furthermore, to address the possibility that the SU receiver might experience location dependent co-channel interference, we have proposed a quadratic polynomial regression algorithm for estimating the noise plus interference power in the presence of mobility which can be used for adapting the centroid corresponding to inactive PU status. Simulation results demonstrate the efficacy of the proposed algorithm.


2020 ◽  
Author(s):  
Xiao Lai ◽  
Pu Tian

AbstractSupervised machine learning, especially deep learning based on a wide variety of neural network architectures, have contributed tremendously to fields such as marketing, computer vision and natural language processing. However, development of un-supervised machine learning algorithms has been a bottleneck of artificial intelligence. Clustering is a fundamental unsupervised task in many different subjects. Unfortunately, no present algorithm is satisfactory for clustering of high dimensional data with strong nonlinear correlations. In this work, we propose a simple and highly efficient hierarchical clustering algorithm based on encoding by composition rank vectors and tree structure, and demonstrate its utility with clustering of protein structural domains. No record comparison, which is an expensive and essential common step to all present clustering algorithms, is involved. Consequently, it achieves linear time and space computational complexity hierarchical clustering, thus applicable to arbitrarily large datasets. The key factor in this algorithm is definition of composition, which is dependent upon physical nature of target data and therefore need to be constructed case by case. Nonetheless, the algorithm is general and applicable to any high dimensional data with strong nonlinear correlations. We hope this algorithm to inspire a rich research field of encoding based clustering well beyond composition rank vector trees.


2021 ◽  
Vol 8 (10) ◽  
pp. 43-50
Author(s):  
Truong et al. ◽  

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.


2018 ◽  
Author(s):  
Mridul K. Thomas ◽  
Simone Fontana ◽  
Marta Reyes ◽  
Francesco Pomati

AbstractScanning flow cytometry (SFCM) is characterized by the measurement of time-resolved pulses of fluorescence and scattering, enabling the high-throughput quantification of phytoplankton morphology and pigmentation. Quantifying variation at the single cell and colony level improves our ability to understand dynamics in natural communities. Automated high-frequency monitoring of these communities is presently limited by the absence of repeatable, rapid protocols to analyse SFCM datasets, where images of individual particles are not available. Here we demonstrate a repeatable, semi-automated method to (1) rapidly clean SFCM data from a phytoplankton community by removing signals that do not belong to live phytoplankton cells, (2) classify individual cells into trait clusters that correspond to functional groups, and (3) quantify the biovolumes of individual cells, the total biovolume of the whole community and the total biovolumes of the major functional groups. Our method involves the development of training datasets using lab cultures, the use of an unsupervised clustering algorithm to identify trait clusters, and machine learning tools (random forests) to (1) evaluate variable importance, (2) classify data points, and (3) estimate biovolumes of individual cells. We provide example datasets and R code for our analytical approach that can be adapted for analysis of datasets from other flow cytometers or scanning flow cytometers.


Author(s):  
Huifang Li ◽  
◽  
Rui Fan ◽  
Qisong Shi ◽  
Zijian Du

Recent advancements in machine learning and communication technologies have enabled new approaches to automated fault diagnosis and detection in industrial systems. Given wide variation in occurrence frequencies of different classes of faults, the class distribution of real-world industrial fault data is usually imbalanced. However, most prior machine learning-based classification methods do not take this imbalance into consideration, and thus tend to be biased toward recognizing the majority classes and result in poor accuracy for minority ones. To solve such problems, we propose a k-means clustering generative adversarial network (KM-GAN)-based fault diagnosis approach able to reduce imbalance in fault data and improve diagnostic accuracy for minority classes. First, we design a new k-means clustering algorithm and GAN-based oversampling method to generate diverse minority-class samples obeying the similar distribution to the original minority data. The k-means clustering algorithm is adopted to divide minority-class samples into k clusters, while a GAN is applied to learn the data distribution of the resulting clusters and generate a given number of minority-class samples as a supplement to the original dataset. Then, we construct a deep neural network (DNN) and deep belief network (DBN)-based heterogeneous ensemble model as a fault classifier to improve generalization, in which DNN and DBN models are trained separately on the resulting dataset, and then the outputs from both are averaged as the final diagnostic result. A series of comparative experiments are conducted to verify the effectiveness of our proposed method, and the experimental results show that our method can improve diagnostic accuracy for minority-class samples.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Yanyang Bai ◽  
Xuesheng Zhang

With the technological development and change of the times in the current era, with the rapid development of science and technology and information technology, there is a gradual replacement in the traditional way of cognition. Effective data analysis is of great help to all societies, thereby drive the development of better interests. How to expand the development of the overall information resources in the process of utilization, establish a mathematical analysis–oriented evidence theory system model, improve the effective utilization of the machine, and achieve the goal of comprehensively predicting the target behavior? The main goal of this article is to use machine learning technology; this article defines the main prediction model by python programming language, analyzes and forecasts the data of previous World Cup, and establishes the analysis and prediction model of football field by K-mean and DPC clustering algorithm. Python programming is used to implement the algorithm. The data of the previous World Cup football matches are selected, and the built model is used for the predictive analysis on the Python platform; the calculation method based on the DPC-K-means algorithm is used to determine the accuracy and probability of the variables through the calculation results, which develops results in specific competitions. Research shows how the machine wins and learns the efficiency of the production process, and the machine learning process, the reliability, and accuracy of the prediction results are improved by more than 55%, which proves that mobile algorithm technology has a high level of predictive analysis on the World Cup football stadium.


2019 ◽  
Vol 116 (4) ◽  
pp. 1110-1115 ◽  
Author(s):  
Bingqing Cheng ◽  
Edgar A. Engel ◽  
Jörg Behler ◽  
Christoph Dellago ◽  
Michele Ceriotti

Thermodynamic properties of liquid water as well as hexagonal (Ih) and cubic (Ic) ice are predicted based on density functional theory at the hybrid-functional level, rigorously taking into account quantum nuclear motion, anharmonic fluctuations, and proton disorder. This is made possible by combining advanced free-energy methods and state-of-the-art machine-learning techniques. The ab initio description leads to structural properties in excellent agreement with experiments and reliable estimates of the melting points of light and heavy water. We observe that nuclear-quantum effects contribute a crucial 0.2 meV/H2O to the stability of ice Ih, making it more stable than ice Ic. Our computational approach is general and transferable, providing a comprehensive framework for quantitative predictions of ab initio thermodynamic properties using machine-learning potentials as an intermediate step.


Energies ◽  
2019 ◽  
Vol 12 (13) ◽  
pp. 2483 ◽  
Author(s):  
Jianhao ◽  
Jing ◽  
Longqiang ◽  
Yi ◽  
Hanzhang ◽  
...  

Driver perception, decision, and control behaviors are easily affected by traffic conditions and driving style, showing the tendency of randomness and personalization. Brake intention and intensity are integrated and control-oriented parameters that are crucial to the development of an intelligent braking system. In this paper, a composite machine learning approach was proposed to predict driver brake intention and intensity with a proper prediction horizon. Various driving data were collected from Controller Area Network (CAN) bus under a real driving condition, which mainly contained urban and rural road types. ReliefF and RReliefF (they don’t have abbreviations) algorithms were employed as feature subset selection methods and applied in a prepossessing step before the training. The rank importance of selected predictors exhibited different trends or even negative trends when predicting brake intention and intensity. A soft clustering algorithm, Fuzzy C-means, was adopted to label the brake intention into categories, namely slight, medium, intensive, and emergency braking. Data sets with misplaced labels were used for training of an ensemble machine learning method, random forest. It was validated that brake intention could be accurately predicted 0.5 s ahead. An open-loop nonlinear autoregressive with external input (NARX) network was capable of learning the long-term dependencies in comparison to the static neural network and was suggested for online recognition and prediction of brake intensity 1 s in advance. As system redundancy and fault tolerance, a close-loop NARX network could be adopted for brake intensity prediction in the case of possible sensor failure and loss of CAN message.


2012 ◽  
Vol 2 (1) ◽  
pp. 11-20 ◽  
Author(s):  
Ritu Vijay ◽  
Prerna Mahajan ◽  
Rekha Kandwal

Cluster analysis has been extensively used in machine learning and data mining to discover distribution patterns in the data. Clustering algorithms are generally based on a distance metric in order to partition the data into small groups such that data instances in the same group are more similar than the instances belonging to different groups. In this paper the authors have extended the concept of hamming distance for categorical data .As a data processing step they have transformed the data into binary representation. The authors have used proposed algorithm to group data points into clusters. The experiments are carried out on the data sets from UCI machine learning repository to analyze the performance study. They conclude by stating that this proposed algorithm shows promising result and can be extended to handle numeric as well as mixed data.


Sign in / Sign up

Export Citation Format

Share Document