Physical interpretation of hydrologic model complexity revisited

Mapping Intimacies ◽

10.5194/egusphere-egu2020-13391 ◽

2020 ◽

Author(s):

Saket Pande ◽

Mehdi Moayeri

Keyword(s):

Stream Flow ◽

Nearest Neighbor ◽

Model Complexity ◽

K Nearest Neighbor ◽

Data Set ◽

Daily Streamflow ◽

Hydrological System ◽

System Representation ◽

Generalization Theory ◽

Space Data

It is intuitive that instability of hydrological system representation, in the sense of how perturbations in input forcings translate into perturbation in a hydrologic response, may depend on its hydrological characteristics. Responses of unstable systems are thus complex to model. We interpret complexity in this context and define complexity as a measure of instability in hydrological system representation. We use algorithms to quantify model complexity in this context from Pande et al. (2014). We use Sacramento soil moisture accounting model (SAC-SMA) parameterized for CAMEL data set (Addor et al., 2017) and quantify complexities of corresponding models. Relationships between hydrologic characteristics of CAMEL basins such as location, precipitation seasonality index, slope, hydrologic ratios, saturated hydraulic conductivity and NDVI and respective model complexities are then investigated.Recently Pande and Moayeri (2018) introduced an index of basin complexity based on another, non-parameteric, model of least statistical complexity that is needed to reliably model daily streamflow of a basin. This method essentially interprets complexity in terms of difficulty in predicting historically similar stream flow events. Daily streamflow is modeled using k-nearest neighbor model of lagged streamflow. Such models are parameterised by the number of lags and radius of neighborhood that it uses to identify similar streamflow events from the past. These parameters need to be selected for each time step of prediction &#8217;query&#8217;. We use 1) Tukey half-space data depth function to identify time steps corresponding to &#8217;difficult&#8217; queries and 2) then use Vapnik-Chervonenkis (VC) generalization theory, which trades off model performance with VC dimension (i.e. a measure of model complexity), to select parameters corresponding to k nearest neighbor model that is of appropriate complexity for modelling difficult queries. Average of selected model complexities corresponding to difficult queries are then related with the same hydrologic characteristics as above for CAMEL basins.We find that complexities estimated on SAC-SMA model using the algorithm of Pande et al. (2014) are correlated with those estimated on knn model using VC generalization theory. Further, the relationships between the two complexities and hydrologic characteristics are also similar. This indicates that interpretation of complexity as a measure of instability in hydrological system representation is similar to the interpretation provided by VC generalization theory of difficulty in predicting historically similar stream flow events. &#160;Reference:Addor, N., Newman, A. J., Mizukami, N., and Clark, M. P. (2017) The CAMELS data set: catchment attributes and meteorology for large-sample studies, Hydrol. Earth Syst. Sci., 21, 5293&#8211;5313, https://doi.org/10.5194/hess-21-5293-2017.Pande, S., Arkesteijn, L., Savenije, H. H. G., and Bastidas, L. A. (2014) Hydrological model parameter dimensionality is a weak measure of prediction uncertainty, Hydrol. Earth Syst. Sci. Discuss., 11, 2555&#8211;2582, https://doi.org/10.5194/hessd-11-2555-2014.Pande, S., and Moayeri, M. (2018). Hydrological interpretation of a statistical measure of basin complexity. Water Resources Research, 54. https://doi.org/10.1029/2018WR022675

Download Full-text

Machine Learning Verdict of EEG Signals in Brain Computer Interface

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit1838114 ◽

2018 ◽

pp. 429-441

Author(s):

M. Jeyanthi ◽

C. Velayutham

Keyword(s):

Nearest Neighbor ◽

Technology Development ◽

Vital Role ◽

Svm Classifier ◽

K Nearest Neighbor ◽

Data Mining Technique ◽

Data Set ◽

Eeg Data ◽

Irrelevant Attributes

In Science and Technology Development BCI plays a vital role in the field of Research. Classification is a data mining technique used to predict group membership for data instances. Analyses of BCI data are challenging because feature extraction and classification of these data are more difficult as compared with those applied to raw data. In this paper, We extracted features using statistical Haralick features from the raw EEG data . Then the features are Normalized, Binning is used to improve the accuracy of the predictive models by reducing noise and eliminate some irrelevant attributes and then the classification is performed using different classification techniques such as Naïve Bayes, k-nearest neighbor classifier, SVM classifier using BCI dataset. Finally we propose the SVM classification algorithm for the BCI data set.

Download Full-text

Symmetry Breaking and Training from Incomplete Data with Radial Basis Boltzmann Machines

International Journal of Neural Systems ◽

10.1142/s0129065797000318 ◽

1997 ◽

Vol 08 (03) ◽

pp. 301-315 ◽

Cited By ~ 8

Author(s):

Marcel J. Nijman ◽

Hilbert J. Kappen

Keyword(s):

Symmetry Breaking ◽

Incomplete Data ◽

Missing Values ◽

Nearest Neighbor ◽

Boltzmann Machine ◽

K Nearest Neighbor ◽

Data Set ◽

Input Space ◽

Learning Rules ◽

Radial Basis

A Radial Basis Boltzmann Machine (RBBM) is a specialized Boltzmann Machine architecture that combines feed-forward mapping with probability estimation in the input space, and for which very efficient learning rules exist. The hidden representation of the network displays symmetry breaking as a function of the noise in the dynamics. Thus, generalization can be studied as a function of the noise in the neuron dynamics instead of as a function of the number of hidden units. We show that the RBBM can be seen as an elegant alternative of k-nearest neighbor, leading to comparable performance without the need to store all data. We show that the RBBM has good classification performance compared to the MLP. The main advantage of the RBBM is that simultaneously with the input-output mapping, a model of the input space is obtained which can be used for learning with missing values. We derive learning rules for the case of incomplete data, and show that they perform better on incomplete data than the traditional learning rules on a 'repaired' data set.

Download Full-text

Determination of Reactivity Ratios from Binary Copolymerization Using the k-Nearest Neighbor Non-Parametric Regression

Polymers ◽

10.3390/polym13213811 ◽

2021 ◽

Vol 13 (21) ◽

pp. 3811

Author(s):

Iosif Sorin Fazakas-Anca ◽

Arina Modrea ◽

Sorin Vlase

Keyword(s):

Experimental Data ◽

Nearest Neighbor ◽

Optimization Method ◽

Reactivity Ratios ◽

Data Sets ◽

K Nearest Neighbor ◽

Integration Algorithm ◽

Data Set ◽

Parametric Regression ◽

Non Parametric

This paper proposes a new method for calculating the monomer reactivity ratios for binary copolymerization based on the terminal model. The original optimization method involves a numerical integration algorithm and an optimization algorithm based on k-nearest neighbour non-parametric regression. The calculation method has been tested on simulated and experimental data sets, at low (<10%), medium (10–35%) and high conversions (>40%), yielding reactivity ratios in a good agreement with the usual methods such as intersection, Fineman–Ross, reverse Fineman–Ross, Kelen–Tüdös, extended Kelen–Tüdös and the error in variable method. The experimental data sets used in this comparative analysis are copolymerization of 2-(N-phthalimido) ethyl acrylate with 1-vinyl-2-pyrolidone for low conversion, copolymerization of isoprene with glycidyl methacrylate for medium conversion and copolymerization of N-isopropylacrylamide with N,N-dimethylacrylamide for high conversion. Also, the possibility to estimate experimental errors from a single experimental data set formed by n experimental data is shown.

Download Full-text

A MODIFIED MODEL BASED ON FLOWER POLLINATION ALGORITHM AND K-NEAREST NEIGHBOR FOR DIAGNOSING DISEASES

IIUM Engineering Journal ◽

10.31436/iiumej.v19i1.854 ◽

2018 ◽

Vol 19 (1) ◽

pp. 144-157

Author(s):

Mehdi Zekriyapanah Gashti

Keyword(s):

Breast Cancer ◽

Nearest Neighbor ◽

Heart Diseases ◽

Critical Role ◽

Clinical Manifestations ◽

Flower Pollination Algorithm ◽

K Nearest Neighbor ◽

Data Set ◽

Modified Model ◽

Flower Pollination

Exponential growth of medical data and recorded resources from patients with different diseases can be exploited to establish an optimal association between disease symptoms and diagnosis. The main issue in diagnosis is the variability of the features that can be attributed for particular diseases, since some of these features are not essential for the diagnosis and may even lead to a delay in diagnosis. For instance, diabetes, hepatitis, breast cancer, and heart disease, that express multitudes of clinical manifestations as symptoms, are among the diseases with higher morbidity rate. Timely diagnosis of such diseases can play a critical role in decreasing their effect on patientsâ€™ quality of life and on the costs of their treatment. Thanks to the large data set available, computer aided diagnosis can be an advanced option for early diagnosis of the diseases. In this paper, using a Flower Pollination Algorithm (FPA) and K-Nearest Neighbor (KNN), a new method is suggested for diagnosis. The modified model can diagnose diseases more accurately by reducing the number of features. The main purpose of the modified model is that the Feature Selection (FS) should be done by FPA and data classification should be performed using KNN. The results showed higher efficiency of the modified model on diagnosis of diabetes, hepatitis, breast cancer, and heart diseases compared to the KNN models. ABSTRAK: Pertumbuhan eksponen dalam data perubatan dan sumber direkodkan daripada pesakit dengan penyakit berbeza boleh disalah guna bagi membentuk kebersamaan optimum antara simptom penyakit dan mengenal pasti gejala penyakit (diagnosis). Isu utama dalam diagnosis adalah kepelbagaian ciri yang dimiliki pada penyakit tertentu, sementara ciri-ciri ini tidak penting untuk didiagnosis dan boleh mengarah kepada penangguhan dalam diagnosis. Sebagai contoh, penyakit kencing manis, radang hati, barah payudara dan penyakit jantung, menunjukkan banyak klinikal simptom jelas dan merupakan penyakit tertinggi berlaku dalam masyarakat. Diagnosis tepat pada penyakit tersebut boleh memainkan peranan penting dalam mengurangkan kesan kualitiÂ hidup dan kos rawatan pesakit. Terima kasih kepada set data yang banyak, diagnosis dengan bantuan komputer boleh menjadi pilihan maju menuju ke arah diagnosis awal kepada penyakit. Kertas ini menggunakan Algoritma Flower Pollination (FPA) dan K-Nearest Neighbor (KNN), iaitu kaedah baru dicadangkan bagi diagnosis. Model yang diubah suai boleh mendiagnosis penyakit lebih tepat dengan mengurangkan bilangan ciri-ciri. Tujuan utama model yang diubah suai ini adalah bagi Pemilihan Ciri (FS) perlu dilakukan menggunakan FPA and pengkhususan data perlu dijalankan menggunakan KNN. Keputusan menunjukkan model yang diubah suai lebih cekap dalam mendiagnosis penyakit kencing manis, radang hati, barah payudara dan penyakit jantung berbanding model KNN.

Download Full-text

An Incremental Isomap Method for Hyperspectral Dimensionality Reduction and Classification

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.7.445 ◽

2021 ◽

Vol 87 (6) ◽

pp. 445-455

Author(s):

Yi Ma ◽

Zezhong Zheng ◽

Yutang Ma ◽

Mingcang Zhu ◽

Ran Huang ◽

...

Keyword(s):

Manifold Learning ◽

Nearest Neighbor ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Data Points

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.

Download Full-text

Improving the Performance of kNN in the MapReduce Framework Using Locality Sensitive Hashing

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2019100101 ◽

2019 ◽

Vol 10 (4) ◽

pp. 1-16

Author(s):

Sikha Bagui ◽

Arup Kumar Mondal ◽

Subhash Bagui

Keyword(s):

Nearest Neighbor ◽

Parallel Implementation ◽

Block Size ◽

Computation Time ◽

Locality Sensitive Hashing ◽

K Nearest Neighbor ◽

Mapreduce Framework ◽

Data Set ◽

Data Object ◽

Very Large Datasets

In this work the authors present a parallel k nearest neighbor (kNN) algorithm using locality sensitive hashing to preprocess the data before it is classified using kNN in Hadoop's MapReduce framework. This is compared with the sequential (conventional) implementation. Using locality sensitive hashing's similarity measure with kNN, the iterative procedure to classify a data object is performed within a hash bucket rather than the whole data set, greatly reducing the computation time needed for classification. Several experiments were run that showed that the parallel implementation performed better than the sequential implementation on very large datasets. The study also experimented with a few map and reduce side optimization features for the parallel implementation and presented some optimum map and reduce side parameters. Among the map side parameters, the block size and input split size were varied, and among the reduce side parameters, the number of planes were varied, and their effects were studied.

Download Full-text

Identification of Leukemia Subtypes from Microscopic Images Using Convolutional Neural Network

Diagnostics ◽

10.3390/diagnostics9030104 ◽

2019 ◽

Vol 9 (3) ◽

pp. 104 ◽

Cited By ~ 11

Author(s):

Ahmed ◽

Yigit ◽

Isik ◽

Alpkocak

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Nearest Neighbor ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

K Nearest Neighbor ◽

Data Set ◽

Leukemia Data

Leukemia is a fatal cancer and has two main types: Acute and chronic. Each type has two more subtypes: Lymphoid and myeloid. Hence, in total, there are four subtypes of leukemia. This study proposes a new approach for diagnosis of all subtypes of leukemia from microscopic blood cell images using convolutional neural networks (CNN), which requires a large training data set. Therefore, we also investigated the effects of data augmentation for an increasing number of training samples synthetically. We used two publicly available leukemia data sources: ALL-IDB and ASH Image Bank. Next, we applied seven different image transformation techniques as data augmentation. We designed a CNN architecture capable of recognizing all subtypes of leukemia. Besides, we also explored other well-known machine learning algorithms such as naive Bayes, support vector machine, k-nearest neighbor, and decision tree. To evaluate our approach, we set up a set of experiments and used 5-fold cross-validation. The results we obtained from experiments showed that our CNN model performance has 88.25% and 81.74% accuracy, in leukemia versus healthy and multiclass classification of all subtypes, respectively. Finally, we also showed that the CNN model has a better performance than other wellknown machine learning algorithms.

Download Full-text

Feature Selection Algorithm for Hyperlipidemia Classification

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.701-702.110 ◽

2014 ◽

Vol 701-702 ◽

pp. 110-113

Author(s):

Qi Rui Zhang ◽

He Xian Wang ◽

Jiang Wei Qin

Keyword(s):

Feature Selection ◽

Nearest Neighbor ◽

Information Gain ◽

Classification Systems ◽

Support Vector ◽

K Nearest Neighbor ◽

Data Set ◽

Document Frequency ◽

Selection Algorithms ◽

Term Weights

This paper reports a comparative study of feature selection algorithms on a hyperlipimedia data set. Three methods of feature selection were evaluated, including document frequency (DF), information gain (IG) and aχ2 statistic (CHI). The classification systems use a vector to represent a document and use tfidfie (term frequency, inverted document frequency, and inverted entropy) to compute term weights. In order to compare the effectives of feature selection, we used three classification methods: Naïve Bayes (NB), k Nearest Neighbor (kNN) and Support Vector Machines (SVM). The experimental results show that IG and CHI outperform significantly DF, and SVM and NB is more effective than KNN when macro-averagingF1 measure is used. DF is suitable for the task of large text classification.

Download Full-text

A Research Travelogue on Classification Algorithms using R Programming

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d9014.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 9155-9158

Keyword(s):

Machine Learning ◽

Nearest Neighbor ◽

Statistical Tests ◽

Learning Task ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Domain Experts ◽

R Programming ◽

Training Examples

Classification is a machine learning task which consists in predicting the set association of unclassified examples, whose label is not known, by the properties of examples in a representation learned earlier as of training examples, that label was known. Classification tasks contain a huge assortment of domains and real world purpose: disciplines such as medical diagnosis, bioinformatics, financial engineering and image recognition between others, where domain experts can use the model erudite to sustain their decisions. All the Classification Approaches proposed in this paper were evaluate in an appropriate experimental framework in R Programming Language and the major emphasis is on k-nearest neighbor method which supports vector machines and decision trees over large number of data sets with varied dimensionality and by comparing their performance against other state-of-the-art methods. In this process the experimental results obtained have been verified by statistical tests which support the better performance of the methods. In this paper we have survey various classification techniques of Data Mining and then compared them by using diverse datasets from “University of California: Irvine (UCI) Machine Learning Repository” for acquiring the accurate calculations on Iris Data set.

Download Full-text

Impact of a pulsed xenon disinfection system on hospital onset Clostridioides difficile infections in 48 hospitals over a 5-year period

BMC Infectious Diseases ◽

10.1186/s12879-021-06789-y ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Sarah Simmons ◽

Grady Wier ◽

Antonio Pedraza ◽

Mark Stibich

Keyword(s):

Nearest Neighbor ◽

Negative Binomial ◽

Learning Algorithm ◽

Compliance Rate ◽

Negative Binomial Regression ◽

Inverse Association ◽

K Nearest Neighbor ◽

Data Set ◽

Clostridioides Difficile ◽

The Impact

Abstract Background The role of the environment in hospital acquired infections is well established. We examined the impact on the infection rate for hospital onset Clostridioides difficile (HO-CDI) of an environmental hygiene intervention in 48 hospitals over a 5 year period using a pulsed xenon ultraviolet (PX-UV) disinfection system. Methods Utilization data was collected directly from the automated PX-UV system and uploaded in real time to a database. HO-CDI data was provided by each facility. Data was analyzed at the unit level to determine compliance to disinfection protocols. Final data set included 5 years of data aggregated to the facility level, resulting in a dataset of 48 hospitals and a date range of January 2015–December 2019. Negative binomial regression was used with an offset on patient days to convert infection count data and assess HO-CDI rates vs. intervention compliance rate, total successful disinfection cycles, and total rooms disinfected. The K-Nearest Neighbor (KNN) machine learning algorithm was used to compare intervention compliance and total intervention cycles to presence of infection. Results All regression models depict a statistically significant inverse association between the intervention and HO-CDI rates. The KNN model predicts the presence of infection (or whether an infection will be present or not) with greater than 98% accuracy when considering both intervention compliance and total intervention cycles. Conclusions The findings of this study indicate a strong inverse relationship between the utilization of the pulsed xenon intervention and HO-CDI rates.

Download Full-text