VECTOR SPACE INDEXING FOR BIOSEQUENCE SIMILARITY SEARCHES

2005 ◽  
Vol 14 (05) ◽  
pp. 811-826 ◽  
Author(s):  
OZGUR OZTURK ◽  
HAKAN FERHATOSMANOGLU

We present a multi-dimensional indexing approach for fast sequence similarity search in DNA and protein databases. In particular, we propose effective transformations of subsequences into numerical vector domains and build efficient index structures on the transformed vectors. We then define distance functions in the transformed domain and examine properties of these functions. We experimentally compared their (a) approximation quality for k-Nearest Neighbor (k-NN) queries and both (b) pruning ability and (c) approximation quality for ε-range queries. Results for k-NN queries, which we present here, show that our proposed distances FD2 and WD2 (i.e. Frequency and Wavelet Distance functions for 2-grams) perform significantly better than the others. We then develop effective index structures, based on R-trees and scalar quantization, on top of transformed vectors and distance functions. Promising results from the experiments on real biosequence data sets are presented.

Polymers ◽  
2021 ◽  
Vol 13 (21) ◽  
pp. 3811
Author(s):  
Iosif Sorin Fazakas-Anca ◽  
Arina Modrea ◽  
Sorin Vlase

This paper proposes a new method for calculating the monomer reactivity ratios for binary copolymerization based on the terminal model. The original optimization method involves a numerical integration algorithm and an optimization algorithm based on k-nearest neighbour non-parametric regression. The calculation method has been tested on simulated and experimental data sets, at low (<10%), medium (10–35%) and high conversions (>40%), yielding reactivity ratios in a good agreement with the usual methods such as intersection, Fineman–Ross, reverse Fineman–Ross, Kelen–Tüdös, extended Kelen–Tüdös and the error in variable method. The experimental data sets used in this comparative analysis are copolymerization of 2-(N-phthalimido) ethyl acrylate with 1-vinyl-2-pyrolidone for low conversion, copolymerization of isoprene with glycidyl methacrylate for medium conversion and copolymerization of N-isopropylacrylamide with N,N-dimethylacrylamide for high conversion. Also, the possibility to estimate experimental errors from a single experimental data set formed by n experimental data is shown.


2021 ◽  
Vol 87 (6) ◽  
pp. 445-455
Author(s):  
Yi Ma ◽  
Zezhong Zheng ◽  
Yutang Ma ◽  
Mingcang Zhu ◽  
Ran Huang ◽  
...  

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.


Author(s):  
Wei Yan

In cloud computing environments parallel kNN queries for big data is an important issue. The k nearest neighbor queries (kNN queries), designed to find k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operator widely adopted by many applications including knowledge discovery, data mining, and spatial databases. This chapter proposes a parallel method of kNN queries for big data using MapReduce programming model. Firstly, this chapter proposes an approximate algorithm that is based on mapping multi-dimensional data sets into two-dimensional data sets, and transforming kNN queries into a sequence of two-dimensional point searches. Then, in two-dimensional space this chapter proposes a partitioning method using Voronoi diagram, which incorporates the Voronoi diagram into R-tree. Furthermore, this chapter proposes an efficient algorithm for processing kNN queries based on R-tree using MapReduce programming model. Finally, this chapter presents the results of extensive experimental evaluations which indicate efficiency of the proposed approach.


Web Mining ◽  
2011 ◽  
pp. 253-275
Author(s):  
Xiaodi Huang ◽  
Wei Lai

This chapter presents a new approach to clustering graphs, and applies it to Web graph display and navigation. The proposed approach takes advantage of the linkage patterns of graphs, and utilizes an affinity function in conjunction with the k-nearest neighbor. This chapter uses Web graph clustering as an illustrative example, and offers a potentially more applicable method to mine structural information from data sets, with the hope of informing readers of another aspect of data mining and its applications.


Author(s):  
Amit Saxena ◽  
John Wang

This paper presents a two-phase scheme to select reduced number of features from a dataset using Genetic Algorithm (GA) and testing the classification accuracy (CA) of the dataset with the reduced feature set. In the first phase of the proposed work, an unsupervised approach to select a subset of features is applied. GA is used to select stochastically reduced number of features with Sammon Error as the fitness function. Different subsets of features are obtained. In the second phase, each of the reduced features set is applied to test the CA of the dataset. The CA of a data set is validated using supervised k-nearest neighbor (k-nn) algorithm. The novelty of the proposed scheme is that each reduced feature set obtained in the first phase is investigated for CA using the k-nn classification with different Minkowski metric i.e. non-Euclidean norms instead of conventional Euclidean norm (L2). Final results are presented in the paper with extensive simulations on seven real and one synthetic, data sets. It is revealed from the proposed investigation that taking different norms produces better CA and hence a scope for better feature subset selection.


2019 ◽  
Vol 12 (2) ◽  
pp. 140
Author(s):  
Retsi Firda Maulina ◽  
Anik Djuraidah ◽  
Anang Kurnia

Poverty is a complex and multidimensional problem so that it becomes a development priority. Applications of poverty modeling in discrete data are still few and applications of the Bayesian paradigm are also still few. The Bayes Method is a parameter estimation method that utilizes initial information (prior) and sample information so that it can provide predictions that have a higher accuracy than the classical methods. Bayes inference using INLA approach provides faster computation than MCMC and possible uses large data sets. This study aims to model Javanese poverty using the Bayesian Spatial Probit with the INLA approach with three weighting matrices, namely K-Nearest Neighbor (KNN), Inverse Distance, and Exponential Distance. Furthermore, the result showed poverty analysis in Java based on the best model is using Bayesian SAR Probit INLA with KNN weighting matrix produced the highest level of classification accuracy, with specificity is 85.45%, sensitivity is 93.75%, and accuracy is 89.92%.


2005 ◽  
Vol 02 (02) ◽  
pp. 167-180
Author(s):  
SEUNG-JOON OH ◽  
JAE-YEARN KIM

Clustering of sequences is relatively less explored but it is becoming increasingly important in data mining applications such as web usage mining and bioinformatics. The web user segmentation problem uses web access log files to partition a set of users into clusters such that users within one cluster are more similar to one another than to the users in other clusters. Similarly, grouping protein sequences that share a similar structure can help to identify sequences with similar functions. However, few clustering algorithms consider sequentiality. In this paper, we study how to cluster sequence datasets. Due to the high computational complexity of hierarchical clustering algorithms for clustering large datasets, a new clustering method is required. Therefore, we propose a new scalable clustering method using sampling and a k-nearest-neighbor method. Using a splice dataset and a synthetic dataset, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional algorithms.


2019 ◽  
Vol 8 (4) ◽  
pp. 9155-9158

Classification is a machine learning task which consists in predicting the set association of unclassified examples, whose label is not known, by the properties of examples in a representation learned earlier as of training examples, that label was known. Classification tasks contain a huge assortment of domains and real world purpose: disciplines such as medical diagnosis, bioinformatics, financial engineering and image recognition between others, where domain experts can use the model erudite to sustain their decisions. All the Classification Approaches proposed in this paper were evaluate in an appropriate experimental framework in R Programming Language and the major emphasis is on k-nearest neighbor method which supports vector machines and decision trees over large number of data sets with varied dimensionality and by comparing their performance against other state-of-the-art methods. In this process the experimental results obtained have been verified by statistical tests which support the better performance of the methods. In this paper we have survey various classification techniques of Data Mining and then compared them by using diverse datasets from “University of California: Irvine (UCI) Machine Learning Repository” for acquiring the accurate calculations on Iris Data set.


Author(s):  
Norsyela Muhammad Noor Mathivanan ◽  
Nor Azura Md.Ghani ◽  
Roziah Mohd Janor

<p>Online business development through e-commerce platforms is a phenomenon which change the world of promoting and selling products in this 21<sup>st</sup> century. Product title classification is an important task in assisting retailers and sellers to list a product in a suitable category. Product title classification is apart of text classification problem but the properties of product title are different from general document. This study aims to evaluate the performance of five different supervised learning models on data sets consist of e-commerce product titles with a very short description and they are incomplete sentences. The supervised learning models involve in the study are Naïve Bayes, K-Nearest Neighbor (KNN), Decision Tree, Support Vector Machine (SVM) and Random Forest. The results show KNN model is the best model with the highest accuracy and fastest computation time to classify the data used in the study. Hence, KNN model is a good approach in classifying e-commerce products.</p>


Sign in / Sign up

Export Citation Format

Share Document