Maximum Variance Hashing via Column Generation

Mathematical Problems in Engineering ◽

10.1155/2013/379718 ◽

2013 ◽

Vol 2013 ◽

pp. 1-10

Author(s):

Lei Luo ◽

Chao Zhang ◽

Yongrui Qin ◽

Chunyuan Zhang

Keyword(s):

Column Generation ◽

Large Scale ◽

Web Search ◽

Nearest Neighbor ◽

Computational Cost ◽

Multimedia Retrieval ◽

Training Data ◽

Nonlinear Dimensionality Reduction ◽

Maximum Variance ◽

Data Volume

With the explosive growth of the data volume in modern applications such as web search and multimedia retrieval, hashing is becoming increasingly important for efficient nearest neighbor (similar item) search. Recently, a number of data-dependent methods have been developed, reflecting the great potential of learning for hashing. Inspired by the classic nonlinear dimensionality reduction algorithm—maximum variance unfolding, we propose a novel unsupervised hashing method, named maximum variance hashing, in this work. The idea is to maximize the total variance of the hash codes while preserving the local structure of the training data. To solve the derived optimization problem, we propose a column generation algorithm, which directly learns the binary-valued hash functions. We then extend it using anchor graphs to reduce the computational cost. Experiments on large-scale image datasets demonstrate that the proposed method outperforms state-of-the-art hashing methods in many cases.

Download Full-text

Test-framework: performance profiling and testing web search engine on non factoid queries

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v14.i3.pp1373-1381 ◽

2019 ◽

Vol 14 (3) ◽

pp. 1373

Author(s):

Althaf Ali A ◽

Mahammad Shafi R

Keyword(s):

Search Engine ◽

Large Scale ◽

Web Search ◽

Expectation Maximization Algorithm ◽

Training Data ◽

Test Cases ◽

Retrieval Method ◽

Performance Profiling ◽

Query Result ◽

Prediction Technique

Performance profiling and testing is one of the interesting topics in the big data management and Cloud Computing. In testing, we use test cases composed to different type of queries to evaluate the performance aspects of the information retrieval system for large scale information collection. This test scenarioperforms the evaluation ofretrieval accuracy for all kind of ambiguity and non factoid queries with result set as Training data. This stands difficult to evaluate the retrieval method in order to schedule or optimize the Recommendation and prediction technique of the IR method to the Real time queries. The Queries is considered as requirement specification which has to supply to search engine or web information provider applications for information or web page retrieval. In this paper, we propose a novel technique named as “Test Retrieval Framework“a performance profiling and testing of the web search engines on the information retrieved towards non factoid queries. In this technique, we apply expectation maximization algorithm as an iterative method to find maximum likelihood estimate.We discuss on the important aspects in this work based on Recommendation models integrating domain and web usage, Query optimization for navigational and Transactional queries, Query Result records.The Experimental results demonstrates the proposed technique outperforms of state of arts approaches in terms of set based measures like Precision, Recall and F measure and rank based measures like Mean Average Precision and Cumulative Gain.

Download Full-text

A new parallel data geometry analysis algorithm to select training data for support vector machine

AIMS Mathematics ◽

10.3934/math.2021806 ◽

2021 ◽

Vol 6 (12) ◽

pp. 13931-13953

Author(s):

Yunfeng Shi ◽

◽

Shu Lv ◽

Kaibo Shi ◽

◽

...

Keyword(s):

Support Vector Machine ◽

Large Scale ◽

Computational Cost ◽

Training Data ◽

Support Vector ◽

Training Set ◽

Redundant Data ◽

Parallel Data ◽

Low Efficiency ◽

Geometry Analysis

<abstract>Support vector machine (SVM) is one of the most powerful technologies of machine learning, which has been widely concerned because of its remarkable performance. However, when dealing with the classification problem of large-scale datasets, the high complexity of SVM model leads to low efficiency and become impractical. Due to the sparsity of SVM in the sample space, this paper presents a new parallel data geometry analysis (PDGA) algorithm to reduce the training set of SVM, which helps to improve the efficiency of SVM training. The PDGA introduce Mahalanobis distance to measure the distance from each sample to its centroid. And based on this, proposes a method that can identify non support vectors and outliers at the same time to help remove redundant data. When the training set is further reduced, cosine angle distance analysis method is proposed to determine whether the samples are redundant data, ensure that the valuable data are not removed. Different from the previous data geometry analysis methods, the PDGA algorithm is implemented in parallel, which greatly saving the computational cost. Experimental results on artificial dataset and 6 real datasets show that the algorithm can adapt to different sample distributions. Which significantly reduce the training time and memory requirements without sacrificing the classification accuracy, and its performance is obviously better than the other five competitive algorithms.</abstract>

Download Full-text

Improving Object Tracking by Added Noise and Channel Attention

Sensors ◽

10.3390/s20133780 ◽

2020 ◽

Vol 20 (13) ◽

pp. 3780 ◽

Cited By ~ 2

Author(s):

Mustansar Fiaz ◽

Arif Mahmood ◽

Ki Yeol Baek ◽

Sehar Shahzad Farooq ◽

Soon Ki Jung

Keyword(s):

Large Scale ◽

Data Augmentation ◽

Feature Fusion ◽

State Of The Art ◽

Computational Cost ◽

Training Data ◽

Superior Performance ◽

Input Noise ◽

Offline Learning ◽

Benchmark Datasets

CNN-based trackers, especially those based on Siamese networks, have recently attracted considerable attention because of their relatively good performance and low computational cost. For many Siamese trackers, learning a generic object model from a large-scale dataset is still a challenging task. In the current study, we introduce input noise as regularization in the training data to improve generalization of the learned model. We propose an Input-Regularized Channel Attentional Siamese (IRCA-Siam) tracker which exhibits improved generalization compared to the current state-of-the-art trackers. In particular, we exploit offline learning by introducing additive noise for input data augmentation to mitigate the overfitting problem. We propose feature fusion from noisy and clean input channels which improves the target localization. Channel attention integrated with our framework helps finding more useful target features resulting in further performance improvement. Our proposed IRCA-Siam enhances the discrimination of the tracker/background and improves fault tolerance and generalization. An extensive experimental evaluation on six benchmark datasets including OTB2013, OTB2015, TC128, UAV123, VOT2016 and VOT2017 demonstrate superior performance of the proposed IRCA-Siam tracker compared to the 30 existing state-of-the-art trackers.

Download Full-text

Learning Deep Unsupervised Binary Codes for Image Retrieval

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/85 ◽

2018 ◽

Cited By ~ 4

Author(s):

Junjie Chen ◽

William K. Cheung ◽

Anran Wang

Keyword(s):

Image Retrieval ◽

Large Scale ◽

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Multimedia Retrieval ◽

Binary Codes ◽

Linear Transformations ◽

Neighbor Search ◽

Compact Representations ◽

The Difference

Hashing is an efficient approximate nearest neighbor search method and has been widely adopted for large-scale multimedia retrieval. While supervised learning is more popular for the data-dependent hashing, deep unsupervised hashing methods have recently been developed to learn non-linear transformations for converting multimedia inputs to binary codes. Most of existing deep unsupervised hashing methods make use of a quadratic constraint for minimizing the difference between the compact representations and the target binary codes, which inevitably causes severe information loss. In this paper, we propose a novel deep unsupervised method called DeepQuan for hashing. The DeepQuan model utilizes a deep autoencoder network, where the encoder is used to learn compact representations and the decoder is for manifold preservation. To contrast with the existing unsupervised methods, DeepQuan learns the binary codes by minimizing the quantization error through product quantization technique. Furthermore, a weighted triplet loss is proposed to avoid trivial solution and poor generalization. Extensive experimental results on standard datasets show that the proposed DeepQuan model outperforms the state-of-the-art unsupervised hashing methods for image retrieval tasks.

Download Full-text

Learning to Explain Entity Relationships by Pairwise Ranking with Convolutional Neural Networks

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/561 ◽

2017 ◽

Cited By ~ 3

Author(s):

Jizhou Huang ◽

Wei Zhang ◽

Shiqi Zhao ◽

Shiqiang Ding ◽

Haifeng Wang

Keyword(s):

Search Engine ◽

Large Scale ◽

Web Search ◽

Training Data ◽

Data Annotation ◽

Web Search Engine ◽

Clickthrough Data ◽

Ranking Model ◽

Knowledge Graphs ◽

Entity Relationships

Providing a plausible explanation for the relationship between two related entities is an important task in some applications of knowledge graphs, such as in search engines. However, most existing methods require a large number of manually labeled training data, which cannot be applied in large-scale knowledge graphs due to the expensive data annotation. In addition, these methods typically rely on costly handcrafted features. In this paper, we propose an effective pairwise ranking model by leveraging clickthrough data of a Web search engine to address these two problems. We first construct large-scale training data by leveraging the query-title pairs derived from clickthrough data of a Web search engine. Then, we build a pairwise ranking model which employs a convolutional neural network to automatically learn relevant features. The proposed model can be easily trained with backpropagation to perform the ranking task. The experiments show that our method significantly outperforms several strong baselines.

Download Full-text

A Bottom-Up Clustering Approach to Unsupervised Person Re-Identification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018738 ◽

2019 ◽

Vol 33 ◽

pp. 8738-8745 ◽

Cited By ~ 41

Author(s):

Yutian Lin ◽

Xuanyi Dong ◽

Liang Zheng ◽

Yan Yan ◽

Yi Yang

Keyword(s):

Supervised Learning ◽

Large Scale ◽

State Of The Art ◽

Training Data ◽

Real World Data ◽

Bottom Up ◽

Data Volume ◽

Clustering Approach ◽

The Individual ◽

The Relationship

Most person re-identification (re-ID) approaches are based on supervised learning, which requires intensive manual annotation for training data. However, it is not only resourceintensive to acquire identity annotation but also impractical to label the large-scale real-world data. To relieve this problem, we propose a bottom-up clustering (BUC) approach to jointly optimize a convolutional neural network (CNN) and the relationship among the individual samples. Our algorithm considers two fundamental facts in the re-ID task, i.e., diversity across different identities and similarity within the same identity. Specifically, our algorithm starts with regarding individual sample as a different identity, which maximizes the diversity over each identity. Then it gradually groups similar samples into one identity, which increases the similarity within each identity. We utilizes a diversity regularization term in the bottom-up clustering procedure to balance the data volume of each cluster. Finally, the model achieves an effective trade-off between the diversity and similarity. We conduct extensive experiments on the large-scale image and video re-ID datasets, including Market-1501, DukeMTMCreID, MARS and DukeMTMC-VideoReID. The experimental results demonstrate that our algorithm is not only superior to state-of-the-art unsupervised re-ID approaches, but also performs favorably than competing transfer learning and semi-supervised learning methods.

Download Full-text

Large-scale tandem mass spectrum clustering using fast nearest neighbor searching

10.1101/2021.02.05.429957 ◽

2021 ◽

Author(s):

Wout Bittremieux ◽

Kris Laukens ◽

William Stafford Noble ◽

Pieter C. Dorrestein

Keyword(s):

Large Scale ◽

Nearest Neighbor ◽

Distance Matrix ◽

Relevant Information ◽

Downstream Processing ◽

Mass Spectrometry Data ◽

Similarity Searching ◽

Mass Spectral ◽

Data Volume ◽

Low Dimensional

AbstractRationaleAdvanced algorithmic solutions are necessary to process the ever increasing amounts of mass spectrometry data that is being generated. Here we describe the falcon spectrum clustering tool for efficient clustering of millions of MS/MS spectra.Methodsfalcon succeeds in efficiently clustering large amounts of mass spectral data using advanced techniques for fast spectrum similarity searching. First, high-resolution spectra are binned and converted to low-dimensional vectors using feature hashing. Next, the spectrum vectors are used to construct nearest neighbor indexes for fast similarity searching. The nearest neighbor indexes are used to efficiently compute a sparse pair-wise distance matrix without having to exhaustively compare all spectra to each other. Finally, density-based clustering is performed to group similar spectra into clusters.ResultsUsing a large draft human proteome dataset consisting of 25 million spectra, falcon is able to generate clusters of a similar quality as MS-Cluster and spectra-cluster, two widely used clustering tools, while being considerably faster. Notably, at comparable cluster quality levels, falcon generates larger clusters than alternative tools, leading to a larger reduction in data volume without the loss of relevant information for more efficient downstream processing.Conclusionsfalcon is a highly efficient spectrum clustering tool. It is publicly available as open source under the permissive BSD license at https://github.com/bittremieux/falcon.

Download Full-text

Accurate and Transferable Multitask Prediction of Chemical Properties with an Atoms-in-Molecule Neural Network

10.26434/chemrxiv.7151435.v2 ◽

2018 ◽

Author(s):

Roman Zubatyuk ◽

Justin S. Smith ◽

Jerzy Leszczynski ◽

Olexandr Isayev

Keyword(s):

Neural Network ◽

Molecular System ◽

Computational Cost ◽

Chemical Properties ◽

The State ◽

Molecular Properties ◽

Training Data ◽

Dft Methods ◽

Benchmark Datasets ◽

Quantum Phenomena

Atomic and molecular properties could be evaluated from the fundamental Schrodinger’s equation and therefore represent different modalities of the same quantum phenomena. Here we present AIMNet, a modular and chemically inspired deep neural network potential. We used AIMNet with multitarget training to learn multiple modalities of the state of the atom in a molecular system. The resulting model shows on several benchmark datasets the state-of-the-art accuracy, comparable to the results of orders of magnitude more expensive DFT methods. It can simultaneously predict several atomic and molecular properties without an increase in computational cost. With AIMNet we show a new dimension of transferability: the ability to learn new targets utilizing multimodal information from previous training. The model can learn implicit solvation energy (like SMD) utilizing only a fraction of original training data, and archive MAD error of 1.1 kcal/mol compared to experimental solvation free energies in MNSol database.

Download Full-text

DeepSSPred: A Deep Learning Based Sulfenylation site predictor via a novel n-segmented optimize federated feature encoder

Protein and Peptide Letters ◽

10.2174/0929866527666201202103411 ◽

2020 ◽

Vol 27 ◽

Author(s):

Zaheer Ullah Khan ◽

Dechang Pi

Keyword(s):

Large Scale ◽

Computational Models ◽

Research Work ◽

Training Data ◽

Training Dataset ◽

Validation Dataset ◽

Cytokine Signaling ◽

Minority Class ◽

Independent Dataset ◽

Feature Encoding

Background: S-sulfenylation (S-sulphenylation, or sulfenic acid) proteins, are special kinds of post-translation modification, which plays an important role in various physiological and pathological processes such as cytokine signaling, transcriptional regulation, and apoptosis. Despite these aforementioned significances, and by complementing existing wet methods, several computational models have been developed for sulfenylation cysteine sites prediction. However, the performance of these models was not satisfactory due to inefficient feature schemes, severe imbalance issues, and lack of an intelligent learning engine. Objective: In this study, our motivation is to establish a strong and novel computational predictor for discrimination of sulfenylation and non-sulfenylation sites. Methods: In this study, we report an innovative bioinformatics feature encoding tool, named DeepSSPred, in which, resulting encoded features is obtained via n-segmented hybrid feature, and then the resampling technique called synthetic minority oversampling was employed to cope with the severe imbalance issue between SC-sites (minority class) and non-SC sites (majority class). State of the art 2DConvolutional Neural Network was employed over rigorous 10-fold jackknife cross-validation technique for model validation and authentication. Results: Following the proposed framework, with a strong discrete presentation of feature space, machine learning engine, and unbiased presentation of the underline training data yielded into an excellent model that outperforms with all existing established studies. The proposed approach is 6% higher in terms of MCC from the first best. On an independent dataset, the existing first best study failed to provide sufficient details. The model obtained an increase of 7.5% in accuracy, 1.22% in Sn, 12.91% in Sp and 13.12% in MCC on the training data and12.13% of ACC, 27.25% in Sn, 2.25% in Sp, and 30.37% in MCC on an independent dataset in comparison with 2nd best method. These empirical analyses show the superlative performance of the proposed model over both training and Independent dataset in comparison with existing literature studies. Conclusion : In this research, we have developed a novel sequence-based automated predictor for SC-sites, called DeepSSPred. The empirical simulations outcomes with a training dataset and independent validation dataset have revealed the efficacy of the proposed theoretical model. The good performance of DeepSSPred is due to several reasons, such as novel discriminative feature encoding schemes, SMOTE technique, and careful construction of the prediction model through the tuned 2D-CNN classifier. We believe that our research work will provide a potential insight into a further prediction of S-sulfenylation characteristics and functionalities. Thus, we hope that our developed predictor will significantly helpful for large scale discrimination of unknown SC-sites in particular and designing new pharmaceutical drugs in general.

Download Full-text

Neural methods for effective, efficient, and exposure-aware information retrieval

ACM SIGIR Forum ◽

10.1145/3476415.3476434 ◽

2021 ◽

Vol 55 (1) ◽

pp. 1-2

Author(s):

Bhaskar Mitra

Keyword(s):

Information Retrieval ◽

Language Processing ◽

Large Scale ◽

Web Search ◽

Real Life ◽

Inverted Index ◽

Information Need ◽

Product Model ◽

Performance Improvements ◽

Deep Model

Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature [Mitra and Craswell, 2018]. Our key contribution towards improving the effectiveness of deep ranking models is developing the Duet principle [Mitra et al., 2017] which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent representations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence [Mitra et al., 2019] into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives [Diaz et al., 2020]. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets [Craswell et al., 2019] and explores the application of neural methods to other IR tasks, such as query auto-completion.

Download Full-text