Comparison of Apache SOLR Search Spellcheck String Distance Measure – Levenshtein, Jaro Winkler, and N-Gram

To achieve the goal of creating products for a specific market segment, implementation of Software Product Line (SPL) is required to fulfill specific needs of customers by managing a set of common features and exploiting the variabilities between the products. Testing product-by-product is not feasible in SPL due to the combinatorial explosion of product number, thus, Test Case Prioritization (TCP) is needed to select a few test cases which could yield high number of faults. Among the most promising TCP techniques is similarity-based TCP technique which consists of similarity distance measure and prioritization algorithm. The goal of this paper is to propose an enhanced string distance and prioritization algorithm which could reorder the test cases resulting to higher rate of fault detection. Comparative study has been done between different string distance measures and prioritization algorithms to select the best techniques for similarity-based test case prioritization. Identified enhancements have been implemented to both techniques for a better adoption of prioritizing SPL test cases. Experiment has been done in order to identify the effectiveness of enhancements done for combination of both techniques. Result shows the effectiveness of the combination where it achieved highest average fault detection rate, attained fastest execution time for highest number of test cases and accomplished 41.25% average rate of fault detection. The result proves that the combination of both techniques improve SPL testing effectiveness compared to other existing techniques.

Download Full-text

Orthographic measures of language distances between the official South African languages

Literator ◽

10.4102/lit.v29i1.106 ◽

2008 ◽

Vol 29 (1) ◽

pp. 185-204 ◽

Cited By ~ 5

Author(s):

P.N. Zulu ◽

G. Botha ◽

E. Barnard

Keyword(s):

South African ◽

Distance Measure ◽

Similarity Measures ◽

Levenshtein Distance ◽

Identification System ◽

Text Documents ◽

African Languages ◽

Official Languages ◽

N Gram ◽

The Relationship

Two methods for objectively measuring similarities and dissimilarities between the eleven official languages of South Africa are described. The first concerns the use of n-grams. The confusions between different languages in a text-based language identification system can be used to derive information on the relationships between the languages. Our classifier calculates n-gram statistics from text documents and then uses these statistics as features in classification. We show that the classification results of a validation test can be used as a similarity measure of the relationship between languages. Using the similarity measures, we were able to represent the relationships graphically. We also apply the Levenshtein distance measure to the orthographic word transcriptions from the eleven South African languages under investigation. Hierarchical clustering of the distances between the different languages shows the relationships between the languages in terms of regional groupings and closeness. Both multidimensional scaling and dendrogram analysis reveal results similar to well-known language groupings, and also suggest a finer level of detail on these relationships.

Download Full-text

Text Clustering using Distances Combination by Social Bees

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2014070103 ◽

2014 ◽

Vol 4 (3) ◽

pp. 34-53 ◽

Cited By ~ 21

Author(s):

Hadj Ahmed Bouarara ◽

Reda Mohamed Hamou ◽

Abdelmalek Amine

Keyword(s):

Distance Measure ◽

Text Clustering ◽

Distance Measures ◽

Cluster Number ◽

3D Navigation ◽

Social Bees ◽

Clustering Problem ◽

Novel Approach ◽

Automatic Text Classification ◽

N Gram

Recently, the researchers proved that 90% of the information existed on the web, were presented in unstructured format (text free). The automatic text classification (clustering), has become a crucial challenge in the computer science community, where Most of the classical techniques, have known different problems in terms of time execution, multiplicity of data (marketing, biology, economics), and the initialization of cluster number. Nowadays, the bio-inspired paradigm, has known a genuine success in several sectors and particularly in the world of data-mining. The content of our work, is a novel approach called distances combination by social bees (DC-SB) for text clustering, composed of four steps: Pre-processing using different methods of texts representation (bag of words and n-gram characters) and the weighting TF-IDF, for the construction of the vectors; Bees' artificial life, the authors have imitated the functioning of social bees using three artificial worker bees(cleaner, guardian and forager) where each one of them is characterized by a distance measure different to others generated from the artificial queen (centroid) of the cluster (hive); Clustering using the concept of filtering where each filter is controlled by an artificial worker, and a document must pass three different obstacles to be added to the cluster. For the experiments they use the benchmark Reuters 21578 and a variety of validation tools (execution time f-measure and entropy) with a variation of parameters (threshold, distance measures combination and texts representation). The authors have compared their results with the performances of other methods existed in literature (Cellular Automata 2D, Artificial Immune System (AIS) and Artificial Social Spiders (ASS)), the conclusion obtained prove that the approach can solve the text clustering problem; finally, the visualization step, which provides a 3D navigation of the results obtained by the mean of a global and detailed view of the hive and the apiary, using the functionality of zooming and rotation.

Download Full-text

Feature selection of the armature winding broken coils in synchronous motor using genetic algorithm and mahalanobis distance

Archives of Metallurgy and Materials ◽

10.2478/v10172-012-0091-7 ◽

2012 ◽

Vol 57 (3) ◽

pp. 829-835 ◽

Cited By ~ 1

Author(s):

Z. Głowacz ◽

J. Kozik

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Mahalanobis Distance ◽

Distance Measure ◽

Synchronous Motor ◽

Medical Diagnostics ◽

Motor Current ◽

Feature Spaces ◽

Multidimensional Feature Spaces ◽

Selection Of

The paper describes a procedure for automatic selection of symptoms accompanying the break in the synchronous motor armature winding coils. This procedure, called the feature selection, leads to choosing from a full set of features describing the problem, such a subset that would allow the best distinguishing between healthy and damaged states. As the features the spectra components amplitudes of the motor current signals were used. The full spectra of current signals are considered as the multidimensional feature spaces and their subspaces are tested. Particular subspaces are chosen with the aid of genetic algorithm and their goodness is tested using Mahalanobis distance measure. The algorithm searches for such a subspaces for which this distance is the greatest. The algorithm is very efficient and, as it was confirmed by research, leads to good results. The proposed technique is successfully applied in many other fields of science and technology, including medical diagnostics.

Download Full-text

Color Image Retrieval Based on Chernoff Distance Measure

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i9.329333 ◽

2018 ◽

Vol 6 (9) ◽

pp. 329-333

Author(s):

S.Selvaraj . ◽

K. Seetharaman

Keyword(s):

Image Retrieval ◽

Distance Measure ◽

Color Image ◽

Chernoff Distance

Download Full-text

N-gram based Language Model for the QWERTY Keyboard Input Errors in a Touch Screen Environment

Korean Institute of Smart Media ◽

10.30693/smj.2018.7.2.54 ◽

2018 ◽

Vol 7 (2) ◽

pp. 54-59

Author(s):

Yoon Gee Ong ◽

◽

Seung Shik Kang ◽

Keyword(s):

Language Model ◽

Touch Screen ◽

Keyboard Input ◽

N Gram

Download Full-text

Comparision of Different Distance Measure Methods in Text Document Clustering

INTERNATIONAL JOURNAL OF RESEARCH AND ENGINEERING ◽

10.21276/ijre.2018.5.7.2 ◽

2018 ◽

Vol 5 (7) ◽

Author(s):

Yin Min Tun ◽

Keyword(s):

Distance Measure ◽

Document Clustering ◽

Text Document ◽

Measure Methods

Download Full-text

Learning N-Gram Language Models from Uncertain Data

10.21437/interspeech.2016-1093 ◽

2016 ◽

Cited By ~ 4

Author(s):

Vitaly Kuznetsov ◽

Hank Liao ◽

Mehryar Mohri ◽

Michael Riley ◽

Brian Roark

Keyword(s):

Uncertain Data ◽

Language Models ◽

N Gram

Download Full-text

Rescore in a Flash: Compact, Cache Efficient Hashing Data Structures for n-Gram Language Models

10.21437/interspeech.2020-1939 ◽

2020 ◽

Author(s):

Grant P. Strimel ◽

Ariya Rastrow ◽

Gautam Tiwari ◽

Adrien Piérard ◽

Jon Webb

Keyword(s):

Data Structures ◽

Language Models ◽

Cache Efficient ◽

N Gram

Download Full-text

Superpixel Segmentation Based on Anisotropic Edge Strength

Journal of Imaging ◽

10.3390/jimaging5060057 ◽

2019 ◽

Vol 5 (6) ◽

pp. 57 ◽

Cited By ~ 4

Author(s):

Gang Wang ◽

Bernard De Baets

Keyword(s):

Distance Measure ◽

Saliency Detection ◽

Experimental Results ◽

Segmentation Method ◽

Superpixel Segmentation ◽

Gaussian Kernels ◽

First Derivative ◽

Edge Strength

Superpixel segmentation can benefit from the use of an appropriate method to measure edge strength. In this paper, we present such a method based on the first derivative of anisotropic Gaussian kernels. The kernels can capture the position, direction, prominence, and scale of the edge to be detected. We incorporate the anisotropic edge strength into the distance measure between neighboring superpixels, thereby improving the performance of an existing graph-based superpixel segmentation method. Experimental results validate the superiority of our method in generating superpixels over the competing methods. It is also illustrated that the proposed superpixel segmentation method can facilitate subsequent saliency detection.

Download Full-text