Comparison of Apache SOLR Search Spellcheck String Distance Measure – Levenshtein, Jaro Winkler, and N-Gram

Author(s):  
Parameswara Rao Kandregula
Author(s):  
Shahliza Abd Halim ◽  
Dayang Norhayati Abang Jawawi ◽  
Muhammad Sahak

To achieve the goal of creating products for a specific market segment, implementation of Software Product Line (SPL) is required to fulfill specific needs of customers by managing a set of common features and exploiting the variabilities between the products. Testing product-by-product is not feasible in SPL due to the combinatorial explosion of product number, thus, Test Case Prioritization (TCP) is needed to select a few test cases which could yield high number of faults. Among the most promising TCP techniques is similarity-based TCP technique which consists of similarity distance measure and prioritization algorithm. The goal of this paper is to propose an enhanced string distance and prioritization algorithm which could reorder the test cases resulting to higher rate of fault detection. Comparative study has been done between different string distance measures and prioritization algorithms to select the best techniques for similarity-based test case prioritization. Identified enhancements have been implemented to both techniques for a better adoption of prioritizing SPL test cases. Experiment has been done in order to identify the effectiveness of enhancements done for combination of both techniques. Result shows the effectiveness of the combination where it achieved highest average fault detection rate, attained fastest execution time for highest number of test cases and accomplished 41.25% average rate of fault detection. The result proves that the combination of both techniques improve SPL testing effectiveness compared to other existing techniques.  


Literator ◽  
2008 ◽  
Vol 29 (1) ◽  
pp. 185-204 ◽  
Author(s):  
P.N. Zulu ◽  
G. Botha ◽  
E. Barnard

Two methods for objectively measuring similarities and dissimilarities between the eleven official languages of South Africa are described. The first concerns the use of n-grams. The confusions between different languages in a text-based language identification system can be used to derive information on the relationships between the languages. Our classifier calculates n-gram statistics from text documents and then uses these statistics as features in classification. We show that the classification results of a validation test can be used as a similarity measure of the relationship between languages. Using the similarity measures, we were able to represent the relationships graphically. We also apply the Levenshtein distance measure to the orthographic word transcriptions from the eleven South African languages under investigation. Hierarchical clustering of the distances between the different languages shows the relationships between the languages in terms of regional groupings and closeness. Both multidimensional scaling and dendrogram analysis reveal results similar to well-known language groupings, and also suggest a finer level of detail on these relationships.


2014 ◽  
Vol 4 (3) ◽  
pp. 34-53 ◽  
Author(s):  
Hadj Ahmed Bouarara ◽  
Reda Mohamed Hamou ◽  
Abdelmalek Amine

Recently, the researchers proved that 90% of the information existed on the web, were presented in unstructured format (text free). The automatic text classification (clustering), has become a crucial challenge in the computer science community, where Most of the classical techniques, have known different problems in terms of time execution, multiplicity of data (marketing, biology, economics), and the initialization of cluster number. Nowadays, the bio-inspired paradigm, has known a genuine success in several sectors and particularly in the world of data-mining. The content of our work, is a novel approach called distances combination by social bees (DC-SB) for text clustering, composed of four steps: Pre-processing using different methods of texts representation (bag of words and n-gram characters) and the weighting TF-IDF, for the construction of the vectors; Bees' artificial life, the authors have imitated the functioning of social bees using three artificial worker bees(cleaner, guardian and forager) where each one of them is characterized by a distance measure different to others generated from the artificial queen (centroid) of the cluster (hive); Clustering using the concept of filtering where each filter is controlled by an artificial worker, and a document must pass three different obstacles to be added to the cluster. For the experiments they use the benchmark Reuters 21578 and a variety of validation tools (execution time f-measure and entropy) with a variation of parameters (threshold, distance measures combination and texts representation). The authors have compared their results with the performances of other methods existed in literature (Cellular Automata 2D, Artificial Immune System (AIS) and Artificial Social Spiders (ASS)), the conclusion obtained prove that the approach can solve the text clustering problem; finally, the visualization step, which provides a 3D navigation of the results obtained by the mean of a global and detailed view of the hive and the apiary, using the functionality of zooming and rotation.


2012 ◽  
Vol 57 (3) ◽  
pp. 829-835 ◽  
Author(s):  
Z. Głowacz ◽  
J. Kozik

The paper describes a procedure for automatic selection of symptoms accompanying the break in the synchronous motor armature winding coils. This procedure, called the feature selection, leads to choosing from a full set of features describing the problem, such a subset that would allow the best distinguishing between healthy and damaged states. As the features the spectra components amplitudes of the motor current signals were used. The full spectra of current signals are considered as the multidimensional feature spaces and their subspaces are tested. Particular subspaces are chosen with the aid of genetic algorithm and their goodness is tested using Mahalanobis distance measure. The algorithm searches for such a subspaces for which this distance is the greatest. The algorithm is very efficient and, as it was confirmed by research, leads to good results. The proposed technique is successfully applied in many other fields of science and technology, including medical diagnostics.


Author(s):  
Vitaly Kuznetsov ◽  
Hank Liao ◽  
Mehryar Mohri ◽  
Michael Riley ◽  
Brian Roark

2020 ◽  
Author(s):  
Grant P. Strimel ◽  
Ariya Rastrow ◽  
Gautam Tiwari ◽  
Adrien Piérard ◽  
Jon Webb

2019 ◽  
Vol 5 (6) ◽  
pp. 57 ◽  
Author(s):  
Gang Wang ◽  
Bernard De Baets

Superpixel segmentation can benefit from the use of an appropriate method to measure edge strength. In this paper, we present such a method based on the first derivative of anisotropic Gaussian kernels. The kernels can capture the position, direction, prominence, and scale of the edge to be detected. We incorporate the anisotropic edge strength into the distance measure between neighboring superpixels, thereby improving the performance of an existing graph-based superpixel segmentation method. Experimental results validate the superiority of our method in generating superpixels over the competing methods. It is also illustrated that the proposed superpixel segmentation method can facilitate subsequent saliency detection.


Sign in / Sign up

Export Citation Format

Share Document