longest common subsequence
Recently Published Documents


TOTAL DOCUMENTS

326
(FIVE YEARS 65)

H-INDEX

26
(FIVE YEARS 2)

Author(s):  
Iqra Muneer ◽  
Rao Muhammad Adeel Nawab

Cross-Lingual Text Reuse Detection (CLTRD) has recently attracted the attention of the research community due to a large amount of digital text readily available for reuse in multiple languages through online digital repositories. In addition, efficient machine translation systems are freely and readily available to translate text from one language into another, which makes it quite easy to reuse text across languages, and consequently difficult to detect it. In the literature, the most prominent and widely used approach for CLTRD is Translation plus Monolingual Analysis (T+MA). To detect CLTR for English-Urdu language pair, T+MA has been used with lexical approaches, namely, N-gram Overlap, Longest Common Subsequence, and Greedy String Tiling. This clearly shows that T+MA has not been thoroughly explored for the English-Urdu language pair. To fulfill this gap, this study presents an in-depth and detailed comparison of 26 approaches that are based on T+MA. These approaches include semantic similarity approaches (semantic tagger based approaches, WordNet-based approaches), probabilistic approach (Kullback-Leibler distance approach), monolingual word embedding-based approaches siamese recurrent architecture, and monolingual sentence transformer-based approaches for English-Urdu language pair. The evaluation was carried out using the CLEU benchmark corpus, both for the binary and the ternary classification tasks. Our extensive experimentation shows that our proposed approach that is a combination of 26 approaches obtained an F 1 score of 0.77 and 0.61 for the binary and ternary classification tasks, respectively, and outperformed the previously reported approaches [ 41 ] ( F 1 = 0.73) for the binary and ( F 1 = 0.55) for the ternary classification tasks) on the CLEU corpus.


2021 ◽  
Vol 19 (4) ◽  
pp. e49
Author(s):  
Anas Oujja ◽  
Mohamed Riduan Abid ◽  
Jaouad Boumhidi ◽  
Safae Bourhnane ◽  
Asmaa Mourhir ◽  
...  

Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-16
Author(s):  
Cui-Bin Ji ◽  
Gui-Jiang Duan ◽  
Jun-Yan Zhou ◽  
Wei-Jie Xuan

With the advancement of digital manufacturing technology, data-driven quality management is getting more and more attention, and it is developing rapidly under the impetus of technology and management. Quality data are growing exponentially with the help of increasingly interconnected devices and IoT (Internet of Things technologies). Aiming at the problems of insufficient quality data acquisition and poor data quality of complex equipment, the research on quality data integration and cleaning based on digital total quality management is carried out. The data integration architecture of complex equipment quality based on multiterminal collaboration is constructed. The architecture integrates a variety of integration methods and standards, such as XML, OPC-UA, and QIF protocol. Then, to unify the data view, a cleaning method of complex equipment quality data based on the combination of edit distance and longest common subsequence similarity calculation is proposed, and its effectiveness is verified. It provides the basis for the design of the digital total quality management system of complex equipment.


2021 ◽  
Vol 11 (21) ◽  
pp. 9787
Author(s):  
Martin J.-D. Otis ◽  
Julien Vandewynckel

Discretization and feature selection are two relevant techniques for dimensionality reduction. The first one aims to transform a set of continuous attributes into discrete ones, and the second removes the irrelevant and redundant features; these two methods often lead to be more specific and concise data. In this paper, we propose to simultaneously deal with optimal feature subset selection, discretization, and classifier parameter tuning. As an illustration, the proposed problem formulation has been addressed using a constrained many-objective optimization algorithm based on dominance and decomposition (C-MOEA/DD) and a limited-memory implementation of the warping longest common subsequence algorithm (WarpingLCSS). In addition, the discretization sub-problem has been addressed using a variable-length representation, along with a variable-length crossover, to overcome the need of specifying the number of elements defining the discretization scheme in advance. We conduct experiments on a real-world benchmark dataset; compare two discretization criteria as discretization objective, namely Ameva and ur-CAIM; and analyze recognition performance and reduction capabilities. Our results show that our approach outperforms previous reported results by up to 11% and achieves an average feature reduction rate of 80%.


2021 ◽  
Vol 9 (9) ◽  
pp. 1037
Author(s):  
Jinwan Park ◽  
Jungsik Jeong ◽  
Youngsoo Park

According to the statistics of maritime accidents, most collision accidents have been caused by human factors. In an encounter situation, the prediction of ship’s trajectory is a good way to notice the intention of the other ship. This paper proposes a methodology for predicting the ship’s trajectory that can be used for an intelligent collision avoidance algorithm at sea. To improve the prediction performance, the density-based spatial clustering of applications with noise (DBSCAN) has been used to recognize the pattern of the ship trajectory. Since the DBSCAN is a clustering algorithm based on the density of data points, it has limitations in clustering the trajectories with nonlinear curves. Thus, we applied the spectral clustering method that can reflect a similarity between individual trajectories. The similarity measured by the longest common subsequence (LCSS) distance. Based on the clustering results, the prediction model of ship trajectory was developed using the bidirectional long short-term memory (Bi-LSTM). Moreover, the performance of the proposed model was compared with that of the long short-term memory (LSTM) model and the gated recurrent unit (GRU) model. The input data was obtained by preprocessing techniques such as filtering, grouping, and interpolation of the automatic identification system (AIS) data. As a result of the experiment, the prediction accuracy of Bi-LSTM was found to be the highest compared to that of LSTM and GRU.


Author(s):  
Anshita Garg

This is a research-based project and the basic point motivating this project is learning and implementing algorithms that reduce time and space complexity. In the first part of the project, we reduce the time taken to search a given record by using a B/B+ tree rather than indexing and traditional sequential access. It is concluded that disk-access times are much slower than main memory access times. Typical seek times and rotational delays are of the order of 5 to 6 milliseconds and typical data transfer rates are of the range of 5 to 10 million bytes per second and therefore, main memory access times are likely to be at least 4 or 5 orders of magnitude faster than disk access on any given system. Therefore, the objective is to minimize the number of disk accesses, and thus, this project is concerned with techniques for achieving that objective i.e. techniques for arranging the data on a disk so that any required piece of data, say some specific record, can be located in a few I/O’s as possible. In the second part of the project, Dynamic Programming problems were solved with Recursion, Recursion With Storage, Iteration with Storage, Iteration with Smaller Storage. The problems which have been solved in these 4 variations are Fibonacci, Count Maze Path, Count Board Path, and Longest Common Subsequence. All 4 variations are an improvement over one another and thus time and space complexity are reduced significantly as we go from Recursion to Iteration with Smaller Storage.


Mathematics ◽  
2021 ◽  
Vol 9 (13) ◽  
pp. 1515
Author(s):  
Bojan Nikolic ◽  
Aleksandar Kartelj ◽  
Marko Djukanovic ◽  
Milana Grbic ◽  
Christian Blum ◽  
...  

The longest common subsequence (LCS) problem is a prominent NP–hard optimization problem where, given an arbitrary set of input strings, the aim is to find a longest subsequence, which is common to all input strings. This problem has a variety of applications in bioinformatics, molecular biology and file plagiarism checking, among others. All previous approaches from the literature are dedicated to solving LCS instances sampled from uniform or near-to-uniform probability distributions of letters in the input strings. In this paper, we introduce an approach that is able to effectively deal with more general cases, where the occurrence of letters in the input strings follows a non-uniform distribution such as a multinomial distribution. The proposed approach makes use of a time-restricted beam search, guided by a novel heuristic named Gmpsum. This heuristic combines two complementary scoring functions in the form of a convex combination. Furthermore, apart from the close-to-uniform benchmark sets from the related literature, we introduce three new benchmark sets that differ in terms of their statistical properties. One of these sets concerns a case study in the context of text analysis. We provide a comprehensive empirical evaluation in two distinctive settings: (1) short-time execution with fixed beam size in order to evaluate the guidance abilities of the compared search heuristics; and (2) long-time executions with fixed target duration times in order to obtain high-quality solutions. In both settings, the newly proposed approach performs comparably to state-of-the-art techniques in the context of close-to-uniform instances and outperforms state-of-the-art approaches for non-uniform instances.


2021 ◽  
Vol 11 (11) ◽  
pp. 5302
Author(s):  
Xiaodong Wang ◽  
Yining Zhao ◽  
Haili Xiao ◽  
Xiaoning Wang ◽  
Xuebin Chi

Logs record valuable data from different software and systems. Execution logs are widely available and are helpful in monitoring, examination, and system understanding of complex applications. However, log files usually contain too many lines of data for a human to deal with, therefore it is important to develop methods to process logs by computers. Logs are usually unstructured, which is not conducive to automatic analysis. How to categorize logs and turn into structured data automatically is of great practical significance. In this paper, LTmatch algorithm is proposed, which implements a log pattern extracting algorithm based on a weighted word matching rate. Compared with our preview work, this algorithm not only classifies the logs according to the longest common subsequence(LCS) but also gets and updates the log template in real-time. Besides, the pattern warehouse of the algorithm uses a fixed deep tree to store the log patterns, which optimizes the matching efficiency of log pattern extraction. To verify the advantages of the algorithm, we applied the proposed algorithm to the open-source data set with different kinds of labeled log data. A variety of state-of-the-art log pattern extraction algorithms are used for comparison. The result shows our method is improved by 2.67% in average accuracy when compared with the best result in all the other methods.


2021 ◽  
pp. 2150007
Author(s):  
Pavan Kumar Perepu

Given a mathematical expression in LaTeX or MathML format, retrieval algorithm extracts similar expressions from a database. In our previous work, we have used Longest Common Subsequence (LCS) algorithm to match two expressions of lengths, [Formula: see text] and [Formula: see text], which takes [Formula: see text] time complexity. If there are [Formula: see text] database expressions, total complexity is [Formula: see text], and an increase in [Formula: see text] also increases this complexity. In the present work, we propose to use parallel LCS algorithm in our retrieval process. Parallel LCS has [Formula: see text] time complexity with [Formula: see text] processors and total complexity can be reduced to [Formula: see text]. For our experimentation, OpenMP based implementation has been used on Intel [Formula: see text] processor with 4 cores. However, for smaller expressions, parallel version takes more time as the implementation overhead dominates the algorithmic improvement. As such, we have proposed to use parallel version, selectively, only on larger expressions, in our retrieval algorithm to achieve better performance. We have compared the sequential and parallel versions of our ME retrieval algorithm, and the performance results have been reported on a database of 829 mathematical expressions.


Sign in / Sign up

Export Citation Format

Share Document