Reconstruction of a Symbolic Periodic Sequence from a Sequence with Noise

The problem of constructing a periodic sequence consisting of at least eight periods is considered, based on a given sequence obtained from an unknown periodic sequence, also containing at least eight periods, by introducing noise of deletion, replacement, and insertion of symbols. To construct a periodic sequence that approximates a given one, distorted by noise, it is first required to estimate the length of the repeating fragment (period). Further, the distorted original sequence is divided into successive sections of equal length; the length takes on integer values from 80 to 120 % of the period estimate. Each obtained section is compared with each of the remaining sections, a section is selected to build a periodic sequence that has the minimum edit distance (Levenshtein distance) to any of the remaining sections, minimization is carried out over all sections of a fixed length, and then along all lengths from 80 to 120 % of period estimates. For correct comparison of fragments of different lengths, we consider the ration between the edit distance and the length of the fragment. The length of a fragment that minimizes the ratio of the edit distance to another fragment of the same length to the fragment length is considered the period of the approximating periodic sequence, and the fragment itself, repeating the required number of times, forms an approximating sequence. The constructed sequence may contain an incomplete repeating fragment at the end. The quality of the approximation is estimated by the ratio of the edit distance from the original distorted sequence to the constructed periodic sequence of the same length and this length.

Download Full-text

Using Structural Similarity to Classify Tests in Mutation Testing

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.378.546 ◽

2013 ◽

Vol 378 ◽

pp. 546-551 ◽

Cited By ~ 4

Author(s):

Joanna Strug ◽

Barbara Strug

Keyword(s):

Edit Distance ◽

Computational Cost ◽

Structural Similarity ◽

Mutation Testing ◽

Effective Technique ◽

The Cost ◽

High Computational Cost

Mutation testing is an effective technique for assessing quality of tests provided for a system. However it suffers from high computational cost of executing mutants of the system. In this paper a method of classifying such mutants is proposed. This classification is based on using an edit distance kernel and k-NN classifier. Using the results of this classification it is possible to predict whether a mutant would be detected by tests or not. Thus the application of the approach can help to lower the number of mutants that have to be executed and so also to lower the cost of using the mutation testing.

Download Full-text

Match quality of a linkage strategy based on the combined use of a statistical linkage key and the Levenshtein distance to link birth to death records in Brazil.

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.53 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Luis Carlos Guillen ◽

Juliana Domenico ◽

Kenneth Camargo ◽

Rejane Pinheiro ◽

Claudia Coeli

Keyword(s):

Levenshtein Distance ◽

Match Quality ◽

Discrimination Power ◽

Live Births ◽

Combined Use ◽

Small Decline ◽

One Year ◽

Janeiro State ◽

F Measure

ABSTRACTObjectivesTo assess the match quality of a linkage strategy based on the combined use of a statistical linkage key and the Levenshtein distance to link birth to death records in Brazil. ApproachFirst we evaluated the discrimination power of a statistical linkage key adapted from the Australian SLK-581. The modified statistical linkage key (MSLK-781) was based on the concatenation of the 2nd, 3rd and 5th letters of the mother's family name, the 2nd and 3rd letters of the mother's given name, the 2nd and 3rd letters of the mother's middle name, the child's date of birth and sex. We calculated the proportion of records that have a unique value for the MSLK-781 within the 2013 live births (N=224,038 records) and mortality (N=132,646 records) databases for Rio de Janeiro state. We also calculated the joint unique proportion measure based on the product of these two proportions. Second we evaluated the match quality of a linkage strategy based on the combined use of the MSLK-781 and the Levenshtein distance of the mother's name to link the live births database to death records of singleton children younger than one year of age (N=1,488). To assess the match quality we calculated the sensitivity, the predictive positive value (PPV) and the F-measure. ResultsThe proportion of records that have a unique value for the MSLK-781 within the live birth and the mortality databases were, respectively, 97.5% and 98.8%, which yields a joint unique proportion of 96.1%. The match quality measures of the linkage strategy based only on the MSLK-781 were: sensitivity=83.6%; PPV=98.3%; F-measure=90.4%. Combining the agreement on the MSLK-781 with a Levenshtein distance of the mother's name of less than 4 for the record pairs classification eliminated the false-positive matches (PPV=100%) with a small decline in the sensitivity (81.7%) and the F-measure (89.9%). ConclusionUsing the MSLK-781 combined with the Levenshtein distance can be used as a first pass for linking birth to death records in Brazil without having to send pairs of records to clerical review.

Download Full-text

Computerized Quality of Life Assessment: A Randomized Experiment to Determine the Impact of Individualized Feedback on Assessment Experience (Preprint)

10.2196/preprints.12212 ◽

2018 ◽

Author(s):

Daan Geerards ◽

Andrea Pusic ◽

Maarten Hoogbergen ◽

René van der Hulst ◽

Chris Sidey-Gibbons

Keyword(s):

Quality Of Life ◽

User Experience ◽

Computerized Adaptive Testing ◽

Life Assessment ◽

World Health ◽

Quality Of Life Assessment ◽

Fixed Length ◽

Graphical Feedback ◽

The Impact

BACKGROUND Quality of life (QoL) assessments, or patient-reported outcome measures (PROMs), are becoming increasingly important in health care and have been associated with improved decision making, higher satisfaction, and better outcomes of care. Some physicians and patients may find questionnaires too burdensome; however, this issue could be addressed by making use of computerized adaptive testing (CAT). In addition, making the questionnaire more interesting, for example by providing graphical and contextualized feedback, may further improve the experience of the users. However, little is known about how shorter assessments and feedback impact user experience. OBJECTIVE We conducted a controlled experiment to assess the impact of tailored multimodal feedback and CAT on user experience in QoL assessment using validated PROMs. METHODS We recruited a representative sample from the general population in the United Kingdom using the Oxford Prolific academic Web panel. Participants completed either a CAT version of the World Health Organization Quality of Life assessment (WHOQOL-CAT) or the fixed-length WHOQOL-BREF, an abbreviated version of the WHOQOL-100. We randomly assigned participants to conditions in which they would receive no feedback, graphical feedback only, or graphical and adaptive text-based feedback. Participants rated the assessment in terms of perceived acceptability, engagement, clarity, and accuracy. RESULTS We included 1386 participants in our analysis. Assessment experience was improved when graphical and tailored text-based feedback was provided along with PROMs (Δ=0.22, P<.001). Providing graphical feedback alone was weakly associated with improvement in overall experience (Δ=0.10, P=.006). Graphical and text-based feedback made the questionnaire more interesting, and users were more likely to report they would share the results with a physician or family member (Δ=0.17, P<.001, and Δ=0.17, P<.001, respectively). No difference was found in perceived accuracy of the graphical feedback scores of the WHOQOL-CAT and WHOQOL-BREF (Δ=0.06, P=.05). CAT (stopping rule [SE<0.45]) resulted in the administration of 25% fewer items than the fixed-length assessment, but it did not result in an improved user experience (P=.21). CONCLUSIONS Using tailored text-based feedback to contextualize numeric scores maximized the acceptability of electronic QoL assessment. Improving user experience may increase response rates and reduce attrition in research and clinical use of PROMs. In this study, CAT administration was associated with a modest decrease in assessment length but did not improve user experience. Patient-perceived accuracy of feedback was equivalent when comparing CAT with fixed-length assessment. Fixed-length forms are already generally acceptable to respondents; however, CAT might have an advantage over longer questionnaires that would be considered burdensome. Further research is warranted to explore the relationship between assessment length, feedback, and response burden in diverse populations.

Download Full-text

Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach

Informatics ◽

10.3390/informatics6030035 ◽

2019 ◽

Vol 6 (3) ◽

pp. 35

Author(s):

Defauw ◽

Szoc ◽

Bardadym ◽

Brabers ◽

Everaert ◽

...

Keyword(s):

Web Sites ◽

State Of The Art ◽

Levenshtein Distance ◽

Regression Problem ◽

Neural Machine Translation ◽

Automatic Alignment ◽

Regression Approach ◽

Cosine Distance ◽

Language Pair

To build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel sentences are needed. Typically, large amounts of data are scraped from multilingual web sites and aligned into datasets for training. Many tools exist for automatic alignment of such datasets. However, the quality of the resulting aligned corpus can be disappointing. In this paper, we present a tool for automatic misalignment detection (MAD). We treated the task of determining whether a pair of aligned sentences constitutes a genuine translation as a supervised regression problem. We trained our algorithm on a manually labeled dataset in the FR–NL language pair. Our algorithm used shallow features and features obtained after an initial translation step. We showed that both the Levenshtein distance between the target and the translated source, as well as the cosine distance between sentence embeddings of the source and the target were the two most important features for the task of misalignment detection. Using gold standards for alignment, we demonstrated that our model can increase the quality of alignments in a corpus substantially, reaching a precision close to 100%. Finally, we used our tool to investigate the effect of misalignments on NMT performance.

Download Full-text

Fast trie-based method for multiple pairwise sequence alignment

Доклады Академии наук ◽

10.31857/s0869-56524844401-404 ◽

2019 ◽

Vol 484 (4) ◽

pp. 401-404

Author(s):

P. A. Yakovlev

Keyword(s):

Numerical Experiments ◽

Edit Distance ◽

Dynamic Programming Algorithm ◽

Programming Algorithm ◽

Levenshtein Distance ◽

Biological Sequences ◽

Pairwise Sequence Alignment ◽

Symbol Sequence ◽

Original Algorithm ◽

Variable Domains

A method for efficient comparison of a symbol sequence with all strings of a set is presented, which performs considerably faster than the naive enumeration of comparisons with all strings in succession. The procedure is accelerated by applying an original algorithm combining a prefix tree and a standard dynamic programming algorithm searching for the edit distance (Levenshtein distance) between strings. The efficiency of the method is confirmed by numerical experiments with arrays consisting of tens of millions of biological sequences of variable domains of monoclonal antibodies.

Download Full-text

Skin writometer: A novel instrument for assessing provocation threshold in patients with symptomatic dermographism

IP Indian Journal of Clinical and Experimental Dermatology ◽

10.18231/j.ijced.2021.035 ◽

2021 ◽

Vol 7 (2) ◽

pp. 178-180

Author(s):

Kiran Godse ◽

Gauri Godse ◽

Anant Patil

Keyword(s):

Quality Of Life ◽

Treatment Response ◽

Common Condition ◽

Fixed Length ◽

Physical Urticaria ◽

Varying Length ◽

User Friendly ◽

Ball Point

Symptomatic dermographism, a type of physical urticaria is a common condition affecting patient’s quality of life. For its diagnosis, clinicians in India currently use tip of the ball point pen for estimating the provocation threshold. However, because of single tip of fixed length, ball point pen can-not differentiate between different grades of symptomatic dermographism. With variations in the intensity of stroke, there is a possibility of even missing the diagnosis. Hence, there is a need of a better method to diagnose symptomatic dermographism and determine the provocation threshold. Skin writometer, a plastic instrument with three arms of varying length can be novel in this regards. This instrument is simple, user friendly, easy to use and inexpensive. It can be used for diagnosis as well as assessment of treatment response in patients with symptomatic dermographism.

Download Full-text

Enhanced targeted resequencing by optimizing the combination of enrichment technology and DNA fragment length

10.1101/712125 ◽

2019 ◽

Cited By ~ 1

Author(s):

Barbara Iadarola ◽

Luciano Xumerle ◽

Denise Lavezzari ◽

Marta Paterno ◽

Luca Marcolungo ◽

...

Keyword(s):

Fragment Length ◽

Fragment Size ◽

Sequencing Depth ◽

Variant Call ◽

Base Calling ◽

Optimal Sequencing ◽

Whole Exome ◽

Enrichment Technology ◽

Dna Fragment

AbstractWhole-exome sequencing (WES) enrichment platforms are usually evaluated by measuring the depth of coverage at target regions. However, variants called in WES are reported in the variant call format (VCF) file, which is filtered by minimum site coverage and mapping quality. Therefore, genotypability (base calling calculated by combining depth of coverage with the confidence of read alignment) should be considered as a more informative parameter to assess the performance of WES. We found that the mapping quality of reads aligned to difficult target regions was improved by increasing the DNA fragment length well above the average exon size. We tested three different DNA fragment lengths using four major commercial WES platforms and found that longer DNA fragments achieved a higher percentage of callable bases in the target regions and thus improved the genotypability of many genes, including several associated with clinical phenotypes. DNA fragment size also affected the uniformity of coverage, which in turn influences genotypability, indicating that different platforms are optimized for different DNA fragment lengths. Finally, we found that although the depth of coverage continued to increase in line with the sequencing depth (overall number of reads), base calling reached saturation at a depth of coverage that depended on the enrichment platform and DNA fragment length. This confirmed that genotypability provides better estimates for the optimal sequencing depth of each fragment size/enrichment platform combination.

Download Full-text

Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3464427 ◽

2021 ◽

Vol 20 (6) ◽

pp. 1-21

Author(s):

Mieradilijiang Maimaiti ◽

Yang Liu ◽

Huanbo Luan ◽

Zegao Pan ◽

Maosong Sun

Keyword(s):

Data Augmentation ◽

Ranking Algorithm ◽

Original Sequence ◽

Low Resource ◽

Pos Tagging ◽

Word Level ◽

Corpus Size ◽

Semantic Errors ◽

Pseudo Data

Data augmentation is an approach for several text generation tasks. Generally, in the machine translation paradigm, mainly in low-resource language scenarios, many data augmentation methods have been proposed. The most used approaches for generating pseudo data mainly lay in word omission, random sampling, or replacing some words in the text. However, previous methods barely guarantee the quality of augmented data. In this work, we try to build the data by using paraphrase embedding and POS-Tagging. Namely, we generate the fake monolingual corpus by replacing the main four POS-Tagging labels, such as noun, adjective, adverb, and verb, based on both the paraphrase table and their similarity. We select the bigger corpus size of the paraphrase table with word level and obtain the word embedding of each word in the table, then calculate the cosine similarity between these words and tagged words in the original sequence. In addition, we exploit the ranking algorithm to choose highly similar words to reduce semantic errors and leverage the POS-Tagging replacement to mitigate syntactic error to some extent. Experimental results show that our augmentation method consistently outperforms all previous SOTA methods on the low-resource language pairs in seven language pairs from four corpora by 1.16 to 2.39 BLEU points.

Download Full-text

Work effort, readability and quality of pharmacy transcription of patient directions from electronic prescriptions: a retrospective observational cohort analysis

BMJ Quality & Safety ◽

10.1136/bmjqs-2019-010405 ◽

2020 ◽

pp. bmjqs-2019-010405

Author(s):

Yifan Zheng ◽

Yun Jiang ◽

Michael P Dorsch ◽

Yuting Ding ◽

V G Vinod Vydiswaran ◽

...

Keyword(s):

Language Processing ◽

Edit Distance ◽

Cohort Analysis ◽

Free Text ◽

Pharmacy Staff ◽

Quality Issue ◽

Before And After ◽

Quality Issues ◽

Electronic Prescriptions

BackgroundFree-text directions generated by prescribers in electronic prescriptions can be difficult for patients to understand due to their variability, complexity and ambiguity. Pharmacy staff are responsible for transcribing these directions so that patients can take their medication as prescribed. However, little is known about the quality of these transcribed directions received by patients.MethodsA retrospective observational analysis of 529 990 e-prescription directions processed at a mail-order pharmacy in the USA. We measured pharmacy staff editing of directions using string edit distance and execution time using the Keystroke-Level Model. Using the New Dale-Chall (NDC) readability formula, we calculated NDC cloze scores of the patient directions before and after transcription. We also evaluated the quality of directions (eg, included a dose, dose unit, frequency of administration) before and after transcription with a random sample of 966 patient directions.ResultsPharmacy staff edited 83.8% of all e-prescription directions received with a median edit distance of 18 per e-prescription. We estimated a median of 6.64 s of transcribing each e-prescription. The median NDC score increased by 68.6% after transcription (26.12 vs 44.03, p<0.001), which indicated a significant readability improvement. In our sample, 51.4% of patient directions on e-prescriptions contained at least one pre-defined direction quality issue. Pharmacy staff corrected 79.5% of the quality issues.ConclusionPharmacy staff put significant effort into transcribing e-prescription directions. Manual transcription removed the majority of quality issues; however, pharmacy staff still miss or introduce following their manual transcription processes. The development of tools and techniques such as a comprehensive set of structured direction components or machine learning–based natural language processing techniques may help produce clear directions.

Download Full-text

Effective Structure Matching Algorithm for Automatic Assessment of Use-Case Diagram

International Journal of Distance Education Technologies ◽

10.4018/ijdet.2020100103 ◽

2020 ◽

Vol 18 (4) ◽

pp. 31-50

Author(s):

Vinay Vachharajani ◽

Jyoti Pareek

Keyword(s):

Edit Distance ◽

Research Literature ◽

Use Case ◽

Automatic Assessment ◽

Tree Edit Distance ◽

Matching Algorithm ◽

Qualified Teachers ◽

E Learning ◽

Use Case Diagram

The demand for higher education keeps on increasing. The invention of information technology and e-learning have, to a large extent, solved the problem of shortage of skilled and qualified teachers. But there is no guarantee that this will ensure the high quality of learning. In spite of large number of students, though the delivery of learning materials and tests to the students have become very easy by uploading the same on the web, assessment could be tedious. There is a need to develop tools and technologies for fully automated assessment. In this paper, an innovative algorithm has been proposed for matching structures of two use-case diagrams drawn by a student and an expert respectively for automatic assessment of the same. Zhang and Shasha's tree edit distance algorithm has been extended for assessing use-case diagrams. Results from 445 students' answers based on 14 different scenarios are analyzed to evaluate the performance of the proposed algorithm. No comparable study has been reported by any other diagram assessing algorithms in the research literature.

Download Full-text