approximate matching
Recently Published Documents


TOTAL DOCUMENTS

160
(FIVE YEARS 22)

H-INDEX

16
(FIVE YEARS 1)

2021 ◽  
Author(s):  
Anas Al-okaily ◽  
Abdelghani Tbakhi

Abstract Pattern matching is a fundamental process in almost every scientific domain. The problem involves finding the positions of a given pattern (usually of short length) in a reference stream of data (usually of large length). The matching can be as an exact or as an approximate (inexact) matching. Exact matching is to search for the pattern without allowing for mismatches (or insertions and deletions) of one or more characters in the pattern), while approximate matching is the opposite. For exact matching, several data structures that can be built in linear time and space are used and in practice nowadays. For approximate matching, the solutions proposed to solve this matching are non-linear and currently impractical. In this paper, we designed and implemented a structure that can be built in linear time and space and solve the approximate matching problem in (O(m + {log_Σ^k}n/{k!} + occ) search costs, where m is the length of the pattern, n is the length of the reference, and k is the number of tolerated mismatches (and insertion and deletions).


2021 ◽  
Vol 44 (1) ◽  
pp. 101-136
Author(s):  
Lidija Iordanskaja ◽  
Igor Mel’čuk

Abstract A formal linguistitic model is presented, which produces, for a given conceptual representation of an extralinguistic situation, a corresponding semantic representation [SemR] that, in its turn, underlies the deep-syntactic representations of four near-synonymous Russian sentences expressing the starting information. Two full-fledged lexical entries are given for the lexemes besporjadki ‘disturbance’ and stolknovenie ‘clash(N)’, appearing in these sentences. Some principles of lexicalization – that is, matching the formal lexicographic definitions to the starting semantic representation in order to produce the deep-syntactic structures of the corresponding sentences – are formulated and illustrated; the problem of approximate matching is dealt with in sufficient detail.


JAMIA Open ◽  
2021 ◽  
Vol 4 (3) ◽  
Author(s):  
Abeed Sarker ◽  
Yao Ge

Abstract Our objective was to mine Reddit to discover long-COVID symptoms self-reported by users, compare symptom distributions across studies, and create a symptom lexicon. We retrieved posts from the /r/covidlonghaulers subreddit and extracted symptoms via approximate matching using an expanded meta-lexicon. We mapped the extracted symptoms to standard concept IDs, compared their distributions with those reported in recent literature and analyzed their distributions over time. From 42 995 posts by 4249 users, we identified 1744 users who expressed at least 1 symptom. The most frequently reported long-COVID symptoms were mental health-related symptoms (55.2%), fatigue (51.2%), general ache/pain (48.4%), brain fog/confusion (32.8%), and dyspnea (28.9%) among users reporting at least 1 symptom. Comparison with recent literature revealed a large variance in reported symptoms across studies. Temporal analysis showed several persistent symptoms up to 15 months after infection. The spectrum of symptoms identified from Reddit may provide early insights about long-COVID.


2021 ◽  
Author(s):  
Abeed Sarker ◽  
Yao Ge

Objective To mine Reddit to discover long-COVID symptoms self-reported by users, compare symptom distributions across studies, and create a symptom lexicon. Materials and Methods We retrieved posts from the /r/covidlonghaulers subreddit and extracted symptoms via approximate matching using an expanded meta-lexicon. We mapped the extracted symptoms to standard concept IDs, compared their distributions with those reported in recent literature and analyzed their distributions over time. Results From 42,995 posts by 4249 users, we identified 1744 users who expressed at least 1 symptom. The most frequently reported long-COVID symptoms were mental health-related symptoms (55.2%), fatigue (51.2%), general ache/pain (48.4%), brain fog/confusion (32.8%) and dyspnea (28.9%) amongst users reporting at least 1 symptom. Comparison with recent literature revealed a large variance in reported symptoms across studies. Temporal analysis showed several persistent symptoms up to 15 months after infection. Conclusion The spectrum of symptoms identified from Reddit may provide early insights about long-COVID.


Matching algorithms are working to find the exact or the approximate matching between text “T” and pattern “P”, due to the development of a computer processor, which currently contains a set of multi-cores, multitasks can be performed simultaneously. This technology makes these algorithms work in parallel to improve their speed matching performance. Several exact string matching and approximate matching algorithms have been developed to work in parallel to find the correspondence between text “T” and pattern “P”. This paper proposed two models: First, parallelized the Direct Matching Algorithm (PDMA) in multi-cores architecture using OpenMP technology. Second, the PDMA implemented in Network Intrusion Detection Systems (NIDS) to enhance the speed of the NIDS detection engine. The PDMA can be achieved more than 19.7% in parallel processing time compared with sequential matching processing. In addition, the performance of the NIDS detection engine improved for more than 8% compared to the current SNORT-NIDS detection engine


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Van-Kien Bui ◽  
Chaochun Wei

Abstract Background Current taxonomic classification tools use exact string matching algorithms that are effective to tackle the data from the next generation sequencing technology. However, the unique error patterns in the third generation sequencing (TGS) technologies could reduce the accuracy of these programs. Results We developed a Classification tool using Discriminative K-mers and Approximate Matching algorithm (CDKAM). This approximate matching method was used for searching k-mers, which included two phases, a quick mapping phase and a dynamic programming phase. Simulated datasets as well as real TGS datasets have been tested to compare the performance of CDKAM with existing methods. We showed that CDKAM performed better in many aspects, especially when classifying TGS data with average length 1000–1500 bases. Conclusions CDKAM is an effective program with higher accuracy and lower memory requirement for TGS metagenome sequence classification. It produces a high species-level accuracy.


Sign in / Sign up

Export Citation Format

Share Document