scholarly journals Text Filtering through Multi-Pattern Matching: A Case Study of Wu–Manber–Uy on the Language of Uyghur

Information ◽  
2019 ◽  
Vol 10 (8) ◽  
pp. 246 ◽  
Author(s):  
Turdi Tohti ◽  
Jimmy Huang ◽  
Askar Hamdulla ◽  
Xing Tan

Given its generality in applications and its high time-efficiency on big data-sets, in recent years, the technique of text filtering through pattern matching has been attracting increasing attention from the field of information retrieval and Natural language Processing (NLP) research communities at large. That being the case, however, it has yet to be seen how this technique and its algorithms, (e.g., Wu–Manber, which is also considered in this paper) can be applied and adopted properly and effectively to Uyghur, a low-resource language that is mostly spoken by the ethnic Uyghur group with a population of more than eleven-million in Xinjiang, China. We observe that technically, the challenge is mainly caused by two factors: (1) Vowel weakening and (2) mismatching in semantics between affixes and stems. Accordingly, in this paper, we propose Wu–Manber–Uy, a variant of an improvement to Wu–Manber, dedicated particularly for working on the Uyghur language. Wu–Manber–Uy implements a stem deformation-based pattern expansion strategy, specifically for reducing the mismatching of patterns caused by vowel weakening and spelling errors. A two-way strategy that applies invigilation and control on the change of lexical meaning of stems during word-building is also used in Wu–Manber–Uy. Extra consideration with respect to Word2vec and the dictionary are incorporated into the system for processing Uyghur. The experimental results we have obtained consistently demonstrate the high performance of Wu–Manber–Uy.

2015 ◽  
Vol 22 (3) ◽  
pp. 671-681 ◽  
Author(s):  
Azadeh Nikfarjam ◽  
Abeed Sarker ◽  
Karen O’Connor ◽  
Rachel Ginn ◽  
Graciela Gonzalez

Abstract Objective Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and user-expressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. Methods We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words’ semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. Results ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. Conclusion It is possible to extract complex medical concepts, with relatively high performance, from informal, user-generated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.


2014 ◽  
Vol 11 (3) ◽  
pp. 88-98 ◽  
Author(s):  
Shima Soroushnia ◽  
Masoud Daneshtalab ◽  
Juha Plosila ◽  
Tapio Pahikkala ◽  
Pasi Liljeberg

Summary Pattern discovery is one of the fundamental tasks in bioinformatics and pattern recognition is a powerful technique for searching sequence patterns in the biological sequence databases. Fast and high performance algorithms are highly demanded in many applications in bioinformatics and computational molecular biology since the significant increase in the number of DNA and protein sequences expand the need for raising the performance of pattern matching algorithms. For this purpose, heterogeneous architectures can be a good choice due to their potential for high performance and energy efficiency. In this paper we present an efficient implementation of Aho-Corasick (AC) which is a well known exact pattern matching algorithm with linear complexity, and Parallel Failureless Aho-Corasick (PFAC) algorithm which is the massively parallelized version of AC algorithm without failure transitions, on a heterogeneous CPU/GPU architecture. We progressively redesigned the algorithms and data structures to fit on the GPU architecture. Our results on different protein sequence data sets show that the new implementation runs 15 times faster compared to the original implementation of the PFAC algorithm.


2017 ◽  
Vol 9 (1) ◽  
pp. 19-24 ◽  
Author(s):  
David Domarco ◽  
Ni Made Satvika Iswari

Technology development has affected many areas of life, especially the entertainment field. One of the fastest growing entertainment industry is anime. Anime has evolved as a trend and a hobby, especially for the population in the regions of Asia. The number of anime fans grow every year and trying to dig up as much information about their favorite anime. Therefore, a chatbot application was developed in this study as anime information retrieval media using regular expression pattern matching method. This application is intended to facilitate the anime fans in searching for information about the anime they like. By using this application, user can gain a convenience and interactive anime data retrieval that can’t be found when searching for information via search engines. Chatbot application has successfully met the standards of information retrieval engine with a very good results, the value of 72% precision and 100% recall showing the harmonic mean of 83.7%. As the application of hedonic, chatbot already influencing Behavioral Intention to Use by 83% and Immersion by 82%. Index Terms—anime, chatbot, information retrieval, Natural Language Processing (NLP), Regular Expression Pattern Matching


2021 ◽  
Vol 29 ◽  
pp. 115-124
Author(s):  
Xinlu Wang ◽  
Ahmed A.F. Saif ◽  
Dayou Liu ◽  
Yungang Zhu ◽  
Jon Atli Benediktsson

BACKGROUND: DNA sequence alignment is one of the most fundamental and important operation to identify which gene family may contain this sequence, pattern matching for DNA sequence has been a fundamental issue in biomedical engineering, biotechnology and health informatics. OBJECTIVE: To solve this problem, this study proposes an optimal multi pattern matching with wildcards for DNA sequence. METHODS: This proposed method packs the patterns and a sliding window of texts, and the window slides along the given packed text, matching against stored packed patterns. RESULTS: Three data sets are used to test the performance of the proposed algorithm, and the algorithm was seen to be more efficient than the competitors because its operation is close to machine language. CONCLUSIONS: Theoretical analysis and experimental results both demonstrate that the proposed method outperforms the state-of-the-art methods and is especially effective for the DNA sequence.


Foods ◽  
2021 ◽  
Vol 10 (3) ◽  
pp. 644
Author(s):  
Do-Yeong Kim ◽  
Boram Kim ◽  
Han-Seung Shin

The effect of cellulosic aerogel treatments used for adsorption of four polycyclic aromatic hydrocarbons (PAHs)—benzo[a]anthracene, chrysene, benzo[b]fluoranthene, and benzo[a]pyrene [BaP])—generated during the manufacture of sesame oil was evaluated. In this study, eulalia (Miscanthus sinensis var. purpurascens)-based cellulosic aerogel (adsorbent) was prepared and used high performance liquid chromatography with fluorescence detection for determination of PAHs in sesame oil. In addition, changes in the sesame oil quality parameters (acid value, peroxide value, color, and fatty acid composition) following cellulosic aerogel treatment were also evaluated. The four PAHs and their total levels decreased in sesame oil samples roasted under different conditions (p < 0.05) following treatment with cellulosic aerogel. In particular, highly carcinogenic BaP was not detected after treatment with cellulosic aerogel. Moreover, there were no noticeable quality changes in the quality parameters between treated and control samples. It was concluded that eulalia-based cellulosic aerogel proved suitable for the reduction of PAHs from sesame oil and can be used as an eco-friendly adsorbent.


2021 ◽  
Vol 170 ◽  
pp. 112529
Author(s):  
N. Cruz ◽  
A.J.N. Batista ◽  
J.M. Cardoso ◽  
B.B. Carvalho ◽  
P.F. Carvalho ◽  
...  

2021 ◽  
pp. 1-13
Author(s):  
Qingtian Zeng ◽  
Xishi Zhao ◽  
Xiaohui Hu ◽  
Hua Duan ◽  
Zhongying Zhao ◽  
...  

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.


Animals ◽  
2021 ◽  
Vol 11 (5) ◽  
pp. 1399
Author(s):  
Karina Yévenes ◽  
Ekaterina Pokrant ◽  
Lina Trincado ◽  
Lisette Lapierre ◽  
Nicolás Galarce ◽  
...  

Tetracyclines, sulphonamides, and quinolones are families of antimicrobials (AMs) widely used in the poultry industry and can excrete up to 90% of AMs administrated, which accumulate in poultry litter. Worryingly, poultry litter is widely used as an agriculture fertilizer, contributing to the spread AMs residues in the environment. The aim of this research was to develop a method that could simultaneously identify and quantify three AMs families in poultry litter by high-performance liquid chromatography–tandem mass spectrometry (HPLC–MS/MS). Samples of AMs free poultry litter were used to validate the method according to 657/2002/EC and VICH GL49. Results indicate that limit of detection (LOD) ranged from 8.95 to 20.86 μg kg−1, while limits of quantitation (LOQ) values were between 26.85 and 62.58 µg kg−1 of tetracycline, 4-epi-tetracycline, oxytetracycline, 4-epi-oxytetracycline, enrofloxacin, ciprofloxacin, flumequine, sulfachloropyridazine, and sulfadiazine. Recoveries obtained ranged from 93 to 108%. The analysis of field samples obtained from seven commercial poultry flocks confirmed the adequacy of the method since it detected means concentrations ranging from 20 to 10,364 μg kg−1. This provides us an accurate and reliable tool to monitor AMs residues in poultry litter and control its use as agricultural fertilizer.


Electronics ◽  
2021 ◽  
Vol 10 (12) ◽  
pp. 1382
Author(s):  
Xiaoying Deng ◽  
Huazhang Li ◽  
Mingcheng Zhu

Based on the idea of bisection method, a new structure of All-Digital Phased-Locked Loop (ADPLL) with fast-locking is proposed. The structure and locking method are different from the traditional ADPLLs. The Control Circuit consists of frequency compare module, mode-adjust module and control module, which is responsible for adjusting the frequency control word of digital-controlled-oscillator (DCO) by Bisection method according to the result of the frequency compare between reference clock and restructure clock. With a high frequency cascade structure, the DCO achieves wide tuning range and high resolution. The proposed ADPLL was designed in SMIC 180 nm CMOS process. The measured results show a lock range of 640-to-1920 MHz with a 40 MHz reference frequency. The ADPLL core occupies 0.04 mm2, and the power consumption is 29.48 mW, with a 1.8 V supply. The longest locking time is 23 reference cycles, 575 ns, at 1.92 GHz. When the ADPLL operates at 1.28 GHz–1.6 GHz, the locking time is the shortest, only 9 reference cycles, 225 ns. Compared with the recent high-performance ADPLLs, our design shows advantages of small area, short locking time, and wide tuning range.


2020 ◽  
pp. 1-17
Author(s):  
Francisco Javier Balea-Fernandez ◽  
Beatriz Martinez-Vega ◽  
Samuel Ortega ◽  
Himar Fabelo ◽  
Raquel Leon ◽  
...  

Background: Sociodemographic data indicate the progressive increase in life expectancy and the prevalence of Alzheimer’s disease (AD). AD is raised as one of the greatest public health problems. Its etiology is twofold: on the one hand, non-modifiable factors and on the other, modifiable. Objective: This study aims to develop a processing framework based on machine learning (ML) and optimization algorithms to study sociodemographic, clinical, and analytical variables, selecting the best combination among them for an accurate discrimination between controls and subjects with major neurocognitive disorder (MNCD). Methods: This research is based on an observational-analytical design. Two research groups were established: MNCD group (n = 46) and control group (n = 38). ML and optimization algorithms were employed to automatically diagnose MNCD. Results: Twelve out of 37 variables were identified in the validation set as the most relevant for MNCD diagnosis. Sensitivity of 100%and specificity of 71%were achieved using a Random Forest classifier. Conclusion: ML is a potential tool for automatic prediction of MNCD which can be applied to relatively small preclinical and clinical data sets. These results can be interpreted to support the influence of the environment on the development of AD.


Sign in / Sign up

Export Citation Format

Share Document