Text Filtering through Multi-Pattern Matching: A Case Study of Wu–Manber–Uy on the Language of Uyghur

Given its generality in applications and its high time-efficiency on big data-sets, in recent years, the technique of text filtering through pattern matching has been attracting increasing attention from the field of information retrieval and Natural language Processing (NLP) research communities at large. That being the case, however, it has yet to be seen how this technique and its algorithms, (e.g., Wu–Manber, which is also considered in this paper) can be applied and adopted properly and effectively to Uyghur, a low-resource language that is mostly spoken by the ethnic Uyghur group with a population of more than eleven-million in Xinjiang, China. We observe that technically, the challenge is mainly caused by two factors: (1) Vowel weakening and (2) mismatching in semantics between affixes and stems. Accordingly, in this paper, we propose Wu–Manber–Uy, a variant of an improvement to Wu–Manber, dedicated particularly for working on the Uyghur language. Wu–Manber–Uy implements a stem deformation-based pattern expansion strategy, specifically for reducing the mismatching of patterns caused by vowel weakening and spelling errors. A two-way strategy that applies invigilation and control on the change of lexical meaning of stems during word-building is also used in Wu–Manber–Uy. Extra consideration with respect to Word2vec and the dictionary are incorporated into the system for processing Uyghur. The experimental results we have obtained consistently demonstrate the high performance of Wu–Manber–Uy.

Download Full-text

Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocu041 ◽

2015 ◽

Vol 22 (3) ◽

pp. 671-681 ◽

Cited By ~ 145

Author(s):

Azadeh Nikfarjam ◽

Abeed Sarker ◽

Karen O’Connor ◽

Rachel Ginn ◽

Graciela Gonzalez

Keyword(s):

Machine Learning ◽

Social Media ◽

Language Processing ◽

High Performance ◽

Conditional Random Fields ◽

Training Data ◽

Data Sets ◽

Social Media Mining ◽

Medical Concepts ◽

Media Mining

Abstract Objective Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and user-expressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. Methods We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words’ semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. Results ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. Conclusion It is possible to extract complex medical concepts, with relatively high performance, from informal, user-generated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.

Download Full-text

High Performance Pattern Matching on Heterogeneous Platform

Journal of Integrative Bioinformatics ◽

10.1515/jib-2014-253 ◽

2014 ◽

Vol 11 (3) ◽

pp. 88-98 ◽

Cited By ~ 1

Author(s):

Shima Soroushnia ◽

Masoud Daneshtalab ◽

Juha Plosila ◽

Tapio Pahikkala ◽

Pasi Liljeberg

Keyword(s):

Pattern Matching ◽

High Performance ◽

Sequence Data ◽

Good Choice ◽

Data Sets ◽

Biological Sequence ◽

Computational Molecular Biology ◽

Heterogeneous Architectures ◽

Protein Sequence Data ◽

Gpu Architecture

Summary Pattern discovery is one of the fundamental tasks in bioinformatics and pattern recognition is a powerful technique for searching sequence patterns in the biological sequence databases. Fast and high performance algorithms are highly demanded in many applications in bioinformatics and computational molecular biology since the significant increase in the number of DNA and protein sequences expand the need for raising the performance of pattern matching algorithms. For this purpose, heterogeneous architectures can be a good choice due to their potential for high performance and energy efficiency. In this paper we present an efficient implementation of Aho-Corasick (AC) which is a well known exact pattern matching algorithm with linear complexity, and Parallel Failureless Aho-Corasick (PFAC) algorithm which is the massively parallelized version of AC algorithm without failure transitions, on a heterogeneous CPU/GPU architecture. We progressively redesigned the algorithms and data structures to fit on the GPU architecture. Our results on different protein sequence data sets show that the new implementation runs 15 times faster compared to the original implementation of the PFAC algorithm.

Download Full-text

Rancang Bangun Aplikasi Chatbot Sebagai Media Pencarian Informasi Anime Menggunakan Regular Expression Pattern Matching

Jurnal ULTIMATICS ◽

10.31937/ti.v9i1.559 ◽

2017 ◽

Vol 9 (1) ◽

pp. 19-24 ◽

Cited By ~ 1

Author(s):

David Domarco ◽

Ni Made Satvika Iswari

Keyword(s):

Information Retrieval ◽

Expression Pattern ◽

Pattern Matching ◽

Language Processing ◽

Regular Expression ◽

Technology Development ◽

Data Retrieval ◽

Index Terms ◽

Retrieval Engine ◽

Behavioral Intention To Use

Technology development has affected many areas of life, especially the entertainment field. One of the fastest growing entertainment industry is anime. Anime has evolved as a trend and a hobby, especially for the population in the regions of Asia. The number of anime fans grow every year and trying to dig up as much information about their favorite anime. Therefore, a chatbot application was developed in this study as anime information retrieval media using regular expression pattern matching method. This application is intended to facilitate the anime fans in searching for information about the anime they like. By using this application, user can gain a convenience and interactive anime data retrieval that can’t be found when searching for information via search engines. Chatbot application has successfully met the standards of information retrieval engine with a very good results, the value of 72% precision and 100% recall showing the harmonic mean of 83.7%. As the application of hedonic, chatbot already influencing Behavioral Intention to Use by 83% and Immersion by 82%. Index Terms—anime, chatbot, information retrieval, Natural Language Processing (NLP), Regular Expression Pattern Matching

Download Full-text

A novel optimal multi-pattern matching method with wildcards for DNA sequence

Technology and Health Care ◽

10.3233/thc-218012 ◽

2021 ◽

Vol 29 ◽

pp. 115-124

Author(s):

Xinlu Wang ◽

Ahmed A.F. Saif ◽

Dayou Liu ◽

Yungang Zhu ◽

Jon Atli Benediktsson

Keyword(s):

Dna Sequence ◽

Pattern Matching ◽

Health Informatics ◽

State Of The Art ◽

Machine Language ◽

Data Sets ◽

Fundamental Issue ◽

Matching Method ◽

Dna Sequence Alignment ◽

The Given

BACKGROUND: DNA sequence alignment is one of the most fundamental and important operation to identify which gene family may contain this sequence, pattern matching for DNA sequence has been a fundamental issue in biomedical engineering, biotechnology and health informatics. OBJECTIVE: To solve this problem, this study proposes an optimal multi pattern matching with wildcards for DNA sequence. METHODS: This proposed method packs the patterns and a sliding window of texts, and the window slides along the given packed text, matching against stored packed patterns. RESULTS: Three data sets are used to test the performance of the proposed algorithm, and the algorithm was seen to be more efficient than the competitors because its operation is close to machine language. CONCLUSIONS: Theoretical analysis and experimental results both demonstrate that the proposed method outperforms the state-of-the-art methods and is especially effective for the DNA sequence.

Download Full-text

Reduction of Polycyclic Aromatic Hydrocarbons (PAHs) in Sesame Oil Using Cellulosic Aerogel

Foods ◽

10.3390/foods10030644 ◽

2021 ◽

Vol 10 (3) ◽

pp. 644

Author(s):

Do-Yeong Kim ◽

Boram Kim ◽

Han-Seung Shin

Keyword(s):

Polycyclic Aromatic Hydrocarbons ◽

Aromatic Hydrocarbons ◽

High Performance ◽

Oil Quality ◽

Acid Value ◽

Quality Parameters ◽

Sesame Oil ◽

Miscanthus Sinensis ◽

Polycyclic Aromatic ◽

And Control

The effect of cellulosic aerogel treatments used for adsorption of four polycyclic aromatic hydrocarbons (PAHs)—benzo[a]anthracene, chrysene, benzo[b]fluoranthene, and benzo[a]pyrene [BaP])—generated during the manufacture of sesame oil was evaluated. In this study, eulalia (Miscanthus sinensis var. purpurascens)-based cellulosic aerogel (adsorbent) was prepared and used high performance liquid chromatography with fluorescence detection for determination of PAHs in sesame oil. In addition, changes in the sesame oil quality parameters (acid value, peroxide value, color, and fatty acid composition) following cellulosic aerogel treatment were also evaluated. The four PAHs and their total levels decreased in sesame oil samples roasted under different conditions (p < 0.05) following treatment with cellulosic aerogel. In particular, highly carcinogenic BaP was not detected after treatment with cellulosic aerogel. Moreover, there were no noticeable quality changes in the quality parameters between treated and control samples. It was concluded that eulalia-based cellulosic aerogel proved suitable for the reduction of PAHs from sesame oil and can be used as an eco-friendly adsorbent.

Download Full-text

Advanced high-performance processing tools for diagnostics and control in fusion devices

Fusion Engineering and Design ◽

10.1016/j.fusengdes.2021.112529 ◽

2021 ◽

Vol 170 ◽

pp. 112529

Author(s):

N. Cruz ◽

A.J.N. Batista ◽

J.M. Cardoso ◽

B.B. Carvalho ◽

P.F. Carvalho ◽

...

Keyword(s):

High Performance ◽

Diagnostics And Control ◽

And Control

Download Full-text

Learning emotional word embeddings for sentiment analysis

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201993 ◽

2021 ◽

pp. 1-13

Author(s):

Qingtian Zeng ◽

Xishi Zhao ◽

Xiaohui Hu ◽

Hua Duan ◽

Zhongying Zhao ◽

...

Keyword(s):

Sentiment Analysis ◽

Language Processing ◽

State Of The Art ◽

Research Problem ◽

Emotional Word ◽

Classification Model ◽

Data Sets ◽

Word Embeddings ◽

Real World Data ◽

Text Documents

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.

Download Full-text

Detection of Antimicrobial Residues in Poultry Litter: Monitoring a Risk through a Selective and Sensitive HPLC–MS/MS Method

Animals ◽

10.3390/ani11051399 ◽

2021 ◽

Vol 11 (5) ◽

pp. 1399

Author(s):

Karina Yévenes ◽

Ekaterina Pokrant ◽

Lina Trincado ◽

Lisette Lapierre ◽

Nicolás Galarce ◽

...

Keyword(s):

High Performance ◽

Limit Of Detection ◽

Poultry Litter ◽

Chromatography Tandem Mass Spectrometry ◽

Antimicrobial Residues ◽

Field Samples ◽

Liquid Chromatography Tandem Mass ◽

Commercial Poultry ◽

Agriculture Fertilizer ◽

And Control

Tetracyclines, sulphonamides, and quinolones are families of antimicrobials (AMs) widely used in the poultry industry and can excrete up to 90% of AMs administrated, which accumulate in poultry litter. Worryingly, poultry litter is widely used as an agriculture fertilizer, contributing to the spread AMs residues in the environment. The aim of this research was to develop a method that could simultaneously identify and quantify three AMs families in poultry litter by high-performance liquid chromatography–tandem mass spectrometry (HPLC–MS/MS). Samples of AMs free poultry litter were used to validate the method according to 657/2002/EC and VICH GL49. Results indicate that limit of detection (LOD) ranged from 8.95 to 20.86 μg kg−1, while limits of quantitation (LOQ) values were between 26.85 and 62.58 µg kg−1 of tetracycline, 4-epi-tetracycline, oxytetracycline, 4-epi-oxytetracycline, enrofloxacin, ciprofloxacin, flumequine, sulfachloropyridazine, and sulfadiazine. Recoveries obtained ranged from 93 to 108%. The analysis of field samples obtained from seven commercial poultry flocks confirmed the adequacy of the method since it detected means concentrations ranging from 20 to 10,364 μg kg−1. This provides us an accurate and reliable tool to monitor AMs residues in poultry litter and control its use as agricultural fertilizer.

Download Full-text

A Novel Fast-Locking ADPLL Based on Bisection Method

Electronics ◽

10.3390/electronics10121382 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1382

Author(s):

Xiaoying Deng ◽

Huazhang Li ◽

Mingcheng Zhu

Keyword(s):

High Performance ◽

Tuning Range ◽

Frequency Control ◽

Control Word ◽

Cmos Process ◽

Bisection Method ◽

Reference Frequency ◽

Wide Tuning Range ◽

Cascade Structure ◽

And Control

Based on the idea of bisection method, a new structure of All-Digital Phased-Locked Loop (ADPLL) with fast-locking is proposed. The structure and locking method are different from the traditional ADPLLs. The Control Circuit consists of frequency compare module, mode-adjust module and control module, which is responsible for adjusting the frequency control word of digital-controlled-oscillator (DCO) by Bisection method according to the result of the frequency compare between reference clock and restructure clock. With a high frequency cascade structure, the DCO achieves wide tuning range and high resolution. The proposed ADPLL was designed in SMIC 180 nm CMOS process. The measured results show a lock range of 640-to-1920 MHz with a 40 MHz reference frequency. The ADPLL core occupies 0.04 mm2, and the power consumption is 29.48 mW, with a 1.8 V supply. The longest locking time is 23 reference cycles, 575 ns, at 1.92 GHz. When the ADPLL operates at 1.28 GHz–1.6 GHz, the locking time is the shortest, only 9 reference cycles, 225 ns. Compared with the recent high-performance ADPLLs, our design shows advantages of small area, short locking time, and wide tuning range.

Download Full-text

Analysis of Risk Factors in Dementia Through Machine Learning

Journal of Alzheimer s Disease ◽

10.3233/jad-200955 ◽

2020 ◽

pp. 1-17

Author(s):

Francisco Javier Balea-Fernandez ◽

Beatriz Martinez-Vega ◽

Samuel Ortega ◽

Himar Fabelo ◽

Raquel Leon ◽

...

Keyword(s):

Machine Learning ◽

Optimization Algorithms ◽

Progressive Increase ◽

Control Group ◽

Data Sets ◽

Modifiable Factors ◽

Validation Set ◽

The One ◽

And Control ◽

Potential Tool

Background: Sociodemographic data indicate the progressive increase in life expectancy and the prevalence of Alzheimer’s disease (AD). AD is raised as one of the greatest public health problems. Its etiology is twofold: on the one hand, non-modifiable factors and on the other, modifiable. Objective: This study aims to develop a processing framework based on machine learning (ML) and optimization algorithms to study sociodemographic, clinical, and analytical variables, selecting the best combination among them for an accurate discrimination between controls and subjects with major neurocognitive disorder (MNCD). Methods: This research is based on an observational-analytical design. Two research groups were established: MNCD group (n = 46) and control group (n = 38). ML and optimization algorithms were employed to automatically diagnose MNCD. Results: Twelve out of 37 variables were identified in the validation set as the most relevant for MNCD diagnosis. Sensitivity of 100%and specificity of 71%were achieved using a Random Forest classifier. Conclusion: ML is a potential tool for automatic prediction of MNCD which can be applied to relatively small preclinical and clinical data sets. These results can be interpreted to support the influence of the environment on the development of AD.

Download Full-text