character sequences Latest Research Papers

Fake news spreaders profiling using N-grams of various types and SHAP-based feature selection

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219233 ◽

2021 ◽

pp. 1-12

Author(s):

Fazlourrahman Balouchzahi ◽

Grigori Sidorov ◽

Hosahalli Lakshmaiah Shashirekha

Keyword(s):

Language Processing ◽

Support Vector ◽

Feature Engineering ◽

Learning Approaches ◽

Fake News ◽

Shared Task ◽

Kernel Logistic Regression ◽

Average Accuracy ◽

Rbf Kernel ◽

Character Sequences

Complex learning approaches along with complicated and expensive features are not always the best or the only solution for Natural Language Processing (NLP) tasks. Despite huge progress and advancements in learning approaches such as Deep Learning (DL) and Transfer Learning (TL), there are many NLP tasks such as Text Classification (TC), for which basic Machine Learning (ML) classifiers perform superior to DL or TL approaches. Added to this, an efficient feature engineering step can significantly improve the performance of ML based systems. To check the efficacy of ML based systems and feature engineering on TC, this paper explores char, character sequences, syllables, word n-grams as well as syntactic n-grams as features and SHapley Additive exPlanations (SHAP) values to select the important features from the collection of extracted features. Voting Classifiers (VC) with soft and hard voting of four ML classifiers, namely: Support Vector Machine (SVM) with Linear and Radial Basis Function (RBF) kernel, Logistic Regression (LR), and Random Forest (RF) was trained and evaluated on Fake News Spreaders Profiling (FNSP) shared task dataset in PAN 2020. This shared task consists of profiling fake news spreaders in English and Spanish languages. The proposed models exhibited an average accuracy of 0.785 for both languages in this shared task and outperformed the best models submitted to this task.

Named Entity Recognition with Gating Mechanism and Parallel BiLSTM

Journal of Web Engineering ◽

10.13052/jwe1540-9589.20413 ◽

2021 ◽

Author(s):

Yenan Yi ◽

Yijie Bian

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Weighted Sum ◽

Named Entity ◽

Gating Mechanism ◽

Final Output ◽

Different Types ◽

Character Sequences ◽

The One ◽

Types Of Information

In this paper, we propose a novel neural network for named entity recognition, which is improved in two aspects. On the one hand, our model uses a parallel BiLSTM structure to generate character-level word representations. By inputting character sequences of words into several independent and parallel BiLSTMs, we can obtain word representations from different representation subspaces, because the parameters of these BiLSTMs are randomly initialized. This method can enhance the expression abilities of character-level word representations. On the other hand, we use a two-layer BiLSTM with gating mechanism to model sentences. Since the features extracted by each layer in a multi-layer LSTM from texts contain different types of information, we use the gating mechanism to assign appropriate weights to the outputs of each layer, and take the weighted sum of these outputs as the final output for named entity recognition. Our model only changes the structure, does not need any feature engineering or external knowledge source, which is a complete end-to-end NER model. We used the CoNLL-2003 English and German datasets to evaluate our model and got better results compared with baseline models.

An Approach for Character Recognition in Piston Cavity with Faster R-CNN and Prior Knowledge Library of Character Sequences

2021 International Conference on Computer Communication and Artificial Intelligence (CCAI) ◽

10.1109/ccai50917.2021.9447471 ◽

2021 ◽

Author(s):

Lan Junfeng ◽

Wang Hongyan ◽

Li Jinping

Keyword(s):

Prior Knowledge ◽

Character Recognition ◽

Character Sequences

Finding Better Subwords for Tibetan Neural Machine Translation

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3448216 ◽

2021 ◽

Vol 20 (2) ◽

pp. 1-11

Author(s):

Yachao Li ◽

Jing Jiang ◽

Jia Yangji ◽

Ning Ma

Keyword(s):

Machine Translation ◽

Experimental Results ◽

Linguistic Features ◽

Neural Machine Translation ◽

Word Structure ◽

Low Resource ◽

Segmentation Methods ◽

Character Sequences ◽

First Words

Subword segmentation plays an important role in Tibetan neural machine translation (NMT). The structure of Tibetan words consists of two levels. First, words consist of a sequence of syllables, and then a syllable consists of a sequence of characters. According to this special word structure, we propose two methods for Tibetan subword segmentation, namely syllable-based and character-based methods. The former generates subwords based on the Tibetan syllables, and the latter is based on Tibetan characters. In addition, we carry out experiments with these two subword segmentation methods on low-resource Tibetan-to-Chinese NMT, respectively. The experimental results show that both of them can improve translation performance, in which the subword segmentation based on character sequences can achieve better results. Overall, our proposed character-based subword segmentation is more simple and effective. Moreover, it can achieve better experimental results without paying much attention to the linguistic features of Tibetan.

Data Structures for Ordered Short Character-Sequences

2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC) ◽

10.1109/ccwc51732.2021.9376076 ◽

2021 ◽

Author(s):

Sudarshan S. Chawathe

Keyword(s):

Data Structures ◽

Character Sequences

New Construction of Family of MLCS Algorithms

Journal of Healthcare Engineering ◽

10.1155/2021/6636710 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Haihe Shi ◽

Jun Wang

Keyword(s):

Sequence Analysis ◽

Text Processing ◽

Longest Common Subsequence ◽

Feature Model ◽

Generic Programming ◽

Biological Sequence ◽

Biological Sequence Analysis ◽

Character Sequences ◽

Common Subsequence ◽

Component Assembly

The multiple longest common subsequence (MLCS) problem involves finding all the longest common subsequences of multiple character sequences. This problem is encountered in a variety of areas, including data mining, text processing, and bioinformatics, and is particularly important for biological sequence analysis. By taking the MLCS problem and algorithms for its solution as research domain, this study analyzes the domain of multiple longest common subsequence algorithms, extracts features that algorithms in the domain do and do not have in common, and creates a domain feature model for the MLCS problem by using generic programming, domain engineering, abstraction, and related technologies. A component library for the domain is designed based on the feature model for the MLCS problem, and the partition and recur (PAR) platform is used to ensure that highly reliable MLCS algorithms can be quickly assembled through component assembly. This study provides a valuable reference for obtaining rapid solutions to problems of biological sequence analysis and improves the reliability and assembly flexibility of assembling algorithms.

A Vision-Based Framework for Spotting and Segmentation of Gesture-Based Assamese Characters Written in the Air

Journal of Information Technology Research ◽

10.4018/jitr.2021010105 ◽

2021 ◽

Vol 14 (1) ◽

pp. 70-91

Author(s):

Ananya Choudhury ◽

Kandarpa Kumar Sarma

Keyword(s):

Character Recognition ◽

Hand Gesture ◽

Automatic Method ◽

Physical Impairments ◽

Statistical Feature ◽

Hand Segmentation ◽

Character Sequences ◽

Continuous Character ◽

Important Constituent ◽

Gesture Spotting

The task of automatic gesture spotting and segmentation is challenging for determining the meaningful gesture patterns from continuous gesture-based character sequences. This paper proposes a vision-based automatic method that handles hand gesture spotting and segmentation of gestural characters embedded in a continuous character stream simultaneously, by employing a hybrid geometrical and statistical feature set. This framework shall form an important constituent of gesture-based character recognition (GBCR) systems, which has gained tremendous demand lately as assistive aids for overcoming the restraints faced by people with physical impairments. The performance of the proposed system is validated by taking into account the vowels and numerals of Assamese vocabulary. Another attribute to this proposed system is the implementation of an effective hand segmentation module, which enables it to tackle complex background settings.

FOUR-DIMENSIONAL ENCODING OF CHARACTER SEQUENCES AND EVALUATION OF THEIR SIMILARITIES AND DIFFERENCES

Proceedings of the Technical University of Sofia ◽

10.47978/tus.2020.70.02.008 ◽

2020 ◽

Vol 70 (2) ◽

Author(s):

Martin Marinov

Keyword(s):

Dimensional Space ◽

Two Dimensional ◽

Text Data ◽

Loss Of Information ◽

String Comparison ◽

Main Disadvantage ◽

Character Sequences ◽

Similarities And Differences

This paper describes a string encoding algorithm, which produces sparse dis-tributed representations (SDR) of text data. In essence, this is a modified version of a prior algorithm and the modifications have the following benefits: - the ability to decode data, without loss of information; - greatly increased capacity of the encoding space; - the possibility of performing more detailed comparisons of encoded strings. The main disadvantage compared to the prior algorithm is the increased complexity of the procedure for encoded string comparison. This is due to the use of a four-dimensional encoding space, instead of a two-dimensional space.

The Timer Inremental Compression of Data and Information

WSEAS TRANSACTIONS ON MATHEMATICS ◽

10.37394/23206.2020.19.41 ◽

2020 ◽

Vol 19 ◽

Keyword(s):

Human Computer Interaction ◽

Data Compression ◽

Quantitative Data ◽

Discrete Model ◽

Parallel Implementation ◽

Public Key ◽

Analysis And Evaluation ◽

Character Sequences ◽

Computer Interaction

The ability to find short representations, i.e. to compress data, is crucial for many intelligentsystems. This paper is devoted to data compression and a transform-based quantitative data compressiontechnique involving quick enumeration in a unary-binary time-based numeral system (NS). The symbolscomprising the alphabets of human-computer interaction languages (HCIL), which are used in an informationalmessage (IM), are collected in primary code tables, such as the ASCII table. The statistical-oriented datacompression method using unconventional timer encryption and encoding information are proposed by us. Itwas constructed probability - discrete model of the set of character sequences and characterized someprobabilistic algorithms associated with the recovery of text by its public key and its cipher. We find thepossibility of parallel implementation of this method by building a block of timer tags. The necessaryestimations of complexity are obtained. The method can be used to compress SMS messages. Probabilisticstatistical analysis and evaluation of their effectiveness are obtained.

Characterization of Beta Chain Fibrinogen (FGB) Gene From Maleo (Macrocephalon Maleo S. Muller 1846) Tuva Village Gumbasa Sub-District Sigi Regency Central Sulawesi

Jurnal Riset Pendidikan MIPA ◽

10.22487/j25490192.2019.v3.i2.pp94-100 ◽

2019 ◽

Vol 3 (2) ◽

pp. 94-100

Author(s):

Zakiah Zulfitri Syam ◽

I Made Budiarsa ◽

Astija Astija

Keyword(s):

Molecular Marker ◽

Base Composition ◽

Gene Sequence ◽

Nucleotide Composition ◽

Neighbor Joining ◽

Beta Chain ◽

Character Sequences ◽

Central Sulawesi ◽

Group Species

The FGB gene is a gene that plays a role in synthesizing b-fibrinogen proteins and is often used as a molecular marker because it is useful for studying bird phylogenetics. The purpose of this study was to describe the character sequences of the FGB gene in Maleo birds (Macrocephalon maleo S. Muller 1864). This study uses laboratory exploratory methods. Alignment is done using the MEGA6 program (Clustal W). Phylogeny trees are constructed based on the Neighbor-Joining algorithm and the Juke-Cantor evolution model of the MEGA6 program. The sample in this study was 0.3 ml of blood from Maleo birds and members of the Megapodiidae group species used as a comparison. The results of this study indicate the length of the FGB 576 bp sequence and was identified with a base composition of 32.2% A, 30.4% T, 17.0% C and 20.3% G. Nucleotide composition of the sequence of Adenine-rich FGB genes. Analysis of the FGB gene sequence phylogeny tree produced a tree topology that was quite good and had the power to split at the interspecies level and place the maleo bird in its own line compared to other Megapoda genera.

character sequences
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Fake news spreaders profiling using N-grams of various types and SHAP-based feature selection

Named Entity Recognition with Gating Mechanism and Parallel BiLSTM

An Approach for Character Recognition in Piston Cavity with Faster R-CNN and Prior Knowledge Library of Character Sequences

Finding Better Subwords for Tibetan Neural Machine Translation

Data Structures for Ordered Short Character-Sequences

New Construction of Family of MLCS Algorithms

A Vision-Based Framework for Spotting and Segmentation of Gesture-Based Assamese Characters Written in the Air

FOUR-DIMENSIONAL ENCODING OF CHARACTER SEQUENCES AND EVALUATION OF THEIR SIMILARITIES AND DIFFERENCES

The Timer Inremental Compression of Data and Information

Characterization of Beta Chain Fibrinogen (FGB) Gene From Maleo (Macrocephalon Maleo S. Muller 1846) Tuva Village Gumbasa Sub-District Sigi Regency Central Sulawesi

Export Citation Format

character sequencesRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Fake news spreaders profiling using N-grams of various types and SHAP-based feature selection

Named Entity Recognition with Gating Mechanism and Parallel BiLSTM

An Approach for Character Recognition in Piston Cavity with Faster R-CNN and Prior Knowledge Library of Character Sequences

Finding Better Subwords for Tibetan Neural Machine Translation

Data Structures for Ordered Short Character-Sequences

New Construction of Family of MLCS Algorithms

A Vision-Based Framework for Spotting and Segmentation of Gesture-Based Assamese Characters Written in the Air

FOUR-DIMENSIONAL ENCODING OF CHARACTER SEQUENCES AND EVALUATION OF THEIR SIMILARITIES AND DIFFERENCES

The Timer Inremental Compression of Data and Information

Characterization of Beta Chain Fibrinogen (FGB) Gene From Maleo (Macrocephalon Maleo S. Muller 1846) Tuva Village Gumbasa Sub-District Sigi Regency Central Sulawesi

character sequences
Recently Published Documents