character sequences
Recently Published Documents


TOTAL DOCUMENTS

36
(FIVE YEARS 14)

H-INDEX

4
(FIVE YEARS 1)

2021 ◽  
pp. 1-12
Author(s):  
Fazlourrahman Balouchzahi ◽  
Grigori Sidorov ◽  
Hosahalli Lakshmaiah Shashirekha

Complex learning approaches along with complicated and expensive features are not always the best or the only solution for Natural Language Processing (NLP) tasks. Despite huge progress and advancements in learning approaches such as Deep Learning (DL) and Transfer Learning (TL), there are many NLP tasks such as Text Classification (TC), for which basic Machine Learning (ML) classifiers perform superior to DL or TL approaches. Added to this, an efficient feature engineering step can significantly improve the performance of ML based systems. To check the efficacy of ML based systems and feature engineering on TC, this paper explores char, character sequences, syllables, word n-grams as well as syntactic n-grams as features and SHapley Additive exPlanations (SHAP) values to select the important features from the collection of extracted features. Voting Classifiers (VC) with soft and hard voting of four ML classifiers, namely: Support Vector Machine (SVM) with Linear and Radial Basis Function (RBF) kernel, Logistic Regression (LR), and Random Forest (RF) was trained and evaluated on Fake News Spreaders Profiling (FNSP) shared task dataset in PAN 2020. This shared task consists of profiling fake news spreaders in English and Spanish languages. The proposed models exhibited an average accuracy of 0.785 for both languages in this shared task and outperformed the best models submitted to this task.


Author(s):  
Yenan Yi ◽  
Yijie Bian

In this paper, we propose a novel neural network for named entity recognition, which is improved in two aspects. On the one hand, our model uses a parallel BiLSTM structure to generate character-level word representations. By inputting character sequences of words into several independent and parallel BiLSTMs, we can obtain word representations from different representation subspaces, because the parameters of these BiLSTMs are randomly initialized. This method can enhance the expression abilities of character-level word representations. On the other hand, we use a two-layer BiLSTM with gating mechanism to model sentences. Since the features extracted by each layer in a multi-layer LSTM from texts contain different types of information, we use the gating mechanism to assign appropriate weights to the outputs of each layer, and take the weighted sum of these outputs as the final output for named entity recognition. Our model only changes the structure, does not need any feature engineering or external knowledge source, which is a complete end-to-end NER model. We used the CoNLL-2003 English and German datasets to evaluate our model and got better results compared with baseline models.


Author(s):  
Yachao Li ◽  
Jing Jiang ◽  
Jia Yangji ◽  
Ning Ma

Subword segmentation plays an important role in Tibetan neural machine translation (NMT). The structure of Tibetan words consists of two levels. First, words consist of a sequence of syllables, and then a syllable consists of a sequence of characters. According to this special word structure, we propose two methods for Tibetan subword segmentation, namely syllable-based and character-based methods. The former generates subwords based on the Tibetan syllables, and the latter is based on Tibetan characters. In addition, we carry out experiments with these two subword segmentation methods on low-resource Tibetan-to-Chinese NMT, respectively. The experimental results show that both of them can improve translation performance, in which the subword segmentation based on character sequences can achieve better results. Overall, our proposed character-based subword segmentation is more simple and effective. Moreover, it can achieve better experimental results without paying much attention to the linguistic features of Tibetan.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Haihe Shi ◽  
Jun Wang

The multiple longest common subsequence (MLCS) problem involves finding all the longest common subsequences of multiple character sequences. This problem is encountered in a variety of areas, including data mining, text processing, and bioinformatics, and is particularly important for biological sequence analysis. By taking the MLCS problem and algorithms for its solution as research domain, this study analyzes the domain of multiple longest common subsequence algorithms, extracts features that algorithms in the domain do and do not have in common, and creates a domain feature model for the MLCS problem by using generic programming, domain engineering, abstraction, and related technologies. A component library for the domain is designed based on the feature model for the MLCS problem, and the partition and recur (PAR) platform is used to ensure that highly reliable MLCS algorithms can be quickly assembled through component assembly. This study provides a valuable reference for obtaining rapid solutions to problems of biological sequence analysis and improves the reliability and assembly flexibility of assembling algorithms.


2021 ◽  
Vol 14 (1) ◽  
pp. 70-91
Author(s):  
Ananya Choudhury ◽  
Kandarpa Kumar Sarma

The task of automatic gesture spotting and segmentation is challenging for determining the meaningful gesture patterns from continuous gesture-based character sequences. This paper proposes a vision-based automatic method that handles hand gesture spotting and segmentation of gestural characters embedded in a continuous character stream simultaneously, by employing a hybrid geometrical and statistical feature set. This framework shall form an important constituent of gesture-based character recognition (GBCR) systems, which has gained tremendous demand lately as assistive aids for overcoming the restraints faced by people with physical impairments. The performance of the proposed system is validated by taking into account the vowels and numerals of Assamese vocabulary. Another attribute to this proposed system is the implementation of an effective hand segmentation module, which enables it to tackle complex background settings.


Author(s):  
Martin Marinov

This paper describes a string encoding algorithm, which produces sparse dis-tributed representations (SDR) of text data. In essence, this is a modified version of a prior algorithm and the modifications have the following benefits: - the ability to decode data, without loss of information; - greatly increased capacity of the encoding space; - the possibility of performing more detailed comparisons of encoded strings. The main disadvantage compared to the prior algorithm is the increased complexity of the procedure for encoded string comparison. This is due to the use of a four-dimensional encoding space, instead of a two-dimensional space.


2020 ◽  
Vol 19 ◽  

The ability to find short representations, i.e. to compress data, is crucial for many intelligentsystems. This paper is devoted to data compression and a transform-based quantitative data compressiontechnique involving quick enumeration in a unary-binary time-based numeral system (NS). The symbolscomprising the alphabets of human-computer interaction languages (HCIL), which are used in an informationalmessage (IM), are collected in primary code tables, such as the ASCII table. The statistical-oriented datacompression method using unconventional timer encryption and encoding information are proposed by us. Itwas constructed probability - discrete model of the set of character sequences and characterized someprobabilistic algorithms associated with the recovery of text by its public key and its cipher. We find thepossibility of parallel implementation of this method by building a block of timer tags. The necessaryestimations of complexity are obtained. The method can be used to compress SMS messages. Probabilisticstatistical analysis and evaluation of their effectiveness are obtained.


2019 ◽  
Vol 3 (2) ◽  
pp. 94-100
Author(s):  
Zakiah Zulfitri Syam ◽  
I Made Budiarsa ◽  
Astija Astija

The FGB gene is a gene that plays a role in synthesizing b-fibrinogen proteins and is often used as a molecular marker because it is useful for studying bird phylogenetics. The purpose of this study was to describe the character sequences of the FGB gene in Maleo birds (Macrocephalon maleo S. Muller 1864). This study uses laboratory exploratory methods. Alignment is done using the MEGA6 program (Clustal W). Phylogeny trees are constructed based on the Neighbor-Joining algorithm and the Juke-Cantor evolution model of the MEGA6 program. The sample in this study was 0.3 ml of blood from Maleo birds and members of the Megapodiidae group species used as a comparison. The results of this study indicate the length of the FGB 576 bp sequence and was identified with a base composition of 32.2% A, 30.4% T, 17.0% C and 20.3% G. Nucleotide composition of the sequence of Adenine-rich FGB genes. Analysis of the FGB gene sequence phylogeny tree produced a tree topology that was quite good and had the power to split at the interspecies level and place the maleo bird in its own line compared to other Megapoda genera.    


Sign in / Sign up

Export Citation Format

Share Document