species2vec: A novel method for species representation

2018 ◽  
Author(s):  
Boyan Angelov

AbstractWord embeddings are omnipresent in Natural Language Processing (NLP) tasks. The same technology which defines words by their context can also define biological species. This study showcases this new method - species embedding (species2vec). By proximity sorting of 6761594 mammal observations from the whole world (2862 different species), we are able to create a training corpus for the skip-gram model. The resulting species embeddings are tested in an environmental classification task. The classifier performance confirms the utility of those embeddings in preserving the relationships between species, and also being representative of species consortia in an environment.

Author(s):  
Iraj Mantegh ◽  
Nazanin S. Darbandi

Robotic alternative to many manual operations falls short in application due to the difficulties in capturing the manual skill of an expert operator. One of the main problems to be solved if robots are to become flexible enough for various manufacturing needs is that of end-user programming. An end-user with little or no technical expertise in robotics area needs to be able to efficiently communicate its manufacturing task to the robot. This paper proposes a new method for robot task planning using some concepts of Artificial Intelligence. Our method is based on a hierarchical knowledge representation and propositional logic, which allows an expert user to incrementally integrate process and geometric parameters with the robot commands. The objective is to provide an intelligent and programmable agent such as a robot with a knowledge base about the attributes of human behaviors in order to facilitate the commanding process. The focus of this work is on robot programming for manufacturing applications. Industrial manipulators work with low level programming languages. This work presents a new method based on Natural Language Processing (NLP) that allows a user to generate robot programs using natural language lexicon and task information. This will enable a manufacturing operator (for example for painting) who may be unfamiliar with robot programming to easily employ the agent for the manufacturing tasks.


2015 ◽  
Author(s):  
Abraham G Ayana

Natural Language Processing (NLP) refers to Human-like language processing which reveals that it is a discipline within the field of Artificial Intelligence (AI). However, the ultimate goal of research on Natural Language Processing is to parse and understand language, which is not fully achieved yet. For this reason, much research in NLP has focused on intermediate tasks that make sense of some of the structure inherent in language without requiring complete understanding. One such task is part-of-speech tagging, or simply tagging. Lack of standard part of speech tagger for Afaan Oromo will be the main obstacle for researchers in the area of machine translation, spell checkers, dictionary compilation and automatic sentence parsing and constructions. Even though several works have been done in POS tagging for Afaan Oromo, the performance of the tagger is not sufficiently improved yet. Hence,the aim of this thesis is to improve Brill’s tagger lexical and transformation rule for Afaan Oromo POS tagging with sufficiently large training corpus. Accordingly, Afaan Oromo literatures on grammar and morphology are reviewed to understand nature of the language and also to identify possible tagsets. As a result, 26 broad tagsets were identified and 17,473 words from around 1100 sentences containing 6750 distinct words were tagged for training and testing purpose. From which 258 sentences are taken from the previous work. Since there is only a few ready made standard corpuses, the manual tagging process to prepare corpus for this work was challenging and hence, it is recommended that a standard corpus is prepared. Transformation-based Error driven learning are adapted for Afaan Oromo part of speech tagging. Different experiments are conducted for the rule based approach taking 20% of the whole data for testing. A comparison with the previously adapted Brill’s Tagger made. The previously adapted Brill’s Tagger shows an accuracy of 80.08% whereas the improved Brill’s Tagger result shows an accuracy of 95.6% which has an improvement of 15.52%. Hence, it is found that the size of the training corpus, the rule generating system in the lexical rule learner, and moreover, using Afaan Oromo HMM tagger as initial state tagger have a significant effect on the improvement of the tagger.


2020 ◽  
Author(s):  
Masashi Sugiyama

Recently, word embeddings have been used in many natural language processing problems successfully and how to train a robust and accurate word embedding system efficiently is a popular research area. Since many, if not all, words have more than one sense, it is necessary to learn vectors for all senses of word separately. Therefore, in this project, we have explored two multi-sense word embedding models, including Multi-Sense Skip-gram (MSSG) model and Non-parametric Multi-sense Skip Gram model (NP-MSSG). Furthermore, we propose an extension of the Multi-Sense Skip-gram model called Incremental Multi-Sense Skip-gram (IMSSG) model which could learn the vectors of all senses per word incrementally. We evaluate all the systems on word similarity task and show that IMSSG is better than the other models.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Ivano Lauriola ◽  
Fabio Aiolli ◽  
Alberto Lavelli ◽  
Fabio Rinaldi

Abstract Background Named Entity Recognition is a common task in Natural Language Processing applications, whose purpose is to recognize named entities in textual documents. Several systems exist to solve this task in the biomedical domain, based on Natural Language Processing techniques and Machine Learning algorithms. A crucial step of these applications is the choice of the representation which describes data. Several representations have been proposed in the literature, some of which are based on a strong knowledge of the domain, and they consist of features manually defined by domain experts. Usually, these representations describe the problem well, but they require a lot of human effort and annotated data. On the other hand, general-purpose representations like word-embeddings do not require human domain knowledge, but they could be too general for a specific task. Results This paper investigates methods to learn the best representation from data directly, by combining several knowledge-based representations and word embeddings. Two mechanisms have been considered to perform the combination, which are neural networks and Multiple Kernel Learning. To this end, we use a hybrid architecture for biomedical entity recognition which integrates dictionary look-up (also known as gazetteers) with machine learning techniques. Results on the CRAFT corpus clearly show the benefits of the proposed algorithm in terms of F1 score. Conclusions Our experiments show that the principled combination of general, domain specific, word-, and character-level representations improves the performance of entity recognition. We also discussed the contribution of each representation in the final solution.


2018 ◽  
Author(s):  
Paulo Henrique Calado Aoun ◽  
Andre C. A. Nascimento ◽  
Adenilton J. Da Silva

The use of word embeddings is becoming very common in many Natural Language Processing tasks. Most of the time, these require computacional resources that can not be found in most part of the current mobile devices. In this work, we evaluate a combination of numeric truncation and dimensionality reduction strategies in order to obtain smaller vectorial representations without substancial losses in performance.


2021 ◽  
Vol 10 (1) ◽  
pp. 57
Author(s):  
Kazuhisa Fujita

Spherical data is distributed on the sphere. The data appears in various fields such as meteorology, biology, and natural language processing. However, a method for analysis of spherical data does not develop enough yet. One of the important issues is an estimation of the number of clusters in spherical data. To address the issue, I propose a new method called the Spherical X-means (SX-means) that can estimate the number of clusters on d-dimensional sphere. The SX-means is the model-based method assuming that the data is generated from a mixture of von Mises-Fisher distributions. The present paper explains the proposed method and shows its performance of estimation of the number of clusters.


2019 ◽  
Author(s):  
William Jin

Recently, word embeddings have been used in many natural language processing problems successfully and how to train a robust and accurate word embedding system efficiently is a popular research area. Since many, if not all, words have more than one sense, it is necessary to learn vectors for all senses of word separately. Therefore, in this project, we have explored two multi-sense word embedding models, including Multi-Sense Skip-gram (MSSG) model and Non-parametric Multi-sense Skip Gram model (NP-MSSG). Furthermore, we propose an extension of the Multi-Sense Skip-gram model called Incremental Multi-Sense Skip-gram (IMSSG) model which could learn the vectors of all senses per word incrementally. We evaluate all the systems on word similarity task and show that IMSSG is better than the other models.


2021 ◽  
Vol 113 ◽  
pp. 103665
Author(s):  
Timothy L. Chen ◽  
Max Emerling ◽  
Gunvant R. Chaudhari ◽  
Yeshwant R. Chillakuru ◽  
Youngho Seo ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document