Domain-Specific Chinese Transformer-XL Language Model with Part-of-Speech Information

Named Entity Recognition is the process wherein named entities which are designators of a sentence are identified. Designators of a sentence are domain specific. The proposed system identifies named entities in Malayalam language belonging to tourism domain which generally includes names of persons, places, organizations, dates etc. The system uses word, part of speech and lexicalized features to find the probability of a word belonging to a named entity category and to do the appropriate classification. Probability is calculated based on supervised machine learning using word and part of speech features present in a tagged training corpus and using certain rules applied based on lexicalized features.

Download Full-text

BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition

10.21203/rs.3.rs-90025/v1 ◽

2020 ◽

Author(s):

Usman Naseem ◽

Matloob Khushi ◽

Vinay Reddy ◽

Sakthivel Rajendran ◽

Imran Razzak ◽

...

Keyword(s):

State Of The Art ◽

Language Model ◽

Named Entity Recognition ◽

Training Data ◽

Entity Recognition ◽

Future Research ◽

Named Entity ◽

Domain Specific ◽

Context Dependent ◽

Biomedical Named Entity Recognition

Abstract Background: In recent years, with the growing amount of biomedical documents, coupled with advancement in natural language processing algorithms, the research on biomedical named entity recognition (BioNER) has increased exponentially. However, BioNER research is challenging as NER in the biomedical domain are: (i) often restricted due to limited amount of training data, (ii) an entity can refer to multiple types and concepts depending on its context and, (iii) heavy reliance on acronyms that are sub-domain specific. Existing BioNER approaches often neglect these issues and directly adopt the state-of-the-art (SOTA) models trained in general corpora which often yields unsatisfactory results. Results: We propose biomedical ALBERT (A Lite Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) - bioALBERT - an effective domain-specific pre-trained language model trained on huge biomedical corpus designed to capture biomedical context-dependent NER. We adopted self-supervised loss function used in ALBERT that targets on modelling inter-sentence coherence to better learn context-dependent representations and incorporated parameter reduction strategies to minimise memory usage and enhance the training time in BioNER. In our experiments, BioALBERT outperformed comparative SOTA BioNER models on eight biomedical NER benchmark datasets with four different entity types. The performance is increased for; (i) disease type corpora by 7.47% (NCBI-disease) and 10.63% (BC5CDR-disease); (ii) drug-chem type corpora by 4.61% (BC5CDR-Chem) and 3.89 (BC4CHEMD); (iii) gene-protein type corpora by 12.25% (BC2GM) and 6.42% (JNLPBA); and (iv) Species type corpora by 6.19% (LINNAEUS) and 23.71% (Species-800) is observed which leads to a state-of-the-art results. Conclusions: The performance of proposed model on four different biomedical entity types shows that our model is robust and generalizable in recognizing biomedical entities in text. We trained four different variants of BioALBERT models which are available for the research community to be used in future research.

Download Full-text

Simulation-Based and Formal Verification of Domain-Specific Language Model

AIAA Scitech 2020 Forum ◽

10.2514/6.2020-0897 ◽

2020 ◽

Author(s):

Bharvi N. Chhaya ◽

Shafagh Jafer

Keyword(s):

Formal Verification ◽

Language Model ◽

Domain Specific Language ◽

Specific Language ◽

Domain Specific ◽

Simulation Based

Download Full-text

Word embeddings for application in geosciences: development, evaluation and examples of soil-related concepts

10.5194/soil-2018-44 ◽

2019 ◽

Author(s):

José Padarian ◽

Ignacio Fuentes

Keyword(s):

Language Processing ◽

Dimensional Space ◽

Language Model ◽

Test Suite ◽

Word Embeddings ◽

General Domain ◽

Domain Specific ◽

Descriptive Information ◽

Development Evaluation ◽

Numerical Representations

Abstract. A large amount of descriptive information is available in most disciplines of geosciences. This information is usually considered subjective and ill-favoured compared with its numerical counterpart. Considering the advances in natural language processing and machine learning, it is possible to utilise descriptive information and encode it as dense vectors. These word embeddings lay on a multi-dimensional space where angles and distances have a linguistic interpretation. We used 280 764 full-text scientific articles related to geosciences to train a domain-specific language model capable of generating such embeddings. To evaluate the quality of the numerical representations, we performed three intrinsic evaluations, namely: the capacity to generate analogies, term relatedness compared with the opinion of a human subject, and categorisation of different groups of words. Since this is the first attempt to evaluate word embedding for tasks in the geosciences domain, we created a test suite specific for geosciences. We compared our results with general domain embeddings commonly used in other disciplines. As expected, our domain-specific embeddings (GeoVec) outperformed general domain embeddings in all tasks, with an overall performance improvement of 107.9 %. The resulting embedding and test suite will be made available for other researchers to use an expand.

Download Full-text

The effect of word embeddings and domain specific long-range contextual information on a Recurrent Neural Network Language Model

2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA) ◽

10.1109/robomech.2019.8704827 ◽

2019 ◽

Author(s):

Linda Khumalo ◽

Georg I. Schltinz ◽

Quentin Williams

Keyword(s):

Neural Network ◽

Long Range ◽

Recurrent Neural Network ◽

Contextual Information ◽

Language Model ◽

Word Embeddings ◽

Domain Specific ◽

Network Language

Download Full-text

Language Model Domain Adaptation Via Recurrent Neural Networks with Domain-Shared and Domain-Specific Representations

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2018.8462631 ◽

2018 ◽

Cited By ~ 3

Author(s):

Tsuyoshi Moriokal ◽

Naohiro Tawara ◽

Tetsuji Ogawa ◽

Atsunori Ogawa ◽

Tomoharu Iwata ◽

...

Keyword(s):

Neural Networks ◽

Recurrent Neural Networks ◽

Domain Adaptation ◽

Language Model ◽

Domain Specific ◽

Model Domain

Download Full-text

A FINITE STATE COMMA TAGGER

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213004001636 ◽

2004 ◽

Vol 13 (03) ◽

pp. 449-468 ◽

Cited By ~ 1

Author(s):

SEBASTIAN VAN DELDEN ◽

FERNANDO GOMEZ

Keyword(s):

Learning Algorithm ◽

Finite State Automata ◽

Rule Based ◽

Domain Specific ◽

Part Of Speech ◽

System A ◽

Finite State ◽

Rule Based Approach

A method has been developed and implemented that assigns syntactic roles to commas. Text that has been tagged using a part-of-speech tagger serves as the input to the system. A set of Finite State Automata first assigns temporary syntactic roles to each comma in the sentence. A greedy learning algorithm is then used to determine the final syntactic roles of the commas. The system requires no training and is not domain specific. The performance of the system on numerous corpora is given and compared against a rule-based approach.

Download Full-text