scholarly journals The experience of developing a large-scale natural language text processing system

Author(s):  
Stephen D. Richardson ◽  
Lisa C. Braden-Harder
2005 ◽  
Vol 6 (1-2) ◽  
pp. 86-93 ◽  
Author(s):  
Henk Harkema ◽  
Ian Roberts ◽  
Rob Gaizauskas ◽  
Mark Hepple

Recent years have seen a huge increase in the amount of biomedical information that is available in electronic format. Consequently, for biomedical researchers wishing to relate their experimental results to relevant data lurking somewhere within this expanding universe of on-line information, the ability to access and navigate biomedical information sources in an efficient manner has become increasingly important. Natural language and text processing techniques can facilitate this task by making the information contained in textual resources such as MEDLINE more readily accessible and amenable to computational processing. Names of biological entities such as genes and proteins provide critical links between different biomedical information sources and researchers' experimental data. Therefore, automatic identification and classification of these terms in text is an essential capability of any natural language processing system aimed at managing the wealth of biomedical information that is available electronically. To support term recognition in the biomedical domain, we have developed Termino, a large-scale terminological resource for text processing applications, which has two main components: first, a database into which very large numbers of terms can be loaded from resources such as UMLS, and stored together with various kinds of relevant information; second, a finite state recognizer, for fast and efficient identification and mark-up of terms within text. Since many biomedical applications require this functionality, we have made Termino available to the community as a web service, which allows for its integration into larger applications as a remotely located component, accessed through a standardized interface over the web.


Information ◽  
2018 ◽  
Vol 9 (12) ◽  
pp. 294 ◽  
Author(s):  
William Teahan

A novel compression-based toolkit for modelling and processing natural language text is described. The design of the toolkit adopts an encoding perspective—applications are considered to be problems in searching for the best encoding of different transformations of the source text into the target text. This paper describes a two phase `noiseless channel model’ architecture that underpins the toolkit which models the text processing as a lossless communication down a noise-free channel. The transformation and encoding that is performed in the first phase must be both lossless and reversible. The role of the verification and decoding second phase is to verify the correctness of the communication of the target text that is produced by the application. This paper argues that this encoding approach has several advantages over the decoding approach of the standard noisy channel model. The concepts abstracted by the toolkit’s design are explained together with details of the library calls. The pseudo-code for a number of algorithms is also described for the applications that the toolkit implements including encoding, decoding, classification, training (model building), parallel sentence alignment, word segmentation and language segmentation. Some experimental results, implementation details, memory usage and execution speeds are also discussed for these applications.


Author(s):  
Matheus C. Pavan ◽  
Vitor G. Santos ◽  
Alex G. J. Lan ◽  
Joao Martins ◽  
Wesley Ramos Santos ◽  
...  

2012 ◽  
Vol 30 (1) ◽  
pp. 1-34 ◽  
Author(s):  
Antonio Fariña ◽  
Nieves R. Brisaboa ◽  
Gonzalo Navarro ◽  
Francisco Claude ◽  
Ángeles S. Places ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document