scholarly journals Extracting Information from Archaeological Texts

2015 ◽  
Vol 1 (1) ◽  
Author(s):  
Keith W. Kintigh

AbstractTo address archaeology’s most pressing substantive challenges, researchers must discover, access, and extract information contained in the reports and articles that codify so much of archaeology’s knowledge. These efforts will require application of existing and emerging natural language processing technologies to extensive digital corpora. Automated classification can enable development of metadata needed for the discovery of relevant documents. Although it is even more technically challenging, automated extraction of and reasoning with information from texts can provide urgently needed access to contextualized information within documents. Effective automated translation is needed for scholars to benefit from research published in other languages.

Author(s):  
Yinjun Hu ◽  
Mengmeng Chen ◽  
Qian Wang ◽  
Yue Zhu ◽  
Bei Wang ◽  
...  

Abstract [Background] On January 7, 2020, the novel coronavirus named "COVID-19" aroused worldwide concern was identified by Chinese scientists. Many related research works were developed for the emerging, rapidly evolving situation of this epidemic. This study aimed to analyze the research literatures on SARS, MERS and COVID-19 to retrieve important information for virologists, epidemiologist and policy decision makers. [Methods] In this study, we collected data from multi data sources and compared bibliometrics indices among COVID-19, Severe Acute Respiratory Syndrome (SARS), and Middle East Respiratory Syndrome (MERS) up to March 25, 2020. In purpose to extract data in corresponding quantity and scale, the volume of search results will be balance with the limitation of publication years. For further analysis, we extracted 1,480 documents from 1,671 candidates with Natural Language Processing technologies. [Results] In total, 13,945 research literatures of 7 datasets were selected for analysis. Unlike other topics, research passion on epidemic may reach its peak at the first year the outbreak happens. The document type distribution of SARS, MERS and COVID-19 are nearly the same (less than 6 point difference for each type), however, there were notable growth in the research qualities during these three epidemics (3.68, 6.63 and 11.35 for Field-Weighted Citation Impact scores). Asian countries has less international collaboration (less than 35.1\%) than the Occident (more than 49.5\%), which should be noticed as same as research itself. [Conclusions] We found that research passion on epidemics may always reach its peak at the first year after outburst, however, the peak of research on MERS appeared at the third year because of its outburst of reproduction in 2015. For the research quality, although we did better in research qualities than before especially on COVID-19, research on epidemics not started from our own country should not be looked down. Another important effective strategy for enhancing epidemic prevention for China and other Asian countries is to continue strengthening international collaboration.


2018 ◽  
Vol 11 (3) ◽  
pp. 1-25
Author(s):  
Leonel Figueiredo de Alencar ◽  
Bruno Cuconato ◽  
Alexandre Rademaker

ABSTRACT: One of the prerequisites for many natural language processing technologies is the availability of large lexical resources. This paper reports on MorphoBr, an ongoing project aiming at building a comprehensive full-form lexicon for morphological analysis of Portuguese. A first version of the resource is already freely available online under an open source, free software license. MorphoBr combines analogous free resources, correcting several thousand errors and gaps, and systematically adding new entries. In comparison to the integrated resources, lexical entries in MorphoBr follow a more user-friendly format, which can be straightforwardly compiled into finite-state transducers for morphological analysis, e.g. in the context of syntactic parsing with a grammar in the LFG formalism using the XLE system. MorphoBr results from a combination of computational techniques. Errors and the more obvious gaps in the integrated resources were automatically corrected with scripts. However, MorphoBr's main contribution is the expansion in the inventory of nouns and adjectives. This was carried out by systematically modeling diminutive formation in the paradigm of finite-state morphology. This allowed MorphoBr to significantly outperform analogous resources in the coverage of diminutives. The first evaluation results show MorphoBr to be a promising initiative which will directly contribute to the development of more robust natural language processing tools and applications which depend on wide-coverage morphological analysis.KEYWORDS: computational linguistics; natural language processing; morphological analysis; full-form lexicon; diminutive formation. RESUMO: Um dos pré-requisitos para muitas tecnologias de processamento de linguagem natural é a disponibilidade de vastos recursos lexicais. Este artigo trata do MorphoBr, um projeto em desenvolvimento voltado para a construção de um léxico de formas plenas abrangente para a análise morfológica do português. Uma primeira versão do recurso já está disponível gratuitamente on-line sob uma licença de software livre e de código aberto. MorphoBr combina recursos livres análogos, corrigindo vários milhares de erros e lacunas. Em comparação com os recursos integrados, as entradas lexicais do MorphoBr seguem um formato mais amigável, o qual pode ser compilado diretamente em transdutores de estados finitos para análise morfológica, por exemplo, no contexto do parsing sintático com uma gramática no formalismo da LFG usando o sistema XLE. MorphoBr resulta de uma combinação de técnicas computacionais. Erros e lacunas mais óbvias nos recursos integrados foram automaticamente corrigidos com scripts. No entanto, a principal contribuição de MorphoBr é a expansão no inventário de substantivos e adjetivos. Isso foi alcançado pela modelação sistemática da formação de diminutivos no paradigma da morfologia de estados finitos. Isso possibilitou a MorphoBr superar de forma significativa recursos análogos na cobertura de diminutivos. Os primeiros resultados de avaliação mostram que o MorphoBr constitui uma iniciativa promissora que contribuirá de forma direta para conferir robustez a ferramentas e aplicações de processamento de linguagem natural que dependem de análise morfológica de ampla cobertura.PALAVRAS-CHAVE: linguística computacional; processamento de linguagem natural; análise morfológica; léxico de formas plenas; formação de diminutivos.


2018 ◽  
pp. 1-8 ◽  
Author(s):  
Alexander P. Glaser ◽  
Brian J. Jordan ◽  
Jason Cohen ◽  
Anuj Desai ◽  
Philip Silberman ◽  
...  

Purpose Bladder cancer is initially diagnosed and staged with a transurethral resection of bladder tumor (TURBT). Patient survival is dependent on appropriate sampling of layers of the bladder, but pathology reports are dictated as free text, making large-scale data extraction for quality improvement challenging. We sought to automate extraction of stage, grade, and quality information from TURBT pathology reports using natural language processing (NLP). Methods Patients undergoing TURBT were retrospectively identified using the Northwestern Enterprise Data Warehouse. An NLP algorithm was then created to extract information from free-text pathology reports and was iteratively improved using a training set of manually reviewed TURBTs. NLP accuracy was then validated using another set of manually reviewed TURBTs, and reliability was calculated using Cohen’s κ. Results Of 3,042 TURBTs identified from 2006 to 2016, 39% were classified as benign, 35% as Ta, 11% as T1, 4% as T2, and 10% as isolated carcinoma in situ. Of 500 randomly selected manually reviewed TURBTs, NLP correctly staged 88% of specimens (κ = 0.82; 95% CI, 0.78 to 0.86). Of 272 manually reviewed T1 tumors, NLP correctly categorized grade in 100% of tumors (κ = 1), correctly categorized if muscularis propria was reported by the pathologist in 98% of tumors (κ = 0.81; 95% CI, 0.62 to 0.99), and correctly categorized if muscularis propria was present or absent in the resection specimen in 82% of tumors (κ = 0.62; 95% CI, 0.55 to 0.73). Discrepancy analysis revealed pathologist notes and deeper resection specimens as frequent reasons for NLP misclassifications. Conclusion We developed an NLP algorithm that demonstrates a high degree of reliability in extracting stage, grade, and presence of muscularis propria from TURBT pathology reports. Future iterations can continue to improve performance, but automated extraction of oncologic information is promising in improving quality and assisting physicians in delivery of care.


Sign in / Sign up

Export Citation Format

Share Document