text mining system
Recently Published Documents


TOTAL DOCUMENTS

60
(FIVE YEARS 10)

H-INDEX

13
(FIVE YEARS 2)

Author(s):  
Canhui Li

Background:: To improve the information efficiency in web text mining, filtration is utilized. Methods:: A web content mining technology based on web text mining, augmented information support (AIS), is proposed for improving the web text mining efficiency. Additionally, the AIS technology is applied to the Xiangshan science conference website, and AIS4XSSC text mining system is developed. The developed system is tested for its efficiency, and its main functions are discussed. Results:: 192 documents are represented by 8352 vectors, and 192 × 8352 vectors are obtained; the similarity between 192 vectors is calculated using the cosine of included angle, 192 × 192 symmetric matrix is obtained, and 35 categories are formed by hierarchical clustering by using similarity between texts. Conclusion:: The results show that the AIS technology can effectively extract information from a large amount of web texts. The proposed system improves information retrieval efficiently and can push the valuable information to users.


2019 ◽  
Vol 10 (S1) ◽  
Author(s):  
Beatrice Alex ◽  
Claire Grover ◽  
Richard Tobin ◽  
Cathie Sudlow ◽  
Grant Mair ◽  
...  

Abstract Background With the improvements to text mining technology and the availability of large unstructured Electronic Healthcare Records (EHR) datasets, it is now possible to extract structured information from raw text contained within EHR at reasonably high accuracy. We describe a text mining system for classifying radiologists’ reports of CT and MRI brain scans, assigning labels indicating occurrence and type of stroke, as well as other observations. Our system, the Edinburgh Information Extraction for Radiology reports (EdIE-R) system, which we describe here, was developed and tested on a collection of radiology reports.The work reported in this paper is based on 1168 radiology reports from the Edinburgh Stroke Study (ESS), a hospital-based register of stroke and transient ischaemic attack patients. We manually created annotations for this data in parallel with developing the rule-based EdIE-R system to identify phenotype information related to stroke in radiology reports. This process was iterative and domain expert feedback was considered at each iteration to adapt and tune the EdIE-R text mining system which identifies entities, negation and relations between entities in each report and determines report-level labels (phenotypes). Results The inter-annotator agreement (IAA) for all types of annotations is high at 96.96 for entities, 96.46 for negation, 95.84 for relations and 94.02 for labels. The equivalent system scores on the blind test set are equally high at 95.49 for entities, 94.41 for negation, 98.27 for relations and 96.39 for labels for the first annotator and 96.86, 96.01, 96.53 and 92.61, respectively for the second annotator. Conclusion Automated reading of such EHR data at such high levels of accuracies opens up avenues for population health monitoring and audit, and can provide a resource for epidemiological studies. We are in the process of validating EdIE-R in separate larger cohorts in NHS England and Scotland. The manually annotated ESS corpus will be available for research purposes on application.


2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Şenay Kafkas ◽  
Robert Hoehndorf

Abstract Background Infectious diseases claim millions of lives especially in the developing countries each year. Identification of causative pathogens accurately and rapidly plays a key role in the success of treatment. To support infectious disease research and mechanisms of infection, there is a need for an open resource on pathogen–disease associations that can be utilized in computational studies. A large number of pathogen–disease associations is available from the literature in unstructured form and we need automated methods to extract the data. Results We developed a text mining system designed for extracting pathogen–disease relations from literature. Our approach utilizes background knowledge from an ontology and statistical methods for extracting associations between pathogens and diseases. In total, we extracted a total of 3420 pathogen–disease associations from literature. We integrated our literature-derived associations into a database which links pathogens to their phenotypes for supporting infectious disease research. Conclusions To the best of our knowledge, we present the first study focusing on extracting pathogen–disease associations from publications. We believe the text mined data can be utilized as a valuable resource for infectious disease research. All the data is publicly available from https://github.com/bio-ontology-research-group/padimi and through a public SPARQL endpoint from http://patho.phenomebrowser.net/.


2019 ◽  
Author(s):  
Saman Farahmand ◽  
Todd Riley ◽  
Kourosh Zarringhalam

ABSTRACTBackgroundTranscription factors (TFs) are proteins that are fundamental to transcription and regulation of gene expression. Each TF may regulate multiple genes and each gene may be regulated by multiple TFs. TFs can act as either activator or repressor of gene expression. This complex network of interactions between TFs and genes underlies many developmental and biological processes and is implicated in several human diseases such as cancer. Hence deciphering the network of TF-gene interactions with information on mode of regulation (activation vs. repression) is an important step toward understanding the regulatory pathways that underlie complex traits. There are many experimental, computational, and manually curated databases of TF-gene interactions. In particular, high-throughput ChIP-Seq datasets provide a large-scale map or transcriptional regulatory interactions. However, these interactions are not annotated with information on context and mode of regulation. Such information is crucial to gain a global picture of gene regulatory mechanisms and can aid in developing machine learning models for applications such as biomarker discovery, prediction of response to therapy, and precision medicine.MethodsIn this work, we introduce a text-mining system to annotate ChIP-Seq derived interaction with such meta data through mining PubMed articles. We evaluate the performance of our system using gold standard small scale manually curated databases.ResultsOur results show that the method is able to accurately extract mode of regulation with F-score 0.77 on TRRUST curated interaction and F-score 0.96 on intersection of TRUSST and ChIP-network. We provide a HTTP REST API for our code to facilitate usage.AvailibilitySource code and datasets are available for download on GitHub: https://github.com/samanfrm/modex HTTP REST API: https://watson.math.umb.edu/modex/[type query]


Sign in / Sign up

Export Citation Format

Share Document