semistructured data
Recently Published Documents


TOTAL DOCUMENTS

164
(FIVE YEARS 2)

H-INDEX

26
(FIVE YEARS 0)

JAMIA Open ◽  
2021 ◽  
Vol 4 (3) ◽  
Author(s):  
David M Miller ◽  
Sophia Z Shalhout

Abstract Objectives Clinico-genomic data (CGD) acquired through routine clinical practice has the potential to improve our understanding of clinical oncology. However, these data often reside in heterogeneous and semistructured data, resulting in prolonged time-to-analyses. Materials and Methods We created GENETEX: an R package and Shiny application for text mining genomic reports from electronic health record (EHR) and direct import into Research Electronic Data Capture (REDCap). Results GENETEX facilitates the abstraction of CGD from EHR and streamlines the capture of structured data into REDCap. Its functions include natural language processing of key genomic information, transformation of semistructured data into structured data, and importation into REDCap. When evaluated with manual abstraction, GENETEX had >99% agreement and captured CGD in approximately one-fifth the time. Conclusions GENETEX is freely available under the Massachusetts Institute of Technology license and can be obtained from GitHub (https://github.com/TheMillerLab/genetex). GENETEX is executed in R and deployed as a Shiny application for non-R users. It produces high-fidelity abstraction of CGD in a fraction of the time.


2021 ◽  
Vol 1 (2) ◽  
pp. 65-77
Author(s):  
T. E. Vildanov ◽  
◽  
N. S. Ivanov ◽  

This article explores both popular and newly invented tools for extracting data from sites and converting them into a form suitable for analysis. The paper compares the Python libraries, the key criterion of the compared tools is their performance. The results will be grouped by sites, tools used and number of iterations, and then presented in graphical form. The scientific novelty of the research lies in the field of application of data extraction tools: we will receive and transform semistructured data from the websites of bookmakers and betting exchanges. The article also describes new tools that are currently not in great demand in the field of parsing and web scraping. As a result of the study, quantitative metrics were obtained for all the tools used and the libraries that were most suitable for the rapid extraction and processing of information in large quantities were selected.


2019 ◽  
pp. 147-175
Author(s):  
Dmitry Anoshin ◽  
Dmitry Shirokov ◽  
Donna Strok
Keyword(s):  

2015 ◽  
Vol 2015 ◽  
pp. 1-9 ◽  
Author(s):  
Lin Guo ◽  
Wanli Zuo ◽  
Tao Peng ◽  
Lin Yue

The diversities of large-scale semistructured data make the extraction of implicit semantic information have enormous difficulties. This paper proposes an automatic and unsupervised method of text categorization, in which tree-shape structures are used to represent semantic knowledge and to explore implicit information by mining hidden structures without cumbersome lexical analysis. Mining implicit frequent structures in trees can discover both direct and indirect semantic relations, which largely enhances the accuracy of matching and classifying texts. The experimental results show that the proposed algorithm remarkably reduces the time and effort spent in training and classifying, which outperforms established competitors in correctness and effectiveness.


Sign in / Sign up

Export Citation Format

Share Document