UDAT: Compound quantitative analysis of text using machine learning

Author(s):  
Lior Shamir

Abstract Computing machines allow quantitative analysis of large databases of text, providing knowledge that is difficult to obtain without using automation. This article describes Universal Data Analysis of Text (UDAT) —a text analysis method that extracts a large set of numerical text content descriptors from text files and performs various pattern recognition tasks such as classification, similarity between classes, correlation between text and numerical values, and query by example. Unlike several previously proposed methods, UDAT is not based on frequency of words and links between certain key words and topics. The method is implemented as an open-source software tool that can provide detailed reports about the quantitative analysis of sets of text files, as well as exporting the numerical text content descriptors in the form of comma-separated values files to allow statistical or pattern recognition analysis with external tools. It also allows the identification of specific text descriptors that differentiate between classes or correlate with numerical values and can be applied to problems related to knowledge discovery in domains such as literature and social media. UDAT is implemented as a command-line tool that runs in Windows, and the open source is available and can be compiled in Linux systems. UDAT can be downloaded from http://people.cs.ksu.edu/∼lshamir/downloads/udat.

2020 ◽  
Author(s):  
Jean-Rémy Marchand ◽  
Bernard Pirard ◽  
Peter Ertl ◽  
Finton Sirockin

<div>Motivation: The detection of small molecules binding sites in proteins is central to structure based drug design. Many tools were developed in the last 40 years, but only few of them are available today, open-source, and suitable for the analysis of large databases or for the integration in automatic workflows. In addition, no software can characterize subpockets solely with the information of the protein structure, a pivotal concept in fragment-based drug design.</div><div>Results: CAVIAR is a new open source tool for protein cavity identification and rationalization. Protein pockets are automatically detected based on the protein structure. It comprises a subcavity segmentation algorithm that decomposes binding sites into subpockets without requiring the presence of a ligand. The defined subpockets mimick the empirical definitions of subpockets in medicinal chemistry projects. A tool like CAVIAR may be valuable to support chemical biology, medicinal chemistry and ligand identification efforts. Our analysis of the entire PDB and the</div><div>PDBBind confirms that liganded cavities tend to be bigger, more hydrophobic and more complex than apo cavities. Moreover, in line with the paradigm of fragment-based drug design, the binding affinity scales relatively well with the number of subcavities filled by the ligand. Compounds binding to more than three of the subcavities identified by CAVIAR are mostly in the nanomolar or better range of affinities to their target.</div><div>Availability and implementation: Installation notes, user manual and support for CAVIAR are available at https://jr-marchand.github.io/caviar/. The CAVIAR GUI and CAVIAR command line tool are available on GitHub at https://github.com/jr-marchand/caviar and the package is hosted on Anaconda cloud at https://anaconda.org/jr-marchand/caviar under a MIT license. The GitHub</div><div>repository also hosts the validation datasets.</div>


Planta Medica ◽  
2008 ◽  
Vol 74 (09) ◽  
Author(s):  
XL Piao ◽  
HH Yoo ◽  
SY Park ◽  
JH Park

2008 ◽  
Vol 5 (4) ◽  
pp. 319-322 ◽  
Author(s):  
Sung Kyu Park ◽  
John D Venable ◽  
Tao Xu ◽  
John R Yates

Author(s):  
A. Awaid ◽  
H. Al-Muqbali ◽  
A. Al-Bimani ◽  
Z. Al-Yazeedi ◽  
H. Al-Sukaity ◽  
...  

2021 ◽  
Vol 65 (2) ◽  
pp. 52
Author(s):  
Colin Bitter ◽  
Yuji Tosaka

The purpose of this paper is to report on a quantitative analysis of the LCGFT vocabulary within a large set of MARC bibliographic data retrieved from the OCLC WorldCat database. The study aimed to provide a detailed analysis of the outcomes of the LCGFT project, which was launched by the Library of Congress (LC) in 2007. Findings point to a moderate increase in LCGFT use over time; however, the vocabulary has not been applied to the fullest extent possible in WorldCat. Further, adoption has been inconsistent between the various LCGFT disciplines. These and other findings discussed here suggest that retrospective application of the vocabulary using automated means should be investigated by catalogers and other technical services librarians. Indeed, as the data used for the analysis show somewhat uneven application of LCGFT, and with nearly half a billion records in WorldCat, it remains a certainty that much of LCGFT’s full potentials for genre/form access and retrieval will remain untapped until innovative solutions are introduced to further increase overall vocabulary usage in bibliographic databases.


2017 ◽  
Vol 139 ◽  
pp. 320-329 ◽  
Author(s):  
Joshua Stuckner ◽  
Katherine Frei ◽  
Ian McCue ◽  
Michael J. Demkowicz ◽  
Mitsuhiro Murayama

2012 ◽  
Vol 51 (05) ◽  
pp. 441-448 ◽  
Author(s):  
P. F. Neher ◽  
I. Reicht ◽  
T. van Bruggen ◽  
C. Goch ◽  
M. Reisert ◽  
...  

SummaryBackground: Diffusion-MRI provides a unique window on brain anatomy and insights into aspects of tissue structure in living humans that could not be studied previously. There is a major effort in this rapidly evolving field of research to develop the algorithmic tools necessary to cope with the complexity of the datasets.Objectives: This work illustrates our strategy that encompasses the development of a modularized and open software tool for data processing, visualization and interactive exploration in diffusion imaging research and aims at reinforcing sustainable evaluation and progress in the field.Methods: In this paper, the usability and capabilities of a new application and toolkit component of the Medical Imaging and Interaction Toolkit (MITK, www.mitk.org), MITKDI, are demonstrated using in-vivo datasets.Results: MITK-DI provides a comprehensive software framework for high-performance data processing, analysis and interactive data exploration, which is designed in a modular, extensible fashion (using CTK) and in adherence to widely accepted coding standards (e.g. ITK, VTK). MITK-DI is available both as an open source software development toolkit and as a ready-to-use in stallable application.Conclusions: The open source release of the modular MITK-DI tools will increase verifiability and comparability within the research community and will also be an important step towards bringing many of the current techniques towards clinical application.


Sign in / Sign up

Export Citation Format

Share Document