Natural Language Compression Optimized for Large Set of Files

Author(s):  
P. Prochazka ◽  
J. Holub
Keyword(s):  

Author(s):  
Pankaj Kailas Bhole ◽  
A. J. Agrawal

Text  summarization is  an  old challenge  in  text  mining  but  in  dire  need  of researcher’s attention in the areas of computational intelligence, machine learning  and  natural  language  processing. We extract a set of features from each sentence that helps identify its importance in the document. Every time reading full text is time consuming. Clustering approach is useful to decide which type of data present in document. In this paper we introduce the concept of k-mean clustering for natural language processing of text for word matching and in order to extract meaningful information from large set of offline documents, data mining document clustering algorithm are adopted.



2020 ◽  
Author(s):  
Michael Prendergast

Abstract – A Verification Cross-Reference Matrix (VCRM) is a table that depicts the verification methods for requirements in a specification. Usually requirement labels are rows, available test methods are columns, and an “X” in a cell indicates usage of a verification method for that requirement. Verification methods include Demonstration, Inspection, Analysis and Test, and sometimes Certification, Similarity and/or Analogy. VCRMs enable acquirers and stakeholders to quickly understand how a product’s requirements will be tested.Maintaining consistency of very large VCRMs can be challenging, and inconsistent verification methods can result in a large set of uncoordinated “spaghetti tests”. Natural language processing algorithms that can identify similarities between requirements offer promise in addressing this challenge.This paper applies and compares compares four natural language processing algorithms to the problem of automatically populating VCRMs from natural language requirements: Naïve Bayesian inference, (b) Nearest Neighbor by weighted Dice similarity, (c) Nearest Neighbor with Latent Semantic Analysis similarity, and (d) an ensemble method combining the first three approaches. The VCRMs used for this study are for slot machine technical requirements derived from gaming regulations from the countries of Australia and New Zealand, the province of Nova Scotia (Canada), the state of Michigan (United States) and recommendations from the International Association of Gaming Regulators (IAGR).



2004 ◽  
Vol 21 ◽  
pp. 287-317 ◽  
Author(s):  
M. J. Nederhof ◽  
G. Satta

We propose a formalism for representation of finite languages, referred to as the class of IDL-expressions, which combines concepts that were only considered in isolation in existing formalisms. The suggested applications are in natural language processing, more specifically in surface natural language generation and in machine translation, where a sentence is obtained by first generating a large set of candidate sentences, represented in a compact way, and then by filtering such a set through a parser. We study several formal properties of IDL-expressions and compare this new formalism with more standard ones. We also present a novel parsing algorithm for IDL-expressions and prove a non-trivial upper bound on its time complexity.





2021 ◽  
pp. 219256822110269
Author(s):  
Fabio Galbusera ◽  
Andrea Cina ◽  
Tito Bassani ◽  
Matteo Panico ◽  
Luca Maria Sconfienza

Study Design: Retrospective study. Objectives: Huge amounts of images and medical reports are being generated in radiology departments. While these datasets can potentially be employed to train artificial intelligence tools to detect findings on radiological images, the unstructured nature of the reports limits the accessibility of information. In this study, we tested if natural language processing (NLP) can be useful to generate training data for deep learning models analyzing planar radiographs of the lumbar spine. Methods: NLP classifiers based on the Bidirectional Encoder Representations from Transformers (BERT) model able to extract structured information from radiological reports were developed and used to generate annotations for a large set of radiographic images of the lumbar spine (N = 10 287). Deep learning (ResNet-18) models aimed at detecting radiological findings directly from the images were then trained and tested on a set of 204 human-annotated images. Results: The NLP models had accuracies between 0.88 and 0.98 and specificities between 0.84 and 0.99; 7 out of 12 radiological findings had sensitivity >0.90. The ResNet-18 models showed performances dependent on the specific radiological findings with sensitivities and specificities between 0.53 and 0.93. Conclusions: NLP generates valuable data to train deep learning models able to detect radiological findings in spine images. Despite the noisy nature of reports and NLP predictions, this approach effectively mitigates the difficulties associated with the manual annotation of large quantities of data and opens the way to the era of big data for artificial intelligence in musculoskeletal radiology.



2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Mohammad-Ali Yaghoub-Zadeh-Fard ◽  
Boualem Benatallah

Abstract Objectives Recently natural language interfaces (e.g., chatbots) have gained enormous attention. Such interfaces execute underlying application programming interfaces (APIs) based on the user's utterances to perform tasks (e.g., reporting weather). Supervised approaches for building such interfaces rely upon a large set of user utterances paired with APIs. Collecting such pairs is typically starts with obtaining initial utterances for a given API method. Generating initial utterances can be considered as a machine translation task in which an API method is translated into an utterance. However, the key challenge is the lack of training samples for training domain-independent translation models. In this paper, we propose a dataset for training supervised models to generate initial utterances for APIs. Data description The dataset contains 14,370 pairs of API methods and utterances. It is built automatically by converting method descriptions of a large number of APIs to user utterances; and it is cleaned manually to ensure quality. The dataset is also accompanied with a set of microservices (e.g., translating API methods to utterances) which can facilitate the process of collecting training samples for building natural language interfaces.



2016 ◽  
Author(s):  
David Barner ◽  
Katharine Chow ◽  
Shu-Ju Yang

We explored children’s early interpretation of numerals and linguisticnumber marking, in order to test the hypothesis (e.g., Carey, 2004) thatchildren’s initial distinction between one and other numerals (i.e., two,three, etc.) is bootstrapped from a prior distinction between singular andplural nouns. Previous studies have presented evidence that in languageswithout singular- plural morphology, like Japanese and Chinese, childrenacquire the meaning of the word one later than in singular-plural languageslike English and Russian. In two experiments, we sought to corroborate thisrelation between grammatical number and integer acquisition within English.We found a significant correlation between children’s comprehension ofnumerals and a large set of natural language quantifiers and determiners,even when controlling for effects due to age. However, we also found that2-year-old children, who are just acquiring singular-plural morphology andthe word one, fail to assign an exact interpretation to singular nounphrases (e.g., a banana), despite interpreting one as exact. For example,in a truth value judgment task, most children judged that a banana wasconsistent with a set of two objects, despite rejecting sets of two for thenumeral one. Also, children who gave exactly one object for singular nounsdid not have a better comprehension of numerals relative to children whodid not give exactly one. Thus, we conclude that the correlation betweenquantifier comprehension and numeral comprehension in children of this ageis not attributable to the singular-plural distinction facilitating theacquisition of the word one. We argue that quantifiers play a more generalrole in highlighting the semantic function of numerals, and that childrendistinguish between numerals and other quantifiers from the beginning,assigning exact interpretations only to numerals.



Author(s):  
Xudong Weng ◽  
O.F. Sankey ◽  
Peter Rez

Single electron band structure techniques have been applied successfully to the interpretation of the near edge structures of metals and other materials. Among various band theories, the linear combination of atomic orbital (LCAO) method is especially simple and interpretable. The commonly used empirical LCAO method is mainly an interpolation method, where the energies and wave functions of atomic orbitals are adjusted in order to fit experimental or more accurately determined electron states. To achieve better accuracy, the size of calculation has to be expanded, for example, to include excited states and more-distant-neighboring atoms. This tends to sacrifice the simplicity and interpretability of the method.In this paper. we adopt an ab initio scheme which incorporates the conceptual advantage of the LCAO method with the accuracy of ab initio pseudopotential calculations. The so called pscudo-atomic-orbitals (PAO's), computed from a free atom within the local-density approximation and the pseudopotential approximation, are used as the basis of expansion, replacing the usually very large set of plane waves in the conventional pseudopotential method. These PAO's however, do not consist of a rigorously complete set of orthonormal states.



Author(s):  
Michael schatz ◽  
Joachim Jäger ◽  
Marin van Heel

Lumbricus terrestris erythrocruorin is a giant oxygen-transporting macromolecule in the blood of the common earth worm (worm "hemoglobin"). In our current study, we use specimens (kindly provided by Drs W.E. Royer and W.A. Hendrickson) embedded in vitreous ice (1) to avoid artefacts encountered with the negative stain preparation technigue used in previous studies (2-4).Although the molecular structure is well preserved in vitreous ice, the low contrast and high noise level in the micrographs represent a serious problem in image interpretation. Moreover, the molecules can exhibit many different orientations relative to the object plane of the microscope in this type of preparation. Existing techniques of analysis requiring alignment of the molecular views relative to one or more reference images often thus yield unsatisfactory results.We use a new method in which first rotation-, translation- and mirror invariant functions (5) are derived from the large set of input images, which functions are subsequently classified automatically using multivariate statistical techniques (6). The different molecular views in the data set can therewith be found unbiasedly (5). Within each class, all images are aligned relative to that member of the class which contributes least to the classes′ internal variance (6). This reference image is thus the most typical member of the class. Finally the aligned images from each class are averaged resulting in molecular views with enhanced statistical resolution.



1987 ◽  
Vol 32 (1) ◽  
pp. 33-34
Author(s):  
Greg N. Carlson
Keyword(s):  


Sign in / Sign up

Export Citation Format

Share Document