scholarly journals A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics

Entropy ◽  
2020 ◽  
Vol 22 (1) ◽  
pp. 126 ◽  
Author(s):  
Martin Gerlach ◽  
Francesc Font-Clos

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3 × 10 9 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.

2021 ◽  
pp. 1-13
Author(s):  
Lamiae Benhayoun ◽  
Daniel Lang

BACKGROUND: The renewed advent of Artificial Intelligence (AI) is inducing profound changes in the classic categories of technology professions and is creating the need for new specific skills. OBJECTIVE: Identify the gaps in terms of skills between academic training on AI in French engineering and Business Schools, and the requirements of the labour market. METHOD: Extraction of AI training contents from the schools’ websites and scraping of a job advertisements’ website. Then, analysis based on a text mining approach with a Python code for Natural Language Processing. RESULTS: Categorization of occupations related to AI. Characterization of three classes of skills for the AI market: Technical, Soft and Interdisciplinary. Skills’ gaps concern some professional certifications and the mastery of specific tools, research abilities, and awareness of ethical and regulatory dimensions of AI. CONCLUSIONS: A deep analysis using algorithms for Natural Language Processing. Results that provide a better understanding of the AI capability components at the individual and the organizational levels. A study that can help shape educational programs to respond to the AI market requirements.


2021 ◽  
Vol 21 (2) ◽  
pp. 1-25
Author(s):  
Pin Ni ◽  
Yuming Li ◽  
Gangmin Li ◽  
Victor Chang

Cyber-Physical Systems (CPS), as a multi-dimensional complex system that connects the physical world and the cyber world, has a strong demand for processing large amounts of heterogeneous data. These tasks also include Natural Language Inference (NLI) tasks based on text from different sources. However, the current research on natural language processing in CPS does not involve exploration in this field. Therefore, this study proposes a Siamese Network structure that combines Stacked Residual Long Short-Term Memory (bidirectional) with the Attention mechanism and Capsule Network for the NLI module in CPS, which is used to infer the relationship between text/language data from different sources. This model is mainly used to implement NLI tasks and conduct a detailed evaluation in three main NLI benchmarks as the basic semantic understanding module in CPS. Comparative experiments prove that the proposed method achieves competitive performance, has a certain generalization ability, and can balance the performance and the number of trained parameters.


1990 ◽  
Vol 17 (1) ◽  
pp. 21-29
Author(s):  
C. Korycinski ◽  
Alan F. Newell

The task of producing satisfactory indexes by automatic means has been tackled on two fronts: by statistical analysis of text and by attempting content analysis of the text in much the same way as a human indexcr does. Though statistical techniques have a lot to offer for free-text database systems, neither method has had much success with back-of-the-bopk indexing. This review examines some problems associated with the application of natural-language processing techniques to book texts.


2019 ◽  
Author(s):  
Fabio Trecca ◽  
Kristian Tylén ◽  
Riccardo Fusaroli ◽  
Christer Johansson ◽  
Morten H. Christiansen

Language processing depends on the integration of bottom-up information with top-down cues from several different sources—primarily our knowledge of the real world, of discourse contexts, and of how language works. Previous studies have shown that factors pertaining to both the sender and the receiver of the message affect the relative weighting of such information. Here, we suggest another factor that may change our processing strategies: perceptual noise in the environment. We hypothesize that listeners weight different sources of top-down information more in situations of perceptual noise than in noise-free situations. Using a sentence-picture matching experiment with four forced-choice alternatives, we show that degrading the speech input with noise compels the listeners to rely more on top-down information in processing. We discuss our results in light of previous findings in the literature, highlighting the need for a unified model of spoken language comprehension in different ecologically valid situations, including under noisy conditions.


Author(s):  
Toluwase Asubiaro

This study investigated if there is a difference in the number of articles, datasets and computer codes that foreign and Nigerian authors of scientific publications on natural language processing (NLP) of Nigerian languages deposited in digital archives. Relevant articles were systematically retrieved from Google, Web of Science and Scopus. Authorship type and data archiving information was extracted from the full text of the relevant publications. Result shows that papers with foreign authorship (80.4%) published their articles in non-commercial repositories, more than papers with Nigerian authorship (55.3%). Similarly, few papers with foreign authorship deposited research data (19.1%) and computer codes (10.4%), while none of the papers with Nigerian authorship did. It was recommended that librarians in Nigeria should create awareness on the benefits of digital archiving and open science. Cette étude a eximané les différences dans le nombre d'articles, d'ensembles de données et de codes informatiques dans les articles scientifiques sur le traitement du langage naturel que les auteurs nigériens et les auteurs étrangers ont soumis dans les dépôts d'autoarchivage. Les articles pertinents ont été systématiquement extraits de Google, Web of Science et Scopus. Les informations relatives au type d'auteur et à l'archivage des données ont été extraites du texte intégral des publications pertinentes. Les résultats montrent que les articles écrits par des auteurs étrangers ont davantage publié leurs articles dans des dépôts non commerciaux (80,4%) que les auteurs nigériens (55,3%). Peu d'auteurs étrangers ont déposé des données de recherche (19,1%) et des codes informatiques (10,4%) tandis qu'aucun auteur nigérien ne l'a fait. Ces résultats démontrent l'importance de la sensibilisation aux avantages des dépôt d'archivage et de la science ouverte pour les bibliothécaires nigériens.


2004 ◽  
Vol 1 (1) ◽  
pp. 1-10 ◽  
Author(s):  
Jean-Luc Verschelde ◽  
Mariana Casella Dos Santos ◽  
Tom Deray ◽  
Barry Smith ◽  
Werner Ceusters

Summary Successful biomedical data mining and information extraction require a complete picture of biological phenomena such as genes, biological processes, and diseases; as these exist on different levels of granularity. To realize this goal, several freely available heterogeneous databases as well as proprietary structured datasets have to be integrated into a single global customizable scheme. We will present a tool to integrate different biological data sources by mapping them to a proprietary biomedical ontology that has been developed for the purposes of making computers understand medical natural language.


2004 ◽  
Vol 9 (1) ◽  
pp. 53-68 ◽  
Author(s):  
Montserrat Arévalo Rodríguez ◽  
Montserrat Civit Torruella ◽  
Maria Antònia Martí

In the field of corpus linguistics, Named Entity treatment includes the recognition and classification of different types of discursive elements like proper names, date, time, etc. These discursive elements play an important role in different Natural Language Processing applications and techniques such as Information Retrieval, Information Extraction, translations memories, document routers, etc.


2014 ◽  
Vol 22 (1) ◽  
pp. 135-161 ◽  
Author(s):  
M. MELERO ◽  
M.R. COSTA-JUSSÀ ◽  
P. LAMBERT ◽  
M. QUIXAL

AbstractWe present research aiming to build tools for the normalization of User-Generated Content (UGC). We argue that processing this type of text requires the revisiting of the initial steps of Natural Language Processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user-generated texts) presents a number of nonstandard communicative and linguistic characteristics – often closer to oral and colloquial language than to edited text. We present a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews, and blogs, and describe its main characteristics. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging. Our aim with this paper is to seize the power of already existing spell and grammar correction engines and endow them with automatic normalization capabilities in order to pave the way for the application of standard Natural Language Processing tools to typical UGC text. Particularly, we propose a strategy for automatically normalizing UGC by adding a module on top of a pre-existing spell-checker that selects the most plausible correction from an unranked list of candidates provided by the spell-checker. To build this selector module we train four language models, each one containing a different type of linguistic information in a trade-off with its generalization capabilities. Our experiments show that the models trained on truecase and lowercase word forms are more discriminative than the others at selecting the best candidate. We have also experimented with a parametrized combination of the models by both optimizing directly on the selection task and doing a linear interpolation of the models. The resulting parametrized combinations obtain results close to the best performing model but do not improve on those results, as measured on the test set. The precision of the selector module in ranking number one the expected correction proposal on the test corpora reaches 82.5% for Twitter text (baseline 57%) and 88% for non-Twitter text (baseline 64%).


2018 ◽  
Vol 3 (7) ◽  
pp. 42 ◽  
Author(s):  
Omer Salih Dawood ◽  
Abd-El-Kader Sahraoui

The paper aimed to address the problem of incompleteness and inconsistency between requirements and design stages, and how to make efficient linking between these stages. Software requirements written in natural languages (NL), Natural Language Processing (NLP) can be used to process requirements. In our research we built a framework that can be used to generate design diagrams from requirements in semi-automatic way, and make traceability between requirements and design phases, and in contrast. Also framework shows how to manage traceability in different levels, and how to apply changes to different artifacts. Many traceability reports can be generated based on developed framework. After Appling this model we obtained good results. Based on our case study the model generate a class diagram depends on central rule engine, and traceability was built and can be managed in visualize manner. We proposed to continue this research as its very critical area by adding more Unified Modeling Language(UML) diagrams, and apply changes directly inside software requirement document.


Sign in / Sign up

Export Citation Format

Share Document