scholarly journals 4. Feature Engineering: Linguistic Annotation and Transformation

2021 ◽  
pp. 177-220
Author(s):  
José Calvo Tello
Corpora ◽  
2011 ◽  
Vol 6 (2) ◽  
pp. 145-158 ◽  
Author(s):  
Jordi Porta Zamorano ◽  
Emilio del Rosal García ◽  
Ignacio Ahumada Lara

Iberia is a synchronic corpus of scientific Spanish designed mainly for terminological studies. In this paper, we describe its design and the infrastructure for its acquisition, processing and exploitation, including mark-up, linguistic annotation, indexing and the user interface. Two pre-processing tasks affecting a large number of words are described in detail: de-hyphenation and identification of text fragments in other languages. We also show how some of the reported statistics, namely, dispersion and association, are used for research on lexis.


Author(s):  
Vera Yakubson ◽  
Victor Zakharov

This paper deals with the specialized corpora building, specifically academic language corpus in the biotechnology field. Being a part of larger research devoted to creation and usage of specialized parallel corpus, this piece aims to analyze the initial step of corpus building. Our main research question was what procedures we need to implement to the texts before using them to develop the corpus. Analysis of previous research showed the significant quantity of papers devoted to corpora creation, including academic specialized corpora. Different sides of the process were analyzed in these researches, including the types of texts used, the principles of crawling, the recommended length of texts etc. As to the text processing for the needs of corpora creation, only the linguistic annotation issues were examined earlier. At the same time, the preliminary cleaning of texts before their usage in corpora may have significant influence on the corpus quality and its utility for the linguistic research. In this paper, we considered three small corpora derived from the same set of academic texts in the biotechnology field: “raw” corpus without any preliminary cleaning and two corpora with different level of cleaning. Using different Sketch Engine tools, we analyzed these corpora from the position of their future users, predominantly as sources for academic wordlists and specialized multi-word units. The conducted research showed very little difference between two cleaned corpora, meaning that only basic cleaning procedures such as removal of reference lists are can be useful in corpora design. At the same time, we found a significant difference between raw and cleaned corpora and argue that this difference can affect the quality of wordlists and multi-word terms extraction, therefore these cleaning procedures are meaningful. The main limitation of the study is that all texts were taken from the unique source, so the conclusions could be affected by this specific journal’s peculiarities. Therefore, the future work should be the verification of results on different text collections


2020 ◽  
Vol 1525 ◽  
pp. 012107
Author(s):  
M Erdmann ◽  
E Geiser ◽  
Y Rath ◽  
M Rieger

2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Sima Ranjbari ◽  
Toktam Khatibi ◽  
Ahmad Vosough Dizaji ◽  
Hesamoddin Sajadi ◽  
Mehdi Totonchi ◽  
...  

Abstract Background Intrauterine Insemination (IUI) outcome prediction is a challenging issue which the assisted reproductive technology (ART) practitioners are dealing with. Predicting the success or failure of IUI based on the couples' features can assist the physicians to make the appropriate decision for suggesting IUI to the couples or not and/or continuing the treatment or not for them. Many previous studies have been focused on predicting the in vitro fertilization (IVF) and intracytoplasmic sperm injection (ICSI) outcome using machine learning algorithms. But, to the best of our knowledge, a few studies have been focused on predicting the outcome of IUI. The main aim of this study is to propose an automatic classification and feature scoring method to predict intrauterine insemination (IUI) outcome and ranking the most significant features. Methods For this purpose, a novel approach combining complex network-based feature engineering and stacked ensemble (CNFE-SE) is proposed. Three complex networks are extracted considering the patients' data similarities. The feature engineering step is performed on the complex networks. The original feature set and/or the features engineered are fed to the proposed stacked ensemble to classify and predict IUI outcome for couples per IUI treatment cycle. Our study is a retrospective study of a 5-year couples' data undergoing IUI. Data is collected from Reproductive Biomedicine Research Center, Royan Institute describing 11,255 IUI treatment cycles for 8,360 couples. Our dataset includes the couples' demographic characteristics, historical data about the patients' diseases, the clinical diagnosis, the treatment plans and the prescribed drugs during the cycles, semen quality, laboratory tests and the clinical pregnancy outcome. Results Experimental results show that the proposed method outperforms the compared methods with Area under receiver operating characteristics curve (AUC) of 0.84 ± 0.01, sensitivity of 0.79 ± 0.01, specificity of 0.91 ± 0.01, and accuracy of 0.85 ± 0.01 for the prediction of IUI outcome. Conclusions The most important predictors for predicting IUI outcome are semen parameters (sperm motility and concentration) as well as female body mass index (BMI).


Sign in / Sign up

Export Citation Format

Share Document