scholarly journals All Mixed Up? Finding the Optimal Feature Set for General Readability Prediction and Its Application to English and Dutch

2016 ◽  
Vol 42 (3) ◽  
pp. 457-490 ◽  
Author(s):  
Orphée De Clercq ◽  
Véronique Hoste

Readability research has a long and rich tradition, but there has been too little focus on general readability prediction without targeting a specific audience or text genre. Moreover, although NLP-inspired research has focused on adding more complex readability features, there is still no consensus on which features contribute most to the prediction. In this article, we investigate in close detail the feasibility of constructing a readability prediction system for English and Dutch generic text using supervised machine learning. Based on readability assessments by both experts and crowdsourcing, we implement different types of text characteristics ranging from easy-to-compute superficial text characteristics to features requiring deep linguistic processing, resulting in ten different feature groups. Both a regression and classification set-up are investigated reflecting the two possible readability prediction tasks: scoring individual texts or comparing two texts. We show that going beyond correlation calculations for readability optimization using a wrapper-based genetic algorithm optimization approach is a promising task that provides considerable insights in which feature combinations contribute to the overall readability prediction. Because we also have gold standard information available for those features requiring deep processing, we are able to investigate the true upper bound of our Dutch system. Interestingly, we will observe that the performance of our fully automatic readability prediction pipeline is on par with the pipeline using gold-standard deep syntactic and semantic information.

2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Yue Jiao ◽  
Fabienne Lesueur ◽  
Chloé-Agathe Azencott ◽  
Maïté Laurent ◽  
Noura Mebirouk ◽  
...  

Abstract Background Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors. Methods To identify as many as possible of the individuals participating in the two studies but not registered by a shared identifier, we combined probabilistic record linkage (PRL) and supervised machine learning (ML). This approach (named “PRL + ML”) combined together the candidate matches identified by both approaches. We built the ML model using the gold standard on a first version of the two databases as a training dataset. This gold standard was obtained from PRL-derived matches verified by an exhaustive manual review. Results The Random Forest (RF) algorithm showed a highest recall (0.985) among six widely used ML algorithms: RF, Bagged trees, AdaBoost, Support Vector Machine, Neural Network. Therefore, RF was selected to build the ML model since our goal was to identify the maximum number of true matches. Our combined linkage PRL + ML showed a higher recall (range 0.988–0.992) than either PRL (range 0.916–0.991) or ML (0.981) alone. It identified 1995 individuals participating in both GEMO (6375 participants) and GENEPSO (4925 participants). Conclusions Our hybrid linkage process represents an efficient tool for linking GEMO and GENEPSO. It may be generalizable to other epidemiological studies involving other databases and registries.


Author(s):  
Moretti Emilio ◽  
Tappia Elena ◽  
Limère Veronique ◽  
Melacini Marco

AbstractAs a large number of companies are resorting to increased product variety and customization, a growing attention is being put on the design and management of part feeding systems. Recent works have proved the effectiveness of hybrid feeding policies, which consist in using multiple feeding policies in the same assembly system. In this context, the assembly line feeding problem (ALFP) refers to the selection of a suitable feeding policy for each part. In literature, the ALFP is addressed either by developing optimization models or by categorizing the parts and assigning these categories to policies based on some characteristics of both the parts and the assembly system. This paper presents a new approach for selecting a suitable feeding policy for each part, based on supervised machine learning. The developed approach is applied to an industrial case and its performance is compared with the one resulting from an optimization approach. The application to the industrial case allows deepening the existing trade-off between efficiency (i.e., amount of data to be collected and dedicated resources) and quality of the ALFP solution (i.e., closeness to the optimal solution), discussing the managerial implications of different ALFP solution approaches and showing the potential value stemming from machine learning application.


2021 ◽  
Vol 5 (2.1) ◽  
pp. 1
Author(s):  
Ting Deng ◽  
Juan Li ◽  
Xiaohua Li ◽  
Xiaobo Li ◽  
Yiming Yan

Objective: To define a complex of changes in hematologic parameters associated with subtypes (ST) of Blastocystis sp. infections and the status of immune function in Sprague Dawley (SD) rats, and lay the foundation for Blastocystis hominis pathogenesis research. Methods: 5 isolates of ST1, ST3 and ST7 were used, including 1 isolate of ST1 from symptomatic patient, 2 isolates of ST3 and ST7 from symptomatic patients and asymptomatic carrier separately. Immune compromise model was set up using dexamethasone (DEX) and infection models with 5 isolates of ST1, ST3 and ST7, and then examined the hematologic changes post infection 15 days using fully automatic hematology analyzer sysmex xe-2100. Results: The results showed that infections of Blastocystis STs leaded to the increase of platelet indexes including MPV and PDW except ST3 isolated from asymptomatic carrier only with PDW increase and the higher values of PLT in ST7 isolated from asymptomatic carrier compared with the controls in the immune competence status (P < 0.05). However, the infections of Blastocystis ST7 isolated from symptomatic patient gave rise to higher values of WBC, LYMP, EO, MCV and RDW-SD while lower values of NEU% compared with the controls in immune compromise status (P < 0.05). Meanwhile, higher values of WBC and LYMP and lower NEUT% values were observed in ST1 infections compared with the controls (P < 0.05); lower NEUT values in ST1 infections and controls compared with ST3 and ST7 respectively were observed (P < 0.05); the infection of ST3 isolated from symptomatic patient resulted in higher values of MCV and RDW-SD while the asymptomatic isolate of ST3 only had higher RDW-SD (P < 0.05). Conclusion: The virulence of Blastocystis sp. isolated from symptomatic patient is higher than that of the identical subtype one isolated from asymptomatic carrier. The infection of ST7 isolated from symptomatic patients may result in the most distinct hematologic changes among STs, and then followed by ST1 symptomatic isolate. And the severity of Blastocystis sp. infection may be mediated by the immune status of host.


Over the last decades, digital image processing based fire and smoke detection have been improving steadily to provide a more accurate detection results in the area of surveillance security system. Detection of the fire and smoke from the surveillance videos is very challenging task due to the complex structural properties of the video frames or images and need improvisation in the existing work by utilization of feature selection or optimization approach to select on optimal feature according to the fire and smoke. A research based on the combination of various feature extraction techniques with feature selection approach for fire and smoke detection has been presented in this paper. In this research, we develop Fire and Smoke Detection (FSD) system using digital image processing with the concept of Speed up Robust Feature (SURF) along with the Intelligent Water Drops (IWD) as a feature selection and optimization algorithm. Here, Artificial Neural Network (ANN) is used as an Artificial Intelligence (AI) technique with that helps to select a set of optimal feature from the extracted by SURF descriptor from the video frames. By utilizing the concept of optimized ANN, the accuracy of proposed FSD system is increases in terms of detection accuracy and with minimum percentage of error. At last, the performance of the FSD system is calculated to validate the model and this shows that it is possible to use IWD with SURF as a feature extraction technique in order to detect the fire or smoke form the surveillance video with minimum error rate and the simulation results clearly show the effectiveness of proposed FSD system


2021 ◽  
Vol 15 (2) ◽  
pp. e0009042
Author(s):  
Artemis Koukounari ◽  
Haziq Jamil ◽  
Elena Erosheva ◽  
Clive Shiff ◽  
Irini Moustaki

Various global health initiatives are currently advocating the elimination of schistosomiasis within the next decade. Schistosomiasis is a highly debilitating tropical infectious disease with severe burden of morbidity and thus operational research accurately evaluating diagnostics that quantify the epidemic status for guiding effective strategies is essential. Latent class models (LCMs) have been generally considered in epidemiology and in particular in recent schistosomiasis diagnostic studies as a flexible tool for evaluating diagnostics because assessing the true infection status (via a gold standard) is not possible. However, within the biostatistics literature, classical LCM have already been criticised for real-life problems under violation of the conditional independence (CI) assumption and when applied to a small number of diagnostics (i.e. most often 3-5 diagnostic tests). Solutions of relaxing the CI assumption and accounting for zero-inflation, as well as collecting partial gold standard information, have been proposed, offering the potential for more robust model estimates. In the current article, we examined such approaches in the context of schistosomiasis via analysis of two real datasets and extensive simulation studies. Our main conclusions highlighted poor model fit in low prevalence settings and the necessity of collecting partial gold standard information in such settings in order to improve the accuracy and reduce bias of sensitivity and specificity estimates.


Author(s):  
Vasiliki Simaki ◽  
Carita Paradis ◽  
Maria Skeppstedt ◽  
Magnus Sahlgren ◽  
Kostiantyn Kucher ◽  
...  

AbstractThe aim of this study is to explore the possibility of identifying speaker stance in discourse, provide an analytical resource for it and an evaluation of the level of agreement across speakers. We also explore to what extent language users agree about what kind of stances are expressed in natural language use or whether their interpretations diverge. In order to perform this task, a comprehensive cognitive-functional framework of ten stance categories was developed based on previous work on speaker stance in the literature. A corpus of opinionated texts was compiled, the Brexit Blog Corpus (BBC). An analytical protocol and interface (Active Learning and Visual Analytics) for the annotations was set up and the data were independently annotated by two annotators. The annotation procedure, the annotation agreements and the co-occurrence of more than one stance in the utterances are described and discussed. The careful, analytical annotation process has returned satisfactory inter- and intra-annotation agreement scores, resulting in a gold standard corpus, the final version of the BBC.


Sign in / Sign up

Export Citation Format

Share Document