scholarly journals Developing and Analyzing a Spanish Corpus for Forensic Purposes

Author(s):  
Ángela Almela ◽  
Gema Alcaraz-Mármol ◽  
Arancha García-Pinar ◽  
Clara Pallejá

In this paper, the methods for developing a database of Spanish writing that can be used for forensic linguistic research are presented, including our data collection procedures. Specifically, the main instrument used for data collection has been translated into Spanish and adapted from Chaski (2001). It consists of ten tasks, by means of which the subjects are asked to write formal and informal texts about different topics. To date, 93 undergraduates from Spanish universities have already participated in the study and prisoners convicted of gender-based abuse have participated. A twofold analysis has been performed, since the data collected have been approached from a semantic and a morphosyntactic perspective. Regarding the semantic analysis, psycholinguistic categories have been used, many of them taken from the LIWC dictionary (Pennebaker et al., 2001). In order to obtain a more comprehensive depiction of the linguistic data, some other ad-hoc categories have been created, based on the corpus itself, using a double-check method for their validation so as to ensure inter-rater reliability. Furthermore, as regards morphosyntactic analysis, the natural language processing tool ALIAS TATTLER is being developed for Spanish.  Results shows that is it possible to differentiate non-abusers from abusers with strong accuracy based on linguistic features.

2010 ◽  
Vol 31 (3) ◽  
pp. 439-462 ◽  
Author(s):  
NICHOLAS D. DURAN ◽  
CHARLES HALL ◽  
PHILIP M. MCCARTHY ◽  
DANIELLE S. MCNAMARA

ABSTRACTThe words people use and the way they use them can reveal a great deal about their mental states when they attempt to deceive. The challenge for researchers is how to reliably distinguish the linguistic features that characterize these hidden states. In this study, we use a natural language processing tool called Coh-Metrix to evaluate deceptive and truthful conversations that occur within a context of computer-mediated communication. Coh-Metrix is unique in that it tracks linguistic features based on cognitive and social factors that are hypothesized to influence deception. The results from Coh-Metrix are compared to linguistic features reported in previous independent research, which used a natural language processing tool called Linguistic Inquiry and Word Count. The comparison reveals converging and contrasting alignment for several linguistic features and establishes new insights on deceptive language and its use in conversation.


2020 ◽  
Vol 24 ◽  
pp. 43-62
Author(s):  
Yamel Pérez-Guadarramas ◽  
Manuel Barreiro-Guerrero ◽  
Alfredo Simón-Cuevas ◽  
Francisco P. Romero ◽  
José A. Olivas

Automatic keyphrase extraction from texts is useful for many computational systems in the fields of natural language processing and text mining. Although a number of solutions to this problem have been described, semantic analysis is one of the least exploited linguistic features in the most widely-known proposals, causing the results obtained to have low accuracy and performance rates. This paper presents an unsupervised method for keyphrase extraction, based on the use of lexico-syntactic patterns for extracting information from texts, and a fuzzy topic modeling. An OWA operator combining several semantic measures was applied to the topic modeling process. This new approach was evaluated with Inspec and 500N-KPCrowd datasets. Several approaches within our proposal were evaluated against each other. A statistical analysis was performed to substantiate the best approach of the proposal. This best approach was also compared with other reported systems, giving promising results.


Author(s):  
Ramin Sabbagh ◽  
Farhad Ameri

Abstract The natural language descriptions of the capabilities of manufacturing companies can be found in multiple locations including company websites, legacy system databases, and ad hoc documents and spreadsheets. To unlock the value of unstructured capability data and learn from it, there is a need for developing advanced quantitative methods supported by machine learning and natural language processing techniques. This research proposes a hybrid unsupervised learning methodology using K-means clustering and topic modeling techniques in order to build clusters of suppliers based on their capabilities, automatically infer topics from the created clusters, and discover nontrivial patterns in manufacturing capability corpora. The capability data is extracted either directly from the website of manufacturing firms or from their profiles in e-sourcing portals and directories. Feature extraction and dimensionality reduction process in this work are supported by N-gram extraction and latent semantic analysis (LSA) methods. The proposed clustering method is validated experimentally based on a dataset composed of 150 capability descriptions collected from web-based sourcing directories such as the Thomas Net directory for manufacturing companies. The results of the experiment show that the proposed method creates supplier cluster with high accuracy. Two example applications of the proposed framework, related to supplier similarity measurement and automated thesaurus creation, are introduced in this paper.


Author(s):  
Ramin Sabbagh ◽  
Farhad Ameri

The descriptions of capabilities of manufacturing companies can be found in multiple locations including company websites, legacy system databases, and ad hoc documents and spreadsheets. The capability descriptions are often represented using natural language. To unlock the value of unstructured capability information and learn from it, there is a need for developing advanced quantitative methods supported by machine learning and natural language processing techniques. This research proposes a multi-step unsupervised learning methodology using K-means clustering and topic modeling techniques in order to build clusters of suppliers based on their capabilities, extract and organize the manufacturing capability terminology, and discover nontrivial patterns in manufacturing capability corpora. The capability data is extracted either directly from the website of manufacturing firms or from their profiles in e-sourcing portals and directories. Feature extraction and dimensionality reduction process in this work in supported by Ngram extraction and Latent Semantic Analysis (LSA) methods. The proposed clustering method is validated experimentally based a dataset composed of 150 capability descriptions collected from web-based sourcing directories such as the Thomas Net directory for manufacturing companies. The results of the experiment show that the proposed method creates supplier cluster with high accuracy.


2020 ◽  
Vol 3 (4) ◽  
pp. p94
Author(s):  
Wenli Xu ◽  
Yi Tang

The present study investigated the variations in linguistic features of English academic writing by American and Chinese scientists by building a corpus of 600 English agricultural journal abstracts and using the natural language processing tool Coh-Metrix. Through a one-way Analysis of Variance (ANOVA) and a discriminant function analysis (DFA), we statistically analyzed the corpus texts based on their lexical, syntactic and cohesive features and generated 8 distinguishing linguistic indices. The results indicated that Chinese scientists tended to write abstracts with more frequent words, more similar sentence structures, more modifiers per noun phrase and more agentless passive voice forms, while the American counterparts tended to write abstracts with a wider range of vocabulary, more specific terms, more words with multiple senses and more adversative connectives. These findings offer good guidance for Chinese scientists to write in a style closer to the agricultural research field and the native speakers so as to get their manuscripts better reviewed and more easily published. These findings also have practical implications for the development of agricultural English teaching materials as well as the curriculum design.


Author(s):  
Radha Guha

Background:: In the era of information overload it is very difficult for a human reader to make sense of the vast information available in the internet quickly. Even for a specific domain like college or university website it may be difficult for a user to browse through all the links to get the relevant answers quickly. Objective:: In this scenario, design of a chat-bot which can answer questions related to college information and compare between colleges will be very useful and novel. Methods:: In this paper a novel conversational interface chat-bot application with information retrieval and text summariza-tion skill is designed and implemented. Firstly this chat-bot has a simple dialog skill when it can understand the user query intent, it responds from the stored collection of answers. Secondly for unknown queries, this chat-bot can search the internet and then perform text summarization using advanced techniques of natural language processing (NLP) and text mining (TM). Results:: The advancement of NLP capability of information retrieval and text summarization using machine learning tech-niques of Latent Semantic Analysis(LSI), Latent Dirichlet Allocation (LDA), Word2Vec, Global Vector (GloVe) and Tex-tRank are reviewed and compared in this paper first before implementing them for the chat-bot design. This chat-bot im-proves user experience tremendously by getting answers to specific queries concisely which takes less time than to read the entire document. Students, parents and faculty can get the answers for variety of information like admission criteria, fees, course offerings, notice board, attendance, grades, placements, faculty profile, research papers and patents etc. more effi-ciently. Conclusion:: The purpose of this paper was to follow the advancement in NLP technologies and implement them in a novel application.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Nadzifatul Mu’tamaroh ◽  
Yuni Pantiwati

Abstract: Gender issues must be resolved immediately. The study aims to describe: 1) Implementation of gender-based class segregation policies; 2) Inhibiting factors and solutions in implementing gender-based class segregation policies and school efforts in overcoming the problems faced in implementing gender-based class segregation policies. The type of research used is descriptive qualitative. This research was carried out at Islamic junior high school Maarif 01 Singosari. Data collection techniques are interviews, observation, and documentation. The analysis phase used is data collection, data reduction, data presentation, conclusion drawing. To check the validity of the data using data and source triangulation. The results showed that 1) The implementation of the gender-based class segregation policy process of its implementation was carried out by separating classes between male and female classes, from grades VII, VII and X but in one building, one organization and supported by the implementation of orderly. 2) Barriers and solutions faced by schools in carrying out policies are: attitudes of male students disagree with the existence of policies that have an impact on the class structure when learning hours are less conducive. The solution that is done by the school, by providing approaches and direction to students, and every teacher and especially the counseling guidance teacher and subject teachers must know all the problems that often occur to Al Maarif SMPI students to be evaluated on an ongoing basis.Keywords: Gender, Segregation, Policy Implementation Abstrak: Permasalahan mengenai gender harus segara dituntaskan.Tujuan penelitian ini adalah untuk mendeskripsikan: 1) Implementasi kebijakan segregasi kelas berbasis gender; 2) Faktor penghambatdan solusi dalam implementasi kebijakan segregasi kelas berbasis gender dan upaya sekolah dalam mengatasi masalah yang dihadapi dalam implementasi kebijakan segregasi kelas berbasis gender. Jenis penelitian yang digunakan adalah deskriptif kualitatif. Penelitian ini dilaksanakan di SMPIslamAl Maarif 01 Singosari. Teknik pengumpulan data yaitu Wawancara, observasi dan dokumentasi. Tahapan analisis yang digunakan yaitu pengumpulan data, reduksi data, penyajian data, penarikan kesimpulan. Untuk mengecek keabsahan data menggunakan triangulasi data dan sumber. Hasil penelitian menunjukkan bahwa 1) Implementasi kebijakan segregasi kelas berbasis gender proses penerapannya dilakukan dengan cara pemisahan kelas antara kelas laki-laki dan kelas perempuan, mulai dari kelas VII, VII dan X akan tetapi dalam satu gedung, satu organisasi dan didukung dengan diterapkannya tata tertib. 2) Kendala dan solusi yang dihadapi sekolah dalam menjalankan kebijakan yaitu: sikap siswa putra kurang setuju adanya kebijakan hal tersebut berdampak pada suasana kelas pada saat jam pembelajaran kurang kondusif. Solusi yang dilakukan sekolah, dengan memberikan pendekatan dan arahan terhadap siswa, dan setiap guru dan khususnya guru BK dan guru mata pelajaran harus mengetahui segala problem yang sering terjadi pada siswa Islamic junior high schoolAl Maarif untuk dievaluasi secara berkelanjutan.Kata kunci: Segregasi, Gender, Implementasi Kebijakan


Author(s):  
Vittoria Cuteri ◽  
Giulia Minori ◽  
Gloria Gagliardi ◽  
Fabio Tamburini ◽  
Elisabetta Malaspina ◽  
...  

Abstract Purpose Attention has recently been paid to Clinical Linguistics for the detection and support of clinical conditions. Many works have been published on the “linguistic profile” of various clinical populations, but very few papers have been devoted to linguistic changes in patients with eating disorders. Patients with Anorexia Nervosa (AN) share similar psychological features such as disturbances in self-perceived body image, inflexible and obsessive thinking and anxious or depressive traits. We hypothesize that these characteristics can result in altered linguistic patterns and be detected using the Natural Language Processing tools. Methods We enrolled 51 young participants from December 2019 to February 2020 (age range: 14–18): 17 girls with a clinical diagnosis of AN, and 34 normal-weighted peers, matched by gender, age and educational level. Participants in each group were asked to produce three written texts (around 10–15 lines long). A rich set of linguistic features was extracted from the text samples and the statistical significance in pinpointing the pathological process was measured. Results Comparison between the two groups showed several linguistics indexes as statistically significant, with syntactic reduction as the most relevant trait of AN productions. In particular, the following features emerge as statistically significant in distinguishing AN girls and their normal-weighted peers: the length of the sentences, the complexity of the noun phrase, and the global syntactic complexity. This peculiar pattern of linguistic erosion may be due to the severe metabolic impairment also affecting the central nervous system in AN. Conclusion These preliminary data showed the existence of linguistic parameters as probable linguistic markers of AN. However, the analysis of a bigger cohort, still ongoing, is needed to consolidate this assumption. Level of evidence III Evidence obtained from case–control analytic studies.


2021 ◽  
Author(s):  
Xinxu Shen ◽  
Troy Houser ◽  
David Victor Smith ◽  
Vishnu P. Murty

The use of naturalistic stimuli, such as narrative movies, is gaining popularity in many fields, characterizing memory, affect, and decision-making. Narrative recall paradigms are often used to capture the complexity and richness of memory for naturalistic events. However, scoring narrative recalls is time-consuming and prone to human biases. Here, we show the validity and reliability of using a natural language processing tool, the Universal Sentence Encoder (USE), to automatically score narrative recall. We compared the reliability in scoring made between two independent raters (i.e., hand-scored) and between our automated algorithm and individual raters (i.e., automated) on trial-unique, video clips of magic tricks. Study 1 showed that our automated segmentation approaches yielded high reliability and reflected measures yielded by hand-scoring, and further that the results using USE outperformed another popular natural language processing tool, GloVe. In study two, we tested whether our automated approach remained valid when testing individual’s varying on clinically-relevant dimensions that influence episodic memory, age and anxiety. We found that our automated approach was equally reliable across both age groups and anxiety groups, which shows the efficacy of our approach to assess narrative recall in large-scale individual difference analysis. In sum, these findings suggested that machine learning approaches implementing USE are a promising tool for scoring large-scale narrative recalls and perform individual difference analysis for research using naturalistic stimuli.


Sign in / Sign up

Export Citation Format

Share Document