scholarly journals A World Full of Stereotypes? Further Investigation on Origin and Gender Bias in Multi-Lingual Word Embeddings

2021 ◽  
Vol 4 ◽  
Author(s):  
Mascha Kurpicz-Briki ◽  
Tomaso Leoni

Publicly available off-the-shelf word embeddings that are often used in productive applications for natural language processing have been proven to be biased. We have previously shown that this bias can come in different forms, depending on the language and the cultural context. In this work, we extend our previous work and further investigate how bias varies in different languages. We examine Italian and Swedish word embeddings for gender and origin bias, and demonstrate how an origin bias concerning local migration groups in Switzerland is included in German word embeddings. We propose BiasWords, a method to automatically detect new forms of bias. Finally, we discuss how cultural and language aspects are relevant to the impact of bias on the application and to potential mitigation measures.

2021 ◽  
Vol 12 ◽  
Author(s):  
Emma Pair ◽  
Nikitha Vicas ◽  
Ann M. Weber ◽  
Valerie Meausoone ◽  
James Zou ◽  
...  

Background: Despite a 2010 Kenyan constitutional amendment limiting members of elected public bodies to < two-thirds of the same gender, only 22 percent of the 12th Parliament members inaugurated in 2017 were women. Investigating gender bias in the media is a useful tool for understanding socio-cultural barriers to implementing legislation for gender equality. Natural language processing (NLP) methods, such as word embedding and sentiment analysis, can efficiently quantify media biases at a scope previously unavailable in the social sciences.Methods: We trained GloVe and word2vec word embeddings on text from 1998 to 2019 from Kenya’s Daily Nation newspaper. We measured gender bias in these embeddings and used sentiment analysis to predict quantitative sentiment scores for sentences surrounding female leader names compared to male leader names.Results: Bias in leadership words for men and women measured from Daily Nation word embeddings corresponded to temporal trends in men and women’s participation in political leadership (i.e., parliamentary seats) using GloVe (correlation 0.8936, p = 0.0067, r2 = 0.799) and word2vec (correlation 0.844, p = 0.0169, r2 = 0.712) algorithms. Women continue to be associated with domestic terms while men continue to be associated with influence terms, for both regular gender words and female and male political leaders’ names. Male words (e.g., he, him, man) were mentioned 1.84 million more times than female words from 1998 to 2019. Sentiment analysis showed an increase in relative negative sentiment associated with female leaders (p = 0.0152) and an increase in positive sentiment associated with male leaders over time (p = 0.0216).Conclusion: Natural language processing is a powerful method for gaining insights into and quantifying trends in gender biases and sentiment in news media. We found evidence of improvement in gender equality but also a backlash from increased female representation in high-level governmental leadership.


AERA Open ◽  
2021 ◽  
Vol 7 ◽  
pp. 233285842110286
Author(s):  
Kylie L. Anglin ◽  
Vivian C. Wong ◽  
Arielle Boguslav

Though there is widespread recognition of the importance of implementation research, evaluators often face intense logistical, budgetary, and methodological challenges in their efforts to assess intervention implementation in the field. This article proposes a set of natural language processing techniques called semantic similarity as an innovative and scalable method of measuring implementation constructs. Semantic similarity methods are an automated approach to quantifying the similarity between texts. By applying semantic similarity to transcripts of intervention sessions, researchers can use the method to determine whether an intervention was delivered with adherence to a structured protocol, and the extent to which an intervention was replicated with consistency across sessions, sites, and studies. This article provides an overview of semantic similarity methods, describes their application within the context of educational evaluations, and provides a proof of concept using an experimental study of the impact of a standardized teacher coaching intervention.


Author(s):  
Clifford Nangle ◽  
Stuart McTaggart ◽  
Margaret MacLeod ◽  
Jackie Caldwell ◽  
Marion Bennie

ABSTRACT ObjectivesThe Prescribing Information System (PIS) datamart, hosted by NHS National Services Scotland receives around 90 million electronic prescription messages per year from GP practices across Scotland. Prescription messages contain information including drug name, quantity and strength stored as coded, machine readable, data while prescription dose instructions are unstructured free text and difficult to interpret and analyse in volume. The aim, using Natural Language Processing (NLP), was to extract drug dose amount, unit and frequency metadata from freely typed text in dose instructions to support calculating the intended number of days’ treatment. This then allows comparison with actual prescription frequency, treatment adherence and the impact upon prescribing safety and effectiveness. ApproachAn NLP algorithm was developed using the Ciao implementation of Prolog to extract dose amount, unit and frequency metadata from dose instructions held in the PIS datamart for drugs used in the treatment of gastrointestinal, cardiovascular and respiratory disease. Accuracy estimates were obtained by randomly sampling 0.1% of the distinct dose instructions from source records, comparing these with metadata extracted by the algorithm and an iterative approach was used to modify the algorithm to increase accuracy and coverage. ResultsThe NLP algorithm was applied to 39,943,465 prescription instructions issued in 2014, consisting of 575,340 distinct dose instructions. For drugs used in the gastrointestinal, cardiovascular and respiratory systems (i.e. chapters 1, 2 and 3 of the British National Formulary (BNF)) the NLP algorithm successfully extracted drug dose amount, unit and frequency metadata from 95.1%, 98.5% and 97.4% of prescriptions respectively. However, instructions containing terms such as ‘as directed’ or ‘as required’ reduce the usability of the metadata by making it difficult to calculate the total dose intended for a specific time period as 7.9%, 0.9% and 27.9% of dose instructions contained terms meaning ‘as required’ while 3.2%, 3.7% and 4.0% contained terms meaning ‘as directed’, for drugs used in BNF chapters 1, 2 and 3 respectively. ConclusionThe NLP algorithm developed can extract dose, unit and frequency metadata from text found in prescriptions issued to treat a wide range of conditions and this information may be used to support calculating treatment durations, medicines adherence and cumulative drug exposure. The presence of terms such as ‘as required’ and ‘as directed’ has a negative impact on the usability of the metadata and further work is required to determine the level of impact this has on calculating treatment durations and cumulative drug exposure.


2021 ◽  
Author(s):  
Abigail Matthews ◽  
Isabella Grasso ◽  
Christopher Mahoney ◽  
Yan Chen ◽  
Esma Wali ◽  
...  

2020 ◽  
Vol 10 (8) ◽  
pp. 2824
Author(s):  
Yu-Hsiang Su ◽  
Ching-Ping Chao ◽  
Ling-Chien Hung ◽  
Sheng-Feng Sung ◽  
Pei-Ju Lee

Electronic medical records (EMRs) have been used extensively in most medical institutions for more than a decade in Taiwan. However, information overload associated with rapid accumulation of large amounts of clinical narratives has threatened the effective use of EMRs. This situation is further worsened by the use of “copying and pasting”, leading to lots of redundant information in clinical notes. This study aimed to apply natural language processing techniques to address this problem. New information in longitudinal clinical notes was identified based on a bigram language model. The accuracy of automated identification of new information was evaluated using expert annotations as the reference standard. A two-stage cross-over user experiment was conducted to evaluate the impact of highlighting of new information on task demands, task performance, and perceived workload. The automated method identified new information with an F1 score of 0.833. The user experiment found a significant decrease in perceived workload associated with a significantly higher task performance. In conclusion, automated identification of new information in clinical notes is feasible and practical. Highlighting of new information enables healthcare professionals to grasp key information from clinical notes with less perceived workload.


2020 ◽  
Author(s):  
Masashi Sugiyama

Recently, word embeddings have been used in many natural language processing problems successfully and how to train a robust and accurate word embedding system efficiently is a popular research area. Since many, if not all, words have more than one sense, it is necessary to learn vectors for all senses of word separately. Therefore, in this project, we have explored two multi-sense word embedding models, including Multi-Sense Skip-gram (MSSG) model and Non-parametric Multi-sense Skip Gram model (NP-MSSG). Furthermore, we propose an extension of the Multi-Sense Skip-gram model called Incremental Multi-Sense Skip-gram (IMSSG) model which could learn the vectors of all senses per word incrementally. We evaluate all the systems on word similarity task and show that IMSSG is better than the other models.


Sign in / Sign up

Export Citation Format

Share Document