scholarly journals Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality

2020 ◽  
Vol 28 (4) ◽  
pp. 445-468 ◽  
Author(s):  
Reagan Mozer ◽  
Luke Miratrix ◽  
Aaron Russell Kaufman ◽  
L. Jason Anastasopoulos

Matching for causal inference is a well-studied problem, but standard methods fail when the units to match are text documents: the high-dimensional and rich nature of the data renders exact matching infeasible, causes propensity scores to produce incomparable matches, and makes assessing match quality difficult. In this paper, we characterize a framework for matching text documents that decomposes existing methods into (1) the choice of text representation and (2) the choice of distance metric. We investigate how different choices within this framework affect both the quantity and quality of matches identified through a systematic multifactor evaluation experiment using human subjects. Altogether, we evaluate over 100 unique text-matching methods along with 5 comparison methods taken from the literature. Our experimental results identify methods that generate matches with higher subjective match quality than current state-of-the-art techniques. We enhance the precision of these results by developing a predictive model to estimate the match quality of pairs of text documents as a function of our various distance scores. This model, which we find successfully mimics human judgment, also allows for approximate and unsupervised evaluation of new procedures in our context. We then employ the identified best method to illustrate the utility of text matching in two applications. First, we engage with a substantive debate in the study of media bias by using text matching to control for topic selection when comparing news articles from thirteen news sources. We then show how conditioning on text data leads to more precise causal inferences in an observational study examining the effects of a medical intervention.

2019 ◽  
Vol 8 (3) ◽  
pp. 6634-6643 ◽  

Opinion mining and sentiment analysis are valuable to extract the useful subjective information out of text documents. Predicting the customer’s opinion on amazon products has several benefits like reducing customer churn, agent monitoring, handling multiple customers, tracking overall customer satisfaction, quick escalations, and upselling opportunities. However, performing sentiment analysis is a challenging task for the researchers in order to find the users sentiments from the large datasets, because of its unstructured nature, slangs, misspells and abbreviations. To address this problem, a new proposed system is developed in this research study. Here, the proposed system comprises of four major phases; data collection, pre-processing, key word extraction, and classification. Initially, the input data were collected from the dataset: amazon customer review. After collecting the data, preprocessing was carried-out for enhancing the quality of collected data. The pre-processing phase comprises of three systems; lemmatization, review spam detection, and removal of stop-words and URLs. Then, an effective topic modelling approach Latent Dirichlet Allocation (LDA) along with modified Possibilistic Fuzzy C-Means (PFCM) was applied to extract the keywords and also helps in identifying the concerned topics. The extracted keywords were classified into three forms (positive, negative and neutral) by applying an effective machine learning classifier: Convolutional Neural Network (CNN). The experimental outcome showed that the proposed system enhanced the accuracy in sentiment analysis up to 6-20% related to the existing systems.


2021 ◽  
pp. 101981
Author(s):  
Nicole Gürtzgen ◽  
Benjamin Lochner ◽  
Laura Pohlan ◽  
Gerard J. van den Berg

Author(s):  
Daniel Häussler ◽  
Stefanie Hüttemann ◽  
Christel Weiß ◽  
Nicole Karoline Rotter ◽  
Haneen Sadick

Abstract Purpose The assessment of the quality of life (QoL) of patients with chronic diseases before and after medical interventions has gained increasing importance in recent decades. Particularly for patients with visible keloid scars in the head and neck region, standardized measurement tools are either absent or have been shown to be insufficient. The aim of the present study was to create a new standardized questionnaire that is specific to auricular keloid patients and reflects their clinical symptoms and QoL. Methods The Keloid Intervention Benefit Inventory 21 (KIBI-21) questionnaire was developed in two stages. First, a group of experts identified a pool of 26 questions and modified and supplemented the items through a comparison with existing QoL assessments so that they related to keloid-specific clinical symptoms and the QoL of patients with auricular keloids before and after a medical intervention. This questionnaire was distributed to 27 outpatients who had undergone medical interventions for visible auricular keloids. Second, a sequential statistical analysis was conducted. This included a single-item assessment and reduction, analysis for internal consistency, construct validity, and divergence validity as well as a factor analysis. The analyses were performed for the entire questionnaire and for the items in the subcategories General Health, Physical Symptoms, Self-Esteem, and Social Impact. Results The final version of this newly validated and standardized KIBI questionnaire consisted of 21 items, of which each item was assigned to only one subscale. The questionnaire showed a Cronbach's α of 0.84 with a good internal consistency. In the item correlation validity, strong associations were found in all subscales, except for the Social Impact Subscale. Conclusion The keloid-specific QoL questionnaire KIBI-21 proved to be a reliable and reproducible instrument to assess the QoL and clinical symptoms in patients suffering from auricular keloids before and after a medical treatment.


2018 ◽  
Vol 4 (Supplement 2) ◽  
pp. 99s-99s
Author(s):  
O. Abdalrahman ◽  
E. Almashaikh ◽  
H. Aljarrah

Background: Fatigue interferes with the individual´s functioning and quality of life in cancer patients specifically, after chemotherapy and post–bone marrow transplantation (BMT), fatigue is not adequately addressed and prioritized among health care providers. Aim: The purpose of this study is to determine the severity and prevalence of fatigue among cancer patients post-BMT after receiving chemotherapy. Methods: A descriptive, cross-sectional and correlational design was used, Piper fatigue scale (PFS) Arabic version was used to measure participants' level of fatigue, the scale measures four dimensions of subjective fatigue: behavioral, affective, sensory, and cognitive. Patients above 18 years old, received chemotherapy and do BMT between Oct 2016 and Oct 2017, were included in this study. Results: 100 patient participated in this study, 52% (N: 52) diagnosed with leukemia, 32% (N: 32) lymphoma, and 16% (N: 16) hematology. Thirty-nine patients (39%) had no or mild fatigue level, they do not need medical intervention, 47% (N: 47) and 14% (N: 14) classified as moderate and sever level of fatigue respectively, equal to 61% of the total sample who need medical intervention. Overall fatigue severity categories; mild, moderate, and sever shows that there is significant difference in term of severity subscale in sensory and behavioral dimensions ( P = 0.03, 0.004) respectively, and the other subscale dimension did not significantly differ among patient ( P > 0.05), the highest mean subscale score occurred in the behavioral dimension (M = 4.8, SD = 2.37), while the lowest mean subscale score occurred in cognitive dimension, (M=2.59, SD=2.35). The overall score mean of the male patients regarding the fatigue severity was 45.18 (n=74), and for the female patients the mean was 57.03 (n=26), and the result shows that there was significant difference in the overall mean scores between male patients and female patients (t (98)= −2.2, P < 0.05). Conclusion: Fatigue-related to BMT is a serious and prevalent problem among patients with cancer. Fatigue may impair quality of life among this group of patients; further study may be conducted to assess the effect of fatigue on quality of life and activity of daily living. It is essential to include fatigue assessment as a priority for the BMT patients.


2018 ◽  
Vol 11 (1) ◽  
pp. 86
Author(s):  
Narayana Mahendra Prastya

Tulisan ini bertujuan untuk menganalisis aktivitas hubungan media yang dilakukan oleh Universitas Islam Indonesia, saat kejadian Tragedi Diksar Mapala UII. Kejadian tersebut merupakan krisis karena tidak diduga, terjadi secara mendadak, dan menimbulkan gangguan pada aktivitas dan citra organisasi. Hubungan media adalah salah satu aktivitas yang penting dalam manajemen krisis, karena media massa mampu mempengaruhi persepsi masyarakat terhadap satu organisasi dalam krisis. Dalam situasi krisis sendiri, persepsi dapat menjadi lebih kuat daripada fakta. Batasan hubungan media dalam tulisan ini adalah dalam aspek penyediaan informasi yang terdiri dari : (1) kualitas narasumber organisasi dan (2) cara organisasi dalam membantu liputan media. Data penelitian ini diperoleh dengan mewawancarai wartawan dari media di Yogyakarta yang meliput Diksar Mapala UII. Hasilnya menunjukkan bahwa media membutuhkan narasumber pimpinan tertinggi universitas. Informasi yang diperoleh dari humas universitas dirasa masih kurang cukup. Dalam hal upaya organisasi membantu aktivitas liputan, UII dinilai masih kurang cepat dan kurang terbuka dalam memberikan informasi. The purpose of this article is to analyse the media relations activities by Islamic University of Indonesia (UII), related to crisis "Tragedi Diksar Mapala UII". This incident lead to crisis because it is unpredictable, happen suddenly, disturb the organizational activities, and make the organization's image being at risk. Media relations is one important activites in crisis management. It is because mass media could affect the public perception toward an organization. In crisis situation, perception could be stronger than the fact. The limitation of media relations in this article are information subsidies. Information subsidies consist of : (1) the quality of news sources that provided by the organization, and (2) how organization facilitate the news gathering process by the media. The data for this article is being collected from interview with journalist from the mass media in Yogyakarta. The results are media want the top management of the universities as the news sources. The information that being provided by public relations is not enough. The university also lack of quickness and lack of openess.


Vector representations for language have been shown to be useful in a number of Natural Language Processing tasks. In this paper, we aim to investigate the effectiveness of word vector representations for the problem of Sentiment Analysis. In particular, we target three sub-tasks namely sentiment words extraction, polarity of sentiment words detection, and text sentiment prediction. We investigate the effectiveness of vector representations over different text data and evaluate the quality of domain-dependent vectors. Vector representations has been used to compute various vector-based features and conduct systematically experiments to demonstrate their effectiveness. Using simple vector based features can achieve better results for text sentiment analysis of APP.


Author(s):  
Caleb Scheffer Sponheim ◽  
Vasileios Papadourakis ◽  
Jennifer Collinger ◽  
John Downey ◽  
Jeffrey M Weiss ◽  
...  

Abstract Objective. Microelectrode arrays are standard tools for conducting chronic electrophysiological experiments, allowing researchers to simultaneously record from large numbers of neurons. Specifically, Utah electrode arrays (UEAs) have been utilized by scientists in many species, including rodents, rhesus macaques, marmosets, and human participants. The field of clinical human brain-computer interfaces currently relies on the UEA as a number of research groups have FDA clearance for this device through the investigational device exemption pathway. Despite its widespread usage in systems neuroscience, few studies have comprehensively evaluated the reliability and signal quality of the Utah array over long periods of time in a large dataset. Approach. We collected and analyzed over 6000 recorded datasets from various cortical areas spanning almost 9 years of experiments, totaling 17 rhesus macaques (Macaca Mulatta) and 2 human subjects, and 55 separate microelectrode Utah arrays. The scale of this dataset allowed us to evaluate the average life of these arrays, based primarily on the signal-to-noise ratio of each electrode over time. Main Results. Using implants in primary motor, premotor, prefrontal, and somatosensory cortices, we found that the average lifespan of available recordings from UEAs was 622 days, although we provide several examples of these UEAs lasting over 1000 days and one up to 9 years; human implants were also shown to last longer than non-human primate implants. We also found that electrode length did not affect longevity and quality, but iridium oxide metallization on the electrode tip exhibited superior yield as compared to platinum metallization.


Author(s):  
Sucheta Gupta ◽  
Vinod Gupta ◽  
Akhil Gupta

<p><strong>Background:</strong> Allergic rhinitis (AR) is a chronic inflammatory disorder affecting the nasal mucosa. There is negative impact of AR on several aspects of day to day living and quality of life (QoL), which include: daily functioning, sleep, absenteeism, school productivity and academic performance. Almost 40% of children are being affected by AR.</p><p><strong>Method:</strong> An observational study was conducted on randomly selected 100 parents of school going children aging 2 to 15 years, attending OPD in community health center, Chenani, district Udhampur, J and K, for a period of one year from June 2018 to Nov 2018. Children having frequent episodes of allergic rhinitis were enquired about their history of sneezing, runner itchy nose and eyes, thick mucus, nasal blockage or breathless with associated symptoms were selected.</p><p><strong>Results:</strong> 81% of subjects had a worse problem during specific months of the year; and 67% had itchy-watery eyes. In 15% of subjects, AR impacted daily activities. A prevalence of 28% for nasal symptoms and 14% for allergic rhino-conjunctivitis was found. Study also showed significantly higher proportion of blockers (61%) than sneeze runners (39%). 56% children had one or more co morbidity, whereas 44% had ‘nil’ co-morbidities. The most common allergens were: pollens (grass, trees and weeds), house dust mites, pets, molds, fungi and food.</p><p><strong>Conclusions:</strong> AR adversely affects quality of life of patients and furthermore studies should be conducted for more clarity on the subject, besides a timely medical intervention and treatment could possibly avoid the rising morbidity associated with the disease.</p>


2012 ◽  
Vol 21 (04) ◽  
pp. 383-403 ◽  
Author(s):  
ELENA FILATOVA

Wikipedia is used as a training corpus for many information selection tasks: summarization, question-answering, etc. The information presented in Wikipedia articles as well as the order in which this information is presented, is treated as the gold standard and is used for improving the quality of information selection systems. However, the Wikipedia articles corresponding to the same entry (person, location, event, etc.) written in different languages have substantial differences regarding what information is included in these articles. In this paper we analyze the regularities of information overlap among the articles about the same Wikipedia entry written in different languages: some information facts are covered in the Wikipedia articles in many languages, while others are covered only in a few languages. We introduce a hypothesis that the structure of this information overlap is similar to the information overlap structure (pyramid model) used in summarization evaluation, as well as the information overlap/repetition structure used to identify important information for multidocument summarization. We prove the correctness of our hypothesis by building a summarization system according to the presented information overlap hypothesis. This system summarizes English Wikipedia articles given the articles about the same Wikipedia entries written in other languages. To evaluate the quality of the created summaries, we use Amazon Mechanical Turk as the source of human subjects who can reliably judge the quality of the created text. We also compare the summaries generated according to the information overlap hypothesis against the lead line baseline which is considered to be the most reliable way to generate summaries of Wikipedia articles. The summarization experiment proves the correctness of the introduced multilingual Wikipedia information overlap hypothesis.


2020 ◽  
Vol 25 (6) ◽  
pp. 755-769
Author(s):  
Noorullah R. Mohammed ◽  
Moulana Mohammed

Text data clustering is performed for organizing the set of text documents into the desired number of coherent and meaningful sub-clusters. Modeling the text documents in terms of topics derivations is a vital task in text data clustering. Each tweet is considered as a text document, and various topic models perform modeling of tweets. In existing topic models, the clustering tendency of tweets is assessed initially based on Euclidean dissimilarity features. Cosine metric is more suitable for more informative assessment, especially of text clustering. Thus, this paper develops a novel cosine based external and interval validity assessment of cluster tendency for improving the computational efficiency of tweets data clustering. In the experimental, tweets data clustering results are evaluated using cluster validity indices measures. Experimentally proved that cosine based internal and external validity metrics outperforms the other using benchmarked and Twitter-based datasets.


Sign in / Sign up

Export Citation Format

Share Document