Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

Political Analysis ◽

10.1017/pan.2017.44 ◽

2018 ◽

Vol 26 (2) ◽

pp. 168-189 ◽

Cited By ~ 72

Author(s):

Matthew J. Denny ◽

Arthur Spirling

Keyword(s):

Feature Selection ◽

Unsupervised Learning ◽

Political Science ◽

Real Data ◽

Statistical Procedure ◽

Science Text ◽

Substantive Theory ◽

Text Preprocessing

Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher’s substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it aids replication efforts.

Download Full-text

Talk Show Segmentation System Based on Twitter Using K-Medoids Clustering Algorithm

JURNAL PENDIDIKAN TEKNOLOGI KEJURUAN ◽

10.24036/jptk.v3i3.15123 ◽

2020 ◽

Vol 3 (3) ◽

pp. 158-163

Author(s):

Kharisma Jevi Shafira Sepyanto ◽

Yulison Herry Chrisnanto ◽

Fajri Rakhmat Umbara

Keyword(s):

Public Opinion ◽

Unsupervised Learning ◽

Clustering Algorithm ◽

Talk Show ◽

Analysis Process ◽

Silhouette Coefficient ◽

Competent Person ◽

Long Time ◽

Text Preprocessing ◽

Coefficient Method

Innovations on a talk show on television can be a threat. Audience will be divided into groups so that it can make a downgrade rating program. Program ratings affect companies that will use advertising services. Television companies will go bankrupt. The biggest source of income is sales of advertising services. One way to overcome them can be analyzed in public opinion. The results of the analysis can provide information about the attractiveness of the community towards the program. But the analysis process takes a long time and can be done only by a competent person so another process is needed to get the results of the analysis that is fast and can be done by anyone. In this study using K-Medoids Clustering in the process of identifying public opinion. The clustering process known as unsupervised learning will be combined with the labeling process. The previous episode's tweet data will be labeled and then used to obtain the predicted labels from other cluster members. Before going through the clustering stage, the tweet data will go through the text preprocessing stage then transformed into a numeric form based on the appearance of the word. Transformation data will be clustered by calculating proximity using Cosine Similarity. Labels from the Medoids cluster will be used on unlabeled tweet data. The cluster results were tested using the Silhouette Coefficient method to get 0.19 results. However, this method successfully predicted public opinion and achieved an accuracy of 80%.

Download Full-text

Prior knowledge and correlational structure in unsupervised learning.

Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale ◽

10.1037/cjep20070012 ◽

2007 ◽

Vol 61 (2) ◽

pp. 109-127 ◽

Cited By ~ 3

Author(s):

John P. Clapper

Keyword(s):

Unsupervised Learning ◽

Prior Knowledge

Download Full-text

Unsupervised learning of object identities and their parts in a hierarchical visual memory

Frontiers in Computational Neuroscience ◽

10.3389/conf.neuro.10.2009.14.168 ◽

1970 ◽

Author(s):

Jenia Jitsev ◽

Christoph von der Malsburg

Keyword(s):

Unsupervised Learning ◽

Visual Memory

Download Full-text

Classification of Observations through Combination of the Dimension Reduction and the Cluster Analysis

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i8.13 ◽

2017 ◽

Vol 7 (8) ◽

pp. 30

Author(s):

Hyeuk Kim

Keyword(s):

Machine Learning ◽

Principal Component Analysis ◽

Cluster Analysis ◽

Unsupervised Learning ◽

Principal Component ◽

Component Analysis ◽

Baseball Players ◽

Partitioning Around Medoids ◽

Different Characteristics

Unsupervised learning in machine learning divides data into several groups. The observations in the same group have similar characteristics and the observations in the different groups have the different characteristics. In the paper, we classify data by partitioning around medoids which have some advantages over the k-means clustering. We apply it to baseball players in Korea Baseball League. We also apply the principal component analysis to data and draw the graph using two components for axis. We interpret the meaning of the clustering graphically through the procedure. The combination of the partitioning around medoids and the principal component analysis can be used to any other data and the approach makes us to figure out the characteristics easily.

Download Full-text

An Unsupervised Learning Algorithm to Compute Fluid Volumes From NMR T1-T2 Logs in Unconventional Reservoirs

Petrophysics – The SPWLA Journal of Formation Evaluation and Reservoir Description ◽

10.30632/pjv59n5-2018a4 ◽

2018 ◽

Vol 59 ◽

pp. 617-632 ◽

Cited By ~ 3

Author(s):

Lalitha Venkataramanan ◽

◽

Noyan Evirgen ◽

David F. Allen ◽

Albina Mutina ◽

...

Keyword(s):

Unsupervised Learning ◽

Learning Algorithm ◽

Unconventional Reservoirs

Download Full-text

Klasifikasi pada Tempat Tinggal Menurut Provinsi dan Jenis Kepemilikan Berdasarkan Algoritma K-Means

STRING (Satuan Tulisan Riset dan Inovasi Teknologi) ◽

10.30998/string.v4i3.5932 ◽

2020 ◽

Vol 4 (3) ◽

pp. 247

Author(s):

Dwi Swasono Rachmad

Keyword(s):

Data Mining ◽

Unsupervised Learning ◽

Residential Buildings ◽

Government Agency ◽

Role Of Government ◽

The Republic ◽

Household Processing ◽

Central Statistics

Housing is derived from the word house which means a place that has a place to live which will stay or stop in a certain time. Housing is a residence that has been grouped into a place that has facilities and infrastructure. The problem in this study focuses on the type of residential ownership in the form of SHM ART, SHM Non ART, NON SHM and others. These four types can be used to know the percentage of ownership in all provinces in Indonesia. Due to the fact that there is still a lot of information about the type of certificate ownership, there is still not much ownership. Therefore, the use of the k-Means algorithm as a data mining concept in the form of clusters, where the data already has parameters or values that fall into the category of unsupervised learning. That data produced the best. The data was obtained from published sources of the Republic of Indonesia government agency, namely the Central Statistics Agency data with the category of household processing with self-owned residential buildings purchased from developers or non-developers by province and type of ownership in 2016 throughout Indonesia. In conducting the dataset, researchers used the RapidMiner application as a clustering process application. This research shows that there are more types of ownership in the SHM ART, but for other values it is still smaller than the value in other types of ownership which is the second largest value. So, in this case, the role of government in providing assistance in the process of ownership in order to become SHM ART is very important.

Download Full-text

Faculty Opinions recommendation of Unsupervised learning of individuals and categories from images.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1109143.565130 ◽

2008 ◽

Author(s):

Shimon Ullman

Keyword(s):

Unsupervised Learning

Download Full-text

Unsupervised Learning: What is a Sports Car?

SSRN Electronic Journal ◽

10.2139/ssrn.3439358 ◽

2019 ◽

Cited By ~ 2

Author(s):

Simon Rentzmann ◽

Mario V. Wuthrich

Keyword(s):

Unsupervised Learning

Download Full-text

Application of Machine Learning in Animal Disease Analysis and Prediction

Current Bioinformatics ◽

10.2174/1574893615999200728195613 ◽

2020 ◽

Vol 15 ◽

Author(s):

Shuwen Zhang ◽

Qiang Su ◽

Qin Chen

Keyword(s):

Machine Learning ◽

Unsupervised Learning ◽

Supervised Learning ◽

Clustering Algorithm ◽

Principal Component ◽

Support Vector ◽

Animal Disease ◽

Human Beings ◽

Animal Diseases ◽

Disease Analysis

Abstract: Major animal diseases pose a great threat to animal husbandry and human beings. With the deepening of globalization and the abundance of data resources, the prediction and analysis of animal diseases by using big data are becoming more and more important. The focus of machine learning is to make computers learn how to learn from data and use the learned experience to analyze and predict. Firstly, this paper introduces the animal epidemic situation and machine learning. Then it briefly introduces the application of machine learning in animal disease analysis and prediction. Machine learning is mainly divided into supervised learning and unsupervised learning. Supervised learning includes support vector machines, naive bayes, decision trees, random forests, logistic regression, artificial neural networks, deep learning, and AdaBoost. Unsupervised learning has maximum expectation algorithm, principal component analysis hierarchical clustering algorithm and maxent. Through the discussion of this paper, people have a clearer concept of machine learning and understand its application prospect in animal diseases.

Download Full-text