Sampling the Web as Training Data for Text Classification

Author(s):  
Wei-Yen Day ◽  
Chun-Yi Chi ◽  
Ruey-Cheng Chen ◽  
Pu-Jen Cheng

Data acquisition is a major concern in text classification. The excessive human efforts required by conventional methods to build up quality training collection might not always be available to research workers. In this paper, the authors look into possibilities to automatically collect training data by sampling the Web with a set of given class names. The basic idea is to populate appropriate keywords and submit them as queries to search engines for acquiring training data. The first of two methods presented in this paper is based on sampling the common concepts among classes and the other is based on sampling the discriminative concepts for each class. A series of experiments were carried out independently on two different datasets and results show that the proposed methods significantly improve classifier performance even without using manually labeled training data. The authors’ strategy for retrieving Web samples substantially helps in the conventional document classification in terms of accuracy and efficiency.

Author(s):  
Pratiksha Bongale

Today’s world is mostly data-driven. To deal with the humongous amount of data, Machine Learning and Data Mining strategies are put into usage. Traditional ML approaches presume that the model is tested on a dataset extracted from the same domain from where the training data has been taken from. Nevertheless, some real-world situations require machines to provide good results with very little domain-specific training data. This creates room for the development of machines that are capable of predicting accurately by being trained on easily found data. Transfer Learning is the key to it. It is the scientific art of applying the knowledge gained while learning a task to another task that is similar to the previous one in some or another way. This article focuses on building a model that is capable of differentiating text data into binary classes; one roofing the text data that is spam and the other not containing spam using BERT’s pre-trained model (bert-base-uncased). This pre-trained model has been trained on Wikipedia and Book Corpus data and the goal of this paper is to highlight the pre-trained model’s capabilities to transfer the knowledge that it has learned from its training (Wiki and Book Corpus) to classifying spam texts from the rest.


2010 ◽  
Vol 1 (4) ◽  
pp. 24-42 ◽  
Author(s):  
Wei-Yen Day ◽  
Chun-Yi Chi ◽  
Ruey-Cheng Chen ◽  
Pu-Jen Cheng

2020 ◽  
Vol 34 (05) ◽  
pp. 9547-9554
Author(s):  
Mozhi Zhang ◽  
Yoshinari Fujinuma ◽  
Jordan Boyd-Graber

Text classification must sometimes be applied in a low-resource language with no labeled training data. However, training data may be available in a related language. We investigate whether character-level knowledge transfer from a related language helps text classification. We present a cross-lingual document classification framework (caco) that exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder derives vector representations for input words from their written forms, and the classifier makes predictions based on the word vectors. We use a joint character representation for both the source language and the target language, which allows the embedder to generalize knowledge about source language words to target language words with similar forms. We propose a multi-task objective that can further improve the model if additional cross-lingual or monolingual resources are available. Experiments confirm that character-level knowledge transfer is more data-efficient than word-level transfer between related languages.


2018 ◽  
Vol 2 (1) ◽  
pp. 92-99
Author(s):  
Nigel Rapport

In an earlier work (Anyone: The Cosmopolitan Subject of Anthropology, 2012), I considered a solution to the ‘problem’ of society as identified by Georg Simmel. The fact that we only come to know the interactional ‘Other’ by way of distortion, by virtue of the imposition of alien and alienating labels, categories and taxonomies, Simmel (1971) described as ‘tragic’ (cf. Rapport 2017). We distort the Other’s identity when we ‘know’ them in the conventional and collectivising terms of a symbolic classification of cultural reality. In response, I argued for a linguistic and behavioural style of public address and exchange, and an ethos of good manners, that I termed ‘cosmopolitan politesse’. This was an interactional code by which we presumed the common humanity and the distinct individuality of whomsoever we engaged with, but classified the Other in no more substantive fashion than this. We accepted that in our social interactions we were engaging with an individual human other – ‘Anyone’ – and not with a representative of some more substantive class: ‘a woman’, ‘a Swede’, ‘a Jew’, someone ‘working class’, ‘primitive’ or ‘pious’, and so on.


Author(s):  
Kieron O’Hara ◽  
Harith Alani ◽  
Yannis Kalfoglou ◽  
Nigel Shadbolt

There are certain features that distinguish killer apps from other ordinary applications. This chapter examines those features in the context of the Semantic Web, in the hope that a better understanding of the characteristics of killer apps might encourage their consideration when developing Semantic Web applications. Killer apps are highly transformative technologies that create new e-Commerce venues and widespread patterns of behaviour. Information Technology generally, and the Web in particular, has benefited from killer apps to create new networks of users and increase its value. The Semantic Web community on the other hand is still awaiting a killer app that proves the superiority of its technologies. The authors hope that this chapter will help to highlight some of the common ingredients of killer app in e-Commerce, and discuss how such applications might emerge in the Semantic Web.


Author(s):  
SARAH ZELIKOVITZ ◽  
FINELLA MARQUEZ

This paper presents work that uses Transductive Latent Semantic Indexing (LSI) for text classification. In addition to relying on labeled training data, we improve classification accuracy by incorporating the set of test examples in the classification process. Rather than performing LSI's singular value decomposition (SVD) process solely on the training data, we instead use an expanded term-by-document matrix that includes both the labeled data as well as any available test examples. We report the performance of LSI on data sets both with and without the inclusion of the test examples, and we show that tailoring the SVD process to the test examples can be even more useful than adding additional training data. This method can be especially useful to combat possible inclusion of unrelated data in the original corpus, and to compensate for limited amounts of data. Additionally, we evaluate the vocabulary of the training and test sets and present the results of a series of experiments to illustrate how the test set is used in an advantageous way.


1829 ◽  
Vol 119 ◽  
pp. 301-316 ◽  

In the year 1822, when I received from Mr. Barton some very fine specimens of his Iris ornaments, I availed myself of the opportunity of performing a series of experiments on the action of grooved surfaces upon light. As the subject was to a certain extent new, many of the results which I obtained seemed to possess considerable interest, and I accordingly communicated to the Royal Society of Edinburgh a general account of them, which was read on the 3rd of February 1823. The interruptions, however, of professional pursuits prevented me, but at distant intervals, from pursuing the inquiry; and having found that M. Fraunhofer was actively engaged in the very same research, with all the advantages of the finest apparatus and materials, I abandoned the subject, though with some reluctance, to his superior powers and means of investigation. During a visit paid to Edinburgh by the Chevalier Yelin, a friend of Fraunhofer’s and a distinguished member of the Academy of Sciences of Munich, I showed him the general results which I had obtained; and as he assured me that the phenomena which had principally occupied my attention had entirely escaped the notice of his friend, I was thus induced to resume my labours, the results of which, in relation to one branch of the subject, I shall now submit to the consideration of the Society. When a flat and polished metallic surface is covered with equal and equidistant grooves, we may characterize it by the relation of two quantities, one of which m represents the breadth of each groove, or of the surface that is removed, while the other n represents the breadth of the intermediate space, or of the original surface that is left. If the image of a candle is seen by reflexion from such a surface, the trace of the plane of reflexion being parallel to the grooves, we observe the colourless image of a candle in the middle of a row of prismatic images arranged in a line perpendicular to the grooves. The colourless image of the candle is formed by the original portions n of the metallic surface, while the prismatic images are formed by the sides of the grooves m . This may be demonstrated ocularly by increasing m , and consequently diminishing n till the latter nearly disappears. In this case the intensity of the prismatic images rises to a maximum, while the ordinary colourless image becomes extremely faint, and vice versâ. The general phenomena of the prismatic images, such as their distance from the common image, and the dispersion of their colours, depend entirely on the magnitude of m + n , or the number of grooves and intervals that occupy any given space; and the laws of these phenomena have been accurately determined by M. Fraunhofer.


This communication consists of three parts. In the first part the author shows that the common deflecting galvanometer, in which the deflecting forces are assumed to be as the tangents of deflection, is founded on false principles, and consequently leads to erroneous re­sults. The wire forming the coil is of considerable thickness, and therefore there is no fixed zero from which the deflections can be reckoned. The length of the coil, also, being generally short, occasions another serious error, us the theoretical investigation is founded on the supposition of an indefinite length. In proof of the inaccuracy of the indications of the common deflecting galvanometer, the author took two elementary batteries, the plates of one being one inch square, and those of the other two inches. The tangents of the deflections of the needle (proper precautions having been taken for the equally free passage of all the electricity evolved in either case,) were very nearly as 1 to 2, though it is obvious that the real quantities of voltaic electricity were as 1 to 4. The author’s torsion galvanometer gave the degrees of torsion nearly as 1 to 4. Other experiments led to similar conclusions. The author then examines the laws which were supposed to connect the conducting power of a wire for electricity, with its length and diameter, and which, according to Professors Cumming and Barlow, varies directly as the diameter, and inversely as the square root of the length; but, according to MM. Becquerel and Pouillet, directly as the square of the diameter, and inversely as the length. He points out the false conclusions of M. Becquerel, and that he has, in fact, deduced the value of two unknown quantities from one equation j and that M. Pouillet having arrived at his through the fallacious indica­tions of the common deflecting galvanometer, they are equally erroneous. The author then hows that the law pointed out by Cumming and Barlow is, in ordinary cases, nearest the truth; though, under certain circumstances, the limits f even that law may be passed. Hence, and from a series of experiments with the torsion galvanome­ter, he arrives at the unexpected conclusion, hat there is no deter­minate law of conduction, either for the length or diameter of the wire, but that it must vary, in every case, with the size of he plates, and the energy of the acid solution used in exciting them. his con­clusion the author shows to be in accordance with the views of conduction which he had previously published; namely, that there is no actual transfer of electricity, but that all the phenomena result from the definite arrangement of the electric fluid essentially belonging to the conducting wire.


2011 ◽  
Vol 131 (8) ◽  
pp. 1459-1466
Author(s):  
Yasunari Maeda ◽  
Hideki Yoshida ◽  
Masakiyo Suzuki ◽  
Toshiyasu Matsushima

Author(s):  
Mauro Rocha Baptista

Neste artigo analisamos a relação do Ensino Religioso com a sua evolução ao longo do contexto recente do Brasil para compreender a posição do Supremo Tribunal Federal ao considerar a possibilidade do Ensino Religioso confessional. Inicialmente apresentaremos a perspectiva legislativa criada com a constituição de 1988 e seus desdobramentos nas indicações curriculares. Neste contexto é frisado a intenção de incluir o Ensino Religioso na Base Nacional Curricular Comum, o que acabou não acontecendo. A tendência manifesta nas duas primeiras versões da BNCC era de um Ensino Religioso não-confessional. Uma tendência que demarcava a função do Ensino Religioso em debater a religião, mas que não permitia o direcionamento por uma vertente religioso qualquer. Esta posição se mostrava uma evolução da primeira perspectiva histórica mais associada à catequese confessional. Assim como também ultrapassava a interpretação posterior de um ecumenismo interconfessional, que mantinha a superioridade do cristianismo ante as demais religiões. Sendo assim, neste artigo, adotaremos o argumento de que a decisão do STF, de seis votos contra cinco, acaba retrocedendo ante o que nos parecia um caminho muito mais frutífero.Palavras-chave: Ensino Religioso. Supremo Tribunal Federal. Confessional. Interconfessional. Não-confessional.Abstract: On this article, we analyze the relation between Religious education and its evolution along the currently Brazilian context in order to understand the position of the Supreme Court in considering the possibility of a confessional Religious education. Firstly, we are going to present the legislative perspective created with the 1988 Federal Constitution and its impacts in the curricular lines. On this context it was highlighted the intention to include the Religious Education on the Common Core National Curriculum (CCNC), which did not really happened. The tendency manifested in the first two versions of the CCNC was of a non-confessional Religious Education. A tendency that delineated the function of the Religious Education as debating religion, but not giving direction on any religious side. This position was an evolution of the first historical perspective more associated to the confessional catechesis. It also went beyond the former interpretation of an inter-confessional ecumenism, which kept the superiority of the Christianity over the other religions. As such, in this paper we adopt the argument that the decision of the Supreme Court, of six votes against five, is a reversal of what seemed to be a much more productive path on the Religious Education.Keywords: Religious Education. Brazilian Supreme Court. Confessional. Inter-confessional. Non- confessional.Enviado: 23-01-2018 - Aprovado e publicado: 12-2018


Sign in / Sign up

Export Citation Format

Share Document