Web Scraping in Python

2018 ◽

pp. 120-164

Author(s):

Alessandra Corigliano

Keyword(s):

Nella sentenza di seguito commentata, la Corte d'Appello di Milano, in merito alla decisione di Ryanair di escludere qualsiasi intermediazione commerciale nella vendita dei propri biglietti aerei, si è pronunciata nella vertenza tra la compagnia aerea irlandese e l'agenzia di viaggi italiana Viaggiare che, in primo grado, ha denunciato il comportamento di Ryanair in quanto avrebbe ostacolato con il proprio comportamento l'agenzia di viaggio nella vendita dei biglietti aerei di Ryanair direttamente ai consumatori, costringendo l'agenzia stessa a riutilizzare i dati forniti dal database di Ryanair al fine di vendere indirettamente i biglietti sul suo sito web. La Corte (in parziale riforma della sentenza del Tribunale di primo grado) ha ritenuto che la decisione della compagnia aerea di riservarsi la vendita di biglietti aerei non costituisse un abuso di posizione dominante come previsto dall'articolo 102 del Trattato sul Funzionamento dell'Unione Europea, in quanto Ryanair deteneva nel mercato dei voli europei solo il 10%, quota questa molto bassa, che varrebbe a escludere una posizione dominante della compagnia su detto mercato. Nell'ottica della normativa antitrust, è stata accolta la mozione di Ryanair volta ad escludere una posizione dominante sul mercato dei voli europei, mentre nell'ottica dei diritti di proprietà intellettuale la domanda di Ryanair è stata respinta. A questo proposito, la Corte non ha accolto la mozione di Ryanair in base alla quale l'uso dei suoi marchi da parte di Viaggiare violasse i diritti privativi di Ryanair; la Corte ha inoltre stabilito che il database di Ryanair non potesse essere considerato di proprietà di quest'ultima, in quanto lo stesso, essendo del tutto svincolato da specifiche tecniche e funzionali che ne dettano la scelta e l'organizzazione dei dati, non può essere considerato alla stregua di una manifestazione creativa e, quindi, proprietà intellettuale ai sensi dell'art. 2, 64-quinques e 64-sexies della Legge sul Copyright. La Corte ha quindi ritenuto che non vi fosse nemmeno protezione ai sensi della cosiddetta dottrina "sui generis" del database Rynair poiché la protezione di tale database era finalizzata ad escludere la commercializzazione dei biglietti aerei e non a proteggere gli sforzi di investimento di Ryanair. La condotta di Viagiare di "screen scraping" dei dati Ryanair relativi all'offerta di biglietti aerei è stata considerata legittima in quanto Ryanair - nei Termini di Utilizzo del suo sito web - ha fornito l'accesso (concessione di licenza) a terzi dei suoi dati

Download Full-text

The impact of Airbnb on a non-touristic city. A Case study of short-term rentals in Santa Cruz de Tenerife (Spain)

Erdkunde ◽

10.3112/erdkunde.2020.03.03 ◽

2020 ◽

Vol 74 (3) ◽

pp. 191-204

Author(s):

Marcus Hübscher ◽

Juana Schulze ◽

Felix zur Lage ◽

Johannes Ringel

Keyword(s):

Santa Cruz ◽

Short Term ◽

Rental Market ◽

Rural Space ◽

Use Of Data ◽

Web Scraping ◽

Per Capita ◽

Per Se ◽

The Impact

Short-term rentals such as Airbnb have become a persistent element of today’s urbanism around the globe. The impacts are manifold and differ depending on the context. In cities with a traditionally smaller accommodation market, the impacts might be particularly strong, as Airbnb contributes to ongoing touristification processes. Despite that, small and medium-sized cities have not been in the centre of research so far. This paper focuses on Santa Cruz de Tenerife as a medium-sized Spanish city. Although embedded in the touristic region of the Canary Islands, Santa Cruz is not a tourist city per se but still relies on touristification strategies. This paper aims to expand the knowledge of Airbnb’s spatial patterns in this type of city. The use of data collected from web scraping and geographic information systems (GIS) demonstrates that Airbnb has opened up new tourism markets outside of the centrally established tourist accommodations. It also shows that the price gap between Airbnb and the housing rental market is broadest in neighbourhoods that had not experienced tourism before Airbnb entered the market. In the centre the highest prices and the smallest units are identified, but two peripheral quarters stand out. Anaga Mountains, a natural and rural space, has the highest numbers of Airbnb listings per capita. Suroeste, a suburban quarter, shows the highest growth rates on the rental market, which implies a linkage between Airbnb and suburbanization processes.

Download Full-text

Color Trend Analysis using Machine Learning with Fashion Collection Images

Clothing and Textiles Research Journal ◽

10.1177/0887302x21995948 ◽

2021 ◽

pp. 0887302X2199594

Author(s):

Ahyoung Han ◽

Jihoon Kim ◽

Jaehong Ahn

Keyword(s):

Machine Learning ◽

Trend Analysis ◽

Image Data ◽

Fashion Industry ◽

Color Palette ◽

Web Scraping ◽

Design Variables ◽

Sales Organizations ◽

Fashion Designers ◽

Selection Of

Fashion color trends are an essential marketing element that directly affect brand sales. Organizations such as Pantone have global authority over professional color standards by annually forecasting color palettes. However, the question remains whether fashion designers apply these colors in fashion shows that guide seasonal fashion trends. This study analyzed image data from fashion collections through machine learning to obtain measurable results by web-scraping catwalk images, separating body and clothing elements via machine learning, defining a selection of color chips using k-means algorithms, and analyzing the similarity between the Pantone color palette (16 colors) and the analysis color chips. The gap between the Pantone trends and the colors used in fashion collections were quantitatively analyzed and found to be significant. This study indicates the potential of machine learning within the fashion industry to guide production and suggests further research expand on other design variables.

Download Full-text

Web Scraping of COVID-19 News Stories to Create Datasets for Sentiment and Emotion Analysis

The 14th PErvasive Technologies Related to Assistive Environments Conference ◽

10.1145/3453892.3461333 ◽

2021 ◽

Author(s):

Poojitha Thota ◽

Elmasri Ramez

Keyword(s):

Emotion Analysis ◽

News Stories ◽

Web Scraping

Download Full-text

Identifying key drivers in airline recommendations using logistic regression from web scraping

Proceedings of the 2020 the 3rd International Conference on Computers in Management and Business ◽

10.1145/3383845.3383870 ◽

2020 ◽

Author(s):

Praowpan Tansitpong

Keyword(s):

Logistic Regression ◽

Web Scraping ◽

Key Drivers

Download Full-text

Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling

Journal of Applied Statistics ◽

10.1080/02664763.2021.1919063 ◽

2021 ◽

pp. 1-18

Author(s):

Anton Thielmann ◽

Christoph Weisser ◽

Astrid Krenz ◽

Benjamin Säfken

Keyword(s):

Document Classification ◽

Topic Modelling ◽

Web Scraping

Download Full-text

Implementation of Web Application for Disease Prediction Using AI

10.54646/bijdmbd.002 ◽

2020 ◽

pp. 5-9

Author(s):

Manasvi Srivastava ◽

◽

Vikas Yadav ◽

Swati Singh ◽

◽

...

Keyword(s):

Web Application ◽

Ad Hoc ◽

Data Extraction ◽

Extraction Methods ◽

Web Page ◽

Web Based ◽

Web Extraction ◽

Web Scraping ◽

Audio Video ◽

Manual Extraction

The Internet is the largest source of information created by humanity. It contains a variety of materials available in various formats such as text, audio, video and much more. In all web scraping is one way. It is a set of strategies here in which we get information from the website instead of copying the data manually. Many Web-based data extraction methods are designed to solve specific problems and work on ad-hoc domains. Various tools and technologies have been developed to facilitate Web Scraping. Unfortunately, the appropriateness and ethics of using these Web Scraping tools are often overlooked. There are hundreds of web scraping software available today, most of them designed for Java, Python and Ruby. There is also open source software and commercial software. Web-based software such as YahooPipes, Google Web Scrapers and Firefox extensions for Outwit are the best tools for beginners in web cutting. Web extraction is basically used to cut this manual extraction and editing process and provide an easy and better way to collect data from a web page and convert it into the desired format and save it to a local or archive directory. In this paper, among others the kind of scrub, we focus on those techniques that extract the content of a Web page. In particular, we use scrubbing techniques for a variety of diseases with their own symptoms and precautions.

Download Full-text

Boca A Boca Online No Turismo: Análise Netnográfica de Avaliações No Setor Hoteleiro.

Revista Acadêmica Observatório de Inovação do Turismo ◽

10.17648/raoit.v15n1.6292 ◽

2021 ◽

Vol 15 (1) ◽

pp. 58-80

Author(s):

Maurilio Barbosa de Oliveira da Silva ◽

Dyego De Oliveira Arruda ◽

Milton Augusto Pasquotto Mariani

Keyword(s):

Word Of Mouth ◽

Web Scraping

A popularização das mídias sociais representou a oportunidade para que consumidores pudessem exprimir suas opiniões às empresas e outros consumidores. Essas opiniões, positivas ou negativas, são chamadas de eWOM (eletronic word of mouth) ou boca a boca online e são compostas de revisões, comentários e análises de produtos. O presente artigo objetiva analisar os principais atributos valorizados pelos turistas que tenham comentado na página TripAdvisor, acerca de meios de hospedagem visitados na cidade de Bonito-MS, um dos mais relevantes destinos turísticos do centro-oeste brasileiro. Realizou-se uma investigação de abordagem qualitativa, em que princípios da netnografia foram utilizados para coleta e análise de dados, além do uso de web scraping para acessar 1.635 comentários de 2018 do TripAdvisor. Esses comentários foram sistematizados por intermédio do software Iramuteq. Constatou-se que ‘quarto’, ‘café da manhã’, ‘piscina’ e ‘atendimento’ perfazem os atributos mais observados e posteriormente redigidos no site.

Download Full-text

A qualitative and quantitative comparison between Web scraping and API methods for Twitter credibility analysis

International Journal of Web Information Systems ◽

10.1108/ijwis-03-2021-0037 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Irvin Dongo ◽

Yudith Cardinale ◽

Ana Aguilera ◽

Fabiola Martinez ◽

Yuni Quintero ◽

...

Keyword(s):

San Francisco ◽

Data Extraction ◽

Qualitative Evaluation ◽

Application Programming Interface ◽

Extraction Methods ◽

Content Type ◽

Qualitative And Quantitative ◽

Advantages And Disadvantages ◽

Web Scraping ◽

Shared Information

Purpose This paper aims to perform an exhaustive revision of relevant and recent related studies, which reveals that both extraction methods are currently used to analyze credibility on Twitter. Thus, there is clear evidence of the need of having different options to extract different data for this purpose. Nevertheless, none of these studies perform a comparative evaluation of both extraction techniques. Moreover, the authors extend a previous comparison, which uses a recent developed framework that offers both alternates of data extraction and implements a previously proposed credibility model, by adding a qualitative evaluation and a Twitter-Application Programming Interface (API) performance analysis from different locations. Design/methodology/approach As one of the most popular social platforms, Twitter has been the focus of recent research aimed at analyzing the credibility of the shared information. To do so, several proposals use either Twitter API or Web scraping to extract the data to perform the analysis. Qualitative and quantitative evaluations are performed to discover the advantages and disadvantages of both extraction methods. Findings The study demonstrates the differences in terms of accuracy and efficiency of both extraction methods and gives relevance to much more problems related to this area to pursue true transparency and legitimacy of information on the Web. Originality/value Results report that some Twitter attributes cannot be retrieved by Web scraping. Both methods produce identical credibility values when a robust normalization process is applied to the text (i.e. tweet). Moreover, concerning the time performance, Web scraping is faster than Twitter API and it is more flexible in terms of obtaining data; however, Web scraping is very sensitive to website changes. Additionally, the response time of the Twitter API is proportional to the distance from the central server at San Francisco.

Download Full-text

ANALYSIS OF PARSING AND WEBSCRAPING TOOLS WITHIN THE FRAMEWORK DEVELOPMENT OF ARBITRATION INVESTMENT STRATEGY ON THE MARKET SPORT BETS

SOFT MEASUREMENTS AND COMPUTING ◽

10.36871/2618-9976.2021.02.006 ◽

2021 ◽

Vol 1 (2) ◽

pp. 65-77

Author(s):

T. E. Vildanov ◽

◽

N. S. Ivanov ◽

Keyword(s):

Data Extraction ◽

Investment Strategy ◽

Semistructured Data ◽

Graphical Form ◽

Great Demand ◽

Rapid Extraction ◽

Web Scraping ◽

Quantitative Metrics ◽

Framework Development ◽

Number Of Iterations

This article explores both popular and newly invented tools for extracting data from sites and converting them into a form suitable for analysis. The paper compares the Python libraries, the key criterion of the compared tools is their performance. The results will be grouped by sites, tools used and number of iterations, and then presented in graphical form. The scientific novelty of the research lies in the field of application of data extraction tools: we will receive and transform semistructured data from the websites of bookmakers and betting exchanges. The article also describes new tools that are currently not in great demand in the field of parsing and web scraping. As a result of the study, quantitative metrics were obtained for all the tools used and the libraries that were most suitable for the rapid extraction and processing of information in large quantities were selected.

Download Full-text

Web Scraping in Python

Web scraping e diritti di proprietà intellettuale nell'intermediazione di biglietti aerei low cost

The impact of Airbnb on a non-touristic city. A Case study of short-term rentals in Santa Cruz de Tenerife (Spain)

Color Trend Analysis using Machine Learning with Fashion Collection Images

Web Scraping of COVID-19 News Stories to Create Datasets for Sentiment and Emotion Analysis

Identifying key drivers in airline recommendations using logistic regression from web scraping

Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling

Implementation of Web Application for Disease Prediction Using AI

Boca A Boca Online No Turismo: Análise Netnográfica de Avaliações No Setor Hoteleiro.

A qualitative and quantitative comparison between Web scraping and API methods for Twitter credibility analysis

ANALYSIS OF PARSING AND WEBSCRAPING TOOLS WITHIN THE FRAMEWORK DEVELOPMENT OF ARBITRATION INVESTMENT STRATEGY ON THE MARKET SPORT BETS

Export Citation Format