Web Scraping in Python

2022 ◽  
pp. 229-254
Keyword(s):  
2018 ◽  
pp. 120-164
Author(s):  
Alessandra Corigliano
Keyword(s):  
Low Cost ◽  

Nella sentenza di seguito commentata, la Corte d'Appello di Milano, in merito alla decisione di Ryanair di escludere qualsiasi intermediazione commerciale nella vendita dei propri biglietti aerei, si è pronunciata nella vertenza tra la compagnia aerea irlandese e l'agenzia di viaggi italiana Viaggiare che, in primo grado, ha denunciato il comportamento di Ryanair in quanto avrebbe ostacolato con il proprio comportamento l'agenzia di viaggio nella vendita dei biglietti aerei di Ryanair direttamente ai consumatori, costringendo l'agenzia stessa a riutilizzare i dati forniti dal database di Ryanair al fine di vendere indirettamente i biglietti sul suo sito web. La Corte (in parziale riforma della sentenza del Tribunale di primo grado) ha ritenuto che la decisione della compagnia aerea di riservarsi la vendita di biglietti aerei non costituisse un abuso di posizione dominante come previsto dall'articolo 102 del Trattato sul Funzionamento dell'Unione Europea, in quanto Ryanair deteneva nel mercato dei voli europei solo il 10%, quota questa molto bassa, che varrebbe a escludere una posizione dominante della compagnia su detto mercato. Nell'ottica della normativa antitrust, è stata accolta la mozione di Ryanair volta ad escludere una posizione dominante sul mercato dei voli europei, mentre nell'ottica dei diritti di proprietà intellettuale la domanda di Ryanair è stata respinta. A questo proposito, la Corte non ha accolto la mozione di Ryanair in base alla quale l'uso dei suoi marchi da parte di Viaggiare violasse i diritti privativi di Ryanair; la Corte ha inoltre stabilito che il database di Ryanair non potesse essere considerato di proprietà di quest'ultima, in quanto lo stesso, essendo del tutto svincolato da specifiche tecniche e funzionali che ne dettano la scelta e l'organizzazione dei dati, non può essere considerato alla stregua di una manifestazione creativa e, quindi, proprietà intellettuale ai sensi dell'art. 2, 64-quinques e 64-sexies della Legge sul Copyright. La Corte ha quindi ritenuto che non vi fosse nemmeno protezione ai sensi della cosiddetta dottrina "sui generis" del database Rynair poiché la protezione di tale database era finalizzata ad escludere la commercializzazione dei biglietti aerei e non a proteggere gli sforzi di investimento di Ryanair. La condotta di Viagiare di "screen scraping" dei dati Ryanair relativi all'offerta di biglietti aerei è stata considerata legittima in quanto Ryanair - nei Termini di Utilizzo del suo sito web - ha fornito l'accesso (concessione di licenza) a terzi dei suoi dati


Erdkunde ◽  
2020 ◽  
Vol 74 (3) ◽  
pp. 191-204
Author(s):  
Marcus Hübscher ◽  
Juana Schulze ◽  
Felix zur Lage ◽  
Johannes Ringel

Short-term rentals such as Airbnb have become a persistent element of today’s urbanism around the globe. The impacts are manifold and differ depending on the context. In cities with a traditionally smaller accommodation market, the impacts might be particularly strong, as Airbnb contributes to ongoing touristification processes. Despite that, small and medium-sized cities have not been in the centre of research so far. This paper focuses on Santa Cruz de Tenerife as a medium-sized Spanish city. Although embedded in the touristic region of the Canary Islands, Santa Cruz is not a tourist city per se but still relies on touristification strategies. This paper aims to expand the knowledge of Airbnb’s spatial patterns in this type of city. The use of data collected from web scraping and geographic information systems (GIS) demonstrates that Airbnb has opened up new tourism markets outside of the centrally established tourist accommodations. It also shows that the price gap between Airbnb and the housing rental market is broadest in neighbourhoods that had not experienced tourism before Airbnb entered the market. In the centre the highest prices and the smallest units are identified, but two peripheral quarters stand out. Anaga Mountains, a natural and rural space, has the highest numbers of Airbnb listings per capita. Suroeste, a suburban quarter, shows the highest growth rates on the rental market, which implies a linkage between Airbnb and suburbanization processes.


2021 ◽  
pp. 0887302X2199594
Author(s):  
Ahyoung Han ◽  
Jihoon Kim ◽  
Jaehong Ahn

Fashion color trends are an essential marketing element that directly affect brand sales. Organizations such as Pantone have global authority over professional color standards by annually forecasting color palettes. However, the question remains whether fashion designers apply these colors in fashion shows that guide seasonal fashion trends. This study analyzed image data from fashion collections through machine learning to obtain measurable results by web-scraping catwalk images, separating body and clothing elements via machine learning, defining a selection of color chips using k-means algorithms, and analyzing the similarity between the Pantone color palette (16 colors) and the analysis color chips. The gap between the Pantone trends and the colors used in fashion collections were quantitatively analyzed and found to be significant. This study indicates the potential of machine learning within the fashion industry to guide production and suggests further research expand on other design variables.


Author(s):  
Anton Thielmann ◽  
Christoph Weisser ◽  
Astrid Krenz ◽  
Benjamin Säfken

2020 ◽  
pp. 5-9
Author(s):  
Manasvi Srivastava ◽  
◽  
Vikas Yadav ◽  
Swati Singh ◽  
◽  
...  

The Internet is the largest source of information created by humanity. It contains a variety of materials available in various formats such as text, audio, video and much more. In all web scraping is one way. It is a set of strategies here in which we get information from the website instead of copying the data manually. Many Web-based data extraction methods are designed to solve specific problems and work on ad-hoc domains. Various tools and technologies have been developed to facilitate Web Scraping. Unfortunately, the appropriateness and ethics of using these Web Scraping tools are often overlooked. There are hundreds of web scraping software available today, most of them designed for Java, Python and Ruby. There is also open source software and commercial software. Web-based software such as YahooPipes, Google Web Scrapers and Firefox extensions for Outwit are the best tools for beginners in web cutting. Web extraction is basically used to cut this manual extraction and editing process and provide an easy and better way to collect data from a web page and convert it into the desired format and save it to a local or archive directory. In this paper, among others the kind of scrub, we focus on those techniques that extract the content of a Web page. In particular, we use scrubbing techniques for a variety of diseases with their own symptoms and precautions.


2021 ◽  
Vol 15 (1) ◽  
pp. 58-80
Author(s):  
Maurilio Barbosa de Oliveira da Silva ◽  
Dyego De Oliveira Arruda ◽  
Milton Augusto Pasquotto Mariani
Keyword(s):  

A popularização das mídias sociais representou a oportunidade para que consumidores pudessem exprimir suas opiniões às empresas e outros consumidores. Essas opiniões, positivas ou negativas, são chamadas de eWOM (eletronic word of mouth) ou boca a boca online e são compostas de revisões, comentários e análises de produtos. O presente artigo objetiva analisar os principais atributos valorizados pelos turistas que tenham comentado na página TripAdvisor, acerca de meios de hospedagem visitados na cidade de Bonito-MS, um dos mais relevantes destinos turísticos do centro-oeste brasileiro. Realizou-se uma investigação de abordagem qualitativa, em que princípios da netnografia foram utilizados para coleta e análise de dados, além do uso de web scraping para acessar 1.635 comentários de 2018 do TripAdvisor. Esses comentários foram sistematizados por intermédio do software Iramuteq. Constatou-se que ‘quarto’, ‘café da manhã’, ‘piscina’ e ‘atendimento’ perfazem os atributos mais observados e posteriormente redigidos no site.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Irvin Dongo ◽  
Yudith Cardinale ◽  
Ana Aguilera ◽  
Fabiola Martinez ◽  
Yuni Quintero ◽  
...  

Purpose This paper aims to perform an exhaustive revision of relevant and recent related studies, which reveals that both extraction methods are currently used to analyze credibility on Twitter. Thus, there is clear evidence of the need of having different options to extract different data for this purpose. Nevertheless, none of these studies perform a comparative evaluation of both extraction techniques. Moreover, the authors extend a previous comparison, which uses a recent developed framework that offers both alternates of data extraction and implements a previously proposed credibility model, by adding a qualitative evaluation and a Twitter-Application Programming Interface (API) performance analysis from different locations. Design/methodology/approach As one of the most popular social platforms, Twitter has been the focus of recent research aimed at analyzing the credibility of the shared information. To do so, several proposals use either Twitter API or Web scraping to extract the data to perform the analysis. Qualitative and quantitative evaluations are performed to discover the advantages and disadvantages of both extraction methods. Findings The study demonstrates the differences in terms of accuracy and efficiency of both extraction methods and gives relevance to much more problems related to this area to pursue true transparency and legitimacy of information on the Web. Originality/value Results report that some Twitter attributes cannot be retrieved by Web scraping. Both methods produce identical credibility values when a robust normalization process is applied to the text (i.e. tweet). Moreover, concerning the time performance, Web scraping is faster than Twitter API and it is more flexible in terms of obtaining data; however, Web scraping is very sensitive to website changes. Additionally, the response time of the Twitter API is proportional to the distance from the central server at San Francisco.


2021 ◽  
Vol 1 (2) ◽  
pp. 65-77
Author(s):  
T. E. Vildanov ◽  
◽  
N. S. Ivanov ◽  

This article explores both popular and newly invented tools for extracting data from sites and converting them into a form suitable for analysis. The paper compares the Python libraries, the key criterion of the compared tools is their performance. The results will be grouped by sites, tools used and number of iterations, and then presented in graphical form. The scientific novelty of the research lies in the field of application of data extraction tools: we will receive and transform semistructured data from the websites of bookmakers and betting exchanges. The article also describes new tools that are currently not in great demand in the field of parsing and web scraping. As a result of the study, quantitative metrics were obtained for all the tools used and the libraries that were most suitable for the rapid extraction and processing of information in large quantities were selected.


Sign in / Sign up

Export Citation Format

Share Document