scholarly journals Extracting Parallel Paragraphs from Common Crawl

2017 ◽  
Vol 107 (1) ◽  
pp. 39-56
Author(s):  
Jakub Kúdela ◽  
Irena Holubová ◽  
Ondřej Bojar

Abstract Most of the current methods for mining parallel texts from the web assume that web pages of web sites share same structure across languages. We believe that there still exists a non-negligible amount of parallel data spread across sources not satisfying this assumption. We propose an approach based on a combination of bivec (a bilingual extension of word2vec) and locality-sensitive hashing which allows us to efficiently identify pairs of parallel segments located anywhere on pages of a given web domain, regardless their structure. We validate our method on realigning segments from a large parallel corpus. Another experiment with real-world data provided by Common Crawl Foundation confirms that our solution scales to hundreds of terabytes large set of web-crawled data.

Author(s):  
ZHI-QIANG LIU ◽  
YA-JUN ZHANG

Recently many techniques, e.g., Google or AltaVista, are available for classifying well-organized, hierarchical crisp categories from human constructed web pages such as that in Yahoo. However, given the current rate of web-page production, there is an urgent need of classifiers that are able to autonomously classify web-page categories that have overlaps. In this paper, we present a competitive learning method for this problem, which based on a new objective function and gradient descent scheme. Experimental results on real-world data show that the approach proposed in this paper gives a better performance in classifying randomly-generated, knowledge-overlapped categories as well as hierarchical crisp categories.


2020 ◽  
pp. 143-158
Author(s):  
Chris Bleakley

Chapter 8 explores the arrival of the World Wide Web, Amazon, and Google. The web allows users to display “pages” of information retrieved from remote computers by means of the Internet. Inventor Tim Berners-Lee released the first web software for free, setting in motion an explosion in Internet usage. Seeing the opportunity of a lifetime, Jeff Bezos set-up Amazon as an online bookstore. Amazon’s success was accelerated by a product recommender algorithm that selectively targets advertising at users. By the mid-1990s there were so many web sites that users often couldn’t find what they were looking for. Stanford PhD student Larry Page invented an algorithm for ranking search results based on the importance and relevance of web pages. Page and fellow student, Sergey Brin, established a company to bring their search algorithm to the world. Page and Brin - the founders of Google - are now worth US$35-40 billion, each.


Author(s):  
Ravi P. Kumar ◽  
Ashutosh K. Singh ◽  
Anand Mohan

In this era of Web computing, Cyber Security is very important as more and more data is moving into the Web. Some data are confidential and important. There are many threats for the data in the Web. Some of the basic threats can be addressed by designing the Web sites properly using Search Engine Optimization techniques. One such threat is the hanging page which gives room for link spamming. This chapter addresses the issues caused by hanging pages in Web computing. This Chapter has four important objectives. They are 1) Compare and review the different types of link structure based ranking algorithms in ranking Web pages. PageRank is used as the base algorithm throughout this Chapter. 2) Study on hanging pages, explore the effects of hanging pages in Web security and compare the existing methods to handle hanging pages. 3) Study on Link spam and explore the effect of hanging pages in link spam contribution and 4) Study on Search Engine Optimization (SEO) / Web Site Optimization (WSO) and explore the effect of hanging pages in Search Engine Optimization (SEO).


Author(s):  
June Tolsby

How can three linguistical methods be used to identify the Web displays of an organization’s knowledge values and knowledge-sharing requirements? This chapter approaches this question by using three linguistical methods to analyse a company’s Web sites; (a) elements from the community of practice theory (CoP), (b) concepts from communication theory, such as modality and transitivity, and (c) elements from discourse analysis. The investigation demonstrates how a company’s use of the Web can promote a work attitude that actually can be considered as an endorsement of a particular organizational behaviour. The Web pages display a particular organizational identity that will be a magnet for some parties and deject others. In this way, a company’s Web pages represent a window to the world that need to be handled with care, since this can be interpreted as a projection of the company’s identity.


First Monday ◽  
1997 ◽  
Author(s):  
Steven M. Friedman

The power of the World Wide Web, it is commonly believed, lies in the vast information it makes available; "Content is king," the mantra runs. This image creates the conception of the Internet as most of us envision it: a vast, horizontal labyrinth of pages which connect almost arbitrarily to each other, creating a system believed to be "democratic" in which anyone can publish Web pages. I am proposing a new, vertical and hierarchical conception of the Web, observing the fact that almost everyone searching for information on the Web has to go through filter Web sites of some sort, such as search engines, to find it. The Albert Einstein Online Web site provides a paradigm for this re-conceptualization of the Web, based on a distinction between the wealth of information and that which organizes it and frames the viewers' conceptions of the information. This emphasis on organization implies that we need a new metaphor for the Internet; the hierarchical "Tree" would be more appropriate organizationally than a chaotic "Web." This metaphor needs to be changed because the current one implies an anarchic and random nature to the Web, and this implication may turn off potential Netizens, who can be scared off by such overwhelming anarchy and the difficulty of finding information.


2021 ◽  
Vol 17 (2) ◽  
pp. 1-10
Author(s):  
Hussein Mohammed ◽  
Ayad Abdulsada

Searchable encryption (SE) is an interesting tool that enables clients to outsource their encrypted data into external cloud servers with unlimited storage and computing power and gives them the ability to search their data without decryption. The current solutions of SE support single-keyword search making them impractical in real-world scenarios. In this paper, we design and implement a multi-keyword similarity search scheme over encrypted data by using locality-sensitive hashing functions and Bloom filter. The proposed scheme can recover common spelling mistakes and enjoys enhanced security properties such as hiding the access and search patterns but with costly latency. To support similarity search, we utilize an efficient bi-gram-based method for keyword transformation. Such a method improves the search results accuracy. Our scheme employs two non-colluding servers to break the correlation between search queries and search results. Experiments using real-world data illustrate that our scheme is practically efficient, secure, and retains high accuracy.


2001 ◽  
Vol 20 (4) ◽  
pp. 11-18 ◽  
Author(s):  
Cleborne D. Maddux

The Internet and the World Wide Web are growing at unprecedented rates. More and more teachers are authoring school or classroom web pages. Such pages have particular potential for use in rural areas by special educators, children with special needs, and the parents of children with special needs. The quality of many of these pages leaves much to be desired. All web pages, especially those authored by special educators should be accessible for people with disabilities. Many other problems complicate use of the web for all users, whether or not they have disabilities. By taking some simple steps, beginning webmasters can avoid these problems. This article discusses practical solutions to common accessibility problems and other problems seen commonly on the web.


Author(s):  
Kai-Hsiang Yang

This chapter will address the issues of Uniform Resource Locator (URL) correction techniques in proxy servers. The proxy servers are more and more important in the World Wide Web (WWW), and they provide Web page caches for browsing the Web pages quickly, and also reduce unnecessary network traffic. Traditional proxy servers use the URL to identify their cache, and it is a cache-miss when the request URL is non-existent in its caches. However, for general users, there must be some regularity and scope in browsing the Web. It would be very convenient for users when they do not need to enter the whole long URL, or if they still could see the Web content even though they forgot some part of the URL, especially for those personal favorite Web sites. We will introduce one URL correction mechanism into the personal proxy server to achieve this goal.


IKON ◽  
2009 ◽  
pp. 151-175
Author(s):  
Sara Rigutti ◽  
Gisella Paoletti ◽  
Laura Blasutig

- We examined the consequences of a visualization pattern often chosen by web sites which show textual information within the web pages and the related iconic information within pop-up windows. The information visualization in pop-up windows aims to integrate text and pictures but makes difficult the analysis of both information resources. We conducted an experiment in which 80 participants read on a computer screen a text with embedded graphs either near (to the related textual information) or far (from it), plus graphs were integrated in the text or within pop-up windows. The reading behaviour of participants was observed to establish who, among them, examined the graphs and who did not. The recall for textual and iconic information was measured using a recall questionnaire. Our pattern of data shows a student's tendency to ignore graphs, in particular when they are visualized in pop-up windows. These results are confirmed by interviews to undergraduate students who analyzed the same materials using thinking aloud method.


2004 ◽  
Vol 4 (1) ◽  
Author(s):  
David Carabantes Alarcón ◽  
Carmen García Carrión ◽  
Juan Vicente Beneit Montesinos

La calidad en Internet tiene un gran valor, y más aún cuando se trata de una página web sobre salud como es un recurso sobre drogodependencias. El presente artículo recoge los estimadores y sistemas más destacados sobre calidad web para el desarrollo de un sistema específico de valoración de la calidad de recursos web sobre drogodependencias. Se ha realizado una prueba de viabilidad mediante el análisis de las principales páginas web sobre este tema (n=60), recogiendo la valoración, desde el punto de vista del usuario, de la calidad de los recursos. Se han detectado aspectos de mejora en cuanto a la exactitud y fiabilidad de la información, autoría, y desarrollo de descripciones y valoraciones de los enlaces externos. AbstractThe quality in Internet has a great value, and still more when is a web page on health like a resource of drug dependence. This paper contains the estimators and systems on quality in the web for the development of a specific system to value the quality of a web site about drug dependence. A test of viability by means of the analysis of the main web pages has been made on this subject, gathering the valuation from the point of view of the user of the quality of the resources. Aspects of improvement as the exactitude and reliability of the information, responsibility, and development of descriptions and valuations of the external links have been detected.


Sign in / Sign up

Export Citation Format

Share Document