Extracting Parallel Paragraphs from Common Crawl

Jakub Kúdela; Irena Holubová; Ondřej Bojar

doi:10.1515/pralin-2017-0003

Extracting Parallel Paragraphs from Common Crawl

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0003 ◽

2017 ◽

Vol 107 (1) ◽

pp. 39-56

Author(s):

Jakub Kúdela ◽

Irena Holubová ◽

Ondřej Bojar

Keyword(s):

Real World ◽

Web Sites ◽

Locality Sensitive Hashing ◽

Web Pages ◽

Large Set ◽

Real World Data ◽

Negligible Amount ◽

Parallel Data ◽

Parallel Texts ◽

The Web

Abstract Most of the current methods for mining parallel texts from the web assume that web pages of web sites share same structure across languages. We believe that there still exists a non-negligible amount of parallel data spread across sources not satisfying this assumption. We propose an approach based on a combination of bivec (a bilingual extension of word2vec) and locality-sensitive hashing which allows us to efficiently identify pairs of parallel segments located anywhere on pages of a given web domain, regardless their structure. We validate our method on realigning segments from a large parallel corpus. Another experiment with real-world data provided by Common Crawl Foundation confirms that our solution scales to hundreds of terabytes large set of web-crawled data.

Download Full-text

A COMPETITIVE NEURAL NETWORK APPROACH TO WEB-PAGE CATEGORIZATION

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488501001186 ◽

2001 ◽

Vol 09 (06) ◽

pp. 731-741 ◽

Cited By ~ 9

Author(s):

ZHI-QIANG LIU ◽

YA-JUN ZHANG

Keyword(s):

Neural Network ◽

Real World ◽

Gradient Descent ◽

Web Pages ◽

Learning Method ◽

Network Approach ◽

Web Page ◽

Real World Data ◽

Neural Network Approach ◽

Competitive Neural Network

Recently many techniques, e.g., Google or AltaVista, are available for classifying well-organized, hierarchical crisp categories from human constructed web pages such as that in Yahoo. However, given the current rate of web-page production, there is an urgent need of classifiers that are able to autonomously classify web-page categories that have overlaps. In this paper, we present a competitive learning method for this problem, which based on a new objective function and gradient descent scheme. Experimental results on real-world data show that the approach proposed in this paper gives a better performance in classifying randomly-generated, knowledge-overlapped categories as well as hierarchical crisp categories.

Download Full-text

Googling the Web

Poems That Solve Puzzles ◽

10.1093/oso/9780198853732.003.0008 ◽

2020 ◽

pp. 143-158

Author(s):

Chris Bleakley

Keyword(s):

Web Sites ◽

World Wide ◽

Search Algorithm ◽

Internet Usage ◽

Web Pages ◽

The World ◽

Set Up ◽

Recommender Algorithm ◽

A Company ◽

The Web

Chapter 8 explores the arrival of the World Wide Web, Amazon, and Google. The web allows users to display “pages” of information retrieved from remote computers by means of the Internet. Inventor Tim Berners-Lee released the first web software for free, setting in motion an explosion in Internet usage. Seeing the opportunity of a lifetime, Jeff Bezos set-up Amazon as an online bookstore. Amazon’s success was accelerated by a product recommender algorithm that selectively targets advertising at users. By the mid-1990s there were so many web sites that users often couldn’t find what they were looking for. Stanford PhD student Larry Page invented an algorithm for ranking search results based on the importance and relevance of web pages. Page and fellow student, Sergey Brin, established a company to bring their search algorithm to the world. Page and Brin - the founders of Google - are now worth US$35-40 billion, each.

Download Full-text

Review of Link Structure Based Ranking Algorithms and Hanging Pages

Handbook of Research on Modern Cryptographic Solutions for Computer and Cyber Security - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-5225-0105-3.ch018 ◽

2016 ◽

pp. 420-459 ◽

Cited By ~ 1

Author(s):

Ravi P. Kumar ◽

Ashutosh K. Singh ◽

Anand Mohan

Keyword(s):

Search Engine ◽

Web Sites ◽

Cyber Security ◽

Web Security ◽

Optimization Techniques ◽

Web Pages ◽

Link Structure ◽

Search Engine Optimization ◽

Ranking Algorithms ◽

The Web

In this era of Web computing, Cyber Security is very important as more and more data is moving into the Web. Some data are confidential and important. There are many threats for the data in the Web. Some of the basic threats can be addressed by designing the Web sites properly using Search Engine Optimization techniques. One such threat is the hanging page which gives room for link spamming. This chapter addresses the issues caused by hanging pages in Web computing. This Chapter has four important objectives. They are 1) Compare and review the different types of link structure based ranking algorithms in ranking Web pages. PageRank is used as the base algorithm throughout this Chapter. 2) Study on hanging pages, explore the effects of hanging pages in Web security and compare the existing methods to handle hanging pages. 3) Study on Link spam and explore the effect of hanging pages in link spam contribution and 4) Study on Search Engine Optimization (SEO) / Web Site Optimization (WSO) and explore the effect of hanging pages in Search Engine Optimization (SEO).

Download Full-text

Identifying Knowledge Values and Knowledge Sharing Through Linguistic Methods

Building the Knowledge Society on the Internet ◽

10.4018/978-1-59904-816-1.ch017 ◽

2008 ◽

pp. 340-356

Author(s):

June Tolsby

Keyword(s):

Knowledge Sharing ◽

Web Sites ◽

Communication Theory ◽

Practice Theory ◽

Web Pages ◽

Work Attitude ◽

Organizational Behaviour ◽

The World ◽

Linguistic Methods ◽

The Web

How can three linguistical methods be used to identify the Web displays of an organization’s knowledge values and knowledge-sharing requirements? This chapter approaches this question by using three linguistical methods to analyse a company’s Web sites; (a) elements from the community of practice theory (CoP), (b) concepts from communication theory, such as modality and transitivity, and (c) elements from discourse analysis. The investigation demonstrates how a company’s use of the Web can promote a work attitude that actually can be considered as an endorsement of a particular organizational behaviour. The Web pages display a particular organizational identity that will be a magnet for some parties and deject others. In this way, a company’s Web pages represent a window to the world that need to be handled with care, since this can be interpreted as a projection of the company’s identity.

Download Full-text

Albert Einstein online

First Monday ◽

10.5210/fm.v2i2.509 ◽

1997 ◽

Author(s):

Steven M. Friedman

Keyword(s):

Web Sites ◽

Albert Einstein ◽

The Internet ◽

Web Pages ◽

Web Based ◽

Hierarchical Tree ◽

The World ◽

Random Nature ◽

The Web ◽

Online Web

The power of the World Wide Web, it is commonly believed, lies in the vast information it makes available; "Content is king," the mantra runs. This image creates the conception of the Internet as most of us envision it: a vast, horizontal labyrinth of pages which connect almost arbitrarily to each other, creating a system believed to be "democratic" in which anyone can publish Web pages. I am proposing a new, vertical and hierarchical conception of the Web, observing the fact that almost everyone searching for information on the Web has to go through filter Web sites of some sort, such as search engines, to find it. The Albert Einstein Online Web site provides a paradigm for this re-conceptualization of the Web, based on a distinction between the wealth of information and that which organizes it and frames the viewers' conceptions of the information. This emphasis on organization implies that we need a new metaphor for the Internet; the hierarchical "Tree" would be more appropriate organizationally than a chaotic "Web." This metaphor needs to be changed because the current one implies an anarchic and random nature to the Web, and this implication may turn off potential Netizens, who can be scared off by such overwhelming anarchy and the difficulty of finding information.

Download Full-text

Secure Multi-keyword Similarity Search Over Encrypted Data With Security Improvement

Iraqi Journal for Electrical And Electronic Engineering ◽

10.37917/ijeee.17.2.1 ◽

2021 ◽

Vol 17 (2) ◽

pp. 1-10

Author(s):

Hussein Mohammed ◽

Ayad Abdulsada

Keyword(s):

Real World ◽

Similarity Search ◽

Keyword Search ◽

Bloom Filter ◽

Locality Sensitive Hashing ◽

Real World Data ◽

Encrypted Data ◽

Search Results ◽

Security Properties ◽

Cloud Servers

Searchable encryption (SE) is an interesting tool that enables clients to outsource their encrypted data into external cloud servers with unlimited storage and computing power and gives them the ability to search their data without decryption. The current solutions of SE support single-keyword search making them impractical in real-world scenarios. In this paper, we design and implement a multi-keyword similarity search scheme over encrypted data by using locality-sensitive hashing functions and Bloom filter. The proposed scheme can recover common spelling mistakes and enjoys enhanced security properties such as hiding the access and search patterns but with costly latency. To support similarity search, we utilize an efficient bi-gram-based method for keyword transformation. Such a method improves the search results accuracy. Our scheme employs two non-colluding servers to break the correlation between search queries and search results. Experiments using real-world data illustrate that our scheme is practically efficient, secure, and retains high accuracy.

Download Full-text

Solving Accessibility and Other Problems in School and Classroom Web Sites

Rural Special Education Quarterly ◽

10.1177/875687050102000403 ◽

2001 ◽

Vol 20 (4) ◽

pp. 11-18 ◽

Cited By ~ 3

Author(s):

Cleborne D. Maddux

Keyword(s):

Special Needs ◽

Web Sites ◽

Rural Areas ◽

World Wide ◽

Special Educators ◽

Web Pages ◽

Children With Special Needs ◽

The World ◽

The Web

The Internet and the World Wide Web are growing at unprecedented rates. More and more teachers are authoring school or classroom web pages. Such pages have particular potential for use in rural areas by special educators, children with special needs, and the parents of children with special needs. The quality of many of these pages leaves much to be desired. All web pages, especially those authored by special educators should be accessible for people with disabilities. Many other problems complicate use of the web for all users, whether or not they have disabilities. By taking some simple steps, beginning webmasters can avoid these problems. This article discusses practical solutions to common accessibility problems and other problems seen commonly on the web.

Download Full-text

Interactive Proxy for URL Correction

Issues of Human Computer Interaction ◽

10.4018/978-1-59140-191-9.ch005 ◽

2011 ◽

pp. 72-84

Author(s):

Kai-Hsiang Yang

Keyword(s):

Web Sites ◽

World Wide ◽

Web Pages ◽

Web Content ◽

Web Page ◽

Proxy Servers ◽

Personal Favorite ◽

The World ◽

Cache Miss ◽

The Web

This chapter will address the issues of Uniform Resource Locator (URL) correction techniques in proxy servers. The proxy servers are more and more important in the World Wide Web (WWW), and they provide Web page caches for browsing the Web pages quickly, and also reduce unnecessary network traffic. Traditional proxy servers use the URL to identify their cache, and it is a cache-miss when the request URL is non-existent in its caches. However, for general users, there must be some regularity and scope in browsing the Web. It would be very convenient for users when they do not need to enter the whole long URL, or if they still could see the Web content even though they forgot some part of the URL, especially for those personal favorite Web sites. We will introduce one URL correction mechanism into the personal proxy server to achieve this goal.

Download Full-text

Comprendere informazioni complesse con i multimedia. Quando l'immagine č a pop-up

IKON ◽

10.3280/ikr2006-053006 ◽

2009 ◽

pp. 151-175

Author(s):

Sara Rigutti ◽

Gisella Paoletti ◽

Laura Blasutig

Keyword(s):

Undergraduate Students ◽

Information Visualization ◽

Web Sites ◽

Information Resources ◽

Web Pages ◽

Thinking Aloud ◽

Textual Information ◽

Embedded Graphs ◽

Computer Screen ◽

The Web

- We examined the consequences of a visualization pattern often chosen by web sites which show textual information within the web pages and the related iconic information within pop-up windows. The information visualization in pop-up windows aims to integrate text and pictures but makes difficult the analysis of both information resources. We conducted an experiment in which 80 participants read on a computer screen a text with embedded graphs either near (to the related textual information) or far (from it), plus graphs were integrated in the text or within pop-up windows. The reading behaviour of participants was observed to establish who, among them, examined the graphs and who did not. The recall for textual and iconic information was measured using a recall questionnaire. Our pattern of data shows a student's tendency to ignore graphs, in particular when they are visualized in pop-up windows. These results are confirmed by interviews to undergraduate students who analyzed the same materials using thinking aloud method.

Download Full-text

Valuation and analysis of drug addiction web sites / Criterios de valoración y análisis de sitios web sobre drogodependencias

Health and Addictions/Salud y Drogas ◽

10.21134/haaj.v4i1.138 ◽

2004 ◽

Vol 4 (1) ◽

Author(s):

David Carabantes Alarcón ◽

Carmen García Carrión ◽

Juan Vicente Beneit Montesinos

Keyword(s):

Drug Addiction ◽

Web Sites ◽

Drug Dependence ◽

Web Site ◽

Point Of View ◽

Web Pages ◽

Web Page ◽

Specific System ◽

The Web

La calidad en Internet tiene un gran valor, y más aún cuando se trata de una página web sobre salud como es un recurso sobre drogodependencias. El presente artículo recoge los estimadores y sistemas más destacados sobre calidad web para el desarrollo de un sistema específico de valoración de la calidad de recursos web sobre drogodependencias. Se ha realizado una prueba de viabilidad mediante el análisis de las principales páginas web sobre este tema (n=60), recogiendo la valoración, desde el punto de vista del usuario, de la calidad de los recursos. Se han detectado aspectos de mejora en cuanto a la exactitud y fiabilidad de la información, autoría, y desarrollo de descripciones y valoraciones de los enlaces externos. AbstractThe quality in Internet has a great value, and still more when is a web page on health like a resource of drug dependence. This paper contains the estimators and systems on quality in the web for the development of a specific system to value the quality of a web site about drug dependence. A test of viability by means of the analysis of the main web pages has been made on this subject, gathering the valuation from the point of view of the user of the quality of the resources. Aspects of improvement as the exactitude and reliability of the information, responsibility, and development of descriptions and valuations of the external links have been detected.

Download Full-text