Big Data Analysis of Web Data Extraction

2018 ◽  
Vol 7 (4.37) ◽  
pp. 168
Author(s):  
Nadia Ibrahim ◽  
Alaa Hassan ◽  
Marwah Nihad

In this study, the large data extraction techniques; include detection of patterns and secret relationships between factors numbering and bring in the required information. Rapid analysis of massive data can lead to innovation and concepts of the theoretical value. Compared with results from mining between traditional data sets and the vast amount of large heterogeneous data interdependent it has the ability expand the knowledge and ideas about the target domain. We studied in this research data mining on the Internet. The various networks that are used to extract data onto different locations complex may appear sometimes and has been used to extract information on the web technology to extract and data analysis (Marwah et al., 2016). In this research, we extracted the information on large quantities of the web pages and examined the pages of the site using Java code, and we added the extracted information on a special database for the web page. We used the data network function to get accurate results of evaluating and categorizing the data pages found, which identifies the trusted web or risky web pages, and imported the data onto a CSV extension. Consequently, examine and categorize these data using WEKA to obtain accurate results. We concluded from the results that the applied data mining algorithms are better than other techniques in classification and extraction of data and high performance.  

2020 ◽  
Vol 17 (1) ◽  
pp. 6-9
Author(s):  
Ramya G. Franklin ◽  
B. Muthukumar

The growth of Science is a priceless asset to the human and society. The plethora of high-end machines has made life a sophistication which in turn is paid back as health issues. The health care data are complex and large. This heterogeneous data are used to diagnose patient’s diseases. It is better to predict the diseases at an earlier stage that can save the life and also have an upper hand in controlling the diseases. Data mining approaches are very useful in analyzing the complex, heterogeneous and large data set. The mining algorithms extract the essential data set from the raw data. This paper presents a survey on the various data mining algorithms used in predicting a very common disease in day a today life “Diabetics Mellitus.” Over 246 million people in the world are diabetic with a majority of them being women. The WHO reports that by 2025 this number is expected to rise to over 380 million.


2016 ◽  
Vol 4 (8) ◽  
pp. 118-135
Author(s):  
Rajendra Gupta

The phishing is a kind of e-commerce lure which try to steal the confidential information of the web user by making identical website of legitimate one in which the contents and images almost remains similar to the legitimate website with small changes. Another way of phishing is to make minor changes in the URL or in the domain of the legitimate website. In this paper, a number of anti-phishing toolbars have been discussed and proposed a system model to tackle the phishing attack. The proposed anti-phishing system is based on the development of the Plug-in tool for the web browser. The performance of the proposed system is studied with three different data mining classification algorithms which are Random Forest, Nearest Neighbour Classification (NNC), Bayesian Classifier (BC). To evaluate the proposed anti-phishing system for the detection of phishing websites, 7690 legitimate websites and 2280 phishing websites have been collected from authorised sources like APWG database and PhishTank. After analyzing the data mining algorithms over phishing web pages, it is found that the Bayesian algorithm gives fast response and gives more accurate results than other algorithms.


2013 ◽  
Vol 7 (2) ◽  
pp. 574-579 ◽  
Author(s):  
Dr Sunitha Abburu ◽  
G. Suresh Babu

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.  But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies  data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.   It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.  The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.


Now a day different data mining algorithms are ready to create the specific set of data known as Pattern from a huge data repository, but there is no infrastructure or system to save it as persistent storage for the generated patterns. Pattern warehouse presents a foundation to make these patterns safe in the specific environment for long term use. Most organizations are excited to know the information or patterns rather than raw data or group of unprocessed data. Because extracted knowledge play a vital role to take right decision for the growth of an organization. We have examined the sources of patterns generated from large data sets. In this paper, we have presented little importance on the application area of pattern and idea of patter warehouse, the architecture of pattern warehouse then correlation between data warehouse and data mining, association between data mining and pattern warehouse, critical evaluation between existing approaches which theoretically published and more stress on association rule related review elements. In this paper, we analyze the patterns warehouse, data warehouse concerning various factors like storage space, type of storage unit, characteristics, and provide several research domains.


Author(s):  
Shalin Hai-Jew

Understanding Web network structures may offer insights on various organizations and individuals. These structures are often latent and invisible without special software tools; the interrelationships between various websites may not be apparent with a surface perusal of the publicly accessible Web pages. Three publicly available tools may be “chained” (combined in sequence) in a data extraction sequence to enable visualization of various aspects of http network structures in an enriched way (with more detailed insights about the composition of such networks, given their heterogeneous and multimodal contents). Maltego Tungsten™, a penetration-testing tool, enables the mapping of Web networks, which are enriched with a variety of information: the technological understructure and tools used to build the network, some linked individuals (digital profiles), some linked documents, linked images, related emails, some related geographical data, and even the in-degree of the various nodes. NCapture with NVivo enables the extraction of public social media platform data and some basic analysis of these captures. The Network Overview, Discovery, and Exploration for Excel (NodeXL) tool enables the extraction of social media platform data and various evocative data visualizations and analyses. With the size of the Web growing exponentially and new domains (like .ventures, .guru, .education, .company, and others), the ability to map widely will offer a broad competitive advantage to those who would exploit this approach to enhance knowledge.


Author(s):  
Xiaohui Liu

Intelligent Data Analysis (IDA) is an interdisciplinary study concerned with the effective analysis of data. IDA draws the techniques from diverse fields, including artificial intelligence, databases, high-performance computing, pattern recognition, and statistics. These fields often complement each other (e.g., many statistical methods, particularly those for large data sets, rely on computation, but brute computing power is no substitute for statistical knowledge) (Berthold & Hand 2003; Liu, 1999).


Author(s):  
Ahmed El Azab ◽  
Mahmood A. Mahmood ◽  
Abd El-Aziz

Web usage mining techniques and applications across industries is still exploratory and, despite an increase in academic research, there are challenge of analyze web which quantitatively capture web users' common interests and characterize their underlying tasks. This chapter addresses the problem of how to support web usage mining techniques and applications across industries by combining language of web pages and algorithms that used in web data mining. Existing research in web usage mining techniques tend to focus on finding out how each techniques can apply in different industries fields. However, there is little evidence that researchers have approached the issue of web usage mining across industries. Consequently, the aim of this chapter is to provide an overview of how the web usage mining techniques and applications across industries can be supported.


2017 ◽  
Vol 7 (1.1) ◽  
pp. 286
Author(s):  
B. Sekhar Babu ◽  
P. Lakshmi Prasanna ◽  
P. Vidyullatha

 In current days, World Wide Web has grown into a familiar medium to investigate the new information, Business trends, trading strategies so on. Several organizations and companies are also contracting the web in order to present their products or services across the world. E-commerce is a kind of business or saleable transaction that comprises the transfer of statistics across the web or internet. In this situation huge amount of data is obtained and dumped into the web services. This data overhead tends to arise difficulties in determining the accurate and valuable information, hence the web data mining is used as a tool to determine and mine the knowledge from the web. Web data mining technology can be applied by the E-commerce organizations to offer personalized E-commerce solutions and better meet the desires of customers. By using data mining algorithm such as ontology based association rule mining using apriori algorithms extracts the various useful information from the large data sets .We are implementing the above data mining technique in JAVA and data sets are dynamically generated while transaction is processing and extracting various patterns.


2016 ◽  
Vol 2016 ◽  
pp. 1-11 ◽  
Author(s):  
Ivan Kholod ◽  
Ilya Petukhov ◽  
Andrey Shorov

This paper describes the construction of a Cloud for Distributed Data Analysis (CDDA) based on the actor model. The design uses an approach to map the data mining algorithms on decomposed functional blocks, which are assigned to actors. Using actors allows users to move the computation closely towards the stored data. The process does not require loading data sets into the cloud and allows users to analyze confidential information locally. The results of experiments show that the efficiency of the proposed approach outperforms established solutions.


Sign in / Sign up

Export Citation Format

Share Document