Big data as a challenge for man and statistics

2015 ◽  
Vol 60 (8) ◽  
pp. 1-11
Author(s):  
Mirosław Szreder

The phenomenon of ”big data”, understood as the collection and processing of large data sets, in order to extract from them new knowledge, develops independently of the will of individuals and societies. The driving force behind this development is, on the one hand, rapid technological progress in the field of IT, and on the other the desire of many organizations to gain access to the knowledge accumulated in more and more electronic databases of Internet users, facebook, or twitter. The fact that the challenge is this phenomenon for man and for the statistics, the methodology can in these conditions prove less adequate, treats article. The author tries to argue that in case of protection of individuals and society, devoid of attribute privacy and anonymity, technological progress raises previously unknown threats. As statisticians analytical work hardly keep up with the possibilities offered by ”big data”, as well as the protection of human rights is merely a belated response to the dynamic world of electronic data.

Author(s):  
A. Sheik Abdullah ◽  
R. Suganya ◽  
S. Selvakumar ◽  
S. Rajaram

Classification is considered to be the one of the data analysis technique which can be used over many applications. Classification model predicts categorical continuous class labels. Clustering mainly deals with grouping of variables based upon similar characteristics. Classification models are experienced by comparing the predicted values to that of the known target values in a set of test data. Data classification has many applications in business modeling, marketing analysis, credit risk analysis; biomedical engineering and drug retort modeling. The extension of data analysis and classification makes the insight into big data with an exploration to processing and managing large data sets. This chapter deals with various techniques, methodologies that correspond to the classification problem in data analysis process and its methodological impacts to big data.


2021 ◽  
Author(s):  
Kristia M. Pavlakos

Big Data1is a phenomenon that has been increasingly studied in the academy in recent years, especially in technological and scientific contexts. However, it is still a relatively new field of academic study; because it has been previously considered in mainly technological contexts, more attention needs to be drawn to the contributions made in Big Data scholarship in the social sciences by scholars like Omar Tene and Jules Polonetsky, Bart Custers, Kate Crawford, Nick Couldry, and Jose van Dijk. The purpose of this Major Research Paper is to gain insight into the issues surrounding privacy and user rights, roles, and commodification in relation to Big Data in a social sciences context. The term “Big Data” describes the collection, aggregation, and analysis of large data sets. While corporations are usually responsible for the analysis and dissemination of the data, most of this data is user generated, and there must be considerations regarding the user’s rights and roles. In this paper, I raise three main issues that shape the discussion: how users can be more active agents in data ownership, how consent measures can be made to actively reflect user interests instead of focusing on benefitting corporations, and how user agency can be preserved. Through an analysis of social sciences scholarly literature on Big Data, privacy, and user commodification, I wish to determine how these concepts are being discussed, where there have been advancements in privacy regulation and the prevention of user commodification, and where there is a need to improve these measures. In doing this, I hope to discover a way to better facilitate the relationship between data collectors and analysts, and user-generators. 1 While there is no definitive resolution as to whether or not to capitalize the term “Big Data”, in capitalizing it I chose to conform with such authors as boyd and Crawford (2012), Couldry and Turow (2014), and Dalton and Thatcher (2015), who do so in the scholarly literature.


Author(s):  
Saranya N. ◽  
Saravana Selvam

After an era of managing data collection difficulties, these days the issue has turned into the problem of how to process these vast amounts of information. Scientists, as well as researchers, think that today, probably the most essential topic in computing science is Big Data. Big Data is used to clarify the huge volume of data that could exist in any structure. This makes it difficult for standard controlling approaches for mining the best possible data through such large data sets. Classification in Big Data is a procedure of summing up data sets dependent on various examples. There are distinctive classification frameworks which help us to classify data collections. A few methods that discussed in the chapter are Multi-Layer Perception Linear Regression, C4.5, CART, J48, SVM, ID3, Random Forest, and KNN. The target of this chapter is to provide a comprehensive evaluation of classification methods that are in effect commonly utilized.


Author(s):  
B. K. Tripathy ◽  
Hari Seetha ◽  
M. N. Murty

Data clustering plays a very important role in Data mining, machine learning and Image processing areas. As modern day databases have inherent uncertainties, many uncertainty-based data clustering algorithms have been developed in this direction. These algorithms are fuzzy c-means, rough c-means, intuitionistic fuzzy c-means and the means like rough fuzzy c-means, rough intuitionistic fuzzy c-means which base on hybrid models. Also, we find many variants of these algorithms which improve them in different directions like their Kernelised versions, possibilistic versions, and possibilistic Kernelised versions. However, all the above algorithms are not effective on big data for various reasons. So, researchers have been trying for the past few years to improve these algorithms in order they can be applied to cluster big data. The algorithms are relatively few in comparison to those for datasets of reasonable size. It is our aim in this chapter to present the uncertainty based clustering algorithms developed so far and proposes a few new algorithms which can be developed further.


Big Data ◽  
2016 ◽  
pp. 2249-2274
Author(s):  
Chinh Nguyen ◽  
Rosemary Stockdale ◽  
Helana Scheepers ◽  
Jason Sargent

The rapid development of technology and interactive nature of Government 2.0 (Gov 2.0) is generating large data sets for Government, resulting in a struggle to control, manage, and extract the right information. Therefore, research into these large data sets (termed Big Data) has become necessary. Governments are now spending significant finances on storing and processing vast amounts of information because of the huge proliferation and complexity of Big Data and a lack of effective records management. On the other hand, there is a method called Electronic Records Management (ERM), for controlling and governing the important data of an organisation. This paper investigates the challenges identified from reviewing the literature for Gov 2.0, Big Data, and ERM in order to develop a better understanding of the application of ERM to Big Data to extract useable information in the context of Gov 2.0. The paper suggests that a key building block in providing useable information to stakeholders could potentially be ERM with its well established governance policies. A framework is constructed to illustrate how ERM can play a role in the context of Gov 2.0. Future research is necessary to address the specific constraints and expectations placed on governments in terms of data retention and use.


2020 ◽  
pp. 1826-1838
Author(s):  
Rojalina Priyadarshini ◽  
Rabindra K. Barik ◽  
Chhabi Panigrahi ◽  
Harishchandra Dubey ◽  
Brojo Kishore Mishra

This article describes how machine learning (ML) algorithms are very useful for analysis of data and finding some meaningful information out of them, which could be used in various other applications. In the last few years, an explosive growth has been seen in the dimension and structure of data. There are several difficulties faced by conventional ML algorithms while dealing with such highly voluminous and unstructured big data. The modern ML tools are designed and used to deal with all sorts of complexities of data. Deep learning (DL) is one of the modern ML tools which are commonly used to find the hidden structure and cohesion among these large data sets by giving proper training in parallel platforms with intelligent optimization techniques to further analyze and interpret the data for future prediction and classification. This article focuses on the use of DL tools and software which are used in past couple of years in various areas and especially in the area of healthcare applications.


2017 ◽  
pp. 83-99
Author(s):  
Sivamathi Chokkalingam ◽  
Vijayarani S.

The term Big Data refers to large-scale information management and analysis technologies that exceed the capability of traditional data processing technologies. Big Data is differentiated from traditional technologies in three ways: volume, velocity and variety of data. Big data analytics is the process of analyzing large data sets which contains a variety of data types to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. Since Big Data is new emerging field, there is a need for development of new technologies and algorithms for handling big data. The main objective of this paper is to provide knowledge about various research challenges of Big Data analytics. A brief overview of various types of Big Data analytics is discussed in this paper. For each analytics, the paper describes process steps and tools. A banking application is given for each analytics. Some of research challenges and possible solutions for those challenges of big data analytics are also discussed.


2016 ◽  
pp. 1220-1243
Author(s):  
Ilias K. Savvas ◽  
Georgia N. Sofianidou ◽  
M-Tahar Kechadi

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.


2014 ◽  
Vol 931-932 ◽  
pp. 1353-1359
Author(s):  
Sutheetutt Vacharaskunee ◽  
Sarun Intakosum

Processing of a large data set which is known for today as big data processing is still a problem that has not yet a well-defined solution. The data can be both structured and unstructured. For the structured part, eXtensible Markup Language (XML) is a major tool that freely allows document owners to describe and organize their data using their markup tags. One major problem, however, behind this freedom lies in the big data retrieving process. The same or similar information that are described using the different tags or different structures may not be retrieved if the query statements contains different keywords to the one used in the markup tags. The best way to solve this problem is to specify a standard set of the markup tags for each problem domain. The creation of such a standard set if done manually requires a lot of hard work and is a time consuming process. In addition, it may be hard to define terms that are acceptable by all people. This research proposes a model for a new technique, XML Tag Recommendation (XTR) that aims to solve this problem. This technique applies the idea of Case Base Reasoning (CBR) by collecting the most used tags in each domain as a case. These tags come from the collection of related words in WordNet. The WordCount that is the web site to find the frequency of words is applied to choose the most used one. The input (problem) to the XTR system is an XML document contains the tags specified by the document owners. The solution is a set of the recommended tags, which is the most used tags, for the problem domain of the document. Document owners have a freedom to change or not change the tags in their documents and can provide feedback to the XTR system.


Sign in / Sign up

Export Citation Format

Share Document