Introduction to Data Science

Author(s):  
M. Govindarajan

This chapter focuses on introduction to the field of data science. Data science is the area of study which involves extracting insights from vast amounts of data by the use of various scientific methods, algorithms, and processes. The term data science has emerged because of the evolution of mathematical statistics, data analysis, and big data. Data science helps to discover hidden patterns from the raw data. It enables to translate a business problem into a research project and then translate it back into a practical solution. The purpose of this chapter is to provide emphasis on integration and synthesis of concepts, techniques, applications, and tools to deal with various facets of data science practice, including data collection and integration, exploratory data analysis, predictive modeling, descriptive modeling, data product creation, evaluation, and effective communication.

Sensors ◽  
2019 ◽  
Vol 19 (12) ◽  
pp. 2772 ◽  
Author(s):  
Aguinaldo Bezerra ◽  
Ivanovitch Silva ◽  
Luiz Affonso Guedes ◽  
Diego Silva ◽  
Gustavo Leitão ◽  
...  

Alarm and event logs are an immense but latent source of knowledge commonly undervalued in industry. Though, the current massive data-exchange, high efficiency and strong competitiveness landscape, boosted by Industry 4.0 and IIoT (Industrial Internet of Things) paradigms, does not accommodate such a data misuse and demands more incisive approaches when analyzing industrial data. Advances in Data Science and Big Data (or more precisely, Industrial Big Data) have been enabling novel approaches in data analysis which can be great allies in extracting hitherto hidden information from plant operation data. Coping with that, this work proposes the use of Exploratory Data Analysis (EDA) as a promising data-driven approach to pave industrial alarm and event analysis. This approach proved to be fully able to increase industrial perception by extracting insights and valuable information from real-world industrial data without making prior assumptions.


2021 ◽  
Author(s):  
Gunnar Carlsson ◽  
Mikael Vejdemo-Johansson

The continued and dramatic rise in the size of data sets has meant that new methods are required to model and analyze them. This timely account introduces topological data analysis (TDA), a method for modeling data by geometric objects, namely graphs and their higher-dimensional versions: simplicial complexes. The authors outline the necessary background material on topology and data philosophy for newcomers, while more complex concepts are highlighted for advanced learners. The book covers all the main TDA techniques, including persistent homology, cohomology, and Mapper. The final section focuses on the diverse applications of TDA, examining a number of case studies drawn from monitoring the progression of infectious diseases to the study of motion capture data. Mathematicians moving into data science, as well as data scientists or computer scientists seeking to understand this new area, will appreciate this self-contained resource which explains the underlying technology and how it can be used.


2021 ◽  
pp. 379-402
Author(s):  
CléMent Gautrais ◽  
Yann Dauxais ◽  
Stefano Teso ◽  
Samuel Kolb ◽  
Gust Verbruggen ◽  
...  

Everybody wants to analyse their data, but only few posses the data science expertise to do this. Motivated by this observation, we introduce a novel framework and system VisualSynth for human-machine collaboration in data science. Its aim is to democratize data science by allowing users to interact with standard spreadsheet software in order to perform and automate various data analysis tasks ranging from data wrangling, data selection, clustering, constraint learning, predictive modeling and auto-completion. VisualSynth relies on the user providing colored sketches, i.e., coloring parts of the spreadsheet, to partially specify data science tasks, which are then determined and executed using artificial intelligence techniques.


Author(s):  
Meenakshi Bhrugubanda ◽  
A.V. L. Prasuna

Cyber security alludes to an assemblage of advancements, procedures and practices intended to forestall an assault, harm or unapproved access to systems, gadgets, projects and information. It is likewise can be alluded to as Information security. Clients must comprehend and follow the essential standards of information security, for example, email connections watchfulness, solid passwords, and information reinforcement. Today is particularly testing to implement fruitful cyber security activities since a greater number of gadgets than individuals are available and more imaginative attackers. Data science is a subset of AI, and it alludes more to the covering regions of statistics, scientific methods, and data analysis—which are all used to remove significance and bits of knowledge from information. This paper analyses the role of Data science in Cyber Security.


Author(s):  
Sylvain Lespinats ◽  
Michaël Aupetit ◽  
Anke Meyer-Baese

Multidimensional scaling techniques are unsupervised Dimension Reduction (DR) techniques which use multidimensional data pairwise similarities to represent data into a plane enabling their visual exploratory analysis. Considering labeled data, the DR techniques face two objectives with potentially different priorities: one is to account for the data points' similarities, the other for the data classes' structures. Unsupervised DR techniques attempt to preserve original data similarities, but they do not consider their class label hence they can map originally separated classes as overlapping ones. Conversely, the state-of-the-art so-called supervised DR techniques naturally handle labeled data, but they do so in a predictive modeling framework where they attempt to separate the classes in order to improve a classification accuracy measure in the low-dimensional space, hence they can map as separated even originally overlapping classes. We propose ClassiMap, a DR technique which optimizes a new objective function enabling Exploratory Data Analysis (EDA) of labeled data. Mapping distortions known as tears and false neighborhoods cannot be avoided in general due to the reduction of the data dimension. ClassiMap intends primarily to preserve data similarities but tends to distribute preferentially unavoidable tears among the different-label data and unavoidable false neighbors among the same-label data. Standard quality measures to evaluate the quality of unsupervised mappings cannot tell about the preservation of within-class or between-class structures, while classification accuracy used to evaluate supervised mappings is only relevant to the framework of predictive modeling. We propose two measures better suited to the evaluation of DR of labeled data in an EDA framework. We use these two label-aware indices and four other standard unsupervised indices to compare ClassiMap to other state-of-the-art supervised and unsupervised DR techniques on synthetic and real datasets. ClassiMap appears to provide a better tradeoff between pairwise similarities and class structure preservation according to these new measures.


2019 ◽  
Vol 27 (3) ◽  
pp. 233-251
Author(s):  
Sabeena Jalal ◽  
Marshall E Lloyd ◽  
Faisal Khosa ◽  
Grace I-Hsuan Hsu ◽  
Savvas Nicolaou

2019 ◽  
Vol 631 ◽  
pp. A116 ◽  
Author(s):  
P. Giommi ◽  
C. H. Brandt ◽  
U. Barres de Almeida ◽  
A. M. T. Pollock ◽  
F. Arneodo ◽  
...  

Aims. Open Universe for Blazars is a set of high-transparency multi-frequency data products for blazar science, and the tools designed to generate them. Blazars are drawing growing interest following the consolidation of their position as the most abundant type of source in the extragalactic very high-energy γ-ray sky, and because of their status as prime candidate sources in the nascent field of multi-messenger astrophysics. As such, blazar astrophysics is becoming increasingly data driven, depending on the integration and combined analysis of large quantities of data from the entire span of observational astrophysics techniques. The project was therefore chosen as one of the pilot activities within the United Nations Open Universe Initiative, whose objective is to stimulate a large increase in the accessibility and ease of utilisation of space science data for the worldwide benefit of scientific research, education, capacity building, and citizen science. Methods. Our aim is to deliver innovative data science tools for multi-messenger astrophysics. In this work we report on a data analysis pipeline called Swift-DeepSky based on the Swift XRTDAS software and the XIMAGE package, encapsulated into a Docker container. Swift-DeepSky downloads and reads low-level data, generates higher level products, detects X-ray sources, and estimates several intensity and spectral parameters for each detection, thus facilitating the generation of complete and up-to-date science-ready catalogues from an entire space-mission data set. Results. As a first application of our innovative approach, we present the results of a detailed X-ray image analysis based on Swift-DeepSky that was run on all Swift-XRT observations including a known blazar, carried out during the first 14 years of operations of the Neil Gehrels Swift Observatory. Short exposures executed within one week of each other have been added to increase sensitivity, which ranges between ∼1 × 10−12 and ∼1 × 10−14 erg cm−2 s−1 (0.3–10.0 keV). After cleaning for problematic fields, the resulting database includes over 27 000 images integrated in different X-ray bands, and a catalogue, called 1OUSXB, that provides intensity and spectral information for 33 396 X-ray sources, 8896 of which are single or multiple detections of 2308 distinct blazars. All the results can be accessed online in a variety of ways, from the Open Universe portal through Virtual Observatory services, via the VOU-Blazar tool and the SSDC SED builder. One of the most innovative aspects of this work is that the results can be easily reproduced and extended by anyone using the Docker version of the Swift-DeepSky pipeline, which runs on Linux, Mac, and Windows machines, and does not require any specific experience in X-ray data analysis.


Sign in / Sign up

Export Citation Format

Share Document