The nonequilibrium quantum many-body problem as a paradigm for extreme data science

Generating big data pervades much of physics. But some problems, which we call extreme data problems, are too large to be treated within big data science. The nonequilibrium quantum many-body problem on a lattice is just such a problem, where the Hilbert space grows exponentially with system size and rapidly becomes too large to fit on any computer (and can be effectively thought of as an infinite-sized data set). Nevertheless, much progress has been made with computational methods on this problem, which serve as a paradigm for how one can approach and attack extreme data problems. In addition, viewing these physics problems from a computer-science perspective leads to new approaches that can be tried to solve more accurately and for longer times. We review a number of these different ideas here.

Download Full-text

Calabi-Yau Spaces in the String Landscape

Oxford Research Encyclopedia of Physics ◽

10.1093/acrefore/9780190871994.013.60 ◽

2020 ◽

Author(s):

Yang-Hui He

Keyword(s):

Machine Learning ◽

Computer Science ◽

Data Science ◽

Theoretical Physics ◽

Superstring Theory ◽

Data Set ◽

Pure Mathematics ◽

String Landscape ◽

Natural Solution ◽

Vacuum Solutions

Calabi-Yau spaces, or Kähler spaces admitting zero Ricci curvature, have played a pivotal role in theoretical physics and pure mathematics for the last half century. In physics, they constituted the first and natural solution to compactification of superstring theory to our 4-dimensional universe, primarily due to one of their equivalent definitions being the admittance of covariantly constant spinors. Since the mid-1980s, physicists and mathematicians have joined forces in creating explicit examples of Calabi-Yau spaces, compiling databases of formidable size, including the complete intersecion (CICY) data set, the weighted hypersurfaces data set, the elliptic-fibration data set, the Kreuzer-Skarke toric hypersurface data set, generalized CICYs, etc., totaling at least on the order of 1010 manifolds. These all contribute to the vast string landscape, the multitude of possible vacuum solutions to string compactification. More recently, this collaboration has been enriched by computer science and data science, the former in bench-marking the complexity of the algorithms in computing geometric quantities, and the latter in applying techniques such as machine learning in extracting unexpected information. These endeavours, inspired by the physics of the string landscape, have rendered the investigation of Calabi-Yau spaces one of the most exciting and interdisciplinary fields.

Download Full-text

Cooking Up Knowledge From Big Data Using Data Science

Frontiers for Young Minds ◽

10.3389/frym.2021.632923 ◽

2021 ◽

Vol 9 ◽

Author(s):

Andrea Rau

Keyword(s):

Big Data ◽

Computer Science ◽

Programming Language ◽

Data Science ◽

Weather Forecasting ◽

Original Question ◽

Data Scientist ◽

Using Data ◽

The Way

Data collected in very large quantities are called big data, and big data has changed the way we think about and answer questions in many different fields, like weather forecasting and biology. With all this information available, we need computers to help us store, process, analyze, and understand it. Data science combines tools from fields like statistics, mathematics, and computer science to find interesting patterns in big data. Data scientists write step-by-step instructions called algorithms to teach computers how to learn from data. To help computers understand these instructions, algorithms must be translated from the original question asked by a data scientist into a programming language—and the results must be translated back, so that humans can understand them. That means that data scientists are data detectives, programmers, and translators all in one!

Download Full-text

TOPIC SEGMENTATION METHODS COMPARISON ON COMPUTER SCIENCE TEXTS

Bulletin of National Technical University KhPI Series System Analysis Control and Information Technologies ◽

10.20998/2079-0023.2021.02.10 ◽

2021 ◽

pp. 59-66

Author(s):

Volodymyr Sokol ◽

Vitalii Krykun ◽

Mariia Bilova ◽

Ivan Perepelytsya ◽

Volodymyr Pustovarov ◽

...

Keyword(s):

Information Systems ◽

Computer Science ◽

Management System ◽

Data Science ◽

Continuous Training ◽

Data Set ◽

Topic Segmentation ◽

Science Texts ◽

Segmentation Methods ◽

Segmentation Problem

The demand for the creation of information systems that simplifies and accelerates work has greatly increased in the context of the rapidinformatization of society and all its branches. It provokes the emergence of more and more companies involved in the development of softwareproducts and information systems in general. In order to ensure the systematization, processing and use of this knowledge, knowledge managementsystems are used. One of the main tasks of IT companies is continuous training of personnel. This requires export of the content from the company'sknowledge management system to the learning management system. The main goal of the research is to choose an algorithm that allows solving theproblem of marking up the text of articles close to those used in knowledge management systems of IT companies. To achieve this goal, it is necessaryto compare various topic segmentation methods on a dataset with a computer science texts. Inspec is one such dataset used for keyword extraction andin this research it has been adapted to the structure of the datasets used for the topic segmentation problem. The TextTiling and TextSeg methods wereused for comparison on some well-known data science metrics and specific metrics that relate to the topic segmentation problem. A new generalizedmetric was also introduced to compare the results for the topic segmentation problem. All software implementations of the algorithms were written inPython programming language and represent a set of interrelated functions. Results were obtained showing the advantages of the Text Seg method incomparison with TextTiling when compared using classical data science metrics and special metrics developed for the topic segmentation task. Fromall the metrics, including the introduced one it can be concluded that the TextSeg algorithm performs better than the TextTiling algorithm on theadapted Inspec test data set.

Download Full-text

Data Science: A New Paradigm in the Age of Big-Data Science and Analytics

New Mathematics and Natural Computation ◽

10.1142/s1793005717400038 ◽

2017 ◽

Vol 13 (02) ◽

pp. 119-143 ◽

Cited By ~ 4

Author(s):

Claude E. Concolato ◽

Li M. Chen

Keyword(s):

Machine Learning ◽

Cloud Computing ◽

Big Data ◽

Computer Science ◽

Data Science ◽

Big Data Analytics ◽

Modern World ◽

Growth Data ◽

New Paradigm ◽

Great Opportunity

As an emergent field of inquiry, Data Science serves both the information technology world and the applied sciences. Data Science is a known term that tends to be synonymous with the term Big-Data; however, Data Science is the application of solutions found through mathematical and computational research while Big-Data Science describes problems concerning the analysis of data with respect to volume, variation, and velocity (3V). Even though there is not much developed in theory from a scientific perspective for Data Science, there is still great opportunity for tremendous growth. Data Science is proving to be of paramount importance to the IT industry due to the increased need for understanding the insurmountable amount of data being produced and in need of analysis. In short, data is everywhere with various formats. Scientists are currently using statistical and AI analysis techniques like machine learning methods to understand massive sets of data, and naturally, they attempt to find relationships among datasets. In the past 10 years, the development of software systems within the cloud computing paradigm using tools like Hadoop and Apache Spark have aided in making tremendous advances to Data Science as a discipline [Z. Sun, L. Sun and K. Strang, Big data analytics services for enhancing business intelligence, Journal of Computer Information Systems (2016), doi: 10.1080/08874417.2016.1220239]. These advances enabled both scientists and IT professionals to use cloud computing infrastructure to process petabytes of data on daily basis. This is especially true for large private companies such as Walmart, Nvidia, and Google. This paper seeks to address pragmatic ways of looking at how Data Science — with respect to Big-Data Science — is practiced in the modern world. We also examine how mathematics and computer science help shape Big-Data Science’s terrain. We will highlight how mathematics and computer science have significantly impacted the development of Data Science approaches, tools, and how those approaches pose new questions that can drive new research areas within these core disciplines involving data analysis, machine learning, and visualization.

Download Full-text

Jumpstarting the Justice Disciplines: A Computational-Qualitative Approach to Collecting and Analyzing Text and Image Data in Criminology and Criminal Justice Studies

10.31235/osf.io/4nhd6 ◽

2021 ◽

Author(s):

Alex Luscombe ◽

Jamie Duncan ◽

Kevin Walby

Keyword(s):

Qualitative Research ◽

Big Data ◽

Natural Language Processing ◽

Criminal Justice ◽

Data Collection ◽

Natural Language ◽

Computational Methods ◽

Language Processing ◽

Data Set ◽

Web Scraping

Computational methods are increasingly popular in criminal justice research. As more criminal justice data becomes available in big data and other digital formats, new means of embracing the computational turn are needed. In this article, we propose a framework for data collection and case sampling using computational methods, allowing researchers to conduct thick qualitative research – analyses concerned with the particularities of a social context or phenomenon – starting from big data, which is typically associated with thinner quantitative methods and the pursuit of generalizable findings. The approach begins by using open-source web scraping algorithms to collect content from a target website, online database, or comparable online source. Next, researchers use computational techniques from the field of natural language processing to explore themes and patterns in the larger data set. Based on these initial explorations, researchers algorithmically generate a subset of data for in-depth qualitative analysis. In this computationally driven process of data collection and case sampling, the larger corpus and subset are never entirely divorced, a feature we argue has implications for traditional qualitative research techniques and tenets. To illustrate this approach, we collect, subset, and analyze three years of news releases from the Royal Canadian Mounted Police website (N = 13,637) using a mix of web scraping, natural language processing, and visual discourse analysis. To enhance the pedagogical value of our intervention and facilitate replication and secondary analysis, we make all data and code available online in the form of a detailed, step-by-step tutorial.

Download Full-text

Ethics as Methods: Doing Ethics in the Era of Big Data Research—Introduction

Social Media + Society ◽

10.1177/2056305118784502 ◽

2018 ◽

Vol 4 (3) ◽

pp. 205630511878450 ◽

Cited By ~ 12

Author(s):

Annette N Markham ◽

Katrin Tiidenberg ◽

Andrew Herman

Keyword(s):

Big Data ◽

Data Science ◽

New Materialism ◽

Grand Narrative ◽

Feminist Ethics ◽

Special Issue ◽

Data Set ◽

Fine Grained ◽

Sheer Size ◽

Theoretical Paradigms

This is an introduction to the special issue of “Ethics as Methods: Doing Ethics in the Era of Big Data Research.” Building on a variety of theoretical paradigms (i.e., critical theory, [new] materialism, feminist ethics, theory of cultural techniques) and frameworks (i.e., contextual integrity, deflationary perspective, ethics of care), the Special Issue contributes specific cases and fine-grained conceptual distinctions to ongoing discussions about the ethics in data-driven research. In the second decade of the 21st century, a grand narrative is emerging that posits knowledge derived from data analytics as true, because of the objective qualities of data, their means of collection and analysis, and the sheer size of the data set. The by-product of this grand narrative is that the qualitative aspects of behavior and experience that form the data are diminished, and the human is removed from the process of analysis. This situates data science as a process of analysis performed by the tool, which obscures human decisions in the process. The scholars involved in this Special Issue problematize the assumptions and trends in big data research and point out the crisis in accountability that emerges from using such data to make societal interventions. Our collaborators offer a range of answers to the question of how to configure ethics through a methodological framework in the context of the prevalence of big data, neural networks, and automated, algorithmic governance of much of human socia(bi)lity

Download Full-text

Data Science and Big Data Practice Using Apache Spark and Python

Advances in Data Mining and Database Management - Intelligent Analytics With Advanced Multi-Industry Applications ◽

10.4018/978-1-7998-4963-6.ch004 ◽

2021 ◽

pp. 67-95

Author(s):

Li Chen ◽

Lala Aicha Coulibaly

Keyword(s):

Information Technology ◽

Big Data ◽

Computer Science ◽

Data Analytics ◽

Data Science ◽

Principal Component ◽

Real Data ◽

Apache Spark ◽

Data Sets ◽

Information Technology Students

Data science and big data analytics are still at the center of computer science and information technology. Students and researchers not in computer science often found difficulties in real data analytics using programming languages such as Python and Scala, especially when they attempt to use Apache-Spark in cloud computing environments-Spark Scala and PySpark. At the same time, students in information technology could find it difficult to deal with the mathematical background of data science algorithms. To overcome these difficulties, this chapter will provide a practical guideline to different users in this area. The authors cover the main algorithms for data science and machine learning including principal component analysis (PCA), support vector machine (SVM), k-means, k-nearest neighbors (kNN), regression, neural networks, and decision trees. A brief description of these algorithms will be explained, and the related code will be selected to fit simple data sets and real data sets. Some visualization methods including 2D and 3D displays will be also presented in this chapter.

Download Full-text

Preparing Undergraduate Students Majoring in Computer Science and Mathematics with Data Science Perspectives and Awareness in the Age of Big Data

Procedia - Social and Behavioral Sciences ◽

10.1016/j.sbspro.2015.07.092 ◽

2015 ◽

Vol 197 ◽

pp. 1443-1446 ◽

Cited By ~ 7

Author(s):

Kanyarat Bussaban ◽

Phanu Waraporn

Keyword(s):

Big Data ◽

Computer Science ◽

Undergraduate Students ◽

Data Science ◽

And Mathematics

Download Full-text

BIG: a large-scale data integration tool for renal physiology

AJP Renal Physiology ◽

10.1152/ajprenal.00249.2016 ◽

2016 ◽

Vol 311 (4) ◽

pp. F787-F792 ◽

Cited By ~ 8

Author(s):

Yue Zhao ◽

Chin-Rang Yang ◽

Viswanathan Raghuram ◽

Jaya Parulekar ◽

Mark A. Knepper

Keyword(s):

Big Data ◽

Large Scale ◽

Data Science ◽

Relevant Information ◽

Biological Information ◽

Renal Physiology ◽

Data Sets ◽

Data Set ◽

Large Scale Data ◽

Quantify Gene Expression

Due to recent advances in high-throughput techniques, we and others have generated multiple proteomic and transcriptomic databases to describe and quantify gene expression, protein abundance, or cellular signaling on the scale of the whole genome/proteome in kidney cells. The existence of so much data from diverse sources raises the following question: “How can researchers find information efficiently for a given gene product over all of these data sets without searching each data set individually?” This is the type of problem that has motivated the “Big-Data” revolution in Data Science, which has driven progress in fields such as marketing. Here we present an online Big-Data tool called BIG (Biological Information Gatherer) that allows users to submit a single online query to obtain all relevant information from all indexed databases. BIG is accessible at http://big.nhlbi.nih.gov/ .

Download Full-text

The Evolution of Data Science

International Journal of Knowledge Management ◽

10.4018/ijkm.2019040106 ◽

2019 ◽

Vol 15 (2) ◽

pp. 97-109 ◽

Cited By ~ 3

Author(s):

Jennifer Lewis Priestley ◽

Robert J. McGrath

Keyword(s):

Big Data ◽

Knowledge Management ◽

Private Sector ◽

Computer Science ◽

Knowledge Production ◽

Data Science ◽

Management Practices ◽

Academic Discipline ◽

Academic Disciplines ◽

Knowledge Management Practices

Is data science a new field of study or simply an extension or specialization of a discipline that already exists, such as statistics, computer science, or mathematics? This article explores the evolution of data science as a potentially new academic discipline, which has evolved as a function of new problem sets that established disciplines have been ill-prepared to address. The authors find that this newly-evolved discipline can be viewed through the lens of a new mode of knowledge production and is characterized by transdisciplinarity collaboration with the private sector and increased accountability. Lessons from this evolution can inform knowledge production in other traditional academic disciplines as well as inform established knowledge management practices grappling with the emerging challenges of Big Data.

Download Full-text