Jupyter: Thinking and Storytelling with Code and Data

Mapping Intimacies ◽

10.22541/au.161298309.98344404/v1 ◽

2021 ◽

Author(s):

Brian Granger ◽

Fernando Pérez

Keyword(s):

Machine Learning ◽

Open Source ◽

Community Of Practice ◽

Scientific Computing ◽

Data Science ◽

Climate Science ◽

Three Dimensions ◽

Open Source Project ◽

Interactive Computing ◽

The Impact

Project Jupyter is an open-source project for interactive computing widely used in data science, machine learning, and scientific computing. We argue that even though Jupyter helps users perform complex, technical work, Jupyter itself solves problems that are fundamentally human in nature. Namely, Jupyter helps humans to think and tell stories with code and data. We illustrate this by describing three dimensions of Jupyter: interactive computing, computational narratives, and the idea that Jupyter is more than software. We illustrate the impact of these dimensions on a community of practice in Earth and climate science.

Download Full-text

Jupyter: Thinking and Storytelling with Code and Data

10.22541/au.161298309.98344404/v2 ◽

2021 ◽

Author(s):

Brian Granger ◽

Fernando Pérez

Keyword(s):

Machine Learning ◽

Open Source ◽

Community Of Practice ◽

Scientific Computing ◽

Data Science ◽

Climate Science ◽

Three Dimensions ◽

Open Source Project ◽

Interactive Computing ◽

The Impact

Download Full-text

ASME Open Source Project: Prototype Re-Design and Conclusion of a Human Powered Water Purification Device

Volume 7: Engineering Education and Professional Development ◽

10.1115/imece2009-11293 ◽

2009 ◽

Author(s):

Kevin Schmaltz

Keyword(s):

Open Source ◽

Hurricane Katrina ◽

Water Purification ◽

Early Summer ◽

Team Members ◽

Open Source Project ◽

Ongoing Work ◽

Final Design ◽

Human Powered ◽

The Impact

The Western Kentucky University Mechanical Engineering program partnered with ASME to host an Open Source student design project to develop a prototype water purification device in 2008. The project was funded by an ASME grant and is part of a continuing initiative by ASME to extend the relevance of their annual Student Design Competitions (SDC) and link student projects to societal concerns. The Open Source Project extended the 2007 SDC which required students to design and construct human-powered devices to purify water. The design challenge was inspired by Hurricane Katrina-like temporary disasters, but also addresses one of the National Academy of Engineering’s Grand Challenges for Engineering: provide access to clean water. Affordable and practical solutions are needed to provide drinkable water to people who do not have the equipment, power or other resources necessary to assure safe water supplies. During the spring and early summer of 2008, five students from various SDC teams qualifying for the 2007 SDC finals used their competition experience to develop a new design for a human powered water purification system. Team members were distributed at universities from Sweden to Venezuela to New Mexico, and therefore interacted via internet and teleconferences to refine the design. Ongoing work was posted to the ASME website, allowing people external to the team a chance to critique or contribute to the design. The team met at WKU in May to construct and test a prototype of the design. The initial prototype was able to purify water at 10 times the rate of any SDC devices, using a combination of passive sand filtration, solar heat collection and mechanical friction heating. While this was a marked improvement, the reality is that the human effort to purify this water is still excessive. The second generation prototype was completed by faculty, staff and students at WKU during the 2009 summer with the information learned and experiences gained from the initial prototype of the distributed team. This paper will discuss the evolution of the project design from the SDC to through the second prototype and the impact of the open source approach to the design process. The project represents ASME’s first attempt at executing an “Open Source” project, providing a forum for mechanical engineers around the world to contribute to solutions of critical social, economic and environmental problems. If the final design proves technically feasible, the Open Source team will seek support from the ASME Center for Engineering Entrepreneurship and Innovation to commercialize the design.

Download Full-text

Ligo: An Open Source Application for the Management and Execution of Administrative Data Linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v3i4.749 ◽

2018 ◽

Vol 3 (4) ◽

Author(s):

Greg Lawrance ◽

Raphael Parra Hernandez ◽

Khalegh Mamakani ◽

Suraiya Khan ◽

Brent Hills ◽

...

Keyword(s):

Machine Learning ◽

Open Source ◽

Administrative Data ◽

Data Science ◽

Population Data ◽

Probabilistic Methods ◽

Learning Approaches ◽

Web Interface ◽

Science Community ◽

Comparison Algorithms

IntroductionLigo is an open source application that provides a framework for managing and executing administrative data linking projects. Ligo provides an easy-to-use web interface that lets analysts select among data linking methods including deterministic, probabilistic and machine learning approaches and use these in a documented, repeatable, tested, step-by-step process. Objectives and ApproachThe linking application has two primary functions: identifying common entities in datasets [de-duplication] and identifying common entities between datasets [linking]. The application is being built from the ground up in a partnership between the Province of British Columbia’s Data Innovation (DI) Program and Population Data BC, and with input from data scientists. The simple web interface allows analysts to streamline the processing of multiple datasets in a straight-forward and reproducible manner. ResultsBuilt in Python and implemented as a desktop-capable and cloud-deployable containerized application, Ligo includes many of the latest data-linking comparison algorithms with a plugin architecture that supports the simple addition of new formulae. Currently, deterministic approaches to linking have been implemented and probabilistic methods are in alpha testing. A fully functional alpha, including deterministic and probabilistic methods is expected to be ready in September, with a machine learning extension expected soon after. Conclusion/ImplicationsLigo has been designed with enterprise users in mind. The application is intended to make the processes of data de-duplication and linking simple, fast and reproducible. By making the application open source, we encourage feedback and collaboration from across the population research and data science community.

Download Full-text

Students' perceptions about the use of flipped classroom in the field of electronic electrical engineering

Cadernos de Educação Tecnologia e Sociedade ◽

10.14571/brajets.v14.n1.158-166 ◽

2021 ◽

Vol 14 (1) ◽

pp. 158

Author(s):

Ricardo-Adán Salas-Rueda ◽

Jesús Ramírez-Ortega

Keyword(s):

Machine Learning ◽

Flipped Classroom ◽

Data Science ◽

Electrical Engineering ◽

Educational Process ◽

Digital Design ◽

Combinational Circuits ◽

Design Data ◽

School Year ◽

The Impact

The objective of this mixed research is to analyze the impact of flipped classroom in the educational process on the Combinational Circuits through data science and machine learning. Flipped classroom facilitates the organization of active activities before, during and after the class. The participants are 17 students of Electronic Electrical Engineering who took the Digital Design course in the National Autonomous University of Mexico during the 2019 school year. The results of machine learning indicate that the consultation of the videos before the class, resolution of the exercises during the class through the Quartus software and realization of the laboratory practices after the class positively influence the development of skills on circuit design. Data science identifies 3 predictive models of the use of flipped classroom through the decision tree technique. Finally, teachers can improve the learning process, develop the skills of students, build new educational spaces and carry out creative activities through flipped classroom.

Download Full-text

Unraveling the Impact of Land Cover Changes on Climate Using Machine Learning and Explainable Artificial Intelligence

Big Data and Cognitive Computing ◽

10.3390/bdcc5040055 ◽

2021 ◽

Vol 5 (4) ◽

pp. 55

Author(s):

Anastasiia Kolevatova ◽

Michael A. Riegler ◽

Francesco Cherubini ◽

Xiangping Hu ◽

Hugo L. Hammer

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Land Cover ◽

Climate Science ◽

Computational Time ◽

Science Literature ◽

Linear Regression Models ◽

Local Climate ◽

Explainable Artificial Intelligence ◽

The Impact

A general issue in climate science is the handling of big data and running complex and computationally heavy simulations. In this paper, we explore the potential of using machine learning (ML) to spare computational time and optimize data usage. The paper analyzes the effects of changes in land cover (LC), such as deforestation or urbanization, on local climate. Along with green house gas emission, LC changes are known to be important causes of climate change. ML methods were trained to learn the relation between LC changes and temperature changes. The results showed that random forest (RF) outperformed other ML methods, and especially linear regression models representing current practice in the literature. Explainable artificial intelligence (XAI) was further used to interpret the RF method and analyze the impact of different LC changes on temperature. The results mainly agree with the climate science literature, but also reveal new and interesting findings, demonstrating that ML methods in combination with XAI can be useful in analyzing the climate effects of LC changes. All parts of the analysis pipeline are explained including data pre-processing, feature extraction, ML training, performance evaluation, and XAI.

Download Full-text

Towards Spatial Data Science: Bridging the Gap between GIS, Cartography and Data Science

Abstracts of the ICA ◽

10.5194/ica-abs-1-403-2019 ◽

2019 ◽

Vol 1 ◽

pp. 1-2

Author(s):

Jan Wilkening

Keyword(s):

Machine Learning ◽

Data Mining ◽

Open Source ◽

Real Time ◽

Spatial Data ◽

Data Science ◽

Spatial Concepts ◽

Front End ◽

Gis Tools ◽

University Curricula

Abstract. Data is regarded as the oil of the 21st century, and the concept of data science has received increasing attention in the last years. These trends are mainly caused by the rise of big data &ndash; data that is big in terms of volume, variety and velocity. Consequently, data scientists are required to make sense of these large datasets. Companies have problems acquiring talented people to solve data science problems. This is not surprising, as employers often expect skillsets that can hardly be found in one person: Not only does a data scientist need to have a solid background in machine learning, statistics and various programming languages, but often also in IT systems architecture, databases, complex mathematics. Above all, she should have a strong non-technical domain expertise in her field (see Figure 1).As it is widely accepted that 80% of data has a spatial component, developments in data science could provide exciting new opportunities for GIS and cartography: Cartographers are experts in spatial data visualization, and often also very skilled in statistics, data pre-processing and analysis in general. The cartographers’ skill levels often depend on the degree to which cartography programs at universities focus on the “front end” (visualisation) of a spatial data and leave the “back end” (modelling, gathering, processing, analysis) to GIScientists. In many university curricula, these front-end and back-end distinctions between cartographers and GIScientists are not clearly defined, and the boundaries are somewhat blurred.In order to become good data scientists, cartographers and GIScientists need to acquire certain additional skills that are often beyond their university curricula. These skills include programming, machine learning and data mining. These are important technologies for extracting knowledge big spatial data sets, and thereby the logical advancement to “traditional” geoprocessing, which focuses on “traditional” (small, structured, static) datasets such shapefiles or feature classes.To bridge the gap between spatial sciences (such as GIS and cartography) and data science, we need an integrated framework of “spatial data science” (Figure 2).Spatial sciences focus on causality, theory-based approaches to explain why things are happening in space. In contrast, the scope of data science is to find similar patterns in big datasets with techniques of machine learning and data mining &ndash; often without considering spatial concepts (such as topology, spatial indexing, spatial autocorrelation, modifiable area unit problems, map projections and coordinate systems, uncertainty in measurement etc.).Spatial data science could become the core competency of GIScientists and cartographers who are willing to integrate methods from the data science knowledge stack. Moreover, data scientists could enhance their work by integrating important spatial concepts and tools from GIS and cartography into data science workflows. A non-exhaustive knowledge stack for spatial data scientists, including typical tasks and tools, is given in Table 1.There are many interesting ongoing projects at the interface of spatial and data science. Examples from the ArcGIS platform include:<ul><li>Integration of Python GIS APIs with Machine Learning libraries, such as scikit-learn or TensorFlow, in Jupyter Notebooks</li><li>Combination of R (advanced statistics and visualization) and GIS (basic geoprocessing, mapping) in ModelBuilder and other automatization frameworks</li><li>Enterprise GIS solutions for distributed geoprocessing operations on big, real-time vector and raster datasets</li><li>Dashboards for visualizing real-time sensor data and integrating it with other data sources</li><li>Applications for interactive data exploration</li><li>GIS tools for Machine Learning tasks for prediction, clustering and classification of spatial data</li><li>GIS Integration for Hadoop</li></ul>While the discussion about proprietary (ArcGIS) vs. open-source (QGIS) software is beyond the scope of this article, it has to be stated that a.) many ArcGIS projects are actually open-source and b.) using a complete GIS platform instead of several open-source pieces has several advantages, particularly in efficiency, maintenance and support (see Wilkening et al. (2019) for a more detailed consideration). At any rate, cartography and GIS tools are the essential technology blocks for solving the (80% spatial) data science problems of the future.

Download Full-text

Glycowork: A Python package for glycan data science and machine learning

10.1101/2021.04.22.440981 ◽

2021 ◽

Author(s):

Luc Thomès ◽

Rebekka Burkholz ◽

Daniel Bojar

Keyword(s):

Machine Learning ◽

Open Source ◽

Data Science ◽

Biological Processes ◽

Biological Sequence ◽

Learning Models ◽

Related Data ◽

Strong Focus ◽

Python Package ◽

Machine Learning Models

AbstractAs a biological sequence, glycans occur in every domain of life and comprise monosaccharides that are chained together to form oligo- or polysaccharides. While glycans are crucial for most biological processes, existing analysis modalities make it difficult for researchers with limited computational background to include information from these diverse and nonlinear sequences into standard workflows. Here, we present glycowork, an open-source Python package that was designed for the processing and analysis of glycan data by end users, with a strong focus on glycan-related data science and machine learning. Glycowork includes numerous functions to, for instance, automatically annotate glycan motifs and analyze their distributions via heatmaps and statistical enrichment. We also provide visualization methods, routines to interact with stored databases, trained machine learning models, and learned glycan representations. We envision that glycowork can extract further insights from any glycan dataset and demonstrate this with several workflows that analyze glycan motifs in various biological contexts. Glycowork can be freely accessed at https://github.com/BojarLab/glycowork/.

Download Full-text

The democratization of data science education

10.7287/peerj.preprints.3195v1 ◽

2017 ◽

Cited By ~ 1

Author(s):

Sean Kross ◽

Roger D Peng ◽

Brian S Caffo ◽

Ira Gooding ◽

Jeffrey T Leek

Keyword(s):

Machine Learning ◽

Science Education ◽

Data Analysis ◽

Data Science ◽

Online Data ◽

The Past ◽

The Us ◽

Science Curricula ◽

The Impact ◽

And Training

Over the last three decades data has become ubiquitous and cheap. This transition has accelerated over the last five years and training in statistics, machine learning, and data analysis have struggled to keep up. In April 2014 we launched a program of nine courses, the Johns Hopkins Data Science Specialization, which has now had more than 4 million enrollments over the past three years. Here the program is described and compared to both standard and more recently developed data science curricula. We show that novel pedagogical and administrative decisions introduced in our program are now standard in online data science programs. The impact of the Data Science Specialization on data science education in the US is also discussed. Finally we conclude with some thoughts about the future of data science education in a data democratized world.

Download Full-text

COVID-19 Prognostic Models: A Pro-con Debate for Machine Learning vs. Traditional Statistics

Frontiers in Digital Health ◽

10.3389/fdgth.2021.637944 ◽

2021 ◽

Vol 3 ◽

Author(s):

Ahmed Al-Hindawi ◽

Ahmed Abdulaal ◽

Timothy M. Rawson ◽

Saleh A. Alqahtani ◽

Nabeela Mughal ◽

...

Keyword(s):

Machine Learning ◽

Programming Languages ◽

Data Science ◽

The Other ◽

Machine Learning Techniques ◽

Prognostic Models ◽

Learning Techniques ◽

Classical Statistics ◽

Open Datasets ◽

The Impact

The SARS-CoV-2 virus, which causes the COVID-19 pandemic, has had an unprecedented impact on healthcare requiring multidisciplinary innovation and novel thinking to minimize impact and improve outcomes. Wide-ranging disciplines have collaborated including diverse clinicians (radiology, microbiology, and critical care), who are working increasingly closely with data-science. This has been leveraged through the democratization of data-science with the increasing availability of easy to access open datasets, tutorials, programming languages, and hardware which makes it significantly easier to create mathematical models. To address the COVID-19 pandemic, such data-science has enabled modeling of the impact of the virus on the population and individuals for diagnostic, prognostic, and epidemiological ends. This has led to two large systematic reviews on this topic that have highlighted the two different ways in which this feat has been attempted: one using classical statistics and the other using more novel machine learning techniques. In this review, we debate the relative strengths and weaknesses of each method toward the specific task of predicting COVID-19 outcomes.

Download Full-text

Development of a learning pilot for the remote teaching of Smart Maintenance using open source tools

10.4995/head21.2021.13140 ◽

2021 ◽

Author(s):

Maira Callupe ◽

Luca Fumagalli ◽

Domenico Daniele Nucera

Keyword(s):

Machine Learning ◽

Open Source ◽

Data Science ◽

Computer Programming ◽

Educational Institutions ◽

Educational Tools ◽

Vast Array ◽

New Graduates ◽

The Future

Technology has created a vast array of educational tools readily available to educators, but it also has created a shift in the skills and competences demanded from new graduates. As data science and machine learning are becoming commonplace across all industries, computer programming is emerging as one of the fundamental skills engineers will require to navigate the future and current workplace. It is, thus, the responsibility of educational institutions to rise to this challenge and to provide students with an appropriate training that facilitates the development of these skills. The purpose of this paper is to explore the potential of open source tools to introduce students to the more practical side of Smart Maintenance. By developing a learning pilot based mainly on computational notebooks, students without a programming background are walked through the relevant techniques and algorithms in an experiential format. The pilot highlights the superiority of Colab notebooks for the remote teaching of subjects that deal with data science and programming. The resulting insights from the experience will be used for the development of subsequent iterations during the current year.

Download Full-text