Developing a Prototype Opioid Surveillance System at a 2-Day Virginia Hackathon

Catherine Ordun; Jessica Bonnie; Jung Byun; Daewoo Chong; Richard Latham

doi:10.5210/ojphi.v10i1.8321

Developing a Prototype Opioid Surveillance System at a 2-Day Virginia Hackathon

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v10i1.8321 ◽

2018 ◽

Vol 10 (1) ◽

Author(s):

Catherine Ordun ◽

Jessica Bonnie ◽

Jung Byun ◽

Daewoo Chong ◽

Richard Latham

Keyword(s):

Machine Learning ◽

Law Enforcement ◽

Open Source ◽

Prescription Drugs ◽

Surveillance System ◽

Data Science ◽

Opioid Overdose ◽

Science Methods ◽

The West ◽

R Shiny

ObjectiveA team of data scientists from Booz Allen competed in an opioid hackathon and developed a prototype opioid surveillance system using data science methods. This presentation intends to 1) describe the positives and negatives of our data science approach, 2) demo the prototype applications built, and 3) discuss next steps for local implementation of a similar capability.IntroductionAt the Governor’s Opioid Addiction Crisis Datathon in September 2017, a team of Booz Allen data scientists participated in a two-day hackathon to develop a prototype surveillance system for business users to locate areas of high risk across multiple indicators in the State of Virginia. We addressed 1) how different geographic regions experience the opioid overdose epidemic differently by clustering similar counties by socieconomic indicators, and 2) facilitating better data sharing between health care providers and law enforcement. We believe this inexpensive, open source, surveillance approach could be applied for states across the nation, particularly those with high rates of death due to drug overdoses and those with significant increases in death.MethodsThe Datathon provided a combination of publicly available data and State of Virginia datasets consisting of crime data, treatment center data, funding data, mortality and morbidity data for opioid, prescription drugs (i.e. oxycodone, fentanyl), and heroin cases, where dates started as early as 2010. The team focused on three data sources: U.S. Census Bureau (American Community Survey), State of Virginia Opioid Mortality and Overdose Data, and State of Virginia Department of Corrections Data. All data was cleaned and mapped to county-levels using FIPS codes. The prototype system allowed users to cluster similar counties together based on socioeconomic indicators so that underlying demographic patterns like food stamp usage and poverty levels might be revealed as indicative of mortality and overdose rates. This was important because neighboring counties like Goochland and Henrico Counties, while sharing a border, do not necessarily share similar behavioral and population characteristics. As a result, counties in close proximity may require different approaches for community messaging, law enforcement, and treatment infrastructure. The prototype also ingests crime and mortality data at the county-level for dynamic data exploration across multiple time and geographic parameters, a potential vehicle for data exchange in real-time.ResultsThe team wrote an agglomerative algorithm similar to k-means clustering in Python, with a Flask API back-end, and visualized using FIPS county codes in R Shiny. Users were allowed to select 2 to 5 clusters for visualization. The second part of the prototype featured two dashboards built in ElasticSearch and Kibana, open source software built on a noSQL database designed for information retrieval. Annual data on number of criminal commits and major offenses and mortality and overdose data on opioid usage were ingested and displayed using multiple descriptive charts and basic NLP. The clustering algorithm indicated that when using five clusters, counties in the east of Virginia are more dissimilar to each other, than counties in the west. The farther west, the more socioeconomically homogenous counties become, which may explain why counties in the west have greater rates of opioid overdose than in the east which involve more recreational use of non-prescription drugs. The dashboards indicated that between 2011 and 2017, the majority of crimes associated with heavy-use of drugs included Larceny/Fraud, Drug Sales, Assault, Burglary, Drug Possession, and Sexual Assault. Filtering by year, county, and offense, allowed for very focused analysis at the county level.ConclusionsData science methods using geospatial analytics, unsupervised machine learning, and leverage of noSQL databases for unstructured data, offer powerful and inexpensive ways for local officials to develop their own opioid surveillance system. Our approach of using clustering algorithms could be advanced by including several dozen socioeconomic features, tied to a potential risk score that the group was considering calculating. Further, as the team became more familiar with the data, they considered building a supervised machine learning to not only predict overdoses in each county, but more so, to extract from the model which features would be most predictive county-to-county. Next, because of the fast-paced nature of an overnight hackathon, a variety of open source applications were used to build solutions quickly. The team recommends generating a single architecture that would seamlessly tie together Python, R Shiny, and ElasticSearch/Kibana into one system. Ultimately, the goal of the entire prototype is to ingest and update the models with real-time data dispatched by police, public health, emergency departments, and medical examiners.Referenceshttps://data.virginia.gov/datathon-2017/https://vimeo.com/236131006?ref=tw-sharehttps://vimeo.com/236131182?ref=tw-share

Download Full-text

Data science in economics: comprehensive review of advanced machine learning and deep learning methods

10.31232/osf.io/4pxq2 ◽

2020 ◽

Author(s):

Saeed Nosratabadi ◽

Amir Mosavi ◽

Puhong Duan ◽

Pedram Ghamisi ◽

Ferdinand Filip ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Data Science ◽

State Of The Art ◽

Science Methods ◽

Learning Models ◽

Diverse Range ◽

Hybrid Machine ◽

Economics Research

This paper provides a state-of-the-art investigation of advances in data science in emerging economic applications. The analysis was performed on novel data science methods in four individual classes of deep learning models, hybrid deep learning models, hybrid machine learning, and ensemble models. Application domains include a wide and diverse range of economics research from the stock market, marketing, and e-commerce to corporate banking and cryptocurrency. Prisma method, a systematic literature review methodology, was used to ensure the quality of the survey. The findings reveal that the trends follow the advancement of hybrid models, which, based on the accuracy metric, outperform other learning algorithms. It is further expected that the trends will converge toward the advancements of sophisticated hybrid deep learning models.

Download Full-text

Using of data science in healthcare

Problems of Innovation and Investment Development ◽

10.33813/2224-1213.24.2021.15 ◽

2021 ◽

pp. 149-156

Author(s):

Ihor Ponomarenko ◽

Oleksandra Lubkovska

Keyword(s):

Machine Learning ◽

Health Care ◽

Business Intelligence ◽

Data Science ◽

Large Data ◽

Science Methods ◽

Medical Field ◽

Learning Methods ◽

Machine Learning Methods ◽

Using Data

The subject of the research is the approach to the possibility of using data science methods in the field of health care for integrated data processing and analysis in order to optimize economic and specialized processes The purpose of writing this article is to address issues related to the specifics of the use of Data Science methods in the field of health care on the basis of comprehensive information obtained from various sources. Methodology. The research methodology is system-structural and comparative analyzes (to study the application of BI-systems in the process of working with large data sets); monograph (the study of various software solutions in the market of business intelligence); economic analysis (when assessing the possibility of using business intelligence systems to strengthen the competitive position of companies). The scientific novelty the main sources of data on key processes in the medical field. Examples of innovative methods of collecting information in the field of health care, which are becoming widespread in the context of digitalization, are presented. The main sources of data in the field of health care used in Data Science are revealed. The specifics of the application of machine learning methods in the field of health care in the conditions of increasing competition between market participants and increasing demand for relevant products from the population are presented. Conclusions. The intensification of the integration of Data Science in the medical field is due to the increase of digitized data (statistics, textual informa- tion, visualizations, etc.). Through the use of machine learning methods, doctors and other health professionals have new opportunities to improve the efficiency of the health care system as a whole. Key words: Data science, efficiency, information, machine learning, medicine, Python, healthcare.

Download Full-text

Jupyter: Thinking and Storytelling with Code and Data

10.22541/au.161298309.98344404/v2 ◽

2021 ◽

Author(s):

Brian Granger ◽

Fernando Pérez

Keyword(s):

Machine Learning ◽

Open Source ◽

Community Of Practice ◽

Scientific Computing ◽

Data Science ◽

Climate Science ◽

Three Dimensions ◽

Open Source Project ◽

Interactive Computing ◽

The Impact

Project Jupyter is an open-source project for interactive computing widely used in data science, machine learning, and scientific computing. We argue that even though Jupyter helps users perform complex, technical work, Jupyter itself solves problems that are fundamentally human in nature. Namely, Jupyter helps humans to think and tell stories with code and data. We illustrate this by describing three dimensions of Jupyter: interactive computing, computational narratives, and the idea that Jupyter is more than software. We illustrate the impact of these dimensions on a community of practice in Earth and climate science.

Download Full-text

Analytic Fusion for Essential Indicators of the Opioid Epidemic

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v11i1.9910 ◽

2019 ◽

Vol 11 (1) ◽

Author(s):

Howard Burkom ◽

Joseph Downs ◽

Raghav Ramachandran ◽

Wayne Loschen ◽

Laurel Boyd ◽

...

Keyword(s):

Public Health ◽

Machine Learning ◽

Law Enforcement ◽

Bayesian Networks ◽

Historical Data ◽

Opioid Overdose ◽

Poison Center ◽

Machine Learning Methods ◽

Fusion Methods ◽

Data Source

ObjectiveIn a partnership between the Public Health Division of the Oregon Health Authority (OHA) and the Johns Hopkins Applied Physics Laboratory (APL), our objective was develop an analytic fusion tool using streaming data and report-based evidence to improve the targeting and timing of evidence-based interventions in the ongoing opioid overdose epidemic. The tool is intended to enable practical situational awareness in the ESSENCE biosurveillance system to target response programs at the county and state levels. Threats to be monitored include emerging events and gradual trends of overdoses in three categories: all prescription and illicit opioids, heroin, and especially high-mortality synthetic drugs such as fentanyl and its analogues. Traditional sources included emergency department (ED) visits and emergency management services (EMS) call records. Novel sources included poison center calls, death records, and report-based information such as bad batch warnings on social media. Using available data and requirements analyses thus far, we applied and compared Bayesian networks, decision trees, and other machine learning approaches to derive robust tools to reveal emerging overdose threats and identify at-risk subpopulations.IntroductionUnlike other health threats of recent concern for which widespread mortality was hypothetical, the high fatality burden of opioid overdose crisis is present, steadily growing, and affecting young and old, rural and urban, military and civilian subpopulations. While the background of many public health monitors is mainly infectious disease surveillance, these epidemiologists seek to collaborate with behavioral health and injury prevention programs and with law enforcement and emergency medical services to combat the opioid crisis. Recent efforts have produced key terms and phrases in available data sources and numerous user-friendly dashboards allowing inspection of hundreds of plots. The current effort seeks to distill and present combined fusion alerts of greatest concern from numerous stratified data outputs. Near-term plans are to implement best-performing fusion methods as an ESSENCE module for the benefit of OHA staff and other user groups.MethodsBy analyzing historical OHA data, we formed features to monitor in each data source to adapt diagnosis codes and text strings suggested by CDC’s injury prevention division, published EMS criteria [Reference 1], and generic product codes from CDC toxicologists, with guidance from OHA Emergency Services Director David Lehrfeld and from Oregon Poison Center Director Sandy Giffen. These features included general and specific opioid abuse indicators such as daily counts of records labelled with the “poisoning” subcategory and containing “fentanyl” or other keywords in the free-text. Matrices of corresponding time series were formed for each of 36 counties and the entire state as inputs to region-specific fusion algorithms.To obtain truth data for detection, the OHA staff provided guidance and design help to generate plausible overdose threat scenarios that were quantified as realistic data distributions of monitored features accounting for time delays and historical distributions of counts in each data source. We sampled these distributions to create 1000 target sets for detection based on the event duration and affected counties for each event scenario.We used these target datasets to compare the detection performance of fusion detection algorithms. Tested algorithms included Bayesian Networks formed with the R package gRain, and also random forest, logistic regression, and support vector machine models implemented with the Python scikit-learn package using default settings. The first 800 days of the data were used for model training, and the last 400 days for testing. Model results were evaluated with the metrics:Sensitivity = (number of target event days signaled) / (all event days) andPositive predictive value (PPV) = (number of target event days signaled) / (all days signaled).These metrics were combined with specificity regarded as the expected fusion alert rate calculated from the historical dataset with no simulated cases injected.ResultsThe left half of Figure 1 illustrates a threat scenario along Oregon’s I5 corridor in which string of fentanyl overdoses with a few fatalities affects the monitored data streams in three counties over a seven-day period. The right half of the figure charts the performance metrics for random forest and Bayesian network machine learning methods applied to both training and test datasets assuming total case counts of 50, 20, and 10 overdoses. Sensitivity values were encouraging, especially for the Bayesian networks and even for the 10-case scenario. Computed PPV levels suggested a manageable public health investigation burden.ConclusionsThe detection results were promising for a threat scenario of particular concern to OHA based on a data scenario deemed plausible and realistic based on historical data. Trust and acceptance from public health surveillance of outputs from supervised machine learning methods beyond traditional statistical methods will require user experience and similar evaluation with additional threat scenarios and authentic event data.Credible truth data can be generated for testing and evaluation of analytic fusion methods with the advantages of several years of historical data from multiple sources and the expertise of experienced monitors. The collaborative generation process may be standardized and extended to other threat types and data environments.Next steps include the addition to the analytic fusion capability of report-based data that can influence data interpretation, including mainstream and social media reports, events in neighboring regions, and law enforcement data.References1. Rhode Island Enhanced State Opioid Overdose Surveillance (ESOOS) Case Definition for Emergency Medical Services (EMS), http://www.health.ri.gov/publications/guidelines/ESOOSCaseDefinitionForEMS.pdf, last accessed: Sept. 9, 2018.

Download Full-text

Ligo: An Open Source Application for the Management and Execution of Administrative Data Linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v3i4.749 ◽

2018 ◽

Vol 3 (4) ◽

Author(s):

Greg Lawrance ◽

Raphael Parra Hernandez ◽

Khalegh Mamakani ◽

Suraiya Khan ◽

Brent Hills ◽

...

Keyword(s):

Machine Learning ◽

Open Source ◽

Administrative Data ◽

Data Science ◽

Population Data ◽

Probabilistic Methods ◽

Learning Approaches ◽

Web Interface ◽

Science Community ◽

Comparison Algorithms

IntroductionLigo is an open source application that provides a framework for managing and executing administrative data linking projects. Ligo provides an easy-to-use web interface that lets analysts select among data linking methods including deterministic, probabilistic and machine learning approaches and use these in a documented, repeatable, tested, step-by-step process. Objectives and ApproachThe linking application has two primary functions: identifying common entities in datasets [de-duplication] and identifying common entities between datasets [linking]. The application is being built from the ground up in a partnership between the Province of British Columbia’s Data Innovation (DI) Program and Population Data BC, and with input from data scientists. The simple web interface allows analysts to streamline the processing of multiple datasets in a straight-forward and reproducible manner. ResultsBuilt in Python and implemented as a desktop-capable and cloud-deployable containerized application, Ligo includes many of the latest data-linking comparison algorithms with a plugin architecture that supports the simple addition of new formulae. Currently, deterministic approaches to linking have been implemented and probabilistic methods are in alpha testing. A fully functional alpha, including deterministic and probabilistic methods is expected to be ready in September, with a machine learning extension expected soon after. Conclusion/ImplicationsLigo has been designed with enterprise users in mind. The application is intended to make the processes of data de-duplication and linking simple, fast and reproducible. By making the application open source, we encourage feedback and collaboration from across the population research and data science community.

Download Full-text

Towards Spatial Data Science: Bridging the Gap between GIS, Cartography and Data Science

Abstracts of the ICA ◽

10.5194/ica-abs-1-403-2019 ◽

2019 ◽

Vol 1 ◽

pp. 1-2

Author(s):

Jan Wilkening

Keyword(s):

Machine Learning ◽

Data Mining ◽

Open Source ◽

Real Time ◽

Spatial Data ◽

Data Science ◽

Spatial Concepts ◽

Front End ◽

Gis Tools ◽

University Curricula

Abstract. Data is regarded as the oil of the 21st century, and the concept of data science has received increasing attention in the last years. These trends are mainly caused by the rise of big data &ndash; data that is big in terms of volume, variety and velocity. Consequently, data scientists are required to make sense of these large datasets. Companies have problems acquiring talented people to solve data science problems. This is not surprising, as employers often expect skillsets that can hardly be found in one person: Not only does a data scientist need to have a solid background in machine learning, statistics and various programming languages, but often also in IT systems architecture, databases, complex mathematics. Above all, she should have a strong non-technical domain expertise in her field (see Figure 1).As it is widely accepted that 80% of data has a spatial component, developments in data science could provide exciting new opportunities for GIS and cartography: Cartographers are experts in spatial data visualization, and often also very skilled in statistics, data pre-processing and analysis in general. The cartographers’ skill levels often depend on the degree to which cartography programs at universities focus on the “front end” (visualisation) of a spatial data and leave the “back end” (modelling, gathering, processing, analysis) to GIScientists. In many university curricula, these front-end and back-end distinctions between cartographers and GIScientists are not clearly defined, and the boundaries are somewhat blurred.In order to become good data scientists, cartographers and GIScientists need to acquire certain additional skills that are often beyond their university curricula. These skills include programming, machine learning and data mining. These are important technologies for extracting knowledge big spatial data sets, and thereby the logical advancement to “traditional” geoprocessing, which focuses on “traditional” (small, structured, static) datasets such shapefiles or feature classes.To bridge the gap between spatial sciences (such as GIS and cartography) and data science, we need an integrated framework of “spatial data science” (Figure 2).Spatial sciences focus on causality, theory-based approaches to explain why things are happening in space. In contrast, the scope of data science is to find similar patterns in big datasets with techniques of machine learning and data mining &ndash; often without considering spatial concepts (such as topology, spatial indexing, spatial autocorrelation, modifiable area unit problems, map projections and coordinate systems, uncertainty in measurement etc.).Spatial data science could become the core competency of GIScientists and cartographers who are willing to integrate methods from the data science knowledge stack. Moreover, data scientists could enhance their work by integrating important spatial concepts and tools from GIS and cartography into data science workflows. A non-exhaustive knowledge stack for spatial data scientists, including typical tasks and tools, is given in Table 1.There are many interesting ongoing projects at the interface of spatial and data science. Examples from the ArcGIS platform include:<ul><li>Integration of Python GIS APIs with Machine Learning libraries, such as scikit-learn or TensorFlow, in Jupyter Notebooks</li><li>Combination of R (advanced statistics and visualization) and GIS (basic geoprocessing, mapping) in ModelBuilder and other automatization frameworks</li><li>Enterprise GIS solutions for distributed geoprocessing operations on big, real-time vector and raster datasets</li><li>Dashboards for visualizing real-time sensor data and integrating it with other data sources</li><li>Applications for interactive data exploration</li><li>GIS tools for Machine Learning tasks for prediction, clustering and classification of spatial data</li><li>GIS Integration for Hadoop</li></ul>While the discussion about proprietary (ArcGIS) vs. open-source (QGIS) software is beyond the scope of this article, it has to be stated that a.) many ArcGIS projects are actually open-source and b.) using a complete GIS platform instead of several open-source pieces has several advantages, particularly in efficiency, maintenance and support (see Wilkening et al. (2019) for a more detailed consideration). At any rate, cartography and GIS tools are the essential technology blocks for solving the (80% spatial) data science problems of the future.

Download Full-text

Glycowork: A Python package for glycan data science and machine learning

10.1101/2021.04.22.440981 ◽

2021 ◽

Author(s):

Luc Thomès ◽

Rebekka Burkholz ◽

Daniel Bojar

Keyword(s):

Machine Learning ◽

Open Source ◽

Data Science ◽

Biological Processes ◽

Biological Sequence ◽

Learning Models ◽

Related Data ◽

Strong Focus ◽

Python Package ◽

Machine Learning Models

AbstractAs a biological sequence, glycans occur in every domain of life and comprise monosaccharides that are chained together to form oligo- or polysaccharides. While glycans are crucial for most biological processes, existing analysis modalities make it difficult for researchers with limited computational background to include information from these diverse and nonlinear sequences into standard workflows. Here, we present glycowork, an open-source Python package that was designed for the processing and analysis of glycan data by end users, with a strong focus on glycan-related data science and machine learning. Glycowork includes numerous functions to, for instance, automatically annotate glycan motifs and analyze their distributions via heatmaps and statistical enrichment. We also provide visualization methods, routines to interact with stored databases, trained machine learning models, and learned glycan representations. We envision that glycowork can extract further insights from any glycan dataset and demonstrate this with several workflows that analyze glycan motifs in various biological contexts. Glycowork can be freely accessed at https://github.com/BojarLab/glycowork/.

Download Full-text

Data science in economics: comprehensive review of advanced machine learning and deep learning methods

10.35542/osf.io/5dwrt ◽

2020 ◽

Author(s):

Saeed Nosratabadi ◽

Amir Mosavi ◽

Puhong Duan ◽

Pedram Ghamisi ◽

Ferdinand Filip ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Data Science ◽

State Of The Art ◽

Science Methods ◽

Learning Models ◽

Diverse Range ◽

Hybrid Machine ◽

Economics Research

Download Full-text

Data science in economics: comprehensive review of advanced machine learning and deep learning methods

10.31229/osf.io/2phjr ◽

2020 ◽

Author(s):

Saeed Nosratabadi ◽

Amir Mosavi ◽

Puhong Duan ◽

Pedram Ghamisi ◽

Ferdinand Filip ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Data Science ◽

State Of The Art ◽

Science Methods ◽

Learning Models ◽

Diverse Range ◽

Hybrid Machine ◽

Economics Research

Download Full-text

Development of a learning pilot for the remote teaching of Smart Maintenance using open source tools

10.4995/head21.2021.13140 ◽

2021 ◽

Author(s):

Maira Callupe ◽

Luca Fumagalli ◽

Domenico Daniele Nucera

Keyword(s):

Machine Learning ◽

Open Source ◽

Data Science ◽

Computer Programming ◽

Educational Institutions ◽

Educational Tools ◽

Vast Array ◽

New Graduates ◽

The Future

Technology has created a vast array of educational tools readily available to educators, but it also has created a shift in the skills and competences demanded from new graduates. As data science and machine learning are becoming commonplace across all industries, computer programming is emerging as one of the fundamental skills engineers will require to navigate the future and current workplace. It is, thus, the responsibility of educational institutions to rise to this challenge and to provide students with an appropriate training that facilitates the development of these skills. The purpose of this paper is to explore the potential of open source tools to introduce students to the more practical side of Smart Maintenance. By developing a learning pilot based mainly on computational notebooks, students without a programming background are walked through the relevant techniques and algorithms in an experiential format. The pilot highlights the superiority of Colab notebooks for the remote teaching of subjects that deal with data science and programming. The resulting insights from the experience will be used for the development of subsequent iterations during the current year.

Download Full-text