Data Science in Chemical Engineering: Applications to Molecular Science

Chemical engineering is being rapidly transformed by the tools of data science. On the horizon, artificial intelligence (AI) applications will impact a huge swath of our work, ranging from the discovery and design of new molecules to operations and manufacturing and many areas in between. Early adoption of data science, machine learning, and early examples of AI in chemical engineering has been rich with examples of molecular data science—the application tools for molecular discovery and property optimization at the atomic scale. We summarize key advances in this nascent subfield while introducing molecular data science for a broad chemical engineering readership. We introduce the field through the concept of a molecular data science life cycle and discuss relevant aspects of five distinct phases of this process: creation of curated data sets, molecular representations, data-driven property prediction, generation of new molecules, and feasibility and synthesizability considerations. Expected final online publication date for the Annual Review of Chemical and Biomolecular Engineering, Volume 12 is June 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Biotic and Abiotic Controls on the Phanerozoic History of Marine Animal Biodiversity

Annual Review of Ecology Evolution and Systematics ◽

10.1146/annurev-ecolsys-012021-035131 ◽

2021 ◽

Vol 52 (1) ◽

Author(s):

Andrew M. Bush ◽

Jonathan L. Payne

Keyword(s):

Biotic Interactions ◽

Annual Review ◽

Publication Date ◽

Data Sets ◽

Marine Animal ◽

Marine Animals ◽

Marine Habitat ◽

Positive Feedbacks ◽

Abiotic Controls ◽

Time Changes

During the past 541 million years, marine animals underwent three intervals of diversification (early Cambrian, Ordovician, Cretaceous–Cenozoic) separated by nondirectional fluctuation, suggesting diversity-dependent dynamics with the equilibrium diversity shifting through time. Changes in factors such as shallow-marine habitat area and climate appear to have modulated the nondirectional fluctuations. Directional increases in diversity are best explained by evolutionary innovations in marine animals and primary producers coupled with stepwise increases in the availability of food and oxygen. Increasing intensity of biotic interactions such as predation and disturbance may have led to positive feedbacks on diversification as ecosystems became more complex. Important areas for further research include improving the geographic coverage and temporal resolution of paleontological data sets, as well as deepening our understanding of Earth system evolution and the physiological and ecological traits that modulated organismal responses to environmental change. Expected final online publication date for the Annual Review of Ecology, Evolution, and Systematics, Volume 52 is November 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Perspective on Data Science

Annual Review of Statistics and Its Application ◽

10.1146/annurev-statistics-040220-013917 ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Roger D. Peng ◽

Hilary S. Parker

Keyword(s):

Iterative Process ◽

Data Science ◽

Annual Review ◽

Publication Date ◽

Broad Definition ◽

The Core ◽

Fields Of Study ◽

Points Of Interest ◽

New Knowledge

The field of data science currently enjoys a broad definition that includes a wide array of activities which borrow from many other established fields of study. Having such a vague characterization of a field in the early stages might be natural, but over time maintaining such a broad definition becomes unwieldy and impedes progress. In particular, the teaching of data science is hampered by the seeming need to cover many different points of interest. Data scientists must ultimately identify the core of the field by determining what makes the field unique and what it means to develop new knowledge in data science. In this review we attempt to distill some core ideas from data science by focusing on the iterative process of data analysis and develop some generalizations from past experience. Generalizations of this nature could form the basis of a theory of data science and would serve to unify and scale the teaching of data science to large audiences. Expected final online publication date for the Annual Review of Statistics, Volume 9 is March 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Modern Clinical Text Mining: A Guide and Review

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-030421-030931 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Bethany Percha

Keyword(s):

Machine Learning ◽

Text Mining ◽

Data Science ◽

Annual Review ◽

Publication Date ◽

Biomedical Data ◽

Clinical Text ◽

Quality Improvement Research ◽

Comprehensive Survey ◽

Technical Advances

Electronic health records (EHRs) are becoming a vital source of data for healthcare quality improvement, research, and operations. However, much of the most valuable information contained in EHRs remains buried in unstructured text. The field of clinical text mining has advanced rapidly in recent years, transitioning from rule-based approaches to machine learning and, more recently, deep learning. With new methods come new challenges, however, especially for those new to the field. This review provides an overview of clinical text mining for those who are encountering it for the first time (e.g., physician researchers, operational analytics teams, machine learning scientists from other domains). While not a comprehensive survey, this review describes the state of the art, with a particular focus on new tasks and methods developed over the past few years. It also identifies key barriers between these remarkable technical advances and the practical realities of implementation in health systems and in industry. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 4 is July 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Smartphones and the Neuroscience of Mental Health

Annual Review of Neuroscience ◽

10.1146/annurev-neuro-101220-014053 ◽

2021 ◽

Vol 44 (1) ◽

Author(s):

Claire M. Gillan ◽

Robb B. Rutledge

Keyword(s):

Mental Health ◽

Annual Review ◽

Publication Date ◽

Data Sets ◽

Daily Lives ◽

Neuroscience Research ◽

Health And Illness ◽

Passive Data ◽

Smartphone Sensors ◽

Rich Data

Improvements in understanding the neurobiological basis of mental illness have unfortunately not translated into major advances in treatment. At this point, it is clear that psychiatric disorders are exceedingly complex and that, in order to account for and leverage this complexity, we need to collect longitudinal datasets from much larger and more diverse samples than is practical using traditional methods. We discuss how smartphone-based research methods have the potential to dramatically advance our understanding of the neuroscience of mental health. This, we expect, will take the form of complementing lab-based hard neuroscience research with dense sampling of cognitive tests, clinical questionnaires, passive data from smartphone sensors, and experience-sampling data as people go about their daily lives. Theory- and data-driven approaches can help make sense of these rich data sets, and the combination of computational tools and the big data that smartphones make possible has great potential value for researchers wishing to understand how aspects of brain function give rise to, or emerge from, states of mental health and illness. Expected final online publication date for the Annual Review of Neuroscience, Volume 44 is July 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Perspectives on Allele-Specific Expression

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-021621-122219 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Siobhan Cleary ◽

Cathal Seoighe

Keyword(s):

Gene Expression ◽

Genetic Variants ◽

Data Science ◽

Genetic Diseases ◽

Annual Review ◽

Publication Date ◽

Biomedical Data ◽

Specific Expression ◽

Cis Acting ◽

Gene Copies

Diploidy has profound implications for population genetics and susceptibility to genetic diseases. Although two copies are present for most genes in the human genome, they are not necessarily both active or active at the same level in a given individual. Genomic imprinting, resulting in exclusive or biased expression in favor of the allele of paternal or maternal origin, is now believed to affect hundreds of human genes. A far greater number of genes display unequal expression of gene copies due to cis-acting genetic variants that perturb gene expression. The availability of data generated by RNA sequencing applied to large numbers of individuals and tissue types has generated unprecedented opportunities to assess the contribution of genetic variation to allelic imbalance in gene expression. Here we review the insights gained through the analysis of these data about the extent of the genetic contribution to allelic expression imbalance, the tools and statistical models for gene expression imbalance, and what the results obtained reveal about the contribution of genetic variants that alter gene expression to complex human diseases and phenotypes. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 4 is July 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Human-Centric Data Science for Urban Studies

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi8120584 ◽

2019 ◽

Vol 8 (12) ◽

pp. 584 ◽

Cited By ~ 2

Author(s):

Bernd Resch ◽

Michael Szell

Keyword(s):

Urban Studies ◽

Smart City ◽

Data Science ◽

City Planning ◽

Data Driven ◽

Data Sets ◽

Use Of Technology ◽

New Perspective ◽

Research Initiatives ◽

Adequate Data

Due to the wide-spread use of disruptive digital technologies like mobile phones, cities have transitioned from data-scarce to data-rich environments. As a result, the field of geoinformatics is being reshaped and challenged to develop adequate data-driven methods. At the same time, the term "smart city" is increasingly being applied in urban planning, reflecting the aims of different stakeholders to create value out of the new data sets. However, many smart city research initiatives are promoting techno-positivistic approaches which do not account enough for the citizens’ needs. In this paper, we review the state of quantitative urban studies under this new perspective, and critically discuss the development of smart city programs. We conclude with a call for a new anti-disciplinary, human-centric urban data science, and a well-reflected use of technology and data collection in smart city planning. Finally, we introduce the papers of this special issue which focus on providing a more human-centric view on data-driven urban studies, spanning topics from cycling and wellbeing, to mobility and land use.

Download Full-text

Synthetic Data

Annual Review of Statistics and Its Application ◽

10.1146/annurev-statistics-040720-031848 ◽

2020 ◽

Vol 8 (1) ◽

Author(s):

Trivellore E. Raghunathan

Keyword(s):

Synthetic Data ◽

Future Research ◽

Annual Review ◽

Publication Date ◽

Data Sets ◽

Sensitive Information ◽

Inferential Justification ◽

Widespread Access ◽

Access To Data ◽

Privacy And Confidentiality

Demand for access to data, especially data collected using public funds, is ever growing. At the same time, concerns about the disclosure of the identities of and sensitive information about the respondents providing the data are making the data collectors limit the access to data. Synthetic data sets, generated to emulate certain key information found in the actual data and provide the ability to draw valid statistical inferences, are an attractive framework to afford widespread access to data for analysis while mitigating privacy and confidentiality concerns. The goal of this article is to provide a review of various approaches for generating and analyzing synthetic data sets, inferential justification, limitations of the approaches, and directions for future research. Expected final online publication date for the Annual Review of Statistics, Volume 8 is March 8, 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Deciphering Temperature Seasonality in Earth's Ancient Oceans

Annual Review of Earth and Planetary Sciences ◽

10.1146/annurev-earth-032320-095156 ◽

2021 ◽

Vol 50 (1) ◽

Author(s):

Linda C. Ivany ◽

Emily J. Judd

Keyword(s):

Climate Model ◽

Temperature Data ◽

Temperature Cycle ◽

Annual Review ◽

Publication Date ◽

Mean Annual Temperature ◽

Data Sets ◽

Proxy Data ◽

Seamless Integration ◽

Seasonal Data

Ongoing global warming due to anthropogenic climate change has long been recognized, yet uncertainties regarding how seasonal extremes will change in the future persist. Paleoseasonal proxy data from intervals when global climate differed from today can help constrain how and why the annual temperature cycle has varied through space and time. Records of past seasonal variation in marine temperatures are available in the oxygen isotope values of serially sampled accretionary organisms. The most useful data sets come from carefully designed and computationally robust studies that enable characterization of paleoseasonal parameters and seamless integration with mean annual temperature data sets and climate models. Seasonal data sharpen interpretations of—and quantify overlooked or unconstrained seasonal biases in—the more voluminous mean temperature data and aid in the evaluation of climate model performance. Methodologies to rigorously analyze seasonal data are now available, and the promise of paleoseasonal proxy data for the next generation of paleoclimate research is significant. Expected final online publication date for the Annual Review of Earth and Planetary Sciences, Volume 50 is May 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Data Science in the Food Industry

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-020221-123602 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

George-John Nychas ◽

Emma Sims ◽

Panagiotis Tsakanikas ◽

Fady Mohareb

Keyword(s):

Food Safety ◽

Food Chain ◽

Food Industry ◽

Data Science ◽

Annual Review ◽

Publication Date ◽

Biomedical Data ◽

Constant State ◽

Food Integrity ◽

Multi Stakeholder

Food safety is one of the main challenges of the agri-food industry that is expected to be addressed in the current environment of tremendous technological progress, where consumers’ lifestyles and preferences are in a constant state of flux. Food chain transparency and trust are drivers for food integrity control and for improvements in efficiency and economic growth. Similarly, the circular economy has great potential to reduce wastage and improve the efficiency of operations in multi-stakeholder ecosystems. Throughout the food chain cycle, all food commodities are exposed to multiple hazards, resulting in a high likelihood of contamination. Such biological or chemical hazards may be naturally present at any stage of food production, whether accidentally introduced or fraudulently imposed, risking consumers’ health and their faith in the food industry. Nowadays, a massive amount of data is generated, not only from the next generation of food safety monitoring systems and along the entire food chain (primary production included) but also from the internet of things, media, and other devices. These data should be used for the benefit of society, and the scientific field of data science should be a vital player in helping to make this possible. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 4 is July 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Recent Challenges in Actuarial Science

Annual Review of Statistics and Its Application ◽

10.1146/annurev-statistics-040120-030244 ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Paul Embrechts ◽

Mario V. Wüthrich

Keyword(s):

Probability Theory ◽

Generalized Linear Models ◽

Life Insurance ◽

Data Science ◽

Linear Models ◽

Annual Review ◽

Publication Date ◽

Actuarial Science ◽

Employment Opportunities ◽

Cross Fertilization

For centuries, mathematicians and, later, statisticians, have found natural research and employment opportunities in the realm of insurance. By definition, insurance offers financial cover against unforeseen events that involve an important component of randomness, and consequently, probability theory and mathematical statistics enter insurance modeling in a fundamental way. In recent years, a data deluge, coupled with ever-advancing information technology and the birth of data science, has revolutionized or is about to revolutionize most areas of actuarial science as well as insurance practice. We discuss parts of this evolution and, in the case of non-life insurance, show how a combination of classical tools from statistics, such as generalized linear models and, e.g., neural networks contribute to better understanding and analysis of actuarial data. We further review areas of actuarial science where the cross fertilization between stochastics and insurance holds promise for both sides. Of course, the vastness of the field of insurance limits our choice of topics; we mainly focus on topics closer to our main areas of research. Expected final online publication date for the Annual Review of Statistics, Volume 9 is March 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text