scholarly journals A public data set of spatio-temporal match events in soccer competitions

2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Luca Pappalardo ◽  
Paolo Cintia ◽  
Alessio Rossi ◽  
Emanuele Massucco ◽  
Paolo Ferragina ◽  
...  

Abstract Soccer analytics is attracting increasing interest in academia and industry, thanks to the availability of sensing technologies that provide high-fidelity data streams for every match. Unfortunately, these detailed data are owned by specialized companies and hence are rarely publicly available for scientific research. To fill this gap, this paper describes the largest open collection of soccer-logs ever released, containing all the spatio-temporal events (passes, shots, fouls, etc.) that occured during each match for an entire season of seven prominent soccer competitions. Each match event contains information about its position, time, outcome, player and characteristics. The nature of team sports like soccer, halfway between the abstraction of a game and the reality of complex social systems, combined with the unique size and composition of this dataset, provide an ideal ground for tackling a wide range of data science problems, including the measurement and evaluation of performance, both at individual and at collective level, and the determinants of success and failure.

2020 ◽  
Vol 8 ◽  
Author(s):  
Devasis Bassu ◽  
Peter W. Jones ◽  
Linda Ness ◽  
David Shallcross

Abstract In this paper, we present a theoretical foundation for a representation of a data set as a measure in a very large hierarchically parametrized family of positive measures, whose parameters can be computed explicitly (rather than estimated by optimization), and illustrate its applicability to a wide range of data types. The preprocessing step then consists of representing data sets as simple measures. The theoretical foundation consists of a dyadic product formula representation lemma, and a visualization theorem. We also define an additive multiscale noise model that can be used to sample from dyadic measures and a more general multiplicative multiscale noise model that can be used to perturb continuous functions, Borel measures, and dyadic measures. The first two results are based on theorems in [15, 3, 1]. The representation uses the very simple concept of a dyadic tree and hence is widely applicable, easily understood, and easily computed. Since the data sample is represented as a measure, subsequent analysis can exploit statistical and measure theoretic concepts and theories. Because the representation uses the very simple concept of a dyadic tree defined on the universe of a data set, and the parameters are simply and explicitly computable and easily interpretable and visualizable, we hope that this approach will be broadly useful to mathematicians, statisticians, and computer scientists who are intrigued by or involved in data science, including its mathematical foundations.


2021 ◽  
Author(s):  
Max-Marcel Theilig ◽  
Ashley A Knapp ◽  
Jennifer M Nicholas ◽  
Rüdiger Zarnekow ◽  
David C Mohr

BACKGROUND Using mobile health technology has sparked a broad engagement of data science and machine learning methods to leverage the complex, assorted amount of data for mental health purposes. Despite many studies, there is a reported underdevelopment of user engagement concepts, and the desire for high accuracy or significance has shown to lead to low explicability and irreproducibility. OBJECTIVE To overcome such reasons of poor analysis input and facilitate the reproducibility and credibility of artificial intelligence applications, we aim to explore principal characteristics of user interaction with digital mental health. METHODS We generated five latent features based on previous research, expert opinions from digital mental health, and informed by data. The features were analyzed with descriptive statistics and data visualization. We carried out two rounds of evaluations with data from 12,400 users of IntelliCare, a mental health platform with 12 apps. First, we focused to proof concept and second, we assessed reproducibility by drawing conclusion from distribution differences. User data was drawn from both research trials and public deployment on Google Play. RESULTS Our algorithms showed advantages over commonly used concepts and reproduce in our public data set with different underlying behavioral strategies. These measures relate to the distribution of a user’s allocated attention, users’ circadian behavior, their consecutive commitment to a specific strategy, and users’ interaction trajectory. Because distributions between research trial and public deployment were similar, consistency was implied regarding the underlying behavioral strategies: psychoeducation and goal setting are used as a catalyst to overcome the users’ primary obstacles, sleep hygiene is addressed most regularly, while regular self-reflective thinking is avoided. Relaxation as well as cognitive reframing have increased variance in commitment among public users, indicating the challenging nature of these apps. The relative course of users’ engagement is similar in research and public data. CONCLUSIONS We argue that deliberate, a-priori feature engineering is essential for reproducible, tangible, and explainable study analyses. Our features enable improved results as well as interpretability, providing an increased understanding of how people engage with multiple mental health apps over time. Since we based the generation of features on generic interaction, these methods are applicable to further methods of study analysis and digital health.


Author(s):  
Mikhail Karasikov ◽  
Harun Mustafa ◽  
Daniel Danciu ◽  
Marc Zimmermann ◽  
Christopher Barber ◽  
...  

AbstractThe amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making all this sequencing data searchable and easily accessible to life science and data science researchers is an unsolved problem. We present MetaGraph, a versatile framework for the scalable analysis of extensive sequence repositories. MetaGraph efficiently indexes vast collections of sequences to enable fast search and comprehensive analysis. A wide range of underlying data structures offer different practically relevant trade-offs between the space taken by the index and its query performance. MetaGraph provides a flexible methodological framework allowing for index construction to be scaled from consumer laptops to distribution onto a cloud compute cluster for processing terabases to petabases of input data. Achieving compression ratios of up to 1,000-fold over the already compressed raw input data, MetaGraph can represent the content of large sequencing archives in the working memory of a single compute server. We demonstrate our framework’s scalability by indexing over 1.4 million whole genome sequencing (WGS) records from NCBI’s Sequence Read Archive, representing a total input of more than three petabases.Besides demonstrating the utility of MetaGraph indexes on key applications, such as experiment discovery, sequence alignment, error correction, and differential assembly, we make a wide range of indexes available as a community resource, including those over 450,000 microbial WGS records, more than 110,000 fungi WGS records, and more than 20,000 whole metagenome sequencing records. A subset of these indexes is made available online for interactive queries. All indexes created from public data comprising in total more than 1 million records are available for download or usage in the cloud.As an example of our indexes’ integrative analysis capabilities, we introduce the concept of differential assembly, which allows for the extraction of sequences present in a foreground set of samples but absent in a given background set. We apply this technique to differentially assemble contigs to identify pathogenic agents transfected via human kidney transplants. In a second example, we indexed more than 20,000 human RNA-Seq records from the TCGA and GTEx cohorts and use them to extract transcriptome features that are hard to characterize using a classical linear reference. We discovered over 200 trans-splicing events in GTEx and found broad evidence for tissue-specific non-A-to-I RNA-editing in GTEx and TCGA.


2021 ◽  
Author(s):  
Valeria Gelardi ◽  
Alain Barrat ◽  
Nicolas Claidière

Networks are well-established representations of social systems, and temporal networks are widely used to study their dynamics. Temporal network data often consist in a succession of static networks over consecutive time windows whose length, however, is arbitrary, not necessarily corresponding to any intrinsic timescale of the system. Moreover, the resulting view of social network evolution is unsatisfactory: short time windows contain little information, whereas aggregating over large time windows blurs the dynamics. Going from a temporal network to a meaningful evolving representation of a social network therefore remains a challenge. Here we introduce a framework to that purpose: transforming temporal network data into an evolving weighted network where the weights of the links between individuals are updated at every interaction. Most importantly, this transformation takes into account the interdependence of social relationships due to the finite attention capacities of individuals: each interaction between two individuals not only reinforces their mutual relationship but also weakens their relationships with others. We study a concrete example of such a transformation and apply it to several data sets of social interactions. Using temporal contact data collected in schools, we show how our framework highlights specificities in their structure and temporal organization. We then introduce a synthetic perturbation into a data set of interactions in a group of baboons to show that it is possible to detect a perturbation in a social group on a wide range of timescales and parameters. Our framework brings new perspectives to the analysis of temporal social networks.


2020 ◽  
Author(s):  
Max-Marcel Theilig ◽  
Ashley Arehart Knapp ◽  
Jennifer Nicholas ◽  
Rüdiger Zarnekow ◽  
David Curtis Mohr

Abstract Background: Using smartphones and wearable sensor technology has sparked a broad engagement of data science and machine learning methods to leverage the complex, assorted amount of data. Despite verified processes, there is a reported underdevelopment of user engagement concepts, and the desire for high accuracy or significance has shown to lead to low explicability and irreproducibility. To overcome these issues, we aim to analyze principal characteristics of everyday behavior in digital mental health. Methods: We generated five latent features based on previous research, expert opinions from digital mental health, and informed by data. The features were analyzed with descriptive statistics and data visualization. We carried out two rounds of evaluations with data from 12,400 users of IntelliCare, a mental health platform with 12 apps. First, we focused to proof concept and second, we assessed reproducibility by drawing conclusion from distribution differences. User data was drawn from both research trials and public deployment on Google Play. Results: Our algorithms showed increased rationale for the basic usage of apps with different underlying behavioral strategies. Measures of the distribution of user’s allocated attention, the user’s circadian behavior, their consecutive commitment to a specific strategy, and users’ interaction trajectory are perceived as transferable to the public data set. Because distributions between research trial and public deployment were similar, consistency was shown regarding the underlying behavioral strategies: psychoeducation and goal setting are used as a catalyst to overcome the users’ primary obstacles, sleep hygiene is addressed most regularly, while regular self-reflective thinking is avoided. Relaxation as well as cognitive reframing have increased variance in commitment among public users, indicating the challenging nature of these apps. The relative course of the engagement (learning curve) is similar in research and public data. Conclusions: The deliberate, a-priori engineered features were reproducible across app users from both data sets. These features led to improved results as well as increased interpretability, providing an increased understanding of how people engage with multiple mental health apps over time. Since we based the generation of features on generic interaction proxies, these methods are applicable to other cases in artificial intelligence and digital health.


2021 ◽  
Vol 2 (2(58)) ◽  
pp. 16-19
Author(s):  
Ihor Polovynko ◽  
Lubomyr Kniazevich

The object of research is low-quality digital images. The presented work is devoted to the problem of digital processing of low quality images, which is one of the most important tasks of data science in the field of extracting useful information from a large data set. It is proposed to carry out the process of image enhancement by means of tonal processing of their Fourier images. The basis for this approach is the fact that Fourier images are described by brightness values in a wide range of values, which can be significantly reduced by gradation transformations. The work carried out the Fourier transform of the image with the separation of the amplitude and phase. The important role of the phase in the process of forming the image obtained after the implementation of the inverse Fourier transform is shown. Although the information about the signal amplitude is lost during the phase analysis, nevertheless all the main details correspond accurately to the initial image. This suggests that when modifying the Fourier spectra of images, it is necessary to take into account the effect on both the amplitude and the phase of the object under study. The effectiveness of the proposed method is demonstrated by the example of space images of the Earth's surface. It is shown that after the gradation logarithmic Fourier transform of the image and the inverse Fourier transform, an image is obtained that is more contrasting than the original one, will certainly facilitate the work with it in the process of visual analysis. To explain the results obtained, the schedule of the obtained gradation transformation into the Mercator series was carried out. It is shown that the resulting image consists of two parts. The first of them corresponds to the reproduction of the original image obtained by the inverse Fourier transform, and the second performs smoothing of its brightness, similar to the action of the combined method of spatial image enhancement. When using the proposed method, preprocessing is also necessary, which, as a rule, includes operations necessary for centering the Fourier image, as well as converting the original data into floating point format.


Information ◽  
2021 ◽  
Vol 12 (10) ◽  
pp. 403
Author(s):  
Jiang Wu ◽  
Jiale Wang ◽  
Ao Zhan ◽  
Chengyu Wu

Falls are one of the main causes of elderly injuries. If the faller can be found in time, further injury can be effectively avoided. In order to protect personal privacy and improve the accuracy of fall detection, this paper proposes a fall detection algorithm using the CNN-Casual LSTM network based on three-axis acceleration and three-axis rotation angular velocity sensors. The neural network in this system includes an encoding layer, a decoding layer, and a ResNet18 classifier. Furthermore, the encoding layer includes three layers of CNN and three layers of Casual LSTM. The decoding layer includes three layers of deconvolution and three layers of Casual LSTM. The decoding layer maps spatio-temporal information to a hidden variable output that is more conducive relative to the work of the classification network, which is classified by ResNet18. Moreover, we used the public data set SisFall to evaluate the performance of the algorithm. The results of the experiments show that the algorithm has high accuracy up to 99.79%.


2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Jann M. Weinand ◽  
Russell McKenna ◽  
Kai Mainzer

Abstract In the context of the energy transition, municipalities are increasingly attempting to exploit renewable energies. Socio-energetic data are required as input for municipal energy system analyses. This Data Descriptor provides a compilation of 40 indicators for all 11,131 German municipalities. In addition to census data such as population density, mobility data such as the number of vehicles and data on the potential of renewables such as wind energy are included. Most of the data set also contains public data, the allocation of which to municipalities was an extensive task. The data set can support in addressing a wide range of energy-related research challenges. A municipality typology has already been developed with the data, and the resulting municipality grouping is also included in the data set.


2021 ◽  
Author(s):  
Annika Vogel ◽  
Ghazi Alessa ◽  
Robert Scheele ◽  
Lisa Weber ◽  
Stephanie Fiedler

<p>Aerosols are known to affect atmospheric processes on a wide range of spatio-temporal scales, from dust storms reducing incoming solar radiation to aerosol-climate feedbacks. Although plenty of studies address aerosol radiative forcing, there are persistent differences in current aerosol estimates from both, observations and models. Global reanalyses are able to provide consistent estimates of aerosol distributions by combining these two data sources. However, continuous assimilation of single satellite products forces the analyses towards the satellites climatology including possible inaccuracies. This study investigates differences between current estimates of aerosol optical depth (AOD) by addressing two questions: (1.) How well do we know the large-scale spatio-temporal pattern of present-day AOD across state-of-the-art data? (2.) How does current global aerosol reanalyses perform in comparison to other model- and observation-based data sets? To answer these questions, AOD from the global CAMS and MERRA-2 reanalyses is compared to 8 satellite products, 1 established climatology and 4 multi-model ensembles. The comprehensive data set used in this study allows to evaluate the performance of individual products concerning different spatial and temporal aspects. The evaluation covers results from 1998 to 2019, including most recently available products like the climate model inter-comparison project CMIP6.</p><p>Spatially and temporally averaged AOD from MERRA-2 agrees well with the mean satellite climatology, while the CAMS climatology is higher than most other products. With relative standard deviations of about 11%, temporal variations of CAMS and MERRA-2 agree well with the mean satellite variation. However, averaged AOD from the individual satellites show large differences, ranging from 0.124 for MISR to 0.164 for MODIS. In addition to average differences, spatial patterns vary significantly between the individual data sets. Because the CAMS reanalysis only assimilates AOD from MODIS, it remains close to the MODIS climatology which overestimates AOD in most regions in comparison to other products. This overestimation is considerably increased over eastern China were CAMS simulates regional values of more than 1.2 during summer. By assimilating both, MODIS and MISR data, the MERRA-2 reanalysis is closer to the satellite mean under most conditions. Although annual deviations remain small compared to other models, MERRA-2 tends to underestimate AOD at the equator and overestimates AOD at higher latitudes especially during the winter-season. The spatio-temporal differences between individual aerosol data sets underline the need for further research on both satellite retrievals and model simulations for aerosols. For example, integrating multiple observations in a reanalysis system would allow to compensate for inaccuracies of the individual products. Further developing the multi-scale coupled ICON-ART system at the German Weather Service provides a promising environment to achieve accurate aerosol climatologies on high spatial resolution.</p>


2020 ◽  
Author(s):  
Max-Marcel Theilig ◽  
Ashley Arehart Knapp ◽  
Jennifer Nicholas ◽  
Rüdiger Zarnekow ◽  
David Curtis Mohr

Abstract Background: Using smartphones and wearable sensor technology has sparked a broad engagement of data science and machine learning methods to leverage the complex, assorted amount of data. Despite verified processes, there is a reported underdevelopment of user engagement concepts, and the desire for high accuracy or significance has shown to lead to low explicability and irreproducibility. To overcome these issues, we aim to analyze principal characteristics of everyday behavior in digital mental health. Methods: We generated five latent features based on previous research, expert opinions from digital mental health, and informed by data. The features were analyzed with descriptive statistics and data visualization. We carried out two rounds of evaluations with data from 12,400 users of IntelliCare, a mental health platform with 12 apps. First, we focused to proof concept and second, we assessed reproducibility by drawing conclusion from distribution differences. User data was drawn from both research trials and public deployment on Google Play. Results: Our algorithms showed increased rationale for the basic usage of apps with different underlying behavioral strategies. Measures of the distribution of user’s allocated attention, the user’s circadian behavior, their consecutive commitment to a specific strategy, and users’ interaction trajectory curve are perceived as transferable to the public data set. Because distributions between research trial and public deployment were similar, consistency was shown regarding the underlying behavioral strategies: psychoeducation and goal setting are used as a catalyst to overcome the users’ primary obstacles, sleep hygiene is addressed most regularly, while regular emotional exposure is avoided. Relaxation as well as cognitive reframing have increased variance in commitment among public users, indicating the challenging nature of these apps. The relative course of the engagement (learning curve) is similar in research and public data. Conclusions: The deliberate, a-priori engineered features were reproducible across app users from both data sets. These features led to improved results as well as increased interpretability, providing an increased understanding of how people engage with multiple mental health apps over time. Since we based the generation of features on generic interaction proxies, these methods are applicable to other cases in artificial intelligence and digital health.


Sign in / Sign up

Export Citation Format

Share Document