scholarly journals The Structure and Dynamics of Modern United States Federal Case Law

2022 ◽  
Vol 9 ◽  
Author(s):  
Keerthi Adusumilli ◽  
Bradford Brown ◽  
Joey Harrison ◽  
Matthew Koehler ◽  
Jason Kutarnia ◽  
...  

The structure and dynamics of modern United States Federal Case Law are examined here. The analyses utilize large-scale network analysis tools, natural language processing techniques, and information theory to examine all the federal opinions in the Court Listener database, containing approximately 1.3 million judicial opinions and 11.4 million citations. The analyses are focused on modern United States Federal Case Law, as cases in the Court Listener database range from approximately 1926–2020 and include most Federal jurisdictions. We examine the data set from a structural perspective using the citation network, overall and by time and space (jurisdiction). In addition to citation structure, we examine the dataset from a topical and information theoretic perspective, again, overall and by time and space.

2021 ◽  
Vol 11 (7) ◽  
pp. 3094
Author(s):  
Vitor Fortes Rey ◽  
Kamalveer Kaur Garewal ◽  
Paul Lukowicz

Human activity recognition (HAR) using wearable sensors has benefited much less from recent advances in Deep Learning than fields such as computer vision and natural language processing. This is, to a large extent, due to the lack of large scale (as compared to computer vision) repositories of labeled training data for sensor-based HAR tasks. Thus, for example, ImageNet has images for around 100,000 categories (based on WordNet) with on average 1000 images per category (therefore up to 100,000,000 samples). The Kinetics-700 video activity data set has 650,000 video clips covering 700 different human activities (in total over 1800 h). By contrast, the total length of all sensor-based HAR data sets in the popular UCI machine learning repository is less than 63 h, with around 38 of those consisting of simple mode of locomotion activities like walking, standing or cycling. In our research we aim to facilitate the use of online videos, which exist in ample quantities for most activities and are much easier to label than sensor data, to simulate labeled wearable motion sensor data. In previous work we already demonstrated some preliminary results in this direction, focusing on very simple, activity specific simulation models and a single sensor modality (acceleration norm). In this paper, we show how we can train a regression model on generic motions for both accelerometer and gyro signals and then apply it to videos of the target activities to generate synthetic Inertial Measurement Units (IMU) data (acceleration and gyro norms) that can be used to train and/or improve HAR models. We demonstrate that systems trained on simulated data generated by our regression model can come to within around 10% of the mean F1 score of a system trained on real sensor data. Furthermore, we show that by either including a small amount of real sensor data for model calibration or simply leveraging the fact that (in general) we can easily generate much more simulated data from video than we can collect its real version, the advantage of the latter can eventually be equalized.


Biostatistics ◽  
2020 ◽  
Author(s):  
W Katherine Tan ◽  
Patrick J Heagerty

Summary Scalable and accurate identification of specific clinical outcomes has been enabled by machine-learning applied to electronic medical record systems. The development of classification models requires the collection of a complete labeled data set, where true clinical outcomes are obtained by human expert manual review. For example, the development of natural language processing algorithms requires the abstraction of clinical text data to obtain outcome information necessary for training models. However, if the outcome is rare then simple random sampling results in very few cases and insufficient information to develop accurate classifiers. Since large scale detailed abstraction is often expensive, time-consuming, and not feasible, more efficient strategies are needed. Under such resource constrained settings, we propose a class of enrichment sampling designs, where selection for abstraction is stratified by auxiliary variables related to the true outcome of interest. Stratified sampling on highly specific variables results in targeted samples that are more enriched with cases, which we show translates to increased model discrimination and better statistical learning performance. We provide mathematical details and simulation evidence that links sampling designs to their resulting prediction model performance. We discuss the impact of our proposed sampling on both model training and validation. Finally, we illustrate the proposed designs for outcome label collection and subsequent machine-learning, using radiology report text data from the Lumbar Imaging with Reporting of Epidemiology study.


Atmosphere ◽  
2018 ◽  
Vol 9 (12) ◽  
pp. 482 ◽  
Author(s):  
Abdoulaye Sy ◽  
Christophe Duroure ◽  
Jean-Luc Baray ◽  
Yahya Gour ◽  
Joël Van Baelen ◽  
...  

The rain statistics of 0–45° N area including equatorial, Sahelian, and mid-latitude regions, are studied using the probability distributions of the duration of rainy and dry events. Long time daily data set from ground measurements and satellite observations of rain fields are used. This technique highlights a sharp latitudinal transition of the statistics between equatorial and all other regions (Sahel, mid-latitude). The probability distribution of the 8° S to 8° N latitude band shows a large-scale organization with a slow decreasing (power law decrease) distributions for the time and space size of rain events. This observation is in agreement with a scaling, or macro turbulent, behavior of the equatorial regions rain fields. For the Sahelian and mid-latitude regions, our observations are clearly not in agreement with this behavior. They show that the largest rain systems have a limited time and space size (well described with a decreasing exponential distribution). For these non-equatorial regions it is possible to define a local characteristic duration and a characteristic horizontal size of the large rain events. These characteristics time and space scales of observed mesoscale convective systems could be a sensible indicator for the detection of the possible trend of rain distribution properties due to anthropogenic influence.


Land ◽  
2022 ◽  
Vol 11 (1) ◽  
pp. 123
Author(s):  
Nathan Morrow ◽  
Nancy B. Mock ◽  
Andrea Gatto ◽  
Julia LeMense ◽  
Margaret Hudson

Localized actionable evidence for addressing threats to the environment and human security lacks a comprehensive conceptual frame that incorporates challenges associated with active conflicts. Protective pathways linking previously disciplinarily-divided literatures on environmental security, human security and resilience in a coherent conceptual frame that identifies key relationships is used to analyze a novel, unstructured data set of Global Environment Fund (GEF) programmatic documents. Sub-national geospatial analysis of GEF documentation relating to projects in Africa finds 73% of districts with GEF land degradation projects were co-located with active conflict events. This study utilizes Natural Language Processing on a unique data set of 1500 GEF evaluations to identify text entities associated with conflict. Additional project case studies explore the sequence and relationships of environmental and human security concepts that lead to project success or failure. Differences between biodiversity and climate change projects are discussed but political crisis, poverty and disaster emerged as the most frequently extracted entities associated with conflict in environmental protection projects. Insecurity weakened institutions and fractured communities leading both directly and indirectly to conflict-related damage to environmental programming and desired outcomes. Simple causal explanations found to be inconsistent in previous large-scale statistical associations also inadequately describe dynamics and relationships found in the extracted text entities or case summaries. Emergent protective pathways that emphasized poverty and conflict reduction facilitated by institutional strengthening and inclusion present promising possibilities. Future research with innovative machine learning and other techniques of working with unstructured data may provide additional evidence for implementing actions that address climate change and environmental degradation while strengthening resilience and human security. Resilient, participatory and polycentric governance is key to foster this process.


2021 ◽  
Author(s):  
Ari Z. Klein ◽  
Steven Meanley ◽  
Karen O’Connor ◽  
José A. Bauermeister ◽  
Graciela Gonzalez-Hernandez

AbstractBackgroundPre-exposure prophylaxis (PrEP) is highly effective at preventing the acquisition of Human Immunodeficiency Virus (HIV). There is a substantial gap, however, between the number of people in the United States who have indications for PrEP and the number of them who are prescribed PrEP. While Twitter content has been analyzed as a source of PrEP-related data (e.g., barriers), methods have not been developed to enable the use of Twitter as a platform for implementing PrEP-related interventions.ObjectiveMen who have sex with men (MSM) are the population most affected by HIV in the United States. Therefore, the objective of this study was to develop and assess an automated natural language processing (NLP) pipeline for identifying men in the United States who have reported on Twitter that they are gay, bisexual, or MSM.MethodsBetween September 2020 and January 2021, we used the Twitter Streaming Application Programming Interface (API) to collect more than 3 million tweets containing keywords that men may include in posts reporting that they are gay, bisexual, or MSM. We deployed handwritten, high-precision regular expressions on the tweets and their user profile metadata designed to filter out noise and identify actual self-reports. We identified 10,043 unique users geolocated in the United States, and drew upon a validated NLP tool to automatically identify their ages.ResultsBased on manually distinguishing true and false positive self-reports in the tweets or profiles of 1000 of the 10,043 users identified by our automated pipeline, our pipeline has a precision of 0.85. Among the 8756 users for which a United States state-level geolocation was detected, 5096 (58.2%) of them are in the 10 states with the highest numbers of new HIV diagnoses. Among the 6240 users for which a county-level geolocation was detected, 4252 (68.1%) of them are in counties or states considered priority jurisdictions by the Ending the HIV Epidemic (EHE) initiative. Furthermore, the majority of the users are in the same two age groups as the majority of MSM in the United States with new HIV diagnoses.ConclusionsOur automated NLP pipeline can be used to identify MSM in the United States who may be at risk for acquiring HIV, laying the groundwork for using Twitter on a large scale to target PrEP-related interventions directly at this population.


2011 ◽  
Vol 37 (4) ◽  
pp. 753-809 ◽  
Author(s):  
David Vadas ◽  
James R. Curran

Noun phrases (nps) are a crucial part of natural language, and can have a very complex structure. However, this np structure is largely ignored by the statistical parsing field, as the most widely used corpus is not annotated with it. This lack of gold-standard data has restricted previous efforts to parse nps, making it impossible to perform the supervised experiments that have achieved high performance in so many Natural Language Processing (nlp) tasks. We comprehensively solve this problem by manually annotating np structure for the entire Wall Street Journal section of the Penn Treebank. The inter-annotator agreement scores that we attain dispel the belief that the task is too difficult, and demonstrate that consistent np annotation is possible. Our gold-standard np data is now available for use in all parsers. We experiment with this new data, applying the Collins (2003) parsing model, and find that its recovery of np structure is significantly worse than its overall performance. The parser's F-score is up to 5.69% lower than a baseline that uses deterministic rules. Through much experimentation, we determine that this result is primarily caused by a lack of lexical information. To solve this problem we construct a wide-coverage, large-scale np Bracketing system. With our Penn Treebank data set, which is orders of magnitude larger than those used previously, we build a supervised model that achieves excellent results. Our model performs at 93.8% F-score on the simple task that most previous work has undertaken, and extends to bracket longer, more complex nps that are rarely dealt with in the literature. We attain 89.14% F-score on this much more difficult task. Finally, we implement a post-processing module that brackets nps identified by the Bikel (2004) parser. Our np Bracketing model includes a wide variety of features that provide the lexical information that was missing during the parser experiments, and as a result, we outperform the parser's F-score by 9.04%. These experiments demonstrate the utility of the corpus, and show that many nlp applications can now make use of np structure.


Data & Policy ◽  
2021 ◽  
Vol 3 ◽  
Author(s):  
Francisco Rowe ◽  
Michael Mahony ◽  
Eduardo Graells-Garrido ◽  
Marzia Rango ◽  
Niklas Sievers

Abstract Large-scale coordinated efforts have been dedicated to understanding the global health and economic implications of the COVID-19 pandemic. Yet, the rapid spread of discrimination and xenophobia against specific populations has largely been neglected. Understanding public attitudes toward migration is essential to counter discrimination against immigrants and promote social cohesion. Traditional data sources to monitor public opinion are often limited, notably due to slow collection and release activities. New forms of data, particularly from social media, can help overcome these limitations. While some bias exists, social media data are produced at an unprecedented temporal frequency, geographical granularity, are collected globally and accessible in real-time. Drawing on a data set of 30.39 million tweets and natural language processing, this article aims to measure shifts in public sentiment opinion about migration during early stages of the COVID-19 pandemic in Germany, Italy, Spain, the United Kingdom, and the United States. Results show an increase of migration-related Tweets along with COVID-19 cases during national lockdowns in all five countries. Yet, we found no evidence of a significant increase in anti-immigration sentiment, as rises in the volume of negative messages are offset by comparable increases in positive messages. Additionally, we presented evidence of growing social polarization concerning migration, showing high concentrations of strongly positive and strongly negative sentiments.


2021 ◽  
Vol 19 (3) ◽  
pp. e23
Author(s):  
Sizhuo Ouyang ◽  
Yuxing Wang ◽  
Kaiyin Zhou ◽  
Jingbo Xia

Currently, coronavirus disease 2019 (COVID-19) literature has been increasing dramatically, and the increased text amount make it possible to perform large scale text mining and knowledge discovery. Therefore, curation of these texts becomes a crucial issue for Bio-medical Natural Language Processing (BioNLP) community, so as to retrieve the important information about the mechanism of COVID-19. PubAnnotation is an aligned annotation system which provides an efficient platform for biological curators to upload their annotations or merge other external annotations. Inspired by the integration among multiple useful COVID-19 annotations, we merged three annotations resources to LitCovid data set, and constructed a cross-annotated corpus, LitCovid-AGAC. This corpus consists of 12 labels including Mutation, Species, Gene, Disease from PubTator, GO, CHEBI from OGER, Var, MPA, CPA, NegReg, PosReg, Reg from AGAC, upon 50,018 COVID-19 abstracts in LitCovid. Contain sufficient abundant information being possible to unveil the hidden knowledge in the pathological mechanism of COVID-19.


2021 ◽  
Author(s):  
Francisco Rowe ◽  
Michael Mahony ◽  
Eduardo Graells-Garrido ◽  
Marzia Rango ◽  
Niklas Sievers

In 2020, the world faced an unprecedented challenge to tackle and understand the spread and impacts of COVID-19. Large-scale coordinated efforts have been dedicated to understand the global health and economic implicationsof the pandemic. Yet, the rapid spread of discrimination and xenophobia against specific populations, particularlymigrants and individuals of Asian descent, has largely been neglected. Understanding public attitudes towardsmigration is essential to counter discrimination against immigrants and promote social cohesion. Traditional datasources to monitor public opinion – ethnographies, interviews, and surveys – are often limited due to smallsamples, high cost, low temporal frequency, slow collection, release and coarse spatial resolution. New forms ofdata, particularly from social media, can help overcome these limitations. While some bias exists, social mediadata are produced at an unprecedented temporal frequency, geographical granularity, are collected globally andaccessible in real-time. Drawing on a data set of 30.39 million tweets and natural language processing, this paperaims to measure shifts in public sentiment opinion about migration during early stages of the COVID-19 pandemicin Germany, Italy, Spain, the United Kingdom and the United States. Results show an increase of migration-relatedTweets along with COVID-19 cases during national lockdowns in all five countries. Yet, we found no evidence ofa significant increase in anti-immigration sentiment, as rises in the volume of negative messages are offset bycomparable increases in positive messages. Additionally, we presented evidence of growing social polarisationconcerning migration, showing high concentrations of strongly positive and strongly negative sentiments.


Sign in / Sign up

Export Citation Format

Share Document