Is there such a thing as landscape genetics?

Mapping Intimacies ◽

10.1101/018192 ◽

2015 ◽

Author(s):

Rodney J. Dyer

Keyword(s):

Population Genetics ◽

Natural Language Processing ◽

Language Processing ◽

Landscape Genetics ◽

Population Based ◽

Scientific Discipline ◽

Genetic Studies ◽

Analytical Work ◽

Statistical Approaches ◽

Landscape Genetic

AbstractFor a scientific discipline to be interdisciplinary it must satisfy two conditions; it must consist of contributions from at least two existing disciplines and it must be able to provide insights, through this interaction, that neither progenitor discipline could address. In this paper, I examine the complete body of peer-reviewed literature self-identified as landscape genetics using the statistical approaches of text mining and natural language processing. The goal here is to quantify the kinds of questions being addressed in landscape genetic studies, the ways in which questions are evaluated mechanistically, and how they are differentiated from the progenitor disciplines of landscape ecology and population genetics. I then circumscribe the main factions within published landscape genetic papers examining the extent to which emergent questions are being addressed and highlighting a deep bifurcation between existing individual- and population-based approaches. I close by providing some suggestions on where theoretical and analytical work is needed if landscape genetics is to serve as a real bridge connecting evolution and ecology sensu lato.

P311 Detection and characterisation of extra-intestinal manifestations of IBD in clinical office notes using natural language processing

Journal of Crohn s and Colitis ◽

10.1093/ecco-jcc/jjz203.440 ◽

2020 ◽

Vol 14 (Supplement_1) ◽

pp. S309-S310

Author(s):

R Stidham ◽

D Yu ◽

S Lahiri ◽

V Vydiswaran

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Past History ◽

Model Development ◽

Population Based ◽

Equal Weight ◽

Disease Experience ◽

Status Classification ◽

Therapeutic Decision Making

Abstract Background Extra-Intestinal Manifestations (EIM) occur in nearly 40% of patients with IBD and impact both disease experience and therapeutic decision-making, but are not well captured by administrative codes. We aimed to pilot computational natural language processing (NLP) methods to characterise EIMs using consultant notes. Methods Subjects with a diagnosis of IBD were identified in a single-centre retrospective review of electronic health records (EHR) between 2014–2017. Gastroenterology (GI) notes were annotated by two reviewers for the presence and activity of EIMs. EIM concepts were identified using NLP methods leveraging UMLS libraries and hand-crafted features. EIM characterisation occurred within a ±25-word window around identified EIMs with classifications including inactive concepts (negated, historical, resolved) and active concepts (improved, worsened, active but unchanged). Decisions on EIM status when repeatedly referenced in a document used section-based weighting for status inference, with greatest to least weight ranking for assessment/plan, subjective, past history, exam, and other, respectively. EIM status was classified as ambiguous when multiple conflicting references were present within the same document of approximately equal weight. Model development and testing used an 80/20 dataset split. Results In 4108 unique IBD patients, 1640 (39.9%) had at least 1 EIM identified. The mean age was 41.9 years, 47.2% were male, and 27.0% had biologic exposure. A total of 1240 manually annotated documents (first GI notes) were comprised of 51.1% arthritis, 16.5% ocular, 16.2% psoriasis, with erythema nodosum (EN), pyoderma gangrenosum (PG), and hidradenitis suppurativa (HS) together comprising 16.2% of the cohort. NLP models performed well for correctly classifying both EIM presence and status in a testing set, with overall accuracy, sensitivity, and specificity of 91.2%, 92.9% and 81.8% across all EIMs in notes automatically classified as non-ambiguous (Table 1). NLP methods identified EIM status classification as ambiguous in 38.9% of cases. Conclusion NLP methods can detect and classify EIMs with reasonable performance and efficiency compared with traditional manual chart review. Though source document variation and ambiguity present challenges, NLP offers exciting possibilities for population-based research and decision support.

Haplotype-level DNA metabarcoding from freshwater macroinvertebrate community samples

ARPHA Conference Abstracts ◽

10.3897/aca.4.e64738 ◽

2021 ◽

Vol 4 ◽

Author(s):

Joeselle Serrana ◽

Kozo Watanabe

Keyword(s):

Population Genetics ◽

Relative Abundance ◽

Intraspecific Variability ◽

Population Based ◽

Freshwater Ecosystems ◽

Genetic Studies ◽

Cycle Number ◽

Dna Metabarcoding ◽

Dna Template ◽

Mixed Community

DNA metabarcoding is a robust method for environmental impact assessments of freshwater ecosystems that enables the simultaneous multi-species identification of complex mixed community samples from different origins using extracellular and total genomic DNA. The development and evaluation of DNA metabarcoding protocols for haplotype level resolution require attention, specifically for basic population genetic applications, i.e., analysis to allow genetic diversity estimations and dispersal abilities of the species present in the bulk community samples. Various literature has proposed using DNA metabarcoding for population genetics, and few studies have provided preliminary applications and proof of concepts that always refer to particular taxa. However, further exploration and assessment of the laboratory and bioinformatics strategies are required to unlock the potential of metabarcoding-based population-level ecological assessments. Here, we assessed the ability to infer haplotype information of freshwater macroinvertebrate species from DNA metabarcoding community sequence. Using mock samples with known Sanger-sequenced haplotypes, we also assayed the effects of PCR cycle for the detection and reduction of spurious haplotypes obtained from DNA metabarcoding. We tested our haplotyping strategy on a mock sample containing 20 specimens from four species with known haplotypes based on the 658-bp Folmer region of the mitochondrial cytochrome c oxidase (mtCOI) gene. The read processing and denoising-step resulted in 14 zero-radius operational taxonomic units (ZOTUs) of 421-bp length, with 12 ZOTUs having 100% match with 12 of the mock haplotype sequences. The remaining eight haplotypes that were not detected from the DNA metabarcoding dataset were all the A. decemseta samples (0.01, 0.05, 0.10 ng/μL DNA template concentrations), two E. bulba (0.01 and 0.05 ng/μL), E. latifolium (0.01 ng/μL), and two K. tibialis (0.01 and 0.10 ng/μL). Given that most of the undetected samples had low concentrations, we report the influence of initial DNA template concentration on the amplification from a mock community sample. Our observation is in accordance with previous studies that reported that samples or taxa with low DNA template concentrations have lower detection probability. Accordingly, abundant taxa or samples with high biomass tend to have higher detection probabilities than those rare, smaller or have low biomass from mixed-community samples. The difference in biomass affects haplotypes' detection since most of the large specimens would be retained after read processing. Hence, these factors need to be addressed when metabarcoding-based haplotyping is to be used to infer abundance-based analysis for population genetics applications. The phylogenetic-based analysis (Fig. 1) revealed that the two ZOTUs without taxonomic matches clustered with one of the species from the mock sample. This supports our observation that only the samples with low concentration were unrepresented from the DNA metabarcoding data. Although we still reported false positive detections because two of the 14 ZOTUs failed to have a 100% match with the mock reference sequences, we could at least identify them as A. decemseta sequences based on the phylogenetic approach. Quality passing reads relatively increased with increasing cycle number, and the relative abundance of each ZOTUs was consistent for each cycle number. This suggests that increasing the cycle number, from 24 to 64, did not affect the relative abundance of quality passing filter reads. Our study demonstrated that DNA metabarcoding data could be used to infer intraspecific variability, showing promise for possible applications in population-based genetic studies. As DNA metabarcoding becomes more established and laboratory protocols and bioinformatics pipelines are continuously being developed, our proof of concept study demonstrated that the method could be used to infer intraspecific variability, showing promise for possible applications on population-based genetic studies.

Using natural language processing for identification of herpes zoster ophthalmicus cases to support population-based study

Clinical and Experimental Ophthalmology ◽

10.1111/ceo.13340 ◽

2018 ◽

Vol 47 (1) ◽

pp. 7-14 ◽

Cited By ~ 2

Author(s):

Chengyi Zheng ◽

Yi Luo ◽

Cheryl Mercado ◽

Lina Sy ◽

Steven J Jacobsen ◽

...

Keyword(s):

Natural Language Processing ◽

Herpes Zoster ◽

Natural Language ◽

Language Processing ◽

Population Based ◽

Herpes Zoster Ophthalmicus ◽

Population Based Study

Using natural language processing to construct a metastatic breast cancer cohort from linked cancer registry and electronic medical records data

JAMIA Open ◽

10.1093/jamiaopen/ooz040 ◽

2019 ◽

Vol 2 (4) ◽

pp. 528-537 ◽

Cited By ~ 2

Author(s):

Albee Y Ling ◽

Allison W Kurian ◽

Jennifer L Caswell-Jin ◽

George W Sledge ◽

Nigam H Shah ◽

...

Keyword(s):

Breast Cancer ◽

Metastatic Breast Cancer ◽

Natural Language Processing ◽

Electronic Medical Records ◽

Language Processing ◽

Cancer Diagnosis ◽

Medical Records ◽

De Novo ◽

Metastatic Breast ◽

Population Based

Abstract Objectives Most population-based cancer databases lack information on metastatic recurrence. Electronic medical records (EMR) and cancer registries contain complementary information on cancer diagnosis, treatment and outcome, yet are rarely used synergistically. To construct a cohort of metastatic breast cancer (MBC) patients, we applied natural language processing techniques within a semisupervised machine learning framework to linked EMR-California Cancer Registry (CCR) data. Materials and Methods We studied all female patients treated at Stanford Health Care with an incident breast cancer diagnosis from 2000 to 2014. Our database consisted of structured fields and unstructured free-text clinical notes from EMR, linked to CCR, a component of the Surveillance, Epidemiology and End Results Program (SEER). We identified de novo MBC patients from CCR and extracted information on distant recurrences from patient notes in EMR. Furthermore, we trained a regularized logistic regression model for recurrent MBC classification and evaluated its performance on a gold standard set of 146 patients. Results There were 11 459 breast cancer patients in total and the median follow-up time was 96.3 months. We identified 1886 MBC patients, 512 (27.1%) of whom were de novo MBC patients and 1374 (72.9%) were recurrent MBC patients. Our final MBC classifier achieved an area under the receiver operating characteristic curve (AUC) of 0.917, with sensitivity 0.861, specificity 0.878, and accuracy 0.870. Discussion and Conclusion To enable population-based research on MBC, we developed a framework for retrospective case detection combining EMR and CCR data. Our classifier achieved good AUC, sensitivity, and specificity without expert-labeled examples.

Microsatellite markers for Bokermannohyla species (Anura, Hylidae) from the Brazilian Cerrado and Atlantic Forest domains

Amphibia-Reptilia ◽

10.1163/15685381-00002950 ◽

2014 ◽

Vol 35 (3) ◽

pp. 355-360 ◽

Cited By ~ 7

Author(s):

Renato C. Nali ◽

Kelly R. Zamudio ◽

Cynthia P.A. Prado

Keyword(s):

Microsatellite Markers ◽

Atlantic Forest ◽

Landscape Genetics ◽

Mating Systems ◽

Brazilian Cerrado ◽

Genetic Studies ◽

Focal Species ◽

Central Brazil ◽

Polymorphic Microsatellite ◽

Landscape Genetic

We characterized 22 polymorphic microsatellite markers for the Brazilian treefrog Bokermannohyla ibitiguara and tested their cross-amplification in B. alvarengai, B. circumdata and B. hylax. Our focal species occurs in protected and disturbed Brazilian Cerrado landscapes, a highly threatened savanna in central Brazil. Fourteen markers successfully cross-amplified for at least one congener. These microsatellites will be useful for studies of mating systems, relatedness and landscape genetics of Cerrado populations under various deforestation levels. Moreover, variable markers for B. circumdata and B. hylax will also be useful for landscape genetic studies of taxa typical of the threatened Atlantic Forest domain.

Identifying Cases of Shoulder Injury Related to Vaccine Administration (SIRVA) Using Natural Language Processing

10.1101/2021.05.05.21256555 ◽

2021 ◽

Author(s):

Chengyi Zheng ◽

Jonathan Duffy ◽

In-Lu Amy Liu ◽

Lina S. Sy ◽

Ronald A. Navarro ◽

...

Keyword(s):

Health Care ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Chart Review ◽

Population Based ◽

Care Organization ◽

Reference Standard ◽

Shoulder Injury ◽

Vaccine Administration

Background: Shoulder injury related to vaccine administration (SIRVA) accounts for more than half of all claims received by the National Vaccine Injury Compensation Program. However, there is a lack of population-based studies due to the challenge of identifying SIRVA cases in large health care databases. Objective: To develop a natural language processing (NLP) method to identify SIRVA cases from clinical notes. Methods: We conducted the study among members of a large integrated health care organization who were vaccinated between 04/1/2016 and 12/31/2017 and had subsequent diagnosis codes indicative of shoulder injury. Based on a training dataset with a chart review reference standard of 164 individuals, we developed an NLP algorithm to extract shoulder disorder information, including prior vaccination, anatomic location, temporality and causality. The algorithm identified three groups of positive SIRVA cases (definite, probable and possible) based on the strength of evidence. We compared NLP results to a chart review reference standard of 100 vaccinated individuals. We then applied the final automated NLP algorithm to a broader cohort of vaccinated individuals with a shoulder injury diagnosis code and performed manual chart confirmation on a random sample of NLP-identified definite cases and all NLP-identified probable and possible cases. Results: In the validation sample, the NLP algorithm had 100% accuracy for identifying 4 SIRVA cases and 96 individuals without SIRVA. In the broader cohort of 53,585 individuals, the NLP algorithm identified 291 definite, 124 probable, and 52 possible SIRVA cases. The chart-confirmation rates for these groups were 95.3%, 67.7% and 18.9%, respectively. Conclusions: The algorithm performed with high sensitivity and reasonable specificity in identifying positive SIRVA cases. The NLP algorithm can potentially be used in future population-based studies to identify this rare adverse event, avoiding labor-intensive chart review validation.

Natural Language Processing, Statistical Approaches to

Encyclopedia of Cognitive Science ◽

10.1002/0470018860.s00080 ◽

2006 ◽

Author(s):

Christopher D Manning

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Statistical Approaches

P3.07-013 Determining EGFR and ALK Status in a Population-Based Cancer Registry: A Natural Language Processing Validation Study

Journal of Thoracic Oncology ◽

10.1016/j.jtho.2016.11.2204 ◽

2017 ◽

Vol 12 (1) ◽

pp. S1438 ◽

Cited By ~ 1

Author(s):

Bernardo Goulart ◽

Emily Silgard ◽

Christina Baik ◽

Aasthaa Bansal ◽

Mikael Greenwood-Hickman ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Cancer Registry ◽

Validation Study ◽

Population Based

CloudLM: a Cloud-based Language Model for Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2016-0002 ◽

2016 ◽

Vol 105 (1) ◽

pp. 51-61 ◽

Cited By ~ 1

Author(s):

Jorge Ferrández-Tordera ◽

Sergio Ortiz-Rojas ◽

Antonio Toral

Keyword(s):

Big Data ◽

Natural Language Processing ◽

Machine Translation ◽

Language Processing ◽

State Of The Art ◽

Language Model ◽

Essential Element ◽

Language Models ◽

Language Modelling ◽

Statistical Approaches

Abstract Language models (LMs) are an essential element in statistical approaches to natural language processing for tasks such as speech recognition and machine translation (MT). The advent of big data leads to the availability of massive amounts of data to build LMs, and in fact, for the most prominent languages, using current techniques and hardware, it is not feasible to train LMs with all the data available nowadays. At the same time, it has been shown that the more data is used for a LM the better the performance, e.g. for MT, without any indication yet of reaching a plateau. This paper presents CloudLM, an open-source cloud-based LM intended for MT, which allows to query distributed LMs. CloudLM relies on Apache Solr and provides the functionality of state-of-the-art language modelling (it builds upon KenLM), while allowing to query massive LMs (as the use of local memory is drastically reduced), at the expense of slower decoding speed.

Subjective Answer Evaluator

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39090 ◽

2021 ◽

Vol 9 (11) ◽

pp. 1740-1744

Author(s):

Sarthak Kagliwal

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Summarization ◽

Language Models ◽

Automatic Evaluation ◽

Automatic Assessment ◽

Similarity Matching ◽

Unsupervised Approach ◽

Statistical Approaches

Abstract: The automatic assessment of subjective replies necessitates the use of Natural Language Processing and automated assessment. Ontology, semantic similarity matching, and statistical approaches are among the strategies employed. But most of the methods are based on an unsupervised approach. The proposed system uses an unsupervised method and is divided into two modules. The first one is extracting the essential data through text summarization and the second is applying various Natural Language models to the text retrieved from the above step and giving marks to them. Keywords: Automatic Evaluation, NLP, Text Summarization, Similarity Measure, Marks Scoring