Automatic Classification of Research Papers Using Machine Learning Approaches and Natural Language Processing

Author(s):  
Ortiz Yesenia ◽  
Segarra-Faggioni Veronica
2020 ◽  
Vol 4 (Supplement_1) ◽  
Author(s):  
Lina Sulieman ◽  
Jing He ◽  
Robert Carroll ◽  
Lisa Bastarache ◽  
Andrea Ramirez

Abstract Electronic Health Records (EHR) contain rich data to identify and study diabetes. Many phenotype algorithms have been developed to identify research subjects with type 2 diabetes (T2D), but very few accurately identify type 1 diabetes (T1D) cases or more rare forms of monogenic and atypical metabolic presentations. Polygenetic risk scores (PRS) quantify risk of a disease using common genomic variants well for both T1D and T2D. In this study, we apply validated phenotyping algorithms to EHRs linked to a genomic biobank to understand the independent contribution of PRS to classification of diabetes etiology and generate additional novel markers to distinguish subtypes of diabetes in EHR data. Using a de-identified mirror of medical center’s electronic health record, we applied published algorithms for T1D and T2D to identify cases, and used natural language processing and chart review strategies to identify cases of maturity onset diabetes of the young (MODY) and other more rare presentations. This novel approach included additional data types such as medication sequencing, ratio and temporality of insulin and non-insulin agents, clinical genetic testing, and ratios of diagnostic codes. Chart review was performed to validate etiology. To calculate PRS, we used genome wide genotyping from our BioBank, the de-identified biobank linking EHR to genomic data using coefficients of 65 published T1D SNPS and 76,996 T2D SNPS using PLINK in Caucasian subjects. In the dataset, we identified 82,238 cases of T2D but only 130 cases of T1D using the most cited published algorithms. Adding novel structured elements and natural language processing identified an additional 138 cases of T1D and distinguished 354 cases as MODY. Among over 90,000 subjects with genotyping data available, we included 72,624 Caucasian subjects since PRS coefficients were generated in Caucasian cohorts. Among those subjects, 248, 6,488, and 21 subjects were identified as T1D, T2D, and MODY subjects respectively in our final PRS cohort. The T1D PRS did significantly discriminate well between cases and controls (Mann-Whitney p-value is 3.4 e-17). The PRS for T2D did not significantly discriminate between cases and controls using published algorithms. The atypical case count was too low to calculate PRS discrimination. Calculation of the PRS score was limited by quality inclusion of variants available, and discrimination may improve in larger data sets. Additionally, blinded physician case review is ongoing to validate the novel classification scheme and provide a gold standard for machine learning approaches that can be applied in validation sets.


Sentiment Classification is one of the well-known and most popular domain of machine learning and natural language processing. An algorithm is developed to understand the opinion of an entity similar to human beings. This research fining article presents a similar to the mention above. Concept of natural language processing is considered for text representation. Later novel word embedding model is proposed for effective classification of the data. Tf-IDF and Common BoW representation models were considered for representation of text data. Importance of these models are discussed in the respective sections. The proposed is testing using IMDB datasets. 50% training and 50% testing with three random shuffling of the datasets are used for evaluation of the model.


2021 ◽  
Author(s):  
Alaa Hussainalsaid

This thesis proposes automatic classification of the emotional content of web documents using Natural Language Processing (NLP) algorithms. We used online articles and general documents to verify the performance of the algorithm, such as general web pages and news articles. The experiments used sentiment analysis that extracts sentiment of web documents. We used unigram and bigram approaches that are known as special types of N-gram, where N=1 and N=2, respectively. The unigram model analyses the probability to hit each word in the corpus independently; however, the bigram model analyses the probability of a word occurring depending on the previous word. Our results show that the unigram model has a better performance compared to the bigram model in terms of automatic classification of the emotional content of web documents.


2021 ◽  
Author(s):  
Alaa Hussainalsaid

This thesis proposes automatic classification of the emotional content of web documents using Natural Language Processing (NLP) algorithms. We used online articles and general documents to verify the performance of the algorithm, such as general web pages and news articles. The experiments used sentiment analysis that extracts sentiment of web documents. We used unigram and bigram approaches that are known as special types of N-gram, where N=1 and N=2, respectively. The unigram model analyses the probability to hit each word in the corpus independently; however, the bigram model analyses the probability of a word occurring depending on the previous word. Our results show that the unigram model has a better performance compared to the bigram model in terms of automatic classification of the emotional content of web documents.


Author(s):  
Ayushi Mitra

Sentiment analysis or Opinion Mining or Emotion Artificial Intelligence is an on-going field which refers to the use of Natural Language Processing, analysis of text and is utilized to extract quantify and is used to study the emotional states from a given piece of information or text data set. It is an area that continues to be currently in progress in field of text mining. Sentiment analysis is utilized in many corporations for review of products, comments from social media and from a small amount of it is utilized to check whether or not the text is positive, negative or neutral. Throughout this research work we wish to adopt rule- based approaches which defines a set of rules and inputs like Classic Natural Language Processing techniques, stemming, tokenization, a region of speech tagging and parsing of machine learning for sentiment analysis which is going to be implemented by most advanced python language.


2019 ◽  
Vol 37 (15_suppl) ◽  
pp. e18093-e18093
Author(s):  
Christi French ◽  
Maciek Makowski ◽  
Samantha Terker ◽  
Paul Alexander Clark

e18093 Background: Pulmonary nodule incidental findings challenge providers to balance resource efficiency and high clinical quality. Incidental findings tend to be undertreated with studies reporting appropriate follow-up rates as low as 29%. Ensuring appropriate follow-up on all incidental findings is labor-intensive; requires the clinical reading and classification of radiology reports to identify high-risk lung nodules. We tested the feasibility of automating this process with natural language processing (NLP) and machine learning (ML). Methods: In cooperation with Sarah Cannon Research Institute (SCRI), we conducted a series of data science experiments utilizing NLP and ML computing techniques on 8,879 free-text, narrative CT (computerized tomography) radiology reports. Reports used were dated from Dec 8, 2015 - April 23, 2017, came from SCRI-affiliated Emergency Department, Inpatient, and Outpatient facilities and were a representative, random sample of the patient populations. Reports were divided into a development set for model training and validation, and a test set to evaluate model performance. Two models were developed - a “Nodule Model” was trained to detect the reported presence of a pulmonary nodule and a rules-based “Sizing Model” was developed to extract the size of the nodule in millimeters. Reports were bucketed into three prediction groups: > = 6 mm, < 6 mm, and no size indicated. Nodules were considered positives and placed in a queue for follow-up if the nodule was predicted > = 6 mm, or if the nodule had no size indicated and the radiology report contained the word “mass.” The Fleischner Society Guidelines and clinical review informed these definitions. Results: Precision and recall metrics were calculated for multiple model thresholds. A threshold was selected based on the validation set calculations and a success criterion of 90% queue precision was selected to minimize false positives. On the test dataset, the F1 measure of the entire pipeline (lung nodule classification model and size extraction model) was 72.9%, recall was 60.3%, and queue precision was 90.2%, exceeding success criteria. Conclusions: The experiments demonstrate the feasibility of NLP and ML technology to automate the detection and classification of pulmonary nodule incidental findings in radiology reports. This approach promises to improve healthcare quality by increasing the rate of appropriate lung nodule incidental finding follow-up and treatment without excessive labor or risking overutilization.


2014 ◽  
Vol 8 (3) ◽  
pp. 227-235 ◽  
Author(s):  
Cíntia Matsuda Toledo ◽  
Andre Cunha ◽  
Carolina Scarton ◽  
Sandra Aluísio

Discourse production is an important aspect in the evaluation of brain-injured individuals. We believe that studies comparing the performance of brain-injured subjects with that of healthy controls must use groups with compatible education. A pioneering application of machine learning methods using Brazilian Portuguese for clinical purposes is described, highlighting education as an important variable in the Brazilian scenario.OBJECTIVE: The aims were to describe how to: (i) develop machine learning classifiers using features generated by natural language processing tools to distinguish descriptions produced by healthy individuals into classes based on their years of education; and (ii) automatically identify the features that best distinguish the groups.METHODS: The approach proposed here extracts linguistic features automatically from the written descriptions with the aid of two Natural Language Processing tools: Coh-Metrix-Port and AIC. It also includes nine task-specific features (three new ones, two extracted manually, besides description time; type of scene described - simple or complex; presentation order - which type of picture was described first; and age). In this study, the descriptions by 144 of the subjects studied in Toledo18 were used, which included 200 healthy Brazilians of both genders.RESULTS AND CONCLUSION:A Support Vector Machine (SVM) with a radial basis function (RBF) kernel is the most recommended approach for the binary classification of our data, classifying three of the four initial classes. CfsSubsetEval (CFS) is a strong candidate to replace manual feature selection methods.


Sign in / Sign up

Export Citation Format

Share Document