scholarly journals Classify Refugee Status Using Common Features in EMR

Author(s):  
Malia Morrison ◽  
Crista E. Johnson-Agbakwu ◽  
Celeste Bailey ◽  
Li Liu

ABSTRACTObjectiveAutomated and accurate identification of refugees in healthcare databases is a critical first step to investigate healthcare needs of this vulnerable population and improve health disparities. This study developed a machine-learning method, named refugee identification system (RIS) that uses features commonly collected in healthcare databases to classify refugees and non-refugees.Materials and MethodsWe compiled a curated data set consisting of 103 refugees and 930 non-refugees in Arizona. For each person in the curated data set, we collected age, primary language, and home address. We supplemented individual-level data with state-level refugee resettlement statistics and world language statistics, then performed feature engineering to convert primary language and home address into quantitative features. Finally, we built a random forest model to classify refugee status.ResultsEvaluated on holdout testing data, RIS achieved a high classification accuracy of 0.97, specificity of 0.98, sensitivity of 0.88, positive predictive value of 0.83, and negative predictive value of 0.99. The receiver operating characteristic curve had an area under the curve value of 0.96.Discussion and ConclusionRIS is an automated, accurate, generalizable, and scalable method that can be used to identify refugees in healthcare databases. It enables large-scale investigation of refugee healthcare needs and improvement of health disparities.

Heart ◽  
2018 ◽  
Vol 104 (23) ◽  
pp. 1921-1928 ◽  
Author(s):  
Ming-Zher Poh ◽  
Yukkee Cheung Poh ◽  
Pak-Hei Chan ◽  
Chun-Ka Wong ◽  
Louise Pun ◽  
...  

ObjectiveTo evaluate the diagnostic performance of a deep learning system for automated detection of atrial fibrillation (AF) in photoplethysmographic (PPG) pulse waveforms.MethodsWe trained a deep convolutional neural network (DCNN) to detect AF in 17 s PPG waveforms using a training data set of 149 048 PPG waveforms constructed from several publicly available PPG databases. The DCNN was validated using an independent test data set of 3039 smartphone-acquired PPG waveforms from adults at high risk of AF at a general outpatient clinic against ECG tracings reviewed by two cardiologists. Six established AF detectors based on handcrafted features were evaluated on the same test data set for performance comparison.ResultsIn the validation data set (3039 PPG waveforms) consisting of three sequential PPG waveforms from 1013 participants (mean (SD) age, 68.4 (12.2) years; 46.8% men), the prevalence of AF was 2.8%. The area under the receiver operating characteristic curve (AUC) of the DCNN for AF detection was 0.997 (95% CI 0.996 to 0.999) and was significantly higher than all the other AF detectors (AUC range: 0.924–0.985). The sensitivity of the DCNN was 95.2% (95% CI 88.3% to 98.7%), specificity was 99.0% (95% CI 98.6% to 99.3%), positive predictive value (PPV) was 72.7% (95% CI 65.1% to 79.3%) and negative predictive value (NPV) was 99.9% (95% CI 99.7% to 100%) using a single 17 s PPG waveform. Using the three sequential PPG waveforms in combination (<1 min in total), the sensitivity was 100.0% (95% CI 87.7% to 100%), specificity was 99.6% (95% CI 99.0% to 99.9%), PPV was 87.5% (95% CI 72.5% to 94.9%) and NPV was 100% (95% CI 99.4% to 100%).ConclusionsIn this evaluation of PPG waveforms from adults screened for AF in a real-world primary care setting, the DCNN had high sensitivity, specificity, PPV and NPV for detecting AF, outperforming other state-of-the-art methods based on handcrafted features.


PLoS ONE ◽  
2020 ◽  
Vol 15 (11) ◽  
pp. e0241239
Author(s):  
Kai On Wong ◽  
Osmar R. Zaïane ◽  
Faith G. Davis ◽  
Yutaka Yasui

Background Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Canada is largely unknown. This study conducted a large-scale machine learning framework to predict ethnicity using a novel set of name and census location features. Methods Using census 1901, the multiclass and binary class classification machine learning pipelines were developed. The 13 ethnic categories examined were Aboriginal (First Nations, Métis, Inuit, and all-combined)), Chinese, English, French, Irish, Italian, Japanese, Russian, Scottish, and others. Machine learning algorithms included regularized logistic regression, C-support vector, and naïve Bayes classifiers. Name features consisted of the entire name string, substrings, double-metaphones, and various name-entity patterns, while location features consisted of the entire location string and substrings of province, district, and subdistrict. Predictive performance metrics included sensitivity, specificity, positive predictive value, negative predictive value, F1, Area Under the Curve for Receiver Operating Characteristic curve, and accuracy. Results The census had 4,812,958 unique individuals. For multiclass classification, the highest performance achieved was 76% F1 and 91% accuracy. For binary classifications for Chinese, French, Italian, Japanese, Russian, and others, the F1 ranged 68–95% (median 87%). The lower performance for English, Irish, and Scottish (F1 ranged 63–67%) was likely due to their shared cultural and linguistic heritage. Adding census location features to the name-based models strongly improved the prediction in Aboriginal classification (F1 increased from 50% to 84%). Conclusions The automated machine learning approach using only name and census location features can predict the ethnicity of Canadians with varying performance by specific ethnic categories.


2014 ◽  
Vol 2014 ◽  
pp. 1-7 ◽  
Author(s):  
Zi-Hui Tang ◽  
Fangfang Zeng ◽  
Zhongtao Li ◽  
Linuo Zhou

Background.The purpose of this study was to evaluate the predictive value of DM and resting HR on CAN in a large sample derived from a Chinese population.Materials and Methods.We conducted a large-scale, population-based, cross-sectional study to explore the relationships of CAN with DM and resting HR. A total of 387 subjects were diagnosed with CAN in our dataset. The associations of CAN with DM and resting HR were assessed by a multivariate logistic regression (MLR) analysis (using subjects without CAN as a reference group) after controlling for potential confounding factors. The area under the receiver-operating characteristic curve (AUC) was used to evaluate the predictive performance of resting HR and DM.Results.A tendency toward increased CAN prevalence with increasing resting HR was reported (Pfor trend<0.001). MLR analysis showed that DM and resting HR were very significantly and independently associated with CAN (P<0.001for both). Resting HR alone or combined with DM (DM-HR) both strongly predicted CAN (AUC = 0.719, 95% CI 0.690–0.748 for resting HR and AUC = 0.738, 95% CI 0.710–0.766 for DM-HR).Conclusion.Our findings signify that resting HR and DM-HR have a high value in predicting CAN in the general population.


2019 ◽  
Vol 79 (5) ◽  
pp. 931-961 ◽  
Author(s):  
Cengiz Zopluoglu

Researchers frequently use machine-learning methods in many fields. In the area of detecting fraud in testing, there have been relatively few studies that have used these methods to identify potential testing fraud. In this study, a technical review of a recently developed state-of-the-art algorithm, Extreme Gradient Boosting (XGBoost), is provided and the utility of XGBoost in detecting examinees with potential item preknowledge is investigated using a real data set that includes examinees who engaged in fraudulent testing behavior, such as illegally obtaining live test content before the exam. Four different XGBoost models were trained using different sets of input features based on (a) only dichotomous item responses, (b) only nominal item responses, (c) both dichotomous item responses and response times, and (d) both nominal item responses and response times. The predictive performance of each model was evaluated using the area under the receiving operating characteristic curve and several classification measures such as the false-positive rate, true-positive rate, and precision. For comparison purposes, the results from two person-fit statistics on the same data set were also provided. The results indicated that XGBoost successfully classified the honest test takers and fraudulent test takers with item preknowledge. Particularly, the classification performance of XGBoost was reasonably good when the response time information and item responses were both taken into account.


2021 ◽  
pp. 1-24
Author(s):  
Arnstein Vestre ◽  
Azzeddine Bakdi ◽  
Erik Vanem ◽  
Øystein Engelhardtsen

Abstract Economic and technological development has increased the amount, density and complexity of maritime traffic, which has resulted in new challenges. One challenge is conforming to the distinct evasion manoeuvres required by vessels entering into near-collision situations (NCSs). Existing rules are vague and do not precisely dictate which, when and how collision avoidance manoeuvres (CAMs) should be executed. The automatic identification system (AIS) is widely used for vessel monitoring and traffic control. This paper presents an efficient, scalable method for processing large-scale raw AIS data using the closest point of approach (CPA) framework. NCSs are identified to create a database of historical traffic data. Important features describing CAMs are defined, estimated and analysed. Applications on a high-quality real-world data set show promising results for a subset of the identified situations. Future applications may play a significant role in the maritime regulatory framework, navigation protocol compliance evaluation, risk assessment, automatic collision avoidance, and algorithm design and testing for autonomous vessels.


Author(s):  
Ishtiaque Ahmed ◽  
◽  
Manan Darda ◽  
Neha Tikyani ◽  
Rachit Agrawal ◽  
...  

The COVID-19 pandemic has caused large-scale outbreaks in more than 150 countries worldwide, causing massive damage to the livelihood of many people. The capacity to identify contaminated patients early and get unique treatment is quite possibly the primary stride in the battle against COVID-19. One of the quickest ways to diagnose patients is to use radiography and radiology images to detect the disease. Early studies have shown that chest X-rays of patients infected with COVID-19 have unique abnormalities. To identify COVID-19 patients from chest X-ray images, we used various deep learning models based on previous studies. We first compiled a data set of 2,815 chest radiographs from public sources. The model produces reliable and stable results with an accuracy of 91.6%, a Positive Predictive Value of 80%, a Negative Predictive Value of 100%, specificity of 87.50%, and Sensitivity of 100%. It is observed that the CNN-based architecture can diagnose COVID19 disease. The parameters’ outcomes can be further improved by increasing the dataset size and by developing the CNN-based architecture for training the model.


2018 ◽  
Vol 611 ◽  
pp. A2 ◽  
Author(s):  
C. Schaefer ◽  
M. Geiger ◽  
T. Kuntzer ◽  
J.-P. Kneib

Context. Future large-scale surveys with high-resolution imaging will provide us with approximately 105 new strong galaxy-scale lenses. These strong-lensing systems will be contained in large data amounts, however, which are beyond the capacity of human experts to visually classify in an unbiased way. Aims. We present a new strong gravitational lens finder based on convolutional neural networks (CNNs). The method was applied to the strong-lensing challenge organized by the Bologna Lens Factory. It achieved first and third place, respectively, on the space-based data set and the ground-based data set. The goal was to find a fully automated lens finder for ground-based and space-based surveys that minimizes human inspection. Methods. We compared the results of our CNN architecture and three new variations (“invariant” “views” and “residual”) on the simulated data of the challenge. Each method was trained separately five times on 17 000 simulated images, cross-validated using 3000 images, and then applied to a test set with 100 000 images. We used two different metrics for evaluation, the area under the receiver operating characteristic curve (AUC) score, and the recall with no false positive (Recall0FP). Results. For ground-based data, our best method achieved an AUC score of 0.977 and a Recall0FP of 0.50. For space-based data, our best method achieved an AUC score of 0.940 and a Recall0FP of 0.32. Adding dihedral invariance to the CNN architecture diminished the overall score on space-based data, but achieved a higher no-contamination recall. We found that using committees of five CNNs produced the best recall at zero contamination and consistently scored better AUC than a single CNN. Conclusions. We found that for every variation of our CNN lensfinder, we achieved AUC scores close to 1 within 6%. A deeper network did not outperform simpler CNN models either. This indicates that more complex networks are not needed to model the simulated lenses. To verify this, more realistic lens simulations with more lens-like structures (spiral galaxies or ring galaxies) are needed to compare the performance of deeper and shallower networks.


2018 ◽  
Vol 8 (11) ◽  
pp. 2089 ◽  
Author(s):  
Juha Niemi ◽  
Juha Tanttu

An automatic bird identification system is required for offshore wind farms in Finland. Indubitably, a radar is the obvious choice to detect flying birds, but external information is required for actual identification. We applied visual camera images as external data. The proposed system for automatic bird identification consists of a radar, a motorized video head and a single-lens reflex camera with a telephoto lens. A convolutional neural network trained with a deep learning algorithm is applied to the image classification. We also propose a data augmentation method in which images are rotated and converted in accordance with the desired color temperatures. The final identification is based on a fusion of parameters provided by the radar and the predictions of the image classifier. The sensitivity of this proposed system, on a dataset containing 9312 manually taken original images resulting in 2.44 × 106 augmented data set, is 0.9463 as an image classifier. The area under receiver operating characteristic curve for two key bird species is 0.9993 (the White-tailed Eagle) and 0.9496 (The Lesser Black-backed Gull), respectively. We proposed a novel system for automatic bird identification as a real world application. We demonstrated that our data augmentation method is suitable for image classification problem and it significantly increases the performance of the classifier.


Author(s):  
Ishtiaque Ahmed ◽  
◽  
Manan Darda ◽  
Neha Tikyani ◽  
Rachit Agrawal ◽  
...  

The COVID-19 pandemic has caused large-scale outbreaks in more than 150 countries worldwide, causing massive damage to the livelihood of many people. The capacity to identify contaminated patients early and get unique treatment is quite possibly the primary stride in the battle against COVID-19. One of the quickest ways to diagnose patients is to use radiography and radiology images to detect the disease. Early studies have shown that chest X-rays of patients infected with COVID-19 have unique abnormalities. To identify COVID-19 patients from chest X-ray images, we used various deep learning models based on previous studies. We first compiled a data set of 2,815 chest radiographs from public sources. The model produces reliable and stable results with an accuracy of 91.6%, a Positive Predictive Value of 80%, a Negative Predictive Value of 100%, specificity of 87.50%, and Sensitivity of 100%. It is observed that the CNN-based architecture can diagnose COVID-19 disease. The parameters’ outcomes can be further improved by increasing the dataset size and by developing the CNN-based architecture for training the model.


1997 ◽  
Vol 78 (02) ◽  
pp. 794-798 ◽  
Author(s):  
Bowine C Michel ◽  
Philomeen M M Kuijer ◽  
Joseph McDonnell ◽  
Edwin J R van Beek ◽  
Frans F H Rutten ◽  
...  

Summary Background: In order to improve the use of information contained in the medical history and physical examination in patients with suspected pulmonary embolism and a non-high probability ventilation-perfusion scan, we assessed whether a simple, quantitative decision rule could be derived for the diagnosis or exclusion of pulmonary embolism. Methods: In 140 consecutive symptomatic patients with a non- high probability ventilation-perfusion scan and an interpretable pulmonary angiogram, various clinical and lung scan items were collected prospectively and analyzed by multivariate stepwise logistic regression analysis to identify the most informative combination of items. Results: The prevalence of proven pulmonary embolism in the patient population was 27.1%. A decision rule containing the presence of wheezing, previous deep venous thrombosis, recently developed or worsened cough, body temperature above 37° C and multiple defects on the perfusion scan was constructed. For the rule the area under the Receiver Operating Characteristic curve was larger than that of the prior probability of pulmonary embolism as assessed by the physician at presentation (0.76 versus 0.59; p = 0.0097). At the cut-off point with the maximal positive predictive value 2% of the patients scored positive, at the cut-off point with the maximal negative predictive value pulmonary embolism could be excluded in 16% of the patients. Conclusions: We derived a simple decision rule containing 5 easily interpretable variables for the patient population specified. The optimal use of the rule appears to be in the exclusion of pulmonary embolism. Prospective validation of this rule is indicated to confirm its clinical utility.


Sign in / Sign up

Export Citation Format

Share Document