ABSTRACTObjectiveAutomated and accurate identification of refugees in healthcare databases is a critical first step to investigate healthcare needs of this vulnerable population and improve health disparities. This study developed a machine-learning method, named refugee identification system (RIS) that uses features commonly collected in healthcare databases to classify refugees and non-refugees.Materials and MethodsWe compiled a curated data set consisting of 103 refugees and 930 non-refugees in Arizona. For each person in the curated data set, we collected age, primary language, and home address. We supplemented individual-level data with state-level refugee resettlement statistics and world language statistics, then performed feature engineering to convert primary language and home address into quantitative features. Finally, we built a random forest model to classify refugee status.ResultsEvaluated on holdout testing data, RIS achieved a high classification accuracy of 0.97, specificity of 0.98, sensitivity of 0.88, positive predictive value of 0.83, and negative predictive value of 0.99. The receiver operating characteristic curve had an area under the curve value of 0.96.Discussion and ConclusionRIS is an automated, accurate, generalizable, and scalable method that can be used to identify refugees in healthcare databases. It enables large-scale investigation of refugee healthcare needs and improvement of health disparities.