AbstractAlzheimer’s disease (AD) and AD-related dementias (ADRD) are a class of neurodegenerative diseases affecting about 5.7 million Americans. There is no cure for AD/ADRD. Current interventions have modest effects and focus on attenuating cognitive impairment. Detection of patients at high risk of AD/ADRD is crucial for timely interventions to modify risk factors and primarily prevent cognitive decline and dementia, and thus to enhance the quality of life and reduce health care costs. This study seeks to investigate both knowledge-driven (where domain experts identify useful features) and data-driven (where machine learning models select useful features among all available data elements) approaches for AD/ADRD early prediction using real-world electronic health records (EHR) data from the University of Florida (UF) Health system. We identified a cohort of 59,799 patients and examined four widely used machine learning algorithms following a standard case-control study. We also examined the early prediction of AD/ADRD using patient information 0-years, 1-year, 3-years, and 5-years before the disease onset date. The experimental results showed that models based on the Gradient Boosting Trees (GBT) achieved the best performance for the data-driven approach and the Random Forests (RF) achieved the best performance for the knowledge-driven approach. Among all models, GBT using a data-driven approach achieved the best area under the curve (AUC) score of 0.7976, 0.7192, 0.6985, and 0.6798 for 0, 1, 3, 5-years prediction, respectively. We also examined the top features identified by the machine learning models and compared them with the knowledge-driven features identified by domain experts. Our study demonstrated the feasibility of using electronic health records for the early prediction of AD/ADRD and discovered potential challenges for future investigations.