Embracing imperfection: machine-assisted invertebrate classification in real-world datasets
Despite growing concerns over the health of global insect populations, the spatiotemporal breadth of insect population data is severely lacking. Machine-assisted classification has been proposed as a potential solution to quickly gather large amounts of data, but previous studies have often used unrealistic or idealized datasets to train their models. In this study, we describe a practical methodology for including machine learning in ecologic data acquisition pipelines. Here we train and test machine learning algorithms to classify over 56,000 bulk terrestrial invertebrate specimens from image data. All specimens were collected in pitfall traps by the National Ecological Observatory Network (NEON) at 27 locations across the United States. Image data was extracted as feature vectors using ImageJ. When classifying specimens that were known and seen by our models, we reached an accuracy of 74.7% at the lowest taxonomic level. We also classified invertebrate taxa that the model was not trained on using zero-shot classification, with an accuracy of 42.1% on these taxa. The general methodology outlined here represents a realistic approach to how machine learning may be used as a tool for ecological studies.