Improving the completeness of public metadata accompanying omics studies
Over the last decade there has been tremendous progress to improve the sharing of genomics data, which allows researchers to easily access the various types of data across a wide range of phenotypes. Some of the most well known public repositories are Gene Expression Omnibus, Sequence Read Archive and ArrayExpress. However, despite the availability of raw data, metadata accompanying the raw data is often unavailable. Incomplete and improperly annotated metadata on repositories proves to be a hindrance to reusing and reproducing existing data, especially for making novel discoveries. Leveraging previously published data for novel biological discoveries is only possible to its maximum extent if the metadata that accompanies raw omics data is complete and present in a standardized format. Existing literature has explored how sharing of data should be FAIR - Findable, Accessible, Interoperable and Reusable, and considered accuracy, completeness and consistency as three vital parameters to assess the quality of available metadata, although not many have examined it exclusively as an appendage to omics studies. In our study, we perform a systematic assessment of completeness of public metadata accompanying omics data. We have performed our analysis on sepsis cohorts and are currently extending the same to tuberculosis and cystic fibrosis cohorts. On comparing the data available on both platforms, we observed discrepancies between omics data and the corresponding metadata on public repositories. The results we have for the sepsis cohorts are intriguing and advocate the need to have a standardized "checklist" for researchers to submit their study results and data to public repositories. Our study opens a wide discussion about this being a potential solution to bridge the gap between omics data and metadata on repositories.