redkmer: an assembly-free pipeline for the identification of abundant and specific X-chromosome target sequences for X-shredding by CRISPR endonucleases
AbstractCRISPR-based synthetic sex ratio distorters, that operate by shredding the X-chromosome during male meiosis, are promising tools for the area-wide control of harmful insect pest or disease vector species. However, the selection of gRNA targets, in the form of high-copy sequence repeats on the X chromosome of a given species, is difficult since such repeats are not accurately resolved in genome assemblies and can’t be assigned to chromosomes with confidence. We have therefore developed the redkmer computational pipeline, designed to identify short and highly-abundant sequence elements occurring uniquely on the X-chromosome. Redkmer was designed to use as input exclusively raw WGS data from males and females. We tested redkmer with suitable short and long read WGS data ofAn. gambiae, the major vector of human malaria, in which the X-shredding paradigm was originally developed. Redkmer establishes long reads as chromosomal proxies with excellent correlation to the genome assembly and uses them to rank X-candidate kmers for their level of X-specificity and abundance. Redkmer identified a high-confidence set of 25-mers, many of which belong to previously known X-chromosome specific repeats ofAn. gambiae, including the ribosomal gene array and the selfish genetics elements harbored within it. WGS data from a control strain in which these repeats are also present on the Y chromosome confirmed the elimination of these kmers in the filtering steps. Finally, we show that redkmer output can be linked directly to gRNA selection and can also inform gRNA off-target prediction. The redkmer pipeline is designed to enable the generation of synthetic sex ratio distorters for the control of harmful insect species of medical or agricultural importance. It proceeds from WGS input data to deliver candidate X-specific CRISPR gRNA candidate target sequences. In addition the output of redkmer, including the prediction of chromosomal origin of single-molecule long reads and chromosome specific kmers, could also be used for the characterization of other biologically relevant sex chromosome sequences, a task that is frequently hampered by the repetitiveness of sex chromosome sequence content.