September 28, 2016;
BATCH-GE: Batch analysis of Next-Generation Sequencing data for genome editing assessment.
Targeted mutagenesis by the CRISPR/Cas9 system is currently revolutionizing genetics. The ease of this technique has enabled genome engineering in-vitro and in a range of model organisms and has pushed experimental dimensions to unprecedented proportions. Due to its tremendous progress in terms of speed, read length, throughput and cost, Next-Generation Sequencing (NGS) has been increasingly used for the analysis of CRISPR/Cas9 genome editing experiments. However, the current tools for genome editing assessment lack flexibility and fall short in the analysis of large amounts of NGS data. Therefore, we designed BATCH-GE, an easy-to-use bioinformatics tool for batch analysis of NGS-generated genome editing data, available from https://github.com/WouterSteyaert/BATCH-GE.git. BATCH-GE detects and reports indel mutations and other precise genome editing events and calculates the corresponding mutagenesis efficiencies for a large number of samples in parallel. Furthermore, this new tool provides flexibility by allowing the user to adapt a number of input variables. The performance of BATCH-GE was evaluated in two genome editing experiments, aiming to generate knock-out and knock-in zebrafish mutants. This tool will not only contribute to the evaluation of CRISPR/Cas9-based experiments, but will be of use in any genome editing experiment and has the ability to analyze data from every organism with a sequenced genome.
[+] show captions
References [+] :
Figure 1. Implementation of BATCH-GE. Multiple singleplex PCR products (S1, S2, …, Sn) (upper panel, left) that correspond to different genomic sequences in one specific or in different genomes are pooled in equimolar amounts. Subsequently, the pools are used as DNA input for NGS library preparation using the Nextera XT library preparation kit, which simultaneously fragments and tags input DNA (upper panel, middle). The tagging involves the addition of unique adapter sequences in order to provide sequencing indices on both sides of the amplicons (depicted by yellow, grey, light and dark blue bars). In a final step, all molecules are pooled in a single tube prior to NGS sequencing (upper panel, right). BATCH-GE analyses the data sample-by-sample in an automated batchwise manner. The experimental specifications needed to run BATCH-GE are supplied via two input files (middle panel, E (Experiment.csv) and C (Cutsites.bed) icons). In a first step, raw sequencing data is converted into the SAM file format. Secondly, BATCH-GE screens the reads in the SAM file for their coverage of the region(s) of interest, which are user-defined regions, encompassing the theoretical CRISPR/Cas9 cut site, 3 base pairs upstream of the PAM sequence (middle panel, grey sequence). Thirdly, reads that do not fully cover the region of interest are discarded from the analysis, since they lack information about the presence or absence of indels in this region (middle panel, indicated by a mark/cross). Subsequently, the remaining reads (indicated by a tick) are screened for insertions and deletions initiated within the same user-defined region of interest (middle panel, grey dash-lined box). The detected indel variants, along with information about their position, type, length and their frequency are written to a ‘Variants’ text file. Reads that do not contain any indel, are screened for the presence of intended base pair alterations. Frequencies of partial and full repairs are listed in the ‘RepairReport’ file. Additionally, general indel and repair rates are indicated in the ‘Efficiencies’ file. Lastly, URLs (‘URL’ file) enable read visualization in the freeware UCSC Genome Browser database22.
Figure 2. BATCH-GE output files for a specific genome editing experiment targeting the tprkb gene. (a) The ‘Variants’ text file lists chromosome, chromosomal location of the variant, type of the variant, length, the reference sequence surrounding the indel (10 bp upstream and 10 bp downstream of the indel) with  marking the inserted sequence or with [deleted base pairs] marking the deleted sequence, and absolute and relative frequency of the variants. (b) In case of HDR analysis, the reads which do not contain any indel, are screened for the presence of the intended base pair alterations. BATCH-GE can distinguish between full and partial repair, in case multiple base pair alterations are intended to be introduced in the region of interest. If partial repair is encountered, the specific sequence of the partial repair is listed. (c) General indel and repair rates are shown in the ‘Efficiencies’ file. (d) URLs are generated (‘URL’ file) which allow visualization of the reads in the freeware UCSC Genome Browser database26. However, if the number of total reads (also the reads that are discarded by the tool) exceeds 1000, visualization via UCSC is no longer possible. As an alternative, raw NGS result files (fastQ) can be uploaded into the Integrative Genomics Viewer (IGV)1920.
Figure 3. Indel rates and read number, as a function of the size of the region of interest used in BATCH-GE.The raw sequencing data derived from CRISPR/Cas9 assays (slc2a10, pls3, tapt1a, myt1la, tprkb) injected with 25 pg sgRNA and 250 pg Cas9 and analysed at 1 dpf were reanalysed while varying the size of the region of interest from 20 to 100 bp. The blue bars represent the number of reads retained by BATCH-GE when screened for coverage of the user-defined region of interest. The red line represents the indel rate as a function of the size of the region of interest.
CRISPR/Cas9 and TALEN-mediated knock-in approaches in zebrafish.