Double-strand breaks (DSBs) result from the attack of both DNA strands by multiple sources, including exposure to ionizing radiation or reactive oxygen species. DSBs can cause abnormal chromosomal rearrangements which are linked to cancer development, and hence represent an important issue. Recent techniques allow the genome-wide mapping of DSBs at high resolution, enabling the comprehensive study of DSB origin. However these techniques are costly and challenging.
We devise a computational approach to predict DSBs using the epigenomic and chromatin context, for which public data are available from the ENCODE project.
Our predictions achieve excellent accuracy (AUC>0.97) at high resolution (<1kb) using available ChIP-seq and DNase-seq data from public databases. DNase, CTCF binding and H3K4me1/2/3 are among the best predictors of DSBs, reflecting the importances of chromatin accessibility, activity and long-range contacts in determining DSB sites and subsequent repairing. We also successfully predict DSB sites using DNA motif occurrences only (AUC=0.839) and identify the CTCF motif as a strong predictor. In addition, DNA shape analysis further reveals the importance of the structure-based readout in determining DSB sites, complementary to the sequence-based readout (motifs).
Double-strand breaks represent a major threat to the cell, and they are associated with cancer development. Here, we show, for the first time, that such DSBs can be computationally predicted using public epigenomic data, even when the availability of data is limited (e.g. DNase I and H3K4me1). By using state-of-the-art computational models, we achieve excellent prediction accuracy, paving the way for a better understanding of DSB formation depending on developmental stage or cell-type specific epigenetic marks. In addition, our work represents the first step toward predicting DSBs using DNA information only, which could guide further locus-specific genome editing.