前言
目前探针的靶向捕获测序,已经发展临床检测应用的常规技术手段。因此在进行靶向捕获测序时,我们需要面临和解决的第一个问题,就是我们应该如何设计芯片探针。
2014年,发表的一篇文章An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage为我们进行靶向捕获区域设计提供了一个可参考的方案CAPP-Seq Selector。
名词概念
- RI值 :Recurrence Index (RI), is defined as the number of unique patients (i.e., tumors) with somatic mutations per kilobase of a given genomic unit (here, exon)。RI =(n×1000)÷L,其中n为所述外显子区间的患者数目,L为外显子区间的序列长度(bp)。
方法介绍
本方法是针对NSCLC设计的,但是也可以推广应用到其他高频突变已经明确的癌种。
- 首先,我们挑选出一些重要的外显子,这些外显子包含了COSMIC和其他来源(Somatic mutations affect key pathways in lung adenocarcinoma.[PubMed: 18948947];Identifying cancer driver genes in tumor genome sequencing studies. [PubMed: 21169372])的潜在驱动基因中反复出现的突变。
- 然后使用TCGA数据库,获取NSCLC的407例样本的WES测序数据。
- 最后应用了一种迭代算法来最大化每个患者的错义突变数量,同时最小化整个芯片大小。
Most human cancers are relatively heterogeneous for somatic mutations in individual genes. Specifically, in most human tumors, recurrent somatic alterations of single genes account for a minority of patients, and only a minority of tumor types can be defined using a small number of recurrent mutations (<5-10) at predefined positions. Therefore, the design of the selector is vital to the CAPP-Seq method because (1) it dictates which mutations can be detected with high probability for a patient with a given cancer, and (2) the selector size (in kb) directly impacts the cost and depth of sequence coverage. For example, the hybrid selection libraries available in current whole exome capture kits range from 51-71 Mb, providing ~40-60 fold maximum theoretical enrichment versus whole genome sequencing. The degree of potential enrichment is inversely proportional to the selector size such that for a ~100 kb selector, >10,000 fold enrichment should be achievable.
We employed a six-phase design strategy to identify and prioritize genomic regions for the CAPP-Seq NSCLC selector as detailed below. Three phases were used to incorporate known and suspected NSCLC driver genes, as well as genomic regions known to participate in clinically actionable fusions (phases 1, 5, 6), while another three phases employed an algorithmic approach to maximize both the number of patients covered and SNVs per patient (phases 2–4). The latter relied upon a metric that we termed “Recurrence Index” (RI), defined as the number of NSCLC patients with SNVs that occur within a given kilobase of exonic sequence (i.e., No. of patients with mutations / exon length in kb). RI thus serves to measure patient-level recurrence frequency at the exon level, while simultaneously normalizing for gene or exon size. As a source of somatic mutation data uniformly genotyped across a large cohort of patients, in phases 2–4, we analyzed non-silent SNVs identified in TCGA whole exome sequencing data from 178 patients in the Lung Squamous Cell Carcinoma dataset (SCC)10 and from 229 patients in the Lung Adenocarcinoma (LUAD) datasets (TCGA query date was March 13, 2012). Thresholds for each metric (i.e. RI and patients per exon) were selected to statistically enrich for known/suspected drivers in SCC and LUAD data (Supplementary Fig. 1). RefSeq exon coordinates (hg19) were obtained via the UCSC Table Browser (query date was April 11, 2012)
The following algorithm was used to design the CAPP-Seq selector (parenthetical descriptions match design phases noted in Fig. 1b).
- Phase 1 (Known drivers)
Initial seed genes were chosen based on their frequency of mutation in NSCLCs. Analysis of COSMIC (v57) identified known driver genes that are recurrently mutated in ≥9% of NSCLC (denominator ≥500 cases). Specific exons from these genes were selected based on the pattern of SNVs previously identified in NSCLC. The seed list also included single exons from genes with recurrent mutations that occurred at low frequency but had strong evidence for being driver mutations, such as BRAF exon 15, which harbors V600E mutations in <2% of NSCLC.
- Phase 2 (Max. coverage)
For each exon with SNVs covering ≥5 patients in LUAD and SCC, we selected the exon with
highest RI that identified at least 1 new patient when compared to the prior phase. Among
exons with equally high RI, we added the exon with minimum overlap among patients already
captured by the selector. This was repeated until no further exons met these criteria.
- Phase 3 (RI ≥ 30)
For each remaining exon with an RI ≥ 30 and with SNVs covering ≥3 patients in LUAD and
SCC, we identified the exon that would result in the largest reduction in patients with only 1
SNV. To break ties among equally best exons, the exon with highest RI was chosen. This was
repeated until no additional exons satisfied these criteria.
- Phase 4 (RI ≥ 20)
Same procedure as phase 3, but using RI ≥ 20.
- Phase 5 (Predicted drivers)
We included all exons from additional genes previously predicted to harbor driver mutations in
NSCLC12,13.
- Phase 6 (Add fusions)
For recurrent rearrangements in NSCLC involving the receptor tyrosine kinases ALK, ROS1,
and RET, the introns most frequently implicated in the fusion event and the flanking exons were
included.
All exons included in the selector, along with their corresponding HUGO gene symbols and
genomic coordinates, as well as patient statistics for NSCLC and a variety of other cancers, are
provided in Supplementary Table 1, organized by selector design phase.
参考资料
文献
An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage
$\color{red}{ed}$