N.B.: This simulation is to show the paper pipeline when the number of biologically meaningful variants difficult to cluster is high compared to stringent mutations.
This script is to show the importance of clustering high confidence variants and then attribute meaningful ones to the identified clusters
All functions are stored inside the reproduce.R, to avoid long display of codes. Below are the values that will be used throughout the testing.
number_iterations <- 20
number_mutations <- 150
ndrivers <- 100
We will first create a test set with QuantumCat
with 6 clones, 150 variants, diploid, with an average depth of 100X, two samples with respective purity 70% and 60%. We make sure these variants correspond to stringent filters (i.e. depth \(> 50\)X).
toy.data<-QuantumCat_stringent(number_of_clones = 6,number_of_mutations = number_mutations,
ploidy = "AB",depth = 100,
contamination = c(0.3,0.4),min_depth = 50)
We check that all these variants are within the stringent filters (i.e depth \(\geq 50\) X), and display the first six rows of the first sample:
sum(toy.data[[1]]$Depth<50 | toy.data[[2]]$Depth<50)
## [1] 0
kable(toy.data[[1]][1:6,])
Chr | Start | Genotype | Cellularit | number_of_copies | Frequency | Depth | Alt |
---|---|---|---|---|---|---|---|
1 | 1 | AB | 100 | 1 | 35.00 | 91 | 20 |
6 | 2 | AB | 49 | 1 | 17.15 | 143 | 32 |
2 | 3 | AB | 28 | 1 | 9.80 | 121 | 10 |
1 | 4 | AB | 100 | 1 | 35.00 | 343 | 133 |
2 | 5 | AB | 28 | 1 | 9.80 | 51 | 8 |
6 | 6 | AB | 49 | 1 | 17.15 | 69 | 11 |
Then we create 150 mutations that are in permissive filters. For that we take 38 mutations with 30 to 50 depth, 75 that have a depth \(\geq 30\) in triploid (AAB) loci and 38 that have a depth \(\geq 30\) in a tetraploid (AABB) locus.
permissive<-QuantumCat_permissive(fromQuantumCat = toy.data ,number_of_mutations = number_mutations,
ploidy = "AB",depth = 100,
contamination = c(0.3,0.4),max_depth = 50, min_depth = 30)
kable(permissive[[1]][1:6,])
Chr | Start | Cellularit | Genotype | number_of_copies | Depth | Frequency | Alt |
---|---|---|---|---|---|---|---|
5 | 151 | 6 | AB | 1 | 49 | 2.10 | 1 |
4 | 152 | 13 | AB | 1 | 45 | 4.55 | 1 |
1 | 153 | 100 | AB | 1 | 48 | 35.00 | 19 |
3 | 154 | 25 | AB | 1 | 43 | 8.75 | 1 |
3 | 155 | 25 | AB | 1 | 41 | 8.75 | 3 |
3 | 156 | 25 | AB | 1 | 33 | 8.75 | 5 |
We are now going to select 100 drivers, with probability \(10/11\) of being in the permissive filters.
drivers_id<-sample(1:(2*number_mutations),size = ndrivers,prob = rep(c(1/{20*number_mutations},
19/{20*number_mutations}),
each = number_mutations)
)
drivers_id<-drivers_id[order(drivers_id)]
drivers_id
## [1] 1 3 29 33 108 123 145 151 152 154 157 158 159 160 163 165 166
## [18] 168 169 170 172 173 175 177 178 179 181 182 183 188 189 190 191 192
## [35] 193 195 196 197 199 201 203 204 206 207 208 210 211 213 214 216 217
## [52] 218 219 220 222 223 225 226 227 228 229 230 231 232 234 236 237 239
## [69] 240 241 246 247 249 250 251 252 253 254 256 259 262 263 267 269 271
## [86] 272 278 279 282 283 287 289 292 293 295 296 297 298 299 300
We now want to cluster mutations using only the filtered mutations (Paper pipeline), the filtered and drivers (extended), or all mutations alltogether (All), and compare the clustering quality of these different methods.
ext<-extended(filtered = toy.data,
permissive = permissive,
drivers_id = drivers_id)
all<-All(filtered = toy.data,
permissive = permissive,
drivers_id = drivers_id
)
pap<-paper_pipeline(filtered = toy.data,
permissive = permissive,
drivers_id = drivers_id)
We are now going to compare the quality of clustering using the Normalized Mutual Information, the number of clusters found (the truth being 6), the maximal and average error in the distance of a driver to its real position. N.B.:
Quality<-compare_qual(paper = pap,
extended = ext,
all = all,
drivers_id = drivers_id)
kable(Quality)
Pipeline | NMI | Max.Distance.to.clone | nclusters | mean.mut.error | mean.driv.error | time |
---|---|---|---|---|---|---|
paper | 0.4124101 | 0.0360513 | 4 | 0.1695241 | 0.4055140 | 4.34 |
extended | 0.4290459 | 0.4155007 | 4 | 0.1395963 | 0.4116186 | 13.12 |
all | 0.4151676 | 0.4311365 | 4 | 0.3337322 | 0.4306405 | 15.77 |
We are now going to reproduce this test 19 times.
Quality<-rbind(Quality,
reproduce(number_iterations-1,
number_mutations,
ndrivers)
)
We can plot these results: