N.B.: This simulation is to show the paper pipeline when the number of biologically meaningful variants difficult to cluster is high compared to stringent mutations.

Creating data

This script is to show the importance of clustering high confidence variants and then attribute meaningful ones to the identified clusters

All functions are stored inside the reproduce.R, to avoid long display of codes. Below are the values that will be used throughout the testing.

number_iterations <- 20
number_mutations <- 150
ndrivers <- 100

We will first create a test set with QuantumCat with 6 clones, 150 variants, diploid, with an average depth of 100X, two samples with respective purity 70% and 60%. We make sure these variants correspond to stringent filters (i.e. depth \(> 50\)X).

toy.data<-QuantumCat_stringent(number_of_clones = 6,number_of_mutations = number_mutations,
                     ploidy = "AB",depth = 100,
                     contamination = c(0.3,0.4),min_depth = 50)

We check that all these variants are within the stringent filters (i.e depth \(\geq 50\) X), and display the first six rows of the first sample:

sum(toy.data[[1]]$Depth<50 | toy.data[[2]]$Depth<50)
## [1] 0
kable(toy.data[[1]][1:6,])
Chr Start Genotype Cellularit number_of_copies Frequency Depth Alt
1 1 AB 100 1 35.00 91 20
6 2 AB 49 1 17.15 143 32
2 3 AB 28 1 9.80 121 10
1 4 AB 100 1 35.00 343 133
2 5 AB 28 1 9.80 51 8
6 6 AB 49 1 17.15 69 11

Then we create 150 mutations that are in permissive filters. For that we take 38 mutations with 30 to 50 depth, 75 that have a depth \(\geq 30\) in triploid (AAB) loci and 38 that have a depth \(\geq 30\) in a tetraploid (AABB) locus.

permissive<-QuantumCat_permissive(fromQuantumCat = toy.data ,number_of_mutations = number_mutations,
                               ploidy = "AB",depth = 100,
                               contamination = c(0.3,0.4),max_depth = 50, min_depth = 30)
kable(permissive[[1]][1:6,])
Chr Start Cellularit Genotype number_of_copies Depth Frequency Alt
5 151 6 AB 1 49 2.10 1
4 152 13 AB 1 45 4.55 1
1 153 100 AB 1 48 35.00 19
3 154 25 AB 1 43 8.75 1
3 155 25 AB 1 41 8.75 3
3 156 25 AB 1 33 8.75 5

We are now going to select 100 drivers, with probability \(10/11\) of being in the permissive filters.

drivers_id<-sample(1:(2*number_mutations),size = ndrivers,prob = rep(c(1/{20*number_mutations},
                                                                       19/{20*number_mutations}),
                                                                     each = number_mutations)
                   )
drivers_id<-drivers_id[order(drivers_id)]
drivers_id
##   [1]   1   3  29  33 108 123 145 151 152 154 157 158 159 160 163 165 166
##  [18] 168 169 170 172 173 175 177 178 179 181 182 183 188 189 190 191 192
##  [35] 193 195 196 197 199 201 203 204 206 207 208 210 211 213 214 216 217
##  [52] 218 219 220 222 223 225 226 227 228 229 230 231 232 234 236 237 239
##  [69] 240 241 246 247 249 250 251 252 253 254 256 259 262 263 267 269 271
##  [86] 272 278 279 282 283 287 289 292 293 295 296 297 298 299 300

We now want to cluster mutations using only the filtered mutations (Paper pipeline), the filtered and drivers (extended), or all mutations alltogether (All), and compare the clustering quality of these different methods.

ext<-extended(filtered = toy.data,
              permissive = permissive,
              drivers_id = drivers_id)

all<-All(filtered = toy.data,
         permissive = permissive,
         drivers_id = drivers_id
)

pap<-paper_pipeline(filtered = toy.data,
                    permissive = permissive,
                    drivers_id = drivers_id)

We are now going to compare the quality of clustering using the Normalized Mutual Information, the number of clusters found (the truth being 6), the maximal and average error in the distance of a driver to its real position. N.B.:

Quality<-compare_qual(paper = pap,
                      extended = ext,
                      all = all,
                      drivers_id = drivers_id)

kable(Quality)
Pipeline NMI Max.Distance.to.clone nclusters mean.mut.error mean.driv.error time
paper 0.4124101 0.0360513 4 0.1695241 0.4055140 4.34
extended 0.4290459 0.4155007 4 0.1395963 0.4116186 13.12
all 0.4151676 0.4311365 4 0.3337322 0.4306405 15.77

We are now going to reproduce this test 19 times.

Quality<-rbind(Quality,
               reproduce(number_iterations-1,
                         number_mutations,
                         ndrivers)
               )

We can plot these results: