Brings Seurat to the tidyverse!
website: stemangiola.github.io/tidyseurat/
Please have a look also to
Seurat-compatible Functions | Description |
---|---|
all |
After all tidyseurat is a Seurat object, just better |
tidyverse Packages | Description |
---|---|
dplyr |
All dplyr APIs like for any tibble |
tidyr |
All tidyr APIs like for any tibble |
ggplot2 |
ggplot like for any tibble |
plotly |
plot_ly like for any tibble |
Utilities | Description |
---|---|
tidy |
Add tidyseurat invisible layer over a Seurat object |
as_tibble |
Convert cell-wise information to a tbl_df |
join_transcripts |
Add transcript-wise information, returns a tbl_df |
From CRAN (in submission)
install.packages("tidyseurat")
From Github
devtools::install_github("stemangiola/tidyseurat")
tidyseurat
, the best of both worlds!This is a seurat object but it is evaluated as tibble. So it is fully compatible both with Seurat and tidyverse APIs.
pbmc_small_tidy = tidyseurat::pbmc_small %>% tidy()
It looks like a tibble
pbmc_small_tidy
## # A tibble: 80 x 16
## cell orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents groups
## <chr> <fct> <dbl> <int> <fct> <fct> <chr>
## 1 ATGC… SeuratPro… 70 47 0 A g2
## 2 CATG… SeuratPro… 85 52 0 A g1
## 3 GAAC… SeuratPro… 87 50 1 B g2
## 4 TGAC… SeuratPro… 127 56 0 A g2
## 5 AGTC… SeuratPro… 173 53 0 A g2
## 6 TCTG… SeuratPro… 70 48 0 A g1
## 7 TGGT… SeuratPro… 64 36 0 A g1
## 8 GCAG… SeuratPro… 72 45 0 A g1
## 9 GATA… SeuratPro… 52 36 0 A g1
## 10 AATG… SeuratPro… 100 41 0 A g1
## # … with 70 more rows, and 9 more variables: RNA_snn_res.1 <fct>, file <chr>,
## # PC_1 <dbl>, PC_2 <dbl>, PC_3 <dbl>, PC_4 <dbl>, PC_5 <dbl>, tSNE_1 <dbl>,
## # tSNE_2 <dbl>
But it is a Seurat object after all
pbmc_small_tidy@assays
## $RNA
## Assay data with 230 features for 80 cells
## Top 10 variable features:
## PPBP, IGLL5, VDAC3, CD1C, AKR1C3, PF4, MYL9, GNLY, TREML1, CA2
We may have a column that contains the directory each run was taken from. We may want to extract the run/sample name out of it.
pbmc_small_polished =
pbmc_small_tidy %>%
extract(file, "sample", "../data/([a-z0-9]+)/outs.+", remove = F)
pbmc_small_polished %>%
select(sample, everything())
## # A tibble: 80 x 17
## cell sample orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents
## <chr> <chr> <fct> <dbl> <int> <fct> <fct>
## 1 ATGC… sampl… SeuratPro… 70 47 0 A
## 2 CATG… sampl… SeuratPro… 85 52 0 A
## 3 GAAC… sampl… SeuratPro… 87 50 1 B
## 4 TGAC… sampl… SeuratPro… 127 56 0 A
## 5 AGTC… sampl… SeuratPro… 173 53 0 A
## 6 TCTG… sampl… SeuratPro… 70 48 0 A
## 7 TGGT… sampl… SeuratPro… 64 36 0 A
## 8 GCAG… sampl… SeuratPro… 72 45 0 A
## 9 GATA… sampl… SeuratPro… 52 36 0 A
## 10 AATG… sampl… SeuratPro… 100 41 0 A
## # … with 70 more rows, and 10 more variables: groups <chr>,
## # RNA_snn_res.1 <fct>, file <chr>, PC_1 <dbl>, PC_2 <dbl>, PC_3 <dbl>,
## # PC_4 <dbl>, PC_5 <dbl>, tSNE_1 <dbl>, tSNE_2 <dbl>
We can treat pbmc_small_polished
effectively as a normal tibble for plotting.
Here we plot number of transcripts per cell
pbmc_small_polished %>%
tidyseurat::ggplot(aes(nFeature_RNA, fill=groups)) +
geom_histogram() +
my_theme
Here we plot total transcriptional material per cell
pbmc_small_polished %>%
tidyseurat::ggplot(aes(groups, nCount_RNA, fill=groups)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1) +
my_theme
Here we plot abundance of two transcripts for each group
pbmc_small_polished %>%
join_transcripts(transcripts = c("HLA-DRA" , "LYZ" )) %>%
ggplot(aes(groups, abundance_RNA + 1, fill=groups)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(aes(size=nCount_RNA), alpha=0.5, width = 0.2) +
scale_y_log10() +
my_theme
Also you can treat the object as Seurat object and proceed with data processing.
pbmc_small_pca =
pbmc_small_polished %>%
SCTransform(verbose = FALSE) %>%
FindVariableFeatures(verbose = FALSE) %>%
RunPCA(verbose = FALSE)
pbmc_small_pca
## # A tibble: 80 x 19
## cell orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents groups
## <chr> <fct> <dbl> <int> <fct> <fct> <chr>
## 1 ATGC… SeuratPro… 70 47 0 A g2
## 2 CATG… SeuratPro… 85 52 0 A g1
## 3 GAAC… SeuratPro… 87 50 1 B g2
## 4 TGAC… SeuratPro… 127 56 0 A g2
## 5 AGTC… SeuratPro… 173 53 0 A g2
## 6 TCTG… SeuratPro… 70 48 0 A g1
## 7 TGGT… SeuratPro… 64 36 0 A g1
## 8 GCAG… SeuratPro… 72 45 0 A g1
## 9 GATA… SeuratPro… 52 36 0 A g1
## 10 AATG… SeuratPro… 100 41 0 A g1
## # … with 70 more rows, and 12 more variables: RNA_snn_res.1 <fct>, file <chr>,
## # sample <chr>, nCount_SCT <dbl>, nFeature_SCT <int>, PC_1 <dbl>, PC_2 <dbl>,
## # PC_3 <dbl>, PC_4 <dbl>, PC_5 <dbl>, tSNE_1 <dbl>, tSNE_2 <dbl>
If a tool is not included in the tidyseurat collection, we can use as_tibble
to permanently convert tidyseurat
into tibble
pbmc_small_pca %>%
as_tibble %>%
select(contains("PC"), everything()) %>%
GGally::ggpairs(columns = 1:5, ggplot2::aes(colour=groups))
We proceed with cluster identification
pbmc_small_cluster =
pbmc_small_pca %>%
FindNeighbors(verbose=FALSE) %>%
FindClusters(method = "igraph", verbose=FALSE )
pbmc_small_cluster
## # A tibble: 80 x 21
## cell orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents groups
## <chr> <fct> <dbl> <int> <fct> <fct> <chr>
## 1 ATGC… SeuratPro… 70 47 0 A g2
## 2 CATG… SeuratPro… 85 52 0 A g1
## 3 GAAC… SeuratPro… 87 50 1 B g2
## 4 TGAC… SeuratPro… 127 56 0 A g2
## 5 AGTC… SeuratPro… 173 53 0 A g2
## 6 TCTG… SeuratPro… 70 48 0 A g1
## 7 TGGT… SeuratPro… 64 36 0 A g1
## 8 GCAG… SeuratPro… 72 45 0 A g1
## 9 GATA… SeuratPro… 52 36 0 A g1
## 10 AATG… SeuratPro… 100 41 0 A g1
## # … with 70 more rows, and 14 more variables: RNA_snn_res.1 <fct>, file <chr>,
## # sample <chr>, nCount_SCT <dbl>, nFeature_SCT <int>, SCT_snn_res.0.8 <fct>,
## # seurat_clusters <fct>, PC_1 <dbl>, PC_2 <dbl>, PC_3 <dbl>, PC_4 <dbl>,
## # PC_5 <dbl>, tSNE_1 <dbl>, tSNE_2 <dbl>
Now we can interrogate the object as if it was a regular tibble data frame
pbmc_small_cluster %>%
tidyseurat::count(groups, seurat_clusters)
## # A tibble: 8 x 3
## groups seurat_clusters n
## <chr> <fct> <int>
## 1 g1 0 17
## 2 g1 1 14
## 3 g1 2 9
## 4 g1 3 4
## 5 g2 0 13
## 6 g2 1 12
## 7 g2 2 6
## 8 g2 3 5
We can identify cluster markers using Seurat
# Identify markers
markers =
pbmc_small_cluster %>%
FindAllMarkers(only.pos = TRUE, min.pct = 0.25, thresh.use = 0.25) %>%
group_by(cluster) %>%
top_n(10, avg_logFC)
# Plot heatmap
pbmc_small_cluster %>%
DoHeatmap(features = markers$gene)
We can calculate the first 3 UMAP dimensions using Seurat framework
pbmc_small_UMAP =
pbmc_small_cluster %>%
RunUMAP(reduction = "pca", dims = 1:15, n.components = 3L, )
and we can plot them using 3D plot using plotly
pbmc_small_UMAP %>%
plot_ly(
x = ~`UMAP_1`,
y = ~`UMAP_2`,
z = ~`UMAP_3`,
color = ~ seurat_clusters
)
We can infer cell type identities using singleR
blueprint = SingleR::BlueprintEncodeData()
cell_type_df =
pbmc_small_UMAP@assays[["SCT"]]@counts %>%
log1p %>%
Matrix::Matrix(sparse = TRUE) %>%
SingleR::SingleR(
ref = blueprint ,
labels = blueprint$label.main,
method = "single"
) %>%
as.data.frame() %>%
as_tibble(rownames="cell") %>%
select(cell, first.labels)
pbmc_small_cell_type =
pbmc_small_UMAP %>%
left_join(cell_type_df, by="cell")
pbmc_small_cell_type %>%
tidyseurat::select(cell, first.labels, everything())
## # A tibble: 80 x 25
## cell first.labels orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8
## <chr> <chr> <fct> <dbl> <int> <fct>
## 1 ATGC… CD4+ T-cells SeuratPro… 70 47 0
## 2 CATG… CD8+ T-cells SeuratPro… 85 52 0
## 3 GAAC… CD8+ T-cells SeuratPro… 87 50 1
## 4 TGAC… CD4+ T-cells SeuratPro… 127 56 0
## 5 AGTC… CD4+ T-cells SeuratPro… 173 53 0
## 6 TCTG… CD4+ T-cells SeuratPro… 70 48 0
## 7 TGGT… CD4+ T-cells SeuratPro… 64 36 0
## 8 GCAG… CD4+ T-cells SeuratPro… 72 45 0
## 9 GATA… CD4+ T-cells SeuratPro… 52 36 0
## 10 AATG… CD4+ T-cells SeuratPro… 100 41 0
## # … with 70 more rows, and 19 more variables: letter.idents <fct>,
## # groups <chr>, RNA_snn_res.1 <fct>, file <chr>, sample <chr>,
## # nCount_SCT <dbl>, nFeature_SCT <int>, SCT_snn_res.0.8 <fct>,
## # seurat_clusters <fct>, PC_1 <dbl>, PC_2 <dbl>, PC_3 <dbl>, PC_4 <dbl>,
## # PC_5 <dbl>, tSNE_1 <dbl>, tSNE_2 <dbl>, UMAP_1 <dbl>, UMAP_2 <dbl>,
## # UMAP_3 <dbl>
We can easily summarise the results. For example, see how cell type classification overlaps with cluster classification.
pbmc_small_cell_type %>%
count(seurat_clusters, first.labels)
## # A tibble: 9 x 3
## seurat_clusters first.labels n
## <fct> <chr> <int>
## 1 0 CD4+ T-cells 8
## 2 0 CD8+ T-cells 10
## 3 0 NK cells 12
## 4 1 Macrophages 1
## 5 1 Monocytes 25
## 6 2 B-cells 10
## 7 2 Macrophages 1
## 8 2 Monocytes 4
## 9 3 Erythrocytes 9
We can easily reshape the data for building information-rish faceted plots
pbmc_small_cell_type %>%
# Reshaping
pivot_longer(
cols=c(seurat_clusters, first.labels),
names_to = "classifier", values_to = "label"
) %>%
# Plotting
ggplot(aes(UMAP_1, UMAP_2, color=label)) +
geom_point() +
facet_wrap(~classifier) +
my_theme
We can easily plot gene correlation per cell category, adding multi-layer annotations
pbmc_small_cell_type %>%
# Add mitochondrial abundance
mutate(mitochondrial = rnorm(n())) %>%
# Plot correlation
join_transcripts(transcripts = c("CST3" , "LYZ" ), shape = "wide") %>%
ggplot(aes(CST3 +1, LYZ + 1, color=groups, size=mitochondrial)) +
geom_point() +
facet_wrap(~first.labels, scales = "free") +
scale_x_log10() +
scale_y_log10() +
my_theme
A powerful tool we can use with tidyseurat is nest
. We can easily perform independent analyses on subsets of the dataset. First we classify cell types in lymphoid and myeloid; then, nest based on the new classification
pbmc_small_nested =
pbmc_small_cell_type %>%
filter(first.labels != "Erythrocytes") %>%
mutate(cell_class = if_else(`first.labels` %in% c("Macrophages", "Monocytes"), "myeloid", "lmphoid")) %>%
nest(data = -cell_class)
pbmc_small_nested
## # A tibble: 2 x 2
## cell_class data
## <chr> <list>
## 1 lmphoid <tidysert>
## 2 myeloid <tidysert>
Now we can independently (i) find variable features, (ii) reduce dimensions, and (iii) cluster. Using both tidyverse and Seurat seamlessy
pbmc_small_nested_reanalysed =
pbmc_small_nested %>%
mutate(data = map(
data, ~ .x %>%
FindVariableFeatures(verbose = FALSE) %>%
RunPCA(npcs = 10, verbose = FALSE) %>%
FindNeighbors(verbose = FALSE) %>%
FindClusters(method = "igraph" , verbose = FALSE) %>%
RunUMAP(reduction = "pca", dims = 1:10, n.components = 3L, verbose = FALSE)
))
pbmc_small_nested_reanalysed
## # A tibble: 2 x 2
## cell_class data
## <chr> <list>
## 1 lmphoid <tidysert>
## 2 myeloid <tidysert>
Now we can unnest and plot the new clasification
pbmc_small_nested_reanalysed %>%
# Convert to tibble otherwise Seurat drops reduced dimensions when unifying data sets.
mutate(data = map(data, ~ .x %>% as_tibble)) %>%
unnest(data) %>%
# Define unique clusters
unite("cluster", c(cell_class, seurat_clusters), remove=FALSE) %>%
# Plotting
ggplot(aes(UMAP_1, UMAP_2, color=cluster)) +
geom_point() +
facet_wrap(~cell_class) +
my_theme
We can perform a large number of functional analyses on data subsets. For example intra-sample cell-cell interactions, for then comparing/testing whether interactions have strenghten/weaken across conditions. In this small dataset we have just two sample (one for condition) but you can imagine how you can use tidyverse to perform t-tests and visualisation.
pbmc_small_nested_interactions =
pbmc_small_nested_reanalysed %>%
# Unnest based on cell category
unnest(data) %>%
# Create unambiguous clusters
mutate(integrated_clusters = first.labels %>% as.factor %>% as.integer) %>%
# Nest based on sample
tidyseurat::nest(data = -sample) %>%
tidyseurat::mutate(interactions = map(data, ~ {
# Produce variables. Yuck!
cluster = .x@meta.data$integrated_clusters
data = data.frame(.x[["SCT"]]@data)
# Ligand/Receptor analysis using SingleCellSignalR
data %>%
cell_signaling(genes=rownames(data),cluster=cluster) %>%
inter_network(data = data, signal = ., genes = rownames(data), cluster = cluster) %$%
`individual-networks` %>%
map_dfr(~bind_rows(as_tibble(.x)))
}))
pbmc_small_nested_interactions %>%
select(-data) %>%
unnest(interactions)
If the data set was not so small, and interactions could be identified, you would see something as below.
## # A tibble: 100 x 9
## sample ligand receptor ligand.name receptor.name origin destination
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 sampl… clust… cluster… PTMA VIPR1 clust… cluster 2
## 2 sampl… clust… cluster… B2M KLRD1 clust… cluster 2
## 3 sampl… clust… cluster… IL16 CD4 clust… cluster 2
## 4 sampl… clust… cluster… HLA-B KLRD1 clust… cluster 2
## 5 sampl… clust… cluster… CALM1 VIPR1 clust… cluster 2
## 6 sampl… clust… cluster… HLA-E KLRD1 clust… cluster 2
## 7 sampl… clust… cluster… GNAS VIPR1 clust… cluster 2
## 8 sampl… clust… cluster… B2M HFE clust… cluster 2
## 9 sampl… clust… cluster… PTMA VIPR1 clust… cluster 3
## 10 sampl… clust… cluster… CALM1 VIPR1 clust… cluster 3
## # … with 90 more rows, and 2 more variables: interaction.type <chr>,
## # LRscore <dbl>