Please report comments or bugs to Pierre Lefeuvre - pierre.lefeuvre@cirad.fr
Phylogenetic placements represents mapping of query sequences to a reference phylogenetic tree. Put it another way, it corresponds to the position in a tree where a query sequence fits. Different tools exits to infer phylogenetic placements with the EPA algorithm and pplacer. Importantly, both EPA and pplacer produces placements under a common file format. Placements can later be analysed using the guppy software from the pplacer suite to obtain statistically based taxonomic classification of sequences. The BoSSA package implements functions to reads, plots and summarizes phylogentic placements. This vignette is intended to provide functional examples of placements analyses using BoSSA.
Assuming you dispose of some query sequences (either full sequences,contigs or short reads) you want to assign to a phylogeny/taxonomy, the process to obtain phylogenetic placement is to (1) build a reference package that contains an align set of reference sequences and a reference phylogenetic tree, (2) align query sequences to the reference alignment, (3) use EPA or pplacer to infer placements (jplace file output, format describe here) and optionally (4) infer the classification of each sequences using guppy (sqlite file output).
Let’s say you now dispose of a reference package (refpkg), a placement file (jplace file) and a guppy classification output (sqlite file). The example files presented here are derived from other reference packages and jplaces files from the Matsen group pplacer tutorials. The sqlite file was obtained using the following command:
guppy classify --bayes-cutoff 0 --multiclass-min 0 --cutoff 0 -c example.refpkg --sqlite example.sqlite example.jplace
Let’s start by loading the `BoSSA-package:
library("BoSSA")
A good practice would be to inspect the refpkg content.
refpkg_path <- paste(find.package("BoSSA"),"/extdata/example.refpkg",sep="")
refpkg(refpkg_path)
## ### Reference package summary
##
## Path:/tmp/RtmpdNIu3c/Rinst31784f226967/BoSSA/extdata/example.refpkg
##
## Tree with 652 tips 650 nodes
##
## Classification:
## root 1
## below_root 1
## superkingdom 1
## below_superkingdom 1
## below_below_superkingdom 1
## superphylum 1
## phylum 6
## subphylum 1
## class 11
## subclass 2
## order 15
## below_order 3
## below_below_order 1
## suborder 3
## family 28
## below_family 5
## genus 45
## species_group 6
## species_subgroup 1
## species 138
It is possible to extract the taxonomy of the sequences included in the refpkg.
taxo <- refpkg(refpkg_path,type="taxonomy")
head(taxo)
## tax_id parent_id rank tax_name
## 1 1 1 root root
## 2 131567 1 below_root cellular organisms
## 3 2 131567 superkingdom Bacteria
## 4 2323 2 below_superkingdom unclassified Bacteria
## 5 95818 2323 below_below_superkingdom candidate division TM7
## 6 68336 2 superphylum Bacteroidetes/Chlorobi group
## root below_root superkingdom below_superkingdom below_below_superkingdom
## 1 1
## 2 1 131567
## 3 1 131567 2
## 4 1 131567 2 2323
## 5 1 131567 2 2323 95818
## 6 1 131567 2
## superphylum phylum subphylum class subclass order below_order
## 1
## 2
## 3
## 4
## 5
## 6 68336
## below_below_order suborder family below_family genus species_group
## 1
## 2
## 3
## 4
## 5
## 6
## species_subgroup species below_species
## 1
## 2
## 3
## 4
## 5
## 6
or display a pie chart that summarize the taxonomy…
refpkg(refpkg_path,type="pie",cex.text=0.5)
… or a subset of the taxonomy levels. Here, an example with the “class”, “order” and “family” levels. Note there is a slight decay between the text labels and slices… this will need a fix for next package update.
refpkg(refpkg_path,type="pie",rank_pie=c("class","order","family"),cex.text=0.6)
Finally, a tree display with branch colored according to a given taxonomic level is available. Here tips are colored according to the “order” classification.
refpkg(refpkg_path,type="tree",rank_tree="class",cex.text=0.5)
The BoSSA package come along with examples of phylogenetic placements from the Mastens group.
sqlite_file <- system.file("extdata", "example.sqlite", package = "BoSSA")
jplace_file <- system.file("extdata", "example.jplace", package = "BoSSA")
To read the data, use the read_sqlite
function.
pplace <- read_sqlite(sqlite_file,jplace_file)
pplace
## pplace object
## run: 1
## call run 1: /home/lefeuvre/prgrm_phylo/pplacer/pplacer-Linux-v1.1.alpha18-2-gcb55169/guppy classify --bayes-cutoff 0 --multiclass-min 0 --cutoff 0 -c vaginal_16s.refpkg --sqlite example.sqlite example.jplace
## sequence nb: 100
## placement nb: 30
On screen a summary of the object is printed with the number of runs, the command line, the number of placements and the number of sequences being placed. Pplace objects are list with five components :
str(pplace)
## List of 5
## $ run :'data.frame': 1 obs. of 2 variables:
## ..$ run_id: int 1
## ..$ params: chr "/home/lefeuvre/prgrm_phylo/pplacer/pplacer-Linux-v1.1.alpha18-2-gcb55169/guppy classify --bayes-cutoff 0 --multiclass-min 0 --c"| __truncated__
## $ taxo :'data.frame': 281 obs. of 3 variables:
## ..$ tax_id : chr [1:281] "1" "103621" "109790" "113286" ...
## ..$ tax_name: chr [1:281] "root" "Actinomyces urogenitalis" "Lactobacillus jensenii" "Pseudoramibacter" ...
## ..$ rank : chr [1:281] "root" "species" "species" "genus" ...
## $ multiclass:'data.frame': 100 obs. of 6 variables:
## ..$ placement_id: int [1:100] 28 22 1 28 5 3 9 28 17 28 ...
## ..$ name : chr [1:100] "GLKT0ZE01A0Y5X" "GLKT0ZE01A1B65" "GLKT0ZE01A608I" "GLKT0ZE01A6F1M" ...
## ..$ want_rank : chr [1:100] "species" "species" "species" "species" ...
## ..$ rank : chr [1:100] "species" "species" "species" "species" ...
## ..$ tax_id : chr [1:100] "40543" "187101" "906_1" "40543" ...
## ..$ likelihood : num [1:100] 1 0.857 1 1 1 ...
## $ placement :'data.frame': 193 obs. of 9 variables:
## ..$ placement_id : int [1:193] 1 1 1 1 1 1 1 2 2 2 ...
## ..$ location : num [1:193] 555 556 553 552 557 554 550 111 115 113 ...
## ..$ ml_ratio : num [1:193] 0.143 0.143 0.143 0.143 0.143 ...
## ..$ log_like : num [1:193] -11608 -11608 -11608 -11608 -11608 ...
## ..$ distal_bl : num [1:193] 5.15e-07 5.15e-07 5.15e-07 7.80e-06 5.78e-06 ...
## ..$ pendant_bl : num [1:193] 0.0584 0.0584 0.0584 0.0584 0.0584 ...
## ..$ tax_id : chr [1:193] "906_1" "906_1" "906_1" "906_1" ...
## ..$ map_identity_ratio: num [1:193] NA NA NA NA NA NA NA NA NA NA ...
## ..$ map_identity_denom: int [1:193] NA NA NA NA NA NA NA NA NA NA ...
## $ arbre :List of 5
## ..$ edge : int [1:1301, 1:2] 653 654 655 656 657 658 658 657 656 659 ...
## ..$ Nnode : int 650
## ..$ tip.label : chr [1:652] "S000010758" "S002034883" "S000006177" "S001188095" ...
## ..$ edge.length: num [1:1301] 0.0014 0.2102 0.0728 0.1295 0.1131 ...
## ..$ root.edge : num 0
## ..- attr(*, "class")= chr "phylo"
## ..- attr(*, "order")= chr "cladewise"
## - attr(*, "class")= chr "pplace"
run
element contains the run id and the command line summarytaxo
element is a data frame with the whole taxonomy available in the reference packagemulticlass
element is a data frame with the taxonomic assignation of each placementplacement
element is a data frame with the position of each placement over the reference phylogenetic treearbre
element is the class phylo
object of the reference phylogenetic treeThree different tree plots are available to display placements:
number
plot. Placement number associated to each branch is indicated. Note that this representation may be hard to read due to overlaps between number boxes. Placement numbers are obtained after the multiplication of their sizes with the ML ratio of the placement probabilities. Placement sizes are later round. A zero indicates a size superior to 0 but inferior to 1.plot(pplace,type="number",main="number",cex.number=1.5)
color
plot is the best option. Branches with placement are colored according to the number of sequences they bear.plot(pplace,type="color",main="color",edge.width=2)
precise
plot dots are drawn at the exact placement positions. Whereas the color of the dots depend of the pendant branch length, their sizes depend on the placement sizes. Note that placements are drawn one above the other.plot(pplace,type="precise",main="precise")
Note that it is possible to apply a function to modify the dot size using the transfo
option. In the following example, the dot size is multiplied by 2. In some other cases log
or log10
transformations could be usefull. Beware that when using the transfo
option, the legend does not anymore correspond to the placement size but to the transform dot size (i.e. the transform function applied to the dot size).
plot(pplace,type="precise",main="precise",transfo=function(X){X*2})
Placement object can be subseted. This could be done using placements ids…
sub1 <- sub_pplace(pplace,placement_id=1:100)
sub1
## pplace object
## run: 1
## call run 1: /home/lefeuvre/prgrm_phylo/pplacer/pplacer-Linux-v1.1.alpha18-2-gcb55169/guppy classify --bayes-cutoff 0 --multiclass-min 0 --cutoff 0 -c vaginal_16s.refpkg --sqlite example.sqlite example.jplace
## sequence nb: 100
## placement nb: 30
…or using placements names.
ids <- sample(pplace$multiclass$name,50)
sub2 <- sub_pplace(pplace,ech_id=ids)
sub2
## pplace object
## run: 1
## call run 1: /home/lefeuvre/prgrm_phylo/pplacer/pplacer-Linux-v1.1.alpha18-2-gcb55169/guppy classify --bayes-cutoff 0 --multiclass-min 0 --cutoff 0 -c vaginal_16s.refpkg --sqlite example.sqlite example.jplace
## sequence nb: 50
## placement nb: 23
Using the pplace_to_table
function produces a table taht contains the placement information along with the classification for each sequence. The output can be limited to the “best” placement (as in the example, i.e. the placements with the highest likelihood for each sequence).
pplace_table <- pplace_to_table(pplace,type="best")
head(pplace_table,n=3)
## placement_id name want_rank rank tax_id likelihood location
## 1 28 GLKT0ZE01A0Y5X species species 40543 1.0000000 161
## 2 22 GLKT0ZE01A1B65 species species 187101 0.8571653 152
## 3 1 GLKT0ZE01A608I species species 906_1 1.0000000 555
## ml_ratio log_like distal_bl pendant_bl tax_id.1
## 1 0.1428572 -11098.20 5.149085e-07 6.113515e-06 40543
## 2 0.1428611 -11098.20 5.149085e-07 6.113515e-06 187101
## 3 0.1428572 -11607.93 5.149085e-07 5.842901e-02 906_1
## map_identity_ratio map_identity_denom
## 1 NA NA
## 2 NA NA
## 3 NA NA
The pplace_to_matrix
produces a contingency table. Let say the first 50 sequences in the multiclass table correspond to sequence from “sample 1” and the following 50 correspond to “sample 2”, the function output a ocntingency table for these two samples. You can either have the taxonomic names (tax_name=TRUE, in the example) or keep the taxonomic ids (tax_name=FALSE).
example_contingency <- pplace_to_matrix(pplace,c(rep("sample1",50),rep("sample2",50)),tax_name=TRUE)
example_contingency
## Sneathia sanguinegens Leptotrichia amnionii Megasphaera sp. type 1
## sample1 18 10 6
## sample2 14 14 5
## Atopobium vaginae Gardnerella vaginalis Dialister sp. type 2
## sample1 1 2 1
## sample2 1 3 0
## Prevotella genogroup 1 Lactobacillus iners BVAB1
## sample1 3 4 1
## sample2 5 0 2
## Prevotella genogroup 2 Sneathia Prevotella genogroup 3
## sample1 1 2 1
## sample2 0 0 1
## Aerococcus christensenii Parvimonas micra Eggerthella
## sample1 0 0 0
## sample2 1 2 2
Using the pplace_to_taxonomy
function, a taxonomy table is compiled for each sequences with the taxonomy levels defined in the reference package. The taxonomy levels can be limited to a set of levels using the rank
option.
example_taxo <- pplace_to_taxonomy(pplace,taxo,tax_name=TRUE,rank=c("order","family","genus","species"))
head(example_taxo)
## order family genus
## GLKT0ZE01A0Y5X "Fusobacteriales" "Fusobacteriaceae" "Sneathia"
## GLKT0ZE01A1B65 "Fusobacteriales" "Fusobacteriaceae" "Sneathia"
## GLKT0ZE01A608I "Clostridiales" "Veillonellaceae" "Megasphaera"
## GLKT0ZE01A6F1M "Fusobacteriales" "Fusobacteriaceae" "Sneathia"
## GLKT0ZE01AH4NR "Coriobacteriales" "Coriobacteriaceae" "Atopobium"
## GLKT0ZE01AHXIQ "Bifidobacteriales" "Bifidobacteriaceae" "Gardnerella"
## species
## GLKT0ZE01A0Y5X "Sneathia sanguinegens"
## GLKT0ZE01A1B65 "Leptotrichia amnionii"
## GLKT0ZE01A608I "Megasphaera sp. type 1"
## GLKT0ZE01A6F1M "Sneathia sanguinegens"
## GLKT0ZE01AH4NR "Atopobium vaginae"
## GLKT0ZE01AHXIQ "Gardnerella vaginalis"
Assuming the sequences in the pplace object represent centroids of sequence cluster you obtained from multiple samples, using the taxonomy table and an appropriate OTU file, you can create a phyloseq object.
example_OTU <- matrix(sample(1:100, 500, replace = TRUE), nrow = 100, ncol = 5,dimnames=list(pplace$multiclass$name,paste("sample",1:5,sep="_")))
head(example_OTU)
## sample_1 sample_2 sample_3 sample_4 sample_5
## GLKT0ZE01A0Y5X 42 96 34 74 36
## GLKT0ZE01A1B65 23 8 33 17 28
## GLKT0ZE01A608I 43 49 77 75 15
## GLKT0ZE01A6F1M 34 68 71 6 96
## GLKT0ZE01AH4NR 62 2 21 62 29
## GLKT0ZE01AHXIQ 77 96 20 92 85
library(phyloseq)
example_phyloseq <- phyloseq(otu_table(example_OTU,taxa_are_rows=TRUE),tax_table(example_taxo))
example_phyloseq
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 100 taxa and 5 samples ]
## tax_table() Taxonomy Table: [ 100 taxa by 4 taxonomic ranks ]
If you find phyloseq and/or its tutorials useful, you may cite BoSSA:
citation("BoSSA")
##
## To cite package 'BoSSA' in publications use:
##
## Pierre Lefeuvre (2017). BoSSA: A Bunch of Structure and Sequence
## Analysis. R package version 2.3.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {BoSSA: A Bunch of Structure and Sequence Analysis},
## author = {Pierre Lefeuvre},
## year = {2017},
## note = {R package version 2.3},
## }
##
## ATTENTION: This citation information has been auto-generated from
## the package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.