SubVis

Scott Barlowe

2016-06-15

Description:

Software tool for visual analysis of substitution matrix effects on pairwise protein sequence     alignment.  Utilizes Shiny for interface components, R for alignment processing, and              JavaScript for visualization.

Project Pages:

  1.  https://github.com/sabarlowe/SubVis (includes instructional video)
  2.  https://cran.r-project.org/

Known Issues:

  1.  Zooming and Text - In the detail view, some text representing substitutions and scores
      becomes unaligned with the corresponding matrix type with zooming.

Included Data (inst\extdata\Example_custom_matrix and inst\extdata\Example_FASTA_files)

  1.  Two protein sequences in FASTA format (gpr12.fasta, gpr6.fasta)
  2.  A custom matrix developed by Rios et al. (gpcrtm.txt)

Getting started:

1.  Install R Studio (R version  version 3.3.0 or later).

2.  Install required R packages (and any dependencies):

      'shiny' 
    
      'Biostrings'

3.  Install and load SubVis package

4.  From a RStudio command line lauch 
    the SubVis application with

      '> startSubVis()'

Main tabs: Options and Viz

After launching, there are two main tabs:  Options and Viz

Options tab: Loading data, matrices, and parameters

1.  Protein sequences (one per file) must be 
    in FASTA format.  There are two example 
    FASTA files in the test folder.

2.  BLOSUM and PAM matrices can be selected 
    by checking the appropriate box.

3.  A custom matrix in the form of the 
    predefined matrices can be loaded from 
    a text file.  The basic form is 
  
      First row --> Characters representing lookup table.
  
      First col --> Transpose of first row starting at row 2
    
    All other entries are substitution values.
    

4.  The extension penalty, gap penalty, and score 
    type can be selected by text boxes and a drop-down 
    menu.

Note: Each time a change in the options tab is made, 
      the GO button (in the Overview or detail view) 
      must be clicked.

Viz tab: Overview and Detail view

Overview

Each row (except the last) represents a percent identity 
type.  Within each row, the pid for the selected matrices 
are sorted.  The last row is the sorted overall score.  

Interaction:  Display initiation
  
  1.  Click 'Go' to initiate display.

Interaction:  Mouse move

  1.  When the mouse moves over a matrix type, 
      the corresponding matrix in the other 
      rows are identified.

Interaction:  Size

  1.  '+' enlarges the display and '-' shrinks the 
      display.

Detail

Score is located under the matrix type for 
each alignment pair.  Amino acids are represented
by a color-coded box.

Interaction:  Size

  1.  '+' enlarges the display.

  2.  '-' shrinks the display.

Interaction:  Mouse over an amino acid.

  1.  Histogram showing the number of each amino 
      acid in that column.

  2.  Log-odds score and the amino acid substitution 
      displayed under the score for each alignment.

  3.  Amino acid, log-odds score, amino acid subsitution 
      is shown in the top right.

Interaction: Classification

  1.  Classify amino acids according to minor modification to [1] and
      whether an amino acid substitution is conserved (log-odds > 0)
      or not (log-odds < 0)

Interaction: Nagivation

  1.  'Start' - Go to position 1.

  2.  'End' - Go to the end of the maximum alignment.

  3.  'Pair' - Show both the pattern and subject for each matrix type.

  4.  'Patt' - Show only the pattern for each matrix type.

  5.  'Subj' - Show only the subject for each matrix type.

  6.  'Up' - Scroll up if all matrix types will not fit onto display space.

  7.  'Dn' - Scroll down if all matrix types will not fit onto display space.

  8.  '<-' - Go backward in the alignment.

  9.  '->' - Go forward in the alignment.

  10. 'Alpha' - Show amino acid instead of boxes

  11. 'Pos' - Go to a position by entering column in text box. (Requires clicking 'Go')

Interaction: Search (All require clicking 'Go')

  1.  'NONE' - No search function on (default).

  2.  'INDEL' - Shows the insertions and deletions in red.

  3.  'MATCH' - Shows the positions that match in both the pattern and subject.

  4.  'SEQ' - The amino acid to be searched for.

Notes:

1.  Log-odds score for pairs involving a gap are
    reported as undefined.

2.  Most error checking is delegated to the individual 
    R and Biostrings functions.  For example, it is the
    user's responsibility to ensure a valid custom
    matrix.

References:

1.  Pages, H., Aboyoun, P., Gentleman, R., DebRoy, S.: Biostrings: String
    Objects Representing Biological Sequences, and Matching Algorithms.
    R package version 2.32.0

2.  Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence
    comparison. Proc. Natl. Acad. Sci. USA 85(8), 2444{2448 (1988)

3.  Pommie, C., Levadoux, S., Sabatier, R., Lefranc, G., Lefranc, M.: Imgt
    standardized criteria for statistical analysis of immunoglobulin v-region
    amino acid properties. Journal of Molecular Recognition 17(1), 17{32
    (2004)

4.  R Core Team: R: A Language and Environment for Statistical
    Computing. R Foundation for Statistical Computing, Vienna, Austria
    (2013). R Foundation for Statistical Computing.
    http://www.R-project.org/

5.  RStudio, Inc.: Shiny: Web Application Framework for R. (2014). R
    package version 0.10.1. http://CRAN.R-project.org/package=shiny
    
6.  Rios, S., Fernandez, M.F., Caltabiano, G., Campillo, M., Pardo, L.,
    Gonzalez, A.: Gpcrtm: An amino acid substitution matrix for the
    transmembrane region of class a g protein-coupled receptors. BMC
    Bioinformatics 16(206) (2015)