backbone_introduction

library(backbone)

Weighted Graphs, Bipartite Projections, and Signed Graphs

Introduction

Signed graphs, those with both positive and negative edges, are an incredible tool for understanding social dynamics in today’s world. However, collecting signed graph data is challenging for numerous reasons. Often times people do not want to report on negative ties, may have strategic reasons for not reporting, or in the case of children, cannot be asked negative questions.

One solution is to infer a signed graph from either a weighted graph or bipartite projection. A weighted graph is one where each edge has a numeric value indicating its strength. Bipartite graph data records activity participation such as co-attendance or co-sponsorship. Inferring a signed graph from either of these graph types is called extracting the backbone.

The backbone package provides users with several different methods of backbone extraction which are in use in the current literature. It allows the user to input their own weighted graph or bipartite graph data to find a graph with only the significant edges retained between nodes of a unipartite set. The backbone matrix \(G\) is a positive or signed matrix in which \(G_{ij}=1\) if agents \(i\) and \(j\) have a significant positive relationship and \(G_{ij}=-1\) if they have a significant negative relationship.

We will outline the use of the backbone package with Davis, Gardner, and Gardner’s Southern Women Dataset (Davis, Gardner, and Gardner 1941) and accessed via (Repository, n.d.). This dataset contains bipartite data of 18 women and their attendance of 14 social events over a nine month period.

Bipartite Projections

A bipartite graph \(B\) contains two sets of vertices, with edges only existing between the two sets, not within. In network analysis, these two sets are often referred to as agents and artifacts. Looking at \(B\) as a bipartite matrix, \(B_{ik}=1\) if the \(i\)th agent is associated with the \(j\)th artifact, and is zero otherwise. Let’s take a look at the Davis dataset included in this package to see that it is bipartite.

We see that our two sets of vertices are women and events attended. We want to use this bipartite data to infer a signed graph between the women, indicating who attends events together and avoids attending events together. Let’s first look at the graph projection.

A projection \(P\) of bipartite graph \(B\) is defined by \(BB'\), where \(B'\) is the transpose of \(B\). \(P\) is a unipartite graph of the agents, where \(P_{ij}\) equals the number of artifacts that agents \(i\) and \(j\) have in common. Note that the projection \(P\) is a weighted graph. From looking at the matrix of southern women above, we see that Evelyn and Laura have attended six of the same events. This means that \(P_{1 2} = 6\) in the projection.

General Backbone Methods

In this section, we will describe backbone methods that work for any weighted graph. Finding the backbone of a matrix/graph/network means that we only retain the significant edges. Since weighted graphs can be difficult to work with, we want to find a binary or signed (\(\pm 1\)) matrix that represents our data. The purpose of the backbone package is to provide multiple methods to extract these significant edges and obtain the positive or signed matrices. To illustrate general backbone methods, we will use the weighted graph \(P = BB'\) where \(B\) is the bipartite matrix davis included in the package.

Universal Backbone: universal( )

One way to do this is through the universal() function. The universal() function allows the user to pick threshold values (one for positive matrices, two for signed matrices). If the value of the projection \(P_{ij}\) is greater than or equal to the set positive threshold value, this edge in the backbone is set to \(+1\). If the value of the projection \(P_{ij}\) is less than or equal to the set negative threshold value, this edge is set to \(-1\) in the backbone. Edges that are not outside of the threshold values are set to zero.

The universal( ) function has six parameters, * B, Matrix: Bipartite network * upper, Real: upper threshold value, to be multiplied by the chosen function applied to the projection * lower, Real: lower threshold value, to be multiplied by the chosen function applied to the projection * signed, Boolean: If the backbone should be signed, or positive only * weighted, Boolean: TRUE if weighted matrix, FALSE if bipartite to be projected * by_row, Boolean: If model should be evaluated by row or column. If false, the matrix is transposed before the projection (if biparitite) and threshold is applied. * FUN, Function: A function to be applied to the projection, e.g. mean, sd, max, etc. If trivial, valued threshold is used.

The function universal() returns the backbone matrix, a signed (or positive) adjacency matrix of a graph.

In our example using the davis dataset, if we input the projected matrix P <- davis%*%t(davis), we can use the universal threshold on the weighted matrix. If we set an upper threshold of 1, then if two women have attended any event together, there will be an edge between the two. We can plot this graph with the igraph package.

We can also use the universal() function on the original bipartite data. For the same threshold as before, input the bipartite data as below. Notice we have gotten the same backbone matrix.

The threshold used above creates a very dense matrix. If we want to impose a stricter condition, say women have to attend at least 3 events together, we do so as follows:

Or, perhaps we only keep edges that are more than twice the mean. We can do this with the following code.

To create a signed network, we can apply both an upper and lower threshold value. For instance, we could chose to retain a positive edge if the women attended at least 5 events together, and a negative edge if they only attended 1 event together. We can do this with the following code.

We can also chose to threshold by multiples of a chosen function, such as mean, max, or min. The function is applied to the projection \(P\), and then multiplied by the upper and lower thresholds. Any \(P_{ij}\) values above the upper threshold are counted as a positive \(+1\) value in the backbone, and any below the lower threshold are counted as a negative \(-1\) value in the backbone. The following code will return a backbone where the positive edges indicate two women attended more than 4 times the mean number of events together, and a negative edge indicates two women attended below half the mean number of events together.

While this threshold creates a less dense, more informative graph, it may be missing out on key relationships between agents. We move to more sophisticated methods of backbone extraction.

Bipartite Projection Backbone Methods

When examining a bipartite projection, we need to interpret the \(P_{ij}\) values. Larger values indicate that agents \(i\) and \(j\) share many artifacts, but this may not be enough to claim they have any sort of relationship. If they both are associated with a large number of artifacts if those artifacts are also associated with a large number of agents, then their individual large \(P_{ij}\) value may not be meaningful. To tell if \(P_{ij}\) is large enough or small enough to be an interesting finding, we compare the \(P\) matrix to a null distribution which is conditioned on the number of artifacts with which the agents are associated, and the number of agents associated with each artifact. The backbone package provides three different ways to create distributions: the hypergeometric distribution using hyperg(), the stochastic degree sequence method using sdsm(), and the fixed degree sequence method using fdsm(). Each of these three functions return two matrices labeled positive and negative. The positive matrix contains the number of times an observed edge \(P_{ij}\) is above the expected value in a null distribution divided by the number of trials. The negative matrix contains the number of times the observed edge is below the expected value in a null distribution, divided by the number of trials. To obtain the backbone matrix, we use the backbone.extract() function.

Extracting the Backbone: backbone.extract( )

The backbone.extract() function allows the user to take a matrix of proportions and return a backbone. It is intended to be used with the functions hyperg, sdsm, and fdsm. The backbone extraction function takes in two matrices, denoted positive and negative, and a significant test value alpha. These matrices should be proportions (each entry between 0 and 1) of times an observed edge \(P_{ij}\) of a bipartite projection was above (in the positive matrix) or below (in the negative matrix) the value in a the distribution. One can adjust the precision of the significance test, \(alpha\), to refine their backbone results. We will demonstrate this function’s use in the following sections.

Hypergeometric Backbone: hyperg( )

The hypergeometric distribution compares an edge’s observed weight, \(P_{ij}\) to the distribution of weights expected in a projection obtained from a random bipartite network where the row sums are fixed, but the column sums are allowed to vary. This method of backbone extraction was developed in (Neal 2013). For documentation on the hypergeometric distribution, see stats::phyper. The hyperg() function returns matrices of the probability of ties above and below the observed \(P_{ij}\) value, based on the hypergeometric distribution. One must use the backbone.extract() function to find the backbone at a given significance value \(alpha\).

The Degree Sequence Methods: sdsm( ) and fdsm( )

Both the Stochastic Degree Sequence Method (sdsm) and Fixed Degree Sequence Method (fdsm) involve the same series of steps. They construct a null distribution by: 1. Constructing \(B^*\), a new bipartite matrix which retains the agent and artifact degrees of \(B\) but is otherwise random. 2. Compute the projection of \(B^*\), called \(P^*\). 3. Save \(P^*_{ij}\), repeat 1 and 2. 4. Compare \(P_{ij}\) to the distribution of \(P^*_{ij}\).

The models differ in how they construct \(B^*\). The fdsm method constructs \(B^*\) so that the agent and artifact degrees are exactly the same as in \(B\). This method is due to Zweig and Kaufmann (Zweig and Kaufmann 2011). There are several different ways to randomly sample from this space of \(B^*\) matrices. The function fdsm( ) uses the curveball algorithm (Strona et al. 2014)

Alternatively, the sdsm constructs \(B^*\) so that the agent and artifact degrees are approximately the same as in \(B\), but using a maximum likelihood estimation to estimate parameters of a binomail regression equation \(Pr(B_{ij}=1) = \beta_0 +\beta_1 B_i + \beta_2 B_j +\beta_3 (B_i \times B_j)\) where \(B_i\) and \(B_j\) are the agent and artifact degrees in \(B\) respectively. The fitted parameters are then used to compute the predicted probability that \(B_{ij}=1\). Then, \(B^*\) is constructed such that \(B_{ij}^*\) is the outcome of a single Bernouilli trial with \(Pr(B_{ij}=1)\) probability of success. This method is due to (Neal 2014).

fdsm

The fdsm( ) function has five parameters, * B, Matrix: Bipartite network * trials, Integer: Number of random bipartite graphs generated * by_row, Boolean: If model should be evaluated by row or column. If false, the matrix is transposed before the projection and threshold is applied. * sparse, Boolean: If sparse matrix manipulations should be used * maxiter, Integer: Maximum number of iterations if “model” is a GLM * dyad, vector length 2: two row entries i,j to save Pij values

It returns a list of the following: * positive: matrix of proportion of ties above expected * negative: matrix of proportion of ties below expected * dyad_values: list of edge weight in each trial for dyad

The dyad_values is a list of all of the \(P_{ij}\) values in each random bipartite graph generated for a given \(i,j\) pair.

We can find the backbone via the fdsm method as follows:

fdsm_props$dyad
#>   [1] 3 3 4 3 3 3 3 3 2 3 2 4 4 2 2 4 1 3 1 4 2 3 3 3 2 3 3 3 3 3 4 4 3 3 3
#>  [36] 3 3 4 1 3 2 3 2 2 2 2 2 3 3 3 1 1 3 2 2 2 3 3 2 3 3 1 2 3 2 2 3 3 3 1
#>  [71] 2 3 3 3 3 2 1 2 4 3 2 3 2 3 3 0 2 4 2 4 3 3 3 2 3 3 4 2 4 2
fdsm_bb <- backbone.extract(fdsm_props$positive, fdsm_props$negative, alpha = 0.05)
fdsm_bb
#>           EVELYN LAURA THERESA BRENDA CHARLOTTE FRANCES ELEANOR PEARL RUTH
#> EVELYN         0     1       1      1         0       1       0     1    0
#> LAURA          1     0       1      1         0       1       1     0    0
#> THERESA        1     1       0      1         1       1       1     1    1
#> BRENDA         1     1       1      0         1       1       1     0    0
#> CHARLOTTE      0     0       1      1         0       0       0    -1    0
#> FRANCES        1     1       1      1         0       0       1     1    0
#> ELEANOR        0     1       1      1         0       1       0     0    1
#> PEARL          1     0       1      0        -1       1       0     0    0
#> RUTH           0     0       1      0         0       0       1     0    0
#> VERNE          0     0       0      0         0       0       0     0    1
#> MYRNA          0    -1       0     -1        -1       0       0     1    0
#> KATHERINE     -1    -1      -1     -1        -1      -1      -1     0    0
#> SYLVIA        -1    -1       0     -1         0      -1       0     0    0
#> NORA          -1    -1      -1     -1        -1      -1       0     0    0
#> HELEN         -1     0       0      0        -1       0       0     0    0
#> DOROTHY        1     0       1      0        -1       0       0     1    1
#> OLIVIA         0    -1       0     -1        -1      -1      -1     0    0
#> FLORA          0    -1       0     -1        -1      -1      -1     0    0
#>           VERNE MYRNA KATHERINE SYLVIA NORA HELEN DOROTHY OLIVIA FLORA
#> EVELYN        0     0        -1     -1   -1    -1       1      0     0
#> LAURA         0    -1        -1     -1   -1     0       0     -1    -1
#> THERESA       0     0        -1      0   -1     0       1      0     0
#> BRENDA        0    -1        -1     -1   -1     0       0     -1    -1
#> CHARLOTTE     0    -1        -1      0   -1    -1      -1     -1    -1
#> FRANCES       0     0        -1     -1   -1     0       0     -1    -1
#> ELEANOR       0     0        -1      0    0     0       0     -1    -1
#> PEARL         0     1         0      0    0     0       1      0     0
#> RUTH          1     0         0      0    0     0       1      0     0
#> VERNE         0     1         0      1    0     1       1      0     0
#> MYRNA         1     0         1      1    0     0       1      0     0
#> KATHERINE     0     1         0      1    0     0       1      0     0
#> SYLVIA        1     1         1      0    1     0       1      0     0
#> NORA          0     0         0      1    0     0       0      1     1
#> HELEN         1     0         0      0    0     0       0      0     0
#> DOROTHY       1     1         1      1    0     0       0      0     0
#> OLIVIA        0     0         0      0    1     0       0      0     1
#> FLORA         0     0         0      0    1     0       0      1     0

Note that since we have provided both a positive and negative matrix of proportions, we will return a signed matrix. Also, we now have a two-tailed significance test, so \(alpha\) will be \(0.025\) on each end of the distribution. This also returns a list of the \(P_{3,6}^*\) values for each of the 100 trials. The dyad parameter is defaultly set to NULL and will not be returned if not called for.

sdsm

The sdsm( ) function has nine parameters, * B, Matrix: Bipartite network * trials, Integer: Number of random bipartite graphs generated * model, String: Model used to generate random bipartite graphs * by_row, Boolean: If model should be evaluated by row or column. If false, the matrix is transposed before the projection and threshold is applied. * sparse, Boolean: If sparse matrix manipulations should be used * maxiter, Integer: Maximum number of iterations if “model” is a GLM * dyad, vector length 2: two row entries i,j to save Pij values * row_marg, Integer: row in which to save B* row marginals * col_marg, Integer: column in which to save B* column marginals

It returns a list of the following: * positive: matrix of proportion of ties above expected * negative: matrix of proportion of ties below expected * dyad_values: list of edge weight in each trial for dyad * row_marginals: list of row sum for row in row_marg * col_marginals: list of column sum for column in col_marg

The sdsm function not only computes the positive and negative proportion matrices, but because sdsm doesn’t fix agent and artifact degrees, we can also return a list of the \(P_{i,j}^*\) values for given agents \(i,j\), the agent degree for a given agent, and the artifact degree for a given artifact.

We can find the backbone via the sdsm method as follows:

sdsm_props$dyad
#>   [1] 5 1 3 1 4 3 2 2 3 5 2 3 3 4 4 3 1 5 3 3 2 4 1 2 0 3 1 2 2 3 3 3 3 4 1
#>  [36] 5 4 2 0 2 2 5 2 2 2 2 4 2 5 2 2 1 3 2 5 1 4 3 1 4 4 3 2 3 0 3 6 2 1 4
#>  [71] 2 4 2 5 2 2 3 2 5 3 2 3 6 1 5 2 2 3 3 4 3 2 2 2 5 4 3 2 2 2
sdsm_props$row_marg
#>   [1] 10  8  8 10  7  8  7  9  9  9  7 10  7 12  6  7  7  8  9  6  5  8  7
#>  [24]  8 10  9  6  9  8  8  9  7  8  8  5 13 12 10  8  8  6  8  8  9  5  7
#>  [47]  7  8  9  7  8  8  6  9  9  9  9  7  3  8  9  8 11  9  6  8  8  7 10
#>  [70]  7  6 11  8 11 12 10  9 10  7 10  8 10 10  6  9 11  8 11  7  9  7  7
#>  [93] 12  8 11  9  7  8 10  7
sdsm_props$col_marg
#>   [1] 4 3 3 5 2 4 2 3 4 2 1 3 3 4 3 3 3 3 5 5 6 3 3 5 6 4 3 3 3 5 8 2 2 4 4
#>  [36] 6 3 3 4 2 2 4 3 3 3 2 3 3 4 3 3 2 0 2 2 4 3 3 0 3 4 4 6 1 4 3 3 1 4 3
#>  [71] 3 3 3 5 3 2 3 3 3 1 1 4 1 4 4 2 3 4 4 4 2 3 5 1 3 4 3 4 0 3
sdsm_bb <- backbone.extract(sdsm_props$positive, sdsm_props$negative, alpha = 0.05)
sdsm_bb
#>           EVELYN LAURA THERESA BRENDA CHARLOTTE FRANCES ELEANOR PEARL RUTH
#> EVELYN         0     0       0      0         0       0       0     0    0
#> LAURA          0     0       0      0         0       0       0     0    0
#> THERESA        0     0       0      0         0       0       0     0    0
#> BRENDA         0     0       0      0         0       0       1     0    0
#> CHARLOTTE      0     0       0      0         0       0       0    -1    0
#> FRANCES        0     0       0      0         0       0       0     0    0
#> ELEANOR        0     0       0      1         0       0       0     0    0
#> PEARL          0     0       0      0        -1       0       0     0    0
#> RUTH           0     0       0      0         0       0       0     0    0
#> VERNE          0     0       0      0         0       0       0     0    1
#> MYRNA          0    -1       0      0        -1       0       0     0    0
#> KATHERINE      0    -1       0     -1        -1       0       0     0    0
#> SYLVIA        -1     0       0      0         0       0       0     0    0
#> NORA          -1     0       0     -1         0       0       0     0    0
#> HELEN         -1     0       0      0         0       0       0     0    0
#> DOROTHY        0     0       0      0        -1       0       0     0    0
#> OLIVIA         0    -1       0     -1        -1      -1      -1     0    0
#> FLORA          0    -1       0     -1        -1      -1      -1     0    0
#>           VERNE MYRNA KATHERINE SYLVIA NORA HELEN DOROTHY OLIVIA FLORA
#> EVELYN        0     0         0     -1   -1    -1       0      0     0
#> LAURA         0    -1        -1      0    0     0       0     -1    -1
#> THERESA       0     0         0      0    0     0       0      0     0
#> BRENDA        0     0        -1      0   -1     0       0     -1    -1
#> CHARLOTTE     0    -1        -1      0    0     0      -1     -1    -1
#> FRANCES       0     0         0      0    0     0       0     -1    -1
#> ELEANOR       0     0         0      0    0     0       0     -1    -1
#> PEARL         0     0         0      0    0     0       0      0     0
#> RUTH          1     0         0      0    0     0       0      0     0
#> VERNE         0     0         0      0    0     0       0      0     0
#> MYRNA         0     0         1      0    0     0       0      0     0
#> KATHERINE     0     1         0      1    0     0       0      0     0
#> SYLVIA        0     0         1      0    0     0       0      0     0
#> NORA          0     0         0      0    0     0       0      0     0
#> HELEN         0     0         0      0    0     0       0      0     0
#> DOROTHY       0     0         0      0    0     0       0      0     0
#> OLIVIA        0     0         0      0    0     0       0      0     0
#> FLORA         0     0         0      0    0     0       0      0     0

Here we have returned a list of the numbers of events Evelyn and Charlotte attended together in each generated random matrix, the number of events attended by Evelyn in each generated random matrix, and the number of people who attended event 1 in each generated random matrix.

Future

The backbone package will be updated to contain additional backbone extraction methods that are used in the current literature.

References

Davis, Allison, Burleigh B Gardner, and Mary R Gardner. 1941. “Deep South. The U.” of Chicago Press, Chicago, IL.

Neal, Zachary. 2013. “Identifying Statistically Significant Edges in One-Mode Projections.” Social Network Analysis and Mining 3 (4). Springer: 915–24.

———. 2014. “The Backbone of Bipartite Projections: Inferring Relationships from Co-Authorship, Co-Sponsorship, Co-Attendance and Other Co-Behaviors.” Social Networks 39. Elsevier: 84–97.

Repository, UCI Network Data. n.d. “Southern Women Data Set.” https://networkdata.ics.uci.edu/netdata/html/davis.html.

Strona, Giovanni, Domenico Nappo, Francesco Boccacci, Simone Fattorini, and Jesus San-Miguel-Ayanz. 2014. “A Fast and Unbiased Procedure to Randomize Ecological Binary Matrices with Fixed Row and Column Totals.” Nature Communications 5. Nature Publishing Group: 4114.

Zweig, Katharina Anna, and Michael Kaufmann. 2011. “A Systematic Approach to the One-Mode Projection of Bipartite Graphs.” Social Network Analysis and Mining 1 (3). Springer: 187–218.