22  Special Matrices

There are various other types of matrices we can use to understand and measure things in graphs beyond the adjacency matrix. We consider the most important ones in this lesson.

22.1 The Reachability Matrix

Consider the directed graph shown in Figure 11.2. Earlier, we derived an asymmetric adjacency matrix from a graph similar to this one. But directed graphs do not only encode information about adjacency relations between nodes. As we saw in the graph theory lesson, in directed graphs (and as we will later, in disconnected undirected graphs), there is another pairwise relationship between nodes we may be interested in; namely, reachability (Krackhardt 1994):

In a graph, node B is said to be reachable by node A if there is path (of any length) that has A as the origin node and B as the destination node. In that case, we say that A can reach B.

Sometimes, it is useful to encode reachability relations between nodes to in order to compute some important graph metrics. This is done using the reachability matrix, written \(D^r\). The reachability matrix is just like the adjacency matrix, except that instead of putting a one the corresponding matrix cell if the node in the row sends a tie to the node in the respective column, we put a one in the corresponding matrix cell if the node in the rode can reach the node in the column via a path. Table 22.1 shows the reachability matrix corresponding to the directed graph shown in Figure 11.2.

Note that if a node sends a regular old directed edge (a path of length one!) that counts for reachability too. This means that if the cell corresponding to the relationship between two nodes has a one in the adjacency matrix, then it should also have a one in the reachability matrix.

A B C D E F G
A 1 0 1 0 1 1
B 1 0 1 0 1 1
C 1 1 1 0 1 1
D 1 1 0 0 1 1
E 1 1 1 1 1 1
F 1 1 0 1 0 1
G 1 1 0 1 0 1
Table 22.1: Reachability matrix corresponding to a directed graph.

So show does this work? Let us take a look at how the first row (corresponding to whether node A can reach the other nodes in the graph) was filled out. First we ask, can node A reach node B? The answer is yes, because they are is a direct link between them! Node A sends a tie to node B, so the corresponding cell \(d^r_{12} = 1\).

Then we ask, can node A reach node C? The answer is no. Note that there is no directed path we can trace that would start from node A and end in node C. So we put a zero in the corresponding cell (row 1, column 3) of the reachability matrix (\(d^r_{13} = 0\)).

Further, we can ask, can node A reach node D? Note that here the answer is yes! While node A is not directly connected to node D, they are indirectly connected, so node A can reach node D in two ways: First, via a path of length two that goes: \(A \rightarrow B, B \rightarrow D\), and via path of length three that goes: \(A \rightarrow F, F \rightarrow G, G \rightarrow D\).1 So \(d^r_{13} = 1\).

We can continue like this and finish the row for node A and the the rows for all the other nodes. If we do we end up with numbers shown in $tbl-reachmat. There are some interesting things about this matrix. Reading across the rows for each node, we can figure out the number of other nodes that that particular node can reach.

Note than an interesting feature of the graph shown in Figure 11.2 is that while some nodes (like node A) cannot reach all the nodes in the graph, other nodes (like node E) can! It seems like E has access to everyone in the network, whether directly or indirectly. Maybe they are pretty important (Krackhardt 1994). Node C is almost like node E. They can reach almost everyone in the network, except for E. Maybe they are the second in command.

In the same way, note that reading across the columns, tell us whether a particular node is reachable by the other nodes. So we see that some nodes, like A, B, D, F, and G are reachable by everyone. Other nodes like E are reachable by no one. Finally, a node like C is not reachable by almost anyone else, except node E. It seems like reachability can encode some interesting properties, and can be used to develop some graph metrics related to power and hierarchy.

A B C D E F G H I J K L M
A 1 1 1 1 1 1 1 1 1 1 1 1
B 0 0 0 1 1 1 0 0 0 0 0 0
C 0 0 0 0 0 0 1 1 1 0 0 0
D 0 0 0 0 0 0 0 0 0 1 1 1
E 0 0 0 0 0 0 0 0 0 0 0 0
F 0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0 0 0 0 0 0
J 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 0 0 0 0 0 0 0
L 0 0 0 0 0 0 0 0 0 0 0 0
M 0 0 0 0 0 0 0 0 0 0 0 0
Table 22.2: Reachability matrix corresponding to a tree graph.

For instance, the reachability matrix corresponding to a perfectly hierarchical tree graph containing only antisymmetric relations, such as the one shown in Figure 16.1, has an interesting property. This is shown in Table 22.2. If you look at the reachability matrix’s lower-triangle, it is full of zeroes! The only ones present in the matrix are contained in the matrix’s upper-triangle. This means that when looking at a reachability matrix of any directed graph, we can get a sense of much they approximate a pure antisymmetric hierarchy by counting the number of ones that appear in the reachability matrix’s lower-triangle.

22.2 The Geodesic Distance Matrix

Consider the directed graph shown in Figure 11.1 again. In Table 18.1 we derived an symmetric adjacency matrix from the same graph. However, as noted in the graph theory lesson and our previous discussion of indirect connections, adjacency is only one way (the direct way) in which nodes can be connected in graph. A particularly important way in which two nodes can be connected is via shortest paths. The length of shortest path between two nodes is called the geodesic distance between them. So it is possible to create a matrix D in which each cell d\(_{ij}\) contains the length of the shortest path between the row node i and the column node j. This is called the distance matrix for the corresponding graph. The D matrix corresponding to the graph shown in Figure 11.1, is shown in Table 22.3.

A B C D E F G H I
A 0 1 1 1 1 2 3 3 3
B 1 0 1 1 2 2 3 3 3
C 1 1 0 1 1 2 3 3 3
D 1 1 1 0 1 1 2 2 2
E 1 2 1 1 0 2 3 3 3
F 2 2 2 1 2 0 1 1 1
G 3 3 3 2 3 1 0 1 1
H 3 3 3 2 3 1 1 0 1
I 3 3 3 2 3 1 1 1 0
Table 22.3: Distance matrix for an undirected graph.

The distance matrix reveals a number of things about the network. First, note that nodes that are adjacent in Figure 11.1 have a geodesic distance of 1.0 by definition. In addition, nodes are at minimum distance from themselves, so we put a value of 0 in the diagonal cells; d\(_{ij} = 0\) for all \(i=j\). Second, note that the maximum geodesic distance between any two nodes in the graph is 3.0. As we saw in the lesson on indirect connections, this is an important graph metric, called the graph diameter. Finally, note that just like an undirected graph yields a symmetric adjacency matrix, it also yields a symmetric distance matrix. If the geodesic distance between nodes A and G in an the undirected graph is 3.0, then the geodesic distance between G and A is also 3.0: d\(_{ij}\) = d\(_{ji}\) for all \(i\) and \(j\).

We can also generate geodesic distance matrices for directed graphs, such as the one shown in Figure 11.2. The corresponding distance matrix for this graph is shown in Table 22.4.

A B C D E F G
A 0 1 Inf 2 Inf 1 Inf
B 1 0 Inf 1 Inf 2 Inf
C 2 1 0 2 Inf 3 Inf
D 2 1 Inf 0 Inf 3 Inf
E 3 2 1 1 0 4 Inf
F 1 2 Inf 3 Inf 0 Inf
G 2 2 Inf 1 Inf 1 0
Table 22.4: Distance matrix for a directed graph.

This distance matrix is very different from the one corresponding to the undirected graph. First note that some shortest paths are not defined, because some pairs of nodes are disconnected in the directed graph; that is there is directed path linking them. So the corresponding cells are noted with Inf in the matrix.2 For instance, node A cannot reach nodes C, E or G. Note also that now the matrix is asymmetric, so the numbers above the diagonal do not have to match the numbers below the diagonal. Thus, while node A cannot reach node G via a shortest path, node G can reach node A via shortest path of length 2.

22.3 The Shortest Paths Matrix

Recall that in our discussion of shortest paths in the indirect connectivity lesson, we noted that nodes can be connected by more than one shortest path at the same time. So sometimes it is useful to create a matrix that records this number for each pair of nodes. This is called the shortest paths matrix (S). Each cell in the matrix s\(_{ij}\) gives us the number of shortest paths connecting the row node i with the column node j. The shortest path matrix corresponding to the graph in Figure 11.1, is shown in Table 22.5.

A B C D E F G H I
A 0 1 1 1 1 2 3 3 3
B 1 0 1 1 2 2 3 3 3
C 1 1 0 1 1 2 3 3 3
D 1 1 1 0 1 1 2 2 2
E 1 2 1 1 0 2 3 3 3
F 2 2 2 1 2 0 1 1 1
G 3 3 3 2 3 1 0 1 1
H 3 3 3 2 3 1 1 0 1
I 3 3 3 2 3 1 1 1 0
Table 22.5: Shortest paths matrix for an undirected graph.

The S matrix contains useful information. For instance, it tell us that some pairs of nodes in the graph, have multiple ways of reaching other nodes to which they are not directly connected using shortest paths. For instance, actor A can get to actor G via three distinct shortest paths. So this gives us a sense of the capacity of that actor to reach the other one in an efficient way; if one of those shortest paths were to be compromised, A would still be able to send something to G via the other non-compromised paths.

22.4 The Neighborhood Overlap Matrix

As we noted in the original graph theory lesson, it is possible for the neighborhood of two nodes in a graph to overlap. Recall that for each node, we define its neighborhood as the set of other nodes that they are adjacent to. That means the neighborhood between two nodes can have members in common.

This can be used as a measure of the overlap of the neighborhood between two nodes. For instance, imagine you have a friend and that friend knows all your friends and you know all their friends. In which case we would say that the overlap between your node neighborhoods is pretty high; in fact the two neighborhoods overlap completely.

Now imagine you just met a new person online who lives in a far away country, and as far as you know, they know none of your friends and you know none of their friends. In which case, we would say that the overlap if the two neighborhoods is nil or as close to zero as it can get.

\[ o_{ij} = \frac{|\mathcal{N}(i) \cap \mathcal{N}(j)|}{|\mathcal{N}(i) \cup \mathcal{N}(j)|} \tag{22.1}\]

Given a graph, we can construct the neighborhood overlap matrix for the graph O, containing such overlap scores between the neighborhood sets of each pair of nodes in the graph. The overlap score ranges from 0 (not overlap), to 1.0 (complete overlap), with values in-between for partial overlap (which is the more common case). Each cell in the matrix is filled in using equation Equation 35.2.

This equation says that the overlap between node i and node j, written \(o_{ij}\), is equivalent to the cardinality (||) of the set defined by the intersection (\(\cap\)) of i’s neighborhood (\(\mathcal{N}(i)\)) and j’s neighborhood (\(\mathcal{N}(j)\)), or the number of common neighbors, divided by the cardinality of the set defined by the union (\(\cup\)) of i’s neighborhood (\(\mathcal{N}(i)\)) and j’s neighborhood (\(\mathcal{N}(j)\)), or the total number of neighbors.

Thus, the overlap is the number of common neighbors, divided by the number of total neighbors. For instance, the neighborhood overlap for the undirected graph shown in Figure 11.1 is shown in Table 22.6.

A B C D E F G H I
A 1.00 0.40 0.60 0.50 0.40 0.14 0.00 0.00 0.00
B 0.40 1.00 0.40 0.33 1.00 0.17 0.00 0.00 0.00
C 0.60 0.40 1.00 0.50 0.40 0.14 0.00 0.00 0.00
D 0.50 0.33 0.50 1.00 0.33 0.00 0.14 0.14 0.14
E 0.40 1.00 0.40 0.33 1.00 0.17 0.00 0.00 0.00
F 0.14 0.17 0.14 0.00 0.17 1.00 0.40 0.40 0.40
G 0.00 0.00 0.00 0.14 0.00 0.40 1.00 0.50 0.50
H 0.00 0.00 0.00 0.14 0.00 0.40 0.50 1.00 0.50
I 0.00 0.00 0.00 0.14 0.00 0.40 0.50 0.50 1.00
Table 22.6: Neighborhood Overlap Matrix for an undirected graph.

Looking at the first row of the matrix, we can see that nodes A and C have a pretty high neighborhood overlap score \(o_{AC} = 0.60\). But the node neighborhood A has no overlap with that of nodes G, H and I, as is evident by looking at Figure 11.1.

We can examine the overlap pattern of each node in Figure 11.1 by going down each row of the matrix. Note that the common neighbors matrix is symmetric: If A has an overlap of o with B, then B necessarily has the same overlap score with A. So all the information in the neighborhood overlap matrix is contained in either the lower or upper triangle.

We can think of neighborhood overlap as a measure of structural similarity of two nodes in a graph based on their pattern of social connections. In fact, the formula shown in equation @ref(eq:overlap) is called Jaccard’s Coefficient (named after the French Botanist Paul Jaccard, who introduced it) and it is generally used (along with many variations) as a measure of similarity between two sets (Jaccard 1901).

So looking at Table 22.6, we can see node F is most similar to nodes G, H and I in the graph shown in Figure 11.1, and least similar to node D.

Can you think of which of your friends you are most similar to in terms of neighborhood overlap?

As we noted in the lesson on graph theory, common neighbors are defined for all the dyads in the network, whether they are connected or null. So that means that two nodes can have overlapping neighborhoods even if they do not have tie between them! Sometimes it happens that you meet someone new and then you realize that you had friends in common. This is such a common occurrence that it has a name: the small world phenomenon (Milgram 1967).

References

Jaccard, Paul. 1901. “Distribution of the Alpine Flora in the Dranse’s Basin and Some Neighbouring Regions.” Bulletin de La Societe Vaudoise Des Sciences Naturelles 37 (1): 241–72.
Krackhardt, David. 1994. “Graph Theoretical Dimensions of Informal Organizations.” Computational Organization Theory 89 (112): 123–40.
Milgram, Stanley. 1967. “The Small World Problem.” Psychology Today 2 (1): 60–67.

  1. Because nodes can be indirectly connected by more than one path, it is possible to come up with a weighted version of the reachability matrix, that encodes the number of paths of any length via which the row node can reach the column node.↩︎

  2. If the shortest path between a pair of nodes does not exist, then technically their geodesic distance is infinity! (\(\infty\))↩︎