Network Clustering (Community Detection)

Patrick O. Perry, NYU Stern School of Business

Preliminaries

Computing environment

We will use the following R packages.

library("igraph")
library("igraphdata")

To ensure consistent runs, we set the seed before performing any analysis.

set.seed(0)

Zachary's Karate Club

We will first cluster Zachary's Karate Club, one of the most famous network clustering datasets.

# Zachary's Karate Club data
data(karate)
summary(karate)
IGRAPH UNW- 34 78 -- Zachary's karate club network
+ attr: name (g/c), Citation (g/c), Author (g/c), Faction (v/n), name (v/c), label (v/c),
| color (v/n), weight (e/n)

The nodes in the graph represent the 34 members in a college Karate club. (Zachary is a sociologist, and he was one of the members.) An edge between two nodes indicates that the two members spent significant time together outside normal club meetings. The dataset is interesting because while Zachary was collecting his data, there was a dispute in the Karate club, and it split into two factions: one led by “Mr. Hi”, and one led by “John A”. It turns out that using only the connectivity information (the edges), it is possible to recover the two factions.

The igraph package produces a plot of the graph:

plot(karate)

plot of chunk unnamed-chunk-4

The faction labels are stored in a vertex attribute:

vertex_attr(karate, "Faction")
 [1] 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2

Here are the counts of members in each faction:

table(vertex_attr(karate, "Faction"))

Clustering

The igraph package implements a variety of network clustering methods, most of which are based on Newman-Girvan modularity. To see them all, refer to the ?communities documentation. The simplest such algorithm is the “fast greedy” method, which starts with nodes in separate clusters, and then merges clusters together in a greedy fashion.

# Fast greedy modularity-based clustering
cfg <- cluster_fast_greedy(karate)
plot(cfg, karate)

plot of chunk unnamed-chunk-6

Specifying number of clusters

The igraph package does not allow us to specify the number of clusters. However, we can simplify the result to a smaller number of clusters using the cutat function:

cutat(cfg, 2)
 [1] 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2

igraph does not provide a convenient interface for plotting the result, but the following command suffices for this purpuse:

plot(structure(list(membership=cutat(cfg, 2)), class="communities"), karate)

plot of chunk unnamed-chunk-8

Comparison with ground truth

We can see that the clustering method perfectly recovers the two karate club factions:

table(cutat(cfg, 2), vertex_attr(karate, "Faction"))

     1  2
  1 16  0
  2  0 18

This demonstrates that using only edge information, we are able to recover a meaningful node attribute.

Session information

sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] methods   stats     graphics  grDevices utils     datasets  base     

other attached packages:
[1] igraphdata_1.0.1 igraph_1.0.1     knitr_1.12.3    

loaded via a namespace (and not attached):
[1] magrittr_1.5     formatR_1.1      tools_3.2.3      codetools_0.2-14 stringi_1.0-1   
[6] digest_0.6.8     stringr_1.0.0    evaluate_0.8