Patrick O. Perry, NYU Stern School of Business
We will use the following R packages.
library("igraph")
library("igraphdata")
To ensure consistent runs, we set the seed before performing any analysis.
set.seed(0)
We will first cluster Zachary's Karate Club, one of the most famous network clustering datasets.
# Zachary's Karate Club data
data(karate)
summary(karate)
IGRAPH UNW- 34 78 -- Zachary's karate club network
+ attr: name (g/c), Citation (g/c), Author (g/c), Faction (v/n), name (v/c), label (v/c),
| color (v/n), weight (e/n)
The nodes in the graph represent the 34 members in a college Karate club. (Zachary is a sociologist, and he was one of the members.) An edge between two nodes indicates that the two members spent significant time together outside normal club meetings. The dataset is interesting because while Zachary was collecting his data, there was a dispute in the Karate club, and it split into two factions: one led by “Mr. Hi”, and one led by “John A”. It turns out that using only the connectivity information (the edges), it is possible to recover the two factions.
The igraph
package produces a plot of the graph:
plot(karate)
The faction labels are stored in a vertex attribute:
vertex_attr(karate, "Faction")
[1] 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2
Here are the counts of members in each faction:
table(vertex_attr(karate, "Faction"))
The igraph
package implements a variety of network clustering methods, most
of which are based on Newman-Girvan modularity. To see them all, refer to the
?communities
documentation. The simplest such algorithm is the “fast
greedy” method, which starts with nodes in separate clusters, and then merges
clusters together in a greedy fashion.
# Fast greedy modularity-based clustering
cfg <- cluster_fast_greedy(karate)
plot(cfg, karate)
The igraph
package does not allow us to specify the number of clusters. However, we
can simplify the result to a smaller number of clusters using the cutat
function:
cutat(cfg, 2)
[1] 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2
igraph
does not provide a convenient interface for plotting the result, but the following
command suffices for this purpuse:
plot(structure(list(membership=cutat(cfg, 2)), class="communities"), karate)
We can see that the clustering method perfectly recovers the two karate club factions:
table(cutat(cfg, 2), vertex_attr(karate, "Faction"))
1 2
1 16 0
2 0 18
This demonstrates that using only edge information, we are able to recover a meaningful node attribute.
sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] methods stats graphics grDevices utils datasets base
other attached packages:
[1] igraphdata_1.0.1 igraph_1.0.1 knitr_1.12.3
loaded via a namespace (and not attached):
[1] magrittr_1.5 formatR_1.1 tools_3.2.3 codetools_0.2-14 stringi_1.0-1
[6] digest_0.6.8 stringr_1.0.0 evaluate_0.8