5.31.2013

Community detection in graphs

Community detection in graphs
Santo Fortunato
2010
http://arxiv.org/abs/0906.0612

// This is a major literature review, totaling over 100 pages (including references), about different methods for detecting communities and clusters in graphs. There are lots of different methods and algorithms for defining and identifying a "community", and there are no universally agreed upon definitions or methods, but these reviews are very useful for understanding the state of network science.

// I went through and clipped the majority of the 40+ figures and example networks, and uploaded them to the photo album archive on my G+ stream. I've also curated a few pages of key information, especially concerning modularity and hierarchy, for easy browsing and reference here.

// I strongly encourage people to check out the original paper, which includes an appendix introducing basic terms and concepts in graph theory.

Abstract:

The modern science of networks has brought significant advances to our understanding of complex systems. One of the most relevant features of graphs representing real systems is community structure, or clustering, i. e. the organization of vertices in clusters, with many edges joining vertices of the same cluster and comparatively few edges joining vertices of different clusters. Such clusters, or communities, can be considered as fairly independent compartments of a graph, playing a similar role like, e. g., the tissues or the organs in the human body. Detecting communities is of great importance in sociology, biology and computer science, disciplines where systems are often represented as graphs. This problem is very hard and not yet satisfactorily solved, despite the huge effort of a large interdisciplinary community of scientists working on it over the past few years. We will attempt a thorough exposition of the topic, from the definition of the main elements of the problem, to the presentation of most methods developed, with a special focus on techniques designed by statistical physicists, from the discussion of crucial issues like the significance of clustering and how methods should be tested and compared against each other, to the description of applications to real networks.

1. Basics
The first problem in graph clustering is to look for a quantitative definition of community. No definition is universally accepted. As a matter of fact, the definition often depends on the specific system at hand and/or application one has in mind. From intuition and the examples of Section II we get the notion that there must be more edges "inside" the community than edges linking vertices of the community with the rest of the graph.

This is the reference guideline at the basis of most community definitions. But many alternative recipes are compatible with it. Moreover, in most cases, communities are algorithmically defined, i. e. they are just the final product of the algorithm, without a precise a priori definition.

2. Local definitions
Communities are parts of the graph with a few ties with the rest of the system. To some extent, they can be considered as separate entities with their own autonomy. So, it makes sense to evaluate them independently of the graph as a whole. Local definitions focus on the subgraph under study, including possibly its immediate neighborhood, but neglecting the rest of the graph. We start with a listing of the main definitions adopted in social network analysis, for which we shall closely follow the exposition of (Wasserman and Faust, 1994). There, four types of criteria were identified: complete mutuality, reachability, vertex degree and the comparison of internal versus external cohesion. The corresponding communities are mostly maximal subgraphs, which cannot be enlarged with the addition of new vertices and edges without losing the property which defines them.

3. Global definitions

Communities can also be defined with respect to the graph as a whole. This is reasonable in those cases in which clusters are essential parts of the graph, which cannot be taken apart without seriously affecting the functioning of the system. The literature offers many global criteria to identify communities. In most cases they are indirect definitions, in which some global property of the graph is used in an algorithm that delivers communities at the end. However, there is a class of proper definitions, based on the idea that a graph has community structure if it is different from a random graph. A random graph a la Erdos-Renyi, for instance, is not expected to have community structure, as any two vertices have the same probability to be adjacent, so there should be no preferential linking involving special groups of vertices. Therefore, one can define a null model, i. e. a graph which matches the original in some of its structural features, but which is otherwise a random graph.

The null model is used as a term of comparison, to verify whether the graph at study displays community structure or not. The most popular null model is that proposed by Newman and Girvan and consists of a randomized version of the original graph, where edges are rewired at random, under the constraint that the expected degree of each vertex matches the degree of the vertex in the original graph (Newman and Girvan, 2004). This null model is the basic concept behind the definition of modularity, a
function which evaluates the goodness of partitions of a graph into clusters.

Modularity will be discussed at length in this review, because it has the unique privilege of being at the same time a global criterion to define a community, a quality function and the key ingredient of the most popular method of graph clustering. In the standard formulation of modularity, a subgraph is a community if the number of edges inside the subgraph exceeds the expected number of internal edges that the same subgraph would have in the null model. This expected number is an average over all possible realizations of the null model. Several modifications of modularity have been proposed. A general class of null models, including modularity as a special case, has been designed by Reichardt and Bornholdt.

4. Definitions based on vertex similarity

It is natural to assume that communities are groups of vertices similar to each other. One can compute the similarity between each pair of vertices with respect to some reference property, local or global, no matter whether they are connected by an edge or not. Each vertex ends up in the cluster whose vertices are most similar to it. Similarity measures are at the basis of traditional methods, like hierarchical, partitional and spectral clustering.

Another important class of measures of vertex similarity is based on properties of random walks on graphs. One of this properties is the commute-time between a pair of vertices, which is the average number of steps needed for a random walker, starting at either vertex, to reach the other vertex for the first time and to come back to the starting vertex.

IV. TRADITIONAL METHODS

A. Graph partitioning

The problem of graph partitioning consists in dividing the vertices in g groups of predefined size, such that the
 number of edges lying between the groups is minimal. The number of edges running between clusters is called cut size. Fig. 9 presents the solution of the problem for a graph with fourteen vertices, for g = 2 and clusters of equal size.

Specifying the number of clusters of the partition is necessary. If one simply imposed a partition with the minimal cut size, and left the number of clusters free, the solution would be trivial, corresponding to all vertices ending up in the same cluster, as this would yield a vanishing cut size. Specifying the size is also necessary, as otherwise the most likely solution of the problem would consist in separating the lowest degree vertex from the rest of the graph, which is quite uninteresting. This problem can be actually avoided by choosing a different measure to optimize for the partitioning, which accounts for the size of the clusters.

B. Hierarchical clustering

In general, very little is known about the community structure of a graph. It is uncommon to know the
number of clusters in which the graph is split, or other indications about the membership of the vertices. In such cases clustering procedures like graph partitioning methods can hardly be of help, and one is forced to make some reasonable assumptions about the number and size of the clusters, which are often unjustiffed. On the other hand, the graph may have a hierarchical structure, i. e. may display several levels of grouping of the vertices, with small clusters included within large clusters, which are in turn included in larger clusters, and so on. Social networks, for instance, often have a hierarchical structure. In such cases, one may use hierarchical clustering algorithms, i. e. clustering techniques that reveal the multilevel structure of the graph. Hierarchical clustering is very common in social network analysis, biology, engineering, marketing, etc.

The starting point of any hierarchical clustering method is the definition of a similarity measure between vertices. After a measure is chosen, one computes the similarity for each pair of vertices, no matter if they are connected or not. At the end of this process, one is left with a new n n matrix X, the similarity matrix. In Section III.B.4 we have listed several possible definitions of similarity. Hierarchical clustering techniques aim at identifying groups of vertices with high similarity, and can be classified in two categories:

1. Agglomerative algorithms, in which clusters are iteratively merged if their similarity is sufficiently
high;

2. Divisive algorithms, in which clusters are iteratively split by removing edges connecting vertices with low similarity.

The two classes refer to opposite processes: agglomerative algorithms are bottom-up, as one starts from the vertices as separate clusters (singletons) and ends up with the graph as a unique cluster; divisive algorithms are top-down as they follow the opposite direction.

C. Partitional clustering

Partitional clustering indicates another popular class of methods to find clusters in a set of data points. Here, the number of clusters is preassigned, say k. The points are embedded in a metric space, so that each vertex is a point and a distance measure is defined between pairs of points in the space. The distance is a measure of dissimilarity between vertices. The goal is to separate the points in k clusters such to maximize/minimize a given cost function based on distances between points and/or from points to centroids, i. e. suitably defined positions in space.

D. Spectral clustering

Let us suppose to have a set of n objects x1, x2, ..., xn with a pairwise similarity function S defined between them, which is symmetric and non-negative. Spectral clustering includes all methods and techniques that partition the set into clusters by using the eigenvectors of matrices, like S itself or other matrices derived from it. In particular, the objects could be points in some metric space, or the vertices of a graph. Spectral clustering consists of a transformation of the initial set of objects into a set of points in space, whose coordinates are elements of eigenvectors: the set of points is then clustered via standard techniques, like k-means clustering. One may wonder why it is necessary to cluster the points obtained through the eigenvectors, when one can directly cluster the initial set of objects, based on the similarity matrix. The reason is that the change of representation induced by the eigenvectors makes the cluster properties of the initial data set much more evident. In this way, spectral clustering is able to separate data points that could not be resolved by applying directly k-means clustering, for instance, as the latter tends to deliver convex sets of points.

VIII. DYNAMIC ALGORITHMS

This Section describes methods employing processes running on the graph, focusing on spin-spin interactions, random walks and synchronization.

A. Spin models

The Potts model is among the most popular models in statistical mechanics (Wu, 1982). It describes a system of spins that can be in q different states. The interaction is ferromagnetic, i. e. it favours spin alignment, so at zero temperature all spins are in the same state. If antiferromagnetic interactions are also present, the ground state of the system may not be the one where all spins are aligned, but a state where different spin values coexist, in homogeneous clusters. If Potts spin variables are assigned to the vertices of a graph with community structure, and the interactions are between neighbouring spins, it is likely that the structural clusters could be recovered from like-valued spin clusters of the system, as there are many more interactions inside communities than outside.

B. Random walk

Random walks (Hughes, 1995) can also be useful to find communities. If a graph has a strong community structure, a random walker spends a long time inside a community due to the high density of internal edges and consequent number of paths that could be followed. Here we describe the most popular clustering algorithms based on random walks. All of them can be trivially extended to the case of weighted graphs.

C. Synchronization

Synchronization (Pikovsky et al., 2001) is an emergent phenomenon occurring in systems of interacting units and is ubiquitous in nature, society and technology. In a synchronized state, the units of the system are in the same or similar state(s) at every time. Synchronization has also been applied to find communities in graphs. If oscillators are placed at the vertices, with initial random phases, and have nearest-neighbour interactions, oscillators in the same community synchronize first, whereas a full synchronization requires a longer time. So, if one follows the time evolution of the process, states with synchronized clusters of vertices can be quite stable and long-lived, so they can be easily recognized.

XVII. APPLICATIONS ON REAL-WORLD NETWORKS

The ultimate goal of clustering algorithms is trying to infer properties of and relationships between vertices, that are not available from direct observation/measurement. Still, there are also applications aiming at understanding real systems. Some results have been actually mentioned in the previous sections. This section is supposed to give a flavor of what can be done by using clustering algorithms. Therefore, the list of works presented here is by no means exhaustive. Most studies focus on biological and social networks. We mention a few applications to other types of networks as well.

A. Biological networks

The recent abundance of genomic data has allowed us to explore the cell at an unprecedented depth. A
wealth of information is available on interactions involving proteins and genes, metabolic processes, etc. In order to study cellular systems, the graph representation is regularly used. Protein-protein interaction networks (PIN), gene regulatory networks (GRN) and metabolic networks (MN) are meanwhile standard objects of investigation in biology and bioinformatics.

Biological networks are characterized by a remarkable modular organization, reflecting functional associations between their components. For instance, proteins tend to be associated in two types of cellular modules: protein complexes and functional modules. A protein complex is a group of proteins that mutually interact at the same time and space, forming a sort of physical object. Examples are transcription factor complexes, protein transport and export complexes, etc. Functional modules instead are groups of proteins taking place in the same cellular process, even if the interactions may happen at different times and places. Examples are the CDK/cyclin module, responsible for cell-cycle progression, the yeast pheromone response pathway, etc. Identifying cellular modules is fundamental to uncover the organization and dynamics of cell functions. However, the information on cell units (e. g. proteins, genes) and their interactions is often incomplete, or even incorrect, due to noise in the data produced by the experiments. Therefore, inferring modules from the topology of cellular networks enables one to restrict the set of possible scenarios and can be a safe guide for future experiments.

B. Social networks

Networks depicting social interactions between people have been studied for decades. Recently the modern Information and Communication Technology (ICT) has opened new interaction modes between individuals, like mobile phone communications and online interactions enabled by the Internet. Such new social exchanges can be accurately monitored for very large systems, including millions of individuals, whose study represents a huge opportunity for social science. Communities of social networks can be friendship circles, groups of people sharing common interests and/or activities, etc.

Collaboration networks, in which individuals are linked if they are (were) involved in a common activity, have been often studied because they embed an implicit objective concept of acquaintance, that is not easy to capture in direct social experiments/interviews. For instance, somebody may consider another individual a friend, while the latter may disagree. A collaboration instead is a proof of a social relationship between individuals. The analysis of the structure of scientific collaboration networks has exerted a big influence on the development of the modern network science. Scientific collaboration is associated to coauthorship: two scientists are linked if they have coauthored at least one paper together. Information about coauthorships can be extracted from different databases of research papers.

C. Other networks

Citation networks have been regularly used to understand the citation patterns of authors and to disclose relationships between disciplines. Rosvall and Bergstrom used a citation network of over 6000 scientific journals to derive a map of science. They used a clustering technique based on compressing the information on random walks taking place on the graph. A random walk follows the  flow of citations from one field to another, and the fields emerge naturally from the clustering analysis (Fig. 40). The structure of science resembles the letter U, with the social sciences and engineering at the terminals, joined through a chain including medicine, molecular biology, chemistry and physics.

Reichardt and Bornholdt performed a clustering analysis on a network built from bidding data taken from the German version of Ebay (www.ebay.de), the most popular online auction site. The vertices are bidders and two vertices are connected if the corresponding bidders have expressed interest for the same item. Clusters were detected with the multiresolution modularity optimization developed by the authors themselves. In spite of the variety of items that it is possible to purchase through Ebay, about 85% of bidders were classified into a few major clusters, reflecting bidders' broad categories of interests. Ebay data were also examined by Jin et al., who considered bidding networks where the vertices are the individual auctions and edges are placed between auctions having at least one common bidder. Communities, detected with greedy modularity optimization, allow to identify substitute goods, i. e. products that have value for the same bidder, so that they can be purchased together or alternatively.

Legislative networks enable one to deduce associations between politicians through their parliamentary activity, which may be related or not to party affliation. Porter and coworkers have carried out numerous studies on the subject, by using data on the Congress of the United States. In Refs., they examined the community structure of networks of committees in the US House of Representatives. Committees sharing common members are connected by edges, which are weighted by dividing the number of common members by the number one would expect to have if committee memberships were randomly assigned. Hierarchical clustering reveals close connections between some of the committees. In another work Zhang et al. analyzed networks of legislation cosponsorship, in which vertices are legislators and two legislators are linked if they support at least one common bill. Communities, identified with a modification of Newman's spectral optimization of modularity, are correlated with party affliation, but also with geography and committee memberships of the legislators.

Networks of correlations between time series of stock returns have received a growing attention in the past few years. In early studies, scholars found clusters of correlated stocks by computing the maximum spanning tree of the network, and realized that such clusters match quite well the economic sectors of the stocks. More recently, the community structure of the networks has been investigated by means of proper clustering algorithms. Farkas et al. have applied the weighted version of the Clique Percolation Method and found that the presence of two strong (i. e. carrying high correlation) edges in triangles is usually accompanied by the presence of a strong third edge. Heimo et al. used the weighted version of the multiresolution method by Reichardt and Bornholdt. Clusters correspond to relevant business sectors, as indicated by Forbes classification; moreover, smaller clusters at lower hierarchical levels seem to correspond to (economically) meaningful substructures of the main clusters.

XVIII. OUTLOOK

Despite the remote origins and the great popularity of the last years, research on graph clustering has not yet given a satisfactory solution of the problem and leaves us with a number of important open issues. From our exposition it appears that the field has grown in a rather chaotic way, without a precise direction or guidelines. In some cases, interesting new ideas and tools have been presented, in others existing methods have been improved, becoming more accurate and/or faster.

What the field lacks the most is a theoretical framework that defines precisely what clustering algorithms are supposed to do. Everybody has his/her own idea of what a community is, and most ideas are consistent with each other, but, as long as there is still disagreement, it remains impossible to decide which algorithm does the best job and there will be no control on the creation of new methods. Therefore, we believe that the first and foremost task that the scientific community working on graph clustering has to solve in the future is defining a set of reliable benchmark graphs, against which algorithms should be tested (Section XV.A). Defining a benchmark goes far beyond the issue of testing. It means designing practical examples of graphs with communities, and, in order to do that, one has to agree on the fundamental concepts of community and partition. Clustering algorithms have to be devised consistently with such definitions, in order to give the best performance on the set of designated benchmarks, which represent a sort of ground truth. The explosion in the number of algorithms we have witnessed in recent times is due precisely to the present lack of reliable mechanisms of control of their quality and comparison of their performances. If the community agrees on a benchmark, the future development of the field will be more coherent and the progress brought by new methods can be evaluated in an unbiased manner. The planted l-partition model is the easiest recipe one can think of when it comes to defining clusters, and is the criterion underlying well-known benchmarks, like that by Girvan and Newman. We believe that the new benchmarks have to be defined along the same lines.

Defining a benchmark implies specifying the "natural" partition of a graph, the one that any algorithm should find. This issue in turn involves the concept of quality of a partition, that has characterized large part of the development of the field, in particular after the introduction of Newman-Girvan modularity. Estimating the quality of a partition allows to discriminate among the large number of partitions of a graph. In some cases this is not di cult. For instance, in the benchmark by Girvan and Newman there is a single meaningful partition, and it is hard to argue with that. But most graphs of the real world have a hierarchical structure, with communities including smaller communities and so on. Hence there are several meaningful partitions, corresponding to different hierarchical levels, and discriminating among them is hard, as they may be all relevant, in a sense.

If we consider the human body, we cannot say that the organization in tissues of the cells is more important than the organization in organs. We have seen that there are recent methods dealing with the problem of finding meaningful hierarchical levels. Such methods rank the hierarchical partitions based on some criterion and one can assess their relevance through the ranking. One may wonder whether it makes sense sorting out levels, which means introducing a kind of threshold on the quality index chosen to rank partitions (to distinguish "good" from "bad" partitions), or whether it is more appropriate to keep the information given by the whole set of hierarchical partitions.

The work by Clauset et al. on hierarchical random graphs, discussed in Section XII.B, indirectly raises this issue. There it was shown that the ensemble of model graphs, represented by dendrograms, encodes most of the information on the structure of the graph at study, like its degree distribution, transitivity and distribution of shortest path lengths. At the same time, by construction, the model reveals the whole hierarchy of communities, without any distinction between good and bad partitions. The information given by a dendrogram may become redundant and confusing when the graph is large, as then there is a big number of partitions. This is actually the reason why quality functions were originally introduced. However, in that case, one was dealing with artificial hierarchies, produced by techniques that systematically yield a dendrogram as a result of the analysis (like, e. g., hierarchical clustering), regardless of whether the graph actually has a hierarchical structure or not.

Here instead we speak of real hierarchy, which is a fundamental element of real graphs and, as such, it must be considered in any serious approach to graph clustering. Any good clustering method must be able to tell whether a graph has community structure or not, and, in the first case, whether the community structure is hierarchical (i.e. with two or more levels) or at (one level). We expect that the concept of hierarchy will become a key ingredient of future clustering techniques. In particular, assessing the consistence of the concepts of partitions' quality and hierarchical structure is a major challenge.

Since the paper by Palla et al., overlapping communities have received a lot of attention. However, there is still no consensus about a quantitative definition of the concept of overlapping community, and most definitions depend on the method adopted. Intuitively, one would expect that clusters share vertices lying at their borders, and this idea has inspired most algorithms. However, clusters detected with the Clique Percolation Method often share central vertices of the clusters, which makes sense in specific instances, especially in social networks. So, it is still unclear how to characterize overlapping vertices. Moreover, the concept of overlapping clusters seems at odds with that of hierarchical structure. No dendrogram can be drawn if there are overlapping vertices, at least in the standard way. Due to the relevance of both features in real networks, it is necessary to adapt them to each other in a consistent way. Overlapping vertices pose problems as well when it comes to comparing the results of different methods on the same graph. Most similarity measures are defined only in the case of partitions, where each vertex is assigned to a single cluster. It is then necessary to extend such definitions to the case of overlapping communities, whenever possible.

Another issue that is getting increasingly more popular is the study of graphs evolving in time. This is now possible due to the availability of timestamped network data sets. Tracking the evolution of community structure in time is very important, to uncover how communities are generated and how they interact with each other.

Finally, if there has been a tremendous e ffort in the design of clustering algorithms, basically nothing has been done to make sense of their results. What shall we do with communities? What can they tell us about a system? The hope is that they will enable one to disclose "hidden" relationships between vertices, due to features hat are not known, because they are hard to measure, for instance. It is quite possible that the scientific community will converge sooner or later to a definition a posteriori of community. Already now, most algorithms yield similar results in practical applications. But what is the relationship between the vertex classification given by the algorithms and real classifications? This is the main question beneath the whole endeavor.

2 comments: