On twitter galaxies

2009-05-03graphs mysql research

The past few weeks I’ve been nerding around with a subset of the Twitter follower graph. It’s hard to visualize this graph, because it’s really, really large, about 5MM nodes and 50MM edges, and the tool that I’ve been using (neato) can really only handle graphs in the thousands of edges. And it takes forever to sample randomly from my edge database (MySQL). But I’ve been poking at it nonetheless, and I came up with a pretty neat little image this weekend.

(Click the image for a PDF.) I think it’s a fun image. It reminds me of a little petri dish, or of the view through a telescope into a cartoon universe. The stars and other shapes are like little organisms or galaxies, swimming around in this chaotic sea of online social networking.

This particular image was created by repeatedly selecting 100 rows at random offsets from a database table that stores graph adjacency information. Because the degrees of the nodes in the follower graph are distributed using a power law, you sometimes select 100 rows that are all centered around one user, and you get these really dense communities. Happily, neato lays out these communities together in the center of the image. But, often, you end up picking 100 rows that include several tiny communities involving just a few users. The graph layout tool puts these little galaxies near the border of the image, making it look somehow a little bit artistic. Fun !

Smaller samples

I generated a few other images from this dataset, using a smaller number of edges, and a different sampling strategy. For this set, I sampled 1MM edges randomly from the database into a Python script, and then from that sample I repeatedly drew samples of a fixed size. For instance, here’s one of the images drawn using 20,000 of the 1MM edges:

And another version with just 10,000 edges :

I think these are even more interesting (but somewhat less “cute”) than the first image, because they show the normal galaxy pattern, but there are also a few “link” users in there, who join together galaxies—sometimes even forming long, narrow chains of follower relationships. This is particularly true in the 10k sample.

In addition, you can get a visual idea of the power law in effect by comparing the 20k sample to the 10k sample. There are twice as many edges in the 20k sample as in the 10k sample, and about half of the extra edges tend to be allocated to new pairs (around the edges in the visualization). The other half tend to be allocated to existing communities (closer to the center).

Now, of course, these are all subsets of the whole dataset, and so they do not tell even close to the whole story. (This is particularly true now, in our post-Ashton era, because Twitter has many, many more follower relationships than just these 50MM.) But I enjoy looking at these images all the same.

On twitter galaxies

Smaller samples §

Smaller samples