Attributing data to latent generators

2011-09-06machine-learning research statistics

I’m somehow fascinated by Gibbs sampling for Dirichlet Processes. Here’s how I think of the process at a high level:

You have some data that you assume came from a bunch of unseen (latent) clusters from the same family. Like a collection of Gaussians, for a gaussian mixture model.
In theory, there might have been infinitely many clusters that could have produced your data. But neither computers nor humans can deal well with infinity in practice, so instead, for starters, you just go ahead and sample a finite set of clusters. Say 9 or 121.
Assign each data point to a cluster at random.
Then, take one data point, remove it from its current cluster, and assess the likelihood that it was generated by each of the existing clusters. But here’s the sneaky bit: While you’re doing this step, also sample a few new clusters from your prior over clusters. This will let you assign the current point to a new cluster, maybe !
Assign the point to a cluster using some strategy: softmax, argmax, whatevs.
Every once in a while (or concurrently, if you have a lock !), choose a cluster and update its parameters based on the data that are currently assigned to it, by using Bayes Rule to invert the likelihood of the data given the cluster.
Repeat 4 through 6 until, say, no more points change clusters, or until cluster parameters stop changing very much, or the like.

So, I’ve yet to implement this myself, and as such am likely to have missed important details. But, as far as I can tell, it works in theory.

None of this is particularly new; I just find it a cool model for unsupervised clustering. What I do think could be new is extending the spirit of this model to other basis inference tasks.

For example, take the basis induction problem addressed by Smith & Lewicki (2006). They take a bunch of sound data and pipe it through a Matching Pursuit, using the residual errors to adjust the codebook vectors. But they also want to infer the lengths of their basis vectors, so they do this using a sort of hacky heuristic: Whenever basis vectors have peripheral activity surpassing some fixed threshold, a bunch of zero padding gets added to the end of the basis vector. Instead of hacking this together, wouldn’t it be nice to sample basis vectors from some principled prior, and use a similar clustering methodology to assign data to the basis vectors that best explain them?