Thursday 7 July 2005, 10.30 am, Room 325, McCrea Building
Shai Ben-David, University of Waterloo, Ontario
The goal of this talk is to offer theoretical analysis of some statistical aspects of clustering and, in particular, of some model selection considerations. We work in a framework where the data to be clustered has been sampled from some unknown probability distribution, and the aim is to gain insight into the structure of that distribution. We address the question of how to verify that a sample-based clustering reveals some structure of the underlying data rather than modeling artifacts due to the random sampling process.
We develop a formal notion of stability for sample based clustering, measuring a necessary requirement for the 'meaningfulness' of a clustering, and prove that stability can be reliably estimated from samples. We go on to show that this notion reflects a fit, or alignment, between a clustering algorithm and a given date distribution, and thus can serve as model selection tool.