Practical On-line Sampling Methods for Hypothesis Selection
Dr Carlos Domingo, Universitat Politecnica de Catalunya, Barcelona
Work with Ricard Gavalda (UPC, Barcelona) and Osamu Watanabe (Tokyo Institute of Technology, Tokyo)
Abstract: One of the core applications of machine learning to knowledge discovery consists of building a hypothesis (such as a decision tree or neural network) from a given amount of data, so that we can later use it to predict new instances of the data. In this paper, we focus on a particular situation where we assume that the hypothesis we want to use for prediction belongs to a hypothesis class of feasible size. We study the problem of how to determine which of the hypotheses in the class is almost the best one, a problem that we call hypothesis selection. We present two on-line sampling methods for hypothesis selection, give theoretical bounds on the number of examples needed, and analyze them experimentally with synthetic and real-world data. We compare them with the simple batch sampling approach commonly used in learning theory and show that in nearly all situations our algorithms use a much smaller number of examples. Our algorithms are particularly suitable for current data mining applications where the amount of data is supposed to be huge.
This seminar was held at the Department of Computer Science, Royal Holloway, University of London on 16 September 1998.