Overview of Learning Regular Languages

This page summarises some recent work on learning regular languages and finite state automata. A few salient facts to put this work in context:

  1. Non-deterministic finite state automata and deterministic finite state automata define the same classes of languages: the regular languages.
  2. Converting a non-deterministic automaton to a deterministic one that generates the same language can cause an exponential increase in the number of states.
  3. Distribution free learning of regular languages is not feasible, with positve and negative samples or with positive samples only.
  4. The stochastic variants of these two sorts of automata define classes of distributions that are not the same. Probabilistic Non-deterministic Finite State Automata (PFA) in general can define distributions not defined by any Probabilistic Deterministic Finite State Automata (PDFA).
  5. PFAs are roughly equivalent to Hidden Markov Models (HMMs).
  6. Training arbitrary HMMs or PFAs is not possible: training means finding the HMM or PFA that maximises the likelihood of some data. This is because it is possible to define strings that encode hard problems (Abe and Warmuth).

In collaboration with Franck Thollard, I have proved that:

  1. Regular languages can be PAC-learned from a positive examples when the distributions of the samples are restricted to be generated by one of a large family of related PDFAs, using only a polynomial amount of data, and polynomial computation.
  2. Given any PDFA, you can learn a PDFA which approximates the target distribution with the Kullback-Leibler divergence using polynomial data and computation given two extra parameters to the sample complexity polynomial, a bound on the expected length and a bound on the distinguishability.
  3. It isn't necessary for the algorithms to be provided with any information other than the alphabet and a sample of strings.
  4. In practice these algorithms are efficient; though the theoretical bounds are large in practice you don't need very much data.
This work is heavily based on the work of Rafael C. Carrasco, Josť Oncina and Dana Ron.

What does this mean for learning regular expressions from a set of strings? It depends whether there is a small deterministic FSA that represents the language. Problems will be encountered with expressions like (a|b)*b(a|b){10} where the determinisation algorithm will have to expand out the automaton fully.

Alexander Clark email