Royal Holloway logo with departmental theme Royal Holloway, University of London

CURRENT ISSUES IN PARSING NATURAL LANGUAGE
Dr Ted Briscoe, Computer Laboratory, University of Cambridge

Abstract: Parsing systems able to analyse natural language text robustly and accurately would be of great value in computer applications ranging from document style checking or information retrieval/extraction to message understanding and automatic translation.

However, despite over three decades of research effort, no practical domain-independent parser of unrestricted text has been developed. Such a parser should return the correct or a usefully `close' interpretation for at least 95% of input sentences. It would need to solve the following three problems, which create severe difficulties for conventional parsers utilising standard parsing algorithms with a generative grammar: appropriate segmentation of text into syntactically parsable units; disambiguation, that is, selecting the (unique) semantically and pragmatically correct analysis from the potentially large number of syntactically legitimate ones returned; and undergeneration, or dealing with cases of input outside the systems' lexical or syntactic coverage.

I'll describe our wide-coverage parsing system for English and our attempts to resolve these three problems. Punctuation, it turns out, is a neglected but significant source of constraint for text segmentation and disambiguation, removing as much as 30% of the ambiguity in typical input and increasing coverage by about 10%. Statistical language modelling techniques, widely used in speech recognition systems, are not good for disambiguation, but probabilistic parsing models, designed to distinguish competing interpretations directly, have proved successful. Undergeneration remains the most significant problem for natural language parsing, but we are able to incrementally learn useful grammatical rules and lexical entries using statistical (Bayesian) inference techniques. The parser currently yields analyses for about 90% of unseen input and the correct interpretation is selected from this set about 85% of the time.

This seminar was held at the Department of Computer Science, Royal Holloway, University of London on 18 February 1998

back

I

Last updated Mon, 15-Dec-2008 14:35 GMT / PS
Department of Computer Science, University of London, Egham, Surrey TW20 0EX
Tel/Fax : +44 (0)1784 443421 /439786
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@