Dry run of PhD defense on Generating and Simplifying Sentences

Date: 31 October, 2014, 10:00, Room B011

Speaker: Shashi Narayan

Abstract : Depending on the input representation, Natural Language Generation can be categorized into two classes: data-to-text generation and text-to-text generation. This dissertation investigates issues from both classes. Accordingly, this dissertation is divided into two parts: the first part (data-to-text generation, "Generating Sentences") circles around a task of generating natural language text from shallow dependency trees, and the second part (text-to-text generation, "Simplifying Sentences") tries to generate simple sentence(s) given a complex sentence.

Generating Sentences. In comparison with statistical surface realisers, symbolic surface realisers are usually brittle and/or inefficient. In this thesis, we investigate how to make symbolic grammar based surface realisation robust and efficient. We propose an efficient symbolic approach to surface realisation using a Feature-based Tree-Adjoining grammar and taking as input shallow structures provided in the format of dependency trees from the Surface Realisation Shared Task. Our algorithm combines techniques and ideas from the head-driven and lexicalist approaches. In addition, the input structure is used to filter the initial search space using a concept called local polarity filtering; and to parallelise processes. We show that our algorithm drastically reduces generation times compare to a baseline lexicalist approach which explores the whole search space. To further improve our robustness, we propose two error mining algorithms: one, an algorithm for mining dependency trees rather than sequential data and two, an algorithm that structures the output of error mining into a tree (called, suspicion tree) to represent them in a more meaningful way. We use these error mining algorithms to identify problems in our generation system. We show that our realisers together with these error mining algorithms improves its coverage by a wide margin. At the end, we focus on generating a more complex linguistic phenomenon such as elliptic coordination. We extend our realiser to represent and generate different kinds of ellipsis such as gapping, subject sharing, right node raising and non-constituent coordination.

Simplifying Sentences. Earlier handcrafted rule-based simplification systems are limited to purely syntactic rules and machine learning systems either starts from the input sentence or its phrasal parse tree. In this thesis, we argue for using rich linguistic information in the form of deep semantic representations to improve the sentence simplification task. We use the Discourse Representation Structures (DRS) for the deep semantic representation of the input. We propose two methods for sentence simplification: a supervised approach to hybrid simplification using deep semantics and statistical machine translation, and an unsupervised approach to sentence simplification using the comparable wikipedia corpus. Both approaches use DRS simplification models to do semantically governed splitting and deletion operations. We show that while our supervised approach is significantly ahead of existing state-of-the-art systems in producing simple, grammatical and meaning preserving sentences; our unsupervised approach is also competitive with them.


Comments powered by Disqus