Chapter. 1 Introduction

Yüklə 1,52 Mb.

səhifə	4/21
tarix	07.08.2018
ölçüsü	1,52 Mb.
	#68537

1 2 3 4 5 6 7 8 9 ... 21

Why Dependency Grammar?

According to Covington (2001) constituency based approach appears to have been invented only once, by the ancient Stoics, and has been passed through formal logic to modern linguists. On the other hand, dependency based approach appears to have been invented many times in many places (Covington 2001). Nevertheless the constituency based discourse has overshadowed every other view of syntactic representation. Mel’cuk argues the constituency based approach is particularly suitable for English and this was the mother tongue of its founding fathers (Mel’cuk, 1988). Furthermore, Mel’cuk summarizes a few reasons why the dependency model is preferrable:

A phrase-structure tree focus on grouping of the words, which words go together in the sentence, but does not give a representation of the relations between the words.
A dependency tree is based on relations. It shows which words are related and in what way. The sentence is “built out of words, linked by dependencies”. The relations could be described in more detail by giving them meaningful labels.
A dependency tree also represents grouping. A phrase is represented by a word and its entire sub-tree of dependents.
In a phrase-structure tree usually most nodes are nonterminal, representing intermediate groupings. A dependency tree consists of only terminal nodes. There is no need for abstract representation of grouping.
In a phrase-structure tree the linear order of the nodes is relevant. It must be kept to retain the meaning of the sentence. In a dependency tree this is not important. All information is preserved in the, possibly labeled, connections.

Notion of Treebanking

The improvement in natural language parsing during the last two decades has been generally attributed to the emergence of statistical and machine learning approaches (Collin, 1999; Charnaik, 2000). However, such approaches became possible only with the availability of large scale machine readable handcrafted or automatically generated (manually corrected) syntactic trees. The art or science of crafting or generating and organising machine readable syntactic trees is called treebanking. In the next sub-sections, the concept of treebank, principles of treebanking and review of various dependency treebanks are given.

Some Background

The term ‘treebank’ was probably introduced by Geoffrey Leech (Sampson 2003). The pioneering work in treebanking started in early 70’s in Sweden with the inception of Talbanken²⁵ (Teleman 1974; Einarsson, 1976) which was developed at Lund University by manually annotating Swedish corpus with phrase structure and grammatical functions. However, the serious work in this area started in 80s as put forward by Fred Jelinek of IBM in his 1987 Lifetime achievement talk at Applied Computational Linguistics (ACL):

“We were not satisfied with the crude n-gram language model we were using and were “sure” that an appropriate grammatical approach would be better. Because we wanted to stick to our data-centric philosophy, we thought that what was needed as training material was a large collection of parses of English sentences. We found out that researchers at the University of Lancaster had hand-constructed a “treebank” under the guidance of Professors Geoffrey Leech and Geoffrey Sampson (Garside, Leech, and Sampson 1987). Because we wanted more of this annotation, we commissioned Lancaster in 1987 to create a treebank for us. Our view was that what we needed above all was quantity, possibly at some expense of quality …… We wanted to extract the grammatical language model statistically and so a large amount of data was required.” (Marcus, 1995)

Actually, it was Linguistic Data Consortium (LDC), established at University of Pennsylvania that started massive and sophisticated efforts in developing treebanks for European languages. However, these efforts were latter on extended to Non-European languages as well. So, there are Penn treebanks for various languages like Penn English Treebank (Marcus et al. 1993), Penn Arabic Treebank (Maamouri et al. 2004), Penn Chinese Treebank (Xue et al., 2004), etc. The Penn English Treebank is one of the largest and most widely used English-language treebanks that has contributed greatly in creation of important English NLP resources. Moreover, it is well-documented and the documentations are freely available; consequently, it provides a solid template methodology for researchers attempting to produce treebanks in other languages. Similar efforts were made in Charles University at Prague and various treebanks were created. They include Prague Dependency Treebank of Czech (Hijicova & Hajic, 1998; Böhmova et al. 2003), Turkish Treebank (Oflazer et al. 2003), Danish Dependency Treebank (Kromann, 2003), Turin University Treebank of Italian (Bosco & Lombardo 2004), etc. AnnCorra Treebank (Bharati et al. 1995) is another similar kind of effort made for Indian Languages at LTRC Lab and presently, the work is going on in some major Indian languages e.g. Hindi, Telugu and Urdu (R. Begum et al., 2008; Vempaty et al., 2010, R. Bhat 2012). Moreover, Bangla Treebank²⁶ has been constructed at IIT Kharagpur (S. Chatterji et al., 2009) and some efforts in developing dependency treebank for Kashmiri (KashTreeBank²⁷) have been already initiated (S. Bhat, 2012).

What is a Treebank?

Treebank is a set of corpora annotated with skeletal syntactic information, such as POS labels for words level and syntactic labels beyond word level (Kristin Jacque, 2006). A treebank is text corpus annotated with syntactic, semantic and sometimes even inter-sentential relations (Hajicova et al., 2010). It is essentially a machine readable repository of annotated syntactic structures of a language that predominantly serves as a bank of training & testing data for the development of various computational tools and applications that use some form of supervised learning, e.g. deep syntactic parser, chunker, POS tagger, etc. Although, the term ‘treebank’ initially referred to a bare collection of syntactic trees, its contemporary usage has been extended to the corpora with all kinds of structural annotations, such as constituent structure, functional structure, or predicate-argument structure (Nivre 2005; Smedt & Volk 2005). Currently, treebanks are augmented with different types of structural representations and to restrict a treebank to a particular type of structural representation is not current state-of-art. However, a basic skeletal treebank is perquisite for any kind of further augmentation like multiple representations. Earlier treebanking efforts were based on manual annotations which are laborious, time-consuming and error-prone. Such limitations in the manual annotations have led to the development of several alternative approaches like automatic annotation or automatic conversionbut the alternative approaches won’t work for resource poor Languages like Kashmiri. We have to rely on the manual annotations first as no previous treebank resources are available for Kashmiri. Therefore, to start from scratch & use manual methods are unavoidable unless sufficient resources are created to train a parsing system for automatic annotation. Moreover, a large number of treebanks have been developed and many are currently under construction. Many treebanks implement formats similar to those of the major treebanks and rarely new models are being devised. For instance, the English dependency treebank (Rambow et al., 2002) follows the model of the Prague Dependency Treebank but uses a mono-layered representation centered on the notion of predicate argument structure instead of multi-layered approach of Prague. Similarly, Spanish Treebank adheres to the model of the Penn Treebank (Moreno et al., 2000).

Dependency Treebanks: A Brief Review

It is a fact that most of the languages have relatively free-word order and for treebanking in free-word-order languages dependency based annotation schemes are used. It is because of this fact that there is an ever expanding number of dependency treebanks across the world. Many of these dependency treebanks²⁸are briefly explored here:

Prague Dependency Treebank (PDT)

PDT for Czech is the largest of the existing dependency treebanks in which the corpus has been annotated on the basis of a multi-layer annotation scheme, consisting of morphological layer; analytical layer i.e. syntactic and a tecto-grammatical layer i.e. semantic (Hajic, 1998; Bohmova and Hajikova 1999, Böhmova et al., 2003). It consist of, approx 90,000 sentences, from newspaper articles on diverse topics (e.g. politics, sports, culture) and texts from popular science magazines, selected from the Czech National Corpus (T. Kakkonen, 2006). There are 3,030 morphological tags in the morphological tagset (Hajic, 1998). The syntactic annotation comprises of 23 dependency relations. The annotation for three levels has been done separately, by different groups of annotators. The morphological tagging was performed by two human annotators selecting the appropriate tag from a list proposed by a tagging system. Third annotator then resolved any differences between the two annotations. The syntactic annotation was at first done completely manually with the help of ambiguous morphological tags and a graphical user interface. After the annotation of about 19,000 sentences, Collins Lexicalized Stochastic Parser (Nelleke et al., 1999) was trained with the data with 80% accuracy. Thereafter, the status of work of the annotators changed from building the trees from scratch to post-editing (checking and correcting) the parses assigned by the parser, except for the analytical functions, which still had to be assigned manually. There are other treebank projects that use the same framework developed for the PDT. For instance, Prague Arabic Dependency Treebank (Hajic et al., 2004) is a treebank of Modern Standard Arabic, consisting of around 49,000 tokens of newswire texts from Arabic Giga-word and Penn Arabic Treebank. The Slovene Dependency Treebank consists of around 500 annotated sentences obtained from the MULTEXT-East Corpus (Erjavec, 2005).

Russian Dependency Treebank

The Dependency Treebank for Russian is based on the Uppsala University Corpus (Lonngren, 1993). The texts have been collected from contemporary Russian prose, newspapers, and magazines (Boguslavsky et al., 2000; 2002). The treebank consists of about 12,000 annotated sentences. There are 78 syntactic relations (divided into 6 subgroups, e.g. attributive, quantitative, and coordinative). The annotation is layered, in the sense that different levels of annotation are independent and can be extracted or processed independently. The treebank has been developed automatically with the help of a morphological analyzer and a syntactic parser (Apresjan et al., 1992) which was followed by post-editing.

Italian Dependency Treebank

Italian Dependency Treebank is known as Turin University Treebank. It consists of 1,500 sentences, divided into 4 sub-corpora (Bosco, 2000; Lesmo et al., 2002; Bosco and Lombardo, 2003). The majority of text is from civil law code and newspaper articles. The annotation format is based on the Augmented Relational Structure (ARS). The POS tagset consists of 16 categories and 51 subcategories. There are around 200 dependency types, organized in 5 levels. The scheme provides the annotator with the possibility of marking a relation as under-specified if a correct relation type cannot be determined. The annotation process consists of automatic tokenization, morphological analysis, POS disambiguation and syntactic parsing (Lesmo et al., 2002).

German Treebank

It is known as TIGER Treebank (Brants et al., 2002). It was developed on the basis of NEGRA Corpus (Skut et al., 1998) and consists of complete articles covering diverse topics collected from a German newspaper. It consists of approximately 50,000 sentences. It combines both phrase structure and dependency and organizes them in a way that phrase categories are marked as non-terminals, POS information as terminals and syntactic functions as the edges. The syntactic annotation is rather simple and flat²⁹.

English Dependency Treebank

The Dependency Treebank of English consists of dialogues between a travel agent and customers (Rambow et al., 2002), and is the only dependency treebank with spoken language annotation. The treebank has about 13,000 words. The annotation is a direct representation of lexical predicate-argument structure, thus arguments and adjuncts are dependents of their predicates and all function words are attached to their lexical heads. The annotation is done at a single, syntactic level, without surface representation for surface syntax, the aim being to keep the annotation process as simple as possible. The trained annotators have access to an on-line manual and work off the transcribed speech without access to the speech files. The dialogs are parsed with a Dependency Parser, the Super-tagger and Light weight Dependency Analyzer (Bangalore and Joshi, 1999). The annotators correct the output of the parser using a graphical tool developed by Prague Dependency Treebank project.

Basque and Danish Dependency Treebanks

The Basque Dependency Treebank (Aduriz and al., 2003) consists of 3,000 manually annotated sentences from newspaper articles. The syntactic tags are organized as a hierarchy. The annotation is done by aid of an annotation tool, with tree visualization and automatic tag syntax.

The annotation of the Danish Dependency Treebank is based on Discontinuous Grammar formalism which is closely related to Word Grammar (Kromann, 2003). The treebank consists of 5,540 sentences covering a wide range of topics. The morpho-syntactic annotated corpus is obtained from the PAROLE Corpus (Keson and Norling-Christensen, 2005), thus no morphological analyzer or POS tagger is applied. The dependency links are marked manually by using a command-line interface with a graphical parse view.

Turkish Dependency Treebank

It is known as METU-Sabanci Turkish Treebank³⁰. Itconsists of morphologically and syntactically annotated 5,000 sentences. The treebank is represented in the XML-based Corpus Encoding Standard format (Anne and Romary, 2003). Due to morphological complexity of Turkish, morphological information is encoded as sequences of inflectional groups (IGs). An IG is a sequence of inflectional morphemes, divided by derivation boundaries. The dependencies between IGs are annotated with the following ten link types: subject, object, modifier, possessor, classifier, determiner, dative adjunct, locative adjunct, ablative adjunct, and instrumental adjunct. The annotation is done in a semi-automated fashion though lot of manual work is also involved. First, a morphological analyzer based on the two-level morphology model (Oflazer, 1994) is applied to the texts. The morphologically analyzed and pre-processed text is input to an annotation tool. The tagging process requires two steps: morphological disambiguation and dependency tagging. The annotator selects the correct tag from the list of tags proposed by the morphological analyzer. After the whole sentence has been disambiguated, dependency links are specified manually.

Danish, Portuguese and Estonian Treebanks

Danish, Portuguese and Estonian treebanks are called Arboretum, Floresta Sintactica and Arborest, respectively. These are all sibling treebanks in which Arboretum is the oldest one. The treebanks are hybrids with both constituent and dependency annotation organized into two separate levels. The levels share the same morphological tagset. The dependency annotation is based on the Constraint Grammar (CG) (Karlsson, 1990) and consists of 28 dependency types. For creating each of the four treebanks, a CG-based morphological analyzer and parser has been applied. The annotation process consisted of CG parsing of the texts followed by conversion to constituent format, and manual checking of the structures. Danish treebank (Bick, 2003; Bick, 2005) has around 21,600 sentences annotated with dependency tags, and of those, 12,000 sentences have also been marked with constituent structures. The annotation is in both TIGER-XML and PENN export formats. Portuguese treebank (Afonso et al., 2002) consists of around 9,500 manually checked and around 41,000 fully automatically annotated sentences obtained from a corpus of newspaper. Estonian treebank (Bick et al., 2005) consists of 149 sentences from newspaper articles. The morpho-syntactic and CG-based surface syntactic annotations are obtained from an existing corpus, which is converted semi-automatically to Arboretum-style format.

AnnCorra : Treebanks for Indian Languages

AnnCorra (Hyderabad Treebanks) for Indian Languages (ILs) are a dependency treebanks which use indigenous Karaka-theory based grammatical scheme, known as Paninian Computational Grammar, for syntactic annotation (Bharati et al., 1996, Begum et al., 2008). Currently, treebanks of four ILs, namely Hindi, Urdu, Bangla and Telegu, following grammatical scheme, are under development. Hindi dependency treebank consists of 20705 sentences, Urdu dependency treebank consists of 3226 sentences from newspaper corpus, Bangla dependency treebank consists of 1279 sentences and Telegu dependency treebank consists of 1635 sentences (Bhat & Sharma 2012, DVempaty et al., 2010) annotated with the linguistic information at morpho-syntactic (morphological, part-of-speech and chunk information) and syntactico-semantic (dependency) levels. No reference is available about the size of Hindi and Bangla treebanks. However, the annotation schemes in all treebanks consider the verb as the root of the sentence. The relationship between the participant and the event/activity/state denoted by the verb is marked using relations that are called karaka. It has been shown that the notion of karaka incorporates the local semantics of a verb in a sentence and that it is syntactico-semantic. Indian languages are morphologically rich and have a relatively free constituent order. Unlike karaka relations, structural relations like the subject and the objects are considered less relevant for the grammatical description of ILs due to the less configurational nature of these languages (Bhat, 1991; Begum et al., 2008).

Principles of Treebanking

According to Haung (2003), there are four general principles that have been considered important for the design & development of a treebank. These principles³¹ are given below:

Maximal Resource Sharing

The resources for developing a treebank include corpus, tools, annotations schemes, guidelines and human annotators. Since, developing these resources from a scratch can be very expensive and time consuming process; one should make maximum use of the existing resources, if available at all. For instance, in order to achieve maximal resource sharing, the Sinica Treebank (Chen et al. 1996) has been bootstrapped from existing Chinese computational linguistic resources. The textual material has been extracted from the tagged Sinica Corpus (ibid). Moreover, the same research team that carried out the POS annotation of Sinica Corpus and annotated Sinica Treebank to ensure the consistency in the interpretation of texts and tags.

Minimal Structural Complexity

The criterion of minimal structural complexity is motivated by the idea that the annotated structural information can be shared regardless of users’ theoretical orientations. It is observed that theory-internal motivations often require abstract intermediate phrasal levels like intermediate phrasal category X’ in X-bar-theory and abstract covert phrasal category like INFL in the GB theory. Although, the phrasal categories are well-motivated within the theory, their significance cannot be maintained across theoretical frameworks. Since, the minimal basic level structures are shared by all theories, it would be better to annotate the information which is most commonly shared among theories like the canonical phrasal categories.

Optimal Semantic Information

The most critical issue, involving Treebanking as well as the theories related to NLP, is how much semantic information should be incorporated? The original Penn Treebank used a pure syntactic approach. A purely semantic approach is yet to be attempted. However, a third approach involving annotation of partial semantic information, especially encoded in argument-relations. It is this third approach which is shared by most of the treebanks, e.g. the Prague Dependency Treebank (Bohmova and Hajikova 1999), AnnCorra Treebanks (Bharati et al, 1994), etc use syntacto-semantic approach. In this approach, the thematic relation between a predicate and an argument is marked, in addition to the grammatical category. This allows optimal semantic information to be incorporated in a treebank and subsequently in an NLP system like a syntactic parser.

Minimal Granularity

More important parameter is the granularity (depth) of analyses in treebanks. While some of the earliest syntactically annotated corpora contain information of only syntactic boundaries, others contain, constituent structures (Abeille, Clement, and Toussenel, 2003), functional dependency structures (Hajic, 1998) or in addition to the syntactic structures, also predicate-argument structures (Marcus et al., 1994). However, the present KashTreeBank (S. Bhat, 2012) contains inter-chunk dependency relations in addition to POS and Chunk labels.

Summary

In this chapter a review of literature was done with focus on dependency grammar (DG), dependency parsing and the treebanking. First of all different grammar formalisms which are considered closer to DG were briefly presented in order to compare their fundamental notions with that of DG and to understand what is common ground between them as they all constrast with constituency based formalisms like DG. The sample representations for each of these formalisms, i.e. for DG, RG and LFG, were also given. The review of PSG based formalisms was deliberately avoided as their reparesentations and the notions were hardly required when it confermed that DG based formalisms are more famous for treebanking purposes for relatively variable word order language, due to many reasons some of which are given in sections five of this chapter. After the grammar formalisms, the notion of non-configurationality was given elaborately along with some modifications that were done to original PSG based formalisms in order to minimize the operational apparatus and incorporate notions of dependency, e.g. incorporation of VP shell. This was done to justify the suitability of DG for inflectionally rich languages. Next, the history of dependency based representations was charted out, its rrots were traced and its development in different grammatical traditions was also given. In next section, the notion of treebanking was introduced along with some background that trigerred the creations the wave of creation of treebanks. The notion of treebank has been also given. Further, some important dependency treebanks were introduced and finally the principles that should govern treebanking efforts have been presented.

Yüklə 1,52 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 21