Table.1. Cumulative frequencies (fx) of POS
-
Summary
In this chapter, the fundamental layer of annotation, i.e. POS tagging, of dependency treebank of Kashmiri, has been explored with reference to the four datasets taken from KashCorpus, discussed in chapter-III. First of all, the task to be handled in this chapter has been introduced in section.1 and then the notion of POS tagging has been explained in the beginning of the section.2. In the same section some important corpus annotation standards have been discussed and the various existing POS tagsets have been reviewed briefly. Further, not only the category wise description of Kashmiri POS tagset (used in the current work), has been given in this section, but also the comparative statistical information about various sub-categories involved. In section.3, at first, the prerequisites for actual POS tagging have been discussed which include the annotation interface and a particular data storage format. The SA-Interface of Sanchay platform has been used for the current task and the procedure for using the same has been given in this section, using various snapshots. The storage format called SSF has been also discussed along with the need to rely on any such format for any annotation pipeline. Latter, the actual POS annotation has been discussed along with the results, in the form of various linguistic issues raised and their solutions. The solutions have been presented in the form of a mini-guideline. Finally, statistical results like the frequency and cumulative frequency of various POS categories have been given in the same sections.
The chapter has overall explored and discussed various annotation schemes and tools which have been found relevant to the present work and has also laid down the foundations for building dependency treebank (KashTreeBank), using four samples of data taken from KashCorpus. The next chapter will address further two layers of annotation which revolve around the syntactic dependencies.
Chapter.5. Chunking of KashCorpus
Judgments are inherently unreliable because of their
unavoidable meta-cognitive overtones, because grammaticality
is better described as a graded quantity, and for a host of other reasons.
Edelman and Christianson (2003)
-
Introduction
Chunking is the second level of annotation in developing a dependency treebank based on HTB guidelines (Bharati et al., 2012). It involves annotating clusters of words based on local dependencies with predefined chunk labels. The chunk layer encodes the intermediate level of linguistic information between the POS level and the dependency level. In fact, it covers all those dependency relations which dependents form with their head except with the verbal head. Although, it covers all lower level dependencies which do not belong to argument-adjunct level but these dependencies are not overtly labeled. However, it is very crucial for annotation of inter-chunk dependency relations.
This chapter is mainly concerned with describing the second layer of annotation of KashTreeBank. The second section of this chapter deals with the notion of chunk, the third section discusses the rationale behind chunking, the fourth section gives description of chunk tagset, section five describes the process of manual chunking carried out with the help of Sanchay SA Interface, section six talks about the issues that were encountered during the annotation process, section seven presents results, both statistical as well as theoretical. Section eight presents the guidelines and section nine summarizes the chapter. The next section discusses the notion of chunk.
-
The Notion of Chunk
The term ‘chunk’ appears similar to the term ‘phrase’ but a chunk and a phrase differ, considerably, both refer to a group of words. The former is a general term which has been widely used across various disciplines for a perceptually compact group of entities and in linguistics; it refers to non-recursive group of words. The latter is purely a syntactic term which refers to constituents which are often recursive nature. Therefore, non According to Abney (1991), a chunk consists of a single content word surrounded by a constellation of function words which matches a fixed template, e.g. in Kashmiri noun chunk, [huth/DMD baagas/NN manz/PP].NC (in/PP that/QT garden/NN), the content word baagas/N (garden) is surrounded by function words, huth/DM (that) and manz/PP (in).
Abney (1995) also defines a chunk as “the non-recursive core of an intra-clausal constituent, extending from the beginning of the constituent to its head but not including post-head dependents.” There is psychological evidence for the existence of chunks. Gee and Grosjean (1983) examine that these are performance structures of word clustering that emerge from a variety of types of psychological experimental data such as pause durations in reading and naive sentence diagramming. They argued that performance structures are best predicted by what they called Ø-phrases which are created by breaking the input string after each syntactic head that is a content word. They do not assign syntactic structure to chunks and assume that pre-nominal adjectives do not qualify as syntactic heads; otherwise, phrases like a big dog would not comprise one chunk but two. Contrary to that, Abney (1994) argued that a chunk has syntactic structure which comprises of a connected sub-graph73 of a global parse-tree of a sentence and that the chunks are represented in terms of major heads which are all content words except those that appear between a function word and the content word, e.g. ‘proud’ is a major head in ‘a man proud of his son’ but proud is not a major head in ‘the proud man’ because it appears between the function word ‘the’ and the content word ‘man’.
However, the practical considerations in implementing a framework on the corpus samples can lead to a variety of word-constellations that may or may not be psychologically real chunks as discussed above. Therefore, chunks may not be complicit with the well-known definitions of a chunk but merely ad-hoc solutions to more practical problems, e.g. non-contiguity. Thus, a chunk is a sub-tree within a syntactic phrase structure tree corresponding to nominal, prepositional, adjectival, adverbial or verbal phrases (Abney, 1991, 1992, 1995) or simply a word-group based only on local surface information, e.g. the noun group and the verb group (Bharati et al 1995). Sometimes, even the simplest notion of chunk as a word group may be problematic (see Bhat, 2012) while handling discontinuity.
-
Rationale for Chunking
It has been already discussed in Chapter-I and Chapter-II that dependency relations involve asymmetrical grammatical relations, i.e. head-dependent or modifier-modified relations between words. These relations hold at two levels, one at chunk level74, between the words of minor POS-class (secondary dependents) and the word of a major POS-class (secondary head), and other at sentential level, between the secondary heads (primary dependents) and the primary head, i.e. finite verb. The rationale behind the division of dependency relations into two levels is that it allows incorporating the popular notion of phrase, though crudely, and thereby, permits of division of labor in order to achieve consistency in syntactic annotation. Moreover, at POS level the more focus had been on the form or words rather than the function they performed in a sentence. Therefore, positing the intermediate chunk level is to regain the scope of function which is the key force constructing a sentence. However, at the first level, dependency relations between dependent words and their head word, constituting secondary modifier-modified relations, have not been labeled explicitly, instead the cluster of dependent words and the head word has been annotated with the chunk tags which have been devised based on the notion of head. Therefore, the relations can be easily predicted by head computing based on the chunk label, e.g. in NP chunk, N but not JJ, DM, QT, RP or PP will be head. Similarly, in RBP, RB will be head. So there is no need to label relations explicitly at this level as they can be easily computed from the information encoded in the tags.
Further, it is worth to mention that the chunk tag is not only assigned to a cluster of words which are formed by dependency relations but also to the clusters which are formed by non-dependency relations, e.g. the clusters of JJ + N, DM + N and QT + N are clearly formed by dependency relations and have been tagged as NP chunks but the clusters of N and PP, PRP and PP, N and RP, V and RP, etc are not formed by modifier-modified relations (hence non-dependency) and have been tagged with chunk labels, NP and VGF, respectively. Similarly, predicative adjectives and quantifiers, non-contiguous adverbs, conjuncts and other discontinuous elements like the tensed and non-tensed verbal elements have been assigned chunk tags despite of the fact that they don’t essentially form a cluster of words with some dependency or non-dependency relations, rather, they are solitary elements and have been treated at par with the cluster of words. This flexibility of treating clusters of words at par with the solitary words is actually to account the discontinuity and flexibility in surface word-order which is the hallmark of sentences taken from corpus. Therefore, positing chunk level is not only important to deal with one set of dependency relations but also to settle most of the problems of surface level and to smooth the ground for the next level of annotation. This also brings the notion of chunk closer to the performance structure proposed in Gee and Grosjean (1983) than the standard notion of phrase, visible only in NP chunks.
As POS tagging is prerequisite for chunking; chunking is also pre-requisite for deep syntactic parsing vis-à-vis annotation, i.e. for annotating inter-chunk dependency relations. However, in order to be able to chunk POS annotated data consistently, a chunk tagset, and the chunking interface are must. The chunk tagset that has been used in the current work is described in the next section.
-
Description of the Chunk Tagset
Though parsing by chunking is common practice in IL treebanks, there is yet to be a standardized tagset of chunks and other higher dependency relations for ILs as there is for POS tagging in the BIS standards. However, there has been some work in chunking for Indian Languages particularly for Hindi, Bangla, Urdu and Telugu (see Bharati et. al. 1995; Ray et. al. 2003; Singh et. al., 2005; Das et. al. 2005; Bharati et. al. 2006). The current chunk tagset is based on the POS tagging and chunking guidelines used in ILMT (Bharati et. al. 2006) but the notion of verb-group as posited the guidelines has been refuted in the current work, as non-applicable for Kashmiri. It has been found that the POS annotated words of Kashmiri corpus can be grouped and classified into ten chunk categories (Bhat, 2012 & 2013). These ten chunk categories, along with chunk-tags75, are given in the Table.1 and their description is given below.
-
Noun Chunk (NP)
Noun Chunk is the name assigned to a cluster of words, which a noun forms, with its dependents such as JJ, DM, QT or even with PP which are also considered as its dependents though they are not modifiers like the other dependents. The notion of noun chunk is similar to that of the noun phrase except that it is a single entity and can’t be recursive, i.e. can’t embed any sub-phrase in it, e.g. kwr-i hund (of the girl), su boDbaarI bag (that big orchid), Ak-is bAd-is maqaan-as manz (in one big house), su ti (he too), etc.
The further examples of NPs are given in the Table.1 and Table.3, and the proportion of NP in the Kashmiri corpus is given in the Figure.3.b.
-
Auxiliary Chunk (AUXP)
In Kashmiri like other Indo-Aryan (IA) Languages, tense, aspectual and lexical information of verbs are distributed into three verb tokens known as tense auxiliary, modal auxiliary and main verb, respectively, but unlike them these three verb tokens are non-contiguous in nature, with other elements, particularly the arguments intervening between them. Thus, Auxiliary Chunk is the name assigned to solitary tensed auxiliary or cluster of tense and aspectual auxiliaries, both tagged as VAUX at POS level, rather than to the cluster of three verb tokens, forming a Verb Group (see Bharati, 2006) in Urdu and Hindi. The AUXP tag has been assigned to these solitary tense auxiliary or a cluster of auxiliary tokens away from the main verb, e.g. aasi (will), Os-nI (was not), chi-nI aasaan (do not keep), chi heykaan (can), etc. The further examples of AUXP are given in Table.1 and Table.3, and the proportion of AUXP in the Kashmiri corpus is given in the Figure.3.b.
-
Verb Chunk Finite (VGF)
Verb Chunk Finite is the name assigned to the solitary tense-less or tensed main verbs, tagged as VM at POS level or to the clusters of RB-VM, VM-RP or RP-RB-VM. When VM is tense-less, it is either the lexical part of the auxiliary verb occupying V2 position (or both occupying the V2 and V3 positions) in the sentence or it is itself a full-fledged verb with both lexical part and the mood information condensed in a single token, e.g. gatsh (go), khey (eat), chey (drink), etc. But when it is tensed, tense is either clearly inflected, e.g. in khe-yi (will eat), che-yi (will drink) and shongi (will sleep) or it isn’t inflected at all or it can be said that tense information is morphologically unmarked or underspecified in these cases but contextually encoded for which the aspect provides the most crucial cue. Another possibility is that aspectual information is contaminated with tense and both the tense and aspect are expressed through a single inflection (portmanteau morpheme). For instance, there are two perfective forms of finite verbs in Kashmiri, ‘-mut’ form and ‘-ov’ form. The ‘-mut’ forms, e.g. khey-mut (eat-prf), chey-mut (drink-prf), shong-mut (sleep-prf), etc, co-occur with tense auxiliary which occupy the V2 position in the sentence. Therefore, one can easily determine whether the ‘-mut’ form of the verb is present or simple past perfective form by looking at V2 position where its tense information is located. Therefore, the tense and aspectual information is disjunctive in such cases. However, the ‘-ov’ forms, e.g. khey-ov (ate), chey-ov (drank), shong (slept), etc, neither co-occur with the tense auxiliary at V2 position, nor are such forms inflected with tense information. Since, tense information is underspecified, in such forms; it should have been difficult to determine whether such forms are present or past perfectives but as default, native speaker perceives such forms as past perfectives. So it is evident that in ‘-ov’ forms either ‘-ov’ carries tense information in addition to aspectual information (hence, portmanteau) or it merely provides a cue to tense which is encoded in the context. Irrespective of whatever may be the convincing explanation for this case, such forms have been tagged as VGF, e.g. natsaan (dances), khe’ (eat), shong (slept), etc. The further examples of VGF are given in Table.1 and Table.3, and the proportion of VGF Kashmiri corpus is given in the Figure.3.b.
-
Verb Chunk Non-finite (VGNF)
Verb Chunk Non-finite is the name that has been assigned to solitary participle forms, ‘-vol’ forms and a cluster of reduplicated progressive forms which are essentially de-verbal in nature and function either as an event or an entity modifiers. Such forms are generally known as non-finite verbs but non-finite verbs also include gerunds and infinitives. However, as mentioned in the Chapter-IV, the task of determining finiteness has been avoided on POS level as the grammatical information of the verbs distributed on multiple tokens rather than being condensed in a single token. The task, thus, becomes very complex if one goes by the standard definition of finiteness but it has been found that the notion of ‘de-verbal’ forms simplifies the task. It has been better addressed under the forth coming section on issues. It is important to mention that gerunds and infinitives, though de-verbal in nature, don’t play a modifying role and thus, are not tagged as VGNF like other de-verbal forms, mentioned above, e.g. shong-ith (sleeping), bih-ith (sitting), pakaan pakaan (while walking), etc. The further examples of VGNF are given in Table.1 and Table.3, and the proportion of VGNF in Kashmiri corpus is given in the Figure.3.b.
-
Verb Chunk Gerund (VGNN)
Verb Chunk Gerund is the name that has been assigned to those de-verbal forms which function as nominals. These include solitary direct gerundial forms and the clusters of oblique gerundial forms and the postpositions. Infinitives have been also tagged as VGNN as they form the argument of the finite verbs like the gerunds. As aforementioned, such forms are distinguished from the other de-verbal forms only in terms of their functions, otherwise the constituent verbs of both the VGNF and VGNN are devoid of any verbal feature, except the argument structure which is intact in them even if they play non-verbal roles in the sentence, e.g. shong-nI sI:t’ (because of sleeping), nats-un (dancing), asn-an (laugh-ERG), etc. The further examples of VGNN are given in Table.1 & Table.3, and the proportion of VGNN in Kashmiri corpus is given in the Figure.3.b.
-
Conjunct Chunk (CCP)
Conjunct Chunk is the name assigned to the conjunctions, both coordinating and subordinating which have been tagged as CCD and CCS, respectively at POS level. Most of the sentences in the corpus are compound, complex or compound-complex in nature in which conjunctions play a key structural role and thus, the frequency of conjunctions is high in the corpus. Since, conjunctions neither have any modifier-modified relation nor do they bear any part whole relation with any other POS category, they can’t be part of any other chunk like postpositions. Therefore, they are solitary and are projected as separate chunks, e.g. tI (and), ya (or), kinI (or), zi (that). The further examples of CCP are given clearly, in Table.1 & Table.3, and the proportion of CCP in Kashmiri corpus is given in the Figure.3.b.
-
Adjectival Chunk (JJP)
The name Adjectival Chunk has been given to the solitary adjectives and quantifiers, or adjectival or quantifier clusters like RP-JJ and RP-QT, which can’t be a part of any noun chunk. It is worth to mention, here, that, although, all adjectives have been tagged as JJ and all quantifiers have been tagged as QT, at POS level, all adjectives and quantifiers can’t be raised up to the chunk level as JJP. Adjectives or even quantifiers occur either at attributive position, as part of NP, or at predicative position, as solitary elements or clusters. It is only at these predicative positions, the adjectives and quantifiers have the status of head as they do not constitute what are popularly known as discontinuous phrases and can be easily posited as the adjectival chunks and have been tagged as JJP, e.g. su chu rut (he is nice). The further examples of JJP are given in Table.1 & Table.3, and the proportion of JJP in Kashmiri corpus is given in the Figure.3.b.
-
Adverbial Chunk (RBP)
The name Adverbial Chunk has been assigned to the solitary adverbs or adverbial clusters (RB-RP) which can’t be a part of any verb chunk. It must be noted that although all adverbs are tagged as RB at POS level, all can’t be raised up to chunk level and tagged as RBP because sometimes they are adjacent to their head and can be also part of VGF but mostly they occur non-contiguously with their head are tagged as RBP, e.g. su os vaarI vaarI garI kun pakaan (he was moving towards home slowly). The further examples of RBP are given in Table.1 & Table.3, and the proportion of RBP in Kashmiri corpus is given in the Figure.3.b.
-
Negation Chunk (NEGP)
The name Negation Chunk has been given to those negative particles that occur as solitary elements without an obvious head and hence, may be treated as the heads to be projected as the chunks, e.g. na su yiyi-nI vaapas (no, he won’t come back). The further examples of NEGP are given in Table.1, and the proportion of NEGP in Kashmiri corpus is given in the Figure.3.b.
-
Other Chunk (BLK)
The name Other Chunk is reserved for all those solitary or clusters of POS tagged words which do not fit in the aforementioned chunks. This is actually like a bag in which all elements can be put which do not confirm with the chunking scheme, either because they are unrelated to the sentence structure, e.g. the serial numbers, or they belong to discourse level, connecting one sentence with the other, e.g. khA:r tAm’ vAn’-nI zahn ti titsh kath (however, he never said anything like that). The further examples of BLK are given in Table.1, and the proportion of BLK in Kashmiri corpus is given in the Figure.3.b.
S. No
|
Chunk Name
|
Tag
|
Examples
|
I
|
Noun Chunk
|
NP
|
[su/DM badI/RP rut/JJ shakhIts/NN] NP
(that very big man), [farooq/NNP ti/RP] NP (farooq also), [farooq/NNP nI/RP] NP (not farooq), [Akis/QT bADis/JJ palas/NN peTh/PP] NP (on one big rock)
|
II
|
Auxiliary Chunk
|
AUXP
|
[chu/VAUX] AUXP (is), [chu/VAUX aasaan/VAUX] AUXP (keeps), [Os/VAUX] AUXP (was), [Os/VAUX rOzaan/AUXP] AUXP (used to), [aav/VAUX] AUXP (was), [yiyi/VAUX] AUXP (will be)
|
III
|
Verb Chunk Finite
|
VGF
|
[kheyovum/VM] VGF (ate +1st person clitic), [vaarI/RB parihaa/VM] VGF (may should have read nicely), [dav/VM haz/RP] VGF (run + honorific), [variyaa/RP zorI/RB pakh/VM] VGF (walk very fastly), [yiyi-nI/VM kehn/RP] VGF (will not come + emphasis) *[chu/VAUX vonmut/VM] VGF (has said)
|
IV
|
Verb Chunk Non-Finite
|
VGNF
|
[kheyth/VM] VGNF (after eating), [kheyth/VM cheyth/VM] VGNF (after eating drinking), [pakaan/VM pakaan/VM] VGNF (during walking), [kheynIvol/VM] VGNF (eater)
|
V
|
Verb Chunk Gerund
|
VGNN
|
[paknas/VM peyTh/PP] VGNN (for walking), [kheynI/VM sI:t’/PP] VGNN (with eating), [natsnI/VM kin’/PP] VGNN (due to dancing), [khenIch/VM] VGNN (of eating), [kheyon/VM] VGNN (eating/ to eat), [kheynIvol/VM] VGNN (one who eats)
|
VI
|
Conjunct Chunk
|
CCP
|
[tI/CCD] CCP (and), [yaa/CCD] CCP (or), [kinI/CCD] CCP (or), [natI/CCD] CCP (or) [ki/CCS] CCP (that), [zi/CCS] CCP (that), [yodvai/CCS] CCP (if), [agar/CCS] CCP (if), [magar/CCD] CCP (but),
|
VII
|
Adjectival Chunk
|
JJP
|
[variyaa/INT rut/JJ] JJP (very good), [pantsah/QC kiluu/NN] JJP (fifty kilo), [pandhA:yim/QO] JJP (fifteenth), [zyuuTh/JJ] JJP (tall or lengthy)
|
VIII
|
Adverbial Chunk
|
RBP
|
[teyz/RB teyz/RB] RBP (quickly or fastly), [zorI/RB ti/RP] RBP (loudly) [lot/RB] RBP (slowly) [ti/RP kyaazi/RB] RBP (because), [chunki/CCS] RBP (because), [tawai/RB] RBP (because of that), [teli/RB] RBP (then)
|
IX
|
Negation Chunk
|
NEGP
|
[na/RP] NEGP (no), [na/RP saa/RP na/RP] NEGP (no +honorific not)
|
X
|
Other
|
BLK
|
[khA:r/RP] BLK (however), [teli/RP] BLK (so)
|
Table.1. Kashmiri Chunk Tagset
-
Dostları ilə paylaş: |