Chapter. 1 Introduction

Yüklə 1,52 Mb.

səhifə	17/21
tarix	07.08.2018
ölçüsü	1,52 Mb.
	#68537

1 ... 13 14 15 16 17 18 19 20 21

Chapter.7 Conclusion
Appendix-I: Showing BIS POS Tagset for Kashmiri

Summary

In this chapter, the most important annotation layer of dependency treebank of Kashmiri, i.e. syntactic parsing and annotation has been discussed. First of all the notion of parsing was introduced as it forms the key syntactic operation to produce dependency trees out of input sentences with some degree of previously annotated grammatical information. Since, the Paninian grammatical sketch for Sanskrit has been adopted for sentence parsing in IL treebanks, the basic tenets of Paninian Computational Grammar (PCG) were introduced in order to reveal what kind of syntactic parsing would be involved in developing dependency treebank for Kashmiri. As already mentioned, PCG is reinterpretation of Paninian grammar by one of the leading NLP groups in India called Bharati. PCG seems to be a blend of ideas flourishing throughout the world in dependency tradition. It is not purely Paninian as the name suggests, some key notions like Noun Group and Verb Group also appear in Abney (1996). Moreover, the reinterpretation has been made in terms of modern notions of grammar either by complete equivalence or by mere approximation, e.g. Karta is interpreted as roughly equivalent to agent or SUB. So it is though the popular notions of agent or SUB, the annotators equipped with modern linguistic jargon interpret the terms like Karta. It is because of this reason that PCG sometimes appears to be a matter of ancient labels which, however, is not the case. The fundamental ideas that stay at the heart of PCG have been taken from the ancient Sanskritic genius which is more semantics oriented. It is essentially syntactico-semantic model which incorporates more semantics in it as compared to the syntax and this is the reason that it lacks the popular notions like SUB, OBJ, argument, adjunct, etc but it must be noted that the elements corresponding to such notions can be easily extracted from the treebank as the semanto-syntactic attachment labels can be classified in terms of popular notions of syntax, e.g. k1 attachment always attaches an argument, a SUB.

Next, in this chapter the inventory grammatical relations that need to be annotated have been given with their original Paninian terms, their interpretation in modern terms, their label and their variant labels. The description of each GR mention in the inventory has been given with variety of examples in such a manner that this description also serves as the guidelines. Then, the procedure for annotating various inter-chunk GRs has been given with graphic illustrations so it becomes obvious how all the dependency structures have been produced by using the Sanchay SA Interface. The various annotation issues that have been encountered while annotating Kashmiri corpus were also discussed and illustrated, particularly the V2 phenomenon which brings Kashmiri closer to Germanic languages and it is due to V2 factor particularly that Kashmiri dependency structures from the dependency structures of other ILs.

Finally, the notion of inter-annotator has been introduced and an experiment, for measuring the inter-annotator agreement vis-à-vis consistency in the treebank annotation, has been given. The confusion matrix has shown the disagreement or conflicting labels and the rest of the tables in this section show that the inter-annotator agreement is substantial as per the interpretation matrix of Landis and Koch (1977). The observed agreement was found 0.777 and the kappa value was found 0.7555. In short, the experiment conducted to check the inter-annotator agreement has shown that the annotators agree quite considerably on labels as well as on attachments which mean both have similar understanding of the issues and the guidelines. It also indicates that there will be quite considerable consistency in the syntactic annotations of the current treebank.

Chapter.7 Conclusion

“Computers are incredibly fast, accurate and stupid!!!

Human beings are incredibly slow, inaccurate and brilliant!!!

Together they are powerful beyond imagination.”

Albert Einstein

The corpus based investigations on natural languages has become hallmark of contemporary linguistic research, which not only presents an alternative to the popular introspection based investigations, particularly on natural language syntax, but also adds more interdisciplinary and applied orientation in the research. Since, the contemporary age is considered an age of information where knowledge creation, knowledge dissemination and knowledge acquisition is no more restricted to traditional means and privileged persona but with the invent of computer, internet, world-wide-web and social media, under the force of globalization, even underdogs can share, produce and disseminate knowledge. Since, vehicle of knowledge, either on representational level (cognitive) or on transactional level (communicative) are concrete natural languages, rather than, an ideal natural language which provides space for notions like universalism and lingua-franka, and undermines the potential of creativity in individual languages vis-à-vis the community, need for online representation of all the languages has been severely felt in last few decades. This thirst on the part of speech communities can’t be quenched through only introspection based research but it definitely needs boom of empirical research on natural language so that the probabilistic methods can be harnessed for technological purposes. Further, the need for human and machine interactions through natural languages has also increased considerably which is compensated by increase in resource creation (both linguistic as well as computational) and interdisciplinary researches on natural language which resulted in the inception of entirely new fields of inquiry like computational linguistics (CL), natural language processing (NLP) and language technology (LT). The current research is this kind of effort to create a small scale syntactically annotated corpus, i.e. a treebank for Kashmiri, and lay down the basic methodology for creating a large scale syntactically annotated corpus which can be used for training various NLP algorithms like syntactic parsers.

However, for creating syntactically annotated corpus, grammar formalism is of paramount importance but one astonishes to see the wide range of competing grammatical models/ formalisms. It becomes very difficult to prefer one model over the other as all the models claim flexibility and universality to cater to wider range of language data. However, the fact is that the choice of framework vis-à-vis grammar formalism is, in itself, an interesting and lesser explored research area of experimental syntax which is beyond the scope of this dissertation. Nevertheless, dependency based representations have been considered more suitable for inflectionally rich (relatively free-word-order) languages, i.e. lesser configurational or positional, e.g. Czech, Turkish, Hindi, Urdu, Kashmiri, etc. Since, dependency relations are essentially syntactico-semantic in nature and directly encode the predicate-argument structure, i.e. directly encode the participatory relation of various arguments or adjuncts, it has been argued that dependency based representations are more suitable for annotated resource creation. It is not only because they cater to free-word-order but also because they are considered more suitable for a number of NLP applications (Covington 1995, Culotta & Sorensen 2004, Reichartz et al., 2009). Further, most of languages of the world are inflectionally rich vis-à-vis relatively free-words-order languages (Covington 1995) and thus, in most of the treebanks, there is a tendency to take into account the morpho-syntactic cues, i.e. obliqueness, overt case markings or relational words (pre/postpositions) during sentence parsing. This also provides clear semantic information, crucial for various NLP applications like MT. Since, Kashmiri is an inflectionally rich language; there are also clear cut morpho-syntactic cues associated with NPs or VGNNs, i.e. with entities which are the participants of an action or an event, which are crucial in sentence parsing. It has been observed that if there are no hundred percent one-to-one correspondences between the case relations & the case markers/pre/postpositions which mark the dependents but definitely such morpho-syntactic cues along with TAM features are very helpful in syntactic parsing in relatively free-word-order languages. On the other hand, constituency based representations have been considered better for the fixed-word-order languages where there are least morpho-syntactic cues but the positions of the constituents dictate their grammatical relations, e.g. English and French.

Finally, it is very important to recognize the advantages and disadvantages of both the frameworks while applying them on a particular language data. It is equally important to approach the problem of framework selection on the basis of certain research questions like, what essentially these formalisms are able to capture. Are they complimentary to each other? How can they be helpful in developing treebanks and subsequently, robust syntactic parsers? However, the choice of the framework or formalism, in this research, is not much determined by the theoretical motivations and other technicalities related to the formalism but by the unavailability of the required resources in Indian scenario. For instance, if one wishes to use annotation scheme of Prague or TIGER Treebank, one need to be in constant touch with the people who are actually working in the area to obtain resource & get expert opinions. So, it is quite impractical to use such a grammar formalism for which there are no resources available or not easily accessible. Further, there is no need of reinventing the wheel, as other representations can be added latter and also the algorithms are now available to convert a dependency treebank to the phrase structure one. Therefore, on practical considerations as well as on the basis of principles for treebanking, given in chapter second, the model of AnnCorra Treebanks, i.e. Hindi, Urdu, Telugu and Bengali, has been followed.

Apart from the grammatical model, the most important requirement for developing Kashmiri treebank was the primary source data, i.e. Kashmiri text corpus. The major bottleneck in getting the corpus is the unavailability of any online resource like newspaper, from which data could have been obtained. There is complete vacuum of commercially important text domains (like medical & tourism) in Kashmiri. Therefore, KashCorpus was built for developing KashTreeBank. The selected sets of the corpus were manually pre-processed, i.e. sanitized, normalized and finally tokenized.

The selected sets of corpus were converted into Shakti Standard Format (SSF) with the help of Sanchay platform. Its SA Interface was used to built three annotation layers, i.e. parts-of-speech layer, chunk layer and inter-chunk dependency layer. It is not merely adding the annotation layers to the corpus which is important but the arrangement in which the lower layer of annotation facilitates the higher layer. The arrangement is provided by the SSF which is also important for machine readability of the dependency trees, created during annotation process. The fundamental annotation layer that was added to KashCorpus was POS layer. Each word in a sentence was assigned a POS tag according the BIS tagset for Kashmiri which is coarse grained hierarchical, consisting of 11 top level categories and 32 type level tags. In the process total 14852 words were classified into 11 POS categories with the frequency order; N >V >PP >RD >JJ >PR >CC >RP >DM >QT >RB. During the annotation process, several issues were raised which were resolved and annotation guidelines were laid down to achieve consistency and intra-annotator agreement. Finally, frequency for each category was calculated and cumulative frequencies were obtained.

In the second layer of annotation, same interface to add chunk level information to the same sets of POS annotated KashCorpus. Earlier there were four sets of corpus from three domains in which two sets which belong to the same domain were combined. Therefore, three sets of POS annotated corpus were chunked based on the local dependencies and discontinuities. The cluster of POS tagged words which were contiguous with dependency or part-whole relation with each other, were assigned a single chunk label. It is not that only groups of words were assigned chunk labels but also some solitary or the discontinuous elements which defy the intuitive notion of chunk. The V2 phenomenon, which constitutes 5.009% of grammatical phenomena at the chunk level in Kashmiri data, was also handled by positing AUXP chunk. The most crucial issues related to finiteness and complex predication were resolved and an annotation guideline was laid down for consistency in chunking. All 14852 POS annotated words were chunked and classified into 10 chunks. The increasing frequency order of the chunks is NEGP< BLK< RBP< VGNN< VGNF< JJP< CCP< VGF< NP.

Finally, the third layer of linguistic information was added to the three datasets. The 682 POS annotated sentences of varying lengths from three domains, i.e. newspaper editorials (ASL⁷⁷ = 25 Ws or 15 Cs), short-stories (ASL = 11 Ws or 8 Cs) and critical discourse (ASL = 16 Ws or 10Cs), were partially parsed into 8125 chunks. The inter-chunk GRs holding among 8125 chunks were annotated. In aggregate, 4287 GRs have been found holding under 682 dependency structures. The 4287 GRs were further classified under 25 labels. The inter-annotator agreement was also measured for syntactic annotations. The inter-annotator agreement was found substantial as per the interpretation matrix of Landis and Koch (1977). The observed agreement was found 0.777 and the kappa value was found 0.7555. In short, the experiment conducted to check the inter-annotator agreement has shown that the annotators agree quite considerably on labels as well as on attachments which indicates that both the annotators have similar understanding of the issues and the guidelines.

Appendix-I: Showing BIS POS Tagset for Kashmiri

S. No	Category Name		Annotation Convention	Examples
	Top level	Subtype (level 1)
1	Noun		N
1.1		Common	N__NN	gu:r (milk man), kul (tree), ku:r (girl)
1.2		Proper	N__NNP	gulI marg (Gulmarh) Pahal gha:m (Pahagham), huzaif (huzaif)
1.3		Nloc	N__NST	heri (in upper storey), bonI (in lower storey)
2	Pronoun		PR
2.1		Pronominal	PR__PRP	su (he-nom) bI (I-nom), tse (you-erg), hom-is (he-dat), yi (this), ti (that)
2.2		Reflexive	PR__PRF	panun (self’s-MAS ), panIn’ (self’s-FEM)
2.3		Relative	PR__PRL	yus (who-SG), yim (who-pl)
2.4		Reciprocal	PR__PRC	akh Akis (to one another), pa:nIvan’ (amongst each other)
2.5		WH	PR__PRQ	kus (who-SG), kIm (who-pl)
2.6		Indefinite	PR__PID	kenh (something), kanh (someone)
3	Demonstrative		DM
		Deictic	DM__DMD	hu (he), so` (she), hum (those), yi (this)
		Relative	DM_DMR	yus (who-SG), yim (who-pl), yAmy`(who-erg), yimav (who-pl-erg)
		WH	DM__DMQ	kus (who), kIm (who)
		Indefinite	DM__DMI	kenh (something), kanh (someone)
4	Verb		V
4.1		Main	V__VM	paka:n (walks/walking), thovmut (kept) pari (will read), gindun (to play), tulun (to lift), tsalun/davun (to run), gatshith (having gone), gindnuk (of playing)
4.2		Auxiliary	V__VAUX	chi/chu (is), Os/A:s (was), a:si (will)
5	Adjective
				tshoT (dewarf), z’u:Th (tall) zabar (nice), asIl (good)
6	Adverb
				jaldi: (quickly), va:rI va:rI (slowly)
7	Postposition
				peTh (on), manz (in), tal (under), nish (near)
8	Conjunction		CC
8.1		Co-ordinator	CC__CCD	tI (and), ya:/natI (or) magar (but)
8.2		Subordinator	CC__CCS	zi/ ki (that), agar (if), zanti (as if), teli /adI (then)
9	Particles		RP
9.1		Default	RP__RPD	ti (too), sirif/ mAhaz (only), hish/hiuv (like)
9.3		Interjection	RP__INJ	alie! Oho!
9.4		Intensifier	RP__INTF	seTha: (very), va:riyah (very)
9.5		Negation	RP__NEG	na (no), ma (don’t), kehn (not)
10	Quantifiers		QT
10.1		General	QT__QTF	kam (little), zya:dI (more), kehn (some)
10.2		Cardinals	QT__QTC	akh (one), zI (two), tsor hath (4 hundred)
10.3		Ordinals	QT__QTO	Akim (first), doyim (second)
11	Residuals		RD
11.1		Foreign word	RD__RDF	It is fine
11.2		Symbol	RD__SYM	ؒ ، ؀ ،ؓ ، ء ، ؂
11.3		Punctuation	RD__PUNC	۔ ، ؟ ! ( ) “ ؛ :
11.4		Unknown	RD__UNK	ڑگُ ،فاین ،اِز ،اِٹ
11.5		Echo words	RD__ECH	tre:lI ve:lI (apple and the like) cha:y va:y (tea and the like), batI vatI,(rice and the like) ma:z va:z,(meat and the like)

Yüklə 1,52 Mb.

Dostları ilə paylaş:

1 ... 13 14 15 16 17 18 19 20 21