‘Where shall I begin, please, your majesty?' he asked.
‘Begin at the beginning; the king said gravely''
And go on till you come to the end: then stop.'
It is a well established fact that for the development of NLP tools and applications, a substantial amount of linguistic knowledge is required which can be either in the form of computational grammar1 or in the form of syntactically annotated machine readable corpus known as treebank.2 This research is an effort to create a treebank for Kashmiri Language [KashTreeBank].3 It investigates the theoretical as well as the practical issues involved in the creation of a small scale dependency treebank of Kashmiri, using simple grammar formalism for syntactic parsing and annotation.
Treebank creation is a promethean task which requires different types of resources and enormous funding for the development or acquisition of corpus & tools as well as for labor-intensive annotations, expert opinions, and validation. The current research is an initiative to build language resources for Kashmiri so that a base line syntactic parser of Kashmiri can be developed. The findings of this research can serve as basis for carrying out treebanking for Kashmiri on large scale.
The next section discusses the motivations behind pursuing the current research and highlights its social relevance. Section three provides brief introduction of Kashmiri Language. Section four on research problem, introduces a whole spectrum of issues associated with treebanking in general & development of KashTreeBank in particular. Section five on theoretical preliminaries, elaborates the theoretical framework used in this research work.
Treebank is a rich language resource for research on grammar development & grammar engineering. Grammar engineering is the practice of building elaborated linguistic models on computers. It has been used for practical purposes for many years. For instance, it has been used for developing grammar checkers, e.g. Microsoft grammar checker, Boeing’s simplified English grammar checker, etc but the contemporary grammar engineering involves extraction and induction of probabilistic grammars. Besides, if a treebank is created as a reference work, rather than an application oriented repository, it can serve multiple functions in various subfields of linguistics as well as in language technology. Theoretical linguists can use treebank for searching various illustrations of the different syntactic phenomenon under investigation, whereas psycholinguists can use it to find the relative frequencies of various possible PP attachments or relative clauses (Abeille, 2003). Similarly, formal and computational linguists can evaluate the correctness and coverage of grammars and lexicons against the analyses stored in a treebank and at a more general level, the adequacy of linguistic theories and formalisms can be assessed.
Further, treebanking is not goal in itself rather treebank driven parsers are used as an important component of artificial intelligence (AI) systems like MT system, Question-Answering system and Grammar Checker. Therefore, treebank is a valuable resource not only for Computational Linguistic (CL) and Natural Language Processing (NLP) tasks, such as automatic syntactic parsing, grammar-induction4 and grammar-extraction5but also for non-technological academic research such as experimental syntax. Evaluation of NLP systems or their components is yet another field which is currently very active. These days, treebanks are in much demand for testing and optimization of syntactic parsers. Treebanks can be also used for pedagogical purposes, both in teaching of language and linguistic theory, e.g. the Visual Interactive Syntax Learning (VISL) project, established at the University of Southern Denmark, has developed teaching-treebanks for twenty two languages with a number of different teaching tools including the interactive games such as Syntris6. Treebanks are also being used for empirical linguistic research in theoretical syntax and historical linguistics. For instance, creation of historical treebanks like Middle English (Kroch & Taylor, 2000), Old English (Taylor et al. 2003), Early New High German (Demske et al. 2004), etc, have revolutionized historical linguistics and comparative philology in last one decade or two. Given the versatility of treebanks to hold vast amount of empirical grammatical knowledge and given their commercial utility to be consumed as language data in research & development, it is the need of hour to develop large scale treebanks for all resource poor languages and Kashmiri is one of them.
Kashmiri, locally known as “Koshur,” is one of the 22 scheduled languages of Indian union, as per the 8th schedule of its constitution. It is mainly spoken in the greater region called “Kashmir” which includes State of Jammu Kashmir (JK) and Pak Administered Kashmir. JK is located at a strategically important geographical point where it is bordered with Tibet in the east, China in the north, Pakistan in the west and south west, and in the south-east by rest of India (Hussein, 1987). There are approximately six million Kashmiri speakers scattered in India, Pakistan, UK, USA and Gulf Countries (Ethnologue 2006). Kashmiri is a Dardic language, considered genealogically distinct from Indo-Aryan and Indo-Iranian languages (Grierson, 1915) but latter on it has been classified under Dardic group within Indo-Aryan language family (Morgenstiene, 1961). It is closely related to Shina and some other languages of the North-West frontier (Koul 2006).
It is a highly inflectional language with predominant V2 phenomenon & pronominal clitics like Germanic languages. Kashmiri is the only Dardic language which has a written tradition. It is written in modified Persio-Arabic script, with additional diacritics to capture its peculiar phonetic features. Like Urdu, its writing convention is from right to left. Although, Persio-Arabic is an officially approved script, it is also written in Devanagari. Moreover, Sharda and Roman scripts have been also used for it from time to time. However, it is, mainly written in modified Persio-Arabic script with writing convention from right to left. The script uses some additional distinguishing set of diacritic markers and letters, for representing a system of central vowels and secondary articulations, e.g. palatlization at token initial, medial and final positions. Therefore, the script is fully capable of representing all the sounds of Kashmiri. It has two writing styles- Nasaq and Nastaliq. Kashmiri is mainly written in Nastaliq style, either manually by cartographers (kA:tib) or by using some word processor, e.g. Inpage-Urdu. It can be also directly input in Microsoft Word where it will be displayed in Nasaq style like Arabic as the available Unicode fonts are in only in Nasaq Style. It is worth to mention that the readers of Kashmiri are not normally used to this style and they find it difficult to read.
The Research Problem
Kashmiri is a highly inflectional language with relatively variable word-order, extensive pronominal cliticisation and predominant V2 phenomenon. As per computational resources are concerned, it is a resource poor language, lagging far behind than other Indian languages like Hindi, Urdu, Punjabi, Bengali, Telugu, Tamil, etc. Several kinds of resources are needed for developing a treebank like annotation guidelines to state the conventions in order to guide the annotators throughout their work and a software tool to aid the annotation work. Since, constructing syntactic trees manually is a very slow and error-prone process; semi-automatic annotation can be opted but the semi-automated treebank annotation needs a whole battery of NLP modules like Tokenizer, POS tagger, Morph-analyzer, chunker (shallow parser) and a Syntactic Parser.
The development of KashTreeBank involves many challenges ranging from preliminary decision making regarding the selection of framework & the associated formalism to the actual syntactic annotations & their representations in certain format. Therefore, this multi-dimensional problem of “Creating KashTreeBank” can be better addressed by describing the wide spectrum of small problems related to its design & development. It includes choice of corpus, selection of framework & the associated grammar formalism, choice of annotation scheme, nature of annotation process, representation of treebank & the choice of annotation tool.
Choice of Corpus
Treebank can't be created out of vacuum. One needs to have some primary source data (machine readable text) to work on and to annotate required linguistic information. Either, already created corpus resources, under various projects, can be used, or, new resources can be created for this purpose. But the choice governing the acquisition of the old resources or the development of new resources should be the principled one. The principles necessary to determine the choice of corpus for treebanking are given as:
The corpus should be freely available for research and development with easy licensing policy.
The licensing policy for the distribution of corpus should not undermine your rights on the treebank.
The corpus should have been developed following certain encoding standards, preferably, Unicode for character encoding and XML for text-encoding.
The corpus should be sanitized and normalized one, i.e. with no typographical errors, tokenization problems and missing diacratics (crucial ones).
The corpus should be balanced one (with samples from all the possible existing domains).
The corpus should represent almost all types of constructions of a language for a wider coverage of treebank to produce a robust annotation scheme and parsing model.
The corpus should be preferably annotated with Morph, POS & Chunk information so that one can directly start parsing sentences.
The sufficient quantity of corpus should be available (at least, 1500-2000 sentences). Less than this quantity may be sufficient for developing annotation scheme & guidelines but it won’t be sufficient to train a base line parser.
It is not obligatory to follow all the above criteria, strictly. They can vary from language to language but certainly one has to think on these lines for the acquisition or the development of corpus to create a treebank. It is important to mention that there is a need to use corpus of shorter sentences (with considerable complexity) in the initial stage of the research to lay down a basic annotation scheme.
Treebanks and Linguistic Theory
The choice of a suitable framework as well as an implementable formalism is of paramount importance in any treebanking endeavor as it determines the nature of all data (trees) in the treebank and consequently determines the value and utility of the entire treebank. Since, a number of grammatical frameworks and formalisms exist worldwide; it has become imperative to choose one among the existing models to be implemented on the selected sets of Kashmiri corpus. The choice of annotation scheme for a treebank is influenced by different factors. One of the most central considerations is its relationship with the linguistic theory. It is to be decided if the annotation scheme should be theory-specific or theory-neutral. If the first of these alternatives is taken into consideration, then which theoretical framework should be adopted? If the second is opted then how do we achieve the broader consensus on framework selection, given the fact that truly theory-neutrality is almost impossible? Although, it has been argued that while creating treebank theoretical neutrality should be maintained (Fei Xia, 2008) but in reality theoretically neutral treebank is a myth. However, if theory neutrality is interpreted as NLP friendly, one can choose that framework for preparing annotation guidelines that is advantageous for Natural Language Processing. However, the solution to the problem of framework selection and design of annotation scheme comes from the interaction between different factors that govern treebanking, in particular, from the nature of the language (configurational or non-configurational) that is being analyzed. Also, the researchers, particularly from resource poor scenarios, cannot afford to disregard the already created resources and tools for automatic and interactive annotations. The following criteria can be posited to help in grammar formalism vis-à-vis annotation scheme selection.
The formalism should be simple & elegant with fewer abstractions, i.e. it should be NLP friendly.
The associated resources (tools and schemes) should be accessible.
It should suite the nature of the language under investigation.
It shouldn’t disregard the grammatical tradition for the language.
It should have some cognitive reality.
Two types of frameworks, constituency and dependency, have been used in framing annotation schemes for different treebanks. A constituency-based annotation scheme posits the structure of a sentence as hierarchically organized phrases (IP = Spec + X’ & X’ = X + Comp.) where the annotations are confined to phrasal tags (such as S, JJP, NP, PP, VP, etc). Such schemes do not represent the grammatical relations between and within the constituents, explicitly. On the other hand, dependency-based annotation scheme posits a sentence as a dependency graph, i.e. a structure consisting of a head and a dependent with a labeled arch (which can be also a directed arch), denoting the grammatical relation (GR) between them. The relations in the syntactic structure can be labeled with not only GRs but also with other specifications of the function of the dependent. Syntactic units are words in more lexicalized dependency frameworks (Hudson, 1984; Mel’cuk, 1988) but dependency annotation schemes sometimes rely on units of several words or word clusters, e.g. chunks in case of Abney (1991) and Bharati et al., (1994).
The annotation schemes used in different treebanks can be compared and contrasted on the basis of the following parameters, proposed in Bosco and Lombardo (2004).
The explicit representation of semantic information
On the one hand, Penn Treebank (PTB) (Marcus et al., 1993) uses a mono-stratal (single layered) annotation scheme that combines the annotation of syntax and semantics on the same level of representation. The syntactic annotation is based on constituency but it has been enriched with the annotation of a small set of grammatical relations and semantic information. On the other hand, annotation scheme used in Prague Dependency Treebank (PDT) uses a multi-stratal annotation scheme that consists of three separate layers: morphological, analytical and tecto-grammatical (or semantic). However, NEGRA treebank also uses a mono-stratal annotation scheme which combines phrase-structure and dependency representations, allowing for the direct representation of both phrases for fixed-word-order constructions as well as syntactic dependencies (predicate-argument structures). The PDT uses a richer annotation of the relational structure compared to others. Since the number of relations annotated in NEGRA Treebank and PTB is quite low, their representation of the relational structure is quite poor. Nevertheless, the relational structure can be easily recovered all at once in monostratal representations such as in the NEGRA and PTB than in multi-stratal representations where the information is sparse on several structurally different layers, as in the PDT. The major limits of monostratal representation have been referred to representation of phenomena in one level which require structurally different levels, e.g. representation of semantics and syntax as coordinated rather disjoint.
Some scholars claim that dependency based annotation is more suitable for relatively free-word-order languages (Hudson, 1984; Mel’Cuk, 1988; Covington 1990 & Bharati et al. 1995) while others make their choice on the basis of application requirements and in some cases, the annotation scheme follows the linguistic tradition. To annotate the corpus of relatively fixed-word-order languages like English, principle of constituency is usually employed. However, in treebanks like TIGER Treebank for German (Brants et al. 2002) and Quranic-Arabic Treebank for Quranic Arabic (Kais Dukes & Tim Buckwalter, 2010) dependency is combined with PSG. Also, recently efforts were made to annotate relatively free-word-order languages like Hindi-Urdu with dependency structure, lexical predicate structure & phrase structure in a coordinated manner (Palmer et al. 2009). Further, a treebank can have multiple representations rooted in different linguistic theories to maintain theory equality rather than theory neutrality. For instance, The Multi-Representational and Multi-Layered Treebank for Hindi-Urdu (Bhatt et al., 2009) has both the phrase structure (PS) as well as dependency structure (DS) representations. In fact, multiple representations are the current state-of-art in treebanking but still one has to start with one type of representations.
Treebanking, primarily, involves syntactic parsing & annotation of POS tagged corpus which can be done in different ways. The most commonly used method for developing a treebank is a combination of automatic and manual processing. However, there are some treebanks created completely manually but with taggers and parsers available to automate some of the work. Such a method is rarely employed in state-of-the-art treebanking. There are three main techniques to carry out annotation process.viz:
Supervised Technique: In this technique, the annotation process is carried out manually by human annotators, preferably by syntacticians.
Un-supervised Technique: In this technique, the annotation process is carried out automatically by an intelligent system called syntactic parser (developed without any training data).
Semi-supervised Technique: In this technique, the annotation process is partly done automatically by a trained parser & the parses are partly done or corrected by human intervention.
Traditionally, the parsing or syntactic annotation was mostly confined to manual methods but after the development of more sophisticated grammar formalisms such as context free grammars like PSG, it became possible to automatise the process of syntactic annotation either on the basis of computational grammar in which hand crafted grammar rules (morphological, phrasal and sentential) are used to develop parser or on the basis of statistical modeling in which syntactically annotated electronic corpus is used to train a parser. Hybrid techniques, involving both grammar rules as well as statistical modeling, are also used to develop parsers. But treebank creation on the basis of automatic parsing, using a probabilistic grammar or statistical modeling (Bod 1998; Collins, 1999; Charniak, 2000) is desirable for both practical and theoretical reasons and manual annotation has the disadvantage of being time consuming, labor-intensive, costly & error prone. Also, it is difficult to achieve satisfactory consistency both within and between human annotators (Van Der Beek et al., 2002). However, in order to create treebank for any resource poor language like Kashmiri, automatic approach is an impractical one. Therefore, to stick to the old method of manual annotation is the only choice and the treebank, so obtained serves as data for training & testing state-of-art parsers like Stanford Parser (Dan Klein & Christopher D. Manning, 2003), Malt-parser (Nivre et al, 2006), or MST-parser (McDonald, 2006). Training results in the induction of a language model which in turn results in a baseline Kashmiri parser. Once the baseline parser for Kashmiri is ready, it can be employed to parse more and more Kashmiri corpus automatically and learn more and more structures by boot-strapping7 and only then the labor-intensive manual annotations can be avoided. Nevertheless, the validation of the automatically annotated corpus needs to be done manually. Since, currently, the automation syntactic parsing is predominantly the domain of machine learning (engineering) where consistencies in annotations matter more than the granularity, i.e. the depth of analysis, the annotation guidelines need to be prepared & followed strictly during the annotation process to avoid frequent inconsistencies and propagation of errors to other annotation layers.
Representation of Treebank
Treebanking not only involves deep syntactic analysis of natural language corpus (sentences) according to particular grammar formalism but also the representation of the syntactic analysis (trees) in certain format so that the annotated information can be read by an algorithm during training process. A format is, generally, a sort of matrices which represents various levels of annotated grammatical information in different data types or fields (columns) in such a way so that a link is maintained between them. So far, many such formats have been devised, for instance; CONLL-X (see Table. 2) and Shakti Standard Format (SSF) (see Table. 1). SSF was originally devised for Shakti-Machine Translation System for Indian Languages and is mostly used in India but CONLL-X standard is a widely used format. It has ten data types (fields), of which seven are utilized in the analysis. Recently algorithms have been developed to convert SSF into CONLL so that experiments can be done on wider range of parsers.
In the CONLL-X format, all word-forms and punctuation marks are presented on a separate line. Each word has a numerical address (NA) within the sentence in Column-1. The next column from the left is the actual word-form (WF), followed by its base form (BF) in Column-3. The morphological description is given in both short and coarse grained manner (POS) in column 4, and a fine-grained analysis (Morph) in Column-5. The dependency relations (dRel) are marked in Column-7 by indicating the governing word (Head/Root/Regent) using the sentence-internal numerical address of Column-1. The dependency functions (dFn) of the word-forms are presented in Column-8. Columns 6, 9 and 10 are unused and are marked with an underscore (_).
In the present work, the annotated data is represented in SSF (Bharati et al., 2007). The SSF consists of four columns in which the column1 (C1) carries information about the address of the token (like 1, 2, 3,……….., n), the column-2 (C2) carries the actual tokens in the manner of one token per line (see Fig.1), the column-3 (C3) carries the POS category of the node and the column-4 (C4) carries other features like the dependency relations. Any further information like morph information can be represented in this column using an attribute–value pair. Therefore, POS and chunk information of the tokens would be in the C3 and the morph, dependency and any other information pertaining to a node would appear in the C4 (see Table 1).