Annotation Standards
POS standards provide a framework in which a tagset can be designed to annotate corpora. Therefore, to decide upon choosing a standard from the existing ones or to lay down a new one by taking inputs from the existing ones is the first and foremost task in corpus annotation.
Standardization in POS tagset designing is not only important to achieve consistency in the annotation across related languages and research projects but also to ensure maximum resource sharing and least wastage of annotated language resources, particularly, in resource poor scenarios like Indian Languages. For European languages such steps had been taken more than a decade ago in the form of EAGLES52, ELRA53, and ISLE54 but for Indian Languages, it is quite recent tendency (only 3-4 years old) and came into being in the form of BIS scheme, though there were earlier efforts in this direction in the form of ILPOSTS & ILMT. EAGLES and BIS POS annotation schemes can be seen as instrumental in bringing consensus among NLP groups with divergent interests and approaches to take up annotation projects and solve various CL, NLP or LT problems. These two standard frameworks are briefly given below from the POS annotation point of view.
-
EAGLES Framework
It is widely used framework on POS tagset designing with main aim of standardisation of POS tagsets used for the annotation of corpora of various European Languages. Standardization of the tagsets is very important process as pointed out by Leech and Wilson (1999: 55-56):
“In the interests of interchangeability and re-usability of annotated corpora, it is important to avoid a ‘free-for-all’ or ‘re-invention of the wheel’ every time a new project begins………… At the cross-linguistic level, annotations used for one language should as far as possible be compatible with annotations used for another. Compatibility here means that where there are descriptive categories common in between different languages, these should be recognised in the annotation scheme and recoverable from the annotations applied to texts in different languages.”
The EAGLES guidelines provide a set of features and an encoding scheme which different tagsets were supposed to include. The EAGLES guidelines for morpho-syntactic annotation include: 1) what is obligatory 2) what is recommended 3) what are optional extensions for morphosyntactic annotation. At each level, tags are defined as morphosyntactic Attribute-Value (A-V) Pairs e.g. gender is an attribute that can have the values, masculine, feminine or neuter. These A-V pairs are structured as a hierarchy but need not be so, strictly. The property suggested by the EAGLES guidelines as obligatory to any POS tagset is that of thirteen major word classes which include: noun, verb, adjective, pronoun/determiner, article, adverb, adposition, conjunction, numeral, interjection, unassigned/unique, residual, and punctuation. The recommended properties are then organised according to these major word classes, e.g. the attribute Type with values; Common, Proper, etc, is for nouns but Person with values First, Second and Third, is for verbs and Degree with values Positive, Comparative, Superlative is for adverbs. The recommended attributes also include number, gender, case, finiteness, tense, voice, and other sub-categorisation features. The optional recommendations consist of similar attributes of lesser applicability, and some additional language specific values for the recommended attributes.
The value of this framework is that it promotes consistency and reusability of linguistic resources for different languages and discourages “wheel reinvention”. The main drawback to the EAGLES guidelines, however, is that they cover only a tiny fraction of the world’s languages. As a project of the European Union, it covers only English, Dutch, German, Danish, French, Spanish, Portuguese, Italian and Greek: nine languages of Western Europe which are moreover typologically similar. It is worth to mention that the ILPOSTS on the basis of which LDCIL POS tagsets were made for the annotation of Indian Language Corpora was based to EAGLES. We can say it was an Indian extension of EAGLES. As point out by Leech and Wilson (1999: 58):
“It remains to be seen how far these guidelines can be extended, without substantial revision, to other languages”.
-
BIS Framework
It is the latest annotation framework for the annotation of Indian Languages and recognised by Bureau of Indian Standardization (BIS). Its foundation was laid down by the first meeting of POS tagset standardization committee, held at Department of IT, New Delhi on 19th Nov. 2009. It has been evolved by taking insights from earlier efforts-ILPOSTS, ILMT, etc, to bring consensus among different NLP groups in India. It incorporated the set of POS labels from ILMT POS tagset (Bharati et al., 2006) and the notion of hierarchical structure from ILPOSTS (Baskaran et al., 2008) but avoided fine granularity proposed by ILPOSTS.
In line with the ILMT tagset, it assumes separate layers for morphological analysis and POS annotation for efficient capturing of grammatical information and better results in manual as well as automatic annotation. It, further, holds that the input to the POS tagger (text corpus) should have already undergone through pre-processing. Thus, every token (word) to be assigned a POS tag is a single lexical item and is not a token which internally contains more or less than one lexical item as can be seen in agglutinative languages and in the languages with split-orthography (Bhat, 2010), respectively. It also sticks to the assumption that there must be a MWE identifier layer after POS tagging. Since POS tagging is a lexical level annotation process, any unit that involves more than one lexical item, such as conjunct verb, compounds and will not be captured at the POS level. Therefore, BIS proposes hierarchical and coarse grained tagsets for all Indian Languages. These tagsets have three-levels of hierarchy, including Type, Subtype-I and Subtype-II. The first level (type) includes 11 main categories- Noun, Pronoun, Demonstrative, Adjective, Quantifier, Verb, Adverb, Postposition, Conjunction, Particle, and Residual. The second level (subtype-I) includes 32 subcategories and the third level (subtype-II) includes 3 sub-subcategories only for verb but the third level is optional. The main principles55 that were taken into consideration while developing the POS tagsets for the annotation of Indian Language Corpora are as:
i. The scheme should be generic, i.e. it should work for all the Indian Languages and shouldn’t be oriented towards any one language or a group of languages.
ii. A layered approach should be followed for annotating various types of linguistic information available in a text. Each type of information like morphological, POS and chunk information should be annotated in separate layer.
iii. The scheme should be flexible to incorporate or drop a category either at the top level of hierarchy or as a sub-category of an existing type so that the scheme can be extended from one language to other.
iv. The annotation scheme should be annotator friendly by avoiding ambiguous tags which puts cognitive load on the annotators and leads to inconsistency in the annotation.
v. The scheme should be mappable with pre-existing annotation schemes of Indian Languages to avoid the wastage of the resources.
vi. The scheme should support all types of NLP research efforts independent of a particular technology and development approach.
-
POS Tagsets
The POS tagsets that have been designed for English and Indian Languages have been given.
-
POS Tagsets for English
“POS tagging has been a hot research topic since the early 1980s” (Voutilainen, 1999) but the research actually originated in 1960s for European Languages. However, the research in POS tagging is a quite recent tendency in India and, therefore, the concept of tagset designing and its standardization is also very recent as compared to its European and American counterpart. The main efforts in POS tagging resulted in various POS tagsets such as Brown, CLAWS1, and U-Penn (mainly designed for English) but these tagsets are mostly simple inventories of tags corresponding to the morpho-syntactic features, and varied greatly in terms of their granularity (Hardie, 2004). It is CLAWS 2 & 7 tagsets which are considered landmark in the history of tagset designing (Leech 1997). CLAWS7 marked an important change in the structure of tagsets, from a flat-structure56 to a hierarchical-structure57.
According to Daniel Jurafsky and James H. Martin (1999) “There are a small number of popular tagsets for English, many of which evolved from the 87-tag tagset used for the Brown corpus (Francis, 1979; Francis and Kučera, 1982). Three of the most commonly used are the small 45-tag Penn Treebank tagset (Marcus et al., 1993), the medium-sized 61 tag C5 tagset used by the Lancaster UCREL project's CLAWS (the Constituent Likelihood Automatic Word-tagging System) tagger to tag the British National Corpus (BNC) (Garside et al., 1997), and the larger 146-tag C7 tagset (Leech et al., 1994).” However, irrespective of the popularity, a brief description of many POS tagsets of English and Indian Languages are given as follows.
-
CGC58 Tagset
The earliest work on POS tagging started with CGC of Klein and Simmons (1963) for English in USA. The tagset consists of thirty tags of which only pronoun tags are decomposable59 but the rest are not. Their CGC-program also outputs information, external to the main tag, on the number of nouns and verbs; it is also noted if a noun is possessive, so that the actual number of categories distinguished is considerably greater (Hardie, 2004). It also incorporate tags for punctuation marks, which are treated as words as has been pointed out that the treatment of punctuation marks in this manner can be a significant aid in the tagging of other nearby words (Leech, 1997).
-
TAGGIT Tagset
Klein and Simmons’s work inspired the work of Greene and Rubin (1971)60. The tagset contains 77 POS tags, but their TAGGIT program displays information regarding the number as an integral part of the main tag itself (ibid). The CGC & TAGGIT display some consistent design features. The tagset incorporates tags for punctuation marks, which are treated as words. They based the definition of their tags on the syntactic functions that a given word form performs in a particular context. The tags display more of a tendency to be decomposable. For example, in the tag WPO, W is Wh-word, P is pronoun and O is objective form. However, unlike some latter tagsets, this tagset was not hierarchical. The earlier Klein and Simmons’ (1963) tagset was not hierarchical either. Both these early projects also had some means of dealing with ambiguity. Some of the TAGGIT tags were exclusively for dealing with ambiguous words. For example, the CI tag marks a word which is either a subordinating conjunction or a preposition, such as ‘before.’ There are also tags for subordinating conjunction (CS) and preposition (IN). Only CS and IN tags are needed for an exhaustive classification, but CI is necessary on a pragmatic ground.
-
CLAWS61 Tagset
As mentioned above, POS tagging has been a well known research topic since the early 1980s, a number of tagsets have been devised for English at Lancaster University within a decade from 80s to 90s to be used in CLAWS Tagger (Garside 1987). The C1 tagset was used in the annotation of the LOB Corpus (also known as LOB tagset). Since, this corpus was designed to parallel the structure of the Brown Corpus, the tags were also parallel, and C1 is very similar to the latter version of the Brown tagset (Francis and Kučera 1982). The development of the C262 tagset was motivated by:
“providing distinct codings for all classes of words, having distinct grammatical behavior, and making the tagset more systematic, in the way, that tags are built up from individual characters” (Sampson 1987).
It means more decomposability and hierarchical nature was brought in C2 tagset (166 tags). For example, all verbal tags have V as their first character and as their second character either V again (for a main verb) or another character (for auxiliary verb). The major subsequent developments in the CLAWS tagset were the C5 and C7 tagsets, developed for the annotation of the BNC Corpus (see Leech, Garside & Bryant 1994, Leech 1997b, Garside and Smith 1997). The C7 tagset (146 tags) is the more fine-grained of the two and can be regarded as a further refinement of the CLAWS2 tagset but the C5 tagset is something of a departure from the others, since it has fewer tags (61 tags) – this was in order to make it useful to the largest number of end users (Hardie, 2004). On the other hand, C5 tagset has been characterized as flat tagset (Cloeren, 1999). In fact, although none of the CLAWS tagsets are laid out in the hierarchical fashion described by Cloeren, the C7 tagset is hierarchical in conceptual terms (Leech, 1997). Furthermore, both C5 and C7 are largely decomposable – the C7, again, to a greater extent. For example, in the tag PPHO2, ‘P’ is pronoun, ‘P’ is personal, ‘H ’is third person, & ‘O’ is objective case and ‘2’ is plural.
-
UPenn63 Tagset
The POS tagset used in Penn Treebank (Marcus et al., 1993) is also based on the Brown Corpus tagset. However, it has been modified in terms of simplification, rather than complexity, as is case with CLAWS tagsets (Hardie, 2004). Thus, there are considerably less tags (36). It makes fewer of what has been described as “lexically recoverable distinctions” (Marcus et al, 1993), i.e. the distinction between lexical verbs and the auxiliary verbs (be, do and have) is not retained in this tagset as the distinction is made on the basis of the forms of words. Also, information that could be recovered from the parsing information has been excluded from the tagset to avoid the risk of inconsistency in tagging. “It is clear that reducing the size of the tagset reduces the chances of such tagging inconsistencies” (ibid).
-
Lund Tagset
The tagset, designed for the annotation of the London-Lund Corpus of Spoken English, represents a tagset significantly different from the Brown Corpus/CLAWS tagset tradition (Svartvik 1990). It is more fine-grained, consisting of just over 200 tags. It has been designed for spoken texts and includes tags for a variety of discourse element type adverbs, not usually distinguished in the tagging of written texts, as well as tags for other features of speech such as swearing. Similarly it lacks punctuation tags. Moreover, this tagset is also hierarchical and decomposable into single characters (or 2-3 character strings) that indicate given features.
-
POS Tagsets for Indian Languages
Despite being relatively new field, research on POS annotation in Indian Languages has also produced a number of tagsets and common frameworks. These include AU-KBC tagset for Tamil (2001), Hardie's tagset for Urdu (Hardie, 2005), IIIT-ILMT tagset for Hindi (Bharati et al., 2006), MSRI-JNU tagset for Sanskrit (Chandra Shekhar, 2007), MSRI-ILPOSTS for Hindi & Bangla (Baskaran et al., 2008), CSI-HCU tagset for Telugu (Sree R.J et al., 2008), Nelrlac tagset for Nepali (Hardie et al., 2005), LDCIL tagsets for ILs (Malikarjun et al., 2010; Bhat et al., 2010), BIS tagsets for all ILs (Ms. 2009), etc. Some of the important POS tagsets relevant to the current work are briefly given below.
-
EMILLE64 Tagset for Urdu
Urdu, written in the Perso-Arabic script, offers different set of challenges in POS tagset design. Hardie (2005) designed the Urdu tagset based on the Urdu grammar of Schmidt (1999) in accordance to the EAGLES guidelines for the EMILLE project. However, designing a tagset in Urdu was not a straightforward task particularly with respect to the orthographic convention, and the presence of Arabic and Persian borrowed forms, which are structurally quite distinct from the Indo-Aryan forms. Some of the issues that were highlighted in Hardie (2005) are tokenisation and idiosyncratic features of Urdu. It has been found that in Urdu orthography, many elements described as suffixes in traditional grammars are actually written as independent tokens. Hence, the arbitrary decision was taken to treat every orthographic space as a word break even if it occurs within a lexical item. However, this leads to include some means of tagging those elements which do not constitute a free form (words). For example, “zimmah daar" (responsible) consists of two tokens - a root and a derivational suffix. The same suffix appears fused to the root in other contexts like "samajhdaar" (sensible), and further suffixation can take place like "zimmah daarii" (responsibility). In the background of such orthographic conventions, a syntactically null tag has been introduced which is dependent for its grammar on the subsequent token, e.g. samajhdaar\JJU and zimmah\LL daar\JJU. The major categories in Urdu tagset are virtually identical to the equivalent categories as defined in the EAGLES - Nouns, Pronouns, Verbs, Adjectives, Adverbs, Postpositions and Conjunctions. The tagset handles tokenization problem (for details see chapter 3) at POS level and thus tries to deal with two separate problems- tokenization & POS tagging, simultaneously.
-
ILMT65 POS Tagset for Hindi
The ILMT POS tagset has been developed by Akshar Bharti Group for annotating Hindi corpus. It is based on the principle of simplicity with a motivation to extend it as a framework for all ILs. Another important dimension that has been taken into account in its design is the division of labour between POS tagger and Morph Analyser. POS tagger is supposed to merely disambiguate the multiple tags generated by the Morph Analyser. Finer distinctions have been avoided in order to have lesser number of tags to facilitate efficient machine learning vis-à-vis accuracy in automatic annotation. This has resulted in the flat tagset, comprising of 21 POS tags but other inflectional information associated with the tokens can be obtained from the Morph Analyser. Form-Function duality is one of the crucial issues in tagset designing. However, it is mainly form-based tagset as pointed out in (Bharati et al., 2006) “the syntactic function of a word is not considered for POS tagging.....the word is tagged always according to its lexical category...” Hence, pragmatic function of a token in the context is not considered as the primary basis for POS tagging. As far as tags are concerned, the UPenn tags along with the newly devised tags have been used. The most important point is that the tagset has innovatively left the finiteness to be dealt at next level of annotation, i.e. at word group or chunk level not at the level of token. The participial and gerund are tagged as VM (though they function differently), and all auxiliaries are tagged accordingly as VAUX. A variable tag (XC) has been also introduced where (X) stands for category which is a part of a compound and (C) stands for compound. Finally, it is worth to mention that form has been chosen as primary basis for POS annotation but often adherence to semantic as well as syntactic functions are evident from the tagset.
-
ILPOSTS66
It is a POS tagset framework designed to cover the fine-grained morphosyntactic details of Indian Languages. It proposes a three-level hierarchy of categories, types and attributes. It has been developed by Microsoft Research India, on the basis of EAGLES guidelines (Leech & Wilson 1999). Language specific POS tagsets have been customised on the basis of it. First Sanskrit (C. Shekhar, 2007), Hindi and Bangla (Baskaran et al., 2008) tagsets were customised but latter the scheme was more refined and tagsets for all ILs were developed at LDCIL (Malikarjun et al., 2010; Bhat et al., 2010). These tagsets are hierarchical in nature and consist of decomposable tags.
A general guiding principle has been formulated to handle form-function duality. A set of ‘Attributes’ have been devised on the basis of morpho-syntactic or simply orthographic practices and the attributes are marked according to their form while the ‘Types’ are marked on the basis of their function. It is worth to mention that on the one hand ‘Attributes’ are tagged according to their morphological visibility (like tense, aspect, etc) as well as the semantics (like number, gender, etc). On the other hand, ‘Types’ are exclusively based on semantics (like common noun, proper noun, etc). A combination of the form and the function based on distribution is applied for tagging categories like demonstrative (DEM), Pronoun (P), Quantifier (JQ), Noun (NC), Noun denoting Space & Time (NST) and Adverb of Location (ALC), and the orthographic convention is taken as basis to annotate Postposition and Case marker. Although finiteness is defined on the basis of the inflection for person, number, gender, tense, aspect and mood but verb is not dealt as neatly as it has been dealt latter. Further, with respect to the similar forms, a distributional basis is considered for distinguishing and annotating the categories like pronoun and demonstrative, or between pronoun and quantifier. A token is to be tagged as a demonstrative if it follows an adjective or a noun and as a pronoun if it does not follow another noun or other parts of speech. Similarly, a token is tagged as a nominal modifier if the token it is followed by noun and as a noun if it is not followed. Case marker and Postposition are assumed to be an instance of the same phenomenon of marking dependents. However, due to orthographic conventions, the dependent marker is written in two ways: together and separate. These two ways are tagged as case marker and postposition, respectively.
-
BIS67 Tagset
POS tagset designing and developing is a perquisite for any POS annotation work, whether carried out in isolation or as an integral part of a larger annotation pipeline like as involved in building a treebank. As mentioned above (see the second section), BIS is an annotation framework, recognized by Bureau of Indian Standardization. The framework has not adopted the Indic system of descriptive categories rather, it has, like the most of the annotation schemes of the world relied on the descriptive categories of Techne. Therefore, Dionysius Thrax’s Techne (C.100 B.C) – a grammatical sketch of Greek – has not only served as role model for contemporary POS descriptions in European Languages but also for the POS descriptions of South Asian Languages. Techne includes an inventory of eight POS categories (noun, verb, pronoun, preposition, adverb, conjunction, particle, and article).
BIS recommended POS tagsets for Indian Languages also uses the same basic set of POS categories which were also used by earlier tagsets like ILPOSTS and ILMT. The 32 parts-of-speech categories recommended by BIS for Kashmiri are given in the Table.1 (for detailed tagset see appendix-I). It is worth to mention that at POS level, verb subcategories of Kashmiri have been kept in line with Hindi-Urdu, i.e. fine grained distinction (finite, non-finite, infinite distinction) has been avoided and category verbs has been further sub-divided into verb main and auxiliary.
|
Category
|
Tag
|
|
Category
|
Tag
|
1
|
Noun Common
|
N_NN
|
22
|
Quantifier General
|
QT_QTF
|
2
|
Noun Proper
|
N_NNP
|
23
|
Quantifier Cardinals
|
QT_QTC
|
3
|
Noun Locative
|
N_NST
|
24
|
Quantifier Ordinals
|
QT_QTO
|
4
|
Pronoun Pronominal
|
PR_PRP
|
25
|
Residual Foreign-word
|
RD_RDF
|
5
|
Pronoun Reflexive
|
PR_PRF
|
26
|
Residual Symbol
|
RD_SYM
|
6
|
Pronoun Relative
|
PR_PRL
|
27
|
Residual Punctuation
|
RD_PUNC
|
7
|
Pronoun Reciprocal
|
PR_PRC
|
26
|
Residual Unknown
|
RD_UNK
|
8
|
Pronoun WH
|
PR_PRQ
|
29
|
Residual Echo-words
|
RD_ECH
|
9
|
Pronoun Indefinite
|
PR_PID
|
30
|
Adverb Manner
|
RB_RB
|
10
|
Demonstrative Deictic
|
DM_DMD
|
31
|
Adjective
|
JJ_JJ
|
11
|
Demonstrative Relative
|
DM_DMR
|
32
|
Postposition
|
PP_PSP
|
12
|
Demonstrative WH
|
DM_DMQ
|
|
|
|
13
|
Demonstrative Indefinite
|
DM_DMI
|
|
|
|
14
|
Verb Main
|
V_VM
|
|
|
|
15
|
Verb Auxiliary
|
V_VAUX
|
|
|
|
16
|
Conjunction Coordinating
|
CC_CCD
|
|
|
|
17
|
Conjunction Subordinating
|
CC_CCS
|
|
|
|
18
|
Particle Default
|
RP_RPD
|
|
|
|
19
|
Particle Interjection
|
RP_INJ
|
|
|
|
20
|
Particle Intensifier
|
RP_INTF
|
|
|
|
21
|
Particle Negation
|
RP_NEG
|
|
|
|
Table.1. BIS POS Tagset of Kashmiri68
-
Dostları ilə paylaş: |