Chapter. 1 Introduction


POS Tagging of KashCorpus



Yüklə 1,52 Mb.
səhifə10/21
tarix07.08.2018
ölçüsü1,52 Mb.
#68537
1   ...   6   7   8   9   10   11   12   13   ...   21

POS Tagging of KashCorpus

POS tagging can be performed with the help of this SA-interface by four methods which differ in terms of ease of use. As shown in Figure.13.a, one sentence is displayed at a time in a vertical order so that the first word of a horizontal configuration (right-to-left or left-to-right) corresponds to the top most word in the vertical configuration. In SSF, each word is represented by a node. Once a node has been selected either by clicking on it or by moving the cursor with the keyboard, one of the following methods can be employed to tag it. (i) by selecting from a drop-down list, as shown in Figure.13.b (ii) by right-clicking to get a context menu, then selecting a ‘Node Name’ from the sub-menu and then selecting the tag from sub-sub-menu (iii) by typing with key-board the first letter of the tag one or more times (iv) by clicking on a button with the intended tag as its label.

Figure.13.b. SA-Interface Showing a Method of POS Tagging


For POS tagging, 812 sentences of KashCorpus had been taken and converted into SSSF in which 226 sentences were taken from newspaper domain, 286 sentences from short stories and 300 sentences from literary criticism. All 812 sentences were tagged with POS tags in four phases. Twenty nine POS tags were assigned to three domains, divided into four data sets. The results are given in the next section.

Fig.13.c. SA Interface Showing a POS Annotated Sentence




  1. POS Tagging Issues

The POS annotation of four samples of Kashmiri corpus resulted in raising, understanding and solving of various linguistic issues. The main issues are discussed below and their solutions are given below in the form of annotation guidelines. The statistical information about the various POS categories is also a byproduct of this work which is also given below.

There are some general decisions that need to be taken at the time of tagset designing. These decisions are related to whether the tagset should be flat or hierarchical, fine-grained or coarse-grained, form-based or function based, etc. Though, only after deciding upon these dualities one can proceed with the customization of tagset for a particular language but all the things can’t be decided at the time of the customization and certain things can’t be decided upon at all, categorically in binary yes-no manner. Therefore, some things need to be decided at the time of actual corpus annotations and the decisions need to be documented to form, what is called annotation guideline. One must keep in mind that the decisions taken to solve some issues may or may not be theoretically appealing but mere shallow ad hoc solutions, either to postpone the immediate problem to the next level or to provide the best possible solution that prevents further problems. The issues that have been raised and addressed at this level of corpus annotation have been classified under the following headings:



    1. Fuzzy Items in Complex predicates (FI)

POS categories are hardly like the elements of periodic table that they always retain their unique identity. They lose their grammatical identity, i.e. morpho-syntactic features in certain contexts either due to neighborhood effect or due to grammaticalisation. For instance, in some complex predicates (see Butt for explanation), it is hard to decide upon the grammatical category of the words other than the light verb (on V2 or V-final position) as given in the examples; kor dafah, kor hA:sil, kor pA:dIdarguzar korun, kor tabdiil, kor fanah, etc. In these examples the bold words (dafah, hA:sil, pA:dI, etc) are most likely to be either adjectives or nouns although their nominal features like number, gender & case have been bleached, nevertheless, they are not as clear as the bold words in the following complex predicates; gov khosh, tuj’ dav, dits kreykh, nyuv kheyth, etc.

It is easy to decide if these words are adjectives, nouns or something else as these words clearly retain the nominal fetures either at morphological level or at semantic level. Here, khosh is adjective (mas/fem, agree); dav is noun (fem), kreykh (fem) is noun and kheyth is verb (participle).



    1. Zwitter Ion of Natural Language (ZI)

The term Zwitter Ion has been taken from chemistry to illustrate the dual nature of Gerunds. Usually, chemical particles are either +vely charged or -vely charged at one instant of time but Zwitter Ions are of dual nature unlike other particles and have both +ve & -ve charges, simultaneously. Analogically, nouniness & verbiness are two polar oppositions like positive & negative charges. If a word tends to be noun it means its verbal properties have declined and vice-versa. Gerund is the only class of words that simultaneously retain nominal as well as verbal properties. Gerunds on one hand function as possess case markers & function like nominal but on the other hand retain their predicate-argument structure properties like a typical verb.

Now the question arises, how to tag a gerund? Whether its form should be taken into consideration in order to classify it or its function? If form is taken into account, it is verb, though nonfinite one but if function is taken into consideration it is a noun. It should be noted that in ILPOSTs, it was placed under the category Noun as Verbal Noun, perhaps the focus was on the function, but in ILMT & subsequently in BIS, it has been placed under the category Verb as gerund, given the fact that by form it is verb and its predicate-argument structure frame remains intact, through, it can never be inflected for other typical verbal features like tense, aspect, mood or voice, e.g. kheyn-I sI:t’, vandn-I kin’, cheyn-as peyTh, marn-an, etc.

Here, on one hand, kheyn-I, vandn-I, cheyn-as, are gerunds in oblique form, followed by postposition like nominals. However, the gerund, marn-an, is not in oblige form but inflected with a case marker (-an). On the other hand, the gerunds (transitive) like kheyn-I & cheyn-as can also have their Arguments like batI kheyn-I sI:t’ or chai cheyn-as peyTh.


    1. izaafat Constructions (IC)

These constructions are Persian borrowed multiword expressions like Compounds and Named Entities but with more coherent internal structure. Usually, two nouns or a noun & an adjective are combined by means of a marker called “izaafe” to form izaafat construction. The izaafe behaves like genitive in Urdu but in Persian it behaves more like a linker (see Butt). However, in Kashmiri, the construction seems to behave more like a compound with less conspicuous internal structure, e.g. aab-i hayaat, vaziir-i aazam, hoquuq-o frA:yiz, dast-i shafaa, habiibi paakh, hoquumat-i hind, shariiq-i hayaat, etc.

The diacritic marker that represent izafe in Kashmiri is mostly zer (ِ), unlike in Persian and Urdu where hamzah(ء), vaav (و) and badii ye(ے) also represent izaafe. Therefore, izaafat constructions are either NN-NN combinations or NN-JJ combinations. In NN-NN combinations, the diacritic is on the first element (NN) but it seems to belong to the second element (NN) when it is simplified (nativized) for interpretation. Kashmiri news paper corpus is replete with such expressions. Given the writing conventions of Arabic, i.e. omission of diacritics, and its influence on Urdu and thereby on Kashmiri writing, such markers may or may not be there in the written expressions but are intact in spoken forms.



Here, the problem is how to tag the two constituent words of an izaafat construction? Should the words be tagged separately (like aabi/NN Hayaat/NN)? Or should the constituents be joined & then tagged together as a unit (like aabi-hayaat/NN)? However, in the first case, there will be less clarity in determining the POS category of the first word marked with Izafe.

    1. Identification of Proper Nouns (PNI)

As such the noun as a POS category doesn’t pose any problem but the inclusion of common noun, proper noun and Nloc as subcategories have proved confusing and thus, the noun came to be the most debatable category in the tagset. At times, it becomes very difficult to distinguish between NN and NNP by relying on the traditional notions. For instance, mevIh (fruit) is not referring to any specific fruit or is not the name of any fruit, hence, it is NN. Then, by this logic, amb (mango) or   tsuunTh (an apple), names of specific fruits, should be NNP but these are considered as NN. In order to address this issue, properly, one needs to go by some concrete standards that can be generalized with least exceptions. Therefore, it has been posited that NN is the noun that denotes a class of things, concrete or abstract, (set or sub-set) while as NNP denotes an instance of a class (member of set or sub-set), e.g. mango is a name but of a class of different varieties or instances like Alphanso, Baadaamii, etc. Similarly, different varieties of apples like Amriican, Chomuuriyah, Deylshas, etc are instances of the class apple, hence, ‘Chomuuriyah’ is NNP and ‘tsuunTh’ is NN. This position solves the problem to some extent but raises other questions like, whether zuun (moon) is NN or NNP, given the fact that other planets also have moons with specific names, and hence, zuun is a class not an instance, likewise, in the above examples, Alphanso mangeos or Chomuuriya apples are also names of classes rather than the particular instances. By the same logic, it can be said that Alphanso mango tree is also a class of different Alphanso mango trees but not an instance. Actually, determining the status of a thing as a class or an instance is very tough ontological problem. By looking from the top to bottom of an ontological tree, it seems an object, like Alphanso mango, is a class but looking at the same object from the bottom to top, it seems that the thing or object is an instance. It is hard to determine, where one should stop dividing a class into subclasses, sub-subclass, etc? in order to take one level as an instance. The problem of indeterminacy comes to fore as soon as the definiteness issue creeps in the already vexed problem of looking for instances. One can ask question whether the notion of NNP incorporates the notion definiteness or is it independent of it, e.g. a person name, “Umar” is no doubt an NNP but is not definite as the “Umar” can be Umar Farooq (hurriyat leader), Umar Abdullah (CM), Baba Umar (Journalist), Umar ibni Khataab (Second Khaliifah), Umar Gull (cricketer) or any other Umar. Hence, the person names can be themselves a class (indefinite) rather than an instance (definite). In order to ease out the problem, one can keep the definiteness at bay from the notion of an instance while looking for ‘instances’ within a class. Only then, one may be able to distinguish between NN and NNP otherwise the status of person names or some place names as NNP can be objectionable. Nevertheless, one can propose various diagnostic features, as given below, to help in determining whether a noun is NN, NNP or NST.

1.  If a noun can’t be pluralized and quantified it is likely to be NNP.

2.  If there is room for asking question for the thing under consideration, like which thing? Then the thing is likely to be NN and if there is no room for asking such question, the thing is likely to have NNP as its POS category, e.g. aaftaab (sun) is a specific instance of stars and hence, NNP. There is no room for asking the question, which sun?


    1. Named Entities (NEs)

As the name itself suggests, NEs include the names of companies, institutions, persons, places and things which are multiword in nature. For example: vaziir-i aazam manmohan singh, islamic university of science and technology, Microsoft India Private Limited, etc.

The problem with named entities is that they form long chains of words which in isolation refer to nothing specific but as whole refer to specific entities. Therefore, as whole they are multiword proper nouns though their constituent words can be of any category. Also, izaafat constructions can be their constituent elements as in vaziir-i aazam manmohan singh.

The question arises how to tag them at this level? There are two options; one is to tag each constituent word with their respective POS categories and the second option is to tag all the constituent words with the same tag used for proper noun because as whole they are proper nouns.


    1. Compound Words (CW)

Compound words are also the problematic multiword expressions but they are not comparatively simpler than izaafat constructions and named entities in that the number of the constituent words can’t exceed more than two like named entities and there is no internal linker to them like Izafe. However, they have their own complexities. They can be endocentric with compositional meaning or exocentric with non-compositional meaning. Like the outward drift in the meaning of exocentric compounds, there can be also heterogeneity in their POS compositionality, i.e. words of two different POS categories can form a new word which may or may not have the category of one of its constituent words. For example: Akis/QT akh/QT (pronoun), pA:n’/?? paanai/PR (pronoun), shinI/?? baal/N (noun), gA:r/JJ zimIdaar/JJ (adjective), kheyn/V cheyn/V (noun),khosh/JJ nasiib/N (adjective),zorI/RB zorI/RB (adverb), As’/?? As’/?? (adverb), heyokun/V kheyth/V (verb) , etc.

Of all the compounds, compound nouns and verbs are far more productive in Kashmiri and pose more challenges to the annotator. In some compounds as shown above, the constituent words without POS tags (with ??) are intuitively difficult to classify as their original form has been changed and reduced to a sort of bound form (like pA:n’/?? & As’/??) . However, some words (like shinI/??) seem to have assumed the oblique form, the form which a noun assumes under the influence of following postpositions as in (shinI/N peyTh’/PSP), “shiin” changes to “shinI”. Therefore, such forms have independent existence unlike (pA:n’/?? & As’) whose existence is bound to these contexts (compounds) only & do not exist outside such contexts. Such forms have been classified and tagged on the basis of their original form vis-à-vis category.



In addition to such problems compounds in general are like multiword expressions and therefore, it is important to keep in mind whether the constituent words of the compounds should be joined together by some convention, e.g. dash (-), to form a single token or they should be kept as such (two tokens) without joining. If former approach is followed then they need to be tagged as whole unit which will ignore the category of their constituent words. However, if one follows the latter approach then they need to be tagged separately as per their respective categories which will ignore the category of the entire compound. It is also important to take into account that whether the POS information of the individual constituent words of a compound is more important at this level of annotation or the POS information of the entire compound. However, if one thinks both are important, then it must be also keep in mind whether it is possible to achieved at this level.


    1. Numeric Dates (NDs)

It has been observed that there occur various instances of dates in the corpus. They are also like named entities. As they represent particular points in time, it is quite possible to label them as Nloc but it is a debatable issue whether to classify them under Nloc or not. Date is the name for particular point of time like the name of a place or a person and they are unnamed like the typical temporal Nloc (adverbs of time). Therefore, they are classified under proper nouns and not under Nloc. However, problem arises when they are followed by a case marker which occurs as separate token, e.g. 16 January 1950 has, 1847 huk, 1947 yas (۱۹۴۷یَس ،۱۹۴۷ہُک، ۱۹۵٠ہَس،). In these examples, -has, -huk and -yas are basically bound forms but can’t be attached with numerals naturally. Similarly, sometimes the dates, e.g. 1950 (۱۸۵٠) occur with symbol (ء) for issvi (AD) and the case-marker (یَس) yas, e.g. in 1850 iisvi yas (۱۸۵٠ء یَس) manz. It has been also observed that there are some occurrences of the dates in the corpus where the initial and final dates representing a period of time are kept in brackets and the case marker occurs outside the brackets, e.g. تام ہَس (۱۹۴۷- ۱۹۵٠) or (1842-1857) as taam. Such cases, though a typical tokenization problem, have been handled at POS level as they had been left as such at the time of tokenization because they came to fore during annotation process.

    1. Underspecified Verbs (UVs)

As aforementioned in the subsection 2.2.6 (d), the fine grained sub-classification of verb has been avoided given the fact that fine-grained sub-classification is based on the notion of finiteness, i.e. finite verb, non-finite verb and infinite verb and the notion itself is controversial at deeper level. It is not only tense that contributes to finiteness but sometimes aspect, mood and agreement determines finiteness. The morpho-syntactic information that constitutes finiteness is usually distributed in two or three verb tokens (auxiliary and main verbs). For instance, in the example; su chu batI khevaan (he is eating an apple), auxiliary verb (chu) carries tense (present) & PNG agreement (mas.SG.3rd), and the main verb (khevaan) carries aspectual information (progressive) in addition to the lexical semantics. In such sentences, it would be absurd to say that the auxiliary verb (chu) is finite and the main verb (khevaan) is nonfinite. If only the tense determines finiteness then all –tense verbs are nonfinite. Then, a very important questions which arises is that, are the perfective sentences like “arshidan kheyo batI” (arshid ate rice) and imperative sentences like “tsI khe batI” (you eat rice) basically nonfinite clauses? Isn’t it that only de-verbal verb-forms like the participles or gerunds are basically non-finite?

It is obvious that the main verb in the sentence; su chu batI khevaan (he is eating an apple), is also finite despite that fact that it doesn’t carry the tense (-tense) or PNG information (-PNG). It is verbal in nature and plays a key role in the sentence by providing the lexical semantics of the action and its aspectual information, unlike the nonfinite verbs (such as its khey-th form) which are de-verbal in nature and, thus, play a marginal (modifying) role in the sentences in which they occur. Therefore, finiteness needs to be determined by taking into account the all verb tokens (except de-verbal) of a sentence, irrespective of whether the verbs tokens are contiguous or non-contiguous with relation to each other. Keeping in view this complicated nature of finiteness, verb classification has been kept underspecified at this stage and only two types (main and auxiliary) have been posited just to avoid resolving of finiteness puzzle at this stage and to postpone it to the next level of annotation, i.e. chunking level.



    1. Non-manner Adverbs (NMVs)

The notion of adverb has been simplified by restricting it to only manner adverbs and putting time and location adverbs under noun as Nloc. However, there are lots of words which seem to be adverbs but other than manner and locatives adverbs. Since, only manner adverb has been posited in the current POS tagset, the label needs to be neutralized & expanded to accommodate both, manner as well as non-manner adverbs which signify reason, frequency, some quantification and sentential adverbs e.g. kyaazi (why), beyi (again), dohdish/ dohai/ rozaanI (everyday), hameshI (forever), zyadI (more), vaariya (lot), kam (less), shaayad (perhaps), yaqiinan (surely), lA:ziman (necessarily ), etc.

The rationale to include reason, frequency, unique quantification and sentential adverbs in adverbs is well grounded. The reason word kyaazi (why) modifies whole clause like the sentential modifiers; shaayad (perhaps), yaqiinan (surely), lA:ziman (necessarily ). Frequency words like dohdish/ dohai/ rozaanI (everyday) & hameshI (forever) sound like manner adverbs, sort of temporal manner. It is not that all quantifiers modify verbs but surely some are verb modifiers. Such role of some quantifiers is more evident when they are used with intransitive verbs, e.g. zyadI (more) in the sentence, su shong zyadI (slept more); kam (less) in the sentence, kam osun (s/he laughed less), vaariya (lot) in the sentence vaariya kheyvun (s/he ate lot), etc.

Another problem related to adverbs is that of being multiword like compounds, though it is far from compounding. It mostly arises out of writing convention and can be handled more like other multiword expressions or taken care at the time of tokenization. For instance, certain adverbs are composed of two tokens in which first token is adjective or noun and the second one mostly postposition, e.g. Thiikh pA:Th’ (nicely), khOsh pA:Th’ (happily), vaarI pA:Th’ (safely), dor pA:Th’ (strongly), khushii saan (with happiness), etc.

All the multiword rather multi-token adverbs are not problematic for this level of annotation as they can be handled like any other multiword expression but some, in which the POS status of both the constituent tokens is not clear, are really challenging, e.g. the status of pA:Th’ in the expressions; Thiikh pA:Th’ (nicely), khOsh pA:Th’ (happily), vaarI pA:Th’ (safely) & dor pA:Th’ (strongly), is not clear. Although, it has been treated as postposition in some previous annotation works, it is more likely to be a bound-form. It is, no doubt, a separate token in corpus but intuitively speaking, it is not a word; rather, it is a part of the preceding word and is more like an adverbial morpheme except in the instances like misaali pA:Th’ (for example) where it is clearly a postposition but its frequency in the corpus is very less. The instances, in which it appears to be a bound-form have high frequency in the corpus and have not been handled at the time of tokenization, where a bound-form is usually attached to the preceding token (see chapter-III). The reasons due to which it has been left as such in tokenization process are its high frequency and unclear status.



    1. Paradox in POS Annotation

As aforementioned form-function is one of the important dualities. It is very crucial for tagset designing as well as corpus annotation. Theoretically one needs to stick to only one aspect & to carry out the entire task of corpus annotation on the basis of the same principle without occasionally switching to the alternative dictum. However, practically it seems to be implausible as the annotations are not carried out in isolation just for the sake of annotation but the product of annotation needs to be used for some bigger task ahead and thus, one can’t ignore the formal aspect of a word and focus entirely on its functional aspect or vice-versa as demanded by theory. Somehow, both the aspects need to be taken into account and one needs to realize the use of a one aspect or other in a particular task along and also its use in the tasks ahead so that a particular aspect can be ignored if not very important. For instance, on the one hand, demonstratives wouldn’t have been a POS category if only formal approach would have been taken into account, as by form demonstratives are actually pronouns but by function they are demonstratives, e.g. the word, su (he) is pronoun in the sentence su aav (he came) but the same word is demonstrative in the sentence su shur aav (that kid came). Similarly, on the other hand gerunds wouldn’t have been verbs but nouns if their formal aspect would have been ignored and only functional aspect would have been taken into account. Therefore, it seems contradictory that at one point for positing demonstrative as POS category, functional aspect of a word has been taken into account but to posit gerunds as subclass of verbs the same functional approach has been defied. It is important to mention that such decisions are matter of expertise and experience and one need not to follow theory strictly unless it doesn’t undermine the goal of the task in hands. In the present task, i.e. POS annotation, the goal is to lay down the foundation of dependency treebank which can be further augmented with anaphoric information or other discourse level information. Thus, this dual or hybrid approach to corpus annotation is justified. Nevertheless, it can be said that opting hybridity in corpus annotation under the influence some practical usage, is indeed paradoxical, as capturing the functional aspect of words in the corpus is an alternative way of looking at data and thus, the essence of corpus linguistics vis-à-vis corpus annotation.

  1. Yüklə 1,52 Mb.

    Dostları ilə paylaş:
1   ...   6   7   8   9   10   11   12   13   ...   21




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin