Guidelines for POS Annotation
Some important guidelines that were framed and followed for POS tagging of KashCorpus are given below:
-
All NEs are essentially proper nouns (NNPs) as they refer to specific entities that have been named; however, the name is composed of more than one word with different POS categories. Actually NEs are phrases rather than words but need to be handled like words at this level. Since, NEs are as whole are NNPs, all the words composing NEs are tagged as NNPs, e.g. NE “vaziiri aazam manmohan singh” is tagged as “vaziiri/NNP aazam/NNP manmohan/NNP singh/NNP” so that a chain of NNPs is obtained can be easily identified in the annotated corpus. This might look absurd decision and one can argue that original POS information is suppressed but as aforementioned, it is strategy to evade the problem at this level and to keep track of the problem-items in order to be handled at another intermediate level handling multiword expressions (MWEs).
-
CWs are handled slightly differently, though like NEs they too are treated as MWEs. These are composed of only two words and mostly include compound nouns, compound adjectives, compound adverbs (reduplications), etc. The words which form a compound are assigned their respective POS tags but the specialized ones with ‘C’ to indicate compound, e.g. Compound nouns like “shinI baal, zaril zaal, masI vaal”, Compound adjectives like “khOsh qIsmath, gA:r zoruurii” are tagged as “shinI/NNC baal/NNC, zaril’/NNC zaal/NNC, masI/NNC vaal/NNC” and “khOsh/JJC qIsmat/NNC, gA:r/JJC zoruurii/JJC”, respectively. The overall POS information of the compound word has been suppressed unlike the treatment of NEs but ‘C’ maker has been added to the tag to make the compounds identifiable or extractable. It must be noted that the capturing compounding information of verbs, like the above, has been avoided, given the fact that there are other dreaded complicacies associated with verbs which are handled at the chunking level. It has been also avoided in pronouns as there are very few compound forms in them.
-
ICs are also handled like CWs and the words which are linked together by izaafe are tagged with their respective POS categories, ignoring the change brought about by the izaafe in the word to which they are bound, e.g. the ICs “aabi hayaat, khuuni jigar, habiibi paakh, hoquuqo faraayiz” are tagged as “aabi/NNC hayaat/NNC, khuuni/NNC jigar/NNC, habiibi/NNC paakh/JJC, hoquuqo/NNC faraayiz/NNC”. Here, information about Izaafat has been suppressed as like many other cases it is not much needed for sentence parsing and ICs behave more like compound words.
-
NC’s are actually a kind of numeral NEs and hence, tagged as NNPs but the other complications associated with them have been handled by joining the bound-forms with the numeric date by dash (for details, see above discussion), e.g. Numeric dates like 1845 has manz, 1845 ء yas manz, (1845 ء) yas manz, etc are tagged as 1845-has/NNP manz/PSP, 1845/NNP ء-yas/NNP manz/PSP, (/PUNC 1845/NNP ء-yas/NNP )/PUNC manz/PSP. It should be noted that some unexpected tokenization problems like the one discussed in the issues and mentioned above, have been tackled even at this stage.
-
As aforementioned, the de-verbal non-finite forms like perfective participles (-ith-forms) such as kheyth, pArith, shongith, bihith, etc; progressive participles (-vun-forms) such as zeyvvun, shongvun, natsvun, asvun, gindvun, khasvun, bozvun, etc and gerundial forms such as shongun, shongnI, shongas, shongnan, shongnuk, vothnI, natsnas, etc have been assigned underspecified POS tag, VM. Also, verbal non-finite forms (infinitives like shongun) and verbal finite forms (main verbs) also have been assigned the same tag, i.e. VM. It is important to mention that no distinction has been maintained at this level and the reasons for the same are discussed above.
-
The pronominal-forms which are followed by verb or postposition have been assigned a POS tag, PRP but if the same form is followed by intensifier, quantifier, adjective or noun, it is demonstrative and is tagged as DMD.
-
Adverbs which are not essentially manner adverbs like those representing frequency, quantification & reason (beyi, vaariyah, kyaazi, etc) have been also tagged as RB like the manner adverbs.
-
In addition to traditional adverbs of time and space, some vague time-words like subhas (in morning), shaamas (in evening), dohli (in the day) and ‘roth’ in “roth kyuth” are also potential NSTs as long as they provide temporal location for an action/event. But when they are inflected with genitive marker like subhuk (of the morning), shaamuk (of the evening), etc and cease to provide temporal location for an action/event, they cease to be NSTs and are NNs.
-
Usually, NSTs do not take demonstratives, are marked with locative, ablative or terminatives, and can’t be pluralized or quantified.
-
Besides, words like “kA:shur or kashmiri” are most likely to be NNP when used in isolation or with other words in order to refer to a language but are likely to be NN when used in isolation to refer to people “koshur, PunjA:b’, BangA:l’, etc” However, they are likely to be JJ, when used with other words like koshur saqaafat (kashmiri culture, koshur geyav (kashmiri ghee), etc.
So far linguistic information, an outcome of the analysis cum annotation process has been discussed or put forward in the form of a small guideline for problematic cases. Now, in the next sub-section, statistical information, yet another kind of outcome of the annotation process has been given in the following sub-section.
-
Statistical Results
As aforementioned, the data has been divided into four sets for annotation. In each dataset, the words have been classified into eleven classes. Table.1. shows cumulative frequency of each of the POS category in all four datasets while as Fig. 14.a shows the total quantity of each POS category in terms of percentage. It has been prepared from the frequency Table.1. to show the contribution of each POS category in making KashCorpus and to compare the percentage of the categories, in order to get the most frequent and the least frequent POS categories.
Figure.14.a. Total Quantum of POS in terms of (%)
S. No
|
POS Type
(x)
|
Data Set-1
(f1)
|
DataSet-2
(f2)
|
Data Set-3
(f3)
|
Data Set-4
(f4)
|
Grand Total
fx = (f1+f2+f3+f4)
|
1
|
N
|
953
|
2296
|
868
|
1042
|
5159
|
2
|
V
|
793
|
1045
|
597
|
278
|
2713
|
3
|
PP
|
190
|
665
|
210
|
285
|
1350
|
4
|
RD
|
394
|
345
|
251
|
190
|
1180
|
5
|
JJ
|
176
|
384
|
212
|
185
|
957
|
6
|
PR
|
333
|
234
|
176
|
196
|
939
|
7
|
CC
|
169
|
313
|
208
|
208
|
898
|
8
|
RP
|
207
|
115
|
99
|
96
|
517
|
9
|
DM
|
50
|
146
|
119
|
123
|
438
|
10
|
QT
|
68
|
183
|
64
|
57
|
372
|
11
|
RB
|
48
|
48
|
96
|
157
|
349
|
Total f =
|
3381
|
5774
|
2900
|
2817
|
14872
|
Dostları ilə paylaş: |