Table.3 Sample of Unclean KashCorpus
-
Copyright Issue
Copyright legislation is one of the serious problems for building and usage of large scale text corpora as the authors and publishers protect their rights on their texts through copyright laws. The main concern for the corpus builder is that any text which is to be digitized and included in the corpus will be under copyright protection and that the permission has to be obtained for its use. If corpus is going to be used in the development of various systems or applications for commercial purpose, one has to take the permission and enter into an agreement with the authors or publishers in which some royalty will be fixed for each text. However, if corpus is to be used for research purpose there is hardly any need of taking permission and entering into any agreement. Since the current KashCorpus is not for any commercial purpose but only for the research, permission for using the text has not been taken. The next section describes the procedures involved in polishing the corpus.
-
Preprocessing
Once the inputting of the text is finished and the corpus is ready, it can’t be directly used for annotation purposes rather it has to go through some more manipulating procedures. For instance, the current KashCorpus has been manually sanitized, normalized and tokenized before it has been used for POS annotation. So, all the manipulations done (manually or automatically) to the corpus prior to annotation can be collectively called Preprocessing.
-
Corpus Sanitization
Corpus cleaning involves proof reading or checking of the digitized corpus files for typos, errors, spelling and grammatical mistakes. During this process, it is necessary to be faithful to the text, as whatever, one may think as a mistake on the part of a writer could be in fact a variation. The reasons of the errors or mistakes and how these were corrected in the process of sanitization are given as under:
-
The less expertise in the Kashmiri script, in spite of the good typing experience, on the part of inputter has resulted in many errors and mistakes and consequently in more unclean corpus. Moreover, it was found that the highest scoring day (in terms of number of words per day) was also the day in which more errors & mistakes were committed by the data inputter. There was more percentage of erroneous words in the corpus as compared to the day when there was average word count. Therefore, taking the required time seems to be a good strategy as in haste to finish more and more words per day can lead to the increased percentage of erroneous words.
-
Sometimes the bad quality of print and the errors in the original text would lead to the wrong judgment of the letters/words and consequently the mistakes on the part of the inputter.
-
Kashmiri script uses lots of diacritics to represent different phonetic subtleties of the language. Sometimes some of these diacritics would appear on the top or bottom of one character when actually they are part of preceding or following characters. So, in the text it is sometimes hard to decide to which letter the diacritic belongs, unless the native speaker’s intuitions have not been taken into consideration. These apparently misplaced diacritics generally lead to confusion for the data inputter and result in lot of spelling mistakes or unclean corpus. For instance, in the word مخالفتکُ (mukhaalfatku), given above in Table 2, the ‘pesh’ (ُ) diacritic appears misplaced, i.e. on final letter of the word (ک) which is actually on the preceding letter (ت) and the actual word is مخالفتُک (mukhaalfatuk). Such mistakes are regular and hence quite predictable.
-
Since, various combinations of keys are used to input different character, e.g. by pressing Shirt+P (ُ) can be typed in but sometimes only one key is pressed (e.g. only ‘P’) and an entirely different character (e.g. پ) gets typed in. This has resulted in various errors which are more or less predictable ones.
-
It was maintained to use some diacritics in a consistent way, despite of the variations in the text. When two diacritics are typed contiguously (one after other) in which the first one joins two consonants to function as a unit and the second one represents the vowel on the unit, it was decided to type in on this sequence – 1st consonant, 2nd diacritic representing the vowel, 3rd consonant and 4th the conjoining diacritic. For instance, in the words (دیُٛمُت) & (تیُٛت) the splitting occurs when two diacritics ( ُ & ٛ ) come contiguously. To avoid this, vowel representing diacritic ( ُ ) is typed after first consonant (د) which forms a unit with (ی) with the help of a linker & shortening diacritic ( ٛ ). Therefore the above words were corrected as دُیٛمُت & تُیٛتھand this pattern was followed in the entire process of cleaning.
-
It was also maintained not to consider aspirated consonants as unit and put the diacritic after this unit, instead put diacritic on or under the letter representing the consonant. For instance, in the word (چُھ), the vowel representing diacritic ( ُ ) is not actually associated with only (چ), the word initial letter but the unit (چھ). However, it is maintained to write after the first letter (چ) so that (چُھ) is typed everywhere instead of (چھُ).
-
It was also maintained to put the diacritic representing a vowel under the unpronounced pseudo-character44 instead directly under the preceding letter representing a consonant which is not a writing convention. For instance, in the words (تہٕ) & (تہِ), the letter (ہ) remains unpronounced and is used as a supporting characters for the diacritics (ٕ & ِ ) so that they are written as (ہٕ & ہِ), respectively. These are mere orthographic conventions and have nothing to do with phonological rules.
These all types of errors were rectified during the course of cleaning, keeping in mind the principle of faithfulness to the text and some additional decisions to maintain consistency. The next sub-section describes the normalization of the corpus.
-
Corpus Normalization
Corpus Normalization involves all the necessary manipulations of the corpus, not covered under cleaning and tokenization. It primarily involves filling in left out diacritics. As mentioned above, the Kashmiri script uses fourteen diacritics (e.g. َ ِ ُ ٗ ٖ ٔ ٕ) to represent different phonetic subtleties of the language. Urdu also uses modified Persio-Arabic script but drops the three crucial vowel representing diacritics, namely; zer, zabar & pesh (َ ِ ُ). This tendency can be also seen in Kashmiri texts but it is not as severe as it is in Urdu texts where all diacritics are being left. However, like Urdu writers, there is a tendency in Kashmiri writers too to drop these three crucial diacritics but the remaining are the essential ones, specific to Kashmiri and can’t be inferred from the context. However, dropping of the diacritics creates a big text normalization problem that needs to be taken care of, i.e. all diacritics need to be put in the text or at least where these are crucial for word identification and disambiguation. Same has been done with the current KashCorpus; all the crucial diacritics have been put there manually. Actually, corpus cleaning and normalization has been done simultaneously.
-
Tokenization
A token is a string of characters delimited by unit character spaces and tokenization is a preprocessing procedure by which disparity is removed in achieving one-to-one mapping between tokens and the words or major grammatical categories either by concatenation (joining) or by segmentation, respectively. The natural one-to-one correspondence has been generally observed between a word (simple or complex) and a token in isolating45and inflectional46 languages but hardly in agglutinating languages47 where, one token usually corresponds to many grammatical words (POS categories). However, in case of some inflectional languages, particularly, the languages which use modified Persio-Arabic script and borrow heavily from Persian, e.g. Urdu and Kashmiri, no one-to-one correspondence between the words and the tokens in many complex words (bound + free morphemes) has been observed. The root form of such words is written as one token and the affix as another separate token. It is the common practice in Kashmiri and Urdu and mostly occurs in Persian borrowed affixes, as given in the Table.4. In Kashmiri, this practice has been observed even in some simple words where the two parts of word are written as two separate tokens, with the blank space between them may or may not representing the morphemic boundary. Moreover, the case markers, if added to such words, give rise to three token words as given in examples 4, 5 & 6 of the Table.5. Therefore, the second part of the word may or may not be a bound morpheme but the third token is surely bound morpheme. This orthographic convention of writing bound morphemes or parts-of-word as separate tokens to avoid unacceptable word shapes due to the context-sensitive48 script is called split-orthography49. The Kashmiri specific examples of split-orthography are mostly taken from the corpus sample given in Table.3.
The concept of space as a word boundary is weak in Urdu script (Durrani and Hussain, 2010) but it is far weaker in Kashmiri script. A zero width non-joiner (space character as can be seen between Roman letters) is primarily required to generate acceptable word shapes on the one hand and to join various parts of a word and rectify tokenization problem on the other hand. It has been already implemented for Urdu (G. Lehal, 2010) but for Kashmiri it has been implemented very recently which is compatible to windows-08 only. However, instead of zero width non-joiner, underscore (_) has been used in the tokenization of Urdu Dependency Treebank (R. Bhat 2012) and manual preprocessing for Urdu and Kashmiri Corpus at LDCIL (S.Bhat 2012) but for the current work dash (-) has been used, instead of underscore or zero width unit character space, to join parts of a word as shown in examples 1-3 of Table 4 and 10-15 of the Table 5.
S. No.
|
Root (Token I)
|
Affix (Token II)
|
Words
|
Urdu
|
Kashmiri
|
1
|
aqIl (عقل)
|
-mand (مند)
|
aqIl–mand
|
عقل_مند
|
عقل-مَنٛد
|
2
|
mazmoon
|
-nigar
|
mazmoon–nigaar
|
مضمون_نگار
|
مَضموٗن-نگار
|
3
|
tA:liim
|
-yaaftah
|
tA:liim–yaaftah
|
تعلیم_یافتھ
|
تٲلیٖم-یافتہٕ
|
4
|
khatIm
|
-shudah
|
khatIm–shudah
|
حاصل_شدہ
|
حٲصل شدہ
|
5
|
hA:sil
|
-kardah
|
hA:sil–kardah
|
حاصل_کردہ
|
حٲصل کردٕ
|
6
|
gonah
|
-gaar
|
gonah–gaar
|
گناہ_گار
|
گۄنہ گار
|
7
|
qosuur
|
-vaar
|
qosuur–vaar
|
قصور_وار
|
قصوٗر وار
|
8
|
khosh
|
-go
|
khosh–go
|
خوش_گو
|
خوش گو
|
9
|
tarqii
|
-paziir
|
tarqii–paziir
|
ترقی_پزیر
|
ترقی پزیٖر
|
Table.4: Tokenization Problem Common in Kashmiri & Urdu
S.NO.
|
Root
(Token I)
|
Affix (Token II)
|
Affix (Token III)
|
Words
|
Kashmiri
|
1
|
butaan
|
Chi
|
-
|
buuTaan-chi
|
بھوٹان چِہ
|
2
|
iisvii
|
k’n
|
-
|
isvii-k’n
|
عیسوی کٮ۪ن
|
3
|
sekretrii
|
Yan
|
-
|
sekretrii-yan
|
سیکریٹری یَن
|
4
|
zimI
|
dA:rii
|
yan
|
zimI-dA:rii-yan
|
ذِمہٕ دٲری یَن
|
5
|
tariiqI
|
Kaar
|
k’n
|
tAriiqI-kaar-k’n
|
طریقہٕ کار کٮ۪ن
|
6
|
fal
|
safI
|
kis
|
fal-safI-kis
|
فل سفہٕ کِس
|
7
|
paanI
|
van’
|
-
|
paanI van’
|
پانہٕ ونۍ
|
8
|
mukhaal
|
fAts
|
-
|
mukhaal-fAts
|
مخال فٔژ
|
9
|
misrI
|
Kis
|
-
|
misrI-kis
|
مصرٕ کِس
|
10
|
sapIz
|
mIts
|
-
|
sapIz-mIts
|
سپز-مِژ
|
11
|
vAr’
|
Yas
|
-
|
vAr’-yas
|
ؤرۍ-یَس
|
12
|
Ak’
|
sIY
|
-
|
Ak’-sIY
|
أکۍ-سٕے
|
13
|
pAt’
|
Mis
|
-
|
pAt’-mis
|
پٔتۍ-مِس
|
14
|
anan
|
vA:l’
|
-
|
anan-vA:l’
|
اَنَن-وٲلۍ
|
15
|
yithI
|
pA:Th’
|
-
|
yithI-pA:Th’
|
یتھہٕ-پٲٹھۍ
|
Table.5 Tokenization Problem Specific to Kashmiri
-
Summary
In the wake of current corpus linguistics scenario and the boom of empirical studies, the development of Kashmiri corpus is need of the hour. It is not only to feed data hungry research and development initiatives for technological enhancement of Kashmiri Language but also to carryout various quantitative studies to discover the new realities which have remain unexplored so far due to the unavailability of the corpus. Though, in this chapter the building of KashCorpus is described from a specific point of view, i.e. for developing KashTreeBank, but the corpus can be also used for different types of studies. This work is the most basic part of a large attempt of resource creation to put Kashmiri language on the map of current language technology. Like any other corpus building endeavors, the creation of KashCorpus was not a straightforward process; there were many issues like, selection of text domains, representativeness of the language in the selected samples, etc. which were properly scrutinized and solved before starting the actual work. The other major problems include the unavailability of any online resource from which data could have been obtained, the total vacuum of commercially important text domains like medical & tourism text, lack of well trained data inputters who are not only well versed with Persio-Arabic script in general but particularly with Kashmiri script and its unicode based inputting setup. Usually, data inputters use “Inpage” but not Microsoft office for Kashmiri inputting. Finally, many processes were carried out to make corpus worth for adding further values by various types of annotations. These preprocesses include corpus cleaning, normalization and tokenization. Though, sometimes tokenization is treated as a separate problem in between corpus building and corpus annotation but in this work it is included as the part of corpus building as it has been carried out manually along with cleaning and normalization. In the present form, the KashCorpus is ready to be used for the future work.
Chapter.4 POS Tagging of KashCorpus
“The definitions of the parts-of-speech are
very far from having attained the degree of
exactitude found in Euclidean geometry.”
Otto Jespersen, the Philosophy of Grammar, 1924
-
Introduction
Part-of-speech (POS) tagging constitutes the fundamental layer of annotation in treebanking, on the basis of which furthers annotation layers are build. The next layer of annotation is called chunking, which is important to determine dependency relations, the most crucial task in building a dependency treebank. The POS category which forms the head of the chunk can be further augmented with the crucial morphological information like PNGC50 and TAM51 but in the current work adding morphological information has been avoided to concentrate on inter-chunk dependencies and get the skeletal dependency trees ready. It is important to mention that it is easier to add morph information latter in order to get better results in automatic syntactic annotation.
This chapter describes the first level of annotation of KashCorp, i.e. POS tagging and chunking and the associated resources, technicalities and manipulations of the data that were required to start POS annotation. The section second provides the notion of POS tagging. Section three discusses briefly the important annotation standards. Section four presents POS tagsets developed, mainly, for English and Indian Languages, and elaborates only the most relevant ones. Section five describes the Kashmiri BIS tagset. Section six, seven, eight, and nine talk about the requirements, the process, issues and the guidelines of POS tagging, respectively. Section ten provides the statistical results and finally section eleven summarizes the chapter.
-
The Notion of POS Tagging
The notion of parts-of-speech (POS) tagging has been given very elegantly in Daniel Jurafsky and James H. Martin (2000):
“Words are traditionally grouped into equivalence classes called parts-of-speech (POS), word classes, morphological classes, or lexical tags. In traditional grammars there were generally only a few parts of speech (noun, verb, adjective, preposition, adverb, conjunction, etc.). More recent models have much larger numbers of word classes (45 for the Penn Treebank (Marcus et al., 1993), 87 for the Brown corpus (Francis, 1979; Francis and Kučera, 1982), and 146 for the C7 tagset (Garside et al., 1997).
The part of speech for a word gives a significant amount of information about the word and its neighbors. This is clearly true for major categories, (verb versus noun), but is also true for the many fine distinctions. For example these tagsets distinguish between possessive pronouns (my, your, his, her, its) and personal pronouns (I, you, he, me). Knowing whether a word is a possessive pronoun or a personal pronoun can tell us what words are likely to occur in its vicinity (possessive pronouns are likely to be followed by a noun, personal pronouns by a verb). This can be useful in a language model for speech recognition.”
POS tagging is a process of assigning part-of-speech tags to each and every word used in continuous text after the morphological analysis and grammatical interpretation (Garside, 1995). A set of specially designed tags, carrying grammatical information are assigned to words to indicate their parts-of-speech category with regard to their use in the text (Leech and Garside, 1982). POS tagging is actually the process of labeling words in running corpus with their grammatical categories (optionally with the morpho-syntactic features), based on both their form as well as their contextual function. It is essentially a classification problem in which words are classified on the basis of a predefined inventory of parts-of-speech categories called POS tagset. For morphologically rich languages, it plays a limited role of syntactic category disambiguation in the entire pipeline of NLP modules where morphological analyzer provides all possible POS categories for a word and POS tagger just disambiguates the category of the given word by selecting only one according to its context. It is the fundamental level of corpus annotation; in fact, it is the first stage to proceed for the syntactic annotation in order to develop a treebank. Apart from its role in treebanking, POS annotated corpus alone can be used in wide number of NLP applications like information extraction, information retrieval, parsing (shallow as well as deep), machine translation, speech synthesis and speech recognition.
-
Dostları ilə paylaş: |