Description of Kashmiri BIS POS Tagset
POS categories and subcategories as given in the tagset (Appendix-I) are briefly discussed below with reference to KashTreeBank Dataset-4.
-
Noun (N)
Noun is an open-class item or content word that refers to people, places, animals, objects, substances, ideas, concepts, feelings etc. Nouns have inherent characteristic of number, gender & case and they are usually inflected for such information. Noun is the first top-level category in BIS tagset with three sub-types belonging to level-1 of hierarchy. It sub-types include Common Noun (NN), Proper Noun (NNP) and Spacio-temporal Noun (NST).
NN is the first subtype of the noun which includes those nouns that are classes but not particular instances. Most of the nouns are common nouns which can be easily quantified or pluralized, e.g. kitaab (book), gagur (rat), insaan (human), etc. The common nouns extracted from the dataset-4 are given in Appendix-II. It not only includes simple nouns but also other multiword expressions like compound nouns and Izaafe. NNP is the second subtype which includes nouns that are particular instances (like person, place and institution names) but not general classes. These can’t be quantified or pluralized, e.g. zA:kir hussain (Zakir Hussain), Jiil-i-Dal (Dal Lake), ladaakh (Ladak), Kashmir University, Cufewed Night, etc. In some languages like English, common and proper nouns can be identified with the help of orthographic cues like initial letter capitalization but many languages like Indian Languages lack such luxury. Moreover, proper nouns are also used as common nouns; hence, extraction of either of them is very tough task. The proper nouns that have been extracted from the dataset-4 are given in Appendix-II. It not only includes simple single token proper nouns but also the multi-word/token expressions like Compound Nouns, Izafe & other Named Entities, e.g. company names, institution names, book names, person names, etc.
NSTs are fourth subtype of nouns which are also called Nouns of Location (Nloc). This subcategory was actually introduced in ILMT tagset to register the distinctive nature of some of the locational nouns which also function as part of complex postpositions (e.g. ke uupar, ke niiche, etc) in Indian Languages but in the current tagset the notion has been used little differently. Here, NSTs have been treated as equivalent to the traditional adverbs of time and place. Since, there is no place for traditional adverb of time and place in this tagset; these have been classified under NST which basically refers to particular points in space or time. For example, hoteyth/tateyth (there), yeteyth (here), bronThI (front), peyThI (top), etc (also see Appendix-II). The Fig.1 shows frequency distribution of subcategories of noun and reveals which subcategory is the most frequent in Kashmiri.
Figure.1 Subtype Frequency of Noun
-
Pronoun (PR)
Pronoun is a closed class item which like noun has the inherent property of being inflected for PNGC and can substitute a noun or a noun phrase. The idea for introducing pronominals as separate category or as subtype of noun has been well explored and it has been decided that the tag for pronouns will be helpful for anaphora resolution. Moreover, it is not a subtype of noun but is rather a variable which need not necessarily be referring to a noun. The top-level category of pronouns (PR) includes Pronominal69-PRP, e.g. bI (I), tsI (you), su (he), sw (she), yi (this), ti/hu (that/it), etc; Reflexive-PRF, e.g. paanI (herself/ himself); Relative-PRL, e.g. yus (who), yi (which), yeli (when), etc; Reciprocal-PRC, e.g. paanIvan’ (each other); WH or Interrogative-PRQ, e.g. kus (who/which), kyaa (what), kar (when) and Indefinite-PRI kahn (someone), kuni (somewhere), as six sub-types. It is important to mention that unlike other traditional pronominal sub-classes, possessive pronoun hasn’t been introduced as a sub-type in this tagset. The reason is that possession, as an attribute (genitive), can be inflected with other sub-types as well like his, whose, etc. The pronouns that have been extracted from the dataset-4 are given in Appendix-II. The Fig.2 shows frequency distribution of subcategories of pronoun and reveals which subcategory is the most frequent in Kashmiri.
Figure.2 Subtype Frequency of Pronoun
-
Demonstrative (DM)
Demonstratives are closed class items that perform deictic70 function for a noun. Demonstrative will be always followed by a noun, a pronoun, an intensifier or an adjective. These are a distinct category of determiners and can neither substitute a noun, nor can specify a noun but can point out a noun. Therefore, one must not confuse them with nouns or adjectives though these resemble by form with pronouns and are traditionally treated as adjectives. It is obvious why demonstratives are being posited as separate top-level category in this tagset. It consists of Deictic or Default Demonstratives (DMD), Relative Demonstratives (DMR), WH-Demonstratives (DMQ) and Indefinite Demonstrative (DMI), e.g. yi in yi laDkI (this boy), kahn in kahn chiiz (something), kus in kus insaan (which man), etc. The demonstratives that have been extracted from the dataset-4 are given in Appendix-II. The Fig.3 shows frequency distribution of subcategories of demonstratives and reveals which subcategory is the most frequent in Kashmiri.
Figure.3 Subtype Frequency of Demonstrative
-
Verb (V)
Verb is an open class item that refers to actions, events, occurrences or states. Verbs have inherent properties of Tense, Aspect, Mood or Voice and are also inflected with such information. They also show inflections for Person, Number, Gender and Case due to their agreement properties. In the present tagset Verbs are top-level category with two subtypes; Verb Main (VM) and Verb Auxiliary (VAUX). As mentioned above, the finer distinctions of finite, nonfinite and infinite have been postponed to be tackled at chunk level. The rationale to use these unspecified tags is that the morphosyntactic information of verb that determines status of verb as finite or nonfinite is distributed on two or three tokens. Therefore, it is impossible to decide upon the status of the verb unless all the constituent tokens are not taken into consideration. For example; kheyvaan (eating), shong (slept), chu (is), os (was), etc. The verbs extracted from dataset-4 are given in appendix-II. The Fig.4 shows frequency distribution of subcategories of verbs and reveals which subcategory is the most frequent in Kashmiri.
Figure.4 Subtype Frequency of Verb
-
Adjective (JJ)
Adjective is an open class item that modifies a noun or pronoun by representing one of its properties. Adjectives agree in terms of number, gender, and case with the nouns they modify. Therefore, Adjectives (both attributive and predicative) are inflected for PNGC. In the present tagset, there is no further distinction of subtypes but a distinction has been made between those adjectives which are constituents of compound words as well as izafe and simple adjectives. The tag for simple adjective is JJ while as the tag for constituent adjective is JJC. For example: zyuuTh (tall), asIl (fine), byuuTh (waste), vozul (red), etc. The adjectives extracted from the dataset-4 are given in Appendix-II. The Fig.5 shows frequency distribution of subcategories of adjective and reveals which subcategory is the most frequent in Kashmiri.
Figure.5 Subtype Frequency of Adjective
-
Adverb (RB)
Adverb is an open class item that modifies verb. They form an important top-level category of this tagset. Unlike agreement of adjectives, adverbs do not agree with the verb they modify. They are indeclinable, i.e. do not have any inflectional property. They are floating elements in the sentence and do not occur necessarily adjacent to the verb, they modify. Their distribution in a sentence varies considerably. In the present tagset, only adverb of manner (RB) has been taken into consideration as adverb of time and adverb of place have been already classified under noun as Nloc. For example: The adverbs that have been extracted from the dataset-4 are given in the Appendix-II. The Fig.6 shows frequency distribution of subcategories of adverbs and reveals which subcategory is the most frequent in Kashmiri.
Figure.6 Subtype Frequency of Adverb
-
Postposition (PSP)
Postpositions are closed class items which like prepositions represent case relations between verb and its dependent nouns in a sentence. The forms that represent case relations are either free-forms or bound-forms. The free-forms are called pre/postpositions while as the bound-forms (inflectional categories) are called case-markers. Postpositions, as their name suggests, are always preceded by nominals (noun or pronoun) & always trigger obliqueness either in their head nominals (common in Indo-Aryan languages) or in the entire noun phrase (as in Kashmiri). However, in literature, the notion of pre/postposition is vexed with the notion of case or case marker. For instance, case-marker is considered to be purely syntactic inflectional category while pre/postposition an independent word representing semantic relations but the fact is that orthographic conventions defy these norms and a purely syntactic form is bound (inflectional) in one language and free (independent) in other language. To simplify, all the free-forms that represent some sort of relation (not necessarily semantic) between nominals and verb or between two nominals are considered as postpositions. It is worth to mention that Kashmiri have very few relation representing forms occurring before nouns, e.g. bamutaabiq Farooq (according to Farooq). Such forms can be considered as prepositions but in the current tagset they can be classified as postpositions, given the fact that there is no further sub-division in this category because of their negligible frequency but postpositions have far more frequency, e.g. sund (of), sI:t’ (with), khA:trI (for), etc. The Fig.7 shows frequency of pre/postpositions in the dataset-4.
Figure.7 Type Frequency of Postposition
-
Conjunction (CC)
Conjunctions are closed-class items or function words that conjoin two or more lexical items, phrases or clauses. In the current tagset, conjunctions have been introduced as a top-level category with its two sub-types; coordinators (CCD) and subordinators (CCS). If their conjoining operation is symmetrical the conjunction is coordinator, however, if the conjoining operation is asymmetrical, the conjunction is subordinator. Coordinators form compound sentences while subordinators form complex sentences. In the former constituent clauses are symmetrical (both are independent in nature) while in the latter, the constituent clauses are asymmetrical (one is principal, independent or matrix clause and other one, introduced by subordinator, is subordinating, dependent or embedded clause).
Since conjunctions are indeclinable in nature, they were classified under particles in ILMT and ILPOST vis-à-vis in LDCIL tagsets as per the definition of particle is concerned. However, given their key syntactic functions unlike other particles, they have been introduced as top-level category in BIS tagset and have been tagged as CC, e.g. tI (and), zi (that), etc. It is worth to mention that this decision may be helpful in conversion of dependency treebank into phrase-structure treebank. The CCs that have been extracted from dataset-4 are given in Appendix-II. The Fig.8 shows frequency distribution of subcategories of conjunction and reveals which subcategory is the most frequent in Kashmiri.
Figure.8 Subtype Frequency of Conjunction
-
Particle (RP)
Particles are open-class items or functional words which are generally indeclinable in nature and have least significance in a construction. Particle constitutes a top-level category in the current tagset and has Default (RPD), Intensifier (INT), Interjection (INJ) and Negation (NEG) as its sub-types. There is an elaborate list of particles (Emphatic, Similative, Dedative, Inclusive, Exclusive, etc) which have been assigned a single underspecified label ‘default’, given the fact that their finer distinction is not very significant at this level. Particles generally have limited syntactic function but encode key semantic and pragmatic information, e.g. seThaa (very), na (no), hata (hey), etc. The Fig.9 shows frequency distribution of subcategories of particles and reveals which subcategory is the most frequent in Kashmiri.
Figure.9 Subtype Frequency of Particles
-
Quantifier (QT)
Quantifiers are also closed-class items or function-words which quantify nominals. Quantifier is a top-level category in the current tagset with General (QTF), Cardinal (QTC) and Ordinal (QTO) as sub-type, e.g. akh (one), pI:ntsin (fifth), vaariyaa (lot), etc. General quantifiers include non-numeric quantifiers that show highness or lowness in the quantum of countable nouns or simply show quantity of mass nouns while as Ordinals are numeral quantifiers that specify quantum of countable nouns numerically. The former are less precise in quantification as compared to the latter. Cardinals on the other hand, do not quantify at an at all, rather, they specify position of an item in a series. They modify nominals and can occur both at attribute as well as predictive position like adjectives. However, by form these are generally derivatives of numerals (ordinals) of a language. In Kashmiri, QTF and QTO show agreement properties with their phrasal heads in terms of case like their adjective or demonstrative counterparts. The Fig.10 shows frequency distribution of subcategories of quantifiers and reveals which subcategory is the most frequent in Kashmiri.
Fig.10 Subtype Frequency of Quantifier
-
Residual (RD)
Residual is not any POS category but to accommodate the remaining elements of the corpus (text) which do not fit in the already discussed scheme, it has been introduced as a separate top-level category with five sub-types in the present tagset. Its sub-types include Foreign Word (RDF), Unknown Word (UNK), Eco Word (ECO), Symbol (SYM) and Punctuation (PUNC). RDF includes the words which are given in other script while as UNK includes the words which we don’t know or are confused about or which apparently do not fit anywhere. Therefore, UNK is a kind of baggage where you dump words that we are unable to classify. ECO includes partially reduplicated non-words that play a definite grammatical role. Symbols are apparently neither words nor punctuations but the elements of a text which encode certain information about some entities which can prove crucial for NER recognition. Punctuations are closed-class items but not words that play crucial grammatical function in organizing a discourse. They mark phrase, clause and sentence boundaries and sometimes play role of coordinators. The Fig.11 shows frequency distribution of subcategories of residuals and reveals which subcategory is the most frequent in Kashmiri.
Figure.11 Subtype Frequency of Residual
-
Requirements for POS Tagging
There are two main requirements for POS annotation besides the availability of corpus and POS tagset. These include an annotation interface and the storage format which are elaborated below:
-
POS Annotation Interface
The best way to perform consistent and error-free POS annotation is by using specialized user-friendly interfaces designed for this purpose. There are many POS annotation interfaces that have been developed in India, e.g. the one developed by MSRI and other developed by LDCIL, but these are only POS annotation interfaces. But another level of annotation can not be carried out by using them and also a link can’t be maintained between two or more levels. Since, the current POS annotation is an integral part of KashTreeBank, first level of annotation; we need some specialized plate-form that can consolidate all levels of annotation in certain format. One such platform is Sanchay71 which has been developed by writing Approx. 300000 lines of Java code over many years. Sanchay is a collection of tools and APIs (Application Programme Interfaces) for various language processing purposes (Singh 2006). It is an open-source platform to carry out various NLP tasks for South Asian Languages (SALs). So far, it has been extensively used for Indian Languages (ILs) at various NLP research labs for various research projects. The background information ob Sanchay has been given nicely as:
“It has already been used for the creation of POS tagged corpora for several Indian languages. In fact, the beginning of treebank creation work in India coincides with that of the beginning of the development of this interface and much of the treebank annotation work for Indian languages has been accomplished on various versions of this interface.” Singh (2011)
Sanchay Syntactic Annotation (SA) interface, as shown in Fig.12.A & Fig.12.B, is a specialized interface for syntactic annotation but it has been generalized for various kinds of annotations; Morphological annotation, POS tagging, Chunking, PSG Annotation, Dependency annotation, Named entity annotation and PropBank annotation. It was first developed when the preparations for creating a Hindi treebank were started at LTRC. Actually the work on developing the whole platform started with this interface, as pointed out by A. K. Singh, developer of Sanchay, “It was not just the first annotation interface, but also the first graphical user interface in Sanchay” (ibid 2011).
The same interface with the same mechanisms can be used for these different kinds of annotations. This is made possible by a data representation that is in terms of threaded trees with feature structures (multiple and/or nested). The different threads in the base tree allow different layers of annotation (for details see Singh, 2011).
Sanchay is a generalized platform which needs customization to work for a particular language which is yet to be included. Customization related to encoding or font was already done but same for the tagsets needed to be done. BIS tagset is a recent development and Sanchay was customized for previous tagsets only. For the current work, it needed to be customized for BIS scheme. The properties files (pos-tags.txt, pos-tags-ben.txt, non-terminals.txt) located in the directory which contain the lists of tags for POS tagging and chunking, need to be customized. These plain text files contain simple listings of tags in alphabetical order, with one tag per line. In the same files ILMT tags were simply replaced with BIS tags. These tags have been sorted alphabetically so that Method-3 can be employed for POS tagging conveniently.
-
Storage Format
Since, the data has to be converted into the storage format before starting actual POS tagging, as mentioned in the Chapter one. The format encodes the threaded-tree-representation, which allows multiple layers of annotation to be stored in a single structure or a single file which is readable for various algorithms or convertible in a format which in turn is readable. The default format which Sanchay uses for storage & linking of various levels of grammatical information is called SSF72 (for details, see section-3.2, Chapter one). However, the interface also supports XML and several other formats that are commonly used for computational purposes such as preparing input data for Machine Learning tools.
For converting the corpus into SSF, first the text needs to be split into separate sentences in such a way so that each sentence occupies a separate line. It was done in MS.Word by using a special case of “Find and Replace” (CNTRL+H) option, in which sentence delimiters (؟ & ۔) have been replaced with paragraph markers (^p) & one sentence per line arrangement has been achieved. Secondly, the doc-files need to be converted to plain txt-files by saving the content in plain text editor-notepad with UTF-8 encoding. Finally, the resultant text file needs to be loaded/ opened in SA-Interface of Sanchay and then saved there. The clicking on the save button automatically converts the raw text into four-column SSF, provided the sentences are arranged in one sentence per line fashion.
The following screenshots, Figures- 12.a, 12.b, and 12.c, depict the step-wise opening of the corpus file <kashmiri_treebank_IASNLP_286.txt> as well as the Sanchay.
Step-I: Open the Sanchay folder and double-click on the Sanchay.bat file, an executable file which starts running and results in the opening of the Sanchay Shell, as shown in the Figure.12.a & 12.b.
Figure.12.a. Opening of Sanchay Shell
Figure.12.b. Sanchay Shell with Multiple API Tabs
Step-II: Clicking on the SA-button in Sanchay Shell results in the opening of SA-interface as shown in Figure.12.c. It is this API only which is needed in the entire course of this work.
Figure.12.c. Sanchay Syntactic Annotation (SA) Interface
Step-III: Clicking on Open button in the SA-interface results in the opening of a small Browsing-window in which Browsing-button is to browse for the required task-file. In this window, one needs to set the language in the language drop-down list and also set the encoding in the encoding-dropdown list as shown below in Figure.12.d.
Figure.12.d. SA-Interface with Browsing Window
Step-IV: Clicking on the Browse button results in the opening of small Open-window which provides a list of the files in a particular directory which are in text format & can be opened. One needs to select the required file by clicking on it and then clicking again but on the Open-button so that the path of the file is selected in the Browsing-window. It is shown in the Figures 12.e & 12.f.
Figure.12.e. SA-Interface Showing Browsing Window
Step-V: The path (C:\ Users\ Shanu\ Desktop\ Sanchay-16-02-11\ KashDTreeBabk_03-Aug 2013\ 1.kashmiri_treebank_IASNLP_286.tx) of the required file is selected as shown in the Figure 12.f. The selected file will open as soon as one clicks on the OK-button. The file will open in the interface in a way so that only one sentence is displayed at a time as shown in the Figure.13.a.
Figure.12.f. SA-Interface Showing Annotation Task-Setup
Figure.13.a. SA-Interface Showing a Sentence (Before POS Tagging)
-
Dostları ilə paylaş: |