Chapter. 1 Introduction


Chunking POS Tagged Corpus Samples



Yüklə 1,52 Mb.
səhifə13/21
tarix07.08.2018
ölçüsü1,52 Mb.
#68537
1   ...   9   10   11   12   13   14   15   16   ...   21

Chunking POS Tagged Corpus Samples

As aforementioned, chunking is labeling a cluster of POS annotated words (with an obvious head) or a solitary POS annotated word (which itself acts as a head), with a higher level tag. During chunking, words have been clustered together and assigned a particular chunk tag, keeping in view their POS tags, adjacency and dependency relations between them that make them perceptually closed entities. It has been done in such a way that each chunk has a definite internal structure, i.e. words constituting a chunk are asymmetrically related to each other, with one word as a head and the remaining words as its dependents or in case the word is solitary, it is itself head with no dependents. However, there are certain cases where the word which has been given chunk status is neither a head nor a dependent as per semantic dependency is concerned, e.g. AUXP, as discussed above. The chunking process has been carried out using the same interface which was used to carry out POS tagging. The chunk layer has been built on the POS layer as illustrated below in three steps for the sentences 43 and sentence 42 (taken from the corpus), given in the Table.2 along with their English translation and the chunk information. The POS annotated file in SSF format can be opened in the Sanchay SA Interface (GUI), as shown in the Fig.1.a in order to carry out manual chunking.

Kashmiri Sentence 43

تٔمۍ ووٛن زِ جوہری طاقتہٕ چھٚے پَننہِ فٲیدٕ خٲطرٕ ٹیکنالوجی پٮ۪ٹھ اجارٕ-دٲری یَژھان تہٕ نَہٕ چِھ بیٚین ملکن امن مقصدَو خٲطرٕ جوہری توانٲیی پراونس اجازت دِوان ۔

Translation

He said that the atomic powers want monopoly on the technology for their benefit and do not let other countries to use atomic energy for peaceful purposes.


Chunks

[[ [[ تٔمۍ_PR_PRP ]]_NP [[ ووٛن_V_VM ]]_VGF [[ زِ_CC_CCS ]]_CCP [[ جوہری_JJ_JJ طاقتہٕ_N_NN ]]_NP [[ چھٚے_V_VAUX ]]_VGF [[ پَننہِ_PR_PRF ]]_NP [[ فٲیدٕ_N_NN خٲطرٕ_PP_PSP ]]_NP [[ ٹیکنالوجی_N_NN پٮ۪ٹھ_PP_PSP ]]_NP [[ اجارٕ-دٲری_N_NN ]]_NP [[ یَژھان_V_VM ]]_VGF [[ تہٕ_CC_CCD ]]_CCP [[ نَہٕ_RP_NEG چِھ_V_VAUX ]]_VGF [[ بیٚین_JJ_JJ ملکن_N_NN ]]_NP [[ امن_N_NNC مقصدَو_N_NNC خٲطرٕ_PP_PSP ]]_NP [[ جوہری_JJ_JJ توانٲیی_N_NN ]]_NP [[ پراونس_V_VM ]]_VGF [[ اجازت_N_NN ]]_NP [[ دِوان_V_VM ۔_RD_PUNC ]]_VGF ]]_SSF

Sentence 42

ایرانٕکۍ صدر محمود احمدی نژادَن چُھ ووٛنمُت زِ تٔمۍ سٕنٛدِس مُلکَس خلاف چُھنہٕ اقوام متحد-کِس تازٕ قراردادَس کانٛہہ اہمیت تہٕ سلامتی کونسل چھٚے امریکہ-ہُک ‘ آلہٕ بنیمٕژ ‘ ۔

Translation

President of Iran Mahmood Ahmad Nasraad has said that there is no significance of United Nation’s resolution against his country and United Nations has become an instrument in the hands of America.


Chunks

[[ [[ ایرانٕکۍ_N_NNP ]]_NP [[ صدر_N_NNPC محمود_N_NNPC احمدی_N_NNPC نژادَن_N_NNPC ]]_NP [[ چُھ_V_VAUX ووٛنمُت_V_VM ]]_VGF [[ زِ_CC_CCS ]]_CCP [[ تٔمۍ_PR_PRP سٕنٛدِس_PP_PSP ]]_NP [[ مُلکَس_N_NN خلاف_PP_PSP ]]_NP [[ چُھنہٕ_V_VAUX ]]_VGF [[ اقوام_N_NNPC متحد-کِس_N_NNPC ]]_NP [[ تازٕ_JJ_JJ قراردادَس_N_NN ]]_NP [[ کانٛہہ_DM_DMI اہمیت_N_NN ]]_NP [[ تہٕ_CC_CCD ]]_CCP [[ سلامتی_N_NNPC کونسل_N_NNPC ]]_NP [[ چھٚے_V_VAUX ]]_VGF [[ امریکہ-ہُک_N_NNP ]]_NP [[ ‘_RD_PUNC آلہٕ_N_NN ]]_NP [[’_RD_PUNC بنیمٕژ_V_VM ۔_RD_PUNC ]]_VGF ]]_SSF

Table.2. Showing Example Sentence

Figure.1.a. SA Interface Showing POS Tagged Sentence



Step-1: In this step, the contiguous words which form a chunk have been selected by holding control key and clicking on the nodes so that all the contiguous nodes are selected, simultaneously, as shown in the Fig.1.a. Although, the first three chunks (NP, VGF and CCP) consist of solitary words, they have been also chunked following the same steps as shown for the forth chunk (highlighted one in Fig.1.b), i.e. by selecting the nodes, adding a layer and changing the name of the layer (chunk name) for the selected nodes.

Figure.1.b. SA Interface Showing Step-1 of Chunking

Step-2: In step two, one can right click on the selected chunk so that the drop down list of actions opens in which ‘Add Layer’ option can be selected and new chunk layer can be added in the format as shown in the Fig.1.b.

Figure.1.c. SA Interface Showing Step-2 in Chunking

Step-3: In this step, the newly added layer would have some default tag (NP) which can be easily changed by clicking on the chunk tag itself and using keyboard by pressing the first letter key of the desired chunk tag. One can keep pressing the letter key unless the desired chunk tag is assigned to the newly added chunk layer, as shown in the Fig.1.d.

Figure.1.d. SA Interface Showing Step-3 in Chunking

As shown above, sentence-43 has 26 token which have been grouped into 18 chunks and sentence-42 has 29 tokens which have been grouped into 16 chunks. The ratio between the tokens/words and chunks is not very large (approx 1.6) which is indicative of the fact that there is high frequency of the solitary words that have been given status of chunk. The 18 chunks of the sentence-43, as viewed in the tree viewer of the interface are shown in Fig.2.a and Fig.2.b.

Figure.2.a. SA Interface Showing Chunks in Sentence 43



Figure.2.b. SA Interface Showing Chunks in Sentence 43



1 (( NP


1.1 ایرانٕکۍ N_NNP

))

2 (( NP



2.1 صدر N_NNPC

2.2 محمود N_NNPC

2.3 احمدی N_NNPC

2.4 نژادَن N_NNPC

))

3 (( VGF


3.1 چُھ V_VAUX

3.2 ووٛنمُت V_VM

))

4 (( CCP


4.1 زِ CC_CCS

))

5 (( NP



5.1 تٔمۍ PR_PRP

5.2 سٕنٛدِس PP_PSP

))

6 (( NP


6.1 مُلکَس N_NN

6.2 خلاف PP_PSP

))

7 (( VGF


7.1 چُھنہٕ V_VAUX

))

8 (( NP



8.1 اقوام N_NNPC

8.2 متحد-کِس N_NNPC

))

9 (( NP


9.1 تازٕ JJ_JJ

9.2 قراردادَس N_NN

))

10 (( NP


10.1 کانٛہہ DM_DMI

10.2 اہمیت N_NN

))

11 (( CCP



11.1 تہٕ CC_CCD

))

12 (( NP



12.1 سلامتی N_NNPC

12.2 کونسل N_NNPC

))

13 (( VGF



13.1 چھٚے V_VAUX

))

14 (( NP



14.1 امریکہ-ہُک N_NNP

))

15 (( NP



15.1 ‘ RD_PUNC

15.2 آلہٕ N_NN

))

16 (( VGF



16.1 بنیمٕژ V_VM

16.2 ’ RD_PUNC

16.3 ۔ RD_PUNC

))





1 (( NP


1.1 تٔمۍ PR_PRP

))

2 (( VGF



2.1 ووٛن V_VM

))

3 (( CCP



3.1 زِ CC_CCS

))

4 (( NP



4.1 جوہری JJ_JJ

4.2 طاقتہٕ N_NN

))

5 (( VGF


5.1 چھٚے V_VAUX

))

6 (( NP



6.1 پَننہِ PR_PRF

))

7 (( NP



7.1 فٲیدٕ N_NN

7.2 خٲطرٕ PP_PSP

))

8 (( NP


8.1 ٹیکنالوجی N_NN

8.2 پٮ۪ٹھ PP_PSP

))

9 (( NP


9.1 اجارٕ-دٲری N_NN

))

10 (( VGF



10.1 یَژھان V_VM

))

11 (( CCP



11.1 تہٕ CC_CCD

))

12 (( VGF



12.1 نَہٕ RP_NEG

12.2 چِھ V_VAUX

))

13 (( NP


13.1 بیٚین JJ_JJ

13.2 ملکن N_NN

))

14 (( NP


14.1 امن N_NNC

14.2 مقصدَو N_NNC

14.3 خٲطرٕ PP_PSP

))

15 (( NP



15.1 جوہری JJ_JJ

15.2 توانٲیی N_NN

))

16 (( VGF



16.1 پراونس V_VM

))

17 (( NP



17.1 اجازت N_NN

))

18 (( VGF



18.1 دِوان V_VM

18.2 ۔ RD_PUNC

))



Table.3. Showing Chunked Sentences in SSF

    1. Chunking Issues

As discussed above, chunking covers half of the dependency relations, though they are not explicitly marked as relational labels which is a general practice in dependency treebanks and can be clearly seen in the works of Nivre (2009) who has translated Tesnière’s (1959) seminal work on dependency grammar. However, applying the framework which has been designed for treebanking in ILs (Bharati et. al. 1995, 2006) on Kashmiri, entirely new issues come to fore. Such issues have partly stemmed from the underlying theory and partly from the peculiar morphosyntactic or syntactic properties of Kashmiri that distinguish it from rest of the ILs and bring it closer to Germanic Languages like German and Yiddish. The main issues that have been encountered during the manual chunking of Kashmiri corpus are briefly given below.

    1. V2 and V3 Phenomena

It has been found that the notion of verb group that was proposed for ILs, do not stand for Kashmiri corpus because of a unique syntactic feature of Kashmiri language known as V2 Phenomenon. V2 phenomenon occurs in all tensed clauses be it matrix clause or embedded clause, both in active and passive configurations. It is due to this phenomenon that the tense auxiliary and the main verb cease to exist contiguously. The tense auxiliary (VAUX) occur at the second (V2) position and the main verb (VM) occur at the final position of the sentence but if there is also modal auxiliary in the sentence, it occupies the third (V3) position. For example:

farooq/NNP chu/VAUX batI/NN khevaan/VM

Farooq is eating rice.

farooq/NNP chu/VAUX aasaan/VAUX batI/NN khevaan/VM

Farooq keeps eating rice.

However, in interrogative sentence the tense auxiliary can also occur at third (V3) position in the sentence if the auxiliary carrying aspectual information is also present, it occurs at forth (V4) position. For example:

farooq/NNP kya/WH chu/VAUX reyaazas/NNP divaan/VM

What Farooq is giving to Riyaz?

farooq/NNP kya/WH chu/VAUX aasaan/VAUX reyaazas/NNP divaan/VM

What Farooq keeps giving to Riyaz?



The problem with the finite clauses in Kashmiri is that they can’t be easily chunked like in other IA languages, e.g. Hindi, Urdu or Punjabi, etc, due to the presence of V2 phenomenon. The tensed verb stands discontinuous from its main verb as shown in the above examples. Usually, a group or cluster of words is assigned a chunk label if the words are adjacent or contiguous to each other and also have an asymmetric relation of dependence with each other or simply share unequal category status, so that the one with higher status can be projected as the head but in this case the VAUXs and the VMs are neither adjacent to each other nor the relationship they hold with each other is dependency relation in the real sense. Dependencies are essentially modifier-modified relations and between the discontinuous VAUX and VM in finite clauses, there is hardly any modifier-modified kind of relation but definitely a part-whole kind of relationship. Therefore, it impossible to posit verb as a chunk as noun and adjective chunks have been posited. Some ad hoc decisions need to be taken to tackle the V2 problem as the language data is far from being ideal to fit for our perceived notion of chunk.

    1. Headless Head

Adverbs are considered as the most floating or movable elements in a sentence. They frequently occur, discontinuously, away from their heads (VMs), at beginning, at the final position or at elsewhere in the sentence. However, sometimes they occur adjacent to the VM, they modify and thus, become parts of VGF, VGNF or VGNN. When adverbs (RB) occur discontinuously, they have no governing or influencing head adjacent to them and are authority in themselves. Under such circumstances, RBs can be considered as heads, though they would be pseudo-heads and would be still having a clear cut dependency relation with their far away head, which is also the ultimate head, the root. It is evident that the dependency relation of lower level (chunk level) has been promoted to the dependency relation of higher level (argument structure level) to handle discontinuous verb chunk.

    1. PP No More a Head

Since there is well known notion of functional heads in both the constituency based as well as in dependency based frameworks, at least for exocentric constructions, a distinction has been maintained for when a cluster of words (N-PP) has noun its head and when it has post/preposition its head. In other words, when N-PP cluster is a noun phrase/chunk and when it is a post/prepositional phrase/chunk, has been properly distinguished. However, no such distinction has been drawn on the functional basis, i.e. on the basis of functioning as an argument or an adjunct. In all the clusters of words containing noun, nouns have been treated as their head but never the pre/postposition, irrespective of the fact that some of them perform core functions (subject or objects) and many of them perform mere subsidiary functions (adverbial) in the sentence. This uniformity has been maintained at this level because the underlying notion of the head in PCG (Bharati, 1995) is essentially a semantic notion, with few exceptions76. The function words are devoid of semantics or content and can’t be treated as heads based on the underlying theory. Therefore, there is no possibility of existence of pre/postpositional phrases or chunks unlike what been originally posited in Bloomfieldean and Post-Bloomfieldean literature for exocentric constructions, as already given chapter two, which latter appeared in PSG (Chomsky, 1956) and DG (Tesniere, 1959). It is worth to mention that in these works, NPs are generally arguments and PPs are adjuncts but this distinction has been avoided here for the sake of theory and has been encoded at the next level of annotation.

    1. Junction Still a Head

The notion of dependency does not always provide unambiguous solutions when it comes to exocentric constructions. The dependency representation is at a loss when it comes to representing the notorious paratactic linguistic phenomena such as coordination, whose nature is symmetric (two or more conjuncts play the same role), as opposed to the head-modifier asymmetry of dependencies (Popel et al., 2013). In other words, coordination is a pending problem of natural languages and both PSG as well as DG struggle with it (Hudson, 1988, Covington, 1980). Conjuncts are also exocentric constructions but they have not been treated as endocentric constructions like the pre/postpositional phrases/chunks have been, given their crucial role in the structural organization of sentences.

    1. Negation and Double Negation

Kashmiri has negative elements in free-form like na and nI (no and not) as well as in bound-form like -nI (not) in khe-yi-nI (will not eat) and sometimes there is also double negation, e.g. khe-yi-nI kehn (will not eat no) and na saa na (no +honrific not). The negative markers do not belong to this level and certain negative particles in double negation constructions (see the above example) which do have obvious heads, are of no concern here and can’t be projected as chunks. However, some negative particles, either solitary or in clusters (RP-RP), do not have any obvious heads and they themselves have the potential of being head.

    1. Discourse Elements

Discourse elements are the particles that have been tagged as particle default (RPD) at POS level. They conjoin sentences at semantic or discourse level to bring cohesion in the text. Since, they were extraneous to the existing set of chunks and like conjunctions, in spite of being functional words don’t seem to be dependents of any existing semantic head. It must be noted that discourse elements have been also treated as heads (connectives) in discourse treebanks.

    1. Relational Confusion

As aforementioned, at chunk level one needs to handle two kinds of grammatical relations, one lower level dependencies, e.g. between JJ and N or RB and VM and a kind of part-whole relations, e.g. between N and PP, N and RP, VAUX and VM. It would be more result oriented if one focuses on the one type of relation at a time. Therefore, ones need to keep track of the kind of the relations one has to handle without confusing between the dependencies and part-whole relations.


  1. Statisticsl Results

The quantitative results are given in terms of chunk statistics and qualitative results are given in terms of a miniature guideline.

The four datasets that had been used for POS tagging have been reduced to only three datasets by merging second and third ones. These three datasets have been utilized in chunking which has been carried out by using SA Interface of Sanchay as aforementioned and chunk frequency of each dataset has been obtained with the help of the same interface. The frequency distribution table so obtained has been latter used to calculate the cumulative frequency and the percentage of the chunks. The same data is represented through the bar chats given in the Fig.3.a and 3.b.

Figure.3.a. Showing Cumulative Frequency of Chunks
The three datasets consist of 682 POS annotated sentence which in turn consist of 8125 chunks classified into ten chunk categories. It has been found that the most frequent chunk is NP and the least frequent chunk is NEGP, as given in the Fig.3.a. As shown in the figure the height of the bar is directly proportional to the frequency of the item it represents. Therefore, the ascending order of frequency of the series of chunks would be as follows:

NEGP< BLK< RBP< VGNN< VGNF< JJP< CCP< VGF< NP

The NP being the most frequent and VGF being the second most frequent chunk is expected from the empirical facts about POS categories that have been as given in Chapter-IV. The statistical chunk results have shown an important empirical fact that 27.630% clauses show V2 phenomenon and 72.369 % clauses are devoid of V2 phenomenon in which tense is condensed in the main verb itself. However, it must be noted that only tensed verbs have not been considered finite but the all verbs, which have not become de-verbal and possess aspectual or modal information, have been considered as finite and it is because of this reason that comparatively lesser percentage of finite clauses have been found with V2 phenomenon which otherwise could have been larger. The statistical results reveal another empirical fact that 78.438 % verbs in Kashmiri are finite and 21.565 % verbs are non-finite. In the non-finite forms, 46.913 % are gerunds and remaining 53.086 % are other non-finite forms.

The bar diagram in Fig.3.b shows the data represented in Fig.3.a in terms of percentage. It was done just to reveal the striking quantitative similarities among the three datasets and to put forward a numerical generalization about the percentage of various chunks in the corpus so that one can claim reliably that NPs constitute more that 50% of chunks in Kashmiri.

Figure.3.b. Showing Relative Proportion of Chunks




  1. Chunking Guidelines

Chunk guidelines include various decisions that have been taken to resolve various chunk issues raised during the chunking the data. These guidelines can be followed in order to achieve the consistency in future chunking tasks.

  1. The auxiliaries and main verbs need to be independently projected as chunks (AUXP and VGF), so that the non-adjacency problem can be settled at next level by positing a relation between them in which VM would be head of VAUX. The solution may sound weird if one is preoccupied with the popular notions of syntax but one must think that it is the surface form that is being accounted here through surface level manipulations without positing some abstract layers and categories which has been the popular practice. Moreover, the purpose here is not to contribute or challenge to theoretical paradigm but simply to produce a well grounded data driven grammar which a parser can learn or from which a probabilistic grammar can be extracted.

  2. Though conjunctions can’t be semantic head, it has been worked out that conjunction should be treated as the head and be projected as a chunk under the label CCP.

  3. The negative particles have scope on the entire sentence rather than on the single word or phrase. Therefore, it can be said that they are involved in sentential negation. Such particles should be projected as chunks under the label NEGP.

  4. Though discontinuous adverbs have quite high frequency but as aforementioned, in spite of occurring at long distances from the semantic head, they are still the dependents of verb at lower level. They need to be projected as chunks under the label RBP only to handle the discontinuity.

  5. Since, discourse particles have no role in the internal organization of a sentence; they can not belong to any other chunk proposed in the tagset which are essential to account the internal organization of a clause or a sentence. Therefore, they must be projected as separate chunks under the label BLK.

  6. Though MWEs which include named entities, compound words and izaafat constructions, are the POS level problems which have been handled by concatenating ‘C’ with the tag but they are still separate tokens which can be potentially confusing. It must be taken care of that all the adjacent or contiguous POS tagged tokens with the ‘C’ marked tag must be considered one word so that they are together either a head or a dependent. It should not be seen as problem that they apparently give rise to very big chunks.

  7. It has been found that discontinuous noun phrases are rarity un like discontinuous verb phrases but adjectives do occur either as predicative adjectives or as adjectival component of complex predicates which are genuinely heads and should be projected as chunks and assigned a label JJP.

  1. Yüklə 1,52 Mb.

    Dostları ilə paylaş:
1   ...   9   10   11   12   13   14   15   16   ...   21




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin