Chapter. 1 Introduction



Yüklə 1,52 Mb.
səhifə21/21
tarix07.08.2018
ölçüsü1,52 Mb.
#68537
1   ...   13   14   15   16   17   18   19   20   21


25 Talbanken was recently reconstructed into Talbanken-05 (Nivre et al. 2006).

26 Funded by Linguistic Data Consortium for Indian Languages (LDCIL)

27 KashTreeBank started as a summer school project in IIIT Hyderabad Advanced Summer School for Natural Language Processing (IASNLP 2011)

28 Treebanks given here are mainly taken from (Kakkonen, 2006)


29 Note that hierarchical structure has been avoided and flat structure been preferred to reduce the amount of attachment ambiguities.

30 The corpus for the treebank was obtained from the METU Turkish Corpus (Atalay et al., 2003), hence, the name of the treebank.

31 Taken from (Haung et al. 2003*) “Sinica Treebank: Design Criteria, Annotation Guidelines, and On-line Interface”

32 Linguists like Firth (1930s), Jesperson (1940s), Franz Boas (1940s), Sapir (1950s), Bloomfield (1950s), Harris (1950s), Fries (1950s), etc were practicing this empirical brand of linguistic research (See Biber and Finegan 1991: 207)

33 Unsophisticated (Pen & Paper technology)

34 The method used in empirical research is totally quantitative in nature which, in addition to documenting structural and functional analysis, also stores some numbers like frequency counts and probability weights with the items of analysis. This augmentation with statistical information makes linguistic data more information rich. Information richness and machine readability of such data makes it more preferred data for language technology & NLP research.

35 Kolhapur Corpus of Indian English

36 Technology Development of Indian Languages

38 In Department of Linguistics, Kashmir University under DIT funded Project- Development of Kashmiri Language Technology Tools (See kashmirizaban.com)

39 LDCIL stands for Linguistic Data Consortium for Indian Languages which is set at CIIL. It is a scheme of MHRD, Govt. of India with goal to create annotated language resources for all ILs (for details see ldcil.org)

40 It is worth to mention that Kashmiri was intermittently introduced and taken out from the school curriculum and again recently, it has been re-introduced to be taught as a subject in schools. This is probably reason that most young people are unable to read and write Kashmiri but elders and children are well versed with it.

41 It was observed in a fieldwork that there is hardly any text from the domains other than Aesthetics (Bhat 2012). The fieldwork was done for LDCIL in which data was collected from 270 books for developing balanced corpus for Kashmiri.

42 Unicode Transfer Form http://www.unicode.org, http://www.unicode.org/versions/Unicode 5.2.0


43 Extensible Markup Language

44 In Kashmiri a letter (ہ) is used at word final position just for the support of the preceding diacritic which can’t stand on its own. In such cases (ہ) is a pseudo-character as it doesn’t represent anything of the phonological word.

45 They have low morpheme per word ratio and more this ratio is lower more the language is said to isolating. Purely isolating ones have 1:1 word-morpheme ratio, e.g. Mandarin. Therefore, languages with one to one correspondence between words and morphemes are said to be isolating.

46 They have high morpheme per word ratio, in contrast to isolating languages, and more this ratio is higher more the language is said to be inflectional, e.g. Indo-European languages (also known as low synthetic languages).

47 They have highest morpheme per word ratio but additionally, there is a low degree of fusion of major grammatical categories, e.g. Turkish, Tamil, Malayalam, Telegu, etc (also known as polysynthetic languages).

48 In such scripts some letters (joiners but not non-joiners) attain different shapes upon joining with the adjacent letters. There are three possible shapes a letter can attain at initial, medial & final positions (contexts) in a concatenated sequence of letters of the word. The letters assuming these three shapes according to the context are called joiners. Another set of letters, called as non-joiners do not do not change their shape according to the context. They only join with the letter immediately preceding them and thus, have only word final and isolated variants. An examples of a joiners are Arabic letters ‘te, miim, ye, be, siin’ (ت م ی ب س) and that of non-joiner are Arabic letters ‘vaav’ &‘re’ (و ر ۄ ڑ).

49 The term split-orthography is used due to unavailability of any technical term in the existing literature to denote the splitting tendency in Persio-Arabic Script (an orthographic convention) due to which affixes and the roots are written separately and even some roots are written in two tokens, forming multi-token words. The term is in a way new coinage to describe the tokenization problem of Kashmiri, Urdu, etc (S. Bhat, 2010 & 2012).



50 Person, Number, Gender, Case

51 Tense, Aspect, Mood

52 The EAGLES (Expert Advisory Group for Language Engineering Standards) guidelines provide recommendations for standardization of a range of language engineering resources. The recommendations actually refer solely to the guidelines on morpho-syntactic annotation of texts.

53 The European Language Resources Association (ELRA)

54 International Standards for Language Engineering Standards (ISLES)

55 The principles are given in ‘Linguistic Resource Standards for POS Tag Set for Indian Languages’. Documentation by D. M. Sharma in May 2010. MS


56 Simple inventory of unrelated POS tags

57 The term “hierarchical”, when used for a tagset, means that the categories in that tagset are structured relative to one another. Rather than a large number of independent categories, a hierarchical tagset will contain a small number of categories, each of which contains a number of sub-categories, each of which may contain sub-sub-categories, and so on, in a tree-like structure (ibid).

58 Computational Grammar Coder-CGC (Klein and Simmons 1963) was designed as a component of a parser (in turn a component of a system to synthesize human language behaviour).

59 A tag is considered to be “decomposable” if the string that represents that tag consists of one or more characters that represent the same elsewhere which it represents in the original tag within the same tagset. For example, any noun tag which combines an N for “noun” with other characters to indicate other features of the word is decomposable (N.SG.MAS .dir).

60 It was Green & Rubin’s POS tagset which was used in annotating the Brown Corpus, and was refined slightly in a latter stage of this project (see Francis and Kučera 1982: 3-15) which came to be known as Brown tagset. It consists of 87 tags; allowing for compound tags, the number of potential analyses for any given orthographic form is 179 (Sampson 1987).

61 Constituent Likelihood Automatic Word Tagging System

62 The CLAWS2 tagset was the basis for the much larger, much finer-grained SUSANNE Word-tag Set (Sampson 1995: 79-149; circa 360 tags).

63 University of Pennsylvania, USA

64 Enabling Minority Language Engineering

65 Indian Language Machine Translation is a consortium project for developing MT systems for major ILS pairs. It has been set at IIIT Hyderabad and is funded by DIT.

67 Bureau of Indian Standardization

68 Note: Two tagsets with considerable differences were developed for Kashmiri on the basis of BIS format; one was developed at KU (which proposed fine-grained distinction in verb classification) and other at LDCIL (which avoided the fine-grained distinction like Hindi-Urdu). They were, latter on, combined in National Workshop on BIS & ILCI (2011), commenced at LTRC Lab. IIIT Hyderabad. The present tagset of Kashmiri is the same unpublished collaborative work proposed from the side of LDCIL and it has been first time used in any research.

69 It is a cover term that was originally used in LDCIL tagsets. It includes personal pronouns (I, you, He, etc) that have persons (+human) as antecedents, pronouns (this, that, it) that have animates (-human) or in-animates as antecedent. Discourse deictic pronouns (this, that, it) that have whole proposition as antecedent, e.g. John abused Mary. It is clearly violation.

70 It literally means pointing out.

71 http://sanchay.co.in


72 http://shakti.iiit.ac.in


73 The parse-tree of a chunk is a sub-graph of the global parse-tree (ibid, 1994).

74 It can be considered equivalent to the popular phrasal level.

75 It must be noted that some of the chunks, though conceptually different from other ILs, have been assigned the same tags as in other IL treebanks with the understanding that tags, like words, are arbitrary in nature and there is no point in making objections like why is not verb-chunk-Finite tagged as VCF instead of VGF? Or why Noun Chunks have been tagged as NP instead of NC. This was purely done to keep the doors for easy recourse sharing open.

76 The strict notion that only lexical items can be heads seems to be diluted by projecting certain chunks from function words, e.g. CCP, NEGP and BLK.


Yüklə 1,52 Mb.

Dostları ilə paylaş:
1   ...   13   14   15   16   17   18   19   20   21




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin