These days, unicode has become the prime choice in character encoding for text corpora creation. Unicode is the universal character encoding standard which defines a consistent scheme for encoding multilingual text and assigns a numeric value (code point) and a name for each of its characters. Unicode characters are represented in three forms of UTF42; 32-bit form, 16-bit form or an 8-bit form (UTF-8). UTF-8 has been designed for ease of use with existing ASCII and ISCII-based systems. The Unicode Standard specifies a code point and a name for each of its characters. It contains more than 1 million code points, most of which are available for the encoding of characters (Allen at al., 2009). The availability of unicode compatible font is a prerequisite for the development of corpus with unicode compatibility. As mentioned earlier, Kashmiri has only one unicode compatible font with least issues, i.e. Afan Koshur Naksh (Aadil 2011) and is being used for major NLP related works for many projects. It has been also used for the development of the current corpus. The table 2 shows the encoding of Kashmiri characters employed in developing KashCorpus.
S. No
Characters
Unicode Values
S. No
Characters
Unicode Values
1
ا
0627
30
ل
0644
2
ب
0628
31
م
0645
3
پ
067E
32
ن
0646
4
ت
062A
33
و
0648
5
ٹ
0679
34
ہ
06C1
6
ث
062B
35
ھ
06BE
7
ج
062C
36
ء
0621
8
چ
0686
37
ی
06CC
9
ح
062D
38
ے
06D2
10
خ
062E
39
َ
064E
11
د
062F
40
آ
0622
12
ڈ
0688
41
ٲ
0672
13
ذ
0630
43
ِ
0650
14
ر
0631
44
ٖ
0656
15
ڑ
0691
45
ُ
064F
16
ز
0632
46
ٗ
0657
17
ژ
0698
47
ٕ
0654
18
س
0633
48
ٔ
0655
19
ش
0634
49
ٚ
065A
20
ص
0635
50
ٛ
065B
21
ض
0636
51
ً
064D
22
ط
0637
52
،
061B
23
ظ
0638
53
۔
06D4
24
ع
0639
54
؟
061F
25
غ
063A
55
ۄ *
1732
26
ف
0641
56
ۭ *
1773
27
ق
0642
57
ۍ *
1741
28
ک
06A9
58
ٮ۪ *
1646 + 1770
29
گ
06AF
Table.2 Kashmiri Unicode Chart
Text Encoding
The term text encoding refers to the practice of representing textual and linguistic data in a certain format in corpus. A standard encoding format provides the most possible generality and flexibility (McEnery & Wilson, 1996). The XML43 is the emerging standard for data representation and exchange on the World Wide Web (Bray, Paoli & Sperberg-McQueen, 1998). At the fundamental level XML is a document markup language directly derived from SGML with various additional features that make it a far more powerful tool for data representation and access. Therefore, natural choice these days for storing a corpus is in an XML format. An XML format provides needed standardization so that a user, who is not familiar with the corpus but familiar with XML-DTDs, can easily interface with the corpus but for the current KashCorpus, no markup language or XML-DTDs were used, instead, the entire corpus has been rendered in plain document (.doc) format as there was only one purpose of the corpus, i.e. to be used for syntactic annotation and for that purpose it was not necessary to have corpus in XML format, a plain text (.txt) format in UTF-8 was sufficient. However, the corpus can be easily converted into XML format.
Data Entry
Data entry is the corner stone in any corpus building endure. It is time consuming task especially for the language in which people are accustomed to use some different kind of word processors which are use some different kind encoding standards and are not compatible with unicode (like InPage) but are not yet much familiar with using Microsoft Word, e.g. in Kashmiri. Finally, the manually marked up news items & articles from Sangarmaal and short stories from an anthology were typed in. It took a professional data inputter 8 days to input 46394 words of Kashmiri newspaper text in Microsoft Word, with an average of 5051 words per day (5-7 hrs). It was found that the corpus is unclean, i.e. it contains lot of typos and space problems, and is still unfit to be used for next level process. It is a well established general practice that the corpus needs to be sanitized and preprocessed first before putting it in actual usage. The sample of the unclean corpus is given below in the Table 2. It contains three parts; a) Metadata (information about the data available in the corpus) b) Data (text on which actual work is done) c) Word Count (number of words of the actual data, excluding metadata.
M E T A D A T A
File ID No.
KashCorp11
Newspaper
Details
سنگرمال سرینگر : جلد: ۵ شمارِ : ۱۷ ، ۳ تا ۹ میٖٔ ۲٠۱٠