Chapter. 1 Introduction



Yüklə 1,52 Mb.
səhifə6/21
tarix07.08.2018
ölçüsü1,52 Mb.
#68537
1   2   3   4   5   6   7   8   9   ...   21

Table.1 Metadata of Sample Newspaper corpus

    1. Character Encoding

These days, unicode has become the prime choice in character encoding for text corpora creation. Unicode is the universal character encoding standard which defines a consistent scheme for encoding multilingual text and assigns a numeric value (code point) and a name for each of its characters. Unicode characters are represented in three forms of UTF42; 32-bit form, 16-bit form or an 8-bit form (UTF-8). UTF-8 has been designed for ease of use with existing ASCII and ISCII-based systems. The Unicode Standard specifies a code point and a name for each of its characters. It contains more than 1 million code points, most of which are available for the encoding of characters (Allen at al., 2009). The availability of unicode compatible font is a prerequisite for the development of corpus with unicode compatibility. As mentioned earlier, Kashmiri has only one unicode compatible font with least issues, i.e. Afan Koshur Naksh (Aadil 2011) and is being used for major NLP related works for many projects. It has been also used for the development of the current corpus. The table 2 shows the encoding of Kashmiri characters employed in developing KashCorpus.


S. No

Characters

Unicode Values

S. No

Characters

Unicode Values

1

ا

0627

30

ل

0644

2

ب

0628

31

م

0645

3

پ

067E

32

ن

0646

4

ت

062A

33

و

0648

5

ٹ

0679

34

ہ

06C1

6

ث

062B

35

ھ

06BE

7

ج

062C

36

ء

0621

8

چ

0686

37

ی

06CC

9

ح

062D

38

ے

06D2

10

خ

062E

39

َ

064E

11

د

062F

40

آ

0622

12

ڈ

0688

41

ٲ

0672

13

ذ

0630

43

ِ

0650

14

ر

0631

44

ٖ

0656

15

ڑ

0691

45

ُ

064F

16

ز

0632

46

ٗ

0657

17

ژ

0698

47

ٕ

0654

18

س

0633

48

ٔ

0655

19

ش

0634

49

ٚ

065A

20

ص

0635

50

ٛ

065B

21

ض

0636

51

ً

064D

22

ط

0637

52

،

061B

23

ظ

0638

53

۔

06D4

24

ع

0639

54

؟

061F

25

غ

063A

55

ۄ *

1732

26

ف

0641

56

ۭ *

1773

27

ق

0642

57

ۍ *

1741

28

ک

06A9

58

ٮ۪ *

1646 + 1770

29

گ

06AF










Table.2 Kashmiri Unicode Chart


    1. Text Encoding

The term text encoding refers to the practice of representing textual and linguistic data in a certain format in corpus. A standard encoding format provides the most possible generality and flexibility (McEnery & Wilson, 1996). The XML43 is the emerging standard for data representation and exchange on the World Wide Web (Bray, Paoli & Sperberg-McQueen, 1998). At the fundamental level XML is a document markup language directly derived from SGML with various additional features that make it a far more powerful tool for data representation and access. Therefore, natural choice these days for storing a corpus is in an XML format. An XML format provides needed standardization so that a user, who is not familiar with the corpus but familiar with XML-DTDs, can easily interface with the corpus but for the current KashCorpus, no markup language or XML-DTDs were used, instead, the entire corpus has been rendered in plain document (.doc) format as there was only one purpose of the corpus, i.e. to be used for syntactic annotation and for that purpose it was not necessary to have corpus in XML format, a plain text (.txt) format in UTF-8 was sufficient. However, the corpus can be easily converted into XML format.

    1. Data Entry

Data entry is the corner stone in any corpus building endure. It is time consuming task especially for the language in which people are accustomed to use some different kind of word processors which are use some different kind encoding standards and are not compatible with unicode (like InPage) but are not yet much familiar with using Microsoft Word, e.g. in Kashmiri. Finally, the manually marked up news items & articles from Sangarmaal and short stories from an anthology were typed in. It took a professional data inputter 8 days to input 46394 words of Kashmiri newspaper text in Microsoft Word, with an average of 5051 words per day (5-7 hrs). It was found that the corpus is unclean, i.e. it contains lot of typos and space problems, and is still unfit to be used for next level process. It is a well established general practice that the corpus needs to be sanitized and preprocessed first before putting it in actual usage. The sample of the unclean corpus is given below in the Table 2. It contains three parts; a) Metadata (information about the data available in the corpus) b) Data (text on which actual work is done) c) Word Count (number of words of the actual data, excluding metadata.


M E T A D A T A

File ID No.

KashCorp11

Newspaper

Details

سنگرمال سرینگر : جلد: ۵ شمارِ : ۱۷ ، ۳ تا ۹ میٖٔ ۲٠۱٠

News Item

Title

نَو ٮ۪ن امکانَن لوٚگ زایُن

امن عمل ترٛاوِ کٔشیٖرِ ہٕنٛدِس صورتحالَس پٮ۪ٹھ مثبت اثر، علیحدگی پسندَن ہٕنٛز چَلو سیاست چھنہٕ کارگر

Item Type

سنگرمال تجزیہ

سحالےٕ بھوٹان چِہ رازد انِہ تِمپھو سارک سربراہ اجلاسَسس دوران بھارت تہٕ پاکستان کٮ۪ن وزیراعظمن درمیان سپز مِژ باہمی میٹنِگہ پتہٕ چُھ دۄن ملکن درمیان جامع کَتھ باتھِ ہٕنٛزع مل بٛال سپد نُک اماکن پٲدٕ گوٚمُت۔ حالانکہ دۄشوٕ یَو لیدرَوچھٚے کَتھ باتِھ ہٕنٛز ضرورت واضح کران خٲرجی وزیرن تہٕ خارنہ سیکریٹری یَن یِہ ذِمہٕ دٲری دِژمِژ زِتِم کرَن امِہ با پَتھ ماحول سازگارتہٕ وطریقہٕ کار تلاش۔ مصرٕ کِس شرم الشیخ شہرس منٛز اَکھ ؤری برٛ ونٛہہ دۄن لیڈرَن درمیان ملاقاتہٕ پتہٕ منٛز کَتھ مسلہٕ حل کرنٕچ وٲحد وَتھ قراردِنٕہ آمٕش ٲس، چُھ یِہ ملاقات تہٕ امٕہ پتہٕ دٕنہٕ آمُت بیان زیادٕ محتاط تہٕ منظۭم۔ وزیر اعظم ڈاکٹر منموہن سنگٔھس اوس شرم الشیخ اعلانیہ پتہٕ پارلیمنٹس منٛز زبردست مخالفتکُ بُتھ وُچُھن پیوٚمُت تہٕ امی مخال فٔژ سببہٕ روٗد پٔتۍ مِس پوٗرٕ أ کِس ؤرۍ یَس جامع مذاکرٲتی عمل بحال کرنَس متلق نوِدلہِ ہُنٛد موقف شِٹھہٕ تہٕ ۲۶ ن نومبر ۲٠٠۸ عیسوی کٮ۪ن ممبیٔ حملَن منٛز ملوث نفرن خلاف ٹھوس کاروٲیی کرنہٕ یِنَس سۭتۍ مشروط۔ دوٛیِمہ طرفہٕ روٗد پاکستان اَتھ أکۍ سٕے معاملَس پٮ۪ٹھ تمام امن عمل روٹہٕ کرنہٕ یِنَس پٮ۪ٹھ واو یلا کران تہٕ وَنان زِ پٔتۍ مِس أکِس ؤرۍ یَس دوران چُھ تَتھ ملکس منٛز درجنہٕ وادتِتھی حملہٕ سپدان آمٕتۍ یِتھۍ حملہٕ مُمبیٔ منٛز سپدُن مُت اوس ۔

W-241

Yüklə 1,52 Mb.

Dostları ilə paylaş:
1   2   3   4   5   6   7   8   9   ...   21




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin