Chapter. 1 Introduction


Chapter.3 Creating Corpus for Kashmiri Treebank



Yüklə 1,52 Mb.
səhifə5/21
tarix07.08.2018
ölçüsü1,52 Mb.
#68537
1   2   3   4   5   6   7   8   9   ...   21

Chapter.3 Creating Corpus for Kashmiri Treebank

“There are and can exist but two ways of investigating and discovering truth.

The one hurries on rapidly from the senses and particulars to the most general

axioms, and from them derives and discovers the intermediate axioms. The other

constructs its axioms from the senses and particulars, by ascending continually

and gradually, till it finally arrives at the most general axioms.”

Francis Bacon, Novurn Organurn Book 1.19 (1620)


  1. Introduction

In rationalistic discourse, competence, the underlying ideal grammatical system in the mind of a native speaker, is considered the only legitimate source of grammatical knowledge which can be accessed only through grammatical intuitions of the native speaker (Chomsky 1956). In spite of the fact that performance, the actual real world utterances of a native speaker, is also a source of grammatical information, it is not considered the legitimate source. It has been only considered the inferior copy of the tacit knowledge, the competence. However, in empirical discourse, alternative stance has been taken and the real world observable and verifiable language, the actual writings, speech and signs, which come under the purview of performance acts, are given prime legitimacy to build a linguistic theory. Since corpus is a real world linguistic artifact (written, spoken or sign) that stores linguistic knowledge, it is extensively used in empirical research in Linguistics, CL and NLP. As mentioned earlier, in Chapter 1, linguistic knowledge that exists in corpus is very crucial for creating various NLP tools and applications. Such knowledge can be captured either by building computational grammar (hand crafted linguistic rules) or by annotating large electronic corpus to create treebanks. It is from these treebanks, grammatical knowledge can be induced in a machine by some statistical modeling. Hence, need of the treebank as an empirical basis for research on grammar is well established.

Further, corpus-based empirical research which was not much in practice for quite a long time since late 50s was almost completely marginalized by strong rationalistic discourse and subsequently developed formalisms. For instance, one of the pioneers of Brown Corpus (BC), shares the response they got for the development of BC. It was considered “a useless and foolhardy enterprise” as the intuition of native speaker was considered the only legitimate source of grammatical knowledge of a language which could not be obtained from corpus (Francis, 1992). However, with the progress in corpus linguistics itself and the achievements in Speech Recognition and NLP, particularly, in Statistical Machine Translation (SMT), it is now a well established fact that corpus based empirical grammar products like treebanks are of crucial importance not only for linguistic research and language technology (Nivre et al; 2005) but also for cognitive and historical linguistic studies.

The next section on Corpus linguistics introduces the notion of corpus linguistics and also providers its background. The section three tries to talk about the status of Kashmiri text corpus. Section four discusses the methodology for developing Kashmiri text corpus. Section five tries to look into various problems of corpus development, like corpus sanitization, corpus normalization and tokenization, in general and for creating Kashmiri corpus (KashCorpus) in particular. Finally, section six summarizes the chapter.


  1. Notion of Corpus Linguistics

The Latin term ‘Corpus’ means a ‘body’. It was traditionally applied to various collections of linguistic or non-linguistic items. In linguistics, however, the term ‘corpus’ refers to the finite collections of naturally occurring utterances. Corpus is actually a machine readable, principled and organized collection of text, speech & sign samples that represent a particular language or a variety of that language (Leech 1992, Sinclair 1996). However, Corpus Linguistics is not a branch of linguistics like many inter-disciplinary branches like psycholinguistics, sociolinguistics etc. and core branches like morphology, syntax etc., rather it is an alternative empirical methodology (corpus based) that percolates through all the branches of linguistics.

    1. Some Background

The term Corpus Linguistics was not much in practice up to early 1980s but it came into lime light with the publication of The Recent Trends in English Corpus Linguistics (Aarts & Meijs, 1984). Actually, corpus based linguistic research predates rationalistic generative era (late 50s), when it was practiced by many linguists32. Although, they used hard-copies of text for manual analysis and paper slips or card boards for data storage33, their methodology was purely empirical, based on real world data. As mentioned above, the underlying notion of the language in corpus linguistics is an empiricist and probabilistic one, where language is considered as a real-life object which can only be probabilistically modelled, i.e. the correspondence between linguistic structures and grammatical rules is a matter of frequency vis-à-vis probability.

“If it is correct to describe linguistic behavior as rule-governed, this is much more like the sense in which car-drivers’ behavior is governed by the Highway Code than the sense in which the behavior of material objects is governed by the laws of physics, which can never be violated” (Sampson, 1992).



This period, prior to 1950s, is considered as the golden era of the old fashioned corpus linguistics which has been termed as Early Corpus Linguistics (ECL) (McEnery & Wilson, 1996). In ECL, the corpus was collected, stored and analyzed by linguists by hand, using pen and paper as the aids. Consequently, corpora were hardly as large as today and rarely faultless. The corpus-based methodology required data storage (memory devices) and processing abilities which were not available at that time. In 50s, under the influence of logical positivism and behaviorism, several linguists regarded corpus as the primary source of linguistic information. The corpus was deemed both necessary and sufficient for the task at hand and intuitive evidence was sometimes rejected altogether. A small number of researchers, applying some corpus-based methodology did make weaker claims for suggesting that the purpose of linguist’s work is not simply to account for all utterances included in the corpus but rather to account for the ones which are not in the corpus at a given time (Leech, 1992). In spite of its intrinsic limitations (theoretical and technological), the corpus-based approach was being considered as a scientific methodology for language study. The ECL was widespread among the linguists until early 1950s (McEnery & Wilson, 2001). At the end of 1950s, corpus based empirical method was severely criticized and almost overshadowed by rationalistic discourse (ibid). The criticism was partly genuine, given the crude techniques available at that time. Finally, with the advent of computing machines and their usage in corpus processing, owing to their large storage and computing capacities, modern corpus linguistics came into existence and in early 60s, first modern corpus, known as Brown Corpus was compiled for American English. The modern corpus linguistics, known as Computerized Corpus Linguistics (CCL), received further impetus from the ground breaking successes in automatic speech recognition & automatic machine translation, using various techniques of statistical language modelling. The success in building various NLP applications, based on different modern day corpora, rekindled hope in empiricism and by early 90s, the magic spell of rationalism was almost reversed.

    1. Text and Grammatical Knowledge

Writers write without being conscious that they, apart from their intents, carve their grammatical knowledge and mastery of the language in the patterns of the text. It is a well established fact that this grammatical knowledge can be harnessed. The grammatical information in the text corpus needs to be annotated at various levels in order to be used in developing real world NLP tools and applications. It involves direct induction (learning) of linguistic knowledge from annotated corpora. The annotated corpora being used are the treebanks where the implicit linguistic information has been made explicit through various levels of annotation. Several NLP modules like Part-of-Speech Tagger, Chunker, Parser, etc, and various NLP application systems like Machine Translation, Question Answering or Information Extraction, are trained and tested on treebanks, i.e. the aforementioned systems learn linguistic knowledge from the treebank samples and their performances are also evaluated on those samples. Training of the system consists of two stages - (a) classifying the linguistic structures (i.e. words and chunks) occurring in the corpus, and (b) assigning them probabilities of occurrence according to a probabilistic language model.

  1. Status of Kashmiri Text Corpus

Kashmiri language presents unique challenges to descriptive, theoretical and historical linguistics. It is not only a fascinating language for linguists who base their research on rationalism but also to the corpus linguists and NLP practioners who base their research on empiricism. Though Kashmiri is pretty well explored from rationalistic orientation, it is yet to be explored from empirical perspective. The brief overview regarding existing corpus resources for Kashmiri are given in this section.

Since corpus is the primary source data for empirical research34, corpus building is to be seen as the part-and-parcel of corpus linguistics which has become an essential enterprise for quantitative analysis and technological development of any language in post 1980 scenario. This has led to the development of huge corpora in many languages of the world like English, French, German, Arabic, Chinese etc., hence, loosely called resource rich languages but some languages still lack such resources on large scale like most of the South Asian Languages (SALs), hence, called resource poor languages. For instance, Indian Languages present a good example of resource poor scenario. The work of corpus building in ILs first started at individual level thirty three years ago at Kolhapur University and KCIE35 (Shastri 1988) came into being which consists of approx one million words of Indian English with ISCII encoding. The next initiative in this direction was taken by Department of Electronics Govt. of India in the form of a project-TDIL36 in 1991. The project was launched to develop three million text corpora for all ILs that are included under 8th schedule (cf. Ganesan, 1999). The corpora were compiled from the texts materials published between 1981 and 1990. For Kashmiri, Urdu and Sindhi the initiatives were taken at AMU and similar efforts were made at different institution throughout the country. Another effort was taken under EMILLE project to build multi-lingual corpora for South Asian Languages (McEnery et al., 2000) which released 200,000 words parallel corpus of English, Urdu, Bengali, Gujarati, Hindi and Punjabi. However, still ILs need large scale languages resources and to develop enough language technology and thereby, enhance their online representation. However, to enhance technological development, many corpora projects have been launched recently, for instance LDCIL and ILCI37 which are still going on. The former aims at producing quality annotated corpus for all 22 scheduled ILs while as the latter aims at producing parallel corpus in tourism domain for all major ILs, keeping Hindi corpus as the pivotal one which is translated into other languages.



The efforts of corpora building started in post 80s in Europe and America on large scale with considerable amount of standardization while as these efforts started only a decade back in India on small scale in isolated projects, that too with less emphasis on standardization. As a result of such efforts, resources were created only for few major languages that too without proper standardization. Consequently, until 2008-2009, in spite of the efforts made under TDIL, there were hardly any language resources for Kashmiri, and hence, no corpus based research for Kashmiri was possible before. It was only after some initial efforts that started in this direction first at Central Institute of Indian Languages (CIIL) and then at Kashmir University (KU)38 that some corpus based studies were made. These corpus building efforts resulted in some basic language resources and computational tools like unicode compatible font, text corpus, POS annotated corpus, speech corpus, annotation tools, transliteration tools & some lexical resources like trilingual dictionary, frequency dictionary for Kashmiri. Besides, C-DAC Pune which is also involved in the localization of various softwares like Open-Office, has developed a software package for all Indian Languages, including Kashmiri which consists of word processor, browser, excel, etc. In spite of the fact that considerable amount of text corpus for Kashmiri was build at AMU, more than one million words of text corpus has been built under in LDC-IL39 at CIIL and about 2-5 lakhs at KU (Bhat 2012), no existing corpus is open to the researchers till now. Therefore, instead of trying to get the existing corpus, new small scale resources were created for developing KashTreeBank. The next section describes the methodology used in building Kashmiri corpus.

  1. Methodology for Building Kashmiri Text Corpus (KashCorpus)

Theoretically, text corpora can be developed by typing in printed texts, using OCR or through speech recognition. OCR and speech technologies are far from perfect, especially for ILs and the only workable method is to key in texts. For the development of Kashmiri Text Corpus (KashCorpus) too, raw text has been collected and digitized by inputting the data manually into Microsoft Word (.doc format). After certain procedures like cleaning and normalization, the corpus is deemed fit for the linguistic scrutiny and for different types of annotations. The entire procedure that was adopted for the development of KashCorpus is explained below along with the associated issues.

    1. Planning Corpus

Planning is a very important stage, in fact, a decision making one in corpus building. It is in this stage that the source and the nature of text and the purpose for which corpus needs to be built are decided upon. Once the purpose of the corpus is clear, other specifications like character-encoding, text-encoding and format for storage and usage are also laid down. The general practice in treebanking is the usage of news papers as primary source data. It is because these are easily available and can be freely downloaded, e.g. the Wall Street Journal (WSJ) part of Penn Treebank. However, digitization is yet to be achieved for the newspapers in most of the ILs and if the news papers are digitized at all, these are mostly in image format which can’t be directly used as corpus but one can download or copy and input them. But for the current work, it was even more difficult situation as only few newspapers are available in Kashmiri that too very rare and without digitization.

    1. Selecting Text Domains

Theoretically, one can identify different domains of text like Aesthetics- Literature and Fine Arts, Natural, Physical and Professional Sciences, Social Sciences, Commerce, Government Documents, etc which are very important for creating a balanced corpus but availability of all such domains vary from language to language. Moreover, certain domains have more day-to-day relevance than others, like government documents, medical and tourism texts. These domains are more useful in developing technology for e-governance and hence, much in demand these days to be used for commercial purposes in developing various NLP applications. However, such text domains, whether important for building balanced corpus or important for commercial purpose, are not available in Kashmiri. It is because Kashmiri has been never used as an official language or the medium-of-instruction40 and currently too, the official language of the state is Urdu and the alternative official language is English. Therefore, the text production is in limited domains41, predominantly, in literature. As mentioned earlier the current corpus is meant for developing a KashTreeBank, it was decided that newspaper text should be used. The rational to use newspaper corpus was not in tandem with general practice in the field of treebanking but additional reason was that the textual material in collected from books (Bhat 2012) show least grip of standardization but newspapers use comparatively standard forms. However, when newspaper corpus was used initially on experimental basis it was found very difficult to annotate it at sentence level as the sentences were very complex and lengthy. Consequently, it became very hard to lay down the first version of annotation guidelines. However, to avoid this difficulty, some short story text was also selected to add to the existing corpus. The current KashCorpus consists of the following domains:

S. No

Domain

Word Count (WC)

%age

01

Short Stories (SS)

3384

7.29

02

News Articles Political (NAP)

14395

31.02

03

News Articles (NA)

7001

15.09

04

News Report Political (NRP)

14263

30.74

05

News Report (NR)

2997

6.45

06

Editorials (ED)

4354

9.38

Total WC

46394




Table.1 Text Domains

    1. Data Collection

For building KashCorpus, data collection was carried out through field work. As mentioned above, it was not possible to collect newspaper data online, as it can be done for English, Urdu, Hindi, Tamil, etc which have pretty good online representation. Further, it was decided to use text of Sangarmaal, the only well known news paper in Kashmiri which has recently started daily publication but before it was weekly newspaper. The other Kashmiri news papers - kAhvaTh, soon miiraas, arnimaal, miiraas & kA:shur times are not much circulated ones. Sangarmaal too is not a widely read paper as there are very less people who could read and write Kashmiri but English and Urdu newspapers are widely read in Kashmir. Therefore, it became necessary to go to the field for newspaper collection. Some issues of Sangarmaal (of 6 months duration) were purchased and news items, editorials and articles from mainly political domain were marked up. Besides, short stories were also taken to be included in the corpus most of which have been taken from an anthology of prose used to teach Kashmiri at NRLC. The decision to add short stories to the corpus was taken at the last stage and, as aforementioned, the average sentence length in the newspaper corpus was found high, approx. 27 Ws and the sentences are also quite complex. On the other hand, the average sentence length of short story corpus is approx 12 Ws but with quite considerable complexity. Moreover, data to be collected needs to follow a proper sampling scheme as is done at LDCIL for building text corpus (each nth page from n-page book/ magazine/ journal) for all scheduled languages but for the current case, random sampling was done in which no explicit criterion was followed to chose the text. However, it was taken care that least possible number of newspapers be used to avoid wastage. The sample details of the newspaper data collected during the field trip, in 2011 are given the table 1.

File ID.

Metadata

Words

Domain

KashCorp 01

سنگرمال سرینگر : جِلد: ۵ شمارٕ : ۲۱ ، ۱۴ تا ۲٠ جون ، ۲٠۱٠

وزیر اعظمہٕ سُنٛد دورٕ سِیٲسی اعتبارٕ مویوٗس کُن: اِقتصٲدی مراعات ہٮ۪کَن نہٕ اَصل سِیٲسی سوال گال کھاتَس ترٛاونُک ول بٔنِتھ

سنگرمال تجزیہ


204

ED

KashCorp 02

سنگرمال سرینگر :جِلد: ۵، شمارٕ :۱۷ ، ۳ تا ۹ میٖٔ ، ۲٠۱٠

سری ینگر شہر گوٚژھ ڈنجہِ پٮ۪ٹھ اَننہٕ یُن ۔۔۔ سمیر رشیدبٹ



132

NA

KashCorp 04


سنگرمال سرینگر :جِلد: ۵ ، شمارٕ :۱۲ ، ۲۹ مارچ تا ۴:اپریل، ۲٠۱٠

ہندوستان مَہ لٲگِن کٔشیٖرِ ہُنٛد مَسلہٕ أنٛز راونَس ژیر



172

NAP

KashCorp 06


۲٠٠۹ ، سنگرمال سرینگر :جِلد: ۴ ، شمارٕ : ۹٠ ، ۱۶ تا ۲۲ نومبر

ہندوستان چُھ چیٖنس سۭتۍ رلن وٲلۍ سَرحد مضبوٗط کران

216

NRP

KashCorp 19

سنگرمال سرینگر :جِلد: ۵ ، شمارٕ :۱۹ ، ۲۴ تا ۳٠:میٔ ،۲٠۱٠

بہترین کتابَن ہٕنٛدٮ۪ن لکھارٮ۪ن اعزاز



202

NR

Yüklə 1,52 Mb.

Dostları ilə paylaş:
1   2   3   4   5   6   7   8   9   ...   21




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin