Ieee paper Template in A4 (V1)

Yüklə 187,74 Kb.

tarix	25.12.2017
ölçüsü	187,74 Kb.
	#35934

Proposal of a method of enriching queries for improving of the precision in information retrieval in Arabic

Anis Zouaghi

Souheyl Mallat

Research Unit LATICE, Department of computer sciences

Faculty of sciences of Monastir, University of Monastir

Monastir, Tunisia

Mallatsou.issat@yahoo.fr

Research Unit LATICE, Department of Computer Science, Superior Institute of Applied Science and Technogies of Sousse

Sousse, Tunisia

anis.zouaghi@gmail.com,

Mounir Zrigui

Emna Hkiri

Research Unit LATICE, Department of computer sciences

Faculty of sciences of Monastir, University of Monastir

Monastir, Tunisia

emna.hkiri@gmail.com

Research Unit LATICE, Department of computer sciences Faculty of sciences of Monastir, University of Monastir

Monastir, Tunisia

mounir.zrigui@fsm.rnu.tn

Abstract— In this paper, we propose a method for enriching queries, to improve performance of information retrieval system in Arabic. This method relies on many steps. First, identification of significant terms (simple and composed) present in the query. Then, generation of a descriptive list and its assignment to each term that has been identified as significant in the query. A descriptive list is a set of linguistic knowledge of different types (morphological, syntactic and semantic). Finally, application of the weighting functions of Salton TF-IDF and TF-IE F on the list generated in the previous step. TF-IDF function identifies relevant documents, while the TF-IEF's role is to identify of the relevant sentences, and assign a weight to all the terms belonging to the relevant sentences. The terms of high weight (which are terms which may be correlated to the context of the response) are incorporated into the original query.

The application of this method is based on a corpus of documents belonging to a closed domain.
Keywords— information retrieval; Arabic NL; Query enrichment; weighting.
Introduction
^{The objective of information retrieval system (IRS) is to retrieve relevant documents that meet the needs of users. The user of such system seeks the precision in the answers, and prefers a small number of documents that meet its needs rather then many which contain the answer but drowned in a set of irrelevant documents [1]. Besides, all information’s found in documents are not always correlated with the user query. This is why the information retrieval systems (IRS) are an important issue today.}
^{This work aims to improve the performance of IRS using enrichment queries, related to the domain of environmental pollution. Enrichment consists in adding terms which may be correlated with the query response.}

^{This method is based mainly on the composed terms which identify the query. These units are more precise and less ambiguous than the simple isolated terms [2]. They facilitate the linguistic and statistical treatments on which is based the enrichment method.}
^{The linguistic treatment of text (user query or text corpus) consists of syntactic, morphologic and semantics analysis:}

^{The morphological module covers the inflectional and derivational variation of significant terms of the query, in order to increase the number of occurrence associated to these terms for the search.}
^{The syntactical module is based on grammatical labeling, which associates to each word its grammatical category (noun, verb, adjective, particles ...). The purpose of labeling is tracking the terms simple and composed, in order to operate a first treatment of terms disambiguation.}
^{The semantic module associates the synonymy and hyperonymy relations to significant terms of the query. Extraction of these relationships is done by using two dictionaries (simple and composed), which express the corresponding relations to the simple and composed terms of the initial query.}

^{The result of the linguistic processing is a description of significant terms (desc) in the form of a list containing the significant terms with their semantic and morphological variations. This list provides an improvement to the similarity measures between the query and the documents of the corpus in the statistical processing.}

^{The statistical treatment, consists on the determination of the similarity between the couples (desc-list of the initial query, the documents), and (desc-list of the initial query, the phrases belonging to relevant documents). This measure is based on the weighting functions of Salton TF-IDF and TF-IEF [3]. It is defined as criteria of decreased classification of documents, as well as phrases in terms of relevance. This criteria also assigns a weight for each term, that belongs to the relevant sentences, in order to integrate the terms of highest weight to the initial query.}

Problematic

^{In this work, we are interested in ambiguities that have a direct impact on information retrieval (IR).}

^{Methods which are based on the keywords as a means for IR, are considered insufficient:: for example if the query and document share a key term, then the document can be seen more or less corresponding to the query subject.}

^{The insufficiency of those methods is due to the fact that the terms used in the query vary morphologically and semantically, compared to documents in the knowledge base. This variation degrades the effectiveness of IR systems in terms of precision [4]. These changes affect several levels, for example:}

^{The query does not cover the morphological variations that generate keywords in different numbers, for example "مدرسة}^{" (school) and "مدرستان}^{" (two schools), "خيل}^{" (horse) and "خيول}^{" (horses);}
^{The lexical variations: the use for the same sense different words. The result is that the query with the keyword "فرس}^{" (horse) does not search for documents that contain its synonym "خيل}^{" (horse);}
^{It does not distinguish in case singular words have multiple senses; this is due to semantic variation. A user searching the word "الحجر}^{: الصخر}^{" (stone) will face "انثى الخيل}^{" الحجر}^{": (horse).}

^{In this context, the use of queries enrichment method is indispensable to remedy the problems presented above [5, 6].}
Approaches to Information Retrieval

Approach linguistic

^{This approach is used to identify significant terms representing the query. It is based on a set of modules; whose number and nature vary according to the treated language. Some modules are only used in specific cases depending on the language in question [7]. The modules used for IR are usually (syntactic, morphological, and semantic). Some modules are based on the use of resources (dictionaries, corpora, thesauri, semantic networks).}

Statistical Approach

^{The statistical approach is used to analyze documents to extract significant words that describe its contents. It is based on the calculation of words appearance frequency. The weight of significant words means frequency of the term in the document. It will be maximum in a single document and minimum if discriminated from the rest of documents [8]. The calculation of similarity is an example of statistical analysis. It allows identifying relevant documents in relation to the query terms.}

^{In some cases, linguistic and statistical analyzes are not enough to generate a query more understandable to return relevant documents. These cases, it is necessary to add new significant terms to the process of linguistic and statistical analysis. So, it is to extend the query by using an enrichment method.}

Method of enrichment queries

^{The enrichment of queries is to extend its number of words to make it more representative of the searched information to return relevant documents. Enrichment is the automatic addition of terms [9] using a knowledge base (corpus database, thesaurus, etc…). Enrichment can also be realized according to the documents quoted in a search first: the selection of a relevant document can serve as a reference for adding new terms [10].}

^{In this context, using the enrichment method may be a solution to effectively solve the problems discussed before. There are three levels to differentiate between the query enrichment techniques [11], [12]:}

^{The source of terms used in the enrichment can be come from the results of previous research or knowledge base (semantic networks, thesauri, corpus, etc...).}
^{The choice of a method for selecting terms to be added to the initial query.}
^{The role of the user in the selection process of terms that can be active or passive.}

^{In this work, the enrichment is done automatically, so we are interested in the two first levels.}
Proposed Method

Principle of the method

^{The search result depends on the choice of words used to express the query. If these words are poorly chosen, for example in the web, the user is in front of a large mass of documents (relevant and irrelevant, etc...), hence the need to expand the query in order to remove classic noise and silences in IRS.}

^{The proposed method aims to improve the efficiency of search engines using an enrichment corpus. The figure 1 below presents the proposed method for enriching queries.}

Enrichment corpus

Need for information

Extraction of significant terms

Diab's tool: segment + labeling grammatical

Segmentation of the question in simple and composed terms

Suppression de mots vides :anti-dictionnaire

Anti-dictionnaire

initial query primitive
lemmatization

Algorithm's KHOJA

Anti-dictionnaire

Module- semantics:

Extraction of semantic relations: synonymy, hyperonymy

Specialized dictionary of terms simpleTr

Anti-dictionnaire

Specialized dictionary of composed terms

Anti-dictionnaire

Module-Morphology:
- Variation inflectional

+

- variation in-derivational

Environnement NOOJ

Anti-dictionnaire

descriptive list

Normalization

segmentation of the document(d) in sentences (S)

Algorithme de Anne Dister

Anti-dictionnaire

Segmentation of sentences in simple and composed terms

Segmentation:

Similarity function Sim (d, desc-list) with weighting function TF-IDF

D1

D2

Dn:[S1] [S2] [S3]… [Sn]
Similarity function Sim(S , desc-list) with weighting function TF-IEF

weighting of the terms for all relevant sentences: assign a weight to each term

Integration of tr-susc in the initial query

statistic Analysis

Etape1 : enrichissement de requête par apprentissage

^{enriched query}REnrichi2
Search engine

Results:

1-http://

2-http://

Question

Module-syntactic:

List of Lemmas

Generation of semantic relations associated with lemmas

Normalization

Generation of morphological variation associated with lemmas

^{Texts segmented}

^{texts Segmented of the corpus is normalized}

Extraction of relevant documents:

Descriptive list Normalized

Extraction of the relevant sentences: [S1], [S2], [S3], ... [Sn]

Extraction of the terms of high weight: words may (tr-susc)

Linguistic Analysis

Remove empty words: anti-dictionaryTr

Anti-dictionnaire

Filtering of relevant sentences

Relevant sentences filtered

^{Figure 1. Principle of the method of enrichment proposed}

Analysis and Interpretation of the question:^{The user expresses his needs by a text message to IRS in natural language, through a user interface. The question will be analyzed and interpreted in the following steps:}

Segmentation of the question in simple and composed terms: The question of the user is divided into simple and composed words. This idea is based on the assumption that a composed term is less ambiguous than a single isolated term. The treatment begins with the identification of initial query terms To do this, we use a linguistic analysis and more precisely the syntactic analyzer that categorize every isolated word of the query.

The identification of composed words is performed by Diab’s tool [13], which relies on a hierarchical structure. This structure is based on finding typical patterns, i.e. pairs of two words:

A noun and an adjective. Ex: "التلوث الكمياى"( Pollution Alkamiay)
A noun and another noun. Ex: " تلوث الهواء"( Air pollution)
A noun a preposition and another noun. Ex: " المعالجة ل السموم (Treatment for Poison),

""التخلص من النفاىات (Disposal of waste).

For example, for the question "كيف يتم تلوث الهواء"(How is the air pollution) Diab’s extractor will keep the sequences "noun + noun", "verb" and "adverb" ("كيف" ( How ), "يتم" ( is the ), "تلوث الهواء " (Air pollution) and will stop in three sequences.

The next step is to remove the empty words to obtain significant terms.
b) Extraction of significant terms: The next step, following segmentation, is designed to extract significant terms of the question. Those terms are the meaningful words. It comes first to remove all empty words (not significant, etc...).
These words often appear in all documents as "كيف"," يتم" etc. In practice, the elimination of these words increases the precision of relevant documents search, hence the idea of creating an anti-dictionary. The latter generally contains prepositions, pronouns, some adjectives. The terms of the question will be compared to the elements or particles in the anti-dictionary. In case of existence, these words will be eliminated and will not be taken into consideration during the calculation of word’s frequencies.
Then, the significant terms are lemmatized. Those lemmas will be enriched to find morphological and semantic relations with the terms of the corpus texts.
c) Lemmatization terms of the initial query: Lemmatization is one of the most important treatments for Arabic language in the IR. The use of lemmatization is defined as a process of extracting the canonical form of the terms of the initial query.

It erases problems caused by morphological variations. For example, if a text contains the sentence:

"أنواع التلوث تلوث التربة إن التربة التي تعتبر مصدرا للخير والثمار"( Types of pollution, contamination of soil that soil which is the source of welfare and fruits ).This text can be retrieved by a query containing the word "نوع" ,lemmatized form of the term"أنواع"(types).

Arabic words are usually formed by a sequence of {antefix, prefix, suffix, postfix}.

To obtain the lemma, we used the Khoja’s algorithm [14]. This algorithm is suitable for Arabic IR. Using this algorithm, we may find several forms for the same word as seen in the following example: "أنواع" "تنوعات", (Diversity ), " تنوع"(variety), etc. The objective of lemmatization is to eliminate the indices or termination shape and keep only the root or lemma.
d) Semantic Enrichment of lemmas: After lemmatization of significant terms of the initial query by the algorithm khoja, we assign semantic relations (synonymy, hyperonymy) to these terms through a specialized dictionary. Lemmatized terms extracted from the query are in simple and composed form; therefore we need to build two dictionaries (simple dictionaries (smpt-Dic) and composed dictionaries (cmp-Dict)).

Depending on the terms form, the dictionary generates new enriched lemmas. They are semantically related [15] by the relations of synonymy and hyponymy to the significant terms.

The synonymy relation can reconcile semantically close terms. For example, if we consider the simple term "طرق" (Methods) the simple dictionary generates the words "سبل"(Ways) and "وسيلة" (medium) and for the composed term "حرق الفضلات"( Burning of waste) the composed dictionary generates "حرق النفايات " ( Burning of waste) . The hyponymy relation expresses the relations of generalization for example, for the term "غازي النتروجين"( Gaseous nitrogen) the composed dictionary generates the term "المخلوط الغازي"( Gaseous mixture).
e) Morphological variation: Experiments realized in French [16], show the influence of morphological variation in IR, increasing the chances of match between a query and documents corpus.

Previously generated Lemmas by dictionaries in the semantic enrichment phase will be varied morphologically using inflectional and derivational analyzes. These variations can take many forms: noun, verb, adjective and adverb.

For derivational and inflectional variations of generated lemmas, we used the morphological engine NooJ [17], which uses a finite state transducer.

Inflectional Morphology: Arabic is an inflected language. It employs, for the conjugation of verb and the declination of noun indices of aspect, mode, time, person, gender and number which are in general suffixes and prefixes [18]. Generally, these inflectional marks provide a distinction between the mode of verbs and function noun. For example, the lexical entry "كتب" (has written), has 122 morphological variations such as "اكتب" (I write), "ىكتبان" (they write), "كتبا" (they have written), etc.
Derivational morphology: It studies the construction of words and their transformation according to the sense desired. It relies on the morphological semantics: the same lamm, are derived from different words (صيغ). Generally this type of morphologic can be improved (delete the world: recall).

The significant terms of the query, the terms lemmatized and enriched lemmas (semantically and morphologically) are stored in a descriptive list of the query (desc-list). Query initial
^{f) Normalization:}^{The normalization is more easily manipulated. It transforms a text into a standard format. This step is considered necessary due to variations that may exist when writing the same term.}

^{A query or document is normalized as follows:}

^{- Remove special characters and numbers;}

^{- Replace}

^with^ا^;

^{- Replacing the final letter}^ي^with^ى^;

^{- Replacing the final letter}^ة^with^ه^.

^{In our case the normalization is applied on desc-list and on the texts of the corpus.}
B. Pretreatment of enrichment corpus

¹⁾Segmentation of the texts corpus into sentence: ^{After pretreatment of the initial query, our method needs a very important step able to prepare the texts of the corpus enrichment for information retrieval. This step is to prioritize and structure the text of the corpus into sentences.}

^{In this work we use the segmenter of Anne Dister based on a system INTEX [19], developed for French texts. It is based on punctuation signs. These signs can provide accurate indications formally to structure texts.}

^{To proceed with the division of the text into sentences, the segmentation tool of Anne Dister uses a transducer. This transducer is an automaton that reads a sequence in a text, compares it to information associated with it and it inserts the end mark or not-end of this sequence. The marks of end which are used to our phase of segmentation are ". ","! ","? ".}
^{2) Segmentation of sentences in simple and composed terms:}^{Once the text of corpus is cut into sentences, each one will be divided into in simple and composed terms. Identification of simple and composed terms are realized in the same manner as that of the query (see Paragraph 4.2.1.1).}

^{Here is a description of a text input and output of the ASVM tool of Diab: The text must be analyzed encodes Buckwalter by a table of correspondence between the Arabic characters and ASCII.}

^{Input text of the ASVM tool:}

^{ولم يحتسب الحكم المجري ساندور بول ركلة جزاء صحيحة اثر عرقلة داخل المنطقة من قبل اليساندرو}

^{"A did not take into account the referee Hungarian Sandor Paul a correct penalty after a shot inside in zone by Alessandro"}

^{Output text of the ASVM tool:}

^{" wlm yHtsb AlHkm Almjry sAndwr bwl rklp jzA’ SHyHp Avr Erqlp dAxl AlmnTqp mn qbl AlysAndrw. "}

^{The segmentation process is as follows:}

^{Segmentation in words and grammatical tagging}

^{As an output the text is divided into words followed by a slash and its grammatical category. Coordinating conjunctions (Fa-) and (wa-), the preposition (bi-) etc... are tagging independently.}

^{" w/CC lm/RP yHtsb/VBP Al/DT Hkm/NN Al/DT mjry/JJ sAndwr/NNP bwl/NNP rklp/NN jzA’/NN}

^{SHyHp/JJ Avr/IN Erqlp/NN dAxl/IN Al/DT mnTqp/NN mn/IN qbl/NN Al/DT ysAndrw/NNP ./PUNC "}
^{Identification in simple and composed terms}

^{" w/CC#lm/RP#yHtsb/VBP#Al/DT Hkm/NN Al/DT mjry/JJ# sAndwr/NNP bwl/NNP# rklp/NN jzA’/NN#SHyHp/JJ#Avr/IN#Erqlp/NN dAxl/IN Al/DT mnTqp/NN#mn/IN#qbl/NN Al/DT ysAndrw/NNP#./PUNC "}

^{A second pre-treatment is applied to the documents of corpus. It should be noted that documents and queries are submitted to almost the same preprocessing which consists of the removal of empty words that are stored in anti-dictionary.}

^{After removing empty words with the anti-dictionary, the number of terms decreases of 11to 9, so we obtain a filtered sentence, for example "yHtsb/VBP#Al/DT Hkm/NN Al/DT mjry/JJ# sAndwr/NNP bwl/NNP# rklp/NN jzA’/NN#SHyHp/JJ#Erqlp/NN dAxl/IN Al/DT mnTqp/NN#mn/IN#qbl/NN Al/DT ysAndrw/NNP".}

^{First, the phase of pretreatment facilitates the similarity measure between the sentence of the corpus and the list of significant terms of the query. Second, it minimizes the calculations of scores terms of all relevant sentences.}
C.Enrichment Process

^{The quality of IRS depends on its ability to retrieve relevant documents to the user. It is therefore related to the choice of terms of the query that the user has used to express his needs. This is why we must resolve problems relating to lack of precision.}

^{The enrichment is to add terms related to those of the desc- list (significant terms, lemmas, semantic relation to lemmas, and morphologic variation with lemmas). The integration of these terms is based on the extraction of relevant documents from a corpus of enrichment.}
^{1) Extraction of relevant documents:}^{The extraction of relevant documents consists in establishing a comparison between the document and desc-list. The comparison is founded on a method for calculating of similarity. This method relies on the size of intersection or the degree of correspondence between the two. The function assigns a distance of similarity to each document with relation to the desc-list.}

^{In the literature, there are many models of information retrieval (as the Boolean model, the probabilistic model and the vector model (algebraic approach).}

^{In this work, we use an IR system based on the vector model [20].}

^{The vector model is an algebraic model, in which the documents and the queries are represented by vectors in a multidimensional space, whose dimensions are descriptors terms from pre-treatment.}

^{Thus, the documents and desc-list are represented as vectors in the coordinate terms. It brings back the a semantic proximity to a measure of geometric distance Let R be the vector space defined by: A document d and a query q are defined by the ensemble of terms: (t1, t2, ..., tn) which may be represented by weight vectors as follows: d: (tf1, tf2, ..., tfN) , q: (q1, q2, ..., qn).}

^{The tfi and qi correspond to the weight of term ti in document di of the query q and n is the number of terms of space. These two vectors, their degree of correspondence is determined by their similarity. The measures of similarity are many, such as the cosinus or scalar product, for details sees [21].}

^{In this work, we have used the formula is a scalar product:}

^{Figure2. Measure of similarity between document and querie.}
^{The use of the similarity function (document, desc-list) with the binary presence / absence formula in the vectors, does not evaluate the relevance of document with the needs of the user: a document that contains many times a term is not necessarily more relevant than a document that contains a small number of terms of the query.}

^{In this paper, we propose to explore diverse functions to solve this problem, among them we mention the function of Salton TF-IDF (Term Frequency-Inverse Document Frequency) [22], and this function assigns a weight to each term of the desc-list in each document.}

^{This consists of transforming the occurrence of a term in a document by a combination of local and global weights. Local weighting L (i, j), indicates the importance of term i in a local manner within the document j noted tf (term frequency). It is based on the logarithmic weighting formula [23], while the global weight G (i), indicates the importance of the less frequent term i in the collection of documents. It use the formula noted idf (inverse document frequency).}

^{The functions chosen for our experiments are:}

- Formula for local weighting based on the logarithm of frequency:

Tf= 1+log(freq(t,d)) si freq(t)≥0

Sinon

Figure3. Formula of local weigtht.

- Formula for the global weight:

Figure 4. Formula of global weigtht.
Where N is size of the collection, and df (t) is the number of documents containing t
2)Classification of relevant documents:The classification of the relevant documents based on the combination of the two weights: local and global.

In this work, we use the weighting tf x idf (product of two weights), the advantage of this weighting is that it does not take into consideration the different lengths of text.

The weighting tf x idf is integrated in the formula of similarity function.

The scores obtained by the similarity function belong to tf x idf, these scores are used to classify documents in descending order as proximity of function of terms descriptors of desc-list: this way, the first documents are most similar to the initial query.

^{3)Extraction of relevant sentences:}^{Segmentation is used to structure the text in the sentences form. The purpose of information retrieval in such structured documents is to send to the user the sentences focusing on his need, named relevant sentences.}

^{To that end, we propose a process for selecting relevant sentences using the weighting function. This function gives a weight to each term of desc-list based on two hypotheses:}

^{For each term of the desc-list, we seek their relevant sentences in the corpus enrichment: The hypothesis is based on the formula binary (presence / absence) to determine the intersection of each sentence with the query. These relevant sentences contain terms that are most related to the response.}
^{The second hypothesis is based on the research of terms of the desc-list having a very low frequency in the different relevant sentences. For this we use the function ief (Inverse Element Frequency), which has been proposed by many authors [24]. In this manner, we favor the terms found in a single or a very small number of relevant sentences.}

^{Figure5. Formula of weithing for term in sentence.}
Which N is the number of sentences in the corpus, and df (t) is the number of elements (sentences) that contain t.

Then we promote the frequent terms in relevant sentences and that are not part of the initial query by assigning a weight (noted weight of relevant sentence WRS). For a term, this weight depends on the number of words from the query that are associated with the relevant sentence in which belongs that term. The formula of WRS_i,j for a term i of a relevant sentences j is calculated as follows:

^WRS_i,j⁼

^Figure6. Formula of weight of relevant sentence.
^{With M and nj respectively represent the number of terms in the query and the number of terms in the query that are associated in the relevant sentence j. The final score of a term we noted TRQ (Term Relatedness to Query) is calculated using the following formula:}
^TRQ=^α^{* (WRS) + (1-}^α^{)* ief}
^{Figure7. Formula of the score of Term Relatedness to Query}
^{Which the parameter}^α^{= 0.25.}
^{4)Selection of susceptible terms:}^{Once the score TRQ of the terms are calculated of all relevant sentences, we select the n highest scores associated to these terms. However, among the terms selected, there is terms that have the same score. To distinguish them, we evaluate their tendency to appearance close to the terms of the initial query. So we will calculate their dependence to the query terms in the context. For this, we propose to use the method of Dice. For a term T and a term of the query M, the measure of Dice [25] is calculated as follows:}

^{Dice(T,M)= 2*}

^{Figure 8.Measure of Dice}

^{With | (T, M) | is the times number of the term T and the term M of the query are in the same sentence on the corpus document and | T |, | M | is the total number of occurrences respectively T and M in the enrichment corpus.}

^{The last step of the enrichment process is to integrate the significant terms to the initial query to send a enhanced query to the search engine, so as the system offers results are more appropriate to meet the needs of the user.}
Experiences and Evaluation Method

A. Experience on little collections
^{To evaluate the performance of the proposed method, we used a corpus of documents and queries. This corpus is manually constructed and constituted by a set of Arabic articles extracted from the web, covering the environmental domain. The texts of articles are segmented into sentences.}

^{Each question is expressed in natural language; the relevance of its terms is evaluated by the statistical method. The following table shows the components of our enrichment corpus.}
^{TABLE I.DESCRIPTION OF THE COMPONENTS OF THE ENRICHMENT CORPUS}

^{Number of documents}	¹⁰⁰
^{Number of query}	²⁰
^{Number of sentences}	²⁶⁴⁶
^{Number of terms/sentences}	¹¹

^{The questions of our evaluation are constituted by two terms. In applying the proposed method of enrichment (add the terms may be correlated to the response), the queries become more expressive and longer (about 6 terms per query).}

^{Here, an example describing the selection of terms related to an initial query. We have an initial query composed of two terms}^"^{أسباب تلوث الهواء}^"(Causes^{of air pollution),}^Q_initial^{= {term1, term2}, it is expected that the sentences corresponding to the Q}_initial^{share those terms. We get 10 sentences from different documents:}

S1: [\تلوث الهواء\] ⁽ [^{Air Pollution\])}
S2 :[\تلوث الهواء\:\يتلوث الهواء\ بالعديد\ الملوثات \يمكن\ اجمالها\ في\ أسباب \] ([\ Air Pollution \: \ contaminated air \ number \ persistent \ can \ be summed \ in \ reasons \^]).
S3: [\ تلوث الهواء\ و\تأثيره\ على \صحة الإنسان ]^{([\ Air Pollution \ and \ effect \ Health \ human])}
S4 : [\ملوثات\ ناتجة\ عن\ احتراق الوقود\ ومخلفات الصناعة\ :\ينتج\ عن\ احتراق الوقود\ احتراقاً\ غير كامل\ غازات\ و\مركبات مختلفة\ تلوث الهواء\ مثل\ :-\أول وثاني أكسيد الكربون\ و\الهيدروكربونات\] :^{[\ Persistent \ result for the combustion of fuel \ and the remains of the industry \ result for the combustion of fuel \ burn \ incomplete \ gas \ and \ different compounds \ Air Pollution \ as \: - \ carbon dioxide, \ and \ oil \] »}
S5: [\تؤثر\ على \تأين الهواء\] [\ Effect \ on \ ionize the air \]^»
S6 : [\نقل\ الأمراض المعدية\ -\تؤثر\ بطريقة مباشرة\ على\ الجهاز التنفسي \] (^{[\ Transfer \ maladies infectieuses \ - \ affecter \ directement \ sur \ respiratoire \] )}
S7: [\سبل مكافحة\ هذا \التلوث\ لحماية\ صحة الانسان\ ]^{([\ Means to fight against \ the \ pollution \ to protect the health \ human \])}
S8 : [\أحد\ مسببات\ المطر\ المضي\ سرعة\ صدأ المعادن\ مثل\ أعمدة الإنارة\ ] (^{[\ A \ causes \ rain \ go \ speed \ steel metal \ as \ lamp \] )}
S9 : [ \طرق المعالجة\ بتصنيفها\ حسب\ نوع الإستفادة\ منها\ في \صناعات جديدة\و\في \توليد الطاقة\ ] ( ^{[\ Treatment \ group \ by \ the type of benefit \ of \ in \ new industries \ and \ in Power \])}

S10 : [\مركبات الكبريت\-\مركبات النيتروجين\\خفض\ درجة حرارة\ الأرض\تغير\ الطيف الشمسي](^{[\ Sulfur compounds \ - \ nitrogen \ cup \ temperature \ Earth \ change \ the solar spectrum])}
^{S1, S2, S3, S4, S8 constitute the set of relevant sentences. Table 2 below shows the relevant sentences associated to each significant term of Qinitial:}
^{TABLE II. EXTRACTION OF THE RELEVANT SENTENCES CORRESPONDING TO TERMS OF THE QUERY}

^{Terms of query}	^{Relevant sentences}
^{term 1 "}أسباب^{" (causses) term 2"}تلوث الهواء^{" (air pollution )}	^S8 ^S1,S2,S3,S4

^{In this example, of the term T 1}^«^أسباب^"( ^causses)^{of the Q}_initial^{, is associated with a single relevant sentence S8. Note that the term}^"^تأثير^{" (Impact)}^{is associated to the three relevant sentences so we measure its}^ief^{, it equal to log (5 /3)=0,22, as this term exists in three sentences (S3, S5, S6) of the five relevant.}

^{Thus, the value WRS of the term}^"^تأثير^"^{found in the third position in the sentence S3 is calculated as follows: WRS}_3,3^{= 1/ log(2/1) = 3,33. Indeed, there is a term in Q}_initial^"^{تلوث الهواء}^{"(air pollution) that is associated with the relevant phrase S3 (in which the term}^"^تأثير^"^exists).

^{We specify that the value of WRS of each term depends on the relevant sentence in which it is located. So we can assigning of different values of WRS a term according to the relevant sentences.}

^{For example, the value WRS of term}^"^تأثيره^"^{in the S5 equal to WRS1, 5 = 1 / log (2/0) = 0 because there is not a term in Q}_intial^{that is associated with S5. Similarly to the value WRS of the term in S6 (equal to WRS}_2,6^{= 1/log (2/0) = 0).}

^{Interpretation}^:^{the term}^"^تأثيره^"^{situated in the S3 is more significant than in sentences S5 and S6, therefore WRS of term}^"^تأثيره^"^{is equal to 3.33.}

^{The final score of term}^"^تأثيره^{" is equal to TRQ = 0.25 * 3.33 + 0.75 * 0.22 = 0.99.}

^{As well for the term (}^"^{"سبل مكافحة}^{(ways of treatment) =}^"^{"طرق المعالجة}^{(Ways to fight ))}^{the final score is equal to TRQ = 0.3, and the term "}^{صحة الإنسان}^"( ^{health of the Human), his final score is equal to TRQ = 1.12.}

^{The score obtained by function TRQ, allows to classify the terms which belong to the relevant sentences in a descending order: this way, the first three terms that are most likely to be correlated with the response, should be added to the initial query, so we obtain an enriched query:}

^"^{أسباب تلوث الهواء تأثيره صحة الإنسان طرق المعالجة}^"(Causes^{of air pollution impact in heath of human, methods of treatment).}

B. Performance Evaluation
^{To evaluate the proposed method, we use average measure of precision (AMP) [26]. To evaluate the quality of the classification of an IRS, this measure takes into account the rank of the correct answer found. This reduces the number of noisy answers.}

^{An IR system calculates a relevance score for all documents that constitute the test base, and classify them in descending order of relevance in the same way that search engines (for example Google). Looking through the list of documents returned by Google, the precision is calculated for each relevant document. AMP is obtained by averaging the different precision. For a set of queries, AMP is calculated as follows:}

^{Figure 9. Formula of average measure of precision.}

^{Where d}_ij^{is the j}^th^{relevant document for query Qi. , rank (dij) is the rank of this document in the list of responses of the system, n}_i^{is the number of documents relevant to the query i, N is the number of queries.}

^{We first evaluate the performance of the system using initial query in the simple state (without enrichment). Then we apply the same valuation method but using queries enriched by terms which may be correlated to the response:}

^{Figure 10. Measuring the precision of the IRS to query enriched compared to an initial query}
^{The value obtained by AMP initial query without enrichment is equal to 0.33. So that by applying the method of enrichment to initial query, we obtain a value equal of AMP to 0.81. This confirms that the enrichment of initial query improves significantly the relevance of queries and thus the performance of the research of retrieval document to meet user needs.}

Conclusion

^{In this paper, we focused on the quality of the response of an IR system in Arabic. That depends largely on the quality of the constructed query. Therefore, we proposed a method of enrichment of the initial query, integrating two types of the treatment: linguistic and statistical.}

^{To test this method, we used the Google search engine to obtain the average measure of precision (AMP) for the initial query and the query enriched. Despite the difficulties of language the method improves the performance of research. According to the results, we note an improvement in the value of AMP. That value is equal to 0.81 when using enriched queries, whereas it is equal to 0.33 using initial query only.}
^REFERENCES
^[1]^{Mitra M., Buckley C., Singhal A., Cardi C., «In analysis of statistical and syntactic phrases».}^{In 5ème Conférence de Recherche d’Information Assistée par Ordinateur (RIAO’1997), Montreal, Canada, p. 200-214, (Juin 1997).}

[2] Boulaknadel S, ^«automatic language processing and information retrieval in Arabic in specialty domain^», 2008.

[3] Salton G., Fox E. and Wu H., ^«Extended Boolean Information Retrieval^». Communications of the ACM, 26(11):1022–1036, date

[4] Yannick Prié, « Modélisation de documents audiovisuels en Strates Interconnectées par les Annotations pour l'exploitation contextuelle » Thèse disponible sur l’url : http://lisi.insa-lyon.fr/~yprie/these/node1.html, (2000).

[5] Nicola Guarino, Claudio Masolo, and Guido Vetere. «OntoSeek: content based access to the web». IEEE Intelligent Systems, (1999).

[6] Ereteo, G., M. Buffa, F. Gandon, , et O. Corby. « Analysis of a real online social network using semantic web frameworks ». In Proc. International Semantic Web Conference, ISWC’09, Washington, USA, (2009).

[7] Laib, M., Semmar, N., Fluhr, C.: «Using a linguistic approach for indexing and interrogation in natural language text of databases multilingual». Acts of the Prime International Symposium on Automatic Treatment of Arabic Language, (2006).

[8] Andreewsky, A., Binquet, J.P., Debili, F., Fluhr, C., Pouderoux, B.: «Linguistic and statistical processing texts and its application in the legal documentation». Proceedings of the Sixth Symposium on Legal Data Processing in Europe, Thessaloniki, Greece. (1981).

[9] Attar, R. and Fraenkel, A. «Local feedback in full-text retrieval systems». Journal of the ACM, 24(3):397 ^à 417, (1977).

[10] Robertson, S. Sparck-Jones, K. « Relevance weighting of search terms», volume 27, page 129 ^à 146, (1976).

[11] Ihadjaden M., « Design, implementation and evaluation of a retrieval system and automatic categorization of textual information on the Internet», Thesis, University Paris IV, (1994).

[12] Gauch S., Smith J.B. «An expert system for automatic query reformulation».Technical report, University of north california, (1992).

[13] M. DIAB, K. HACIOGLU et D. JURAFSKY. «Automatic tagging of arabic text: From raw text to

base phrase chunks. In In Proceedings of NAACL-HLT», pages 149–152, Boston, USA, (2004).

[14] S. Khoja and S. Garside. «Stemming Arabic Text». Technical report, Computing department, University, Lancaster, http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps, (1999).

[15] Hadouche. « Integration of syntactic - semantic information in terminology databases». annotation methodology and automation perspective, (2000).

[16] E. GAUSSIER, G. GREFENSTETTE et M. SCHULZE. «Natural language processing and information retrieval»: some experiments on the French. In Proceedings of 1st day scientific and technical network of Francophone of language engineering AUPELF-UREF, pages 33-45, Avignon, France, (1997).

[17] M.Silberztein, «INTEX : a finite state transducer toolbox^».Theorical compter Scinece . Vol.231:1 p 33-46, (1999)

[18] R.Blachere etM.Gaudefroy-Demombynes, «Classical Arabic grammar (morphology and syntax)^», GPMaisonneure & Larose, Publishers in Paris, 508 p. (1975).

[19] SILBERZTEIN M,^« Electronic dictionaries and automatic analysis of text analysis^», system INTEX, Paris, Masson, (1993).

[20] ^{SALTON G., A. WONG and C. S. YANG. « A vector space model for automatic indexing». Commun of the ACM, 18(11):613–620,(1975)}

^[21] Baeza-Yates, R. and Ribeiro-Neito, B. ^«Modern information retrieval». ACM press books, addison-wesley edition, (1999).

[22] ^{Salton G., McGill M. J.,«Introduction to Modern Information Retrieval, McGraw-Hill» , Inc., New York, NY, USA, (1986).}

^[23] C. BUCKLEY, G. SALTON and J. ALLAN. «Automatic retrieval with locality information using smart».

In TREC, pages 59–72, (1992).

[24] GRABS T., SCHECK H.-J., « Flexible information retrieval from XML with PowerDB XML », Proceedings in the First INEX Workshop, p. 26-32, (2002)

[25] Roche M., Kodratoff Y., « Text and Web Mining Approaches in Order to Build Specialized Ontologies », Journal of Digital Information, vol. 10, n°4, p. 6, (2009).

[26] VOORHEES E. « The TREC-8 question answering track report. In Proceedings of TREC-8». (1999)

Yüklə 187,74 Kb.

Dostları ilə paylaş: