Cover Page

Yüklə 0,67 Mb.

səhifə	4/12
tarix	31.12.2018
ölçüsü	0,67 Mb.
	#88621

1 2 3 4 5 6 7 8 9 ... 12

B.1.2 EU dimension

The GALATEAS project is at the crossroads of three main lines of European and national investment, namely:

Digital Libraries;
Machine Translation;
Web Services and Software as a Service

These lines of R&D received a strong support at European, national and even regional level, modelling the objectives of countless funding frameworks. This importance is widely justified by the fact that these activities are all strengthening European cohesion and represents the implementation bricks of the i2010 strategy:

Support to digital libraries means an effort to integrate European richness in terms of art, history, culture and science into a single, unified information space.
Support to Machine Translation means an effort to overcome language barriers, a challenge that became even more compelling with the acquisition of new languages inside the Union.
Support to Web Services (with implicit support to semantic web initiatives) implies the creation of a network of applications which easily communicate across borders without the burden of integration, installation, complicate licensing schema, etc.

The GALATEAS project will support these policies in the following way:

Digital Libraries: they represent the major class of users of GALATEAS web services: they will gain comprehension of the information needs of their end-user and they will achieve cross-linguality in a simple, non intrusive way.
Machine Translation: One of the services offered by GALATEAS concerns query translation. The advantage of GALATEAS approach is that it will be possible to train 0query translation systems for virtually any European language, as long as query logs resources become available
Web Services and Software as a Service: not only GALATEAS will disclose solutions as web services with publicly available specifications, but all NLP and semantic components will be accessed as web services. This will allow the creation of a European network of GALATEAS service providers, that will be chosen on the basis of quality and competition.

In relation to the last aspect, it is worth mentioning that GALATEAS promotes openness of software and infrastructure. This aspect is reflected by the fact that some technology is developed under the project using an open source development model and other technology is access using a web service model.

In order to enhance the European dimension and to promote cross-fertilization among different funded initiatives, GALATEAS will seek collaboration with projects on machine translation and digital libraries. Among these optimal candidates, with which contacts by one or more partners are already established, particular attention will be paid to:

Machine Translation :
- SMART
- EUROMATRIX(PLUS)
- MONOTRANS
- Any project eventually approved in the context of Call 4.
Digital Libraries :
- NEEO
- CACAO
- TELplus
- ENRICH
- GAMA
- Europeana

B.1.3 Maturity of the technical solution

B.1.3.a The overall picture

The main objective of GALATEAS is to assemble a set of innovative technologies in order to derive a simple and cost effective solution to the challenges raised by multilingual query log analysis and query translation.
The main technologies adopted in the project are summarized in the following diagram:

The inner circle contains third party technologies (available to consortium partners) which will be accessed as web services. These technologies are assumed to be mature, by virtue of the fact that they have been already tested in a number of research project, international competitions and, at least as far as XEROX and CELI are concerned, commercial applications. The role of GALATEAS will mainly consist in wrapping them around web services functionally serving GALATEAS.

The outer circle contains technologies which are not available to the consortium, but for which the consortium will have the expertise to parameterize and customize in an optimal way. For each one of these technologies GALATEAS will perform customizations on open source software and commercial proprietary software (e.g., Mondrian/BO for data mining, Tomcat/IIS for web services, AWSTATS/ WEBTRENDS for structured log analysis and MySQL/Microsoft SQL Server, for database technologies).

In the middle circle (white shapes) are those technologies which will constitute the central aspect of innovation in GALATEAS. In general they are all considered quite mature from a research point of view but in few cases underwent a real optimization and industrial take up process. Some of them have available incarnation in the open source (such as MOSES for Machine Translation) while for others only “foundational” libraries are available (such as WEKA for automatic classification). The consortium will adopt these technologies with the following goals:

Integrate them with the other technologies in order to achieve the proposed solutions (LangLog and QueryTrans);
Parameterize them, in order to be able to deal with linguistic objects such as queries, characterized by brevity, poor syntax and lack of structure.

In the remaining of this section we will provide a description of the innovative aspects of the adoption of the latter group of technologies.

B.1.3.b The Components

B.1.3.b.1 Classification

The automatic classification problem here is regarded as a way to answer the specific need of an information provider who wants to know which are the categories which are more often targeted by user queries. Here for “categories” we mean a classification schema into which contents of the information providers are classified. It can be represented by structured objects such as subject headings for digital libraries (e.g. LCSH), pure classification systems such as the Dewey Decimal Classification (DDC), unstructured sets of labels, or even implicit organizations such as sections in a web site. Many sites offer classification based navigation that allows the user to browse information by category. In these cases the information provider can understand which are the most interesting categories simply by analyzing standard transaction logs. However when the information is searched via a textual query to a search engine, transaction logs can only reveal the type of digital items which where clicked by the user (if any) but in no sense they can disclose what the user was looking for (click through is not necessarily a proof that the user found what s/he was looking for; indeed it depends a lot on the quality of the snippets associated to hits). The goal of the classification system in GALATEAS is to match user queries as extracted from logs with categories in use by the information provider in order to disclose knowledge about real user needs. In GALATEAS we propose the tailoring of two different algorithms, depending on the nature of the information provider.^¹ These are described in the following sections.

B.1.3.b.1.1 Click Through Classifier

The most “traditional” way to perform automatic classification is via machine learning(ML). Several ML algorithms have been proven successful in this task, and the availability of such open source ML libraries will give us the opportunity to experimenting with different algorithm in order to detect the optimal configuration when dealing with query-topic associations. In GALATEAS we will evaluate at least Support Vector Machines, and Independent Component Analysis coupled with Latent Semantic Analysis or other semantic vectors generator models (Pu and Yang 2006).

The classical categorization algorithms presuppose the availability of a manually build learning set in order to train the classifier for a specific domain/classification system. In our default situation, however, we do not have a learning set, unless we engage in a costly operation of manual tagging, which is not the case. The only way we could gather data about query-category membership is via click-through. We could assume that if a user issues a query, and then click on a specific digital item the query could be assigned to the category to which such an object belongs. However, as we just mentioned, the method is error prone in all these cases in which the quality of the snippet (i.e. the information presented about a specific digital object in the hit page) is such that we cannot be sure that it represents the content of the document. To make a practical example, if a user types “jaguar Africa environment” and then, because of an imprecise or missing snippet, she clicks on a document which deals with cars, we would use the association “jaguar Africa environment”-automotive to train a statistical classifier, with the risk of producing a rather “confused” model.

There are cases, however, where the quality of the information presenting the hit is almost perfect. The typical case is represented by multimedia (image) digital libraries: for instance, if in a repository of photos the user clicks on a thumbnail, it is highly probable that such a photo is in line with his/her query. Consequently the association “query-category_of_the_digital_object” can be safely used to train the classifier.

The case of multimedia digital libraries also applies for other kind of collections, such as “traditional” digital libraries with high quality metadata and certain merchant sites. In all these cases the information provider will transmit past transaction logs to the GALATEAS system, which will train a specific classifier on the basis of the set of available query click-through associations.^²

B.1.3.b.1.2 Semantic based Classifier

In all cases in which the no reliable query-category is available we must rely on unsupervised methods not based on machine learning. Basically we must capitalize on information derived directly from the classification system, such as category descriptions, category names, category examples. In certain cases we will also be able to derive this information from metadata-category associations. Each category will then be characterized by a set of linguistic features, which will be in turn enriched by calling some of the GALATEAS utility web services such as semantic thesauri (either manually coded (EWN) or vector based). Ultimately each category will be represented by a vector of semantic features. Such a vector will also constitute the representation in which each user query is transformed, by applying exactly the same methodology. Standard vector distance techniques (such as the cosine) will then be used to measure query membership to classification category.
It should be noticed that the maturity of this approach was already largely experimented in the CACAO project (Bernardi et al. 2009, CACAO, deliverable 2.3) where it was applied to semantic matching between subject headings and metadata records.

B.1.3.b.2 Clustering

Certain information providers might have no classification system or might be interested to know how user queries group together, irrespective of the mapping on a specific classification system. The classical technology to achieve this result is clustering. GALATEAS will pursue two different approaches one based on Topic Models (e.g. Newman et. al 2007, Griffiths et al. 2004) and one based on more traditional Spherical k-means (Dhillon et al. 2002)
Statistical topic modelling is a recently developed machine learning technique, based on the seminal Blei et al. (2003) LDA (Latent Dirichlet Allocation) model extended by the by now popular Gibbs sampling method for inference (Griffiths and Steyver, 2004). It has already been used in a range of domains ranging from named entity clustering to metadata grouping. Topic Models simultaneously discover a set of topics or subjects covered by a collection of text documents (in our case user queries) and determine the mix of topics associated with each document (or query). These topic models are gaining popularity because they produce easy-to-interpret topics and can effectively categorize the contents of large scale collections.

The consortium will use the open source version of LDA GibbsLDA++ (http://gibbslda.sourceforge.net/) or its java counterpart JGibbLDA (http://jgibblda.sourceforge.net/).

Topic modelling is a promising technique that according to some authors (Newman et. al 2007) can provide up to 83% usable clusters (over a set of 500 target clusters). However the phase of cluster production can be very time consuming (for instance 10 days of computing time on a 3GHZ processor were required to process 7,5 million queries). Even though by performing algorithm optimisation and input cleaning these inefficiency can be reduce, it is extremely unlikely that Topic Model will be ever able to provide real time results on query clustering. Therefore GALATEAS will exploit in parallel some more traditional clustering technique. In particular positive real time acceptable results (i.e. meaningful to the analyst) have been obtained in the context of the QCS system (Dunlavy et al. 2006) by using document vectors clustered by a spherical k-mean algorithm employing first variation and splitting (Djillon et al. 2002).
In the context of GALATEAS the GPL software GMEANS (http://www.cs.utexas.edu/users/dml/Software/gmeans.htm) will be tailored to enable information providers to access real time clustering of user queries.

B.1.3.b.3 TLike

The TLike algorithm has the goal of aligning translation equivalent user queries extracted from transaction logs.

An analysis of the web transaction logs of the library catalogues of partners in the project showed that in an academic context, such as the one of a university library, about 34% of the queries are "duplicated" in at least two languages (normally the local language and English, e.g. the user wrote first vino and then, as a second research wine). In a library that is operating in a multicultural context, we observed that about 10% of the queries are written in three languages, namely Italian, German and English (e.g. the user first typed vino, then wine and finally Wein).

This fact is a major opportunity to perform an update of the translation lexicon. Indeed, if we could store the translations implicitly provided by the user into the system's translation dictionaries, we could add items which are I) relevant to users; II) reflecting users' perception of the translation of a given word in a different language.^³ Even more importantly if we could dispose not only of query translations implicitly provided by the same user but if we were able to detect which queries in a transaction log file are translations of each other, we could dispose of a large aligned corpus to perform training of specialized machine translations systems.
The TLike algorithm (Bosca and Dini 2009, CACAO deliverable 2.4) is meant to retrieve pairs of translation equivalent queries from transaction logs. It is presupposes the existence of three main components (GALATEAS utility web services):

A system for Natural Language Processing able to perform for each relevant language basic tasks such as part of speech disambiguation, lemmatization and named entity recognition.
A set of word based bilingual translation modules.
A semantic component able to associate a semantic vector representation to words. (Bosca and Dini, 2008).

The basic idea beyond the TLike algorithm is to detect the probability for two queries to be one a translation of the other. In the simple case, we expect that if all the words in query QS have a translation in query QT and if QS and QT have the same number of terms, then QS and QT are translation equivalent. Things are of course more complex than this, due to the following facts:

The presence of compound words makes the constraints on cardinality of search terms defeasible (e.g. the Italian Carta di Credito vs. the German KreditCarte).
One or more words in QS could be absent from translation dictionaries.
One or more words in QS could be present in the translation dictionaries, but contextually correct translation might be missing.
There might be items which do not need to be translated, notably Named Entities.

In order to face this complexity the TLike algorithm (schematically described in the pseudo-code at the end of this section) exploits both NLP techniques such as Named Entity Extraction and compound recognition and statistical techniques such as vector based cross language matching. See Bosca and Dini (2009) for a full description.

In the context of the GALATEAS project the TLike algorithm will be optimised in order to handle efficiently millions of candidate translation pairs and to take into account also extra linguistic features such as user session, query time, click-through etc.

B.1.3.b.4 Machine Translation

Statistical machine translation (SMT) is an approach to MT that is characterized by the use of machine learning methods. In less than two decades, SMT has come to dominate academic MT research, and has gained a share of the commercial MT market. Progress is rapid, and the state of the art is a moving target. We refer to Lopez (2008) for a comprehensive survey of the state of the art in SMT.
SMT treats translation as a machine learning problem. This means that it applies a learning algorithm to a large body of previously translated text (parallel corpus). The learner is then able to translate previously unseen sentences. With an SMT toolkit and enough parallel text, an MT system for a new language pair can be built within a very short period of time. For example, Oard and Och (2003) report constructing a Cebuano-to-English SMT system in a matter of weeks. Workshops have shown that translation systems can be built for a wide variety of language pairs within similar time frames (Koehn and Monz 2005; Koehn and Monz 2006; Callison-Burch et al. 2007). The accuracy of these systems depends crucially on the quantity, quality, and domain of the data, but there are many tasks for which even poor translation is useful (Church and Hovy 1993): cross-language information retrieval is among these tasks
In the context of GALATEAS we will train a publicly available machine translation system in order to learn how to translate user queries on the basis of a parallel corpus of queries aligned by the TLike algorithm. After a first scouting phase, the consortium is oriented to adopt MOSES as an SMT toolkit, but the choice might be revised in the case a more convenient technology will emerge during the project duration. MOSES is based on the theory of factored translation models. Factored translation models extend phrase-based ones by taking into account additional annotation, especially linguistic information. A number of experiments have illustrated that instead of dealing with linguistic markup at pre- or post-processing steps, such information can be better integrated into the decoding process to better guide the search (Koehn and Hoang, 2007). The main markup information used includes POS tags and output of morphological analysis. The use of richer information generates multiple input hypotheses which have varying degrees of confidence and accuracy. Techniques used with MOSES to deal with that include n-best lists and confusion networks (Koehn et al. 2007).
The fact that query logs are characterized by short sentences poses further challenges to MT systems. Factor-based systems may have an advantage in this case when compared to traditional phrase-based ones, since they are not as much dependent on the surface form of the word occurring in the training data. The more training data we have the less important this issue becomes. Other advantages that factor-based systems have such as being able to distinguish between different linguistic contexts may be of limited value because of the short context represented in a query and the flexible word order.

In GALATEAS the chosen SMT toolkit will have to be duly parameterized in order to answer the challenges raised by the specific text type. Parameterization steps include:

Word alignment factors: Different operations are possible (such as intersection, union, grow, etc.). In MOSES the stem/lemma or surface form should possibly be used. Naturally, the quality and quantity of the available aligned or parallel corpora is crucial.
Lexicalised reordering model: They can be parameterised according to the orientation types of phrase pairs supported by MOSES (i.e. bidirectional, monotonicity, e and f).
Different language modelling toolkits: Currently in MOSES the SRI, IRST, and RandLM toolkits are available. The IRST model supports huge language models and memory mapping, which enables the use of a storage mechanism in addition to the memory. RandLM supports even bigger language models but that comes with a cost on the speed of the decoding. However, MOSES supports also distributed processing (Koehn et al. 2007).
Multiple decoding paths can be used (or more generally different variants of factor-based models can be applied).

An interesting feature of MOSES is that it is straightforward to generate the n-best lists (the top n translations found by the search according to the model), which should give us access to different translation variants. This feature will be important as in CLIR contexts one can easily benefit of translational variants (Monz 2006).

Another issue to be attentively evaluated is the dimension of the parallel corpus of queries. There are experiments that report the use of 40K sentence pairs (to illustrate the advantage of factor-based models over phrase-based ones) up to the use of 950K and 1,3M sentence pairs. However these figures are difficult to evaluate on an absolute basis, as a crucial factor is represented by the lexical diversity of the domain that is queried (assuming that this will also be reflected by the queries).

B.1.3.b.5 Named Entity Extraction

Named Entity extraction is a crucial task in GALATEAS for at least three reasons:

Named Entities represent the type of information than many managers of digital collection and vendor sites are willing to monitor and control (LangLog service).
Named Entity identification has been proven by Bosca and Dini (2009) a crucial aspect for performing query alignment correctly.
The quality of the query oriented machine translation systems will benefit of the integration of an advanced named entity recognition component (QueryTrans service).

Named entity recognition (NER: the identification and classification of references to entities of a fixed set of types of interest such as PERSON, LOCATION, ORGANIZATION, a task closely related to term extraction) is an essential building block of information extraction. Work on the task took off with the Message Understanding initiative in the USA, which made the first annotated English language corpora available. Language-dependent, supervised methods were quickly shown to be able to achieve near-human performance: the best system entering MUC-7 (Mikheev et al, 1999) scored 93.39% of f-measure, while human annotators scored 97.60% and 96.95%. These results indicate the algorithms had roughly twice the error rate (6.61%) as human annotators (2.40% and 3.05%). Supervised methods were also used for language-independent NER (Cucerzan & Yarowsky, 1999), e.g., in the CONLL 2002 shared task (http://www.cnts.ua.ac.be/conll2002/ner/), with reasonable results. However, for Web-scale, multilingual named entity recognition, unsupervised methods using a combination of text patterns (Etzioni et al, 2008) and gazetteers (often found on the web itself, as in the case of Wikipedia, Freebase for people, GeoNames for geographical locations) have often been found to be more practical and extensible to new languages (Steinberger and Pouliquen, 2007).

University of Trento and their collaborators at FBK have developed several state of the art supervised NER systems for Italian and English (e.g., (Giuliano, 2009)). XEROX developed several rule based NER systems based on XIP (Ait-Mokhtar et al. 2001): they have been tested in several international competition and recently they scored best system in the ESTER 2 French competition for NER. Finally, CELI was in charge of the NER task for user queries in the context of the CACAO project, and has adopted a heuristic-based approach (by means of voters, implementing an extensible pool of different NER approach) which performs quite well in the specific CLIR settings. In the context of GALATEAS we will integrate supervised, heuristic-based, and rule based approaches as appropriate using CELI’s voting system, to maximize results in identifying Named Entities in queries for the whole set of GALATEAS languages.

Yüklə 0,67 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 12