Every day, millions of search queries are issued to content providers ranging from all-purpose web information sites (such as Google and Yahoo!) to digital library sites (such as Europeana and TEL) to merchant sites (such as Kelkoo and PriceGrabber). These queries are a precious resource for understanding user behaviour with respect to a collection of documents. From a careful analysis of such queries, content providers can understand what is the information users are really looking for, what are the most current strategies to retrieve digital objects, and what is the degree of matching between the needs of end users and the content offered by the web site. It is thus astonishing that no single provider of services/products of log analysis ever attempted to go beyond the type of support to log analysis provided by services such as Google analytics, just to mention the most popular. These services provide tools to segment user queries in words and provide statistics about the occurrences of single words, but this is far from satisfying content provider needs:
Existing analysis tools only consider words as chains of characters. Therefore any generalization concerning searches about, for instance, “sea” and “ocean” is simply lost. Moreover all ambiguities intrinsic in language are not resolved, so that a word such as “bank” will be ranked irrespective of its meaning.
They do not perform any match between user searches and the informative backbone of the content aggregation, be it a standard classification system, subject headings, product types, or plain list of categories.
They do not provide any hint about search episodes. Each search is seen as an isolated event and there is no attempt to determine sequential patterns of search activities
Queries as recorded in log files, besides being a precious resource for understanding users’ behaviours, could also become a key resource for achieving cross-lingual access to information, if we could apply to them appropriate algorithms of alignment. Indeed if one could dispose of a sufficiently large amounts of pairs of queries which are translation equivalents, it would be relatively easy to derive a machine translation system specially adapted 0to deal with queries. It would be consequently possible to set up a service for plain query translation: such a service could be accessed by any kind of monolingual search engine in order to acquire cross-language functionalities.
The objective of GALATEAS is to fill these gaps by setting up two web services:
LangLog: It will analyze transaction log containing queries to search engines for a given content provider. By applying statistical technologies coupled with language oriented services, it will produce reports concerning the informational needs of the users accessing that particular aggregation. In other words, the same way in which standard log analysis systems provide generalizations of paths of users inside a web site, LangLog will provide generalizations of the actions that information seekers perform in order to find contents inside a searchable collection of digital objects.
QueryTrans: It will translate queries coming from an external search engine into several target languages: the external search engine will use these translations to return to the user results into languages different from the one in which the query was formulated. To be clear QueryTrans is not a cross-language information retrieval system, performing indexing and search, but just a query translation service.
These two web services are tightly linked. Apart from the fact that they access the same range of NLP based web services, LangLog is essential to allow the continuous acquisition of large quantities of queries in different languages: it is only on the basis of such an acquisition that the machine translation systems that constitute the backbone QueryTrans can be trained and thus the service itself can be provided.
The LangLog service will provide services of query log analysis for at least 7 languages, namely Italian, French, English, German, Dutch, Modern Arabic and Polish. It is therefore conceivable that for each combination of these languages (42 pairs) a query oriented machine translations could be available. Practically, it is unlikely that during the project duration the LangLog service will be able to gather enough translationally equivalent pairs to cover each language combination. The project will therefore focus on Italian, French, German and English combinations (all pairs). For Dutch, Modern Arabic and Polish it will rather concentrate on finding translation equivalents with respect to English and training respective machine translation systems: in this way that missing pairs could be covered by transitive translation using English as a pivot language.
In summary, from a technological point of view, the main steps through which the objectives above are to be reached are the following:
Taking advantage of the involvement of several partners in different DL initiatives, start gathering constantly increasing amounts of search episodes;
Use session and user-id information together with semantic query analysis in order to determine information paths within search episodes. These information paths will be delivered to information providers, which will be able to tailor both resource acquisition and search strategies on the basis of observed patterns (LangLog service).
Improve and apply the technologies delivered by the CACAO project (three GALATEAS partners were involved also in CACAO) in order to identify queries which can be mutually considered translation pairs. This step will produce a continuously evolving parallel corpus.
Provide tool to integrate data extracted from the parallel corpus into resources for cross language information retrieval. In particular we foresee to increase the quality bilingual translation dictionaries available to at least three partners with query derived translational equivalents.
Use the corpus to train a hybrid statistical MT system. This phase will deliver an MT system specially designed to translate queries: it would be the first one on the market (QueryTrans service).
Customers of GALATEAS will be organizations running content delivering web sites powered by a search engine (Digital Libraries, Content Aggregators, Merchant Sites). We believe that for these organizations the need of understanding what their users are looking for and how they could better match their expectations is a matter of survival. For many of them, the fact of achieving cross-linguality in content access is also a crucial competition factor.
GALATEAS, with its technology of query translation based on web services aims to provide an easy cross-language information retrieval solution to all organizations belonging to one or more of these categories: the “resident” search engine should simply deliver the original query to the QueryTrans service, by asking, at the same time, in which languages it should be translated. The answer of the service will then be elaborated as if it was a query issued by any user speaking that particular language.