Description of the issue and proposed service/solution
With the growth of digital libraries and digital library federation (as well as partially unstructured collections of documents such as web sites), a large set of vendors is offering engines for retrieving contents and metadata via search requests by the end user (queries). In most cases these queries are just unstructured fragments of text in a specific language.
The first service offered by GALATEAS (LangLog) is focused on getting meaning out of these lists of queries and it is addressed to library/federation/site managers. Unlike mainstream services in this field, GALATEAS services will not consider the standard structured information in web logs (e.g. click rate, visited pages, user’s paths inside the document tree) but the information contained in queries from the point of view of language interpretation. By subscribing to the LangLog service, federation administrators and managers of content providing web sites will be able to answer questions such as: “Which are the topics which are most commonly searched in my collection, according to a certain language?”; “how do these topics relate with my catalogue?”; “Which named entities (people, places) are more popular among my users?”. LangLog will be available in at least 7 languages, namely Italian, French, English, German, Dutch, Modern Arabic and Polish.
The second problem addressed by GALATEAS is the one of Cross Language Information Retrieval (CLIR) i.e. the capability of typing a query in one specific language and retrieving documents which are available in different languages. The CACAO consortium (an EU project funded under eContentPlus) is already successfully providing services for indexing and searching over digital libraries and metadata repositories. During commercial exploration for marketing CACAO, it emerged that certain institutions prefer to keep indexing and searching at their premises (using their own favourite search engine) and would be perfectly satisfied with a service of plain query translation.
The second service offered by GALATEAS (QueryTrans) has the ambitious and innovative goal of providing the first translation service specially tailored on query translation.
The two services are tightly connected: it is only by virtues of successful lunch of LangLog that the consortium will gather enough multilingual queries to train the Statistical Machine Translation system adopted by QueryTrans.
Target users and their needs
Indirect “users” of the service are information seekers, which would benefit of improved, possibly cross lingual, search services. However the GALATEAS services are provided not directly to end users, but to administrators and managers of federations and search engine installations. Thus GALATEAS aims to target a high end B2B market where customers will be mostly represented by organizations running middle and large sized federations of contents. Principally, it will answer the following needs:
Need for a federation manager to understand what users are looking for, irrespective of the contents they actually access.
Need for a content provider to understand which are the directions in which the collection should be extended.
Need for a library administrator to understand which are the categories in the catalogue that fit more/less the desiderata of final users.
Need for a library manager to understand the behaviours of its users.
Need to provide cross language information retrieval in a seamless way, without changing anything in the way in which documents are indexed and managed.
Usage
LangLog: In a typical context the customer will sign a contract of log analysis service provision with one of the commercially active GALATEAS partners. At the agreed frequency LangLog will harvest log files from the federation and will start processing them. After the time of processing (which is almost linearly dependent on the amount of queries) it will be returned either with a report describing all information needs which have been detected or with extended log files to be integrated in the federation’s preferred log reporting system.
QueryTrans: The customer will negotiate with one of the commercially active GALATEAS partners concerning the level of the service. Example parameters for such a negotiation are:
Number of covered languages;
Domain of the translation (whether generic or in a specific domain);
Possible training of a specific query translation system;
…
After the negotiation phase (which should be kept as light as possible), the customer will benefit of a service that, for any query in a certain language, will return its translations in n other languages (targeted languages are Italian, French, English, German, Dutch, Modern Arabic and Polish). It is up to the content provider to intercept the user query as well as her needs for cross-linguality. It is also a responsibility of the content provider to match the translations of the query against its indexes and/or databases.
Technology
Both LangLog and QueryTrans will be based on standard web service technology, adopting, at least in a first phase, axis2 as an application server. In this phase we will also adopt a hosting strategy with a single processing unit serving on average 6 customers. The main technological value is not represented, however, by modalities for service provision but by the underlying components, which represent a commercial take up of most innovative available software for machine translation, corpus alignment, automatic classification and clustering.
Content The service LangLog will have to access to specific software and resources such as NLP components, thesauri, classifications schemas etc. These are all available to partners of the consortium. The training of the machine translation system to be used by QueryTrans will initially require access to massive quantities of possibly multilingual logs data. As detailed elsewhere we will acquire this data in exchange of the royalty free provision of the LangLog service for a certain period. Moreover, already at time t0 the consortium can rely of substantial amount of query logs derived from Excite, Yahoo!, Europeana, TEL, and all libraries federated in the course of the CACAO project.
Sustainability
Commercial partners of the project will jointly or disjointly exploit the GALATEAS technology under a fee based model for services (where the underlying technology itself for the services will be available either under open source licenses or proprietary licenses). GALATEAS exploitation will be therefore supported by on-line content providers, which in 2010 are expected to produce only in Europe 8.3 billions revenues (not to count the 170 billions Euros which are yearly generated by eCommerce sites, which generally relies on a resident search engine). Under the provisional business plan described in the relevant section it is expected that already in the first year after project termination GALATEAS technology will produce yearly revenue stream of 2 millions Euros, with an NPV (Net Present Value) of about 3 million Euros (considering project costs) after 3 years from project termination.
Ownership The ownership of technology produced under GALATEAS is based on a mixed model of open source and proprietary software development and fee based access to proprietary services. In short, all software developed in the project using programs originally available under an open source license will continue to be licensed under an open source modality. This use of open source software relies, however, on proprietary software and web services (mostly NLP and dictionary based translation services), which will remain proprietary and will be accessed under standard commercial conditions, when commercial exploitation will start. By decoupling GALATEAS from its ancillary software and services, we will open service provision to concurrency: it is not unreasonable to think that in the future a partner company providing e.g. NLP for a certain language will be replaced by another company which provides better software or services at reduced fees.
Concerning log file ownership, title will remain with the content provider. However the content provider must authorize GALATEAS to produce (and use) derivative works on the basis of such logs: most crucially query logs (once queries have been aligned) will constitute the training set for the query centred machine translation system.