B2 Impact B.2.1 Target outcome and expected impact
Expected GALATEAS impact on specific objectives is supported by the following facts:
-
The proposed solution is engineered to significantly reduce costs of ownership for integrating cross language retrieval solutions: We expect that with the adoption of SOAP based web services the personnel costs for integrating query translation in the search engine of the information provider will not exceed five working days (by a non specialized software engineer). The allocation needed to benefit of the service of intelligent log analysis is even smaller, as GALATEAS automatically harvests log files from the information provider.
-
The proposed solution is relatively low-cost, if compared with full fledged CLIR solutions.
-
The proposed solution maximize the quality of the retrieval if compared to standard translation services, which are know to perform badly on syntax poor texts such as queries.
If we cross all these enabling factors with the huge market dimension (digital libraries, content federators, merchant site) and with some project specific aspects (e.g. openness of software, distributed providers network) we expect that GALATEAS will largely contribute to the Call objectives, i.e. Multilingual Web Services and Machine Translation.
GALATEAS is mostly a commercially driven project, based on a rather linear market penetration strategy:
-
Provide initially free log analysis services in order to gather necessary quantity of query logs;
-
On the basis of aligned queries, provide fee based query translation logs;
-
Switch free log analysis services into fee based services;
In order to maximize exploitation opportunities and to simplify commercial agreements, the following action lines will be implemented:
-
The software that is produced that is subject to an open source license will be made available under an open source license: this fact, besides enabling community take up, will maximize commercial capabilities of the involved partners. It will also trigger the possibility of generating competitors outside the consortium: this fact could have negative impact on GALATEAS commercial partners, but a positive impact overall.
-
As a consequence of openness of some of the software (and the possibility of creating competitors), commercial partners are naturally stimulated to reach a consortium wide agreement, based on geographical areas, language coverage, marketing investments, maintenance strategy etc.
-
Proprietary components are wrapped behind web services: this will allow easy substitution of language technology providers, should the technical and economical conditions become dissatisfying for GALATEAS sellers.
Commercial activities will be supported by focussed dissemination action. We already mentioned open source software licensing, which may apply to some of the GALATEAS technology will be supported by a set of actions of open source dissemination. The rest of the dissemination activities are oriented towards both scientific communities and potential customers, with a set of initiatives target towards web based advertisement and networking (two web sites, massive search engine optimisation, etc.). Moreover an important outcome of the project, namely the user study on search episodes, will play a crucial role in cross-area dissemination, especially in reason of its “large audience” orientation.
Contribution to Programme objectives
GALATEAS addresses theme 5 of the call, namely the one of “Multilingual Web”. The main objective of such a theme is overcoming language barriers across Europe. As it will be shown, all efforts of GALATEAS concentrate in that direction with the goal of providing the first query oriented machine translation system. Such a technology, will allow any user speaking one of the GALATEAS languages to type a query in his/her mother tongue and retrieve documents/metadata in several languages.
The other crucial contribution to Theme 5 is represented by multilingual web content management. By providing support to query log analysis for at least seven languages, GALATEAS will enable content providers to understand what their users are looking for (contrasted with what they found, which can be computed simply by counting user’s click) and in which language: this means, for content providers, the possibility of a much more targeted acquisition of new contents; for users, this will imply access to more pertinent contents.
The project also addresses issue concerning theme 2 of the call, namely Digital Libraries: indeed digital libraries play a central role as users of the cross-language technology. On this respect, in order to strength the link with the DL word, at least two partners in the consortium are maintaining digital collections. The site manager of UBER, Viviane Petras is also WP manager in the multilinguality WP of the EuropeanaConnect, and three partners are involved both in EuropeanaConnect and Europeana 1.0. Finally one month of subcontracting has been foreseen to pay consultancy from Dr Ulrich Kampffmeyer who is the founder and president of PROJECT CONSULT Unternehmensberatung Dr. Ulrich Kampffmeyer GmbH, which is a leading and impartial management consulting company specialised on all topics of document related technologies (such as records management, enterprise content management, compliance, digital preservation, document lifecycle management) in the European German-speaking countries. Dr Ulrich Kampffmeyer is currently a consultant for Europeana 1.0 as experienced specialist.
In the following, we details GALATEAS responses to the issues raised in the “ICT PSP WORK PROGRAMME 2009”, to which page numbers and headings refer.
Section
|
Quote
|
GALATEAS answer
|
Theme 5, page. 27
|
There is therefore an urgent need for actions aimed at easing and speeding up the deployment of a wider range of Web based machine translation systems of adequate quality and coverage, together with the associated technical infrastructure and resources (e.g. lexica and corpora, Web standards and conventions; flexible and interoperable tools), which would make multilingual Web content and services truly accessible across languages.
|
This is the core of GALATEAS: the query translation service will enable easy access of contents in at least 7 different languages.
|
Theme 5.1. page. 27
|
The aim is to make the best possible use and extend the coverage of existing methods and techniques so that Web content can be made available in a given language and yet be searched and understood by users speaking other languages.
|
GALATEAS build on exinting method for building a query specific machine translation system (QueryTrans service)
|
Theme 5.1 p. 27
|
machine translation and other multilingual solutions for information access and
analysis;
|
The log analysis service (LangLog) has exactly the goal of providing solutions for multingual intelligent analysis of query logs.
|
Theme 5.1 p. 27
|
The above actions may include where appropriate collaborative platforms and
infrastructures designed to ease and support the uptake of multilingual solutions, in
particular resource sharing and user driven evaluation.
|
As explained elsewhere all language resources are wrapped as web service with public specifications. Any language resource vendor will be able to integrate new languages or quality-improved repository in the GALATEAS platform.
|
Theme 5.1 p. 27
|
The emphasis is on open access to source code, systems and datasets, and on community-based collaboration and evaluation.
|
Software developed in the context of GALATEAS that is subject to open source licenses will be made available under open source licenses. For example, GALATEAS may make use of open source software programs, such as MOSES, MONDRIAN, etc. GALATEAS will also run a query translation experiment in the context of CLEF.
|
Theme 5.1 p. 28
|
Pilots should investigate and validate novel approaches and cost-effective methods that
allow the usability, performance and language coverage of existing machine translation
systems and applications to be significantly enhanced, in various online use situations.
|
The training of a Machine Translation system on the basis of automatically aligned queries is cost effective and allow retraining on the basis of different log files from different topic libraries
|
Theme 5.1 p. 28
|
Emphasis is placed on exploring the Web and other online information sources,
harvesting resources that can be used to enhance the quality or extend the coverage of
machine translation and multilingual systems in general, in a particular domain or for a
given task.
|
Queries of users are typical on-line resources, and they will be gathered by mining web transaction log files
|
Theme 5.1 p. 28
|
Although the emphasis is on European languages, particularly those not adequately
covered by machine translation and multilingual systems, the proposed actions may also address world languages where appropriate and justified by their economic importance.
|
GALATEAS addresses: Italian, French, English, German, Ducth, Modern Arabic and Polish
|
Theme 5.1 p. 28
|
The emphasis is on collaborative schemes yielding publicly available methods,
processes, tools and resources. The proposed business model and exploitation plans
should reflect this approach.
|
Every deliverable of the project (with the exception of business oriented deliverables) will be available to a public audience. Developed software subject to open source licenses will be licensed under an open source model and delivered via an open source development platform.
|
Theme 5.1 p. 28
|
The consortia should include or address a wide range of stakeholders, such as public
sector and industrial/commercial organisations, content and service providers,
|
The consortium include 3 universities and 5 private companies. Among them we have two NLP companies, two companies in the domain of web services, and a multimedia digital library. Two partner play the role of the users, one from the public sector and one from the private one.
|
Theme 5.1 p. 28
|
The actions falling under this objective should where appropriate support and liaise
with actions established under the Digital Libraries
|
Digital libraries are one of the two main user target of the projects. 7 out of 8 partners are involved in European projects on digital libraries. Three partners are involved in Europeana 1.0 and EuropeanaConnect
|
Theme 5.1 p. 28
|
New business opportunities will be stimulated
as the language coverage of online content and services expands.
|
We present a business plan which show the possibility of expansion of the solution.
|
Theme 5.3 p. 29
|
The aim is to define and validate innovative and effective methods, processes and
workflows for multilingual Web content development and management
|
Multilingual log analysis represents an important tool for web administrator and content manager to understand the behaviour of their user and to enrich their contents in a user-driven way.
|
European Dimension
The objectives of GALATEAS could be positively attained only through a European level effort. Several reasons justify this claim:
-
Both multilingual log analysis and query translation need to access language resources (in the specific case 7 languages) which cannot be found in a single member states.
-
The technology behind the “base” algorithms (TLike, classification and clustering) has been developed in different countries (notably France, Italy and The Netherlands).
-
The business impact of the proposed services (LangLog and QueryTrans) only makes sense if carried on at the European level, where multilingual issues are more central.
-
The user behaviour study need to focus on several search profiles from different member states: an English information seeker and an Italian one might indeed have very different search behaviours.
-
As already mentioned, GALATEAS needs to collaborate with several European initiatives both in the field of Digital Libraries and Natural Language Processing and Machine translation.
Risks
Even for a middle sized project like GALATEAS it is important to foresee all possible risks which might affect the achievement of the proposed results. We divide these risks into management risks, technical risks, and exploitation risks. These are detailed in the following sections.
Management Risks -
Partners withdrawal.
-
Technical partner: This event is quite unlikely, as all technical partners have a strong commercial commitment. In the case it happens, the fact of having adopted a service-based paradigm, together with a development and licensing model that may rely on programs available under an open source license, will make partner substitution quite easy.
-
Users: If a partner involved as a user or in user related task quits the project, it will be substituted by one of the same country. During the phase of project preparation many have been excluded due to budget reasons. They will be reintegrated.
-
Failures in communication: in case failures in communication are observed, generating delays and misunderstandings, the communication strategies will be revised and new channels will be adopted. However, in the event that failure of communications concerns a single partner, the site manager will be officially asked to remedy as soon as possible.
-
Economic risks.
-
Delays: as all partners have a stable financial situation no delay in payments (due for instance to late delivery of cost statements) should affect the project.
-
Single partner overspending: emergence of overspending by a single partner will be monitored by the project co-ordinator before each payment. Also unbalanced personnel allocation (with respect to project goal) will be attentively monitored with a quarterly frequency.
Technical Risks
In general the consortium builds on robust and consolidated technologies so risks of technical failure are minimal. Here we list those aspects of the system that imply more advanced and innovative technologies and that might present some risks:
-
Sparseness of bilingual queries: It might be the case that for certain languages the set of aligned queries is not large enough to train a machine translation system (while 50.000 aligned queries represent the minimum, we would rather aim at performing training on sets of 100.000 aligned queries). In this case the delivery of the specific MT system must wait until a sufficient large dataset will exist. In case that not all pairs are covered at the end of the project, the QueryTrans service will rely on transitive translation using English as a pivot language.
-
Failures in TLike: the TLike algorithm has already been tested in the context of CACAO, and relevant results have already been published. In any case the consortium will work on alternative algorithms should TLike prove to be less performing than expected in the current application context.
-
Poor usability of clustering algorithm results: it might be the case that, given the unsupervised nature of the clustering algorithm, results might be difficult to exploit by library and site managers. For instance clusters might be too much/little populated or too much/little numerous. In this case technical partners will explore available cluster post processing methodologies.
-
Poor quality of the MT output: poor output might result for language pairs with few translation equivalents. It should be noticed, however, that the issue here is not the quality of the translation itself, but its capability of retrieving documents in a certain language. Therefore stylistically or syntactically bad translations are acceptable as long as they are effective in retrieving hits. In case of unacceptable retrieval capability, for certain language the consortium might step back to dictionary based translation, which is already provided as a utility GALATEAS web service.
Exploitation Risks
Main exploitation risks concerns failure in attracting content providers and failure in successful completion of business activities.
-
There is no risk concerning the market dimension as this was computed on the basis of reliable figures. The risks connected to business activities are therefore the following ones:
-
Failure of stakeholders to find an acceptable commercial agreement: A set of bilateral licensing agreement will be delivered and exploitation will be either outsourced to third parties (e.g. OPAC vendors) or carried on by individual partners in competition.
-
Failure to achieve the necessary market impact. If the second year of activities does not produce the planned commercial service adhesion, a new “critic” version of the business plan will be delivered.
-
Emergency of competitors during the project lifecycle. USP (Unique Selling point) will have to be revised, as well as marketing strategies. An addendum to the business plan will be produced.
Concerning the last point we believe that major risks might come from search engine such as Yahoo! and Google, which run research laboratory where both Machine Translation and Log analysis are at study. However, for the time being, Google concentrates on MT for full documents and not for queries, whereas Yahoo! Lab, while producing high level research in term of log analysis, it is not likely to offer log analysis services the way Google Analytics is doing. In any case, the activity of the two labs will be attentively monitored during project execution.
Dostları ilə paylaş: |