We can distinguish three basic building blocks of the proposed system, namely:
The log analysis subsystem: it implements the LangLog service by providing language based log analysis;
The MT training subsystem: It performs machine translation training based on received query logs;
The query translation subsystem: It implements the QueryTrans service and translates queries into several languages by using the appropriately trained MT systems.
These subsystems will be detailed in sections 3.2.a.1.1 through 3.2.a.2. In the last two sections are described a) the kind of external services the system need to access; b) non functional features of the system.
The goal of this component is to periodically harvest log files from customer’s premises and provide customers’ personnel with language based log analysis. The subsystem is composed of three modules (figure 2).
Log acquisition module;
Log analysis module ;
Data delivery and discovery
Figure 2: Architecture of the LOG Analysis Subsystem
Log acquisition module
It is in charge of harvestinglog files from customers sites, converting them to the W3C extended common log format and storing them into the central log DB.
It will be triggered by a scheduling agent fixing the temporal granularity of the analysis (for standard cases one month might be a reasonable interval).
Conversion from non standard formats will be achieved by format conversion techniques, and it presents no technical challenge.
Migration to the database is also a standard application of data import techniques. The structure of the Database is likely to resemble the one described in Jansen 2006 with the addition of customer oriented information (the provider, the collection, the click on a specific digital resource etc.) and GALATEAS delivered information (basically clustering, language, and topic association)
It has the goal of producing (and storing in the log DB) language related information concerning the queries issued to a certain content provider for a certain period. It is composed on two basic modules, namely Topic-Query matching and Clustering.
The Topic-Query matching component has the responsibility of associating queries to a certain classification structure used at customer site (e.g. LCSH, Dewey, etc.). It can deliver this service according to two modalities:
Click Through Classification: It is based on the fact that, after querying the search engine, the end user might click on some hit, and hits are classified in the catalogue according to the adopted classification system. The association query / category-of-the-click-hit is used by this component to train a machine learning based classifier, which will learn how to classify “unseen” queries (details in 1.3.b.1.1).
Language based classification: it assumes that we have no query-category association base for training a learning based classifier. Therefore it exploits “linguistic” matching method such as synonym matching, semantic vectors distance, etc. (details in 1.3.b.1.2)
The Clustering component has the goal of grouping end-user queries into meaningful sets to be presented to library administrator manager. Contrary to classification, clustering does not assume the presence of a classification structure but try to group together documents (queries, in our cases) according to different parameters. There are a number of parameters that can influence a clustering algorithm (see section 1.3.b.2) and these will be controlled by the library administrator thanks to an administration console.
In the project we will evaluate two different approaches, namely dynamic clustering (the parameters are dynamically changed by the administrator and the result of changes are immediately available) and static clustering (the parameters are changed by the administrator but the results will become available only in the next report, or anyway after a certain time, due to processing bottlenecks). The choice between the two will heavily depend on the quality/processing-time ratio.
Both Topic-Query matching and clustering will access several NLP and semantic resources to improve their results (lemmatization, chunking, named entity recognition, semantic information storages, etc.). These are accessed via the web service layer. As detailed elsewhere this technology allows to keep NLP vendors separated from language based log analysis vendors.
Data delivery and discovery module
The goal of this component is to present to the user (i.e. the library administrator/manager) the results of log processing. GALATEAS foresees four different ways of delivering knowledge to the user:
Web based data mining: the results of language-based query analysis as well as structured data extracted from queries (e.g. query time, language of the browser, click through) will be fed to two data mining OLAP-based systems (candidates are the commercial Business Object and the open source Pentaho/Mondrian/Jpivot (http://community.pentaho.com/) or the open source BIRT (http://www.eclipse.org/birt)). They will provide the user with the ability of discovering facts about end-user queries in a dynamic way. It is likely that the integration of extracted data and data-mining environments will take place via the log DB.
Reporting: Not every institution/company can allocate high skilled personnel to use a data mining environment. In this case the best solution would be a report (with analytic details) summarizing the results of query analysis. Technically reports are produced as static exports of business intelligence systems into predefined templates.
Integration with customer’s log analysis software:Under this modality the results of language based log analysis will be integrated with the log analysis system already in use at customer premises. The easiest way to perform such an integration is to send back enriched logs to the customers size. These enriched logs are “captured” by a thin client which feed them to the log reporting system.
Integration with customer’s log data mining software: Some companies might be willing to cross language based log analysis with corporate data. This is typically the case of merchant site where data concerning the web site need to be crossed with selling data and other financial information. The integration will be done via CSV files with predefined semantics. Again a thin client at customer sites might take charge of the communication between the GALATEAS web service and the customer’s data mining software.
B.3.2.a.2 Machine Translation Training subsystem
This subsystem (figure 3) has the goal of selecting translation candidates from queries stored in the logs DB and training a query-tailored machine translation system. This subsystem is therefore administered only by GALATEAS personnel.
It’s main components are the TLike module and the machine translation module. Both access a set of NLP and semantic web services in order to increase their accuracy.
Figure 3: Architecture of the Machine Translation Training subsystem
It is the optimization of the TLike algorithm described in section 1.3.b.3. In its simplest incarnation it retrieves sets of queries from the log DB and identifies queries that can be considered translation candidates. Once retrieved, these are stored in the aligned queries DB. The TLike algorithm uses a language identifier for handling those queries for which the language is not known. As an alternative the language identifier can be called directly when storing the queries in the logs DB, so as to minimize the computational load of the algorithm. This simple behaviour is complicated by a set of parameters that will need to be appropriately tuned. Examples of such settings are:
Should the algorithm consider translation candidates only inside a single session or should it extend the research to queries submitted in different sessions (thus, possibly, by different users).
Should the algorithm consider only candidates “collection wise” or should it try to detect possible candidates also across collection? With which degree of confidence?
How should it consider the fact that potentially matching candidates are repeated in different periods and in different sessions?
The Machine Translation Module
The Machine Translation (MT) module, in this case MOSES, will access the DB of aligned queries that will be used as the training corpus (in case of using a factor-based model with the aligned queries also the POS and morphological analysis data is needed). The MT module has to be optimally tuned according to the following parameters:
Different language models can be used. Currently in MOSES the SRI, IRST , and RandLM language model toolkits are available. The IRST model supports huge language models and memory mapping, which enables the use of a storage mechanism in addition to the memory. RandLM supports even bigger language models but that comes with a cost on the speed of the decoding. Other techniques such as parallelizing the processing or data segmentation can be used to overcome problems related to performance.
Different translation models can be used. This includes the selection of an appropriate algorithm (e.g. Minimum-error-rate) but also depends on other features such as the lexicalised reordering model, the n-best strategy followed (number of n best hypotheses), and the decoding path.
B.3.2.a.3 The Query Translation Subsystem
This subsystem represents the implementation of the QueryTrans service, whose goal is to receive a query in one language and to return the translation of that query into the languages demanded by the client (i.e. the customers search engine).
In terms of integration it is up to the customer to “intercept” the query of the user in its search engine application and sent it to the GALATEAS QueryTrans service. The service will respond by providing translated version of the original query, which the customer’s client will use to consult the search engine, thus achieving the cross-linguality objective.
The main modules of the query translation web service are:
The language identification module, with the goal of automatically identifying the language of the query;
The MT selection module, with the goal of deciding which trained machine translation system to invoke on the basis of the input language, the output translations to be provided and the customer domain of activity.
Figure 4: Architecture of the Query Translation Subsystem
The language identification module
A plain n-gram based text categorization open source software for language identification will be used, such as TextCat (http://textcat.sourceforge.net/ ). For cases where the statistical models are not discriminative enough, the language identification module will resort dictionary based language identification, provided either remotely via access to NLP web services, or locally, via plain word-language association lists.
The MT selection module
As described, there will be several machine translation modules. The first distinction is represented by language pairs. On top of this, the consortium will explore the possibility of having dedicated domain dependent machine translation systems. Certain big information providers might indeed have enough multilingual queries in specific domains (e.g. medicine, arts, law) to train domain specific machine translation systems in order to maximize the quality of the response. Should this scenario occur, the MT selection module will have the task of selecting the most appropriate model, depending on the nature of the information provider asking for query translation.
B.3.2.a.4 GALATEAS web services
Web services are used by several subsystems of GALATEAS. The reason of such a heavy recourse to web services is that we want to keep separate the layer of software developed within GALATEAS (that may be available under open source) and components which are already commercially available (that may be proprietary to a partner or a third party service provider).
Since some of the partners have already gained experience in wrapping NLP services in the context of the CACAO project, GALATEAS web services will be built on the backbone of CACAO web services. The only difference will be a higher standardization, due to the fact that specifications of GALATEAS web services (WSDL) need to be public, in order to allow integration of third parties (It is likely that UIMA will be chosen as an annotation standard, thus type systems will be public as well).
For every involved language the following web services will be deployed,:
It should be noticed that only services in bold are necessary for the integration of a language pair. All other services provide increased accuracy for log analysis and query translation.
It should also be noticed that component-based re-use of CACAO modules holds for all services with the exception of the Named Entity Extraction service. In spite of the fact that such a service is already present inside the CACAO architecture, GALATEAS would like to benefit of an improved version of Named entity recognition. Such a component will be delivered by UNITN and is described in section 1.3.b.5 .
Table showing the language resources on which the web services will be based:
Concerning responsibilities of partners according to the language, this is the preliminary distribution:
–XEROX: French, English, Italian, German, Dutch;
–CELI: French, Italian, Arabic Polish
–UNITN: language independent NE extraction and language identification
In spite of the fact that the adopted technologies has already undergone a number of academic evaluations (in particular in the field of Machine Translation, CLIR and automatic classification), the outcome of the parameterization and optimisation within GALATEAS will be systematically monitored in a special work package. Evaluation with focus on the following components:
These three components are indeed the ones which are deemed “more crucial” for an organic development of GALATEAS, in the sense that:
The topic computation algorithm is crucial for attracting customers willing to access accurate analysis of their logs based on their catalogues.
The TLike algorithm is crucial in determining the quality of the output of the MT system.
The Query based MT system is one of the principal outcome of the project and one of the pillar of the commercial exploitation.
For these reasons they will undergo both in-lab and real-data evaluation. The remaining algorithms will undergo only in-lab evaluation (to be performed in the relative WP), as long as golden standard are available.
Users of GALATEAS are mainly institutions: therefore we do not plan any end-user on-the-field evaluation. The key players for evaluation are UBER, BAL and OD. OD will in particular monitor non functional aspect of the system, such as scalability, performance, portability. It will also evaluate it in the context of a third party library (to be identified).
Evaluation of “Topic Computation”
The goal of the topic computation algorithm is to associate queries to specific categories in which the contents of the information provider are organized. It is a standard classification problem and in-lab evaluation will be performed against community Golden Standard such as the Reuters corpus, for English. Real data evaluation will be performed by manually assigning categories to a set of 1000 queries (in two languages) and then verifying how the algorithm perform. These real data evaluation is particularly important as there is no evidence that an optimal algorithmic tuning such as the one adopted for news classification can perform comparably on short text queries.
The core of the evaluation will be represented by queries obtained from BAL’s query logs, which will be manually assigned to one or more categories in the Bridgeman classification system. In the case of the machine learning based classification system the golden standard will be split into 2/3 for training and 1/3 for testing. Manual assignment will be performed by BAL personnel and the test set will always remain “unknown” to technology partners. BAL will be simply provided a standard evaluation software that will allow the production of reports on new assignments by the classification system.
Evaluation of the language based classification system will proceed in an analogous way, with the only difference that the learning procedure will be replaced by fine-tuning of the CACAO derived Word2Category algorithm.
Evaluation of TLike
In-lab evaluation of the TLike algorithm will be performed by using CLEF-Topics, which is form a corpus of about 1.000 pairs available in several languages. For real data evaluation the golden standard will be manually built, composed of 250 matching queries and 250 non matching queries for at least three language pairs. These queries will be obtained from EUROPEANA logs and aligned by the partner UBER. As the TLike algorithm is not based on learning, the whole set of aligned queries will remain unknown to technology partners and periodical evaluation will be run directly at UBER. It will be up to technology partners (in particular to CELI) to create a golden standard to perform the most appropriate parameterization. This golden standard will be derived either from the Yahoo! logs available at UVA or from Excite logs, publicly available.
Evaluation of QueryTrans
The query translation algorithm is ultimately used for cross language information retrieval applications. Therefore in-lab evaluation will be performed against the CLEF infrastructure. Real data evaluation will be performed by building a golden standard formally identical to the CLEF format, where topics are replaced by real user queries. The TEL@CLEF corpus will be used as the evaluation base. The golden standard will be jointly produced by UBER and BAL. Optimality of query translation will be measured indirectly. Rather than verifying whether the MT derived translation of a query matches the manually coded translation, we will verify whether (all the search parameters being identical) if it returns the same set of documents. For this purpose, the golden standard already defined for the evaluation of TLike will be completed with further a set of 250 queries for at least three languages. The final golden standard will be donated to the CLEF campaign for being included in future evaluation tracks.
B.3.2.a.6 User behaviour study
As a result of the LangLog service, GALATEAS will have the possibility of analyzing millions of user queries (besides the one which are already available to consortium members such as Europeana, Yahoo! and TEL logs). More than this it will have the necessary tools to analyze these queries in their lingual and multi-lingual dimension. The goal of this task is therefore to produce the first study of search episodes centred on user behaviours as they appear from search logs.
The study is particularly challenging in that it must take into account the dynamic (temporal) dimension of search operation, as well as user characterization as emerging from the session identifier. It will take as a methodological point of departure the seminal work described in Jansen (2006) and Jones et al. (2000) augmented with the log analysis tool provided by GALATEAS technologies. Hopefully it will provide clear and quantitatively documented answers to questions such as “how the semantics of queries evolve inside a single search episode?”; “Which are the multilingual variations inside a single search episode?”; “How can be search episodes clustered and categorized?”; “Which kind of user profile emerges by applying data mining techniques to search episodes?”, etc.
The study is part of the Sustainability work package for two reasons:
It will be mainly oriented towards decision makers and will have an essentially non technical distribution: as such it is conceived as an awareness arising initiative, and it is focused on attracting interest towards the services offered by GALATEAS.
It is a scouting activity aimed to check whether the provision of customized reports on search episodes in a particular digital collection could become a sustainable source of business.
The whole software will be preferentially developed/optimized in the Java programming language. As applicable, a project will be opened on some open source development platform such as http://sourceforge.net/.
All NLP web services will be UIMA compliant web services deployed on an axis2 server.
The LangLog and QueryTrans web services will be also deployed initially on an axis2 server, but the consortium will evaluate the advantages of deploying to non-open source containers.
GALATEAS software must access a database. The bridge to the database will be independent from the underlying database management system. The consortium will experiment integration of both an open source DB (mySql) and a commercial one (Microsoft SQL server).
Testing of single components will be realized by using Junit, and builds of the system will be produced by using either ANT or maven.