Lingenio qualifies with its software products translate as well as translateDict™.
Lingenio presents at the BDÜ-conference "Interpreting the future" and at the Frankfurt book fair.
The project and its objectives
HyghTra (Hybrid High-Quality Translation System) is a collaborative FP7 Marie Curie
Industry-Academia Partnership and Pathways project between the Centre for Translation Studies of
the University of Leeds and Lingenio GmbH, a Language Engineering company based in Heidelberg,
Germany. The Project ran between 2010-12 (first part) and 2012-14 (second part). The project website is:
Objectives: Project's principal goal is a technology for fast development of high quality Machine
Translation (MT) systems that translate texts between different languages, such
as German, French, Dutch, Spanish, Russian, Ukrainian, English. The project team has developed a
novel architecture for building MT systems, which is designed to overcome existing technological
limitations of current approaches to MT. For instance, our architecture allows the developers to
combine wider coverage of linguistic phenomena with higher accuracy of linguistic analysis, at the
same time achieving faster development cycle and keeping low computational requirements for MT
systems, which potentially can lead to smaller size solutions that will also run in off-line mode on
mobile devices. Overcoming these technological limitations allowed the industrial partner (Lingenio)
to create translation systems and dictionaries for new translation directions, as well as a new range of
innovative translation solution and services for new markets and applications.
Traditionally MT systems have been built within one of the two major architectures: rule-based
machine translation (RBMT) and statistical machine translation (SMT). The RBMT systems
explicitly represent and process linguistic knowledge about languages (grammar, lexicon) and about
translation equivalents between source and target (translation dictionaries, corresponding
grammatical structures). These systems have higher accuracy of linguistic analysis (e.g., they can
successfully handle multiword linguistic constructions and re-arrangements of the word order,
long-distance dependencies between words or overall syntactic structure of sentences), they have
smaller size and use less computational power. However, they have a slower development cycle and
need manually built dictionaries, grammars and processing algorithms, which seriously limits the
number of supported languages. SMT systems, on the other hand, are built using large collections of
previously translated texts (parallel text corpora), which are automatically aligned on the sentence
and word levels and which are stored as large databases of phrases that are translations of each other;
translations for new text is generated by intelligent search algorithms that recombine the segments
from the database into a faithful and natural translation. SMT systems have faster development cycle,
are more accurate in resolving ambiguities, but have more problems in handling sentence-level
linguistic phenomena. In addition, they require more storage space and use more computational
power: typically they run as web services on powerful servers or computer clusters, which limits
their use for off-line mobile applications. More recently researchers attempted to combine SMT and
SMT approaches (so called Hybrid MT), usually adding some linguistic rules, features and sentence
structure representations on top of SMT systems.
The major scientific contribution of our HyghTra project is developing a new way of building Hybrid
MT systems, where statistical techniques are added on top of an existing wide-coverage RBMT:
statistical methods support rapid development of grammars and dictionaries for new translation
directions and perform run-time disambiguation, but the core system architecture remains rule-based,
preserving the accuracy of the linguistic analysis and smaller computational footprint of the system.
The project has developed a methodology for rapidly creating adequate linguistic resources for
rule-based MT systems on the basis of statistical analysis of the linguistic data. The resources for
new languages (dictionaries, tools for linguistic annotation and analysis, transfer between different
languages and resolving linguistic ambiguities) so far has been the major obstacle for the
developers of RBMT systems, so our HyghTra project has filled the gap, which allowed Lingenio to
speed up the development cycle and enhance the quality of its rule-based MT systems with statistical
analysis and disambiguation techniques. From the commercial perspective these novel Hybrid MT
solutions resulted in an increased range of products and services currently offered by Lingenio.
The project team has created a methodology and a set of computational tools and
resources for rapidly integrating new languages and translation directions into Lingenio's rule-based
MT system using statistical MT techniques. This work also resulted in creation of a modular
development infrastructure, which resulted a new range of products and services, which Lingenio
now offers to new markets beyond traditional users of MT. The team also worked on novel uses of
MT technology in other areas, such language learning and translator training and proposed a
pedagogically grounded methods and scenarios of using MT for advanced language learners to
support learning process.
Main results of the project
- A range of new products and services offered by Lingenio, such as modules for rich linguistic
analysis and generation, terminology extraction, and support of collaborative translation process
(Text Simplifier, Standardizer, Summarizer, Intelligent Concordancer, Dictionary Builder,
Translation Templates Builder, TM Multiplier, TM Standardization, Dictionary Standardization,
Lemmatizer, Morphological Annotator, Syntactic Annotator, Semantic Annotator, Discourse
Analyzer, Text Generation).
Full description of the services is available on Lingenio's website: Configuration of Tools
- New languages and translation directions developed for Lingenio's flagship rule-based MT
products (Translate Pro/Plus/Quick)
- A methodology for induction of richly annotated dictionaries and grammars from large text
collections (text corpora)
- A methodology for extracting databases of translation equivalents for Lingenio's rule-based MT
systems from parallel and comparable corpora
- A methodology for bootstrapping electronic dictionaries and grammars for new closely related
languages from existing Lingenio resources
- A methodology for statistical disambiguation (evaluation) of competing applications of parsing
- A pedagogically motivated scenarios of using MT for generating negative linguistic evidence for
advanced language learning and translator training, which was tested in teaching a University-level
module English for Translators.
- An on-going series of HyTra workshops (HyTra-2 at ACL-2013, Sofia; HyTra-2 at EACL-2014, Gothenburg, HyTra-4 at ACL-2015, Beijin) co-located with leading international conferences on Computational Linguistics, which bring together a community of MT researchers and industrial MT developers interested in hybrid approaches to machine translation.
Socio-economic impact and wider implications of the project: There are two main socio-economic
impacts of our HyghTra project. Firstly, HyghTra has moved technological boundaries in Hybrid
Machine Translation beyond the established paradigm. The focus of the project's innovative
technology was specifically on the needs of industrial developers and users of MT systems. New
technology enables the developers to fill in an existing market gap with new products, which
combine SMT's superior disambiguation techniques and its rapid development cycle with RBMT's
accuracy of linguistic analysis, smaller system size and lower computational cost. This enables the
development of MT systems for new markets, such as the market for mobile devices, where highly
accurate off-line translation is needed with small computational footprint (with applications for
emergency services, security, social support, tourism, etc., where stable internet connection to
translation web services is either too expensive or cannot be guaranteed).
Secondly, HyghTra has brought the range of new Lingenio's products and underlying MT technological solutions into new areas, which go beyond the two main traditional markets for MT systems, i.e., beyond the professional translation automation market and end-user MT market. Specifically, Lingenio's modular development infrastructure for RBMT systems has packaged new combinations of individual workflow components into a new range of Lingenio's products and services for text analytics, bilingual terminology extraction, dictionary creation, intelligent text processing, intelligent linguistic search in large corpora, language teaching and translator training. These new products and services have a much broader range of markets: from foreign language teaching to intelligent big data mining for industry, government, defense and security. The development of technological foundations for these innovative RBMT-based solutions is one of the major successes of the project, since currently such development is unique for the MT industry, and we expect that within 5-10 years it will be widely used by other companies, having made an important impact on the Language Technology industry as a whole.