The Terminology Coordination Unit of the European Parliament has an original approach among the Coordination Services of the 10 EU Institutions. The Unit offers terminology management in the multilingual and multinational EU, which provides 80% of the national legislation of 28 countries.
Terminology management serves some 5000 translators and interpreters, working in 24 languages, across ten institutions and 58 specialised agencies. This management is necessary in order to ensure the linguistic consistency of legislative procedures which are drawn up in a process that makes use of translation and interpretation in 552 possible linguistic combinations. TermCoord undertakes this highly important and complex institutional task, while also giving regard to the requirements and benefits of global communication, the evolution of academic research in the field of terminology and the development of IT tools to aid and complement linguistic work. Communication about EU Terminology takes place on three levels:
1. internally, to raise awareness of the importance of terminology to ensure quality of the original texts and translations that become official legislative texts in each country.
2. among the 10 EU Institutions managing the IATE database in which around 300 terms are inserted daily by the translators. The new interinstitutional terminology portal, EurTerm, has been developed by TermCoord on behalf of all Institutions, and allows cooperation, sharing of resources, storage and automated consultation of research results and communication on both concept and term level in each language.
3. and with the external world of Terminology; international organisations like the UN and NGOs, universities and research centres and networks and also the big multinational companies that produce and offer very important terminology resources in specialised fields. Communicating EU Terminology to the world has a mutual benefit for the quality and consistency of EU legislation covering all fields and translated in 24 languages, and offers a very rich resource to the public in the form of the IATE database which contains 8.5 million terms in 110 domains and receives an average of 3500 hits per hour. This part of the presentation will focus on several methods of coining terminology using several tools and software like e-newspapers, the communication strategy and methods of collecting, selecting and sharing the precious resources gathered through this extrovert approach.
In my talk, I descibe ongoing efforts towards the development of a workbench that will enable academic and non-academic users to access multilingual dictionaries on multiple levels, i.e., both on the level of lexis (gloss-based search), as well as on the level of phonology, resp. orthography (form-based search). In this project, we apply Linked Open Data formalisms to improve the access to dictionaries for low-resourced languages, historical language stages or languages lacking an established orthography, hence the title Linked Open Dictionaries (LiODi).
When studying low-resource languages, historical documents or dialectal variation, researchers often face the problem that lexical resources for the specific variety under consideration are sparse, dated, or simply unavailable. At the moment, the problem is addressed by different initiatives to either aggregate language resources in a central repository (e.g., DOBES) or to collect metadata about them (e.g., META-SHARE).
The availability of this huge and diverse amount of material, often in different formats, and with a highly specialized focus on selected language varieties, now poses the challenge how to access and search this wealth of information. LiODi addresses both aspects:
- uniform access to lexical resources: at the moment, most resources are distributed across different providers. Platforms to query or browse this data are available, but they use different representation formalisms and remain isolated from each other. Consequently, potentially relevant material for the language(s) under consideration is virtually dispersed in the web of data. Here, we employ (Linguistic) Linked Open Data to develop interoperable representations for distributed resources that can be queried in an uniform fashion -- despite their physical separation, and that provide cross-references (links) to each other., to raise awareness of the importance of terminology to ensure quality of the original texts and translations that become official legislative texts in each country.
- search across multi-lingual resources: when thinking about less-resourced languages, we are not only interested in lexcial resources for the specific language we are currently studying, but also, in resources in related varieties. This is because much of the material we have is sparse, and one strategy to address gaps in our lexical knowledge may be to consult background information about forms and meaning of possible cognates in other languages in the broader context of the language under consideration. For this purpose, we develop technologies to detect possible cognates based on their phonological form and to query for cognates in related varieties.
LiODi will provide both functionalities as web services and provide a prototypical web interface that allows to query different lexical resources on their basis. This Lexicographic-Comparativist Workbench will provide access to Linked Data versions of open lexical resources. Resources with restricted access (academic, proprietary or unclear license) that cannot be directly integrated into the system (e.g., the Starling dictionaries), so we will provide an external index that can be queried, but that does not store any textual information. Instead, the user will be redirected to the original platforms maintaining this data.
The Lexicographic-Comparativist Workbench provides two primary search functionalities that extend the functionality of existing platforms, form-based search and gloss-(meaning-)based search. Both functionalities will be provided over a web frontend:
- gloss-(meaning-)based search: in a traditional dictionary, the lemma is complemented with a gloss that paraphrases its meaning, e.g., in a Kazakh-Russian dictionary, the Russian translation of Kazakh forms may be provided. If we are querying, however, for a meaning in English, then, a Russian-English dictionary has to be consulted in addition. The Linked Data formalisms to applied here allow to query over sequences of dictionaries (transitive search). In addition, it is possible to aggregate over different sequences of dictionaries, e.g., by assessing the (English) meaning of a Kazakh word through both, say, Kazakh-Turkish/Turkish-English and Kazakh-Russian/Russian-English dictionaries by highlighting English words they agree upon. Search results will then be marked for the sequence of dictionaries considered.
- form-based search: given a lexeme in a particular language, say, Kazakh, and a set of related languages, say, the Turkish languages in general, the system retrieves phonologically similar lexemes for the respective target languages. This functionality may be used to narrow down or to confirm the meaning of an unknown word – based on the meaning of potential cognates in related languages. In the main phase of the project, this search functionality is to be extended by machine learning methods that approximate systematic sound correspondences resulting, e.g., from diachronic sound changes. Where explicitly given, etymological links can be queried during form-based search in analogy with gloss-based search.
In this talk we will describe our experience when publishing and, more crucially, consuming Linked Data at the Spanish CLARIN Knowledge Centre (http://lod.iula.upf.edu). The center includes a Catalog of NLP resources & tools which aims to promote the use of language technology to researches of Humanities and Social Sciences. Though the original data set followed the XML/XSD schema, this was rewritten in accordance to the LOD approach in order to maximize the information contained in our repositories and to be able to enrich the data there.
We will addresses some critical aspects when RDFying XSD/XML data focusing on the strategy followed when mapping controlled vocabularies expressed in XML enumerations; when dealing with certain unstructured data (those where input strings may generate relevant instances); and when addressing identity resolution and linking tasks once the eventual instances are RDFied. Here we will also report on data cleansing, a crucial and unavoidable task which we addressed as an incremental process where SPARQL played an important role. We will see that some of the decisions taken depend on the eventual application we have in mind. The requirements of our Catalog (implemented as a web browser) include: displaying data to the user in a comprehensive way; aggregating external data in a sensitive manner and making hidden implicit relations explicit. In addition, the system needs to provide fresh data (regularly updated) in a quick response time.
Finally, we will report on our experiences when addressing data integration and enrichment (via data mashup). We experimented with different strategies (e.g. using external URIS vs caching local data) and faced different problems (time latency, dereferencing external URIS) that may be useful to share.
The EuroWordNet project launched the model of an Inter-Lingual-Index or ILI to connect wordnets for all languages in the world. Since then, many wordnets have been built but the notion of the ILI remained underdeveloped. Most of these wordnets have been linked to some version of the Princeton WordNet and the fund of concepts has not been open to other languages. The Global Wordnet Grid (GWG: http://data.lider-project.eu/ili) is an initiative to revive the original idea using a Linked Open Data platform. In this presentation, I will describe the principles behind the GWG and its current status. Finally, I will briefly demonstrate how the capacity of the GWG has been exploited in the European project NewsReader to achieve semantic interoperability for processing news from 3 different languages in order to generate event-centric knowledge graphs in RDF.