• Antonio Moreno Sandoval (PI)
  • Manuel Alcántara
  • Iván Cantador
  • Pablo Castells
  • Paula Gozalo
  • Michael O'Donnell
  • Doroteo Torre
  • David Vallet
  • Alejandro Bellogín
  • Fernando Díez
  • Marta Garrote
  • Saúl Vargas
  • Ignacio Fernández


The Human Language Tecnologies & Information Retrieval Group is a multidisciplinary team including linguists, mathematicians and computer and telecommunication engineers from different units (the Laboratorio de Lingüística Informática, the Information Retrieval Group and the Biometric Recognition Group) at the Universidad Autónoma of Madrid. By their experience in the creation of multilingual resources in electronic format, some of the researchers of the group already participated in the previous MAVIR programme (S-0505/TIC-0267). For this proposal new members have been incorporated to cover other research areas.


In the last five years, the research of the group has produced a number of publications with scientific impact: 9 articles in JRC Journal, 3 chapters in academic books by international publishers (John Benjamins, Rodopi), in addition to several contributions high impact  conferences of first level in their respective areas, such as Hypertext, ISWC, ECIR, SIGIR, LREC. In the training side, 9 doctoral theses within the group have been completed in these years. Their members participate actively in the Master in Computer and Telecommunication Engineering at the Politechnical School of the UAM, as well as in Ph.D Program “The Human Language” of the Faculty of Philosophy. In this period, the group has supervised 15 MA Thesis.

The current lines of research are:

  1. Compilation of multilingual resources in electronic format. The Computational Linguistics Lab excels in this activity in the national and international arena. Among the corpora compiled it must be highlighted those devoted to spontaneous speech: CORLEC, C-ORAL-ROM and CHIEDE. These corpora include the transcription, the alignment of sound and transcription, and multilayer annotation. Also they have been compiled written corpora in Spanish, like the Corpus of reference of the Spanish language in Argentina and Chile, Spanish UAM Treebank. Another important research line is the creation of resources in other languages, like a parallel corpus in Arabic, English and Spanish, an electronic bilingual lexicon in French-Spanish, and spoken corpora in Japanese, Arabic and Chinese. Some of these resources are commercialised through European Linguistic Distribution Agency (ELDA), others are distributed free of charge for research purposes. These resources are essential to design, to train and to evaluate any type of NLP system for processing multimedia content.
  2. Information Retrieval and Extraction, with special emphasis on user modeling and personalisation, recommender systems, context modeling and semantic-based technologies. This research line has produced the development of technology (software modules, prototypes, demos), datasets and
    methodologies in the involved areas (user modeling, context modeling, recommender systems and semantic-based search engine). Several of these resources are currently in the process of being prepared for distribution under public licence.
  3. Automatic transcription of spontaneous speech focused on tasks like automatic speaker recognition (i.e. to recognise the speaker from a recording) and language recognition (i.e. to recognise the language in spoken from a certain segment of voice).

In the activity track of the group in the last five years, we may highlight, on the one hand, the participation in European projects of the FP6 like aceMedia (FP6-001765) and MESH (FP6-027685); R&D national-funded projects on multimedia Recuperación de Información en modelos multidimensionales: relevancia, novedad, personalización y contexto (TIN2008-06566-C04-02), Scalable semantic personalised search of spoken and written contents on the semantic web (TIN2005-06885), Automatic knowledge organisation, data analysis and dynamic document generation on the semantic web (TIC2002-1948), RILARIM (TIN2004-07588-C03-02) and BRAVO-RL (TIN2007-67407-C03-02), CENIT project i3media (CENIT-2007-1012), the national thematic network Web Semántica (TSI2006-26928-E), and several projects of collaboration with industrial partners with funding from PROFIT and other national programs of the CDTI.

Along with the projects, research stays and exchanges with centres or excellence are another priority in the activities of the group. Members of the group have conducted Predoctoral and Postdoctoral research visits to MIT, New York University, DFKI, University of  Edimburgh, Universität Wien, University of Glasgow, University of Southampton, Open University, University of Maryland) and contacts with research centres of technological companies (IBM, Telefónica I+D, Yahoo! Research).


