XXIV Edición del Congreso Anual de la Sociedad … · XXIV Edición del Congreso Anual de la...

XXIV Edición del Congreso Anual de la Sociedad Española para el Procesamiento del Lenguaje Natural

Escuela Politécnica Superior, Campus de Leganés, 10, 11 y 12 de septiembre de 2008

Organizado por: Grupo de Bases de Datos AvanzadasDepartamento de Informática · Universidad Carlos III de Madrid

Contenidos:

+ Índice de Contenidos

+ Índice de Autores

http://basesdatos.uc3m.es/sepln2008/web/

Consorcio MAVIR

Red Temática en Tratamiento de la

Información Multilingüe y Multimodal

Editado en Leganés, 2008

Publicado por: Sociedad Española para el Procesamiento del Lenguaje Natural Departamento de Informática UNIVERSIDAD DE JAÉN Edificio A3. Despacho 127 23071 JAÉN [email protected]

EDITADO POR

Paloma Martínez Fernández Dolores Cuadra Fernández F. Javier Calle Gómez

COMITE DE PROGRAMA

Presidente Prof. Paloma Martínez Fernández (Universidad Carlos III de Madrid)

Miembros

Prof. Ferrán Pla (Universitat Politécnica de Valencia) Profª. Raquel Martínez (Universidad Nacional de Educación a Distancia) Prof. José Gabriel Amores Carredano (Universidad de Sevilla) Prof. Toni Badia i Cardús (Universitat Pompeu Fabra) Prof. Manuel de Buenaga Rodríguez (Universidad Europea de Madrid) Prof. Fco. Javier Calle Gómez (Universidad Carlos III de Madrid) Prof.ª Irene Castellón Masalles (Universitat de Barcelona) Prof.ª Arantza Díaz de Ilarraza (Euskal Herriko Unibertsitatea) Prof. Antonio Ferrández Rodríguez (Universitat d'Alacant) Prof. Mikel Forcada Zubizarreta (Universitat d'Alacant) Prof.ª Ana María García Serrano (Universidad Nacional de Educación a Distancia) Prof. Koldo Gojenola Galletebeitia (Euskal Herriko Unibertsitatea) Prof. Xavier Gómez Guinovart (Universidade de Vigo) Prof. Julio Gonzalo Arroyo (Universidad Nacional de Educación a Distancia) Prof. José Miguel Goñi Menoyo (Universidad Politécnica de Madrid) Prof. José B. Mariño Acebal(Universitat Politécnica de Catalunya) Prof.ª M. Antonia Martí Antonín (Universitat de Barcelona) Prof.ª Mª Teresa Martín Valdivia (Universidad de Jaén) Prof. Patricio Martínez Barco (Universitat d'Alacant) Prof.ª Lidia Ana Moreno Boronat (Universitat Politécnica de Valencia) Prof. Lluis Padró (Universitat Politécnica de Catalunya) Prof. Manuel Palomar Sanz (Universitat d'Alacant) Prof. Germán Rigau (Euskal Herriko Unibertsitatea) Prof. Horacio Rodríguez Hontoria (Universitat Politécnica de Catalunya) Prof. Kepa Sarasola Gabiola (Euskal Herriko Unibertsitatea) Prof. Emilio Sanchís (Universitat Politécnica de Valencia) Prof. L. Alfonso Ureña López (Universidad de Jaén) Prof.ª Mª Felisa Verdejo Maillo (Universidad Nacional de Educación a Distancia) Prof. Manuel Vilares Ferro (Universidade de Vigo)

COMITE LOCAL DE ORGANIZACIÓN

Presidente

Prof. Paloma Martínez Fernández

Miembros

Prof. Javier Calle Gómez Prof. Elena Castro Galán Prof. Dolores Cuadra Fernández Prof Cesar de Pablo Sánchez Prof. Harith Al‐Jumaily Prof. Ana Iglesias Maqueda Prof. Isabel Segura Bedmar Prof. Lourdes Moreno López Prof. Mª. Teresa Vicente Díez Prof. José Luís Martínez Fernández Prof. David del Valle Agudo Prof. Jessica Rivero Espinosa

REVISORES EXTERNOS

Alberto Diaz, Ana Iglesias, Andrés Montoyo, Anselmo Peñas, Antoni Oliver, Antonio Bonafonte, Antonio Molina, Antonio Moreno, Antonio Toral, Antonio Vaquero, Arantza Casillas, Arturo Montejo Raez, Belen Ruiz, Borja Navarro, Carlos Gómez, César de Pablo, David Griol, David Tomás, Dolores Cuadra, Doroteo T. Toledano, Emili Sapena, Enrique Amigó, Estela Saquete, Fco. Mario Barcala, Fernando Martínez, Francisco José Valverde, Gloria Vazquez, Gorka Labaka, Ignacio Giráldez, Inés M. Galván, Jesús Giménez, Jesús Peral, Joaquim Moré, Jordi Atserias, José Carlos González, José Luís Martínez, Kepa Bengoetxea, Laura Alonso, Lourdes Araujo, Maite Oronoz, Manuel Carlos Díaz, Manuel García, Manuel J. Maña, Manuel Montes y Gomez, Maria Fuentes, Miguel A. García , Milagros Fernández, Montserrat Cuadros, Montserrat Marimon, Pablo Gervás, Paloma Moreda, Paolo Rosso, Rafael Muñoz, Roxana Danger, Víctor Fresno, Victor J. Diaz, Víctor Manuel Darriba, Yassine Benajiba

ISSN: 1135‐5948 Depósito Legal: B‐3941/1991 Distribuye: Sociedad Española para el Procesamiento del Lenguaje Natural

ARTICULOS 1 Análisis Morfosintáctico .......................................................................................................................................................... 3 Chunk and Clause Identification for Basque by Filtering and Ranking with Perceptrons. Iñaki Alegría, Bertol Arrieta, Xavier Carreras Pérez, Arantza Díaz de Ilarraza and Larraitz Uria ............................................... 5 Analysis of Noun‐Noun sequences: a rule based approach. José Mari Arriola, Juan Carlos Odriozola. ................................................................................................................................. 13 Dependency Grammars in Freeling. Jordi Carrera, Irene Castellón, Marina Lloberes, Lluis Padró, Nevena Tincova. ........................................................................ 21 Towards a Dependency Parser for Greek Using a Small Training Data Set. Jesús Herrera de la Cruz, Pablo Gervás. .................................................................................................................................. 29 Análisis sintáctico profundo del español: un ejemplo del procesamiento de secuencias idiomáticas. Jorge Antonio Leoni de León, Sandra Schwab, Eric Wehrli. .................................................................................................... 37 Búsqueda de Respuestas......................................................................................................................................................... 45 Un sistema de búsqueda de respuestas basado en ontologías, implicación textual y entornos reales. Oscar Ferrández, Rubén Izquierdo‐Bevia, Sergio Ferrández, José Luis Vicedo. ....................................................................... 47 The influence of Semantic Roles in QA: A comparative analysis. Paloma Moreda, Héctor Llorens, Estela Saquete, Manuel Palomar. ........................................................................................ 55 Categorización de Textos ........................................................................................................................................................ 63 Aproximación a la Categorización Textual en español basada en la Semántica de Marcos. Mario Crespo Miguel, Antonio Frías Delgado. ......................................................................................................................... 65 Clasificación de documentos basada en la opinión: experimentos con un corpus de críticas de cine en español. Fermín L. Cruz Mata, José Antonio Troyano, Fernando Enríquez, F. Javier Ortega .................................................................. 73 Density‐based clustering of short‐text corpora. Diego Ingaramo, Marcelo Errecalde, Paolo Rosso. .................................................................................................................. 81 Clasificación de Páginas Web en Dominio Específico. Francisco Manuel Rangel, Anselmo Peñas. .............................................................................................................................. 89 MIDAS: An Information‐Extraction Approach to Medical Text Classification. Anastasia Sotelsek‐Margalef, Julio Villena‐Román. .................................................................................................................. 97 Lexicografía Computacional .................................................................................................................................................... 105 Mutual terminology extraction using a statistical framework. Le An Ha, Ruslan Mitkov, Gloria Corpas. ................................................................................................................................. 107 Comparing languages from vocabulary growth to inflection paradigms ‐ A study run on parallel corpora and multilingual lexicons. Helena Blancafort, Claude de Loupy. ...................................................................................................................................... 113 Multilingual Evaluation of KnowNet. Montse Cuadros, German Rigau. ............................................................................................................................................ 121 Extensión y corrección semi‐automática de léxicos morfo‐sintácticos. Lionel Nicolas, Benoît Sagot, Miguel Angel Molinero, Jacques Farré, Eric de la Clergerie. ...................................................... 129 A cognitive approach to qualities for NLP. Carlos Periñán‐Pascual , Francisco Arcas‐Túnez. ..................................................................................................................... 137 Lingüística de Corpus .............................................................................................................................................................. 145 From Dependencies to Constituents in the Reference Corpus for the Processing of Basque (EPEC). Izaskun Aldezabal, Maxux Aranzabe, Arantza Díaz de Ilarraza, Enrique Fernández. ................................................................ 147 A Web‐Platform for Preserving, Exploring, Visualising and Querying Linguistic Corpora and other Resources. Georg Rehm, Oliver Schonefeld, Andreas Witt, Christian Chiarcos, Timm Lehmberg. ............................................................. 155 Recuperación de Información ................................................................................................................................................. 163 Sistema de Recomendación para la Recuperación Automática de Enlaces Web Rotos. Lourdes Araujo, Juan Martínez Romo. ..................................................................................................................................... 165 Funciones de Ranking basadas en Lógica Borrosa para IR estructurada. Joaquín Pérez‐Iglesias, Víctor Fresno, José R. Pérez‐Agüera. .................................................................................................. 173 Resúmenes de Textos .............................................................................................................................................................. 181 Integración del reconocimiento de la implicación textual en tareas automáticas de resúmenes de textos. Elena Lloret, Oscar Ferrández, Rafael Muñoz, Manuel Palomar. ............................................................................................. 183 Uso de Grafos de Conceptos para la Generación Automática de Resúmenes en Biomedicina. Laura Plaza Morales, Alberto Díaz, Pablo Gervás. ................................................................................................................... 191

Procesamiento del Lenguaje Natural, nº 41, Septiembre 2008 ISSN 1135-5948

Semántica y Pragmática .......................................................................................................................................................... 199 Determining the Semantic Orientation of Opinions on Products – a Comparative Analysis. Alexandra Balahur, Andres Montoyo. ..................................................................................................................................... 201 Methodological approach for pragmatic annotation. Francisco Javier Calle, David del Valle Agudo, Jessica Rivero, Dolores Cuadra. ....................................................................... 209 Descripción de Entidades y Generación de Expresiones de Referencia en la Generación Automática de Discurso. Raquel Hervás, Pablo Gervás. .................................................................................................................................................. 217 Natural Language Processing meets User Modeling for automatic and adaptive free‐text scoring. Diana Pérez‐Marín, Ismael Pascual‐Nieto, Pilar Rodríguez. ..................................................................................................... 225 Algunos problemas concretos en la anotación de papeles semánticos. Breve estudio comparativo a partir de los datos de AnCora, SenSem y ADESSE. Gael Vaamonde. ...................................................................................................................................................................... 233 Traducción Automática ........................................................................................................................................................... 241 Reutilización de datos lingüísticos para la creación de un sistema de traducción automática para un nuevo par de lenguas. Carme Armentano‐Oller, Mikel Forcada. ................................................................................................................................ 243 Aplicación de métodos estadísticos para la traducción de voz a Lengua de Signos. Beatriz Gallo, Rubén San‐Segundo, Juan Manuel Lucas, Roberto Barra Chicote, Luis Fernando DHaro, Fernando Fernández. .............................................................................................................................................................................. 251 Comparación y combinación de los sistemas de traducción automática basados en n‐gramas y en sintaxis. Maxim Khalilov, José A. R. Fonollosa. ...................................................................................................................................... 259 Generación de múltiples hipótesis ponderadas de reordenamiento para un sistema de traducción automática estadística. Marta Ruiz Costa‐Jussa, José A. R. Fonollosa. ......................................................................................................................... 267 Mining Term Translations from Domain Restricted Comparable Corpora. Xabier Saralegi, Iñaki San Vicente, Maddalen López de Lacalle. .............................................................................................. 273 Bilingual Terminology Extraction based on Translation Patterns. Alberto Simões, Jose Joao Almeida. ........................................................................................................................................ 281 DEMOSTRACIONES .................................................................................................................................................................. 289 AnCoraPipe: A tool for multilevel annotation. Manuel Bertrán, Oriol Borrega, Marta Recasens, Bárbara Soriano. ........................................................................................ 291 Plataforma de Interacción Natural para el Acompañamiento Virtual. David del Valle, Jesica Rivero, Daniel Conde, Garazi Olaziregi, Julián Moreno, Javier Calle, Dolores Cuadra. .......................... 293 El programa de búsqueda con lenguaje natural de Q‐go aplicado a un sitio web multilingüe. Carolina Fraile, Leonoor Van der Beek. ................................................................................................................................... 295 CHIEDE, Corpus de Habla Infantil Espontánea del Español. Marta Garrote Salazar, José M. Guiraó Miras. ........................................................................................................................ 297 MOSTAS: Un Etiquetador Morfo‐Semántico, Anonimizador y Corrector de Historiales Clínicos. Ana Iglesias, Elena Castro, Rebeca Pérez, Leonardo Castaño, Paloma Martínez, José Manuel Gómez‐Pérez, Sandra Kohler, Ricardo Melero. ........................................................................................................................................................... 299 Herramientas de anotación de corpus de habla espontánea del Laboratorio de Lingüística Informática de la UAM. Antonio Moreno, José Ma. Guirao, Doroteo Torre ................................................................................................................. 301 TMILG (Tesouro Medieval Informatizado da Lingua Galega). Antonio de Carlos Moura, Ángel López, José Ramón Pichel. ................................................................................................... 303 Subtitulado Cerrado para la Accesibilidad de Personas con Discapacidad Auditiva en Entornos Educativos. Pablo Revuelta, Javier Jiménez, José Manuel Sánchez, Belén Ruiz. ......................................................................................... 305 ESEDA: Tool for enhanced speech emotion detection and analysis. Julia Sidorova, Toni Badia i Cardús. ......................................................................................................................................... 307 PROYECTOS ............................................................................................................................................................................. 309 CLARIN: Common Language Resources and Technology Infrastructure. Núria Bel, Montserrat Marimon. ............................................................................................................................................. 311 SOPAT ‐ Servicio de orientación personalizada y accesible para turismo. Víctor Codina, Luigi Ceccaroni. ................................................................................................................................................ 313 GODO: Generación inteligente de Objetivos para el Descubrimiento de servicios web semánticos. Juan Miguel Gómez, Javier Chamizo. ...................................................................................................................................... 315 TEXT‐MESS: Minería de Textos Inteligente, Interactiva y Multilingüe basada en Tecnología del Lenguaje Humano. Patricio Martínez‐Barco, Manuel Palomar, Julio Gonzalo, Anselmo Peñas, L. Alfonso Ureña, Mª Teresa Martín, Ferrán Pla, Paolo Rosso, Alicia Ageno, Jordi Turmo, M. Antònia Martí, Mariona Taulé. ............... 317 TECNOPARLA ‐ Speech technologies for Catalan and its application to Speech‐to‐speech Translation. Henrik Schulz, Marta R. Costa‐Jussá, José A. R. Fonollosa ....................................................................................................... 319 ©2008 Sociedad Española para el Procesamiento del Lenguaje Natural

Procesamiento del Lenguaje Natural, nº 41, Septiembre 2008 ISSN 1135-5948

Preámbulo

El ejemplar número 41 de la revista de la Sociedad Española para el Procesamiento del Lenguaje Natural contiene las comunicaciones científicas, junto con los resúmenes de los proyectos de investigación y de las demostraciones de herramientas, aceptadas por el Comité Científico para su presentación en el XXIV Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN’08). Esta edición del congreso ha sido organizada por el Grupo de Bases de Datos Avanzadas perteneciente al Departamento de Informática de la Universidad Carlos III de Madrid.

El interés por parte de la comunidad investigadora en tecnologías del lenguaje natural se ha traducido en el envío a la conferencia SEPLN’08 de 66 contribuciones científicas de las cuales se han aceptado 34. Estas contribuciones se han agrupado en las siguientes áreas que no son excluyentes entre sí: análisis morfológico (5), lingüística de corpus (2), traducción automática (6), categorización de textos (5), resúmenes de textos (2), lexicografía computacional (5), recuperación de información (2), semántica y pragmática (5) y búsqueda de respuestas (2). Cada una de las comunicaciones recibidas ha sido revisada por tres miembros del Comité Científico. También se han incluido en las actas de la conferencia 5 resúmenes de proyectos de investigación y 9 descripciones de demostraciones de herramientas relacionadas con el tratamiento automático del lenguaje natural.

Esta edición cuenta además con una conferencia invitada a cargo del Dr. D. Jan Alexandersson (perteneciente al DFKI, German Research Centre for Artificial Intelligence, Saarbrücken, Alemania) y con la celebración de dos mesas redondas, “Aplicación de la semántica a la industria” y “Herramientas PLN y recursos libres” organizadas con la colaboración del consorcio MAVIR (Mejorando el Acceso y la Visibilidad de la Información multilingüe en Red) y de la red TIMM (red Temática para el Tratamiento de la Información Multilingüe y Multimodal).

Para terminar quiero expresar mi agradecimiento a la Universidad Carlos III de Madrid, al Ministerio de Ciencia e Innovación, a la Comunidad de Madrid, a DAEDALUS, al CESYA, al consorcio MAVIR, a la junta directiva de la SEPLN y a los miembros del comité científico por la ayuda prestada para que todo saliera adelante. También quiero dar las gracias a todos mis compañeros del grupo de Bases de Datos Avanzadas sin cuya colaboración no hubiera sido posible organizar esta conferencia.

Septiembre 2008 Paloma Martínez Fernández

ARTÍCULOS

Análisis Morfosintáctico

Chunk and Clause Identification for Basque by Filtering and Ranking with Perceptrons

Identificación de cláusulas y chunks para el Euskera, usando Filtrado y Ranking con el Perceptron

Iñaki Alegria University of the Basque Country

649 pk 20018 Donostia [email protected]

Bertol Arrieta University of the Basque

Country 649 pk

20018 Donostia [email protected]

Xavier Carreras MIT CSAIL 32 Vassar St.

Cambridge MA 02139 USA

[email protected]

Arantza Díaz de Ilarraza

University of the Basque Country

649 pk 20018 Donostia

[email protected]

Larraitz Uria University of the Basque Country

649 pk 20018 Donostia

[email protected]

Resumen: Este artículo presenta sistemas de identificación de chunks y cláusulas para el euskera, combinando gramáticas basadas en reglas con técnicas de aprendizaje automático. Más concretamente, se utiliza el modelo de Filtrado y Ranking con el Perceptron (Carreras, Màrquez y Castro, 2005): un modelo de aprendizaje que permite identificar estructuras sintácticas parciales en la oración, con resultados óptimos para estas tareas en inglés. Este modelo permite incorporar nuevos atributos, y posibilita así el uso de información de diferentes fuentes. De esta manera, hemos añadido información lingüística en los algoritmos de aprendizaje. Así, los resultados del identificador de chunks han mejorado considerablemente y se ha compensado la influencia del relativamente pequeño corpus de entrenamiento que disponemos para el euskera. En cuanto a la identificación de cláusulas, los primeros resultados no son demasiado buenos, debido probablemente al orden libre del euskera y al pequeño corpus del que disponemos actualmente. Palabras clave: euskera, análisis parcial, chunking, identificación de cláusulas, aprendizaje automático, aprendizaje discriminatorio, perceptron

Abstract: This paper presents systems for syntactic chunking and clause identification for Basque, combining rule-based grammars with machine-learning techniques. Precisely, we used Filtering-Ranking with Perceptrons (Carreras, Màrquez and Castro, 2005): a learning model that recognizes partial syntactic structures in sentences, obtaining state-of-the-art performance for these tasks in English. This model allows incorporating a rich set of features to represent syntactic phrases, making possible to use information from different sources. We used this property in order to include more linguistic features in the learning model and the results obtained in chunking have been improved greatly. This way, we have made up for the relatively small training data available for Basque to learn a chunking model. In the case of clause identification, our preliminary results are low, which suggest that this is due to the free order of Basque and to the small corpus available. Keywords: Basque language, shallow parsing, chunking, clause identification, machine learning, discriminative learning, perceptron

1 Background

1.1 Basque syntactic parser: an important step toward the grammar checker

In the last years, several works have been done

with the aim of building a grammar checker for the Basque language (Ansa et al., 2004); (Diaz De Ilarraza et al., 2005). With that principal purpose, a Basque shallow syntactic parser was created using finite state technologies, constraint grammar rules (Aduriz and Díaz de Ilarraza, 2003) and Hidden Markov Models

Procesamiento del lenguaje Natural, nº 41 (2008), pp 5-12 recibido 7-05-2008; aceptado 16-06-2008

ISSN: 1135-5948 © 2008 Sociedad Española para el procesamiento del Lenguaje Natural

based stochastic rules (Ezeiza et al., 1998). Based on the syntactic information extracted by the mentioned shallow syntactic parser, a set of rules was written in order to detect some types of grammatical errors. This way a first version of the Basque grammar checker was developed.

The mentioned shallow syntactic parser is divided into several modules, each one dealing with a different task. First of all, the text is tokenized an analysed morphologically. After that, a tagger/lemmatiser obtains the lemma and the category corresponding to each word form, and another module disambiguates the proposed tags. Then, a rule-based chunker identifies verb and noun phrases, and, finally, a dependency based syntactic tree is obtained by means of a rule-based module.

This parser only recognizes the sentences which are separated by a full stop. Recently, a set of rules has been developed in order to tag sentence and clause splits. However, it has not been integrated in the parser yet.

On the contrary, the rule-based chunker is integrated in this parser and it contains 560 rules; 479 related to noun phrases and 81 related to verb phrases (Aduriz et al., 2004). In Table 1, we present the results of this chunker.

precision recall f-measure

Np 86.92% 80.68% 83.68%

Vp 84.19% 87.77% 85.94%

Chunks 85.92% 83.46% 84.67%

Table 1: Results of the rule-based chunker

In this context, our main goal was to improve the identification of chunks and clauses, using machine learning techniques and combining them with the already existing rules. This way, we would improve both the parser and the grammar checker, due to the fact that the syntactic information used by the grammar checker would be more reliable.

1.2 EPEC: a manually tagged

corpus for Basque

In the last years, a big effort has been done to build a manually tagged corpus for the Basque language. This corpus, named EPEC, wants to be the reference corpus for the automatic processing of the Basque language. EPEC is a corpus of standard written Basque which has been manually tagged at different levels: morphology, surface syntax and phrases first,

and at deep syntax level later (Aduriz et al., 2006). Half of this corpus was obtained from the Statistical Corpus of 20th Century Basque (www.euskaracorpusa.net). The other half was extracted from Euskaldunon Egunkaria (www.egunero.info), the only daily newspaper written entirely in standard Basque.

The corpus was tagged semi-automatically. First, it was treated by MORFEUS (Alegria, Artola and Sarasola, 1996), a robust morphological analyser for Basque. This way, the corpus was morphosyntactically analysed giving to each word-form all the possible analysis. Then, this output was manually disambiguated; that is, the correct morphological and syntactic tag was chosen for each word. A similar technique was used to tag the noun and verb chains as well as the sentences and clauses: a rule-based grammar did the first tagging, and the tags were then corrected manually.

This way, 56,000 words (3,708 sentences) were tagged at morphosyntactic level (an average of 15 words per sentence). Chunks and clauses were only tagged in the first 25,000 words. Logically, this one has been the corpus we used in these experiments. This way, our corpus contains the following linguistic information: lemma, part of speech, subcategory, declension, subordinate clauses marks, chunk and clause start-end marks and syntactic functions. Nowadays 300,000 words are being tagged at deep syntax level.

We divided the 25,000 words corpus in three parts: 60% for training, 20% for developing and %20 for testing. We used the development data to evaluate all the models here presented.

1.3 Chunk and clause

identification: state of the art

In the last years, machine learning techniques have been applied to different tasks within the NLP field. With respect to chunk and clause identification, the main idea is to recognize partial syntactic structures in a sentence. Since these structures are not very complex, the application of machine learning techniques in this kind of tasks has succeed. Chunk and clause identification shared tasks designed in CoNLL 2000 and 2001, respectively (Tjong Kim Sang and Buchholz, 2000); (Tjong Kim Sang and Déjean, 2001) and the good results obtained with different machine learning

Iñaki Alegria, Bertol Arrieta, Xavier Careras, Arantza Díaz de Ilarraza y Larraitz Uria

6

techniques seem to be a clear evidence of its effectiveness.

A key concept behind syntactic chunking is that the chunks which constitute the sentence can be represented as a sequence of labels along the words of a sentence (see Figure 1).

Figure 1: BIO representation for chunking Therefore, chunking may be solved using

sequential learning models, which predict the most likely sequence of chunk labels given the input sentence. At the heart of these models, there are classifiers which predict the chunk label for a word, given the surrounding context of that word (including the chunk label of the surrounding words). Under this general paradigm, many different algorithms have been applied to chunking. The best systems use discriminative algorithms such as Support Vector Machines (SVM) (Kudo and Matsumoto, 2001), Winnow (Zhang, Damerau and Johnson, 2002), Conditional Random Fields (Sha and Pereira, 2003) or the Averaged Perceptron (Carreras, Màrquez and Castro, 2005). All these algorithms provide important properties. First, in order to represent the data it is possible to incorporate a great deal of different features from many types of sources. Second, they are very efficient algorithms which scale up to the order of tens of thousands of examples and millions of feature dimensions. Third, there are theories that guarantee a good performance of the learned models on unseen data, even in the presence of very large feature sets.

Chunking is evaluated with precision and recall measures of the recognized chunks. To compare the performance of systems, it is common to use the F1 measure (also called F-measure), which corresponds to a weighted harmonic mean of precision and recall, and is computed as:

F1=2*precision*recall / (precision+recall) In this paper, we will use the F1 measure to

compare the different results. In English, the best systems for chunk

identification obtain accuracies at about 94% in F1.

The success of the mentioned methods motivated further research in machine learning systems for recognizing the clause structure of a sentence, a much more difficult problem due to the recursive nature of these structures (see Figure 2). The best systems to date for clause identification obtain accuracies at about 84% in F1. Both systems use a corpus of 200,000 tokens.

Figure 2: Recursive representation of a sentence Carreras, Màrquez and Castro (2005) took

into account the recursive character of clauses, and they developed a system that treats both non recursive and recursive phrases. They suggest a global learning strategy for the general task of recognizing phrases. They propose a filtering-ranking architecture, using perceptrons, and they achieve good results in most of the relevant NLP problems related to recognizing phrases.

2 Phrase recognition using filtering and ranking with perceptrons

As we have seen, Carreras, Màrquez and Castro (2005) suggested a global learning strategy for the general task of recognizing phrases, taking into account the recursive character of some phrases, as clauses.

The system recognizes structures of phrases in a sentence, and it works in two layers. The filtering layer applies simple classifiers to detect boundaries of phrases in the sentence, producing a set of phrase candidates. The ranking layer applies a second set of classifiers that evaluate the phrase candidates produced in the first layer. The final solution is computed with a dynamic programming algorithm that builds the best structure of phrases for the sentence. Depending on the problem at hand, this algorithm will search for sequential or recursive structures of phrases (Carreras, 2005).

All the classifiers are developed using a variant of the Perceptron algorithm: the Averaged Perceptron (Freund and Schapire, 1999), which is a simple improvement of the traditional Perceptron algorithm that learns an averaged combination of classifiers during training. This algorithm has obtained very good results in NLP (Collins, 2002).

((Euria ari zuen arren,) oinez joan ginen.) ((Although it was raining,) we went on foot.)

This is an example. B-NP B-VP B-NP I-NP Hau adibide bat da. B-NP B-NP I-NP B-VP

(this) (example) (an) (is)


7

Basically, the algorithm keeps visiting training examples in a number of passes or “epochs” on the training set. At each example the algorithm predicts the best phrase structure, and corrects the classifiers if the prediction was wrong, using a very simple rule. We will see in the experiments that the number of passes (epochs) it is not critical at all.

Carreras, Màrquez and Castro (2005) obtained, to date, the third best results for chunking and the best ones for clause identification with this system.

3 Experimental setup

3.1 The corpus

The mentioned part of the EPEC corpus with around 25,000 tokens was used in this experiment: a very small corpus if we compare it with the 200,000 tokens corpus used in the shared task of CoNLL 2000 and 2001 (Tjong Kim Sang and Buchholz, 2000); (Tjong Kim Sang and Déjean, 2001). The EPEC corpus was transformed then to the CoNLL format in order to use the filtering-ranking architecture.

For chunking, the train and the test data consisted initially of three columns (word, part-of-speech and chunk tag). The chunk tags contain the name of the chunk type: B-NP or I-NP for noun phrase words, and B-VP or I-VP for verb phrase words. B-CHUNK mode tags are for the first word of the chunk, and I-CHUNK mode tags for each other word in the chunk. The O chunk tag is used for those tokens which are not part of any chunk. In Figure 3 is an example of the file format for the sentence “Niregana abiatu zen” (“He/She came to me”):

Figure 3: CoNLL 2000 file format for learning chunks: word, pos and chunk tags in each line.

We have already mentioned that only noun

chains and verb chains were tagged as chunks in the Basque corpus. It has to be taken into account that Basque is an agglutinative language and, therefore, prepositions come attached to the nouns or adjectives; that is, the prepositions of other languages as English or Spanish are expressed in Basque as declension marks. That is the reason why prepositional

chains were not explicitly tagged. When we have available the 300,000 word corpus, tagged at deep level, we will be able to detect all types of chunks as in CoNLL 2000.

For clause identification, we used the same corpus as the one used for the chunking task.

In this case, the train and the test data consisted, initially, of four columns separated by spaces (word, part of speech, chunk and clause tag). The clause tag may contain the tag (S*, as a start mark; *S), as an ending mark; *, for neither a start nor an ending mark. These tags may be combined recursively.

In Figure 4, we present a real example of the initial training corpus for the sentence “Ogia egunekoa al den galdetzen du.” (“He/She asks whether the bread is daily”). Word by word translation: “Ogia (the bread) egunekoa (daily) al den (whether is) galdetzen du (asks)”.

Figure 4: CoNLL 2001 format for clause identification: word, pos, chunk and clause tags for line

3.2 Baselines

For chunking, the baseline results were obtained by selecting the chunk tag which was most frequently associated with the current part-of-speech tag, as in CoNLL 2000. We achieved 54.10% in F1, while 77.07% was obtained with the English training data in CoNLL 2000.

For clause identification, the baseline results were produced by a system which only put clause brackets around sentences, as in CoNLL 2001. We got 37.24% in F1, while 47.71% was obtained with the English training data in CoNLL 2001.

The difference between our results and the CoNLL ones, using the same baseline, shows the difficulty of our starting point. The small training data available for Basque and the fact that this is an agglutinative and free-order language, may explain this difference.

Niregana IOR B-NP abiatu ADI B-VP zen ADL I-VP . PUNT_ O

Ogia IZE B-NP (S(S* egunekoa ADJ B-NP * al PRT B-VP * den ADT I-VP *S) galdetzen ADI B-VP * du ADL I-VP * . PUNT O *S)


8

4 Chunk identification for Basque

4.1 Initial experiments using filtering and ranking with perceptrons

The same features that those used in CoNLL 2000 were used in the initial experiments with FR-perceptrons: word, part of speech and chunk information. Table 2 shows the best results for the mentioned corpus, with the basic features and the epoch 10.

Table 2: chunker results using the basic features (word, part of speech and chunk tag) We noticed that the results do not vary

much from epoch 10, and the improvements obtained testing the model with further epochs are minimal. That is why we decided to tune the system using the epoch 10, and to test, at the end, the best system with more epochs. In Figure 5, we show the evolution of the performance of the best chunking system, using from 1 to 30 epochs. Note that the result does not improve more than 0.5 points, from the epoch 10.

86

86,5

87

87,5

88

88,5

89

89,5

90

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

epochs

f-m

easu

re

f-measure

Figure 5: evolution of the performance depending on the

number of epochs 4.2 Improvements

We tried to do the stacking with new features extracted from the corpus: lemma, subcategory, declension information, subordinate clause marks and chunking information given by the rule sets. Table 3 shows the results for the chunks, which are a combination of the results of noun phrases and verb phrases:

precision recall f-measure bf 72.78% 74.21% 73.49%

bf + sc 75.33% 76.61% 75.96% bf + decl 87.91% 91.05% 89.45%

bf + l 74.49% 76.79% 75.62% bf + soc 73.45% 76.00% 74.70%

bf + sc + decl. + l + soc 85.59% 90.48% 87.97% Table 3: Basque chunker results, using different type of information (bf: basic features; sc: subcategory inform.; decl: declension inform.; l: lemma; soc:

subordinate clause inform.)

As shown in the table before, the best results are obtained using the information of declension. Therefore, we decided to do the final experiment using the basic information plus the declension information and, including, as a new feature, the information which provides the rule-based chunker (see section 1.1). This way we combined machine learning techniques with rule-based grammars and improved the results: 90.16% of f-measure.

4.3 Interpretation of results

The best results obtained in the chunking task are closely related to the target language. The fact that the best results obtained are those in which we added the declension information is a clear evidence of it. In fact, at least one of the words of the noun phrases in Basque has a declension mark. Moreover, the declension mark is easily detected by the morphosyntactic analyser. For example, big dog is zakur handi, and with the big dog would be zakur handiarekin. As the lemma of handiarekin is handi, we know that the word handiarekin has a declension mark. For the same reason, we know that zakur has not a declension mark. Therefore, zakur handiarekin has to be a noun phrase.

It seems clear that delimiting the part of speech and the declension mark facilitates the identification of the chunk. On one hand, the part of speech is important to detect verb phrases. On the other hand, the declension mark is crucial to detect noun phrases, as we can see in the detailed results of noun phrases, when adding the declension information as a new feature:

precision recall f-measure

np 89.60% 92.49% 91.02% vp 84.55% 88.64% 86.55%

chunks 87.91% 91.05% 89.45% Table 4: Chunker detailed results using the

basic features + declension info

precision recall f-measure np 68.12% 68.07% 68.09% vp 81.42% 86.51% 83.88%

chunks 72.70% 74.03% 73.36%


9

Taking into account that the English corpus is more than 8 times bigger than the Basque one, the results for chunking in Basque are good.

Besides, we have improved the results of the rule-based chunker. However, it has to be taken into account that the test corpus used in both cases is not the same one.

Finally, best results are obtained when stacking the rule-based chunker information to our learning algorithm. This way, we have shown that combining rule based grammars with machine learning techniques improve the results. As a little summary, most important results are compared in Table 5:

technique features pr. rec. F1

Dependency grammar

- 85.92% 83.46% 84.67%

FR-perceptron bf 72.78% 74.21% 73.49% FR-perceptron bf + decl 87.91% 91.05% 89.45%

FR-perceptron bf + decl + r.b.ch.info

88.29% 92.11% 90.16%

Table 5: Summary of the Basque chunking results (bf: basic features; decl: declension information; r.b.ch.info:

rule based chunker information). The corpus used to evaluate the dependency grammar is not the same as the

one used to evaluate the FR-perceptron. We have also analysed the influence of the

corpus size to deduce how much the results could increase if we could get a bigger corpus. Best results with the 25%, the 50%, the 75% and the 100% of the initial training corpus are in Table 6:

Precision Recall F-Measure 25% 84.73% 85.56% 85.14% 50% 86.97% 89.78% 88.35% 75% 88.02% 90.95% 89.46% 100% 87.91% 91.05% 89.45%

Table 6: evolution of the performance depending on the size of the training corpus

(with basic features and declension information)

Although the results show little

improvements, we think that the corpus is too small to draw good conclusions: the training corpus only has 15,000 tokens. Therefore, we are planning to try with a quite bigger corpus.

However, all the results here presented are not fully realistic, since the training corpus was manually tagged. For novel texts, we will have to use the morphosyntactic analyser for Basque, in order to get the necessary linguistic information, which will carry a little decrease in the results.

5 Clause identification for Basque

5.1 Initial experiments using filtering and ranking

The same features that the ones used in CoNLL 2001 were used in our initial experiments in clause identification with FR-perceptrons: word, part of speech, chunk information and clause information. We will call them the basic features.

We trained the filtering-ranking algorithm initially only with the epoch 10 (see Table 7).

Precision Recall F-Measure

clauses 63.67% 41.67% 50.37% Table 7: Results for basic features

5.2 Improvements

We tried to improve the results stacking the system with new features obtained from the Basque corpus: subcategory, declension information, lemma, information of subordinate clauses and the combination of all the features.

We also did the stacking, adding the information of clause splits, provided by the rule-based grammar (see section 1.1), which improves the results considerably.

Finally, we adapted to Basque a set of features of FR-Perceptron that look for lexical units that trigger clauses. For English, these features look for relative pronouns such as "that, "which", or "who". We created the Basque counterparts for these features, with patterns looking for "non", "zein", “zeinaren”... We call these features "Basque trigger words”. See results in Table 8:

Prec. Rec. F1 bf 63.67% 41.67% 50.37%

bf + sc 63.43% 44.85% 52.55% bf + d 63.70% 43.87% 51.96% bf + l 63.18% 45.22% 52.71%

bf + soc 64.13% 44.48% 52.53% bf + sc + d + l + soc 65.21% 49.39% 56.21%

bf + sc + d + l + soc + cl 67.47% 51.35% 58.32% bf + sc + d + l + soc + cl + b 69.43% 51.23% 58.96% Table 8: Stacking clause identification system (bf: basic

features; sc: subcategory info; d: declension info; l: lemma; soc: subordinate clauses info; cl: rule-based

clause identification system’s info; b: Basque triggers) 5.3 Interpretation of the results

It seems that the small corpus we have for the Basque language is the main cause of the low results in comparison with the English ones


10

(see Table 9). Besides, our preliminary experiments suggest that the clause structure for Basque is very difficult to recognize with partial parsing methods. It has to be pointed out that Basque is a free order language, and therefore sentences may be structured in many different types. The recursive character of clauses does not either facilitate this task.

Prec. Rec. F1

English clause identification 87.99% 81.01% 84.36%

Basque clause identification

69.43% 51.23% 58.96%

Table 9: Comparing Basque and English results on clause identification task

The linguistic features added one by one

(subcategory, lemma, declension mark, subordinate clause mark) do not improve so much the results. However, when adding them all together, we get an improvement of 6 points with regard to the results obtained using the basic features. Our hypothesis that subordinate clause marks would improve notably the results has not been completely correct: we obtain the same improvement, adding, for instance, subcategory information. It seems that the more linguistic information we add, the better results we obtain. In this sense, we plan to add information of dependencies, once the Basque automatic parser gives this information.

On the other hand, an improvement of two points is achieved when adding the information of the rules-based grammar developed in order to detect clause splits. This is not either an essential improvement, but it is another little step forward.

But as mentioned, the results are quite low, if we compare them with the English ones, and the one of the reasons seems to be the size of the corpus. That is why we have analysed its influence, measuring the difference between the results obtained with the entire training corpus and the ones obtained using different proportions of the initial corpus. We wanted to deduce how much the results could increase if we could try with a bigger corpus.

Our corpus might be too small even to extract any important conclusion, but it seems that there is quite margin to improve results, increasing its size. In fact, there is a 2 points improvement between using the 50% of the training corpus and using the 100%: a quite big

improvement after adding only about 7500 tokens. See Table 10 for more details.

Precision Recall F-Measure 25% 67.94% 48.04% 56.28% 50% 69.31% 48.16% 56.83% 75% 67.99% 50.24% 57.79% 100% 69.43% 51.22% 58.96%

Table 10: influence of the size of the corpus, for clause identification

6 Conclusions and future work

We have used the filtering-ranking architecture with perceptrons for obtaining a competitive chunker and clause identification system for Basque. In spite of using a 8 times smaller corpus than the English one, we have achieved good results for chunking adding new linguistic features. The results for the clause identification system are quite low, although we have improved the initial results stacking the system with linguistic information, derived sometimes from rule-based grammars. Nevertheless, our preliminary experiments suggest that, being the Basque a free order language, this task is more difficult for successful learning, given the available resources.

We also have shown that both in chunk and clause identification, results are improved combining rule based grammars with machine learning techniques.

In the future, we plan to use a bigger corpus to improve the results. The 300,000 words corpus is hoped to be tagged in a quite short period of time. Besides, we are going to add more features, once the Basque automatic parser provides more linguistic information.

We also are going to include the chunker here presented in the shallow parser for Basque, and we will do the same with the clause identification system, if we obtain competitive results. As a consequence, we hope that the grammar checker will also be improved. Besides, a good clause identification tool would help us to detect incorrect commas. For that purpose we would have take into account that all commas would have to be removed for the training corpus, when learning clauses.

These experiments were done using information extracted from a manually tagged corpus. In order to get realistic results, we will have to use a corpus where the linguistic information is obtained with the automatic parser for Basque.


11

Acknowledgments

We would like to thank Edurne Aldasoro for her help when tagging the corpus.

Research partly funded by the Basque Government (Department of Education, University and Research, IT-397-07), the Spanish Ministry of Education and Science (TIN2007-63173) and the ETORTEK-ANHITZ project from the Basque Government (Department of Culture and Industry, IE06-185).

Xavier Carreras was supported by the Catalan Ministry of Innovation, Universities and Enterprise.

Bibliography

Aduriz I., Aranzabe M., Arriola J., Atutxa A., Díaz de Ilarraza A., Ezeiza N., Gojenola K., Oronoz M., Soroa A., Urizar R. 2006. Methodology and steps towards the construction of EPEC: a corpus of written Basque, tagged at morphological and syntactic levels for the automatic processing. Corpus Linguistics around the World.Book series: Language and Computers. Vol 56 (1-15). Ed.Wilson, Rayson and Archer. Netherlands.

Aduriz I., Aranzabe M., Arriola J., Díaz de Ilarraza A., Gojenola K., Oronoz M., Uria L. 2004. A Cascaded Syntactic Analyser for Basque. Computational Linguistics and Intelligent Text Processing.Pgs.124-135. LNCS Series. Springer Verlag. Berlin.

Aduriz I., Díaz de Ilarraza A. 2003 Morphosyntactic disambiguation and shallow parsing in Computational Processing of Basque. Inquiries into the lexicon-syntax relations in Basque. Bernard Oyharçabal (Ed.). University of the Basque Country. Bilbo.

Alegria I., Artola X., Sarasola K. 1996. Automatic morphological analysis of Basque. Literary & Linguistic Computing Vol. 11, No. 4, 193-203. Oxford University Press. Oxford. 1996.

Ansa O., Arregi X., Arrieta B., Ezeiza N., Fernandez I., Garmendia A., Gojenola K., Laskurain B., Martínez E., Oronoz M., Otegi A., Sarasola K., Uria L. 2004. Integrating NLP Tools for Basque in Text Editors. Workshop on International

Proofing Tools and Language Technologies. University of Patras. Greece.

Carreras X., Màrquez L. and Castro J. 2005. Filtering-Ranking Perceptron Learning for Partial Parsing. Machine Learning Journal, Special Issue on Learning in Speech and Language Technologies, Vol. 60, Issue 1-3, pgs. 41-71.

Carreras X. 2005. Learning and Inference in Phrase Recognition: A Filtering-Ranking Architecture using Perceptron. PhD. Polytechnic University of Catalunya.

Collins M., 2002 Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002.

Díaz de Ilarraza A., Gojenola K., Oronoz M. 2005. Design and Development of a System for the Detection of Agreement Errors in Basque. CICLing-2005. Mexico.

Ezeiza N., Aduriz I., Alegria I., Arriola J.M., Urizar R. 1998. Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages. COLING-ACL'98. Vol.1. Pgs.380-384 Montreal (Canada).

Freund Y. and Schapire R. E. 1999. Large margin classification using the perceptron algorithm. Machine Learning:37(3):277-296

Kudo T. and Matsumoto Y. 2001. Chunking with Support Vector Machines. Proceeding of NAACL 2001, Pittsburgh, PA, USA.

Tjong Kim Sang E.F. and Buchholz S. 2000. Introduction to the CoNLL-2000 Shared Task: Chunking. Proceedings of CoNLL-2000 and LLL-2000. Lisbon. Portugal.

Tjong Kim Sang E.F. and Déjean H. 2001. Introduction to the CoNLL-2001 Shared Task: Clause Identification. In: Proceedings of CoNLL-2001, Toulouse, France.

Sha F. and Pereira F. 2003. Shallow parsing with conditional random fields. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1

Zhang, T., Damerau F. and Johnson D. 2002. Text Chunking based on a Generalization of Winnow. In Journal of Machine Learning Research, vol.2. Pgs. 615-637.


12

Analysis of Noun-Noun sequences1: a rule based approach

Análisis de secuencias N-N: un enfoque con gramáticas basadas en reglas

Jose Mari Arriola

UPV/EHU-Basque Philology Department School of Economy and Business

Oñati Plaza 1, 20018 Donostia [email protected]

Juan Carlos Odriozola

UPV/EHU-Basque Philology Department Science and Technology Faculty

Leioa 48940 (Bizkaia) [email protected]

Abstract: This paper reports on work in progress to improve shallow parsing for Basque. The practical goal of our work is to enrich the information of the shallow parser with linguistic information for analyzing sequences containing an N that instantiates a kind of quantification of the other nominal constituent, by means of some different syntactical structures.

Keywords: shallow parsing, noun phrase chunking.

Resumen: El artículo presenta el trabajo para mejorar el parser superficial del euskara. El objetivo práctico del mismo, consiste en enriquecer dicho parser con la información lingüística pertinente para analizar secuencias que contienen un elemento nominal que instancia por medio de diversas estructuras sintácticas algún tipo de cuantificación de un segundo N.

Palabras clave: parsing superficial, chunks de sintagmas nominales.

1 This research is supported by grants no. HUM2004-05658-C02-01, UPV 1/UPV 00113.310-H-15921/2004

and EHU06/16, HUM2004-05658-C02-01 and EHU06/16. Besides, acknowledgments to the support of the

Government of the Basque Country to IXA group.

1 Introduction

The general framework of the research work and implementation reported here is the syntactic-processing system of Basque (see 1). We are working on a robust parsing scheme that provides syntactic annotation in an incremental fashion: once textual input has been tokenized, morphologically analyzed and disambiguated, syntactic annotation is added in two distinct stages of processing. First, a chunk parser provides a partial constituent analysis. In a second stage, the chunked input is further annotated by dependency links.

C G

Morp

ho

synta

cti c

par

sing

Sy

nta

c tic

tagg

ing

Chun

ke r

Dep

end

encie

s

EU SL E M

M orpheus

Disam biguation using linguistic

in form ation

D isam biguation using statistical

in form ation

Shallow syntactic parsing

N am ed E ntities

%

C G

PostpositionsC G

xfst

N oun and verb chainsC G

T agg ing of syntactic dependenciesC G

Sh

all

ow

pars

ing

Dee

p

pars

ing

R aw data

Analysed text

Figure 1: General framework

Procesamiento del lenguaje Natural, nº 41 (2008), pp. 13-19 recibido 7-05-2008; aceptado 16-06-2008


This paper addresses one specific subtask in the overall parsing scheme: improvement of the shallow analysis of a subclass of nouns involved in two specific syntactic structures that bear a kind of quantification. These syntactic structures require specific syntactic rules that qualitatively improve the NP chunking step, in that these rules are applied only to a subclass of nouns that are close to quantifiers. We will therefore be concerned with a lexical/functional class that is not as wide as the entire noun class. Nevertheless this noun class is wider than the close functional class of quantifiers. We will describe in detail the linguistic information that can be applied for dealing with quantifying N-N sequences instantiated in both quantifying compounds and measure phrases.

Noun sequences have certain characteristics that hinder automatic interpretation. Firstly, the creation of noun sequences is highly productive in Basque; it is not possible to store all the noun sequences that will be encountered while processing text. Basque N-N sequences could be said to pattern with English counterparts, and hence, most of the N-N sequences can be interpreted or processed as restriction readings of the second N, which is the head in the constructions of both languages. Indeed, an itsasgizon lit. ‘seaman’ is a kind of gizon‘man’. Secondly, their interpretation is not always recoverable from syntactic or morphological analysis. As is well known that most of these sequences bear a non- compositional meaning, and baserri lit. ‘wood town’ means ‘farm’. Crucially, some of the Basque N-N sequences contain nominal heads that rarely appear in English counterparts. These Ns have not suffered a lexicalisation. On the contrary, they are similar to quantifiers, in that they force quantifying readings, i.e., not lexical but functional features seem to appear, for instance: esne pixka bat lit. ‘milk bit a’ or mutil mordo bat lit ‘boy lot a’.

Thirdly, some of the right Ns in the sequence supposed to be a compound (esne-

botila bat lit. ‘milk bottle one’ bear two kinds of readings (lit. ‘to drink/to break one milk bottle’) but they also appear in a totally different structure of the type botila bat esne lit. ‘bottle one milk’, which takes a single content reading. In all the other nouns outside this subclass (cif. itsasgizon ‘seaman’, baserri‘farm’) the first structure is not available and in

the second structure they bear a unique reading corresponding to themselves.

In this work, esne-botila bat, botila bat esnewill all be called quantifying N-N sequences, and botila, pixka and mordo will be quantifying nouns.

The practical goal of our work is to enrich the information in mapping rules of the chunk parser with linguistic information dealing with quantifying N-N sequences.

This approach could be useful for additional processing (deep syntax) or for end applications (data mining, IR, etc.).

The rest of this paper is organised as follows: Section 2 describes the previous work on chunking; Section 3 describes the linguistic knowledge needed for improving the NP chunking; Section 4 is devoted to specifying how to apply the linguistic information in a rule-based approach; Section 5 shows the experiments performed. Finally, some conclusions are outlined in Section 6.

2 Previous work

There is an extensive bibliography on the processing of nominal sequences, based on (sub)categorization features of the constituents (Barker, 1998), (Takeuchi et al., 2001), (Flickinger & Bond, 2003). Ngai and Yarowsky (2000), on the other hand, present a comprehensive empirical comparison between two approaches for developing a base noun phrase chunker: human rule writing and active learning using interactive real-time human annotation.

In this section we describe the main steps followed in our shallow syntactic analysis of the corpus. The main base in the analysis of the corpus is the morphological analyser (Alegria et al. 1996) and the disambiguation grammar (Aduriz et al. 2000). Using eliminative linguistic rules or constraints, contextually illegitimate alternative analyses are discarded by means of Constraint Grammar (CG) (Karlsson et al. 1995). This gives us almost fully disambiguated sentences, with one interpretation per word-form and one syntactic-tag label. But there are word-forms that are still morphologically and syntactically ambiguous. At this point we are aware that shallow syntax is the best approach for robust syntactic parsing. As our base, we took the surface

Jose Mari Arriola y Juan Carlos Odriozola

14

oriented syntactic tags in order to analyze noun chains and verb chains. Despite the remaining ambiguity and errors, the identification of various kinds of chunks is reasonably straightforward. For this purpose we based our work on the syntactic function tags designed for Basque (Aduriz et al. 1997). We can divide these tags into three types: main function syntactic tags, modifier function syntactic tags and verb function syntactic tags. This distinction of the syntactic functions was essential for the CG-style subgrammars that contain mapping rules.

The first version of the shallow grammar was applied over a sample of 300 sentences (extracted at random from Euskal Hiztegia). This was manually checked and the proportion of sentences that had the noun and verb chains tags correctly assigned was 75% (Arriola et al., 1999). This grammar has recently been improved (Aranzabe et al. 2004).

At this stage we are concerned with noun chunks: those phrase units headed by a noun.

For this reason, we will explain the subgrammar for noun chunks. The assumption is that any word having a modifier function tag is linked to some word with a main syntactic function tag. Moreover, a word with a main syntactic function tag can by itself constitute a chunk or phrase unit.

The syntactic representation of noun chunks was based on the following syntactic tags:

• @ID>/ @<ID: pre/postmodifying determiner.

• @IA>/@<IA: pre/postmodifying adjective.

• @IZLG>/@<IZLG: pre/postmodifying noun complement.

• @KM>: modifier2 of the element containing the case and determination. This is the element with a main syntactic function tag.

Using this assumption we established three tags to detect this kind of chunk:

• %NCH: this tag is attached to words with main syntactic function tags that constitute a chunk by themselves.

• %INIT_NCH: this tag is attached to words with main syntactic function tags that are linked to other words with modifier syntactic function tags

2 Basque is a head-final language provided with postpositions, so that @<KM is not needed.

and constitute the initial element of a phrase unit.

• %FIN_VCH: this tag is attached to words with main syntactic function tags that are linked to other words with modifier syntactic function tags and constitute the end of a chunk.

The aim of this subgrammar is to attach to each word-form one of those three tags in order to delimit the noun chunks. They make explicit the linking relations expressed by the syntactic functions. In Fig. 2 there is an example3 that shows how the analysis of the chunker is equivalent to the analysis of a sentence into phrases4:

"<Hipoteka-kreditu>" <INIT_CAP>" mortage"hipoteka-kreditu" N @ KM> % INIT_NCH

"<zati>" piece"zati" N @ KM>

"<handi>" big"desobedientzia" N @ <IA

"<bat>" one"bat" ADJ @ <ID % FIN_NCH

"<ordainatzeke>" unpayed"ordaindu" V @-FMAINV %INIT_VCH

"<dugu>" "*edun" AUXV @+FAUXV %FIN_VCH we have“<$.>”<PUNCT_PUNCT>

Fig. 2. Analysis of chains. English translation on the right.

3 Linguistic Knowledge

The linguistic information summarized in this section is based on the linguistic data provided by Odriozola (2006, 2007, 2008). Henceforth, we shall talk about both “quantifying nouns” and “quantifying sequences” as long as a measure noun and a measured noun are involved in a Basque syntactical structure.

3 Each syntactic function tag is prefixed by “@” in contradistinction to other types of tags. Some tags include an angle bracket, “<” or “>”. The angle bracket indicates the direction where the head of the word is to be found. 4

The syntactic structure of the noun chunk in other

terms: [[[[hipoteka-kreditu] zati]Iz handi]NP bat D]DS.

Analysis of Noun-Noun sequences: a rule based approach

15

Some of the quantifying nouns appear in both constructions taken as compounds and in measure phrases. Some others appear only in one of the two constructions. The distribution of the several nominal constituents will be taken into account here.

Following Solé (2002) and Odriozola (2008), we assume that there is a kind of Basque (measure) noun that individualizes mass nouns by means of the following syntactical patterns.

Content nouns are involved in both measure phrases headed by the mass noun (1) and structures headed by the content noun itself that has usually been taken as compounds(2):

(1) hiru botila esne (gozo)

three bottle milk (sweet)

‘three bottles of (sweet) milk’

(2) a Hiru esne (*gozo) botila apurtu ditugu

three milk sweet bottle broken AUX.

‘We broke three bottles of (sweet) milk’

b Hiru esne (*gozo) botila edan ditugu

three milk sweet bottle drunk AUX.

‘We drank three bottles of (sweet) milk’

It should be remarked that the measure phrase in (1) actually bears two phrases, [hiru botila] and [esne (gozo)]. In any case the double readings are common in human languages (Castillo 2001). This is not so in the Basque second option. Furthermore, the so-called compound may bear either a container reading (2a) or a content reading (2b).

Unit nouns are involved in the measure phrases described above (2a). They rarely appear in quantifier compounds (2b)

(3) a bi litro esne

two liter milk

‘two liters of milk’

b %bi esne-litro

two milk liter

Some nouns are claimed to be grammaticalized to (a complex) quantifier, since they can only appear with the quantifyer/determiner bat ‘one/a’

(4) a *esne pittina milk bit-DET

b esne pittin bat milk bit one ‘a little bit of milk’

It is worth remarking that this kind of (quantifier) compound-like constructions can only take a conceptual reading corresponding to the left constituent presumed not to be the head.

Following Solé (2002) we assume that there are some Basque collectivizing nouns that head (quantifier) compounds similar to those headed by the individualizing nouns. Needless to say, the left constituent here is a countable noun, and the reading often corresponds to a reading related to the non-head constituent

(5) a mutil mordo bat

boy lot one

‘a lot of boys’

b mutil mordoa

boy lot-DET

It should be observed that this kind of quantifying noun is never totally grammaticalized and they always accepts the attached determiner –a, as standard nouns do. On the other hand, some such nouns may take a reading that is somewhat independent of the left constituent and may even force a singular agreement in the verb:

(6) mutil mordoa etorri dira

boy lot-det come-PERF AUX-PL

‘A lot of boys came’

(7) mutil taldeak ondo jokatu du

boy team-DET well play-PERF AUX-SING

Nouns like parte zati `piece´ and tarte`interval´ express a non-specific part or a whole that can be mass as in ogi zati lit ‘bread piece’ or something that is subcategorized as a mass


16

like opil zati lit ‘muffin piece’. Sequences of this type rarely allow more than two elements in Basque. However, the language allows these kind of left components when the right component is either a part noun or a collective noun. The ability of both part and collective nouns to allow a noun phrase to the left is an evidence of the non (clear) compound nature of Basque quantifying N-N sequences.

4 Rule based grammar

The linguistic information described above has been implemented by means of CG style mapping rules for adequately analyzing the cases established before.

In order to maintain coherence in quantifying relation when the element carrying the quantifying information is a noun, we decided to include new syntactic function tags: @<NQ which stands for a noun quantifier that modifies the noun to the left; and @NQ> for a noun quantifier that modifies a noun to the right.

In the case of the quantifying compounds, the @<NQ tag will be attached to the second element in the construction as a quantifier of the first nominal element. We deal with particular N-N sequences where the second nominal element supposed to be the head5 of the construction somehow instantiates a quantification of the first nominal element, so the reading actually corresponds to the non-head constituent. This function will be attached to those quantifying nouns that have been detected, for instance: part nouns (zati), collective nouns (mordo) or complex determiners (pittin bat). Here are some examples:

i)[ Hipoteka-kreditu @KM> zati @<NQ bat

@OBJ]

ii) )[Mutil @KM> mordo @<NQ bat @OBJ

@SUBJ]

iii) [Esne @KM> pittin @<NQ bat @OBJ @SUBJ]

5 N1-N2 sequences are described in Basque are of two types: Either N1 syntactically and semantically depends on N2 or dependency cannot be checked. Unlike romance languages such as Spanish, Basque rarely produces left-headed N-N sequences.

The @NQ> tag will be attached to the first element of the construction as a quantifier of the second nominal element. This function will be attached to those quantifying nouns that have been detected, for instance: content nouns that can also be involved in measure phrases (botila) or unit nouns (litro). For instance:

iv) [Botila @NQ> bat @ID> esne @OBJ @SUBJ]

v) [Bi @ID> litro @NQ> esne @OBJ @SUBJ]

The analysis introduces some idiosyncratic constructions, the noun quantifying rules, which links together syntax and morphology. Combined with existing rules, the new rules accounts for the both the distributional and agreement idiosyncrasies.

5 Experiments

We divide the available data into a train and test set, trained the CG grammar on the train set and compared the results on the test set. These rules were formulated, implemented and tested using selected examples from Twentieth Century Basque Corpus6.

In addition, we have taken a sample of 1, 737 noun chunks from EPEC (corpus of standard written Basque that has been manually tagged at different levels (morphology, surface syntax, phrases).

Our results were taken after applying the mapping rules to the output of the noun chunker. The parser labels the selected examples by attaching every quantifying noun to the noun head by means of the corresponding quantifying tag function (@<QN, @QN>). When we examine the noun chunks function tags which do not distinguish between quantifying nouns (@<QN, @QN>) and case-marker modifier nouns (@KM>), these differences do not affect the noun chunk segment. An example is the [Hipoteka-kreditu zati bat] mentioned above. The operation of text chunking, consisting of dividing a text into syntactically correlated parts of words, has not changed.

6http://www.euskaracorpusa.net/XXmendea/Konts_arrunta_fr.html.


17

However, since there were no syntactic function tags to distinguish quantifying nouns, the parser did not know which the head was: “hipoteka-kreditu” or “zati”. The nouns labelled with the noun quantifying tags are similar to quantifiers, in that they force quantifying readings. In fact, zati cannot be the lexical head. From a more formal point of view, zati would be the functional head of the head construction, whereas the other nominal would be the lexical head.

We have evaluated the precision (correctly tagged quantifying nouns/total number of quantifying nouns) and recall (relevant tagged quantifying nouns/actual quantifying nouns in the corpus). For quantifying nouns tag precision and recall were 93% and 100% respectively. The errors are due to the remaining ambiguity in the morphosyntactic analysis.

At the same time, we performed an experiment to question the idea that the noun quantifying tags give better results in determining the head of N-N sequences. We tried a CG style grammar to attach the head tag (&Head) concluding that in most cases this grammar gives good results. However, there are also many examples where the shallow syntactic information we are using is not sufficient to determine the lexical head. For instance, in the case of container nouns that can appear in both quantifying compounds and measure phrases we have two readings (content and container). In these cases we need information about the verb. In the same sense, in the case of collective nouns like talde we have two interpretations, for example:

1.neska taldea etorri zait ( “many girls have come to me”

2.a. zuzendari talde bat aukeratu dute

‘They have chosen some managers’

b. zuzendari-taldea aukeratu dute

‘They have chosen the management team’

3. c. Zuzendaritza-taldea aukeratu dute

‘They have chosen the direction team’

Both (1) and (2a) bear a quantified reading of the left constituent. (2b) seems to not to bear a quantifying reading and talde has a specific reading. Finally, (3c) takes a reading that is clearly not-quantifying.

We consider these results satisfactory as a first approach, even more so if we take into account the fact that the work is still in progress and also that, in some cases there is a lack of sufficient data in our corpus.


The rules have been implemented and tested in the CG grammar, a broad-coverage grammar of Basque. Our analysis supports the position that broad-coverage grammars will necessarily contain both highly schematic and highly idiosyncratic rules. Our approach to improving parsing is to modify the syntactic tagset and to add mapping rules that are used for attaching those new syntactic tags to quantifying nouns.

The results are satisfactory in the case of tagging quantifying nouns with the new syntactic function tags (@<QN, @QN>) and we achieve considerable improvement, when head labelling is performed on noun chunks.

Apart from the evaluation of the results obtained by the mapping rules for detecting noun-modifier structures, we wanted highlight the benefits of using the enhanced syntactic analysis for detecting the lexical heads of NPs. The previous shallow analysis of NPs did not include the noun-modifier function so that in the case of those specific structures there is more than one candidate head. With the near perfect rates of recall and precision obtained in the analysis of those noun-modifier structures the automated extraction of term candidates from text will be improved. Unfortunately, we have not yet been able to use the enhanced version in real situations of terminology work, so we cannot give exact figures.

Besides, as already mentioned before, the creation of noun sequences is highly productive in Basque; it is not possible to store all the noun sequences that will be encountered while processing. For this reason and for future work we plan to write some grammar rules for detecting previously undetected quantifying nouns.


18

Indeed, as far as these nouns are involved in quantification, we could assume that they belong to a subclass that is closed, although it must be large.

The information associated with these grammar rules is as follows:

• Lexical information: mass nouns. Combined with the syntactic information on numerals and the plural overt agreement in the finite verb.

• Morphological information: the particular morphology as indicator is in bold: `goilarakada bat azukre´ lit spoonful one sugar; ´bi opil hiruren´ lit one third of the muffin or specific collective nouns: bikote pair; hirukote trio.

• In hyponim/hypernym relationships, the class of beings expressed by one of the nouns is a subclass of that expressed by the other noun. This article is concerned with meronymic information, where one of the nouns expresses a part of that expressed by other noun. We accept that both individualizing nouns and countable are to be collectivized bear a meronymic relation with mass nouns and collective nouns.

Finally, we wanted to emphasize the benefits of linguistically sound methods and formalisms as the core of the linguistic processors.

Bibliography

Aduriz I., Arriola J., Artola X., Díaz de Ilarraza A., Gojenola K., Maritxalar M., Euskararako murriztapen-gramatika: mapaketak, erregela morfosintaktikoak eta sintaktikoak, UPV/EHU/LSI/TR12-2000

Aduriz I., Arriola J., Artola X., Díaz de Ilarraza A., Gojenola K., Maritxalar M. 1997 Morphosyntactic disambiguation for Basque based on the Constraint Grammar Formalism Proceedings of Recent Advances in NLP (RANLP97), 282-288. Tzigov Chark, Bulgary.

Aranzabe M., Arriola J.M., Díaz de Ilarraza 2004. Towards a Dependency Parser of Basque. Proceedings of the Coling 2004 Workshop on Recent Advances in Dependency Grammar. Geneva, Switzerland.

Barker, K. 1998. A trainable bracketer for noun modifiers. Advances in artificial intelligence, vol. 1418.

Castillo, J.C. 2001. Thematic Relations between Nouns. Doctoral Dissertation, University of Maryland.

Flickinger, D. & Bond F., 2003. A Two-Rule Analysis of Measure Noun Phrases. Proceedings of the HPSG03 Conference, Michigan State University, East Lansing, ed. Stefan Müller.

Karlsson F., Voutilainen A., Heikkila J., Anttila A. 1995. Constraint Grammar: Language-independent System for Parsing Unrestricted Text. Mouton de Gruyter.

Kiyoko U., Koichi T., Masaharu Y., Kyo K., Teruo K. 2001. A Study of Grammatical Categories Based on Grammatical Features for Analysis of Compound Nouns in Specialized Field. Mathematical Linguistics, vol. 23, nº 1.

Ngai, G. and Yarowsky D. 2000. Rule writing or annotation: cost-efficience resource usage for noun phrase chunking. Proceedings of 38th Annual Meeting of the Association of Computational Linguistics, 117-125,Hong-Kong.

Odriozola, J.C., 2007. ‘(Basque) natural phrases for artificial languages. Andolin Gogoan: Essays in Honour of Prfo Eguzkitza: 707-724.

Odriozola, J. C. 2007. Measure phrases in Basque. Lakarra & José Ignacio Hualde (eds.), Studies in Basque and Historical Linguistics in memory of R. L. Trask. Supplements of International Journal of Basque Linguistics and Philology, 40 (1-2): 739-762.

Odriozola, J.C., 2008. ‘Quantifier Compounds’ X.Artiagoitia & J.Lakarra (eds.). Goenagarentzako omenaldia: 503-518 (in press).

Solé, E., 2002. ‘Els noms col.lectius Catalans. Descripció I reconeixement’. Doctoral Dissertation. Unibersitat Pompeu Fabra.


19

Dependency Grammars in Freeling

Gramáticas de Dependencia en Freeling

Jordi Carrera Univ. Politécnica de Cataluña

Dep. Lenguajes y Sistemas Campus Nord UPC, C/ Jordi Girona 1-3

[email protected]

Irene Castellón Universidad Barcelona

Departament de Lingüistica General, Gran Via de les Corts Catalanes 585

[email protected]

Marina Lloberes Universidad Barcelona

Dep. de Lingüistica General, Gran Via de les Corts

Catalanes 585 [email protected]

Lluís Padró Univ. Politécnica de Cataluña

Dep. Lenguajes y Sistemas Campus Nord UPC, C/ Jordi Girona 1-3

[email protected]

Nevena Tinkova Universidad Barcelona

Dep. de Lingüistica General, Gran Via de les Corts

Catalanes 585 [email protected]

Resumen: En el marco del área del PLN, obtener análisis sintácticos profundos de manera automática es indispensable de cara a desarrollar aplicaciones que puedan hacer uso de representaciones semánticas de cualquier nivel. Uno de los objetivos del proyecto KNOW es poner a disposición de la comunidad científica gramáticas de segmentación profunda de amplia cobertura. En este artículo presentamos la implementación en el entorno FreeLing de las gramáticas del castellano, catalán e inglés, lenguas que, junto con el vasco, constituyen las lenguas objeto de interés del proyecto KNOW. Palabras clave: PLN, análisis automático, análisis profundo, gramática de análisis, representación semántica, catalán, castellano, español, inglés

Abstract: Automatic deep parsing is necessary for any NLP applications requiring a certain level of semantic representation. One of the goals of the KNOW project is the development of wide-coverage deep parsing grammars whose outcome will be open to the scientific community. In this article we present a implementation of Spanish, Catalan and English grammars in the FreeLing environment. These three languages, together with Basque, are those we work on in KNOW. Keywords: NLP, automatic parsing, deep parsing, parsing grammar, semantic representation, Catalan, Spanish, English



mailto:[email protected]





1. Some Words on Dependency Parsing Automatic deep parsing is necessary for any NLP applications requiring some level of semantic representation. Although for some languages, such as English, there are several resources, such as Minipar (Lin, D. 1998), VISL (Bick, E. 2006), Connexor (Jarvinen, T. et al 1998) or Link Parser (Sleator, D. et al 1993), few broad-coverage grammars exist for Spanish and Catalan that deliver consistently good quality and can be efficiently embedded in NLP applications.

One of the goals of the KNOW project is the development of wide-coverage, deep parsing grammars whose outcome will be open to the scientific community. FreeLing (Atserias, J. et al 2006) includes a module for rule-based dependency parsing, named TXALA (Atserias, J. et al 2005). This module has been developed in the framework of OpenTrad, an Open-Source Machine Translation project funded by the Spanish Industry Ministry which aims at developing transfer translators for all official languages in Spain (Spanish, Catalan, Galician, and Basque), as well as English.

Observing the results of extensive coverage analysers for Spanish (Bick, E., 2006; Ferrández, A. et al 2000; Marimon, M. et al 2007; Tapanainen, P., 1996,), although in many cases the analysis is correct, there are some shortcomings such as the treatment of discontinuous constituents, infinitive clauses, the doubling of arguments in syntactic realization and the detection of multiword expressions.

On the other hand, these analysers are not open-source: Connexor grants a licence to researchers, but Hispal, which is the most refined, provides only parsed texts. This is why we believe it both a good idea and a necessary endeavor to create wide-coverage, open-source grammars for English, Catalan and Spanish.

In this article we present the parsing grammar implemented for each of these three languages which, together with Euskera (Aranzabe, M. et al 2004; Bengoetxea, K. et al 2007), are those we are working on in the framework of the KNOW project.

The rest of the article is structured as follows: in Section 2, a brief description is given regarding recent improvements in the TXALA analyser. In Section 3, problems posed by deep syntactic analysis and resources

needed to deal with them are succinctly described. Each of the grammars is also broadly examined in this section. Section 4 includes some comments concerning evaluation aspects and, finally, in section 5 we draw conclusions and trace out some ideas for further research. 2. Dependency Parsing with FreeLing The TXALA parser is the last step in the FreeLing processing chain, and is preceded by:

• Sentence splitting • Morphological analysis • Shallow parsing

After the shallow parser produces sequences of subtrees (one for each chunk in the sentence), the dependency parser performs three actions:

1. Completion of the tree sequence into a full parsing tree.

This is done by means of manually defined rules. Each rule applies to a pair of consecutive subtrees, and is assigned a priority value. At each step, the rule with higher priority is applied, and the affected pair of consecutive subtrees is fused into a single subtree.

The linguist defining the rules can specify conditions on each subtree head regarding its form, lemma, PoS, or word class (word classes may be defined by the grammarian as lemmata lists). Conditions on the context where the pair of chunks appears can also be specified such that the rule does not apply if conditions are not met.

2. Conversion of syntax tree to dependency tree.

At each level, the head node (marked as such in the manually defined rules) is set as the parent of all the trees below it.

3. Functional labelling of dependencies.

After the parse has been converted to a dependency tree, each dependency is then labelled with its syntactic function. Another set of rules is applied where conditions are stated on both head and dependent nodes. Conditions range from morphosyntactic checks (v.gr. lemma, relative position) to semantic properties (v.gr. predefined classes, WordNet

Jordi Carrera, Irene Castellón, Marina LLoberes, Lluís Padró y Nevena Tinkova

22

semantic files, EuroWordNet top-ontology features).

The version of the parser presented in this paper contains several improvements with respect to the version described in (Atserias, J. et al 2005). As regards tree completion rules:

• Extension of the repertory of subtree-fusion operations.

• Possibility of specifying form, lemma, PoS or word class conditions on subtrees.

• Possibility of specifying context conditions (stated as labels corresponding to subtrees).

• Defining word classes via lists in external files.

Regarding dependency labelling rules, new conditions on headwords bounded by dependencies are allowed, including:

• EWN Top Ontology properties • WN semantic file • Synonyms • Hypernyms’ synonyms

3. Deep parsing When carrying out full syntatic analysis, sentences must be assigned some sort of semantic representation (more than one if ambiguous). A study carried out on data from Spanish concerning difficulties stemming from deep analysis (Tinkova, N. et al 2007) showed that the most complex phenomena to be solved were coordination, prepositional phrase attachment, inversion or constituent displacement, distinguishing between arguments and adjuncts and parsing subordinate clauses.

From a lexicalist standpoint, and as regards prepositional phrase attachment, crucial knowledge is provided by lexical heads. This kind of knowledge can be integrated in the form of a repertoire of syntactico-semantic structures (i.e. diathesis schemes) containing possible combinations of lexical heads with satellites.

Concerning coordination, this is a syntactic phenomenon with which we have dealt only partly and which requires extreme inter-rule synchronization. Complex situations arise in which coordinations must be resolved either before noun phrases (e.g. to create a

compound subject) or after noun phrases and verb phrases and before sentence rules (e.g. not to create a compound subject but to coordinate two continguous sentences). Coordinated elements must be abstracted from, and context of the conjunction taken into account in order to prioritize some rule over the others. 3.1. Catalan dependency grammar Catalan dependency grammar consists of a set of 2,914 rules, of which 2,565 complete the parse tree by creating dependencies and the remaining 349 label these dependencies.

Catalan grammar treats dependency recursion and dependency relations between a) phrases, b) clauses headed by conjunctions or relative pronouns, c) non-finite clauses and d) punctuation marks.

Verb subcategorization frames created on the basis of the Volem Multilingüe database (Fernández et al 2002) determine chunk selection and chunk labelling conditions for transitive verbs, verbs with a wh- clause as an argument, ditransitive verbs, intransitive verbs, verbs modified by one prepositional phrase argument, verbs modified by two prepositional phrase arguments, impersonal verbs, copulative verbs, verbs with a second predicate and motion verbs.

One problem arose during deep parsing regarding prepositional phrase attachment and, specifically, preposition de (‘of’ or ‘from’) attachment. In Catalan, de-headed prepositional phrases can modify either a noun phrase or a verb phrase. Adding information about both verb behaviour and context allowed to partly account for these problematic cases.

Sometimes, motion verbs code the source of the movement, which is expressed with a prepositional phrase headed by de. Therefore, defining a class of motion verbs allows dependency rules to be more fine-grained. However, given that de-phrases appear mostly after noun phrases, dependency rules for motion verbs yield only a partial solution. In this case, context conditions become essential to discriminate prepositional phrase attachment (a).

(a) Rule for attaching prepositional phrases to verb

phrases: grup-verb[mov] sp-de - top_left $_sn_$_grup-sp 17


23

grup-verb[mov] sp-de - top_left

$$_grup-sp 17

Rule for attaching prepositional phrases to nominal phrases: sn sp-de - top_left - 20

Although de-phrases with a verbal head have a higher priority than de-phrases with a noun head (a), there is one exception to this rule: it is possible to attach a de-phrase to a nominal head after a motion verb. Thus, the rule accounting for this case ((b), below) has a higher priority than rules dealing with prepositional phrases attached to verb phrases (the first rule in (a)): (b) grup-verb[mov] sp-de - top_left

$_sn_sp-de_$_grup-sp 21 sn sp-de - top_left

$$_sp-de_grup-sp 13

This way, prepositional attachment is solved in a wide range of cases. Figure 1 shows the analysis of sentence (c). (c) Els operaris pugen les caixes de les eines del

soterrani a la terrassa. [Catalan] Workers are taking the toolboxes up from the cellar to the balcony. [English]

Another troublesome analysis obtained regarding wh- particles having multiple values. Some wh-particles introducing indirect questions can also appear as adverbial clauses, but whereas in the former case they must receive a direct object tag, in the latter case they must be labelled as verbal modifiers. In order to distinguish between both structures, a feasible solution consisted in listing verbs which usually take clausal direct objects (e.g. verba dicendi) and to create specific labelling rules for this type of verbs (d). (d) Rules that assign direct object tag to wh-

chunks: grup-verb dobj d.label=subord

As for Spanish, TXALA dependency parser consists of 9,600 rules (9,245 parsing rules and 355 dependency rules) acting on a number of categories, such as noun, verb and prepositional phrases, pronouns, coordination, passive voice, punctuation and subordination. A rules applying to noun phrases is shown in (f). There can be seen, in order, the head, a modifier, a label denoting one child of the head, the function applied, no conditions and, finally, a priority index:

d.side=right p.class=que inf dobj d.label=subord

d.side=right p.class=que infinitiu dobj d.label=subord

d.side=right p.class=que subord-ger dobj d.label=subord

d.side=right p.class=que subord-part dobj d.label=subord

d.side=right p.class=que

Rules that assign verbal modifier tag to wh-chunks: grup-verb cc d.label=subord

d.lemma!=que|qui p.class!=que verb-pass cc d.label=subord

d.lemma!=que|qui p.class!=que inf cc d.label=subord

d.lemma!=que|qui p.class!=que infinitiu cc d.label=subord

d.lemma!=que|qui p.class!=que subord-ger cc d.label=subord

d.lemma!=que|qui p.class!=que subord-part cc d.label=subord

d.lemma!=que|qui p.class!=que

Figure 1. Textual output of example (c)

grup-verb/top/(pugen pujar VMIP3P0 -) [sn/ncsubj-subjecte/(opreraris opreraris NCMP000 -)

[espec-mp/det/(Els el DA0MP0 -)] sn/dobj-objecte_directe/(caixes caixa NCFP000 -)

[espec-fp/det/(les el DA0FP0 -) sp-de/ncmod/(de de SPS00 -) [sn/dobj-prep/(eines eina NCFP000 -) [espec-fp/det/(les el DA0FP0 -)]]]

sp-de/iobj-prep/(de de SPS00 -) [sn/dobj-prep/(soterrani soterrani AQ0MS0 -)

[j-ms/det/(el el DA0MS0 -)]] grup-sp/iobj-prep/(a a SPS00 -)

[sn/dobj-prep/(terrassa terrassa NCFS000 -) [espec-fs/det/(la el DA0FS0 -)]]

F-no-c/ta/(. . Fp -)]

These rules allow indirect speech to be labelled with direct object tags (d) and adverbial clauses with wh-particles to be labelled with verbal modifier tags (e), as can be seen in Figure 2. (e) El consell econòmic assenyala quan va

començar la recessió econòmica. [Catalan]

Economic council points when economic recession began. [English]

3.2. Spanish dependency grammar

(f) sn grup-sp sn last_left - 200


24

Figure 2. Parsing output of example (e)

Tag assignment is also carried out in Spanish grammar using labelling rules: (g) grup-verb sp-obj d.label=grup-sp| grup-sp-inf d.side=right d.lemma=a|al|para|hacia p.class=mov (h) grup-verb iobj

d.label=grup-sp|grup-sp-inf d.side=right d.lemma=a|para p.class=ditr

(g) and (h) state that any prepositional phrase following a verb and including either of the prepositions a, al or para, be assigned prepositional object label (g) or indirect object label (h). Before this distinction was set up, whenever TXALA found a prepositional phrase introduced by any of the aforementioned prepositions, it invariably labelled it as an indirect object.

The Spanish grammar is being constantly updated. Incorporation of more complex subordination rules and verb subcategorization frames will result in increased coverage. Taking as a departure point the SenSem databank (Fernández, A. et al 2004), a ninefold typology of verbs was described (i.e. impersonal, intransitive, transitive, ditransitive, predicative, copulative, verbs followed by an argument wh-clause and verbs followed by either one or two argument prepositional phrases).

Solving prepositional phrase attachment is utterly necessary, for it is the cause of most syntactic misanalyses. As was the case for Catalan, attachment of de-phrases is of particular concern for Spanish as well, for these are able to act both as noun or as verb modifiers. Subcategorization information, together with context information, is expected to rule out wrong parses. In (i) and (j), PP-attachment rules are shown which have been enriched with contextual information. One screenshot of the output of the rule in (i) is given in Figure 3: (i) sn sp-de - top_left

$$_grup-verb 34 (j) grup-verb[mov] coor-sp - top_left

$_sp-de_$ 741 As for function assignment, the parse in Figure 3 resulted from applying the rule in (k). (k) sp-de prepos d.label=sn* This rule labels the relation between the prepositional head and the head of the noun phrase immediately to its right. 3.3. English dependency grammar Dependency rules for the English grammar amount to circa 1,340. They proceed in the following way: <noun chunk, verb> pairs are combined first. After that, rules apply recursively until another such pair is found,


25

Figure 3. Graphical output of rule (i) application

and the process goes on iteratively until a full stop is found. Rules have been provided for all major kinds of clauses: declaratives, imperatives, interrogatives, completives, relatives, adverbial and existential. Analogously, separate verb phrase rules have been provided for intransitive, transitive and ditransitive sentences, including specific sets of rules for dealing with completive sentences. Sentences with ditransitive or higher valencies are treated formally as a subtype of transitive sentences.

A section was included in the grammar which contained a special kind of default dependency rules. These consisted broadly in heuristics intended to deal with relatively widespread cases of relatively unsystematic phenomena, i.e.:

• adjoining adverbs, prepositional phrases, etc. to their potential heads when in ambiguous syntactic positions, e.g. Ix approached the many on the chariotx/y;

• preventing main verbs from taking other clauses’ direct objects as their subjects whenever they took as their subjects either other clauses having direct objects, or nominal subjects with embedded clauses, as in (l) and (m) (potentially mistakenly combined terms appear in bold):

l) The man who brought the book was interesting.

m) That he saw the man was uninteresting.

Besides, rules more often than not had to be multiplied. Since one given set of dependency rules would apply to a pair of chunks with a given priority, the same rules would not apply to plausible candidate expressions embedded in those chunks.

For instance, consider the example in (n), taken from Google:

n) The Astrakhan Region is capable of making products having an assured solvent demand in external market.

In (n), each -ing verb form takes its own direct object. The first two chunks, however <making, products>, should be grouped after the second pair of chunks <having, demand> has been grouped in turn. With a single set of rules, nonetheless, and since our algorithm proceeds from left to right, the leftmost <participle, noun chunk> pair is combined first, which results in the second modifier’s being left behind.

This forced us to use several sets of multiplier rules performing virtually identical operations at different priorities, thus causing a remarkable grammar redundancy.

Another distinctive feature of English grammar as opposed to Spanish and Catalan grammars consisted in subordinate clauses’


26

lacking subordinating connectors for either completive clauses (e.g. I’ve said he broke the car) or relative clauses (e.g. The man you saw was tall).

For these cases, long range rules were created that swept for series of concatenated <noun chunk, verb> pairs along with any noun phrases intervening in between (including null events). This yielded more reliable <subject, verb> dependencies extraction and, when conditioned on verbs taking completive sentences (v.gr. say, think, etc.), this heuristic proved to solve a fairly large number of ambiguities, which is remarkable taking into account its simplicity. 4. Evaluation As yet, we have just finished the version 1.1 of Spanish, Catalan and English grammars, which we now intend to evaluate.

As for qualitative evaluation, a corpus has been created for each language. Text was extracted from newspaper articles and Internet websites. The corpora thus created vary in size: 50 sentences for Catalan, 100 for Spanish and 120 for English (this is dependent on the concept of sentence used). All of them contain a number of syntactic phenomena, v.gr. clausal embedding, coordination, different subcategorization frames, different phrase structures, etc. During development, corpora have been regularly analyzed as a testbed for the grammars, with analysis results guiding subsequent implementations.

At the time being, grammars are unable to tackle the following phenomena:

• Lexical coordination. Only some coordinations have been dealt with. We will keep expanding the number of cases covered with each successive update.

• Function assignment. When dependencies are assigned functional labels, information is necessary that the system is currently not sensitive to (e.g. PoS and morphological information for pronouns). New versions of the analyzer able to use this kind of data will have to be developed parallel to newer versions of the grammars.

• Constituent displacement has not been dealt with.

• Neither adverbial phrases or adverbial sentences have been treated (i.e. the system is unable to tell either adjuncts or arguments one from the other).

As for quantitative evaluation, so far we have been unable to carry out any such complete evaluation.

One of the main problems we face lies in the fact that analyses can differ substantially despite all of them being descriptively adequate.

In order to overcome this problem, corpora annotated according to the same formalism, in the same language and following the same grammatical criteria are required, which are usually unavailable.

Another problem lies in the fact that syntactic analysis takes as input the output of several previous processes (v.gr. multiword detection, named entity recognition, morphological labelling, etc.) Since none of these is completely error free, mistakes may take place at some point and keep then passing on to each subsequent step, all of which require an evaluation of their own prior to grammar evaluation proper.

For the languages we have been currently working with, there exist several corpora that we intend to use (3LB, WSJ, CONLL corpora). Our goal is to carry out evaluation using some subset of each of these, but we must still study whether the formalism and the criteria can be adapted to those utilized in the grammars presented here. 5. Conclusions and future work In this article we have presented the version 1.1 of the Spanish, Catalan and English grammars to be used in the framework of the KNOW project in order to develop a broad-coverage deep parser to be distributed open-source. We have also presented the most recent update of the TXALA parser, which features a number of improvements over its predecessor. There is ample room for improvement, however, specially as regards coordinations and constituent displacement for all three languages, and subcategorization frames for Spanish and English in particular. Likewise, subsequent improvement on the databases grammars rely on will also lead to performance increase.


27

On the other hand, coming up with evaluation metrics resting on a well-founded evaluation methodology constitutes another appealing line to deepen our present research.

Acknowledgements

This research has been funded by the Spanish Industry Ministry with the projects: KNOW (TIN MEC 2006-1549-C03-02), and OpenTrad (PROFIT FIT-350401-2006-5) and by a Predoctoral Scholarship FI-IQUC granted by the Generalitat de Catalunya to Nevena Tinkova (2004FI-IQUC1/00084). References Aranzabe M., J.M. Arriola, and A. Díaz de Ilarraza.

2004. Towards a Dependency Parser of Basque. Proceedings of the Coling 2004 Workshop on Recent Advances in Dependency Grammar.

Atserias, J., E. Comelles and A. Mayor. 2005.

TXALA un analizador libre de dependencias para el castellano. Procesamiento del Lenguaje Natural, n. 35, p. 455-456.

Atserias, J., B. Casas, E. Comelles, M. González, L. Padró and M. Padró. 2006. FreeLing 1.3: Syntactic and semantic services in an open-source NLP library Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'06).

Sleator, D. and D.Temperley. 1993. Parsing English

with a Link Grammar. Third International Workshop on Parsing Technologies.

Bengoetxea K. and K. Gojenola. 2007. Desarrollo de un analizador sintáctico estadístico basado en dependencias para el euskera. XXIII Congreso de la SEPLN..

Bick, Eckhard. 2006. A Constraint Grammar-Based Parser for Spanish. In: Proceedings of TIL 2006 - 4th Workshop on Information and Human Language Technology.

Briscoe, E., J. Carroll, and R. Watson. 2006. The Second Release of the RASP System. In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, Sydney.

Fernández, A., P. Saint-Dizier, G. Vázquez, F. Benamara and M. Kamel. 2002. The VOLEM Project: a Framework for the Construction of Advanced Multilingual Lexic. Proceedings of the Language Engineering Conference

Fernández, A., G. Vázquez, I. Castellón (2004)

"Sensem: base de datos verbal del español". G.

de Ita, O. Fuentes, M. Osorio (ed.), IX Ibero-American Workshop on Artificial Intelligence, IBERAMIA.

Ferrández, A., M. Palomar and L. Moreno. 2000.

“Slot Unification Grammar and anaphora resolution”. In: Recent Advances in Natural Language Processing. Nicolas Nicolov & Ruslan Mitkov (eds). John Benjamins: Amsterdam & Philadelphia, pp. 155-166.

Jarvinen T. and P. Tapanainen. 1998. Towards an implementable dependency grammar. CoLing-ACL'98 workshop 'Processing of Dependence-Based Grammars', Kahane and Polguere (eds), p. 1-10, Montreal, Canada.

Lin D. 1998. Dependence-based Evaluation of MINIPAR. In Workshop on the Evaluation of Parsing Systems.

Marimon,M. N.Bel and N.Seghezzi. 2007. Test Suite Construction for a Spanish Grammar, in Tracy Holloway King and Emily M. Bender (eds.) Proceedings of the Grammar Engineering Across Frameworks (GEAF-2007) Workshop "CSLI Studies in Computational Linguistics ONLINE” pp. 250-264

Tinkova, N. and I. Castellón. 2007. A Comparative

Study of Parsers Outputs for Spanish. International Conference’07 Recent Advances in Natural Language (RANLP 2007).


28

http://visl.sdu.dk/pdf/TIL2006.pdf

http://visl.sdu.dk/pdf/TIL2006.pdf

Towards a Dependency Parser for GreekUsing a Small Training Data Set∗

Jesus Herrera†, Pablo Gervas‡†Dep. de Ingenierıa del Software e Inteligencia Artificial

‡Instituto de Tecnologıa del ConocimientoUniversidad Complutense de Madrid

C/ Profesor Jose Garcıa Santesmases, s/n, E-28040 [email protected], [email protected]

Resumen: Se han llevado a cabo experimentos con el fin de determinar estrategiasde interes para la construccion de un corpus de pequeno tamano con el que poderentrenar un analizador de dependencias de precision para el griego, mediante unaherramienta de aprendizaje automatico. Para ello se han estudiado empıricamentediferentes problemas como la cobertura sintactica, el efecto del orden de las palabraso el efecto de la morfologıa en lenguas en las que los roles sintacticos se expresanmorfologicamente. En funcion de los resultados obtenidos se pretende establecer losfundamentos para el desarrollo sistematico y efectivo para el desarrollo de analiza-dores de dependencias cuando no se dispone de grandes corpora de entrenamiento.Las ideas presentadas podrıan ser utilizadas no solo para el griego sino tambien paraotras lenguas.Palabras clave: Analisis de dependencias, corpus de entrenamiento, griego, apren-dizaje automatico

Abstract: Some experiments have been accomplished in order to determine strate-gies that should be followed to build a small corpus capable to train accurately adependency parser for Greek, using a Machine Learning tool. Thus, several prob-lems that should be treated such syntactic coverage, effect of word order or effectof morphology in languages with syntactic roles expressed morphologically, are em-pirically studied. With the results presented we would like to lay the foundationsfor a systematic and effective way to develop dependency parsers when lacking hugetraining corpora. The ideas outlined could be used not only for Greek but for otherlanguages.Keywords: Dependency analysis, training corpus, Greek, machine learning

1 Introduction

In the last years, dependency parsing hasbeen considered as a useful tool for NaturalLanguage Processing. It could be observedin several international meetings, such theRecognizing Textual Entailment Challenge1

or the Cross Language Evaluation Forum2.Initially, dependency parsing tools were avail-able only for English; for instance, Minipar(Lin, 1998) is perhaps the mostly used soft-ware for English Dependency Parsing. Butthe need for dependency parsing tools for

∗ We are very grateful to Xριστ ινα Γιαµαλη andBασιλικη Γιαµαλη for their contribution to thiswork. This work has been partially supported by theSpanish Ministry of Education and Science (TIN2006-14433-C02-01 project).

1http://www.pascal-network.org/Challenges/RTE/

2http://www.clef-campaign.com

every language considered in Natural Lan-guage Processing research, led the organiza-tion of international evaluation tasks devotedto dependency parsing tools for several lan-guages. For example, in the CoNLL Sharedtask dependency parsers were evaluated forthe following languages: Arabic, Bulgarian,Czech, Danish, Dutch, German, Japanese,Portuguese, Slovene, Spanish, Swedish andTurkish, in the 2006 edition; and Arabic,Basque, Catalan, Chinese, Czech, English,Greek, Hungarian, Italian and Turkish, in the2007 edition.

Dependency parsing tools for Greek werenot documented until the CoNLL SharedTask 20073. This suggests that there is roomfor research in this area. But annotated cor-pora with dependency analyses for Greek,

3http://depparse.uvt.nl/depparse-wiki/SharedTaskWebsite/



necessary for training systems, are not freelyavailable. In addition, as an example of theeffort necessary for building a dependencyannotated corpus, Prokopis Prokopidis et al.reported in (Prokopidis et al., 2005) the pro-cess related to the Greek Dependency Tree-bank, which was used to train Maltparser forits participation in the CoNLL Shared Task2007. It took a full months work to thirtyannotators. The Greek Dependency Tree-bank contains 70,000 words. This situationgave rise to the idea of relying on an alter-native kind of training based on small cor-pora. It is not easy to determine the widerange of cases involved in dependency pars-ing for a language, in order to generate accu-rate sets of training samples. Then, relativelybig corpora are used for training, obtaining a“statistically guaranteed” high recall. Nivreet al. (Nivre et al., 2007) reached interest-ing results for Italian by using a “small”4

training corpus of 1,500 sentences. But it isstill a respectable amount of annotated sen-tences. The starting point for the approachpresented here is the assumption that syntac-tic patterns can be found in every language;then, if such patterns can be identified insome way, a single example (or a few exam-ples) for every pattern should be sufficient fortraining an accurate model for dependencyparsing. In fact, one of the strategies usedwhen humans learn languages is the memo-rization of syntactic patterns, that is widelyexploited in learning methods. A good wayto obtain syntactic patterns for the Greeklanguage could be to analyze the sentencescontained in a method for learning Greek.This is because the texts used in these firstexperiments were obtained from the onlineGreek course Φιλoγλωσσια (Filoglossia)5,provided by the Iνστιτoυτo Eπεξεργασιαςτoυ Λoγoυ (Institute for Language andSpeech Processing, ILSP)6. In addition, themethod Eπικoινωνηστε Eλληνικα (Com-municate in Greek) (Arvanitakis and Arvan-itaki, 2003) was used too.

In summary, the goal of the present workis to analyze the possibilities for obtaininga dependency parser for Greek, from a goodtool for dependency parser generation based

4This is the smaller training corpus reported byNivre et al. in (Nivre et al. 2007).

5http://www.xanthi.ilsp.gr/filog/default.htm

6http://www.ilsp.gr/

on Machine Learning, but with the disadvan-tage of lacking a big corpus annotated withdependency analyses. This is a prospectivestudy which does not try to demonstrate thevalidity of the techniques here proposed butto show a set of them that seem promisingfor the objectives stated. While our effortswere focused on the obtention of a depen-dency parser for Greek, the basis given herecould be used in order to generate trainingcorpora for different languages.

2 Training JBeaver

JBeaver is a publicly available tool config-ured originally as a dependency parser forSpanish (Herrera et al., 2007). But JBeaverprovides features in order to easily reconfig-ure it as a dependency parser for virtuallyany language. This is because JBeaver ispowered by Maltparser (Nivre et al., 2007)(Nivre et al., 2006a) (Nivre et al., 2006b),which is a dependency parser generator basedon Machine Learning techniques that acts asa module of JBeaver. By supplying appro-priate corpora to JBeaver, Maltparser mod-els can be trained; these models are used byJBeaver as the core for its parsing action.Maltparser models need at their input notonly the text to be analyzed but every word’spart of speech (POS) tag. Since one of thegoals when developing JBeaver was to offer tothe user the possibility of parsing plain text,without any tagging at all, JBeaver needs toaccomplish the POS tagging of the text atits input. This action is delegated to an-other Machine Learning tool acting as a mod-ule of JBeaver: Treetagger (Schmid, 1994).Treetagger is a tool for annotating text withPOS and lemma information based on Deci-sion Trees.

Since JBeaver uses Maltparser as its Ma-chine Learning core, the training files mustcontain, for every word, its POS tag, a tag de-scribing its syntactic function, and a pointerto the word that modifies, as required byMaltparser. In Figure 1, a excerpt of a train-ing corpus for JBeaver (Maltparser) can beobserved. The fields of every line of the file,from left to right, are the following: the wordform, its POS tag, the numeric identifier ofthe phrase’s word that acts as head of theactual word, and its syntactic function tag.

In addition, a pair of files containing thecomplete set of POS tags and syntactic func-tion tags must be prepared beside the file

Jesús Herrera y Pablo Gervás

30

Figure 1: Excerpt of a training corpus forJBeaver.

mentioned above.Maltparser provides two Machine Learn-

ing methods, i.e., a Memory–based Learningalgorithm and a Support Vector Machine one(Nivre et al., 2007). This latter showed betterresults than the previous one in other workson dependency parsing, but Memory–basedLearning was used because is the only onesupported by JBeaver and the obtention ofhigh accurate results was not a goal of thepresent work.

The set of features used to train a Malt-parser model can be arbitrarily defined bythe user, combining POS features, lexical fea-tures and dependency type features. The ex-periments presented here were developed us-ing several sets of features, in order to studypossible different behaviors when analyzingsentences. More specifically, the sets m2, m4and m7 7 were used. The m2 set contains fea-tures related to: the POS of two items fromthe input string (I), the POS of one item fromthe stack of partially processed tokens (S),the dependency relation linked to one itemfrom I and the dependency relation linked tothree items from S; i.e., this set do not con-sider word forms at all. The m4 set containsfeatures related to: the POS of four itemsfrom I, the POS of one item from S, the de-pendency relation linked to one item from Iand the dependency relation linked to threeitems from S, the complete word form of oneitem from I and the complete word form ofone item from S. The m7 set contains fea-tures related to: the POS of four items from I,the POS of two items from S, the dependencyrelation linked to one item from I and the de-pendency relation linked to three items fromS, the complete word form of two items fromI and the complete word form of two items

7See http://w3.msi.vxu.se/~nivre/research/MaltParser.html for a detailed description of m2, m4and m7.

from S. Alternatively, lexical features can bedefined not only as complete word forms butas suffix features too (Nivre et al., 2006b).Following the idea given by Nivre et al. in(Nivre et al, 2006b), we trained JBeaver con-sidering first of all the m2 set, because forvery small datasets it may be useful to ex-clude lexical features, in order to counter thesparse data problem.

3 First Step: Could a workingsystem be trained using a smalldata set?

Previous work on training a statistical depen-dency parser for Chinese using small train-ing data sets have been developed (Jinshanet al., 2004), and they obtained interestingresults (80.25% precision) with a training cor-pus of 5,300 sentences. Furthermore, similarresults were obtained for Italian by Nivre etal. (Nivre et al., 2007), with a training cor-pus of 1,500 sentences. Thus, the size of thecorpus should not necessarily be a problem.

Some preliminary work showed us that,when analyzing a huge set of sentences, somesyntactic structures occur frequently. Thus,if a single example for every kind of syntacticstructure could be captured, a complete setof possible statements in a language might bemodeled by means of a restricted set of sam-ples that cover all of their syntactic patterns.

To study this approach, a first experi-ment was carried out in order to determineif having supplied a single analyzed sentenceto JBeaver for training, the resulting modelcould parse accurately a set of different sen-tences showing the same syntactic structureas the first one. A model was obtained bytraining JBeaver with the following sentence:Πως την λενε; (What is her name?), anno-tated as seen in figure 2.

Figure 2: Example of one–sentence trainingfile for JBeaver.

The following set of sentences were suc-cessfully parsed with the learned model:

Towards a Dependency Parser for Greek Using a Small Training Data Set

31

Πoτ ε τo ειδες; (When did you see it?), Πoυτoν βρηκες; (Where did you find him?),Πoιoς τo θελει; (Who wants it?). The modelwas trained using the m2 set of features,which means that only POS and dependencytype features were considered. This (simple)set of features applied to a single trainingsample was sufficient for capturing a widerange of phenomena. It can be observed thatpast and present tenses could be treated; thisis because no auxiliary is necessary for build-ing these tenses in Greek. In contrast, othertenses such as future need different models fortraining because an auxiliary is mandatory(θα, να). In addition, every kind of adverband personal pronoun can be successfully an-alyzed.

However, one of the characteristic featuresof the Greek language presented an addi-tional problem: a sentence’s components notnecessarily follow a strict order. For instance,ειµαι o Kωστας and o Kωστας ειµαι con-tain the same words in different orders whilemeaning the same (I am Kostas). This rel-ative independence of word order is a fea-ture of some languages that makes depen-dency parsing more useful than constituencyparsing to capture their full complexity, so itis important that the benefits of simplifyingtraining do not come at the cost of losing theability to model this feature. In search fora solution to this problem within the scarcedata approach to training, the following ex-periment was carried out: having trainedJBeaver with the dependency analysis of thesentence Mετα θα παµε στo θεατρo (We willgo to the theater later), the sentences:

• Θα παµε µετα στo θεατρo

• Θα παµε στo θεατρo µετα

(meaning both “We will go to the theaterlater”) were analyzed, and the results showedsome errors. This suggests that, when select-ing a sentence as a model for a determinedsyntactic structure, every possible reorderingof its words must be considered if the train-ing corpus is to have adequate coverage. Fol-lowing this lesson, satisfactory experimentswere carried out with training models capableof dealing with word reordering in the samedependency structure. For all these experi-ments, the models were trained by using m2and m4 sets of features, obtaining correct re-sults in both cases. Therefore, the consider-

ation or not of lexical features seems not tobe relevant for this kind of samples.

This last experiment shows that, despitethe fact that parsing errors were producedwhen word reordering was not consideredfor training, some sections of the sentencewere correctly analyzed. These sections cor-respond to dependency subtrees that are in-cluded in the training sample. For instance,the subtree having as nodes the first fourwords of the sentence Θα παµε στo θεατρoµετα was correctly analyzed because thissubtree was present in the training sample.This observation lead to the design of the ex-periment described in the following section.

4 Second Step: Could somesamples for training includeothers?

Some syntactic substructures are common toa wide range of sentences; for instance (asshown in figure 3), the dependency tree ofthe sentence Kαθε πρωι o Πετρoς και ηAντιγoνη θα πινoυν καφε στη θαλασσα(Petros and Antigoni will drink coffee by thesea every morning) includes the dependencytree of every one of these other sentences:

• Kαθε πρωι η Θεoδωρα βλεπειτηλεoραση (Theodora watches tele-vision every morning).

• O Φoιβoς θα γραψει τραγoυδια (Phoi-bos will write songs).

• H Eλενη και o Kωστας κανανε βoλταστo βoυνo (Eleni and Kostas went for awalk on the mountains).

Two inverse experiments were accom-plished in order to determine whether it isbetter to use for training a sentence with adependency tree as general as possible, orseveral sentences with a simpler dependencytree but such that their intersection coversa dependency tree equal to the more generalone.

The first experiment consisted on train-ing a model with the sentence Kαθε πρωιo Πετρoς και η Aντιγoνη θα πινoυν καφεστη θαλασσα, using it to parse the othersentences, i.e.: Kαθε πρωι η Θεoδωραβλεπει τηλεoραση. O Φoιβoς θα γραψειτραγoυδια. H Eλενη και o Kωσταςκανανε βoλτα στo βoυνo.

In the second experiment, the model wasobtained by training with the three phrases


32

Figure 3: Included dependency trees.

used before in the parsing step: Kαθε πρωιη Θεoδωρα βλεπει τηλεoραση. O Φoιβoς θαγραψει τραγoυδια. H Eλενη και o Kωσταςκανανε βoλτα στo βoυνo. The model wasused to parse the sentence that in the firstexperiment was used for training, i.e., Kαθεπρωι o Πετρoς και η Aντιγoνη θα πινoυνκαφε στη θαλασσα.

As a result, it was obtained that one sen-tence in the first experiment (O Φoιβoς θαγραψει τραγoυδια) was not correctly ana-lyzed even if changing the set of features usedfor training, i.e., the consideration of differ-ent features did not improve the parsing ac-tion. But the second experiment gave satis-factory results, i.e., the sentence Kαθε πρωι oΠετρoς και η Aντιγoνη θα πινoυν καφε στηθαλασσα was correctly analyzed, but onlywhen using a m7 set of features, excludingthe lexical ones, for training. As a conclu-sion, when considering a relatively complexsyntactic structure, it seems better to use fortraining several sentences with a simpler de-pendency tree but such that their intersection

covers a dependency tree equal to the moregeneral one. In addition, a rich set of featurestrying to exclude lexical features should beused. Further studies should be developed inorder to determine if this approach is appro-priated for more complex structures or newstrategies should be found.

5 Third Step: What aboutdeclination?

Greek is an inflected language that uses caseto encode grammatical relations. For thisreason, we developed an experiment to ob-serve the effect of declination when trying totrain JBeaver. Let us consider the follow-ing two sentences: To λεωφoρειo τoυ KTEΛειναι τoυ Θαναση (The KTEL’s bus belongsto Thanasis) and O ανδρας της Eλενηςειναι o Θανασης (Thanasis is Eleni’s hus-band). Both sentences follow the same orderconsidering POS, i.e., determiner, commonnoun, determiner, common noun, verb, de-terminer, proper noun. But very importantdifferences exist between them. For instance,


33

in the first sentence the word Θαναση lacksthe last letter with respect to the same wordin the second sentence (Θανασης); this lack-ing letter indicates a possessive function andthis proper noun acts as object in the firstsentence, while in the second sentence thesame proper noun acts as subject. Case inGreek is expressed by means of the word’ssuffix. Then, if the training sets for JBeavermust contain every word’s POS, a training setconformed by these two sentences should in-duce errors in parsing time if lexical featureswere not considered when training. Thus, wetrained a model using these two sentences,labeled as shown in Figure 4.

Figure 4: Training sample for studying dec-lination.

A sentence that should be parsed as thesecond one used for training (O ανδρας τηςEλενης ειναι o Θανασης) was submittedto the model learned. This new sentencewas the following: O πατ ερας της Θεoδωραςειναι o Λεωνιδας (Leonidas is Theodora’s fa-ther). As a result, a correct parsing was ob-tained only when training with an m7 setof features. It means that to fully accountfor the complexity of the information con-tained in declination phenomena requires arelatively huge set of training features con-taining, of course, lexical features.

But the lexical features considered in thestandard m7 are complete word forms of thesentence. This could negatively affect the

attempts to obtain a training corpus usingsmall data sets for training, because undersuch circumstances the set of different wordforms for every syntactic structure would notbe very rich. Thus, it led us to evaluatethe possibility of training a model capable ofcorrectly analyzing declination phenomena.Then, we repeated successfully the experi-ment shown in this section, but using a mod-ified m7 model. In this new m7 model everylexical feature consisted of the last characterof every word form involved in the standardm7.

6 Evaluation

As seen in the previous sections, there aresome facts that indicate that a working sys-tem could be trained using a small data set.But while the examples presented here workproperly isolated, it is important to verifythat all them can work together, in orderto build an effective training corpus. Theonly inconvenience we could find is that dif-ferent sets of features were needed for an ac-curate training of every case studied, whilethe model must be trained using a commonset of features. Thus, a complementary proofwas accomplished: the complete set of train-ing samples seen in the previous sections wasused to train a new model, and the set offeatures selected was the more restrictive ofthe ones considered along this work, i.e., themodified m7 model where every lexical fea-ture consisted of the last character of ev-ery word form involved in the standard m7.Then, this training corpus was conformed bythe following sentences:

• Πως την λενε; (What is her name?)

• Mετα θα παµε στo θεατρo (We will goto the theater later)

• Θα παµε µετα στo θεατρo (We will goto the theater later)

• Θα παµε στo θεατρo µετα (We will goto the theater later)

• Kαθε πρωι η Θεoδωρα βλεπειτηλεoραση (Theodora watches tele-vision every morning)

• O Φoιβoς θα γραψει τραγoυδια (Phoi-bos will write songs)

• H Eλενη και o Kωστας κανανε βoλταστo βoυνo (Eleni and Kostas went for awalk on the mountains)


34

• To λεωφoρειo τoυ KTEΛ ειναι τoυΘαναση (The KTEL’s bus belongs toThanasis)

• O ανδρας της Eλενης ειναι o Θανασης(Thanasis is Eleni’s husband)

After it, the following sentences were cor-rectly parsed:

• Πoτ ε τo ειδες; (When did you see it?)

• Πoυ τoν βρηκες; (Where did you findhim?)

• Πoιoς τo θελει; (Who wants it?)

• Θα παµε πρωτα στo θεατρo (We will goto the theater sooner)

• Θα παµε στo θεατρo µετα (We will goto the theater later)

• Kαθε πρωι o Πετρoς και η Aντιγoνηθα πινoυν καφε στη θαλασσα (Petrosand Antigoni will drink coffee by the seaevery morning)

• O πατ ερας της Θεoδωρας ειναι oΛεωνιδας (Leonidas is Theodora’s fa-ther)

As a conclusion, the use of the modi-fied m7 set of features that we used forthe experiment proposed in Section 5 seemsvalid for training a complete training corpusfor Greek, considering a relatively importantrange of syntactic and morphological phe-nomena.

After this last experiment, it was interest-ing to determine empirically if a small train-ing set could be sufficient to produce an accu-rate parser able to deal with a wide range ofsentences. For that, a training set of 15 sen-tences was used to train a new parser. Theywere selected by following the recommenda-tions learned from sections 3 to 5. The 15sentences were, appart from the ones of theprevious experiment, the following:

• H ωρα ειναι δυo (It is two o’clock)

• Mε λενε Γιωργo Oικoνoµoυ (My nameis Giorgo Oikonomou)

• Tις λενε Mαρια και Eλενη (Theirnames are Maria and Eleni)

• Eιµαι σε ενα ξενoδoχειo (I am in a ho-tel)

• Tρεχει στo γηπεδo (It runs on theground)

• T ωρα βλεπω τηλεoραση (Now I amwatching television)

The lexical features considered were,again, the ones pertaining to the modifiedm7 model where every lexical feature con-sisted of the last character of every word forminvolved in the standard m7. 82 sentenceswere selected at random from the methodsfor learning Greek referred in Section 1, cov-ering different levels of complexity pertain-ing to the A1 and A2 levels of the Com-mon European Framework of Reference forLanguages8. They were submitted to theparser generated and the following measureswere computed: Labeled Attachment Score(LAS), Unlabeled Attachment Score (UAS)and a Label Accuracy (LA). 39 of these 82sentences were perfectly analyzed, i.e., theyranked 100% LAS, UAS and LA. The over-all values obtained (LAS = 67.32%, UAS =77.27% and LA = 75.54%) are near to the re-sults reported for Greek dependency parsingin the CoNLL Shared Task 20079. It does notmean that they are comparable works, but itcould be interpreted as a positive sign in or-der to consider the strategies exposed in thispaper.

The next question to answer is if theparser is able to analyze correctly every sen-tence with a syntactic structure equal to oneof those that were previously parsed with-out errors. For this, the following experimentwas accomplished: every one of the 39 syn-tactic structures that were perfectly parsedwas replicated several times by obtaining aset of different sentences for every syntacticstructure. 100 sentences were thus generatedand parsed, ranking again 100% LAS, UASand LA.

7 Discussion and Future Work

After the set of experiments presented here,we can conclude that it could be possibleto obtain an accurate dependency parser forGreek by means of a small training corpus.For this, we rely on the use of tools suchJBeaver and we propose the following basicstrategies to develop such training corpus:

8http://www.coe.int/t/dg4/linguistic/CADRE EN.asp

9http://depparse.uvt.nl/depparse-wiki/AllScores/


35

• To select an adequate source for sen-tences in the language considered. Thissource should provide a wide range ofsamples containing as varied syntacticpatterns as possible. Such kind of sourcecould be a method for learning the lan-guage, that usually presents sentenceswith an incremental syntactic complex-ity.

• To extract sentences from the source tobe analyzed. These sentences shouldcover as varied as possible typical casesof syntactic patterns in the language.If a sentence shows a complex syntac-tic structure, it is recommendable touse for training several sentences with asimpler dependency tree but such thattheir intersection covers a dependencytree equal to the more general one.

• To build the training corpus in an incre-mental way, verifying that the new sen-tences added in a step do not affect tothe overall performance of the trainedmodel. In addition, the set of featuresused for training should be as simple aspossible, trying to avoid lexical featuresor using only words’ suffixes. In case ofa new syntactic pattern needing a richerset of features than the ones consideredso far, it is necessary to verify that itdoes not come at the cost of losing accu-racy during the parsing action.

• To evaluate specific phenomena of thelanguage considered such as, for exam-ple, declination.

The present work covers preliminary stud-ies on the question showing positive resultsthat suggest that there is room for more re-search on it. The effective development ofa training corpus, under the considerationsexposed here, should reveal new problems todeal with. The solutions given to treat themshould conform a useful and complete guidefor the development of training corpora fordependency parsing using small training datasets. In addition, the results obtained wouldbe used to empirically evaluate the advan-tages and disadvantages of the method ex-posed here versus prior existing ones.

References

K. Arvanitakis and F. Arvanitaki (K.Aρβανιτακης και Φ. Aρβανιτακη).

Eπικoινωνηστε Eλληνικα [Communi-cate in Greek]. Deltos Publishing, 2003.

J. Herrera, P. Gervas, P.J. Moriano, A.Munoz, and L. Romero. JBeaver:Un Analizador de Dependencias para elEspanol Basado en Aprendizaje. In Pro-ceedings of the XII Conference of theSpanish Association for Artificial Intelli-gence, Salamanca, Spain, 2007.

M. Jinshan, Z. Yu, L. Ting, and L. Sheng. ASatistical Dependency Parser of ChineseUnder Small Training Data. In IJCNLP–04 Workshop: Beyond Shallow Analyse-Formalisms and Statistical Modeling forDeep Analyses, 2004.

D. Lin. Dependency-based Evaluation ofMINIPAR. In Proceedings of the Work-shop on Evaluation on Parsing Systems,Granada, Spain, 1998.

J. Nivre, J. Hall, and J. Nilsson. Malt-Parser: A Data-Driven Parser Generatorfor Dependency Parsing. In Proceedings ofthe 5th International Conference on Lan-guage Resources and Evaluation, LREC-2006, Genoa, Italy, 2006.

J. Nivre, J. Hall, J. Nilsson, A. Chanev,G. Eryigit, S. Kobler, S. Marinov, andE. Marsi. MaltParser: A Language-Independent System for Data-Driven De-pendency Parsing. In Natural LanguageEngineering 13 (2). Cambridge UniversityPress, Cambridge, United Kingdom, 2007.

J. Nivre, J. Hall, J. Nilsson, G. Eryigit, andS. Marinov. Labeled Pseudo-ProjectiveDependency Parsing with Support VectorMachines. In Proceedings of the CoNLL-XShared Task on Multilingual DependencyParsing, New york, USA, 2006.

P. Prokopidis, E. Desipri, M. Koutsom-bogera, H. Papageorgiou1, S. Piperidis.Theoretical and Practical Issues in theConstruction of a Greek Dependency Cor-pus. In Proceedings of The Fourth Work-shop on Treebanks and Linguistic Theories(TLT 2005), Barcelona, Spain,, 2005.

A. Schmid. Probabilistic Part–of–SpeechTagging Using Decision Trees. In Pro-ceedings of the International Conferenceon New Methods in Language Processing,Manchester, United Kingdom, 1994.


36

Análisis sintáctico profundo del español: un ejemplo del procesamiento desecuencias idiomáticas∗

Spanish deep parsing: the example of idiomatic sequences processing

Jorge Antonio Leoni de León, Sandra Schwab y Éric WehrliLATL - Departamento de Lingüística

Universidad de Ginebra2, rue de Candolle

CH-1211 Ginebra 4,Suiza

[jorge.leonideleon,sandra.schwab,eric.wehrli]@lettres.unige.ch

Resumen: En el Laboratorio de Análisis y de Tecnología del Lenguaje de la Universidad

de Ginebra (Suiza), se ha desarrollado el analizador sintáctico profundo multilingüe FIPS,

el cual es todavía un trabajo en progreso. Dicho analizador, inspirado de las teorías gene-

rativistas chomskyanas, se basa en la idea de conjuntos de estructuras sintácticas comunes

a varios idiomas (ya sea a todas las lenguas o familias de lenguas). En este artículo pre-

sentamos una introducción a la estrategia general de FIPS, ejemplificada con el español, así

como una muestra de aplicación al procesamiento de secuencias idiomáticas. Este tipo de

secuencias, aunque generalmente procesadas como secuencias léxicas estáticas, pueden ser

objeto de diversas transformaciones léxico-sintácticas, como la pronominalización clítica

de un argumento interno o la substitución de elementos. Capturar el sentido de tales secuen-

cias en la oración requiere una representación sintáctica profunda que permita establecer

los vínculos entre la forma base y la realización (o forma superficial).

Palabras clave: analizador sintáctico profundo, expresiones idiomáticas

Abstract: FIPS, a multilingual deep parser, has been developed at the Language Techno-

logy Laboratory (LATL) of the University of Geneva (Switzerland).This parser, inspired

by Chomskyan generative theories, is based on the idea that sets of syntactic structures

are common to different languages (to all languages or to some language families). In this

paper, we present an introduction to FIPS processing that we illustrate with Spanish and

examples of multiword expressions. Such expressions, although generally processed as sta-

tic lexical sequences, can indeed undergo various lexical-syntactic transformations, such as

pronominalizations or substitutions. Retrieving such sequences’ meaning requires a deep

syntactic representation, which needs to establish the links between deep structures and sur-

face forms.

Keywords: deep parsing, multiword expressions

1. Introducción

Desde hace varios años, en el Laboratorio de

Análisis y de Tecnología del Lenguaje (LATL,

2008; Laenzlinger y Wehrli, 1991) de la Uni-

versidad de Ginebra se desarrolla el analiza-

dor sintáctico profundo multilingüe FIPS (Wehr-

li, 2004; Wehrli, 2007). 1 Este se inspira, funda-

mentalmente, del esquema teórico chomskyano

∗ Esta investigación ha recibido el apoyo del Fonds Na-

tional Suisse pour la Recherche Scientifique (Fondo Na-

cional Suizo para la Investigación Científica), proyecto

no101412− 103999.1Existe una versión en línea del analizador (LATL,

2008).

(Chomsky, 1995, capítulo 1 con Howard Lasnik),

con adaptaciones libres del modelo Minimalista

(Chomsky, 2004), de Simpler Syntax (Culicover

y Jackendoff, 2005) y de la Gramática léxico-

funcional (Bresnan, 2001). Así, FIPS posee un

núcleo gramatical común a todas las lenguas del

sistema, al que se le agregan módulos especiali-

zados correspondientes a grupos de lenguas que

presentan similitudes en cuanto a ciertos fenó-

menos, como por ejemplo los pronombres clíti-

cos en las lenguas latinas. Esta estrategia reduce

el tiempo necesario para la introducción de nue-

vas lenguas en el sistema, al haber un conjunto

de condiciones y fenómenos sintácticos predefi-



nidos, tanto para el total de lenguas, como para

un subconjunto de ellas.

La ventaja de un analizador sintáctico profun-

do con respecto a los analizadores sintácticos su-

perficiales, como Atserias et al. (2006), es su ca-

pacidad para identificar eficazmente las relacio-

nes de distancia en la frase. Por ejemplo, los ele-

mentos constitutivos de las expresiones idiomáti-

cas no siempre se encuentran próximos los unos

de los otros, aunque está claro que la coocurren-

cia de dichos elementos es importante. Tal es el

caso de la colocación “explotar un mito”, la cual,

aparte de su forma transitiva básica, puede en-

contrarse bajo una forma pasiva, “el mito ha si-

do explotado”, o una forma nominal, “la explo-

tación del mito”. En este artículo describimos el

funcionamiento de FIPS y abordamos de manera

general sus ventajas en el procesamiento de ex-

presiones idiomáticas.

2. El analizador Fips

La implementación de FIPS se ha concentrado

en seis idiomas: alemán, español, francés, grie-

go, inglés e italiano. 2 Sin embargo, otras lenguas

también han sido tratadas, aunque parcialmente,

como el rumano, el ruso, el polaco y el romanche

sursilvano. La base de datos léxica de FIPS ha

sido siempre una prioridad, de manera que el lé-

xico ha ido alcanzado un notable nivel tanto cua-

litativo, como cuantitativo; el Cuadro 1 resume la

cobertura léxica de FIPS en cifras absolutas:

Idioma Lemas Formas Colocaciones

Inglés 54 000 90 000 5 000

Francés 37 000 227 000 12 500

Alemán 39 000 410 000 2 000

Italiano 31 000 220 000 2 500

Español 25 100 265 000 1500

Griego 12 000 90 000 225

Cuadro 1: Número de entradas en FIPSBD

De esta manera, la base de datos léxica de

FIPS contiene lemas, que son las formas canó-

nicas para acceder a las entradas léxicas, formas,

que son todas las instancias declinadas o conju-

gadas de una entrada léxica, y colocaciones que

se abordan en la sección 3.

Los análisis de FIPS requieren la conjunción

de los resultados de tres sistemas interdependien-

tes: una base de datos léxicos (FIPSBD), un eti-

2En cuanto al procesamiento automático de la lengua es-

pañola, podemos citar tanto el trabajo de La Serna (2004),

como el de Bick (2008), éste último trata de un analiza-

dor basado en gramáticas de restricciones (constraint gram-

mar).

quetador morfosintáctico (FIPSTG) y un analiza-

dor sintáctico (FIPSSYN).

Inspirándose de la gramática chomskyana,

FIPS maximiza los rasgos gramaticales comunes

a las lenguas a través de varios módulos que van

del más general al más específico, siendo este

último un conjunto de reglas propias a una len-

gua en particular (Wehrli, 2004). 3 Por ejemplo,

el tratamiento de los pronombres clíticos (“le di

el libro”) de las lenguas latinas más el griego es

procesado por medio del módulo Romance (Leo-

ni de León y Michou, 2006). 4

FIPSBD especifica, entre otros, los datos de

subcategorización y selección, las funciones te-

máticas y los rasgos semánticos sintácticamente

relevantes. Por ejemplo, en el caso de un verbo

como “ver”, tenemos la serie de valores parcial-

mente especificados en el Cuadro 2, donde “ID”

se refiere al número único de identificación del

verbo “ver” en la base de datos, “Inflexión” indi-

ca el paradigma de conjugación correspondiente

y “Subcategorización” especifica las posiciones

de sujeto y de objeto directo que deben estar ocu-

padas por un sintagma nominal (“NP”). Estas úl-

timas posiciones están asociadas, respectivamen-

te, con “Argumento 1” y “Argumento 2”, donde

las funciones gramaticales y temáticas son decla-

radas. Estas informaciones son provistas al utili-

zador por el etiquetador FIPSTG. Por otra parte,

todas las formas posibles de una entrada léxica

han sido introducidas en la base de datos por me-

dio de un generador.

En el caso de las otras categorías gramati-

cales, la información guardada en FIPSBD pue-

de ser muy similar; por ejemplo, ciertos adjeti-

vos están subcategorizados (“orgulloso de AL-

GO”). Además, tenemos el caso de informacio-

nes léxico-semánticas particularmente relevantes

(rasgos de selección); por ejemplo, la propiedad

[+humano] es agregada a los sustantivos referi-

dos a seres humanos a fin de dar cuenta del uso de

3En el marco de la gramática sintagmática endocéntrica,

HPSG, se han efectuado esfuerzos similares (LinGO Lab,

2008).4El módulo Romance se encarga de: (i) la identificación

de las secuencias clíticas; (ii) la asociación de dichas se-

cuencias al verbo anfitrión (u otra categoría, según el idio-

ma); (iii) la verificación de rasgos entre la secuencia clítica

y los argumentos del verbo; y (iv) la interpretación de la se-

cuencia clítica. La interpretación de las secuencias clíticas

toma la forma de una categoría vacía en la posición de argu-

mento, coindexada con el pronombre clítico en una posición

más alta; la formación de una cadena entre ambos permite

la corrobación de los rasgos pertinentes de caso y tema. El

etiquetador de FIPS (FIPSTG) muestra los valores corres-

pondientes (objeto directo, objeto indirecto, etc.) para cada

vocablo de la oración.

Jorge Antonio Leoni de León, Sandra Schwab y Eric Wehrli

38

Etiqueta Valor

Lema ver

ID Inflexión 1Subcategorización [NP_NP]

Argumentos

Argumento 1Función gramatical sujeto

Argumento 2Función gramatical objeto directo

Función temática tema

Cuadro 2: FIPSBD: “ver”

la preposición española “a” para señalar un obje-

to directo referido a un humano. Así, (1) contras-

ta con (2), puesto que si bien tanto “estudiante”

como “edificio” son objetos directos, en este úl-

timo la preposición “a” está ausente:

(1) Vi al estudiante

(2) Vi el edificio.

Los valores parciales de “estudiante” en

FIPSBD están dados en el Cuadro 3, dentro de los

que se cuenta “humano. El valor “noArg” en

“Subcategorización” indica que “estudiante” no

tiene ningún elemento subcategorizado. En cam-

bio, la entrada léxica de “edificio” (Cuadro 4) ca-

rece del rasgo “humano”, pero posee el rasgo fa-

cultativo “Objeto físico”. Según las infor-

maciones de los Cuadros 3 y 4, FIPSSYN, que

veremos más adelante, atribuye un valor de com-

plemento directo a “al estudiante” luego del aná-

lisis de la preposición en función de la estructura

del verbo y del sintagma nominal en cuestión. 5

Etiqueta Valor

Lema estudiante

ID Género masculino,

femenino

Número singular

Inflexión 7Rasgos humano

Subcategorización noArg

Cuadro 3: FIPSBD: “estudiante”

Ahora bien, si tenemos una oración como “vi

el edificio”, según las informaciones especifica-

das en el Cuadro 2, FIPSSYN va a intentar po-

5El tratamiento del objeto directo del español en FIPS,

tanto para entes animados como inanimados, merece, sin

duda alguna, ser abordado más en profundidad, en especial

en lo que respecta a la fenomenología de la pronominaliza-

ción clítica. Sin embargo, es un tema que exige más espacio

del que podemos dedicarle aquí.

Etiqueta Valor

Lema edificio

ID Género masculino

Número singular

Inflexión 1Rasgos Objeto

físico

Subcategorización noArg

Cuadro 4: FIPSBD: “edificio”

ner en relación el verbo “ver” con el sintagma

nominal “el edificio”, puesto que, según la sub-

categorización del verbo, la posición postverbal

corresponde al objeto directo y éste debe ser un

sintagma nominal. De esta forma, las informa-

ciones para la combinación de los sintagmas son

satisfechas.

FIPSSYN presupone la constitución de sintag-

mas endocéntricos consistentes en tres elemen-

tos: el núcleo del sintagma (X), a su izquierda una

lista de subconstituyentes (Izq) y a su derecha

otra lista de subconstituyentes (Der). Esquemáti-

camente lo representamos así:

[ Izq X Der ]

Cualquiera de estos elementos puede estar va-

cío. La variable “X” puede corresponder a cual-

quier categoría léxica: adverbio (Adv), adjeti-

vo (A), complementador (C), determinante (D),

interjección (Inter), preposición (P), sustanti-

vo (N), Verbo (V). Además tenemos la catego-

ría funcional de tiempo (T), que contiene toda la

oración, así como una proyección funcional (F),

usada para representar objetos predicativos, cuyo

núcleo está constituido por un adjetivo, un ad-

verbio, un sustantivo o una preposición. De esta

manera una representación gráfica de un sintag-

ma, o incluso de una oración, es necesariamente

trinaria (Figura 1).

XP

Izq X Der

Figura 1: Estructura básica de FIPSSYN

Existen varios formatos de salida para los re-

sultados de FIPS (texto, XML y con corchetes);

todos consisten en una versión enriquecida de la

frase original con delimitadores de los sintagmas,

cuyo núcleo es un bigrama que denota la catego-

ría gramatical a la que pertenecen (“NP” si es sus-

tantivo, “AP” si es adjetivo, etc.). Para esta pre-

Análisis sintáctico profundo del español: un ejemplo del procesamiento de secuencias idiomáticas

39

sentación, nosotros empleamos el formato basa-

do únicamente en los corchetes (etiquetas sintag-

máticas). Tomando en cuenta lo anterior, si in-

troducimos en el sistema la frase “vi el edificio”,

obtenemos como resultado la versión etiquetada

del ejemplo (3):

(3) [TP[DP ] vi [VP [DP el [NP edificio]]]]La Figura 2 representa gráficamente esta misma

estructura:

TP

DP

e

vii VP

ei DP

el NP

edificio

Figura 2: Representación de un análisis

Para facilitar la comprensión del análisis en (3),

hemos agregado una categoría vacía, eventual-

mente coindexada con el verbo (vii . . . ei). Por su

parte, FIPSTG resume las informaciones léxicas,

presentadas en el Cuadro 5.

Vocablo vi

Rasgos VER-IND-PRS-1-SIN

ID único Lema ver

Vocablo el

Rasgos DET-SIN-MAS

ID único Lema el

Función OBJ

Vocablo edificio

Rasgos NOM-SIN-MAS

ID único Lema edificio

Cuadro 5: Resultados del etiquetador

Las operaciones que le permiten a FIPS alcan-

zar estos resultados se apoyan en tres métodos:

Project, Merge y Move.

2.1. Método Project

El método “Project” (proyección) crea un

constituyente sintáctico sobre la base de un obje-

to léxico o de otro constituyente sintáctico. Todo

elemento léxico identificado por FIPS a partir de

las informaciones de FIPSBD es proyectado co-

mo un sintagma con un ítem léxico como núcleo.

En español, como se ve en el ejemplo (4), el

artículo definido “el” proyecta un sintagma deter-

minante (4a.), mientras que “edificio” proyecta

un sintagma nominal (4b.). Los pronombres per-

sonales (4c.), que en FIPSBD son considerados

como una forma especial de sustantivos, realizan

lo que se llama una “metaproyección”, es decir,

proyectan inmediatamente su estructura superior,

que en este caso es un DP (en FIPS todo sintagma

nominal está contenido en un sintagma determi-

nante). La metaproyección también es utilizada,

en el análisis de las lenguas romances, para los

verbos conjugados, que pasan a ser TP (4d.) (es-

ta operación tiene como objetivo verificar la con-

cordancia entre el sujeto y el verbo en las lenguas

con sujeto desinencial).

(4) a. Determinantes:

el→ [DP el]b. Sustantivos:

edificio→ [NP edificio]c. Pronombres:

tú→ [DP [NP tu]]d. Verbos:

vi→ [TP vii[VP ei]]

En inglés (5a.), la metaproyección no tiene lu-

gar, puesto que esta operación no se ve justifica-

da dada la pobreza morfológica de ese idioma.

También hay lenguas que requieren una metapro-

yección más compleja, como en el alemán (5b.)

que en nuestro esquema necesita una metapro-

yección superior al sintagma de tiempo (TP) pa-

ra dar cuenta del fenómeno del verbo en posición

final de oración, que es considerada como su po-

sición canónica.

(5) a. Inglés:

reads→ [TP [VP readsi]]b. Alemán:

regnet6 → [CP regneti[TP [VP ei]]

2.2. Método Merge

El método “Merge” es el mecanismo de com-

binación sintagmática de FIPS. Cada vez que el

analizador lee un vocablo, éste es transformado

en un constituyente, es decir, en una proyección

como las explicadas en la sección 2.1. La proyec-

ción puede ser combinada (“merged”) con cons-

tituyentes completos o parciales en cualquiera de

sus contextos. En ese momento, se abren dos po-

sibilidades: una agregación a la izquierda o una

agregación a la derecha.

6En español, “llueve”.


40

Una agregación a la izquierda es el caso tí-

pico del sujeto y el verbo. Esta consiste en la

inserción de un constituyente en el contexto iz-

quierdo de otra proyección, con la que es compa-

tible. Por ejemplo, en (6), el pronombre personal

(6a.), luego del reconocimiento del verbo (6b.),

es agregado como un subconstituyente izquierdo

de la nueva proyección verbal (es decir, como un

sujeto), obteniendo así (6c.).

(6) a. ella→ [DP ella]b. duerme→ [TP duerme [VP ]]c. [TP [DP ella] duerme [VP ]]

Por el contrario, una agregación a la derecha

corresponde a la situación en la que una proyec-

ción es agregada como un subconstituyente de-

recho de su propio contexto izquierdo. Este es

el caso típico de los sintagmas determinantes, en

los que el sintagma nominal es insertado a la de-

recha del sintagma determinante (DP); dicho de

otra forma, los sintagmas determinantes acogen

un sintagma nominal a la derecha del núcleo del

constituyente:

Por ejemplo, en (7), el vocablo “el” proyec-

ta un constituyente DP (7b.), en la gramática de

FIPS, los DP ocupan una posición superior a los

NP. En otras palabras, un DP puede puede te-

ner un NP como argumento. De esta manera, la

proyección (7c.) es combinada con (7b.) (es de-

cir, introducida a la derecha de esta última), lo

que produce el sintagma (7d.). El procedimiento

para satisfacer los argumentos de un verbo son

básicamente los mismos. Así, una vez reconoci-

do el sintagma (7a.), el DP se incorpora a la dere-

cha del sintagma verbal. El resultado de toda esta

operación lo tenemos en (7e.).

(7) a. vi→ [TP vi [VP ]]b. el→ [DP el]c. edificio→ [NP edificio]d. [DP el [NP edificio]]e. [TP [DP ] vi [VP [DP el [NP edificio]

]]]La operación “Merge” debe ser validada ya

sea según las propiedades léxicas, como los ras-

gos de selección, o según ciertas propiedades ge-

nerales (como por ejemplo los adverbios, las ad-

junciones y los paréntesis que pueden modificar

libremente las proyecciones).

Según el Cuadro 2, el verbo “ver” se combina

con un sustantivo en posición postverbal, que es

un objeto directo, mientras que en posición pre-

verbal, se combina con otro sustantivo, que es un

sujeto con el que debe verificar los rasgos de per-

sona y número (aunque en lenguas como el espa-

ñol, dicha posición puede estar vacía). Dado que

en posición postverbal tenemos el sintagma “el

edificio”, reconocido como un sintagma nominal

(por lo tanto compatible con las informaciones

de “ver”), FIPS lo reconoce en esta posición co-

mo un objeto directo.

2.3. Método Move

La estructura general de superficie es el resul-

tado de la combinación de las operaciones de

“Project” y “Merge”. Sin embargo, es necesario

un mecanismo adicional para satisfacer las con-

diciones de uniformidad como, por ejemplo, la

asignación de funciones temáticas. Tal es el ob-

jetivo del método “Move” (“mover”), el cual ma-

neja la relación de elementos extraídos o dislo-

cados con las posiciones que ocupaban original-

mente. Un caso típico es el de las oraciones inte-

rrogativas parciales, como la oración inglesa en

(8):

(8) a. Who did you invite ?

b. [CP[DP who]j didk [TP[DP you] ek

[VP invite ej]]]El método “Move” consiste en la creación

de una cadena de coindexaciones. En el ejemplo

(8b.) tenemos dos elementos desplazados: el pro-

nombre interrogativo “who” y el auxiliar “did”.

Dos hechos justifican la utilización de este meca-

nismo para el pronombre “who”. En primer lu-

gar el pronombre “who”, para ser interpretado

correctamente, necesita estar asociado a un ver-

bo, el cual se encuentra lejos en la frase; por este

motivo, su interpretación es diferida y el pronom-

bre es insertado en una estructura temporal (en

una pila). Luego, el verbo necesita satisfacer tan-

to su subcategorización (_NP), como la asigna-

ción de caso y función temática correspondiente.

Aunque la posición postverbal se encuentra va-

cía, en la pila tenemos un elemento que cumple

los requisitos para ser interpretado con respec-

to al verbo. Entonces una cadena de categorías

vacías (“e”) coindexadas es creada entre la posi-

ción de argumento (postverbal) y el pronombre

“who”. En segundo lugar, tenemos la creación de

una correferencia entre el auxiliar “did” y su po-

sición de origen. En este caso se trata de la ma-

nera de representar la inversión del sujeto en las

interrogativas, fenómeno típico del inglés.

2.4. Ejemplo completo

Consideremos el análisis de la oración “Ana

vio el edificio” a fin de ilustrar los mecanismos

descritos:

Etapa 1 El analizador lee “Ana” y meta-

proyecta la estructura [DP[NPAna]].


41

Etapa 2 El analizador lee “vio” y me-

taproyecta una estructura de frase

[TP vioi [VPei]].

Etapa 3 Una operación de “Merge” es

efectuada entre el TP y el DP,

que será ubicado a la izquier-

da de la proyección de tiempo:

[TP [DP[NPAna]] vioi [VP ei]].

Etapa 4 El parser identifica el determinan-

te “el” y proyecta la estructura [DPel].

Etapa 5 Una operación de “Merge” es

efectuada entre el sintagma TP de la

izquierda y el DP identificado; [DPel]es agregado a la derecha del TP:

[TP [DP[NPAna]] vioi [VP ei [DPel]]].

Etapa 6 El parser identifica el sustanti-

vo “edificio” y proyecta la estructura

[NPedificio].

Etapa 7 Una operación de “Merge”

es efectuada entre el sintagma

DP derecho del TP, en el que

“edificio” es agregado como cons-

tituyente derecho del DP “el”:

[TP [DP[NPAna]] vioi [VP ei [DP el[NPedificio]]]].

La última etapa produce la estructura comple-

ta.

3. FIPS y el reconocimiento de

expresiones idiomáticas: una

propuesta

Dentro del marco de las tecnologías desarrolla-

das en el LATL (2008), se cuentan varias in-

vestigaciones sobre el procesamiento de las ex-

presiones idiomáticas y de las colocaciones. Por

ejemplo, Nerima, Seretan, y Wehrli (2006) y Se-

retan (2008) utilizan un procedimiento híbrido

multilingüe, sintáctico-estadístico, para la extrac-

ción y el reconocimiento de las colocaciones. Por

otra parte, Leoni de León (2008) ha trabajado en

una propuesta de representación léxico-sintáctica

orientada a reconocer y reproducir el funciona-

miento de las expresiones idiomáticas, desde una

perspectiva más próxima a la lexicografía. Estas

propuestas abordan las interfaces entre el léxico y

la sintaxis desde una perspectiva computacional.

En la misma línea, es interesante citar también

el sistema de asistencia terminológica TwicPen

(Wehrli, 2006), que permite limitar el número de

traducciones entre dos pares de lenguas sobre la

base de un análisis lingüístico de un texto selec-

cionado para su traducción. TwicPen explota los

recursos morfosintácticos de FIPS (así como las

lenguas disponibles), aunados a un procesamien-

to sintáctico de las colocaciones, lo que permi-

te recuperar estas unidades aún en circunstancias

en que sus elementos constitutivos se encuentran

morfológicamente modificados o mantienen re-

laciones de distancia. Todas estas investigaciones

están en progreso, aunque ya dieron lugar a algu-

nas publicaciones (ya mencionadas).

Las expresiones idiomáticas, a menudo con-

sideradas como elementos estáticos, pueden pre-

sentar una morfosintaxis bastante rica (Leoni de

León, 2008). Un buen ejemplo es la expresión

idiomática “meter la pata”, corriente en el espa-

ñol coloquial. Esta expresión se caracteriza por

presentar casi todas las opciones sintácticas posi-

bles para una expresión idiomática. Por ejemplo,

el núcleo (verbal) de “meter la pata” puede ser

nominalizado (9a.) o bien su argumento interno

puede ser pronominalizado (9b.) en un contexto

discursivo, operación que implica la adjunción de

un complemento:

(9) a. Metida de pata.

b. La metió hasta el fondo.

Estas operaciones son difícilmente tomadas

en cuenta en los sistemas de extracción estadís-

ticos, impresión reforzada por las relaciones de

concordancia entre el núcleo de una expresión

adjetiva y un sustantivo. Por ejemplo, en la se-

cuencia “hecho polvo” es el participio el que ha-

ce la concordancia de género y número, mientras

que el colocativo no sufre modificación alguna:

(10) a. Él estaba hecho polvo.

b. Ella estaba hecha polvo.

No está de más agregar que la expresión “he-

cho polvo” proviene en realidad de la forma ver-

bal “hacer polvo”. Esto es una muestra de una

relación transcategorial que va de una forma ver-

bal a una forma adjetiva. De esta manera tene-

mos dos fenómenos idiomáticos que presentan

relaciones de distancia ya sea entre sus elemen-

tos constitutivos, como en “meter la pata”, o que

no sólo pueden manifestarse con categorías di-

ferentes (la forma verbal “hacer polvo” se con-

vierte en un adjetivo, “hecho polvo”), sino que

además pueden concordar en généro y número,

por ejemplo. La arquitectura de FIPS permite re-

cuperar muchos de estos fenómenos.

En el caso de las pronominalizaciones clíti-

cas, como el ejemplo (9b.), la identificación de la

expresión como una instancia de “meter la pata”

requiere el establecimiento de la relación entre el

pronombre clítico de objeto directo, “la”, y la po-

sición de argumento, la cual estimamos ocupada


42

por una categoría vacía coindexada con el clítico.

La adjunción de un complemento circunstancial

(“hasta el fondo” en este caso) debe contar dentro

de la base de conocimientos idiomáticos, como lo

señala Leoni de León (2008). En lo que respecta

a la expresión en (10), el punto fundamental está

en la necesidad de establecer una relación entre

el participio y el elemento nominal al cual se re-

fiere, con independencia del sustantivo “polvo”.

Las expresiones idiomáticas son relativamen-

te fáciles de identificar, cuando su realización es

lineal. Tal es el caso del ejemplo (11), para el

que FIPS produce el análisis en (12). Sabemos

que la expresión “romper un récord” ha sido co-

rrectamente identificada por FIPS, debido a que

FIPSTG indica el valor “” de la eti-

queta “Colocación”, que es el número de identifi-

cación único de esta expresión en FIPSBD (Cua-

dro 6). Por otra parte, FIPS tampoco tiene dificul-

tades para identificar dicha expresión, incluso si

el artículo indefinido “un” es sustituido por el ar-

tículo definido “el”; para esto ha bastado indicar

en FIPSBD que la expresión necesita la presencia

de un artículo.

(11) Él rompió un récord.

(12) [TP[DPEl] rompio [VP[DP un [NP

record]]]]

Vocablo rompió

ID Lema romper

Colocación Vocablo un

ID Lema un

Función OBJ

Vocablo récord

ID Lema récord

Colocación

Cuadro 6: Valores de una expresión transitiva

Ahora bien, la capacidad de FIPS para reco-

nocer la expresión (11) no se ve alterada aunque

el objeto directo esté modificado por un sintagma

preposicional (“Él rompió el récord de Claudia”)

o, incluso, si, además, la expresión está realizada

como una oración pasiva, “El récord de Claudia

ha sido roto”. De esta forma, como lo muestran

tanto el análisis en (13), como los resultados de

FIPSTG (Cuadro 7), FIPS no tiene ninguna difi-

cultad para reconocer una expresión, aunque se

hayan establecido relaciones de distancia. Esto

se consigue, por un lado, con la creación de una

cadena de coindexaciones que va de la catego-

ría vacía en posición postverbal, “[DPei]” hasta el

sintagma determinante que contiene el sujeto “El

récord de Claudia”, por otro lado, el análisis pro-

fundo de FIPSSYN, identifica el sintagma prepo-

sicional “de Claudia”, como un subconstituyente

del sintagma determinante sujeto, “El récord”.

(13) [TP[DP El [NP record [PP de [DPClaudia]]]]i ha [VP sido [VP roto[DPei]]]]

Vocablo el

ID Lema el

Función SUBJ

Vocablo récord

ID Lema récord

Colocación −Vocablo roto

ID Lema romper

Colocación −Función SUB:récord

Cuadro 7: Valores de una expresión pasiva

Dentro de los valores del Cuadro 7, encontra-

mos “SUB:récord” para “roto”. Este valor indica

que el analizador reconoció el lema como suje-

to de “romper”; además, este valor se encuentra

también asociado a la forma pasiva del verbo, de

manera que la información es fácilmente recu-

perable. Se trata de una información referida al

sujeto gramatical de la oración.

Las posibilidades de FIPS para el tratamiento

de las expresiones idiomáticas son inmensas, es

así como existe otra estrategia, (Leoni de León,

2008) que consiste en la proposición de un for-

malismo correlacional, llamado Tsool, que co-

difica el comportamiento morfosintáctico de las

expresiones idiomáticas. Dicho formalismo es

reproducido computacionalmente en un sistema

(llamado Mulkin) que interactúa con FIPS para

explotar los análisis sintácticos de este sistema,

a fin de poder conjugar los análisis con las in-

formaciones fraseológicas almacenadas, y así re-

conocer las expresiones idiomáticas. Tanto Tsool

como Mulkin se encuentran en una etapa tempra-

na de desarrollo, y, como ya lo señalamos oportu-

namente, ambos se orientan hacia una represen-

tación más cercana de la lexicografía. Dentro de

los elementos considerados podemos citar las re-

laciones de rima, las posibilidades de conmuta-

ción y de permutación de las expresiones. Una

de las aplicaciones previstas para este sistema es


43

la filtración de secuencias luego de una operación

de extracción a partir de corpus de gran tamaño.

4. Conclusión

FIPS es un analizador sintáctico capaz de identi-

ficar las relaciones profundas entre los constitu-

yentes de la oración. La arquitectura multilingüe

de FIPS, basada en una serie de módulos espe-

cializados en conjuntos de fenómenos sintácticos

por familias o grupos de lenguas, facilita la inclu-

sión de nuevas lenguas en el sistema, maximizan-

do la utilización del código de la aplicación. Las

propiedades de FIPS se muestran particularmen-

te útiles en el reconocimiento de secuencias idio-

máticas, puesto que estas no son necesariamente

estáticas, sino que pueden ser objeto de modifi-

caciones, por las cuales sus constituyentes no se

realizan linealmente, sino de manera discontinua

(relaciones de distancia).

Bibliografía

Atserias, Jordi, Bernardino Casas, Elisabet Co-

melles, Meritxell González, Lluís Padró, y

Muntsa Padró. 2006. Freeling 1.3: Syntac-

tic and semantic services in an open-source

nlp library. En Proceedings of the fifth inter-

national conference on Language Resources

and Evaluation (LREC 2006), ELRA., Géno-

va, Italia, Mayo.

Bick, Eckhard. 2008. A constraint

grammar parser for spanish. Pá-

gina web. [Dirección electrónica:

http://beta.visl.sdu.dk/pdf/TIL2006.pdf ;

Visitada el 2 de mayo de 2008].

Bresnan, J. 2001. Lexical Functional Syntax.

Blackwell, Oxford.

Chomsky, Noam. 1995. The Minimalist Pro-

gram. MIT Press, Cambridge.

Chomsky, Noam. 2004. Beyond Explanatory

Adequacy. En A. Belletti, editor, The Car-

tography of Syntactic Structures. Oxford Uni-

versity Press, Oxford.

Culicover, Peter y Ray Jackendoff. 2005. Sim-

pler Syntax. Oxford University Press, Ox-

ford.

La Serna, Nora. 2004. Un analizador sintácti-

co eficiente para gramáticas del español. Rev.

investig. sist. inform., 1(1):19–26.

Laenzlinger, Christopher y Éric Wehrli. 1991.

FIPS : Un analyseur interactif pour le fra-

nçais. TA Informations, 32(2):35–49.

LATL. 2008. Laboratoire d’Analyse et de Tech-

nologie du Langage. Página web. [Dirección

electrónica : http://www.latl.unige.ch/ ; Visi-

tada el: 28 de abril de 2008].

Leoni de León, Jorge Antonio. 2008. Modèle

d’analyse lexico-syntaxique des locutions es-

pagnoles. Tesis en lingüística, Université de

Genève, Ginebra, Suiza, Mayo.

Leoni de León, Jorge Antonio y Athina Michou.

2006. Traitement des clitiques dans un envi-

ronement multilingue. En Piet Mertens Cé-

drick Fairon Anne Dister, y Patrick Watrin,

editores, Verbum ex machina : Actes de la 13e

conférence sur le traitement automatique des

langues naturelles (TALN 2006), volumen 1

de Cahiers du Cantal 2.1, páginas 541–550,

Louvain-la-Neuve, Belgique, 10-13 avril. As-

sociation pour le Traitement Automatique des

Langues, UCL Presses Universitaires de Lou-

vain.

LinGO Lab, CSLI. 2008. CSLI Linguistic

Grammars Online. Página web. [URL:

http://lingo.stanford.edu/ ; Visitada el 2 de

mayo de 2008].

Nerima, Luka, Violeta Seretan, y Éric Wehrli.

2006. Le problème des collocations en TAL.

Nouveaux cahiers de linguistique française,

(27):95–115.

Seretan, Violeta. 2008. Collocation Extraction

in Syntactic Parsing. Ph.D. tesis, Université

de Genève, Juin.

Wehrli, Éric. 2004. Un modèle multilingue

d’analyse syntaxique. En Antoine Auchlin

Marcel burger Laurent Filliettaz Anne Gro-

bet Jacques Moeschler Laurent Perrin, y Co-

rinne Rossari et Louis de Saussure, editores,

Structures et discours : Melanges offerts à

Eddy Roulet, Langue et pratiques discursives.

Éditions Nota bene, Canada, páginas 311–

332.

Wehrli, Éric. 2006. Twicpen: hand-held scan-

ner and translation software for non-native

readers. En Proceedings of the COLING/ACL

on Interactive presentation sessions, páginas

61–64, Morristown, NJ, USA. Association for

Computational Linguistics.

Wehrli, Éric. 2007. Fips, a “Deep” Linguistic

Multilingual Parser. En ACL 2007 Workshop

on Deep Linguistic Processing, páginas 120–

127, Prague, Czech Republic, Juin. Associa-

tion for Computational Linguistics.


44

Búsqueda de Respuestas

Un sistema de busqueda de respuestas basado en ontologıas,implicacion textual y entornos reales ∗

An User–centred Ontology– and entailment–based Question AnsweringSystem

Oscar Ferrandez, Ruben Izquierdo, Sergio Ferrandez y Jose Luis VicedoDept. de Lenguajes y Sistemas Informaticos (Universidad de Alicante)

Carretera San Vicente s/n 03690 Alicante Espana{ofe, ruben, sferrandez, vicedo}@dlsi.ua.es

Resumen: Este artıculo presenta un sistema de Busqueda de Respuestas basado enontologıas, implicacion textual y requerimientos de usuario. Se propone una metodo-logıa para la construccion de una base de conocimiento de usuario que nos permiteasociar preguntas en lenguaje natural con una representacion formal de datos. Elnucleo de nuestra estrategia se basa en la implicacion textual, la cual permite de-tectar implicaciones entre preguntas y la base de conocimiento. El sistema ha sidodesarrollado para el espanol y sobre el dominio de cine obteniendo unos resultadosprometedores para su utilizacion en entornos reales.Palabras clave: Busqueda de Respuestas, Implicacion textual, Interfaces de Len-guaje Natural, Modelado de Ontologıas

Abstract: This paper presents an user–centred ontology– and entailment–basedQuestion Answering system. A methodology is proposed in order to carry out theconstruction of the user knowledge database. This knowledge database allows usto fill the gap between natural language expressions and formal expressions suchas database queries. The core of the system relies on an entailment engine capableof deducting inferences between queries and the knowledge database. The systemhas been developed for Spanish, covering the cinema domain and obtaining verypromising results within real environments.Keywords: Question Answering, Textual Entailment, Natural Language Interfaces,Ontology Modelling

1. Introduccion

La Busqueda de Respuestas (BR) surgeante la necesidad de recuperar informacionconcreta solicitada por usuarios a partir depreguntas en lenguaje natural. En el caso desistemas de BR basados en ontologıas, los da-tos donde las respuestas deben ser localizadasposeen una estructura definida en una onto-logıa. Una ontologıa determina una represen-tacion de conceptos y sus relaciones en undominio especıfico. Las ontologıas juegan unpapel esencial en la llamada web semantica1,

∗ Esta investigacion ha sido parcialmente financiadabajo los proyectos QALL-ME, dentro del Sexto Pro-grama Marco de Investigacion de la Union Europeacon referencia FP6-IST-033860, y el Gobierno de Es-pana proyecto CICyT numero TIN2006-15265-C06-01.

1La Web Semantica es una Web extendida y do-tada de mayor significado, apoyada en lenguajes uni-versales, que van a permitir que los usuarios puedanencontrar respuestas a sus preguntas de forma mas

posibilitando el intercambio de conocimiento.Sin embargo, para la correcta explotacion delconocimiento, se necesitan herramientas quecomuniquen el lenguaje natural utilizado porlos usuarios con la representacion logica delas ontologıas.

Con el objetivo de solucionar esta tarea,en este artıculo presentamos un sistema deBR basado en ontologıas (el cual denomina-mos QACID) que se compone de dos nucleosprincipales: una base de conocimiento genera-da a partir de pruebas realizadas con usuariosreales; y un modulo de implicacion textual.Para crear la base de conocimiento, el siste-ma recoge preguntas de usuarios realizadassobre un dominio especıfico. Dichas pregun-tas son analizadas y agrupadas en funcion dela informacion que soliciten. A cada agrupa-cion se le asocia manualmente una sentencia

rapida y sencilla gracias a una informacion mejor de-finida.



SPARQL2 la cual permite el acceso a la in-formacion requerida por el usuario. Esta basede conocimiento modela la interaccion de losusuarios con el sistema.

Una vez desarrollado el modulo de im-plicacion textual, las preguntas realizadas alsistema en lenguaje natural son procesadaspor este modulo, el cual infiere deduccionessemanticas entre dichas preguntas y las pre-guntas agrupadas anteriormente con el obje-tivo de asociar a una nueva pregunta su sen-tencia SPARQL correspondiente.

El resto del artıculo esta estructurado dela siguiente manera: la seccion 2 presenta elestado de la cuestion. La seccion 3 muestrauna descripcion detallada del sistema. La sec-cion 4 describe la creacion de la base de cono-cimiento del sistema generada a partir de lainteraccion con usuarios. A continuacion, sepresenta el modulo de implicacion textual y laseccion 6 describe la evaluacion y resultadosobtenidos. Finalmente, la seccion 7 presentalas conclusiones y trabajos futuros.

2. Estado de la Cuestion

Las interfaces en lenguaje natural sobrebases de datos han sido extensamente es-tudiadas (Androutsopoulos, 1996; Copesta-ke y Jones, 1990; Chan y Lim, 2003; Popes-cu, Etzioni, y Kautz, 2003; Filipe y Mame-de, 2000; Minock, 2005). Este tipo de herra-mientas permite a los usuarios formular suspeticiones sobre bases de conocimiento me-diante consultas en lenguaje natural. Existendos tipos de interfaces en lenguaje natural enfuncion de la capacidad de procesar consultasinformales de usuarios:

1. Interfaces en lenguaje natural comple-tas: son sistemas que procesan pregun-tas en lenguaje natural sin restricciones.Nuestra aproximacion se enmarca dentrode esta categorıa.

2. Interfaces en lenguaje natural restrin-gidas: comprenden sistemas que proce-san preguntas formuladas en un lenguajecontrolado. Los usuarios deben aprenderpreviamente dicho lenguaje para poderinterrogar la base de conocimiento.

Podemos encontrar un ejemplo tıpicoen (Androutsopoulos, Ritchie, y Thanisch,

2SPARQL es un lenguaje estandar de consulta pa-ra la recuperacion de informacion desde datos RDF,(www.w3.org/TR/rdf-sparql-query).

1993), el cual procesa la pregunta en lengua-je natural, para transformarla en una formalogica intermedia, que sera mas tarde trans-formada en SQL. Otras aproximaciones (Ze-lle y Mooney, 1996; Thompson y Mooney,1999; Zhang y Yu, 2001) hacen uso de meto-dos de aprendizaje automaticos para trans-formar preguntas informales en representa-ciones logicas estructuradas. Estos sistemasson entrenados usando datos de dominios es-pecıficos, proporcionados por Money3, y ob-tienen unos resultados cercanos al 90 %, pe-ro necesitan una gran cantidad de datos pa-ra realizar un entrenamiento eficiente. Otrostrabajos (Rodrigo et al., 2005; Kang et al.,2004) intentan dividir la pregunta en pala-bras clave y componer una consulta formala partir de estas palabras. Finalmente, re-saltamos dos sistemas (Popescu, Etzioni, yKautz, 2003; Wang et al., 2007), los cualestransforman las preguntas en lenguaje natu-ral en consultas estructuradas en SPARQL.

El procesamiento de preguntas en lengua-je natural es normalmente complejo y ambi-guo. Por esta razon, muchos sistemas traba-jan unicamente sobre preguntas formuladasusando un lenguaje limitado, con restriccio-nes lexicas y gramaticales. Sin embargo, ha-cer uso de estos sistemas conlleva que el usua-rio aprenda estos lenguajes restringidos y susintaxis. El sistema presentado en (Bernsteinet al., 2005) hace uso del “Attempto Con-trolled English” para formular consultas. Si-guiendo en esta lınea, (Popescu, Etzioni, yKautz, 2003) define la nocion de consultassemanticamente tratables, intentando saber apriori los requerimientos de la pregunta paragenerar la consulta SQL. Sin embargo, en es-ta aproximacion las preguntas que contienenpalabras desconocidas no son semanticamen-te tratables y no pueden ser procesadas.

Los metodos basados en patrones son tam-bien usados para resolver tareas relacionadascon interfaces en lenguaje natural. El siste-ma (Lopez et al., 2007) procesa las preguntasy las clasifica en 23 categorıas. Si se consi-gue clasificar la pregunta de entrada en unade estas 23 categorıas, el sistema sera capazde procesarla correctamente. Sin embargo, lalimitada cobertura de los patrones conllevaque muchas preguntas no puedan resolverse.Por contra, aunque QACID esta basado tam-bien en el uso de patrones, evita este proble-

3www.cs.utexas.edu/users/ml/nldata.html.

Oscar Ferrández, Rubén Izquierdo, Sergio Ferrández y José Luis Vicedo

48

ma usando un modulo de implicacion textual,que permite obtener relaciones entre patrony pregunta en la mayorıa de los casos. Parafinalizar, el sistema (Bernstein y Kaufmann,2006) ayuda al usuario a construir la consul-ta evitando la ambiguedad por medio de unmotor de busqueda en lenguaje natural. Sinembargo, de nuevo la capacidad de este sis-tema esta limitada por las restricciones dellenguaje ofrecido.

3. Descripcion General

Esta seccion presenta nuestro sistema deBR basado en ontologıas y en la experien-cia con usuarios reales: QACID (acronimo eningles, Question Answering on CInema Do-main). La estructura de QACID se sostienebajo cuatro componentes principales: la on-tologıa, los datos estructurados, la base deconocimiento de usuario y el modulo de im-plicacion textual.La ontologıa

Un sistema de BR basado en ontologıasprocesa informacion formalmente estructura-da sobre un dominio especıfico determinadopor la ontologıa. En nuestro caso, hemos usa-do OWL4 para disenar una ontologıa sobre eldominio turıstico (trabajo realizado bajo elproyecto de investigacion QALL–ME5), sien-do las clases y relaciones referentes al sub–dominio Cinema las utilizadas en QACID.Los datos

La ontologıa ha sido poblada con informa-cion sobre el dominio de cine provista por laempresa LaNetro6, almacenando los datos enformato RDF7. Los datos RDF son usadoscomo base de datos donde las respuestas sonextraıdas por medio de consultas SPARQL.Base de Conocimiento de Usuario

Con el objetivo de conocer las diferentes ymultiples maneras con las que se pueden so-licitar informacion sobre el dominio Cinema,decidimos adquirir dicho conocimiento a par-tir de pruebas realizadas con usuarios reales.Para esto, se solicito a un grupo de perso-nas que demandaran datos sobre el dominio

4OWL (del ingles, Ontology Web Language) es unlenguaje de marcado para publicar y compartir datosusando ontologıas, www.w3.org/TR/owl-features/.

5Descrito en qallme.itc.it/ .6www.lanetro.com.7RDF (del ingles, Resource Description Fra-

mework) es un modelo de metadatos que proponeun metodo general para el modelado de informacion,/www.w3.org/RDF.

Cinema, generando ası un conjunto de pre-guntas representativas del dominio.

Seguidamente, las preguntas fueron anali-zadas y agrupadas automaticamente en fun-cion de la informacion que solicitan, consti-tuyendo de esta forma agrupaciones de pre-guntas. A cada agrupacion se le asocio ma-nualmente una sentencia SPARQL, la cualpermite el acceso a la informacion requeridapor el usuario. Como resultado final se ob-tuvo un conjunto de pares pregunta–consultaSPARQL, que forma la base de conocimientode usuario del sistema QACID.

Modulo de Implicacion TextualEl modulo de implicacion textual cons-

tituye el eje principal de nuestra estrategiade BR. Este modulo implementa tecnicas deimplicacion textual con el objetivo de infe-rir deducciones semanticas entre preguntasde entrada y las agrupaciones de la base deconocimiento de usuario previamente obteni-das. Este proceso permite asociar consultasSPARQL a las preguntas de entrada y ası re-cuperar las respuestas desde los datos RDF.

En la figura 1 se muestra la arquitectu-ra general de QACID. Las siguientes subsec-ciones detallan el proceso completo de BR elcual se compone de dos fases principales.

P r o c e s a m i e n t o d e d a t o s

O n t o l o g í a

P r o v e e d o r d e Da tos

D a t o s R D F

Lex i cón de c o n c e p t o s o n t o l ó g i c o s

A n á l i s i s d e l a P r e g u n t a

E t i q u e t a d o r d e E n t i d a d e s

Aná l i s i s M o r f o l ó g i c o G e n e r a d o r d e P a t r o n e s

E x t r a c c i ó n d e l a R e s p u e s t a

M o t o r I m p l i c a c i ó n T e x t u a l

P r o c e s a m i e n t o S P A R Q L

B a s e d e C o n o c i m i e n t o d e U s u a r i o

P a t r o n e s - S P A R Q L

Figura 1: Arquitectura del sistema QACID.

3.1. Analisis de la Pregunta

Este modulo se encarga de procesar la pre-gunta de entrada con el objetivo de obteneruna representacion formal de la misma (si-guiendo el formato de los patrones de la basede conocimiento de usuario).

Un sistema de búsqueda de respuestas basado en ontologías, implicación textual y entornos reales

49

Las preguntas se analizan morfologica-mente8 y se detectan y etiquetan las entida-des de la pregunta. Para realizar su etiqueta-do, QACID aplica tecnicas de emparejamien-to difuso9 entre las palabras que contiene lapregunta y un lexicon generado a partir delos datos almacenados en la ontologıa. Dicholexicon se obtiene automaticamente e incluyelas instancias de la ontologıa con sus respecti-vas clases ontologicas (por ejemplo, “CasinoRoyale” ⇔ [MOVIE]; “Alicante” ⇔ [DES-TINATION]; “Cinesa Panoramis”⇔ [CINE-MA]).

La salida de este modulo es una preguntaen lenguaje natural etiquetada con informa-cion morfologica y con conceptos de la onto-logıa (ver tercera fila del cuadro 1).

3.2. Extraccion de la Respuesta

La pregunta es procesada por el modulode implicacion textual con el objetivo de de-terminar las implicaciones semanticas entrela misma y el conjunto de patrones que con-tiene la base de conocimiento de usuario. Sieste proceso se realiza con exito, se obtieneuna sentencia SPARQL generica (ver la filacuarta del cuadro 1) que permite obtener larespuesta a la pregunta. Seguidamente, antesde realizar la consulta a los datos RDF, lasentencia SPARQL se instancia con los datosoriginales de la pregunta del usuario. En elcuadro 1 se muestra un ejemplo del procesocompleto.

Pregunta de entrada¿Donde puedo ver Casino Royale?Analisis morfologico¿ [Fia] Donde [donde PT000000] puedo [poderVMIP1S0] ver [ver VMN0000] Casino Royale [casinoroyale NP00000] ? [Fit]Etiquetado de conceptos de la ontologıa¿Donde puedo ver [MOVIE]?SPARQL por Implicacion TextualSELECT DISTINCT ?nameCinema WHERE { ?mo-vie name [MOVIE]. ?event hasEvent ?movie. ?eventisInSite ?cinema. ?cinema name ?nameCinema }SPARQL finalSELECT DISTINCT ?nameCinema WHERE {?moviename “Casino Royale”.?event hasEvent ?mo-vie.?event isInSite ?cinema.?cinema name ?nameCine-ma }

Cuadro 1: Ejemplo de proceso de BR.

Las secciones siguientes explican en deta-

8Para esta tarea utilizamos Freeling toolkit, dispo-nible en garraf.epsevg.upc.es/freeling/.

9Concretamente secondStrings library, disponibleen secondstring.sourceforge.net.

lle la creacion de la base de conocimiento deusuario y el modulo de implicacion textual.

4. Base de Conocimiento deUsuario

El principal objetivo de la base de conoci-miento es tener una muestra representativade preguntas que reflejen los intereses ynecesidades de los usuarios sobre el dominio.El proceso de construccion de dicha base deconocimiento incluye tres pasos:

1. Generar un conjunto de preguntas signifi-cativas de acuerdo al dominio de la ontologıa.Para esto se seleccionan 50 usuarios de di-ferente edad, genero y nacionalidad. Se lesmuestra la ontologıa junto a una lista deentidades reales extraıdas de nuestros datospara que generen preguntas sobre cualquierdato de interes. Mediante este proceso segeneraron un total de 500 preguntas.

2. Detectar y etiquetar las entidades queaparecen en el conjunto de preguntas. Paraesto, se aplica el Anotador de Entidades a las500 preguntas, reemplazando las entidadesdetectadas por su concepto correspondienteen la ontologıa. Por ejemplo, la pregunta“¿Donde puedo ver Saw 3?” se transforma en“¿Donde puedo ver [MOVIE]?”. Como se hamencionado anteriormente, el Anotador deEntidades esta implementado con una tecni-ca de emparejamiento difuso de cadenas, demodo que se intenta emparejar subcadenasen lenguaje natural libre con entidades ennuestro lexicon. Una vez que se completael proceso, se eliminan aquellas preguntasrepetidas y obtenemos un conjunto de 348preguntas diferentes.

3. Las preguntas anotadas se agrupan ma-nualmente de acuerdo a su equivalenciasemantica. Dos preguntas son semanticamen-te equivalentes cuando ambas solicitan lamisma informacion, y ambas contienen losmismo conceptos ontologicos. Por ejemplo,las preguntas:

“¿Cual es el numero de telefono del cine[CINEMA]?”

“¿Cual es el telefono de contacto del cine[CINEMA]?”


50

pertenecen al mismo cluster semantico. Secrean un total de 54 agrupaciones semantica-mente distintas. Una vez creadas estas agru-paciones, se asocia una consulta SPARQL acada uno, la cual permitira obtener la res-puesta para cualquiera de las preguntas de laagrupacion.

Considerando una agrupacion concreta,disponemos de varias preguntas solicitando lamisma informacion, y es muy sencillo extraerconocimiento sobre la forma de requerir estainformacion. Teniendo en cuenta esto, desa-rrollamos un recurso llamado caracterizacionde atributos de la ontologıa. Esta caracteri-zacion consiste en saber los diferentes mo-dos en que los usuarios preguntan sobre unatributo concreto. Por ejemplo, considerandolas preguntas pertenecientes a la agrupacionque solicita informacion sobre el numero detelefono de un cine, encontramos tres modosdistintos que los usuarios han utilizado pa-ra preguntar sobre este atributo: numero detelefono, numero de contacto y numero te-lefonico. Este conocimiento es muy util paraque el modulo de implicacion textual detecteparafrasis entre los atributos que aparecen enlas preguntas.

En resumen, la base de conocimientoesta compuesta por 54 agrupaciones. Cadauna contiene una media de 6.44 preguntasequivalentes, la consulta SPARQL correspon-diente y la informacion sobre la caracteriza-cion de los atributos de la ontologıa.

A pesar de que la generacion de la ba-se de conocimiento se ha construido a partirde preguntas realizadas espontaneamente porlos usuarios, la mayorıa de atributos y rela-ciones de la ontologıa (88%) quedan cubier-tos por el conjunto final de agrupaciones (porejemplo, no hay preguntas sobre el e–mail,fax o coordenadas GPS de un cine). Esta al-ta cobertura demuestra que la metodologıade construccion de la base de conocimientoes robusta y apropiada, y que aquellos atri-butos que quedan fuera son con mucha pro-babilidad de poco interes para los usuarios.

5. Modulo de Implicacion Textual

El sistema de implicacion textual esta-blece inferencias lexico–semanticas entre unapregunta y el conjunto de patrones predefi-nidos en la base de conocimiento (ver sec-cion 4). Se consideran relaciones unidireccio-nales entre dos preguntas, siguiendo la meto-dologıa propuesta en (Glickman, 2005) para

relaciones de implicacion textual. El sistemautilizado es una extension del presentado en(Ferrandez et al., 2007) adaptando e incor-porando nuevas inferencias relevantes para elnuevo paradigma en el que se enfoca.

5.1. Inferencias Lexicas

Se componen de un conjunto de medidas10

lexicas, basadas en las coocurrencias de laspalabras y el contexto en que aparecen. Parasu calculo, las preguntas son tratadas comobolsas de palabras (del ingles, bag–of–words),lo cual es simple a la vez que preciso, esmas rapido computacionalmente y consigueresultados competitivos comparados conotras aproximaciones (ver (Giampiccolo etal., 2007)). Las medidas son:

– Algoritmo Smith–Waterman– Emparejamiento entre subcadenas consecu-tivas– Distancia de Jaro– Distancia Euclıdea– Coeficiente de similitud de Jaccard– Emparejamiento entre terminos interroga-tivos

5.2. Inferencias basadas en laOntologıa

Corresponden con conocimiento directa-mente derivado de la ontologıa.

Restriccion sobre los conceptos: tantolas preguntas como los patrones son etique-tados con conceptos de la ontologıa. Conse-cuentemente, se establece una restriccion porla que todos los pares de preguntas de las quese puede deducir una relacion de implicaciondeben contener los mismos conceptos, tantoen numero como en tipo.Inferencia basada en atributos on-tologicos: apoyandose en el conocimientoadquirido sobre las diferentes formas que losusuarios han utilizado para referirse a losatributos de la ontologıa (i.e. la caracteriza-cion de los atributos de la ontologıa, ver sec-cion 4), el sistema implementa una inferenciasobre el atributo o atributos ontologicos quesolicitan las preguntas que se realizan al sis-tema. El procedimiento utiliza la caracteriza-cion de los atributos de la ontologıa para de-tectar la presencia de atributos (normalmen-

10Para algunas de ellas se ha usadosu implementacion de SimMetrics libraryhttp://www.dcs.shef.ac.uk/∼sam/simmetrics.html.


51

te estos son la informacion requerida). Aque-llos patrones que contengan atributos equiva-lentes a atributos de la pregunta de entradaseran puntuados positivamente en la detec-cion de implicacion textual. Dos atributos sonequivalentes si estan expresados de la mismamanera o usando alguna de sus parafrasis al-macenadas en la caracterizacion de los atri-butos de la ontologıa. El peso final obtenidoes:

Attsim =

∑

ai∈Pg, aj∈Pt

Eql(ai, aj)

|Pg| (1)

donde Pg y Pt contienen los atributos de lapregunta de entrada y el patron que se estaprocesando, y Eql(ai, aj) toma el valor:

Eql(ai, aj) =

{1 ai = aj o parafrasis,

0 otro caso.(2)

Por lo tanto, dos o mas atributos pertene-cientes a la misma pregunta tienen la mismaimportancia, a la vez que patrones que nocontengan atributos equivalentes seran con-siderados menos relevantes durante la impli-cacion textual.

Por ultimo, cada inferencia obtiene un fac-tor de similitud entre cero y uno, por lo tantoel coeficiente de implicacion final es la sumade todos los factores entre el numero de in-ferencias consideradas. Para decidir las im-plicaciones, se establece un umbral empıricosobre un conjunto de preguntas de entrena-miento, por lo que preguntas nuevas que ob-tengan un coeficiente de implicacion superioral umbral seran consideradas como deduc-ciones de implicacion textual correctas. Lasiguiente seccion detalla la fase de entrena-miento del sistema ası como los resultadosobtenidos durante la evaluacion del mismo.

6. Evaluacion y resultados

Para comprobar la efectividad de QACID,hemos desarrollado un marco de evaluacionsobre el dominio de cine. Debido a que losresultados del sistema estan influenciados di-rectamente por el modulo de implicacion tex-tual, nos hemos centrado en evaluar la ca-pacidad de este en la deteccion correcta deimplicaciones entre preguntas.

6.1. Marco de Evaluacion

Esta evaluacion verificara como de repre-sentativa y util es nuestra base de conoci-miento a la hora de realizar deducciones, ycomo de preciso es el modulo de implicaciontextual. Para ello, 10 nuevos usuarios son re-queridos para generar una nueva preguntapara cada cluster usando las entidades alma-cenadas en nuestro lexicon. En total 450 nue-vas preguntas que son divididas en dos gru-pos: el conjunto de entrenamiento y el con-junto de test. El primero grupo se usa paraajustar el umbral de decision del modulo deimplicacion textual, de modo que cuando elvalor de similitud devuelto por dicho moduloes menor que el umbral establecido, conside-ramos la pregunta como incierta11. El con-junto de test se usa como un conjunto cie-go para la evaluacion final del sistema. Es-tos conjuntos comprenden 378 preguntas (7usuarios) y 162 (3 usuarios) respectivamente.Destacar tambien que no se eliminan las pre-guntas repetidas de estos conjuntos, y todaslas preguntas pertenecen al dominio de cine(esta evaluacion no tiene en cuenta preguntasfuera de dominio).

6.2. Analisis de Resultados

La figura 2 muestra la evolucion de la pre-cision, cobertura y medida F dependiendo delumbral de decision establecido. Cada graficocorresponde a los experimentos llevados a ca-bo teniendo en cuenta las inferencias descri-tas en los apartados 5.1 y 5.2:

Baseline Lexico (BL): implementa todaslas inferencias lexicas (ver seccion 5.1) yestablece nuestro sistema base para me-dir la mejora introducida con las inferen-cias basadas en la ontologıa.

BL+Restriccion de Conceptos(BL+RC): anade al baseline lexicola restriccion referente a la correspon-dencia entre conceptos de la ontologıa.

BL+RC+Inferencia basada en Atribu-tos (BL+RC+IbA): implementa todaslas inferencias lexicas y ontologicas, in-cluyendo aquella basada en la parafrasisde atributos de la ontologıa.

11Aunque en el calculo de precision, cobertura y F,las preguntas inciertas son tratadas como erroneas,cabe destacar que el analisis de las mismas nos ayu-dara a orientar correctamente el enriquecimiento deconocimiento en el modulo de implicacion textual.


52

0

10

20

30

40

50

60

70

80

90

100

0,4 0,5 0,6 0,7 0,8 0,9

Precision Recall F-measure

(a) Baseline Lexico (BL).

0

10

20

30

40

50

60

70

80

90

100

0,4 0,5 0,6 0,7 0,8 0,9


(b) BL+Restriccion de Conceptos(BL+RC).

30

40

50

60

70

80

90

100

0,3 0,4 0,5 0,6 0,7 0,8 0,9


(c) BL+RC+Inferencia basada enAtributos (BL+RC+IbA).

Figura 2: Precision, cobertura y F–medida para cada inferencia sobre el entrenamiento.

Como se puede observar, el sistema ob-tiene buenos resultados para todos los um-brales debido al hecho de que casi todas lasnuevas preguntas poseen un patron correctoasociado en la base de datos. Esto demues-tra la eficiencia en la construccion de la basede conocimiento. Teniendo en cuenta el con-junto de preguntas de entrenamiento, el um-bral de decision que obtiene mejor precisionsin comprometer la cobertura es 0.6 para losexperimentos BL y BL+RC, y 0.5 para el ex-perimento BL+RC+IbA.

El cuadro 2 muestra los resultados sobre elconjunto de test. Aunque los valores de pre-cision, cobertura y medida F son ligeramentemenores que en el caso del conjunto de entre-namiento el comportamiento del sistema essimilar en ambos casos.

Exp. Test

umbral Prec. Cob. FBL 0.6 57.61 93.21 71.21BL+RC 0.6 76.64 84.57 80.41BL+RC+IbA 0.5 89.24 97.53 93.2

Cuadro 2: Resultados sobre el test.

Como era de esperar, el experimento BLobtiene los resultados mas bajos, pero se me-joran sustancialmente incluyendo el conoci-miento basado en la ontologıa. La restriccionbasada en la correspondencia de conceptos(BL+RC) mejora la medida F en 12.91%,mientras que la restriccion de alineamientoentre atributos (BL+RC+IbA) produce unamejora de 30.88 %.

De esta forma se demuestra la correctaaplicacion de: (1) las medidas lexicas sin ha-cer uso de informacion ontologica, y (2) elconocimiento semantico adicional extraıdo dela ontologıa.

7. Conclusion y trabajos futuros

La principal aportacion de este artıculo esel desarrollo de una metodologıa para la crea-cion de un sistema de BR sobre dominios res-tringidos. Dicha metodologıa hace uso de unaontologıa que modela el dominio y de una ba-se de conocimiento creada a partir de las ne-cesidades e intereses de los usuarios. Ademas,el sistema utiliza un modulo de implicaciontextual para deducir inferencias semanticasentre una nueva pregunta y el conjunto depatrones que almacena en su base de conoci-miento.

Los resultados obtenidos muestran una al-ta precision del sistema, y en concreto delmodulo de implicacion textual. Ademas, es-tos valores han sido obtenidos sin el uso decomplejos recursos semanticos que pudierancomprometer la eficiencia del sistema.

La metodologıa propuesta se podrıa extra-polar a otros dominios u otros idiomas, se ne-cesitarıa una ontologıa que modelara el nuevodominio y un conjunto de usuarios para crearla base de conocimiento en el idioma requeri-do.

Finalmente, como trabajo futuro se plan-tea la extension del sistema mediante la de-teccion de expresiones temporales y espa-ciales. Preguntas como: “¿Cual es el cinemas cercano donde puedo ver [MOVIE]?” o“¿Donde puedo ver [MOVIE] manana?” en-trarıan dentro de esta lınea. Ademas, aunquela deteccion de la respuesta esperada es enmayor o menor medida considerada por nues-tra caracterizacion de atributos, tambien seplantea anadir un modulo especıfico que in-corpore este conocimiento al modulo de im-plicacion textual. Por ejemplo, en la pregun-ta “Dime donde fue rodada la pelıcula [MO-VIE]” esta informacion serıa de gran valorya que en ningun momento se menciona el


53

atributo “estudio”. Sin embargo si se formu-lara como “Dime en que estudio fue rodada lapelıcula [MOVIE]”, nuestra caracterizacionde atributos aportarıa el conocimiento sufi-ciente para el modulo de implicacion textual.

Bibliografıa

2005. The Semantic Web - ISWC 2005, 4th In-ternational Semantic Web Conference, ISWC2005, Galway, Ireland, November 6-10, 2005,Proceedings, volumen 3729 de Lecture Notes inComputer Science. Springer.

Androutsopoulos, I., G. Ritchie, y P. Thanisch.1993. An Efficient and Portable Natural Lan-guage Query Interface for Relational Databa-ses. In: 6th International Conference on In-dustrial and Engineering Applications of Arti-ficial Intelligence and Expert Systems, paginas327–320.

Androutsopoulos, Ion. 1996. A Principled Fra-mework for Constructing Natural LanguageInterfaces to Temporal Databases. CoRR,cmp-lg/9609004.

Bernstein, Abraham y Esther Kaufmann. 2006.GINO - A Guided Input Natural LanguageOntology Editor. En International Seman-tic Web Conference, volumen 4273 de LectureNotes in Computer Science, paginas 144–157.Springer.

Bernstein, Abraham, Esther Kaufmann, AnneGohring, y Christoph Kiefer. 2005. QueryingOntologies: A Controlled English Interface forEnd–Users. En International Semantic WebConference (DBL, 2005), paginas 112–126.

Chan, Hock Chuan y John Lim. 2003. A Reviewof Experiments on Natural Language Interfa-ces. En Advanced Topics in Database Resear-ch, Vol. 2. paginas 55–71.

Copestake, A. y K.S. Jones. 1990. Natural Lan-guage Interfaces to Databases. Knowledge En-gineering Review, 5(4):225–249.

Ferrandez, Oscar, Daniel Micol, Rafael Munoz,y Manuel Palomar. 2007. A Perspective–Based Approach for Solving Textual Entail-ment Recognition. En Proceedings of theACL-PASCAL Workshop on Textual Entail-ment and Paraphrasing, paginas 66–71, Pra-gue. Association for Computational Linguis-tics.

Filipe, Porfırio P. y Nuno J. Mamede. 2000. Da-tabases and Natural Language Interfaces. EnCarlos Delgado Esperanza Marcos, y Jose Ma-nuel Marques Corral, editores, JISBD, paginas321–332. Universidad de Valladolid, Departa-mento de Informatica.

Giampiccolo, Danilo, Bernardo Magnini, Ido Da-gan, y Bill Dolan. 2007. The Third PAS-CAL Recognizing Textual Entailment Cha-llenge. En Proceedings of the ACL-PASCALWorkshop on Textual Entailment and Paraph-rasing, paginas 1–9, Prague. Association forComputational Linguistics.

Glickman, Oren. 2005. Applied Textual Entail-ment. Ph.D. tesis, Bar Ilan University.

Kang, In-Su, Seung-Hoon Na, Jong-Hyeok Lee,y Gijoo Yang. 2004. Lightweight NaturalLanguage Database Interfaces. En NLDB,volumen 3136 de Lecture Notes in ComputerScience, paginas 76–88. Springer.

Lopez, Vanessa, Victoria S. Uren, Enrico Motta, yMichele Pasin. 2007. AquaLog: An ontology-driven question answering system for organi-zational semantic intranets. J. Web Sem.,5(2):72–105.

Minock, Michael. 2005. A Phrasal Approach toNatural Language Interfaces over Databases.En NLDB, volumen 3513 de Lecture Notes inComputer Science, paginas 333–336. Springer.

Popescu, Ana-Maria, Oren Etzioni, y Henry A.Kautz. 2003. Towards a theory of natural lan-guage interfaces to databases. En IntelligentUser Interfaces, paginas 149–157. ACM.

Rodrigo, Luis, V. Richard Benjamins, Jesus Con-treras, Diego Paton, D.Navarro, R. Salla, Mer-cedes Blazquez, P. Tena, y I. Martos. 2005. ASemantic Search Engine for the InternationalRelation Sector. En International SemanticWeb Conference (DBL, 2005), paginas 1002–1015.

Thompson, Cynthia A. y Raymond J. Mooney.1999. Automatic Construction of SemanticLexicons for Learning Natural Language In-terfaces. En AAAI/IAAI, paginas 487–493.

Wang, Chong, Miao Xiong, Qi Zhou, y Yong Yu.2007. PANTO: A Portable Natural LanguageInterface to Ontologies. En ESWC, volumen4519 de Lecture Notes in Computer Science,paginas 473–487. Springer.

Zelle, John M. y Raymond J. Mooney. 1996.Learning to Parse Database Queries Using In-ductive Logic Programming. En AAAI/IAAI,Vol. 2, paginas 1050–1055.

Zhang, Lei y Yong Yu. 2001. Learning to Ge-nerate CGs from Domain Specific Sentences.En ICCS, volumen 2120 de Lecture Notes inComputer Science, paginas 44–57. Springer.


54

The influence of Semantic Roles in QA: A comparative analysis∗

La influencia de los roles semanticos en BR: Un analisis comparativo

P. Moreda, H. Llorens, E. Saquete, M. PalomarNatural Language Processing Research Group.

University of Alicante.Alicante, Spain

{paloma,hllorens,stela,mpalomar}@dlsi.ua.es

Resumen: Los conjuntos de preguntas utilizados normalmente para evaluar lossistemas de busqueda de respuestas (BR) estan principalmente constituidos por pre-guntas cuyas respuestas son entidades nombradas (NE), por tanto, la mayorıa deestos sistemas usan reconocedores de entidades para extraer las posibles respues-tas. Ultimamente, el etiquetado de roles semanticos y su contribucion a la BR esun tema de especial interes. Sin embargo, los sistemas basados en NEs siemprefuncionaran mejor que los basados en roles a la hora de extraer respuestas parapreguntas cuya respuesta sea una NE. El objetivo de este artıculo es evaluar ambosmetodos para preguntas de lugar bajo las mismas condiciones, usando, no solo pre-guntas basadas en nombres propios sino tambien basadas en nombres comunes. Paraello se presentan tres propuestas diferentes de un modulo de extraccion de respuestasembebidas en un sistema de BR: una basada en entidades nombradas y dos basadasen roles semanticos. Los resultados obtenidos indican que mientras la propuestade NE contesta mejor las preguntas basadas en nombres propios (+49, 57% MRR),las propuestas de roles obtienen los mejores resultados en las preguntas basadas ennombres comunes (+223, 48% MRR) siendo sus resultados de una precision mas altapara ambos tipos de preguntas.Palabras clave: Roles Semanticos, Entidades Nombradas, Busqueda de Respuestas

Abstract: Question sets normally used to evaluate QA systems are mainly basedon questions whose answers are named entities, therefore most of these systems relyon NERs to extract possible answers. Nowadays, semantic role labeling and itscontribution to question answering has recently become an interesting issue. Nev-ertheless, NE-based systems will always work better than SR-based ones extractinganswers for questions with NE-based answers. The aim of this paper is to evaluateboth of approaches for location questions under the same conditions and using notonly NE-based questions but also common noun-based ones. In order to achievethis goal we present three different proposals of an answer extraction module em-bedded into a QA system: one based on named entities and two based on semanticroles. Results show that while NE-based approach performs better with NE-basedquestions (MRR +49.57%), SR-based approaches show the best results in commonnoun-based ones (MRR +223.48%) and obtain a higher precision in both types ofquestions.Keywords: Semantic Roles, Named Entities, Question Answering

1 Introduction

Nowadays, question answering (QA) taskrepresents one of the main lines of researchof natural language processing (NLP). Its

∗ This paper has been partially supported by theSpanish government, project TIN-2006-15265-C06-01and project GV06/028, and by the framework of theproject QALL-ME, which is a 6th Framework Re-search Programme of the European Union (EU), con-tract number: FP6-IST-033860.

goal is the answering by computers to pre-cise or arbitrary questions formulated byusers in natural language (NL). Summariz-ing, the main objective of a QA system isdetermining “WHO did WHAT to WHOM,WHERE, WHEN, HOW and WHY?” (Ha-cioglu y Ward, 2003).

There exist conferences such as TREC1

1http://trec.nist.gov/



and CLEF2, whose aim is the evaluation ofthese systems requiring all participants to usesame corpus to answer a concrete question setgiven by the organization. Question sets usedto evaluate QA systems are mainly built withquestions whose answer is a named entity(NE) (hereafter referred to as NE-based ques-tions). Nevertheless, questions whose answeris composed of common nouns (hereafter re-ferred as common noun-based questions) arenot easy to find in these corpora.

Due to this fact, most QA systems haveused named entity recognizers (NERs) to ex-tract possible answers for a question (Piz-zato y Moll-Aliod, 2005; Molla, 2006). NERsidentify entities and classify them into differ-ent categories. For each question, once thequestion type is recognized, NE-based QAsystems extract NEs of this type as poten-tial answers.

Recently, semantic role labeling (SRL) hasreceived much attention, pointing questionanswering (QA) as one of the areas where thecontribution of semantic roles (SR) will bemore interesting (Gildea y Jurafsky, 2002).For each predicate in a sentence, seman-tic roles identify all constituents, determin-ing their roles (agent, patient, instrument,etc.) and also their adjuncts (locative, tem-poral, manner, etc.). In this way, semanticroles represent ‘WHO did WHAT to WHOM,WHERE, WHEN, HOW and WHY?” in asentence (see figure 1), which indicates thatits use in answer extraction could be very use-ful.

WHO WHOM

WHAT

WHEN

WHERE

Yesterday, John was hit with a baseball by Mary in the park TEMP PACIENT INSTRUMENT AGENT LOC

WHO

WHOM WHAT

WHEN

WHERE

Mary hit John with a baseball yesterday in the park AGENT PACIENT INSTRUMENT TEMP LOC

Figure 1: Application of semantic roles in QA

2http://www.clef-campaign.org/

There are some works using SR in answerextraction modules of QA systems (Ofoghi,Yearwood, y Ghosh, 2006; Kaisser, 2007; Loy Lam, 2006; Shen et al., 2007) but all ofthem have been evaluated using NE-basedquestions.

In order to achieve our goal, we presenta fair benchmark to evaluate both kinds ofapproaches using a balanced location ques-tion set containing both types of questions(common noun-based and NE-based). More-over, we present three different proposals ofanswer extraction module embedded in a QAsystem. The answer extraction modules de-veloped are: named entities-based, semanticroles-based using rules and semantic roles-based using patterns. In this manner, the in-fluence of using semantic roles in QA systemswill be analyzed and compared to a NE-basedsolution.

The paper is structured as follows: Sec-tion 2 introduces the background of SR fieldapplied to QA systems, Section 3 describesour QA system and the three proposals foran answer extraction module: a) NEs b) SRusing rules, and c) SR using patterns. Sec-tion 4 analyzes the evaluation of the resultsfor the different approaches. Finally, someconclusions and orientations for future workare presented.

2 Background

Since the first automatic SRL system (Gildeay Jurafsky, 2002), the application of seman-tic roles to QA systems was presented asa proposal. One of the initial works usingSR in QA was presented in (Narayanan yHarabagiu, 2004).

All QA systems have a very similar archi-tecture, and as described in this field litera-ture (Ferrandez, 2003), this general architec-ture is summarized in the following modules:

• Question analysis: The main objectiveof this module is extracting all the use-ful information from the question (Piz-zato y Moll-Aliod, 2005), such as typeof question, type of answer, question fo-cus and information related to the con-tent of the question (keywords, syntac-tic and semantic information, questiontopic and so on).

• Document retrieval : This module usesinformation retrieval techniques in orderto obtain a set of relevant documents and

P. Moreda, H. Llorens, E. Saquete y M. Palomar

56

thereby removing most of the documentsin the collection from further processing.

• Passage retrieval : Only the relevant pas-sages, or any other information unit suchas documents or snippets, within the rel-evant documents are selected, using dif-ferent natural language processing tech-niques.

• Answer extraction: In this module, theobjective is determining which parts ofthe selected sentences are potential an-swers. Up to date, one of the simplestway to perform this task is returning thetext of the sentence that is labeled as anamed entity with the same type as theexpected answer type. However, seman-tic role labeling and its contribution toquestion answering has recently becomean interesting issue.Finally, all possible answers found arescored and re-ranked in order to deter-mine the exact answer of the question.

Regarding the use of the semantic rolesin the QA systems, systems can be dividedinto two main groups: a) systems using se-mantic roles to obtain extra information andcomplement other methods, and b) systemsusing semantic roles as a core method of amodule in the QA architecture.

2.1 Roles as a complementarymethod

In this case, QA systems are based on NERs,and the use of semantic roles is only provid-ing additional information in order to analyzethe possible improvement in the results of theQA system (Sun et al., 2005; Lo y Lam, 2006;Shen et al., 2007; Melli et al., 2006). Thesetypes of approaches are only giving informa-tion about how semantic roles are able or notto complement a NER approach.

2.2 Roles as a core method

These approaches are using semantic roles toperform one module of the QA system. Abrief summary of the main systems is pre-sented in table 1.

As shown in the table, most of the systemsare using a mapping between the semanticinformation of the question and the seman-tic information of candidate answers. Thesystem of (Narayanan y Harabagiu, 2004)was the first proposal about using SR in QAsystems and they were applied to determine

System QSet Use MethodNarayanan ad-hoc Type Map. Q. Pattern

answer Answer PatternStenchikova TREC Answer Rules type Q.

Trivia Extrac. Answer roleOfoghi TREC Answer Map. Q. Pattern

Extrac. Answer PatternKaisser TREC Answer Map. Q. Pattern

Extrac. Answer PatternMoschitti TREC Q. Classif. Supervised

A. Classif. MachineA. Rerank. Learning

Fliedner ad-hoc Anwer Map. Q. FrameExtrac. Answer Frame

Table 1: Summary of the use of semanticroles in QA systems

the type of the answer of complex questions.Their evaluation results over an ad-hoc set of400 questions indicated a precision of 73,5%in which the type of the answer was properlydetected. The work of (Ofoghi, Yearwood,y Ghosh, 2006) implemented a manual proofover a set of 15 questions in order to extractcandidate answers to a question using seman-tic roles. The evaluation of this approachusing TREC2004 question corpus showed anMRR of 38,89 %. Kaisser’s system (Kaisser,2007) is a very similar proposal to the ex-plained before. This system was evaluatedwith a subset of TREC2002 question corpusand obtained a precision of 36,70%. Flied-ner (Fliedner, 2007) proposes a representa-tion of both question and passages containinga possible answer as FrameNet style struc-tures. The answer is obtained by a mappingprocess between both structures. Results foropen domain questions achieved a precisionof 66% and a 33% in recall.

Besides, another system (Stenchikova,Hakkani-Tur, y Tur, 2006) is establishing aset of rules that relate some types of questions(who,when,where or what) with the role typefor the expected answer. In this case, theevaluation of the system obtains an MRR of30%.

Otherwise, Moschitti (Moschitti et al.,2007) proposes a supervised learning algo-rithm using information of semantic analysistree composed of the sentence predicate andits arguments tagged with SR. Results ob-tained prove the usefulness of this informa-tion for classification (MRR 56.21%) and re-classification (MRR 81,12%) of answers, butnot for the question classification.

One of the most important problems of all

The influence of Semantic Roles in QA: A comparative analysis

57

these systems is the extraction of the seman-tic roles of the question. This is due to thefact that the semantic role labeling tools haveserious problems to annotate questions due tothe fact that corpora used to train SRL toolsdo not contain many questions.

Once the different proposals have been an-alyzed, it seems obvious that the main con-tribution of SR to QA systems is in the an-swer extraction module. However, NE-basedsystems will always work better than SR-based ones extracting answers for questionswith NE-based answers. Therefore, a bal-anced evaluation using not only NE-basedquestions but also common noun-based onesis proposed.

3 Implementation: three answerextraction approaches in thesame QA system

To make a fair comparative analysis of the in-fluence of SR and NE in QA systems, a QAsystem has been implemented, and three dif-ferent extraction approaches have been em-bedded in it, being then evaluated sepa-rately in order to compare the results be-tween them.

A simple QA system has been developedfollowing the steps indicated in (Pizzato yMoll-Aliod, 2005). The information retrievalmodule uses snippets obtained from severalInternet search engines and the answer ex-traction module has been modified in orderto add the two SR approaches.

Since QA system behavior could be differ-ent depending of the type of SR, this workanalyzes only location questions to minimizeexternal influences. Same analysis could bedone over other kind of questions by onlydefining appropriate rules or patterns foreach answer extraction approach.

The first proposal is based on NEs to beable to compare its results to SR-based ap-proaches. Second and third proposals areboth based on SR. The second one uses se-mantic rules that establish relationships be-tween the type of questions and SR and thethird one uses semantic patterns built usingthe information of SR. Figure 2 shows anschema of the implemented system architec-ture.

3.1 NE-based answer extraction

This one is the simplest approach and theone used in the QA system described by Piz-

Figure 2: Architecture of the QA system withthree different answer extraction modules

zato et al. Once question type is inferredby the system and the relevant snippets arecollected as corpora to answer a question, alltagged NEs in corpora that match with ques-tion type are selected as potential answers.We used LingPipe3 named-entity recognizerto identify location names.

3.2 SR-based answer extractionusing rules

For each different type of question, and itsexpected answer type, a different set of SRcould be considered as a possible answer. It ispossible to define a set of semantic rules thatestablishes relationships between the type ofthe question and a SR. A summary of thesesemantic relationships is shown in table 2(Moreda, Navarro, y Palomar, 2007).

Using these rules, answer extraction mod-ule will select as possible answers all the ar-guments of the snippets returned by the in-formation retrieval module that play the lo-cation role (AM-LOC). We used SemRol tool(Moreda y Palomar, 2006) to determine rolesof sentence arguments.

3http://www.alias-i.com/lingpipe/


58

Question Role No RoleWhere Location ProtoAgentIn where ModeIn what + exp TemporalAt what + exp Cause

ProtoPatientWhen Temporal ProtoAgentIn what + exp ModeWhat + exp Location

CauseProtoPatient

How Mode ProtoAgentTheme Location

(if it is a Temporaldiction verb) Cause

PatientBeneficiary

Who [Proto]Agent Mode[Proto]Patient Temporal

LocationTheme

beneficiaryWhat Cause

ThemeWhose Receiver Agent

Beneficiary LocationPatient Mode

ProtoPatient TemporalThemeCause

Table 2: Set of semantic relationships

3.3 SR-based answer extractionusing patterns

The motivation for the implementation ofthis third approach is that not all locationarguments are represented by the specific lo-cation role (AM-LOC) and then, the previousapproach is not considering all the possibili-ties.

For instance, the sentences in example 1and example 2 that have an argument withthe location role (“to the John’s house” and“to the park”, respectively) do not representit with the AM-LOC role. Otherwise, in onecase, the location role is represented by theA2 role (example 1), and in the other, by theA4 role (example 2).

(1) [A0 Mary] is going [A2 to the John’shouse].

(2) [A0 Mary] is going [A4 to the park].Such as (Moreda, Navarro, y Palomar,

2007) showed, in PropBank, the locationcould be represented by the A2, A3, A4 orAM-LOC semantic roles.

Therefore, the answer extraction modulebased on rules is not able to detect all thepossibilities. A first idea could be consid-ering the AM-LOC when appears, and the

other roles when not. This is possible becausewhen the A2, A3 or A4 roles represent loca-tion, no other argument can have the loca-tion role. The problem is determining whichof the roles, A2, A3 or A4, represent the loca-tion role if they appear in the same sentence.

To solve this problem, and considering thework presented in (Yousefi y Kosseim, 2006)about an answer extraction module based onpatterns using named entities, the automaticconstruction of a set of semantic patternsbased on semantic roles is proposed. Thisset of semantic patterns will cover most ofthe possibilities in which semantic roles rep-resent location. This process consists of fourstages:

1. Snippet retrieval. For each pairquestion-answer, the set of terms whicha relevant document should contain, isdefined. Then, a query using these termsis submitted to the Web and the snippetsretrieved containing some of the termsare selected.

(a) The set of relevant terms is com-posed of 1) the noun phrases ofthe question, and 2) all the possi-ble combinations of sub-phrases ofthe answer.

(b) The search engines used to submitthe terms to the Web are MSN4,AskJeeves5, Google6, Altavista7 yGigablast8.

(c) The first 100 snippets retrieved foreach search engine containing theterms of the question and at least,one of the terms of the answer, inthe same sentence, are selected.

2. Semantic Filtering of snippets. Sen-tences of snippets containing synonyms,hyperonyms or hyponyms of the ques-tion verb, are selected. This semanticinformation is obtained from WordNet(Miller et al., 1990).

3. Generating the answer pattern. Finally,the selected sentences are generalizedin semantic patterns using informationabout semantic roles.

4http://es.msn.com/ (March 2008)5http://es.ask.com/ (March 2008)6http://www.google.es/ (March 2008)7http://es.altavista.com/ (March 2008)8http://beta.gigablast.com/ (March 2008)


59

(a) Each sentence is annotated withsemantic roles using the SemRoltool(Moreda y Palomar, 2006) in or-der to identify location arguments(AM-LOC, A2, A3 or A4 semanticroles).

(b) The argument corresponding tosome of the sub-phrases of the an-swer are replaced by its semanticrole tag.

(c) Arguments corresponding to thenoun phrases of the question are re-placed by < QARGn > tags, wheren is the phrase counter.

(d) Other arguments of the sentence arereplaced by < ARGn > tags, wheren is the argument counter. The restof data is discarded.

4. Pattern clustering. Regardless the posi-tion of the tags, if two patterns have thesame tags but different verbs, a singlepattern is obtained containing the set oftags and a list of those verbs.

Once the described process is done, theanswer extraction module operates in the fol-lowing way: when a new location questionis formulated, one or more patterns (one foreach location semantic role AM-LOC, A2,A3, A4 in the sentence) for the returned snip-pets of this question are obtained and theyare matched with the set of patterns in ourdatabase. If there is a coincidence, the textcorresponding to the semantic role tag in thepattern is retrieved as an answer. To per-form this, sentences of snippets are anno-tated with semantic roles, using the SemRoltool(Moreda y Palomar, 2006) and general-ized in patterns.

4 Comparative evaluation andresults analysis

4.1 Evaluation Environment

A set of 100 location questions has been usedfor testing. First 50 questions are based onNEs representing a subset of TREC1999 andTREC2000 factoid location questions andanswers. Examples of these questions are:

What is the largest city in Germany? BerlinWhere is the actress, Marion Davies, buried?Hollywood Memorial Park

Last 50 questions are based on locationcommon nouns and have been made by ourteam. Examples of these questions are:

Where is pancreas located? abdomenWhere are sheets put on? bed

Before carrying out the test, a Patternsdatabase (DB) for SR pattern-based answerextraction module has to be built, as ex-plained above. It has been built using a setof 200 questions, composed of a subset ofTREC2003, TREC2006 and OpenTrivia.comfactoid location questions and answers.

As explained in section 3, our systemuses internet search engines results as cor-pus to answer the questions. We judgedanswers to be correct if they represent orcontain the correct answer. The mea-sures used to evaluate the system are Preci-sion (questions answered correctly/total ques-tions answered), Recall (questions answeredcorrectly/total questions), F1 ((2*Preci-sion*Recall)/(Precision+Recall)) and MRR(Mean Reciprocal Rank measure used inTREC ).

4.2 Results Analysis

The QA system has been executed for thethree implemented answer extraction mod-ules. Neither manual review of sub-processesoutputs nor post-execution adjustments havebeen made to automatic processes of the pre-sented system.

Table 3 shows the results obtained in theevaluation for the three approaches empha-sizing best MRR marks.

Approach Answer type

Name % NE common

Pre 87.50 15.62N. Entities Rec 84.00 10.00

F1 85.70 12.19MRR 87.25 12.52

Pre 91.54 75.00SR Rules Rec 52.00 30.00

F1 66.32 42.85MRR 52.25 30.33

Pre 93.54 95.23SR Patt. Rec 58.00 40.00

F1 71.60 56.33MRR 58.33 40.50

Table 3: Evaluation Results for implementedapproaches


60

Results clearly confirm that while NE-based approach works better for NE-basedquestions (MRR +66.98% over SR rules and+49.57% over SR patterns), SR-based ap-proaches clearly surpass it for common noun-based questions (MRR +142.25% for rulesand +223.48% for patterns).

On one hand, SR approaches results aremore stable between the two different ques-tion types, showing an average 55.29% MRRon NE-based questions and a 35.41% av-erage MRR for common noun-based ones.The difference obtained could be produced bythe high availability of information for typi-cal NE-based questions on the Internet andthe information spareness for some commonnoun-based questions. Furthermore, the pre-cision achieved by SR-approaches, speciallypatterns-based one, is higher than the onefor NE-based approach. This is due to thefact that SR only tag as possible answers ar-guments representing a location role in a sen-tence whereas NEs select every location en-tity which increases the recall but sacrificesthe precision.

On the other hand, NE-based approach,being the best approach for NE-based ques-tions (87.25% MRR), has a drastic slumpin common noun-based ones (12.52% MRR).Therefore, NE-based approaches have an im-portant limitation on detecting non-entitybased answers. Only NE-based questions canbe answered due to selecting only named en-tities as possible answers. In fact, commonnoun-based questions answered correctly bythe presented NE approach should not be an-swered because answers retrieved are no lo-cation NEs. We have analyzed the reason forthis fact and we have concluded that it is pro-duced due to a NER error in both detectionand classification processes.

SRLs identify roles of arguments in sen-tences alleviating the handicap of detectingonly entities. In this manner, as indicated byseveral studies, SR could be favorably usedin QA task and, as proved by these results,specially in common noun-based questions.

Comparing the two presented SR ap-proaches we can observe that patterns-basedapproach improves rules-based in recall be-cause of the inclusion of A2, A3 and A4roles as possible answers for some patternsand considering synonym, hyperonyms or hy-ponyms verbs in Patterns DB building pro-cess. Patterns obtained the highest precision

as well, because while other approaches ex-tract all locations as possible answers, it onlyconsiders location roles whose pattern repre-sent one of the contained in Patterns DB.That way, patterns-based approach does akind of semantic filtering of sentences, result-ing in a more precise extraction of answers.

5 Conclusions and Further work

The aim of this paper is to analyze the in-fluence of using semantic roles in questionanswering systems by comparing results ob-tained for both NE and common noun-basedquestions by different methods of answer ex-traction. To reach this goal a simple QA sys-tem has been implemented and three propos-als of a QA answer extraction module havebeen embedded on it. The first proposal isbased on named entities while the second andthe third are based on semantic roles.

All proposals have been evaluated underthe same conditions using a balanced loca-tion question set consisting in 50 questionsbased on NEs (TREC subset) and 50 ques-tions based on common nouns.

Results from the evaluation show thatwhile NE-based approach answers better NE-based questions (MRR +49.57% over SR pat-terns), SR-based approaches show the bestresults in common noun-based ones (MRR+223.48% for patterns) obtaining a higherprecision in both types of questions.

Analyzing the obtained results we canconclude that using of SR in QA task, con-cretely in the answer extraction module, canbe very worthy, specially in common nounbased questions.

As further work some possible improve-ments have been proposed:

• Implementing the same system for otherlanguages such as Spanish or Catalan inorder to study if semantic roles affects inthe same manner.

• Extending the QA system and SR-basedextraction modules to other type ofquestions such as Person, Organizationor Time-Date.

• Improving the QA system by implement-ing an Answer Clustering module basedon semantic information of WordNet.

Bibliografıa2007. Deep Linguistic Processing Workshop in

45th Annual Meeting of the Association for


61

Computational Linguistics (ACL)), Prague,Czech Republic, Junio.

Ferrandez, A. 2003. Sistemas de pregunta y re-spuesta. Informe tecnico, Universidad de Ali-cante.

Fliedner, G. 2007. Linguistically InformedQuestion Answering, volumen XXIII deSaarbrucken Dissertations in ComputationalLinguistic and Language Technology. Uni-versitat des Saarlandes und DFKI GmbH,Saarbrucken.

Gildea, D. y D. Jurafsky. 2002. Automatic label-ing of semantic roles. Computational Linguis-tics, 28(3):245–288.

Hacioglu, K. y W. Ward. 2003. Target WordDetection and Semantic Role Chunking UsingSupport Vector Machines. En Proceedings ofthe Human Language Technology Conference(HLT-NAACL), Edmonton, Canada, June.

Kaisser, M. 2007. Question Answering based onSemantic Roles. En Proceedings of the DeepLinguistic Processing Workshop in 45th An-nual Meeting of the Association for Computa-tional Linguistics (ACL2007) (acl, 2007).

Lo, K.K. y W. Lam. 2006. Using semantic re-lations with world knowledge for question an-swering. En Proceedings of The Fifteenth TextRetrieval Conference (TREC 2006).

Melli, G., Y. Wang, Y. Liu, M.M. Kashani, Z. Shi,B. Gu, A. Sarkar, y F. Popowich. 2006. De-scription of squash, the sfu question answer-ing summary handler for the duc-2005 summa-rization task. En Proceedings of the DocumentUnderstanding Conference 2006 (DUC2006),New York City, Junio.

Miller, G., R. Beckwith, C. Fellbaum, D. Gross,y K. Miller. 1990. Five Papers on WordNet.CSL Report 43. Informe tecnico, CognitiveScience Laboratory, Princeton University.

Molla, D. 2006. Sistemas de bsqueda de respues-tas. Informe tecnico, Centre for LanguageTechnology. Division of Information and Com-munication Sciences, Junio.

Moreda, P., B. Navarro, y M. Palomar. 2007.Corpus-based semantic role approach in infor-mation retrieval. Data and Knowledge Engi-neering, 61(3):467–483.

Moreda, P. y M. Palomar. 2006. The Role ofVerb Sense Disambiguation in Semantic RoleLabeling. En Proceedings of The 5th Interna-tional Conference on Natural Language Pro-cessing in the series of the TAL conferences(FINTAL), volumen 4139, paginas 684–695,Agosto.

Moschitti, A., S. Quarteroni, R. Basili, y S. Man-andhar. 2007. Exploiting Syntactic and Shal-low Semantic Kernels for Question Answer

Classification. En Proceedings of the DeepLinguistic Processing Workshop in 45th An-nual Meeting of the Association for Compu-tational Linguistics (ACL2007) (acl, 2007),paginas 776–783.

Narayanan, S. y S. Harabagiu. 2004. Ques-tion answering based on semantic structures.En Proceedings of the 20th International Con-ference on Computational Linguistics (COL-ING), Switzerland, Agosto.

Ofoghi, B., J. Yearwood, y R. Ghosh. 2006. Ahybrid question answering schema using en-capsulated semantics in lexical resources. EnAdvances in Artificial Intelligence, 19th Aus-tralian Joint Conference on Artificial Intelli-gence, paginas 1276–1280, Hobart, Australia,Diciembre.

Pizzato, L.A. Sangoi y D. Moll-Aliod. 2005. Ex-tracting Exact Answers using a Meta Ques-tion answering System. En Proceedings of theAustralasian Language Technology Workshop2005 (ALTW05), Sidney, Australia, Decem-ber.

Shen, D., M. Wiegand, A. Merkel, S. Kazalski,S. Hunsicker, J.L. Leidner, y D. Klakow. 2007.The alyssa system at trec qa 2007: Do we needblog06? En Proceedings of The Sixteenth TextRetrieval Conference (TREC 2007), Gaithers-burg, MD, USA.

Stenchikova, S., D. Hakkani-Tur, y G. Tur.2006. Qasr: Question answering using se-mantic role for speech interface. En Proceed-ings of the International Conference on Spo-ken Language Processing (Interspeech 2006 -ICSLP)), Pittsburg, PA, Septiembre.

Sun, R., J. Jiang, Y.F. Tan, H. Cui, T. Chua, yM. Kan. 2005. Using Syntactic and SemanticRelation Analysis in Question Answering. EnProceedings of The Fourteenth Text RetrievalConference (TREC 2005).

Yousefi, J. y L. Kosseim. 2006. Using semanticconstraints to improve question answering. EnProceedings of 11th International Conferenceon Natural Language Processing and Informa-tion Systems (NLDB2006), paginas 118–128,Klagenfurt, Austria, Mayo.


62

Categorización de Textos

Aproximación a la Categorización Textual en español basada en la Semántica de Marcos

Frame Semantics-based Approach to Spanish Textual Categorization

Mario Crespo Miguel

University of Cadiz

Avda. Gómez Ulla, s/n

[email protected]

Antonio Frías Delgado

University of Cádiz

Avda. Gómez Ulla, s/n

[email protected]

Resumen: FrameNet es un recurso basado en la Semántica de Marcos que trata de representar

el modo por el que diferentes lenguas dan cuenta lingüísticamente de situaciones cotidianas. Los

marcos funcionan al modo de paquetes de información sobre cómo hablar de una determinada

situación. Este trabajo presenta un procedimiento para categorizar documentos a partir del

análisis de las situaciones de FrameNet que concurren en un texto determinado. El conjunto de

marcos situacionales es usado como un vector de rasgos en el que la presencia o ausencia de

determinados marcos situacionales en un texto sirve para establecer su categoría. Los resultados

muestran cómo nuestro sistema fue capaz de categorizar textos en español con gran precisión.

Palabras clave: FrameNet, Categorización textual, Recuperación de información.

Abstract: FrameNet is a resource based on Frame Semantics that comprises how languages

account for daily situations linguistically. Frames represent information packets about how to

convey information about a certain situation. This paper presents an approach to categorize texts

by analysing the range of FrameNet situations that co-occur in a particular text. The set of

FrameNet situations is used as a feature vector where the presence or absence of certain frames

in a text is used to determine its category. Results show how our system was able to categorize

texts in Spanish with high accuracy.

Keywords: FrameNet, Textual Categorization, Information Retrieval.

1. Introducción.

FrameNet (Ruppenhofer et al., 2006) es un

proyecto de semántica léxica concebido para

dar cuentra de cómo las lenguas son capaces de

describir situaciones diarias por medio de sus

unidades léxicas y de cómo los hablantes son

capaces de expresar y entender información a

través de ellas. De esta manera, los marcos

situacionales funcionarían al modo de paquetes

lingüísticos con la información necesaria para

hablar sobre una situación determinada.

Fillmore (1982,1985) afirma que las

personas entienden cosas realizando

operaciones mentales sobre lo que ya saben.

Este conocimiento se puede describir mediante

marcos situacionales, los cuales están formados

por un conjunto de palabras que evocan tales

marcos cuando aparecen en el discurso. Si

asumimos que la lista de palabras de un

determinado lenguaje es limitado, entonces los

marcos que les dan soporte deben ser finitos

también. Sin embargo, como sabemos, el

número de temas de los que podemos hablar es

ilimitado. Por lo tanto, los marcos situacionales

deben combinarse unos con otros en el discurso

para expresar información sobre cualquier tema

cotidiano: medicina, política, familia, etc. Si

para hablar sobre un tema, se usa un cierto

número de marcos situacionales, entonces la

categorización de un texto debe ser factible

desde las situaciones que se le asocian.

A continuación se presenta un

procedimiento capaz de determinar el tema de

un determinado documento a partir del análisis

de los diferentes marcos situacionales que

resultan estadísticamente significativos al

analizarlo.



1.1. Categorización textual mediante conceptos vinculados al texto.

Los enfoques actuales sobre Categorización

Textual han estado normalmente basados en

técnicas de aprendizaje automático (Sebastiani,

2002), orientadas al aprendizaje de las

categorías en las que se divide la clasificación

de un conjunto de documentos. Estas técnicas

suelen llevar a cabo un análisis estadístico de la

frecuencia de los términos del documento y

determinar así cuáles son los que poseen una

mayor relevancia. Estos términos suelen

aparecer dispuestos en un vector de rasgos

usado para comprobar su peso (basado en su

frecuencia) en un determinado documento. La

frecuencia es el indicador principal de la

pertenencia de un documento a una categoría

específica.

Esta clasificación, basada en espacios

vectoriales, podría tener un rendimiento

deficiente si los documentos relevantes no

contienen uno de los términos que conducen a

que el documento sea recuperado. Además, la

recuperación basada exclusivamente en

términos puede ser un método vago y ruidoso.

Esto ha hecho que ciertas líneas de

investigación exploren la recuperación de

información basada más en conceptos e ideas

que se reflejan en el texto, que en términos

usados como índices. La siguiente figura ilustra

el proceso de recuperación de documentos

basada en conceptos en vez de términos:

Figura 1. Recuperación de documentos

mediante términos y mediante conceptos.

Como se puede apreciar en la figura

anterior, los términos (t1, t2, etc,) apuntan

directamente a un determinado dominio

temático (d1, d1, etc) en el caso de la izquierda.

A su derecha, los términos de un documento

apuntan a determinados conceptos y de ahí a un

dominio temático.

El problema principal es cómo

obtener el espacio de conceptos sobre el que se

fundamenta la clasificación. El modelo latent

semantic de recuperación de información surge

como una de las alternativas operativas que

solucionan este problema. El fundamento

director que conduce el desarrollo del modelo

latent semantic (Dumais et al. 1988)

(Deerwester et al. 1990) de recuperación de

información se basa en la idea de que el tema

general del texto (la semántica latente) se

vincula con mayor profundidad con los

conceptos volcados en el documento que con

los términos de indización utilizados en la

descripción. La propuesta se resuelve en un

intento por desentrañar la semántica latente en

los documentos a través de la identificación de

los conceptos concretos vinculados; de este

modo, el proceso de correspondencia entre

documentos y consultas se establece a nivel de

conceptos y no de términos, buscando

minimizar el impacto del ruido y el silencio en

la recuperación, y posibilitando recuperar

documentos que no habían sido representados

por los términos de consulta y excluir de la

recuperación documentos con tales términos

pero no asociados a los conceptos que expresan

la necesidad de información.

El análisis que propone este modelo

se sustenta sobre la técnica de descomposición

en valores singulares, consistente en

descomponer automáticamente la matriz de

ocurrencias documentos-términos en varias

matrices asociativas que definan la

correspondencia entre documentos y conceptos,

y entre términos y conceptos.

Entre los trabajos en este ámbito

destacamos Huang (2003) el cual propone la

clasificación textual mediante máquinas de

soporte vectorial (Support Vector Machines)

basadas en el modelo latent semantic.

2. Metodología.

En este trabajo se utilizan los marcos

situacionales como unidades conceptuales al

que se asocian una serie de términos del

vocabulario. De esta manera, el conjunto de 795

marcos situacionales que componen FrameNet

son usados como un vector de rasgos para

computar la pertenencia o no de un documento

Mario Crespo Miguel y Antonio Frías Delgado

66

a una determinada categoría. Como veremos, la

ausencia o presencia de uno de estos rasgos se

estima analizando cuáles son los marcos

estadísticamente significativos de un

determinado documento o texto. Emparentados

se encuentran los trabajos de Petridis et al.

(2001) o Gómez et al. (2004) los cuales

proponen un modelo de clasificación textual

usando los synsets de WordNet como índices y

aplicando las técnicas de máquinas de soporte

vectorial. En el ámbito de la utilización de

recursos léxico-semánticos para la

categorización textual destaca Shehata et al.

(2007), que entrena un etiquetador de roles

semánticos basado en PropBank para anotar y

así determinar la información relevante de la

oración que será usada posteriormente para

clasificar el documento.

2.1. Corpus

El corpus a analizar fue extraído de

medlineplus.gov, un sitio web sobre salud de la

biblioteca médica más grande del mundo, the

United States National Library of Medicine1 y

umm.edu, el dominio web del centro medico de

la Universidad de Maryland, ya que ambas

proveen información en español para la

comunidad hispana de Estados Unidos. A esto

hay que sumarle las fuentes y secciones

disponibles en español de la web del

elmundo.es, ecosumer.es y 100cia.es lo que nos

hizo contar con un corpus de 7730 documentos

formado por 25674497 palabras y 76616 lemas

diferentes y 4 áreas temáticas diferentes:

DOMI)IO DOCUME)TOS

Medicina y salud 3623

El Mundo 4107

Ciencia 950

Productos y consumo 462

Tabla 1. Número de documentos para cada

ámbito temático.

Cada documento fue analizado de

nuevo usando el analizador TreeTagger de la

1 http://en.wikipedia.org/wiki/MedlinePlus

Universidad de Stuttgart2

para el español. De

este análisis sólo se tomó la información

relativa a los lemas, de los que se computó su

frecuencia en cada documento.

2.2. Procesamiento de los datos

2.2.1. Selección de marcos situacionales.

Desde la óptica de FrameNet, los disparadores

son aquellas unidades léxicas que “disparan” o

activan el marco en la mente de los hablantes

cuando aparecen en el discurso. De esta

manera, es lógico asumir que la selección de

marcos situacionales ha de hacerse partiendo de

las unidades del vocabulario. Cada marco

situacional está compuesto de una lista de

disparadores que serán usados para determinar

si el marco situacional al que pertenecen es

representativo del corpus. Este procedimiento

usa una traducción para el español de los

disparadores de cada marco situacional del

inglés planteada por Crespo y Buitelaar (2008)

en LREC’08. De esta manera, la Figura 2

muestra los disparadores de los marcos

Fall_asleep y Medical_instruments:

Fall asleep Medical instruments

A Sleeper goes from

wakefulness to the

altered state of

conciousness called

sleep.

It includes words for

medical instruments.

Lexical Units: adormecerse, dormirse

Lexical Units: broncoscopio, algalia,

catéter, endoscopio, ...

Figura 2. Vista de los disparadores de dos

marcos situacionales diferentes.

Como vemos, bajo 'Lexical Units' se

definen aquellas unidades o disparadores que

sirven para activar lingüísticamente el marco

situacional. El número de disparadores varía

tanto en FrameNet inglés como en nuestra

traducción de marco a marco.

Existen diferentes métodos que pueden

aplicarse sobre cada conjunto de disparadores y

2 http://www.ims.uni-

stuttgart.de/projekte/corplex/TreeTagger/DecisionTr

eeTagger


67

tratar así de determinar si el marco situacional

al que pertenecen debe ser interpretado como

representativo del texto analizado. Un análisis

de frecuencias de los disparadores de cada

marco situacional podría ayudarnos a

determinar si el marco debe ser seleccionado o

no. En este sentido, se puede comparar la

distribución de frecuencias relativas de los

disparadores en un corpus general y la

distribución de las mismas unidades en el texto

que está siendo analizado. Se asume que las

palabras orientadas al tema del corpus o del

documento van a tener un frecuencia

proporcionalmente mayor.

La tabla 2 compara las frecuencias

relativas de las unidades de dos marcos

situacionales diferentes en un texto médico y en

la proveniente de un corpus de 5,5 millones de

palabras de la Universidad Politécnica de

Cataluña3

usado como corpus de referencia. Se

puede apreciar como existen diferencias entre

ambos. En el caso del marco economy las

frecuencias de sus disparadores en el corpus de

referencia son mucho más altas que en el

médico, lo que conduce a una media final más

alta. En el caso de Active substance ocurre el

caso contrario, las frecuencias de estos

disparadores en el corpus médico son más altas

que en el corpus de referencia, lo que lleva

asociada una media más elevada.

ECO)OMY TEXTO MÉDICO

CORPUS DE REFERE)CIA

económico.a, 0.3e-06 27e-06

economía.n 0 140.e-06

MEDIA 0.15e-06 97e-06

ACTIVE SUBSTA)CE

TEXTO MÉDICO

CORPUS DE REFERE)CIA

medicina.n 931e-06 107.4e-06

químico.a 173e-06 60e-06

irritante.n 15.5e-06 0

MEDIA 373.1e-06 55.8e-06

Tabla 2. Comparación de las frecuencias y

medias de dos marcos situacionales.

3

http://www.lsi.upc.edu/%7Epadro/index.ph

p?page=nlp

Esta metodología de análisis podría ser

utilizada para seleccionar los marcos más

representativos de un determinado documento.

El problema derivado de tal metodología es

establecer los límites en los que una diferencia

entre medias y frecuencias es lo

suficientemente representativa como para

seleccionar el marco. El hecho de que los

valores de una serie de unidades léxicas en el

corpus médico sea superior a la media de los

mismos valores en el corpus de referencia no va

a ser suficiente en muchos casos, para

determinar que un marco determinado es

representativo.

Nuestro problema se asemeja al

que se presenta en muchos otros estudios donde

es necesario comparar ciertas características de

dos o más grupos de sujetos para determinar si

las diferencias que se aprecian entre ambos son

aparentes, o por el contrario, se debe

ciertamente a diferencias significativas.

Normalmente estos análisis tratan de establecer

una hipótesis de partida (hipótesis nula), por

ejemplo, en nuestro caso, que los valores de las

frecuencias relativas de las palabras en dos

corpus diferentes realmente son iguales. Entre

las diferentes técnicas de evaluación, el test t o

test de student analiza si las medias de dos

grupos son estadísticamente diferentes la una de

la otra en relación a la variabilidad de los

valores de cada uno de los individuos. La

metodología es diferente dependiendo del caso

con el que nos encontremos. El nuestro se trata

de uno de los análisis estadísticos más comunes

en la práctica científica, pues es el utilizado

para comparar dos muestras de grupos

independientes respecto a una variable

numérica. La formula aplicada en este caso es

la del Test de Student para dos muestras

independientes :

21

21

XXs

XXt

−

−

= [1]

donde

n

ssS XX

2

2

2

121

−

=−

[2]

El numerador de la fórmula [1] es fácil

de computar ya que se trata de una diferencia

entre las medias (la proveniente del texto a

analizar y el corpus usado como referencia). El


68

denominador calcula la varianza de cada grupo

y lo divide por el número de individuos de cada

grupo. El número de individuos en cada grupo

va a ser el mismo ya que tomamos aquellos

disparadores con frecuencia mayor a cero en el

texto a analizar rechazando aquellos

disparadores con frecuencia cero.

Una vez que se ha calculado el valor-t

se comprueba en una tabla de significación si su

ratio supera a los indicados en la tabla, lo que

nos llevaría a afirmar que la diferencia que

existe entre los valores del grupo de

disparadores en ambos corpus no es debida a la

casualidad y realmente ambos grupos se

diferencian. Esta diferencia es debida a que el

grupo de palabras analizado está orientado

significativamente al tema del documento y no

sigue lo que se esperaría en un corpus general.

La aplicación del test t o test de

Student exige que las observaciones en cada

grupo provengan de una distribución normal

con una variabilidad semejante. Realmente

nuestro caso se trata de una distribución

binomial, pero por el Teorema Central del

Limite podemos aproximarlo a una normal, es

decir, una distribución binomial converge hacia

una distribución normal cuanto más grande es

el número de observaciones (frecuencias de las

palabras) extraídas del corpus. Este hecho nos

permite la utilización de un método paramétrico

en la selección de marcos situacionales.

2.1.2. Fase de entrenamiento.

De los 7730 documentos de nuestro corpus,

aproximadamente un 75% fueron seleccionados

aleatoriamente para entrenar nuestro sistema:

2654 textos médicos, 2105 textos sobre

noticias, 708 documentos sobre ciencia y 322

sobre artículos de productos de alimentación y

consumo en general. Esta fase contempla el

análisis de los documentos tal como ya se ha

explicado en el punto anterior. A partir de la

distribuciones de sus palabras se computaron

los marcos situacionales que resultaban

significativos tras aplicar el test-t. Esto

proporciona una lista de marcos situacionales

relativo a cada uno de los textos. El conjunto de

marcos situacionales será usado como rasgos

indicadores de la pertenencia del documento a

un determinado dominio temático o no.

Una vez nos hicimos con la lista de

marcos situacionales asociados cada

documento, se computo el clasificador

bayesiano )aive. Este clasificador se usa

cuando queremos clasificar una instancia

descrita por un conjunto de atributos (ai's), en

nuestro caso, el conjunto de marcos

situacionales asociados a cada uno de los

documentos, en un conjunto finito de clases (V).

Este clasificador asume que los valores de los

atributos son condicionalmente independientes

dado el valor de la clase, por lo que:

VMN=argmaxvj∊V(P(vi)∏iP(ai |vj))

[3]

Los valores P(ai |vj) se estiman con la

frecuencia de los datos observados. Aquella

categoría que maximice la formula será tomada

como la más apropiada.

3. Resultados y evaluación.

De esta manera el sistema fue evaluado

utilizando el 25% restante de los documentos

extraídos de Internet: 969 médicos, 242 de

ciencia, 590 de prensa y 140 sobre productos de

consumo. Los resultados sobre el conjunto de

documentos son los siguientes:

0

20

40

60

80

100

% Aciertos 95,98 96,44 92,98 81,43

% Errores 4,02 3,56 7,02 18,57

Medicina El Mundo CienciaProductos y

consumo

Tabla 3. Porcentaje de aciertos y errores en

cada ámbito temático.

Conforme a estos resultados,

obtenemos un 94.6% de precisión en la

identificación general de estos cuatro tipos de

documentos.


69

4. Discusión.

Los resultados demuestran la

plausibilidad de nuestro procedimiento. El tema

de un determinado documento se puede

determinar desde los marcos situacionales que

concurren en él. La aplicación de uno de los

clasificadores más simples como es el

clasificador bayesiano naive ofrece buenos

resultados y buen rendimiento.

Es de destacar que el tema Productos y

consumo es el que proporcionalmente da más

errores (81.43% de acierto). No obstante, hay

que tener en cuenta que es aquel al que se le ha

dedicado menos recursos de entrenamiento y

que puede confundirse con los documentos de

elmundo.es ya que éste no sólo incluye noticias,

sino reportajes y secciones variadas.

Este procedimiento podría usarse en

vez de la clasificación que propone Latent

Semantic Indexing. Al ser FrameNet un recurso

creado manualmente, solventa los errores de

crear grupos conceptuales automáticamente.

5. Trabajo futuro.

Nuestro trabajo contempla la extensión de los

temas más allá de los cuatro propuestos aquí, lo

que implica una ampliación del corpus y la

investigación con otros clasificadores lineales

más sofisticados que el clasificador bayesiano

naive.

En esta línea, también sería oportuno

probar el grado de precisión a la hora de

clasificar textos dentro una misma temática

usando marcos situacionales. Quizá esta nueva

metodología permita la discriminación de

subtemas dentro de un mismo dominio o la

clasificación entre diferentes géneros o estilos

de lenguaje como coloquial frente a formal o

prensa o informativo frente a literario, etc.

Diferentes estilos de lenguaje se valen de

recursos lingüísticos diferentes por lo que quizá

un análisis mediante FrameNet sea viable para

diferenciar estilos aunque traten un mismo tema

general.

6. Conclusiones.

FrameNet ofrece la posibilidad de poder usarse

en tareas semánticas como la categorización

textual. Este recurso ofrece un análisis del

lenguaje basado en cómo los hablantes

entienden y usan el lenguaje para hablar sobre

el mundo. De esta manera, la categorización

textual es factible desde el análisis de los

marcos situacionales que aparecen en un texto.

Los marcos situaciones sirven de enlace entre

los términos que aparecen en un documento y el

tema general. La única limitación es que

FrameNet no se encuentra disponible para todas

las lenguas, por lo que las técnicas de

categorización como Latent Semantic Index

que agrupa las palabras por conceptos

automáticamente siguen siendo las más

factibles.

Bibliografía

Crespo Miguel, M. y Buitelaar, P. 2008.

“Domain-Specific English-To-Spanish

Translation of FrameNet”. Proceedings of

LREC (Language Ressources and

Evaluation Conference).

Fillmore, Charles J. 1982. Frame semantics. En

Linguistics in the Morning Calm, Seúl:

Hanshin Publishing Co., págs.111-137.

Fillmore, Charles J. 1985. Frames and the

semantics of understanding. Quaderni di

Semántica 6.2:222-254.

Gómez Hidalgo, J.M., Cortizo Pérez, J.C.,

Puertas Sanz, E., Buenaga Rodríguez, M. de.

Experimentos en indexación conceptual. In

Gutiérrez, J.M., Martínez, J.J., Isaías, P.

(Eds) Actas de la Conferencia Ibero-

Americana WWW/Internet 2004, Madrid,

Spain, October, 7-8, 2004, pp. 251-258.

Huang, Y. (2003). Support vector machines for

text categorization based on latent semantic

indexing. Technical report, Electrical and

Computer Engineering Department, The

Johns Hopkins University.

Petridis, V., V.G. Kaburlasos, P. Fragkou, y A.

Kehagias, 2001. Text classification using the

s-FLNMAP neural network. Proceedings of

the 2001 International Joint Conference on

)eural )etworks.


70

Ruppenhofer, J., Ellsworth M., Petruck, M. R.

L., Christopher R. Johnson, Jan Scheffczyk.

2006., Frame)et II: Extended Theory and

Practice.

Sebastiani. F. 2002. Machine Learning in

automated text categorization. ACM

Comput. Surv. 34(1): 1-47.

Shehata, S., Karray, F. and Kamel, M., "A

concept-based model for enhancing text

categorization", 13th, ACM KDD, August,

2007, pp. 629-637.


71

Clasificacion de documentos basada en la opinion: experimentoscon un corpus de crıticas de cine en espanol

Experiments in sentiment classification of movie reviews in Spanish

Fermın L. Cruz, Jose A. Troyano, Fernando Enriquez, Javier OrtegaUniversidad de Sevilla

Av.Reina Mercedes s/n Sevilla{fcruz,troyano,fenros,javierortega}@us.es

Resumen: En los ultimos anos se ha producido un creciente interes por el proce-samiento automatico de las opiniones contenidas en documentos de texto, en partecomo consecuencia del aumento exponencial de contenidos generados por usuariosen la Web 2.0, y por el interes entre otros de empresas y gobiernos en analizar,filtrar o detectar automaticamente las opiniones vertidas por sus clientes o ciudada-nos. Tomando como punto de partida trabajos de otros autores para el ingles, en elpresente artıculo exponemos los resultados obtenidos en la experimentacion con unclasificador no supervisado de documentos basado en la opinion para el espanol. Pro-ponemos tambien una version supervisada del clasificador que obtiene un resultadosensiblemente mejor. Como paso previo a la experimentacion, y ante la ausencia derecursos en espanol para desarrollar nuestro trabajo, presentamos un corpus de crıti-cas de cine en espanol, que ha sido puesto a disposicion de la comunidad cientıfica.Palabras clave: Clasificacion de documentos basada en la opinion, orientacionsemantica, construccion de corpus

Abstract: In recent years, automatic processing of opinions in text documents hasreceived a growing interest. Some possible causes are the exponential increase of user-generated contents in Web 2.0, and also the interest of companies and governmentsin automatically analysing, filtering or detecting opinions from their customers orcitizens. On the basis of some similar works in English by other authors, in thispaper we expose the results obtained in the experimentation with an unsupervisedsentiment classifier for Spanish. We also propose a supervised version of the classifierthat shows a significatively better performance. Experiments have been carried outusing a corpus that we have extracted from a web of movie reviews in Spanish. Wehave made this corpus available to the research community.Keywords: Sentiment analysis, sentiment classification, opinion mining, semanticorientation, corpus building

1. Introduccion

Considerada inicialmente una subdiscipli-na de la tarea de clasificacion de documen-tos, en los ultimos anos la clasificacion dedocumentos basada en la opinion (conocidaen ingles bajo los nombres de sentiment clas-sification, sentiment analysis o opinion mi-ning) ha sido objeto de un creciente interespor parte de la comunidad de investigadoresdel procesamiento del lenguaje natural. Si enla tarea de clasificacion de documentos clasi-ca el problema consiste en decidir la temati-ca de un documento de entre un conjunto detematicas posibles (por ejemplo, centrando-nos en el ambito de las noticias periodısti-

cas, distinguir cuando nos encontramos anteun texto de polıtica, sociedad o deportes), enla clasificacion basada en la opinion se tra-ta de determinar si en el texto se expresanopiniones negativas o positivas. Es desde es-te prisma, considerando “opinion negativa” y“opinion positiva” como las dos clases de sa-lida de la tarea, bajo el que se considera quela clasificacion basada en la opinion es unasubdisciplina de la clasificacion de documen-tos.

Sin embargo, la naturaleza subjetiva delos documentos con los que se trabaja (anali-sis de productos, crıticas de cine o musica,intervenciones polıticas, contenidos genera-



dos por internautas como blogs o foros,. . . )anaden dificultad a la tarea y hacen necesa-rio plantear soluciones distintas a las utiliza-das en la clasificacion de documentos clasica(Pang, Lee, y Vaithyanathan, 2002). En laclasificacion basada en la opinion entran enjuego fenomenos del lenguaje no solo lexicos,sintacticos y semanticos, sino pragmaticos yen gran medida de conocimiento del mundo.Por ejemplo, a la hora de determinar la po-laridad de la opinion “En esta pelıcula el di-rector nos regala otra de las joyas a las quenos tiene acostumbrados”, hay que conside-rar cuestiones como ¿de que director estamoshablando?, ¿que otras pelıculas ha hecho eldirector?, o ¿son buenas esas pelıculas? ; soloen base a este conocimiento previo (expresa-do o no en el mismo documento que la expre-sion anterior) podremos decidir si la opiniones positiva, tal como parece desprenderse dela semantica, o si se trata de una opinion ne-gativa con una carga considerable de ironıa.

Como suele ocurrir en los primeros anosde investigacion en una nueva tarea de pro-cesamiento del lenguaje natural, los trabajospublicados hasta la fecha se centran exclusi-vamente en el ingles. En el presente artıcu-lo describimos los primeros pasos que hemosdado para adentrarnos en el problema de laclasificacion basada en la opinion de textosen espanol. Nuestro interes se centra princi-palmente en reproducir los experimentos ini-ciales llevados a cabo por Peter D. Turney((Turney, 2002), (Turney y Littman, 2003))aplicandolos a un corpus de crıticas de cineen espanol de construccion propia. Propon-dremos tambien una version supervisada delclasificador que mejora significativamente laprecision.

El resto del presente artıculo se estructuracomo sigue: en la siguiente seccion se introdu-cen algunos trabajos que relacionados con elconcepto de orientacion semantica que es cla-ve en este trabajo. En la tercera seccion des-cribimos el proceso de construccion y las ca-racterısticas del corpus de crıticas de cine quehemos utilizado en nuestros experimentos. Enla seccion siguiente describimos la tarea y laarquitectura del clasificador, basandonos en(Turney, 2002) y proponiendo algunas varia-ciones. En la quinta seccion presentamos losresultados obtenidos en los experimentos quehemos realizado. Finalmente en la ultima sec-cion resumimos las conclusiones principalesque alcanzamos de nuestras primeras expe-

riencias en el campo de la clasificacion dedocumentos basada en la opinion para el es-panol, y planteamos algunas lıneas de trabajofuturo.

2. Antecedentes

El concepto central que utilizaremos ennuestros experimentos sera el de orientacionsemantica. La orientacion semantica de unapalabra o conjunto de palabras (que a par-tir de ahora llamaremos termino) se definecomo un valor real que siendo positivo in-dica que el termino en cuestion tiene impli-caciones subjetivas positivas (opinion favora-ble), y siendo negativo indica lo contrario.Distintos valores absolutos de la medida in-forman ademas sobre distintos grados de in-tensidad en dichas implicaciones. Los prime-ros intentos de clasificar adjetivos segun suorientacion semantica de manera automaticafueron llevados a cabo en (Hatzivassiloglouy McKeown, 1997) basandose en las conjun-ciones entre adjetivos. En (Kamps y Marx,2002) se utilizan las distancias semanticas enWordNet(Fellbaum, 1998) entre la palabracuya orientacion semantica se desea conocery las palabras good y bad.

En (Turney, 2002) se describe un clasifica-dor no supervisado basado en la opinion. Di-cho clasificador decide el caracter positivo onegativo de un documento en base a la orien-tacion semantica de los terminos que apa-recen en el mismo. La orientacion semanti-ca se calcula mediante el algoritmo PMI-IRque sera descrito mas adelante. Con un plan-teamiento relativamente simple, este sistemaclasifica correctamente un 84 % de los docu-mentos de un pequeno corpus de analisis decoches utilizado por el autor. Sin embargo,cuando se tratan de clasificar crıticas de ci-ne, la precision obtenida es solo del 65 %, delo que se deduce que la clasificacion de crıti-cas de cine es una tarea especialmente difıcil.

Tomaremos como punto de partida denuestros experimentos este artıculo, tratandoen primer lugar de reproducir el clasificadordescrito adaptandolo al espanol.

3. Un corpus para la clasificacionbasada en la opinion enespanol

Para poder experimentar la clasificacionbasada en la opinion en el espanol, el pri-mer paso es localizar un recurso adecuado.

Fermín L. Cruz, Jose A. Troyano, Fernando Enriquez y Javier Ortega

74

No nos consta la existencia de ningun recur-so de la naturaleza requerida para el espanol,de forma que nos planteamos la generaciondel mismo. Tras decidir que nuestros esfuer-zos se centrarıan en la clasificacion de crıticasde cine, buscamos alguna web que se dedica-ra al tema de la que extraer el corpus de for-ma automatica. Las caracterısticas que debıacumplir nuestra eleccion eran:

Un numero alto de crıticas disponibles(a partir de 2.000).

En el caso de ser los contenidos gene-rados por los usuarios, asegurarnos unamınima calidad de los textos.

Cada crıtica debe llevar asociada la pun-tuacion que el autor le da a la pelıculaen cuestion, lo que nos permitira distin-guir si una crıtica contiene una opinionfavorable o desfavorable.

La licencia de publicacion de la web debepermitirnos utilizar los contenidos libre-mente.

Bajo estas condiciones la web elegida fueMuchocine1.

3.1. Construccion del corpus

El primer paso para la construccion delcorpus fue la extraccion de las paginas htmlde cada una de las crıticas contenidas en Mu-chocine2, con fecha de febrero de 2008. Lascrıticas de cine contenidas en esta web son in-troducidas en la misma por usuarios y no porcrıticos especializados. Esto anade un puntode dificultad a la tarea que nos ocupa, pues-to que los textos pueden contener faltas deortografıa, incoherencias entre lo que se rela-ta y la puntuacion final asignada, divergenciaentre los tamanos de las distintas crıticas,. . . .

Las paginas HTML extraıdas de dicha webfueron transformadas en ficheros XML (unopor cada crıtica), en los que ademas del textode la misma constan el nombre del autor dela crıtica, el nombre de la pelıcula comenta-da, la puntuacion asignada (valores de 1 a 5)y un pequeno resumen de la crıtica a modode titular escrito tambien por el autor. Cadacrıtica ha sido procesada con la herramienta

1www.muchocine.net2Los contenidos extraıdos de MuchoCine han si-

do utilizados bajo las condiciones de la licenciaCreative Commons con la que estan publicados.(http://creativecommons.org/licenses/by/2.1/es/).

FreeLing (Atserias et al., 2006) para tokeni-zar y separar en oraciones el texto ademas depara obtener ficheros adicionales con infor-macion lexica, sintactica y semantica. Todaesta informacion adicional ha sido almacena-da como parte del corpus: lexemas, etiquetasmorfosintacticas, arboles de dependencias ysynsets de WordNet (Fellbaum, 1998). El cor-pus obtenido tiene un total de 3.878 crıticasy aproximadamente 2 millones de palabras,con una media de 546 palabras por crıtica.La distribucion segun puntuaciones es la quese muestra en el cuadro 1.

El corpus esta disponible3 para su utiliza-cion libre por parte de aquellos investigadoresque deseen realizar experimentos de clasifica-cion de documentos basada en la opinion enespanol.

Puntuacion No de crıticas1 3512 9233 1.2534 8905 461

Total 3.878

Cuadro 1: Distribucion segun puntuacionesdel corpus

4. Clasificando documentos deopinion en espanol

En esta seccion describimos el proceso declasificacion de documentos basada en la opi-nion que hemos utilizado en los experimentoscon crıticas de cine. Un documento de opinionsera cualquier unidad de texto en la que serecoja un analisis crıtico sobre algun objeto,pudiendo ser ese objeto un producto comer-cial, una pelıcula, una ley o cualquier otraentidad susceptible de ser sometida a crıtica.

4.1. Definicion de la tarea

Sea D = {d1, d2, . . . , dn} un conjun-to de documentos de opinion. Sean C ={negativa, positiva} las clases de salida delclasificador. La tarea consiste en asignar acada uno de los documentos de D una clasede C, segun el caracter negativo o positivo delas opiniones vertidas en cada documento.

La tarea exige ciertas simplificaciones so-bre la naturaleza de los documentos conside-rados. Por ejemplo, se supone que todas las

3http://www.lsi.us.es/˜fermin/corpusCine.zip

Clasificación de documentos basada en la opinión: experimentos con un corpus de críticas de cine en español

75

opiniones contenidas en un documento soninequivocamente negativas o positivas, a lolargo de todo el documento, y que todas lasopiniones se refieren a un mismo objeto deanalisis. Por supuesto, en la practica esto noocurre, lo que dificulta la tarea.

4.2. Arquitectura del clasificador

El proceso de clasificacion expuesto en(Turney, 2002) es, segun el autor, comple-tamente no supervisado, al no precisar deuna etapa de entrenamiento. Nosotros cree-mos que la utilizacion que se hace en el sis-tema de busquedas en la web (ver seccionCalculo de la orientacion semantica) es encierto modo un recurso externo que si bien noes utilizado en un proceso de entrenamientocomo tal sino directamente en la clasificacion,deberıa al menos matizar el caracter no su-pervisado del sistema.

El algoritmo propuesto de clasificacion esel siguiente:

Dada una crıtica, extraer bigramas uti-lizando una serie de patrones morfo-sintacticos simples. Se postula que esteconjunto contiene al menos algunos bi-gramas que expresan opinion (y tambienmuchos otros bigramas que no indicanopinion).

Para cada uno de los bigramas extraıdos,calcular la orientacion semantica (valorreal, positivo o negativo) mediante el al-goritmo PMI-IR (Pointwise Mutual In-formation - Information Retrieval).

A partir de la suma de las orientacionessemanticas obtenidas, clasificar la crıticacomo positiva si el valor calculado es ma-yor o igual que cero, y negativa en casocontrario.

En las siguientes secciones detallamos co-mo se lleva a cabo cada uno de los pasos,discutimos posibles debilidades del sistema yplanteamos algunas modificaciones posibles.4.2.1. Extraccion de bigramas

El primero de los pasos consiste en ex-traer un conjunto de bigramas del texto de lacrıtica. Este paso es fundamental puesto quesera sobre estos bigramas sobre los que se cal-cularan las orientaciones semanticas que de-terminaran la clase de salida del clasificadorpara la crıtica en proceso. Para extraer los bi-gramas, Turney utiliza cinco patrones morfo-sintacticos que hemos modificado ligeramente

para adaptarlos a la sintaxis del espanol. Es-tos patrones indican categorıas morfosintacti-cas de los bigramas a extraer, y plantean res-tricciones a la categorıa morfosintactica dela palabra que sucede a dichos bigramas. Lospatrones utilizados pueden verse en el cuadro2.

Primera Segunda Tercera palabrapalabra palabra (no se extrae)

1. adjetivo nombre cualquiera2. nombre adjetivo no es nombre3. adverbio adjetivo no es nombre4. adverbio verbo cualquiera5. verbo adverbio cualquiera

Cuadro 2: Patrones morfosintacticos para laextraccion de bigramas.

Dos son las deficiencias mas importantesde este metodo de extraccion de los bigramasa nuestro modo de ver. La primera es quealgunas relaciones sintacticas no seran cap-turadas por estos patrones, en cuanto que lasdos unidades lexicas relacionadas no aparez-can una seguida de la otra en el texto. Estopodrıa solucionarse mediante el uso de patro-nes basados en los arboles de dependencia.A pesar de que en un primer momento im-plementamos en nuestros experimentos estaidea, la hemos tenido que descartar debido ala poca calidad de los arboles de dependen-cias con los que contamos. Tengase en cuentaque trabajamos sobre texto espontaneo, es-crito por usuarios no profesionales, y que con-tiene en multiples ocasiones frases con dudosaconstruccion gramatical, faltas de ortografıay otras peculiaridades que dificultan la tareade analisis sintactico.

En segundo lugar, muchos de los bigra-mas que se extraen no estan indicando opi-nion alguna (se mostraran algunos ejemplosen la seccion Experimentos). Una primeraetapa de clasificacion de oraciones en obje-tivas/subjetivas podrıa ayudar a solucionareste problema; pero esto serıa en sı mismomateria suficiente para otra lınea de investi-gacion que por ahora no abordaremos.

4.2.2. Calculo de la orientacionsemantica

Para calcular la orientacion semanticase utiliza el algoritmo PMI-IR, que consis-te en estimar la Informacion Mutua Pun-tual (Pointwise Mutual Information) entre eltermino en cuestion y un par de palabras se-


76

milla que sirven de representantes inequıvo-cos de orientacion semantica positiva y nega-tiva, haciendo uso de un buscador de paginasweb para llevar a cabo dicha estimacion. LaInformacion Mutua Puntual se define entredos palabras w1 y w2 y mide estadısticamen-te la informacion que obtenemos sobre la po-sible aparicion de un termino a partir de laaparicion de otro termino:

PMI(w1, w2) = log2

(p(w1&w2)p(w1)p(w2)

)A partir de esta medida estadıstica la

orientacion semantica de un termino t se cal-cula de la siguiente manera:

SO(t) = PMI(t, excellent)− PMI(t, poor)

Para estimar la medida PMI se utilizanbusquedas en la web, de manera que la pro-babilidad de co-aparicion de dos terminos queconsta en el numerador de la formula de PMIse aproxima mediante el numero de paginasweb en las que ambos terminos aparecen unocercano al otro. Tras algunas transformacio-nes algebraicas, la formula final para el calcu-lo de la orientacion semantica de un termino(SO(t)) propuesta por Turney es la siguiente:

log2

(hits(t NEAR excellent)hits(poor)hits(t NEAR poor)hits(excellent)

), donde hits(t) indica el numero de pagi-

nas devueltas por el buscador AltaVista4 albuscar t. El operador NEAR de AltaVista(operador no disponible en otros buscadorescomo Google, razon por la que nos decanta-mos por utilizar AltaVista) devuelve las pagi-nas en las que ambos terminos aparezcan enuna misma pagina y a una distancia maximade 10 palabras. Al numero de paginas obte-nido se le suma 0,01 para evitar una posibledivision por cero.

De manera intuitiva, la idea detras de es-te calculo de la orientacion semantica es queexpresiones que indiquen una opinion positi-va apareceran con mayor frecuencia cerca deuna palabra con claras connotaciones positi-vas como excellent y con mucho menor fre-cuencia cerca de una palabra con connota-ciones negativas como poor.

4www.altavista.com

La adaptacion del algoritmo PMI-IR al es-panol se reduce a escoger dos semillas apro-piadas para el espanol. Las semillas escogidashan sido excelente y malo.

4.2.3. Utilizacion de multiplessemillas para el calculo de laorientacion semantica

En un artıculo posterior del mismo autorse plantea la utilizacion de dos conjuntos desemillas positivas y negativas en lugar de unasola semilla de cada tipo (Turney y Littman,2003). Tambien hemos realizado experimen-tos utilizando un conjunto de semillas en lu-gar de una palabra aislada. El conjunto desemillas positivas y negativas utilizado ha si-do el siguiente:

Positivas: excelente, buenısimo,buenısima, superior, extraordinario,extraordinaria, magnıfico, magnıfica,exquisito, exquisita

Negativas: malo, mala, pesimo, pesima,deplorable, detestable, atroz, fatal

En la seccion Experimentos se contras-tan los valores obtenidos para la orientacionsemantica de algunos terminos de ejemplo yasea utilizando una sola semilla por categorıao usando los conjuntos recien expuestos.

4.2.4. Algoritmo de clasificacionUna vez se han calculado las orientaciones

semanticas de todos los bigramas extraıdos,el proceso de clasificacion propuesto por Tur-ney se basa en sumar todos los valores obte-nidos y clasificar como positiva la crıtica si elresultado obtenido es mayor o igual a cero, ycomo negativa en caso contrario.

En nuestros experimentos, ademas de estesencillo acercamiento, hemos implementadouna solucion alternativa supervisada, consis-tente en calcular un valor optimo a utilizarcomo umbral positivo a partir de un conjun-to de crıticas de entrenamiento. La idea esencontrar un valor real u que maximice elnumero de crıticas positivas del corpus de en-trenamiento que obtienen un valor total deorientacion semantica mayor o igual que u yel numero de crıticas negativas que obtienenun valor total de orientacion semantica me-nor que u. Una vez calculado este valor, unacrıtica sera clasificada como positiva si el va-lor total de la orientacion semantica iguala osupera dicho valor, y como negativa en casocontrario.


77

5. Experimentos

Para llevar a cabo algunos experimen-tos de clasificacion, seleccionaremos aleato-riamente 400 crıticas de nuestro corpus. Deestas, 200 tienen una puntuacion de 1 o de 2,y seran consideradas crıticas negativas. Lasotras 200 tienen una puntuacion de 4 o 5, yseran consideradas crıticas positivas. No he-mos utilizado el total de las crıticas inclui-das en el corpus por limitaciones de tiempo:el calculo de la orientacion semantica es unproceso lento al depender del buscador Alta-Vista (entre cada solicitud al buscador hemosde dejar cinco segundos de espera para no sa-turar al servidor de AltaVista, que de otramanera nos denegarıa el servicio). Creemosde todas formas que las 400 crıticas que nosserviran para obtener resultados son suficien-tes teniendo en cuenta que en el artıculo deTurney (Turney, 2002) se utilizaban tan solo120 crıticas. Actualmente estamos en procesode generacion de las orientaciones semanticasdel resto de las crıticas del corpus para utili-zarlas en futuros experimentos.

5.1. Resultados en el calculo de laorientacion semantica

En el cuadro 3 se muestran algunos delos bigramas extraıdos del corpus utilizadoen el proceso de clasificacion, y la orientacionsemantica obtenida usando una sola semillapor clase (SOv1) o varias semillas por clase(SOv2). Los ejemplos incluidos buscan ilus-trar algunas de las situaciones que se repitenen el resto del corpus. El ultimo bigrama in-cluido es un ejemplo de bigrama que no apor-ta ninguna informacion acerca de las opinio-nes vertidas en la crıtica.

En general, observamos que la orientacionsemantica calculada a partir de varias semi-llas parece funcionar mejor. Por ejemplo, pa-ra el bigrama mınima originalidad la version1 del calculo de la orientacion semantica ob-tiene un valor ligeramente positivo, lo cuales erroneo. Esto se ve corregido en la ver-sion 2 del computo de la orientacion semanti-ca. Existen terminos con una orientacionsemantica a priori ambigua, como es el casode efectos especiales decentes. En estos casoses dudoso cual de las versiones de la orienta-cion semantica se comporta mejor.

5.2. Resultados del clasificador nosupervisado

En el cuadro 4 se encuentran los resulta-dos obtenidos en la clasificacion de las crıti-cas utilizando ambas versiones del computode la orientacion semantica, mediante el pro-ceso no supervisado de sumar las orientacio-nes semanticas obtenidas y clasificar la crıti-ca como positiva si el resultado es mayor oigual que 0 (y negativa en caso contrario).Los resultados obtenidos son comparables alos obtenidos por Turney para el ingles, conuna mejora significativa en el caso de utilizarvarias semillas para calcular la orientacionsemantica. Tengase en cuenta ademas que enel corpus utilizado por Turney las 120 crıticasde cine utilizadas correspondıan unicamentea dos pelıculas, mientras que en nuestro cor-pus hay crıticas de muchas pelıculas.

5.3. Resultados del clasificadorsupervisado

La clara desproporcion obtenida entre losresultados para las crıticas negativas y paralas positivas (con semilla simple, 35,5 % pa-ra las negativas y 91,5 % para las positivas)nos sugiere la idea de buscar un umbral me-jor que 0 para decidir la clase de una crıticaa partir de la orientacion semantica. En elcuadro 5 se recogen los resultados obtenidosa partir de la clasificacion supervisada. Parallevar a cabo la optimizacion del parametroumbral se ha utilizado un 80 % del corpus de400 crıticas anterior, dejando el 20 % restan-te para evaluacion. Los resultados obtenidosen este segundo acercamiento son significati-vamente superiores a los obtenidos anterior-mente, consiguiendo el sistema clasificar co-rrectamente el 77,5 % de las crıticas. Resul-ta llamativo observar que en la version su-pervisada la utilizacion de una unica semillapor clase para el calculo de las orientacionessemanticas conduce a mejores resultados queel calculo utilizando varias semillas por clase.Pensamos que en el caso en que usamos unasola semilla, la optimizacion del parametroumbral compensa una aparente asimetrıa en-tre la intensidad de las orientaciones semanti-cas expresadas por las semillas malo y exce-lente (estas semillas son traducciones direc-tas de las utilizadas en ingles por Turney).Esta asimetrıa parece verse mitigada en el ca-so del uso de multiples semillas, y es por esoque usando esta ultima version del calculo dela orientacion semantica la optimizacion del


78

parametro umbral no nos hace mejorar tanespectacularmente los resultados del clasifi-cador.

Termino SOv1 SOv2

mınima originalidad 0,23 -6,08insufrible sucesion 0,23 -0,37efectos especiales decentes -5,08 -7,18pelıcula tıpica -0,87 -2,32estupenda direccion 3,59 2,62gusto exquisito 7,58 3,37altamente recomendable 5,49 6,99fantastico currıculum 4,24 0,23banda sonora 1,61 3,63

Cuadro 3: Orientaciones semanticas de algu-nos de los bigramas extraıdos.

aciertos aciertos aciertospositivas negativas total

SOv1 35,5 % 91,5 % 63,5 %SOv2 70 % 69 % 69,5 %

Cuadro 4: Resultados para la clasificacion nosupervisada.

aciertos aciertos aciertosumbral positivas negativas total

SOv1 13,0 72,5 % 82,5 % 77,5 %SOv2 -2,25 75 % 72,5 % 73,75 %

Cuadro 5: Resultados para la clasificacion su-pervisada.

6. Conclusiones

En el presente trabajo hemos descritonuestras primeras experiencias en la clasifica-cion de documentos basados en la opinion pa-ra el espanol. Creemos que dicha tarea y otrasrelacionadas con el procesamiento automati-co de las opiniones ofrecen grandes oportu-nidades de investigacion, especialmente apli-cadas al espanol. Los resultados que hemosobtenido, si bien cubren nuestras expectati-vas como primer acercamiento que hacemosal problema de la clasificacion de documen-tos basada en la opinion, distan aun muchode lo que cabe esperar de un sistema de cla-sificacion confiable. A la vista del incremen-to en precision obtenido en la version super-visada de nuestro clasificador, creemos quela aplicacion de acercamientos supervisados

mas sofisticados que el aquı propuesto, ba-sados en las soluciones clasicas a la clasifica-cion de documentos pero enriquecidos con lainformacion proporcionada por la orientacionsemantica, pueden suponer mejoras conside-rables. Ademas de esta vıa de continuaciondel presente trabajo, nos planteamos experi-mentar con otros algoritmos para el calcu-lo de la orientacion semantica (basados enla similitud de palabras y WordNet,como en(Kamps et al., 2004) o (Hu y Liu, 2004); obasados en las frecuencias relativas de apa-ricion en las distintas clases de un corpus ,como en (Cane Wing-ki Leung y lai Chung,2006)), que no dependan de un servicio ex-terno como es AltaVista.

Para poder llevar a cabo nuestros experi-mentos hemos presentado un corpus de crıti-cas de cine en espanol, que ha sido creado apartir de las crıticas introducidas por usua-rios en la web Muchocine. El corpus de crıti-cas de cine en espanol esta disponible5 parasu utilizacion libre por parte de aquellos in-vestigadores que deseen realizar experimen-tos de clasificacion de documentos basada enla opinion en espanol.

Existen multitud de contenidos generadospor usuarios en la web que son idoneos pa-ra la creacion de recursos en los que entrenaro evaluar sistemas relacionados con el proce-samiento automatico de opiniones. Creemosque la utilizacion de estos contenidos para lacreacion de recursos es un punto fundamentalpara el avance de la investigacion en procesa-miento automatico de opiniones, por lo quesera tambien una de nuestras lıneas preferen-tes de trabajo futuro.

Bibliografıa

Atserias, J., B. Casas, E. Comelles,M. Gonzalez, L. Padro, y M. Padro.2006. Freeling 1.3: Syntactic and seman-tic services in an open-source nlp library.En Proceedings of the 5th InternationalConference on Language Resources andEvaluation (LREC’06), paginas 48–55.

Cane Wing-ki Leung, Stephen Chi-fai Chany Fu lai Chung. 2006. Integrating colla-borative filtering and sentiment analysis:A rating inference approach. En Procee-ding of the ECAI 2006 Workshop on Re-commender Systems, in conjunction with

5http://www.lsi.us.es/˜fermin/corpusCine.zip


79

the 17th European Conference on Artifi-cial Intelligence, paginas 62–66.

Fellbaum, Christiane, editor. 1998. Word-Net: An Electronic Lexical Database (Lan-guage, Speech, and Communication). TheMIT Press, May.

Hatzivassiloglou, Vasileios y Kathleen R.McKeown. 1997. Predicting the seman-tic orientation of adjectives. En Philip R.Cohen y Wolfgang Wahlster, editores,Proceedings of the Thirty-Fifth AnnualMeeting of the Association for Compu-tational Linguistics and Eighth Conferen-ce of the European Chapter of the Associa-tion for Computational Linguistics, pagi-nas 174–181, Somerset, New Jersey. Asso-ciation for Computational Linguistics.

Hu, Minqing y Bing Liu. 2004. Mi-ning and summarizing customer reviews.En KDD ’04: Proceedings of the tenthACM SIGKDD international conferenceon Knowledge discovery and data mining,paginas 168–177, New York, NY, USA.ACM.

Kamps, J., M. Marx, R. Mokken, y M. de Rij-ke. 2004. Using wordnet to measure se-mantic orientation of adjectives.

Kamps, Jaap y Maarten Marx. 2002. Wordswith attitude. En 1st International Word-Net Conference, paginas 332–341.

Pang, Bo, Lillian Lee, y Shivakumar Vaith-yanathan. 2002. Thumbs up? sentimentclassification using machine learning tech-niques. En Proceedings of the 2002 Con-ference on Empirical Methods in NaturalLanguage Processing (EMNLP).

Turney, Peter D. 2002. Thumbs up orthumbs down? semantic orientation ap-plied to unsupervised classification of re-views. En ACL, paginas 417–424.

Turney, Peter D. y Michael L. Littman. 2003.Measuring praise and criticism: Inferenceof semantic orientation from association.ACM Trans. Inf. Syst., 21(4):315–346.


80

Density-based clustering of short-text corpora∗

Agupamiento de textos cortos basado en densidad

Diego A. Ingaramo, Marcelo L. ErrecaldeLIDIC, UNSL, San Luis, ArgentinaAvda. Ejercito de los Andes 950{daingara,merreca}@unsl.edu.ar

Paolo RossoNLE Lab., DSIC, UPV, Espana

Camino de Vera s/n [email protected]

Resumen: En este trabajo investigamos el desempeno de diferentes algoritmos deagrupamiento basados en densidad en colecciones de textos cortos y textos cortos dedominios restringidos. Nuestro objetivo es analizar en que medida las caracterısticasde este tipo de colecciones impacta en el calculo de la densidad de los agrupamientosy cuan robustos son este tipo de algoritmos a los distintos niveles de complejidad.Palabras clave: agupamiento de textos cortos, algoritmos basados en densidad.

Abstract: In this work, we analyse the performance of different density-basedalgorithms on short-text and narrow domain short-text corpora. We attempt todetermine to what extent the features of this kind of corpora impact on the densitycomputation of the clusterings obtained and how robust these algorithms to thedifferent complexity levels are.Keywords: short-text clustering, density-based algorithms.

1 Introduction

In realistic document clustering problems, re-sults cannot usually be evaluated with typ-ical external measures like F -Measure, be-cause the correct categorizations specified bya human editor are not available. Therefore,the quality of the resulting groups is evalu-ated with respect to structural properties ex-pressed in different Internal Clustering Valid-ity Measures (ICVM). Classical ICVM usedas cluster validity measures include the Dunnand Davies-Bouldin indexes and new graph-based measures like Density Expected Mea-sure (DEM) and λ-Measure (Stein, Meyer zuEissen, and Wißbrock, 2003).1 A central as-pect to be considered in these cases is whichare the ICVM that show an adequate corre-lation degree with the categorization criteriaof a human editor.

In recent works (Stein, Meyer zu Eissen,and Wißbrock, 2003; Ingaramo et al., 2008),density-based ICVM like DEM have obtainedthe best correlation values with respectto the (external) F -measure, outperformingother more popular ICVM in experimentswith samples of the RCV1 Reuters collec-

∗ This work has been partially supported by theMCyT TIN2006-15265-C06-04 project, the ANPCyTand the Universidad Nacional de San Luis.

1See (Ingaramo et al., 2008) and (Stein, Meyerzu Eissen, and Wißbrock, 2003) for more detailed de-scriptions of these ICVM.

tion (Stein, Meyer zu Eissen, and Wißbrock,2003) and short-text corpora (Ingaramo etal., 2008). According to these results, analgorithm which has a tendency to producegroupings with high density values, wouldachieve results that satisfies the informationneed of users.

Density-based algorithms are supposed toexhibit that tendency and they will be ourmain focus of attention in the present work.We are interested in testing these algorithmsin problems with different degrees of com-plexity and in particular in collections con-taining very short texts, where additional dif-ficulties are introduced due to the low fre-quencies of the document terms. This prob-lem and other features, such as a high level ofvocabulary overlapping among the categoriesof a corpus, can negatively affect the compu-tation of the similarity between documentsand the density of the document clusterings.

Work on “short-text clustering” is rel-evant, particularly if we consider the cur-rent/future mode for people to use ‘small-language’, e.g. blogs, text-messaging, snip-pets, etc. Potential applications in differentareas of natural language processing may in-clude re-ranking of snippets in informationretrieval, and automatic clustering of scien-tific texts available on the Web (Alexandrov,Gelbukh, and Rosso, 2005).

In order to obtain a better understanding



of the adequacy of density-based clusteringalgorithms for clustering short-text corpora,a deeper analysis of the relation between thefeatures and difficulties of these corpora andthe performance of different density-based al-gorithms is required. Specifically, we are in-terested in answering the following questions:

1. how a low frequency of words and thevocabulary overlapping affect the simi-larity estimation and the density of clus-tering?

2. which density algorithms are robust tothese features?

3. how dense the results obtained bydensity-based algorithms are when ap-plied to short-text collections? Are goodthese results from a user viewpoint?

To answer these questions we will usethree popular density-based algorithms: Ma-jorClust, DBscan and Chameleon. They willbe tested on two different very short-text cor-pora which differ in the overlapping degree oftheir vocabularies. Results are also comparedwith a corpus which contains longer docu-ments on well differentiated topics. In a nut-shell, we want to consider situations wherethese algorithms have to deal with differentlevel of complexity in the document collec-tions under consideration. This complexity(or hardness) will be estimated with respectto the DEM value of the “correct” clustering.

The remainder of the paper is organizedas follows. Section 2 presents our criteria fordetermining the hardness of short-text cor-pora from a density perspective; here, we alsoanalyse the corpora that will be used in theexperiments according to these criteria. Theexperimental results are shown in Section 3.Finally, some general conclusions are drawnand possible future work is discussed.

2 Density Estimation andComplexity of Short-textCorpora

In order to analyse the performance of dif-ferent density-based algorithms we have toconsider how they work in corpora with dif-ferent levels of complexity. The term “hard-ness” has been recently used in previousworks (Pinto and Rosso, 2007; Errecalde, In-garamo, and Rosso, 2008) to refer to the com-plexity that a given corpus presents for clus-tering problems. This hardness is estimated

in (Pinto and Rosso, 2007) considering thevocabulary overlapping among the categoriesof a corpus and in (Errecalde, Ingaramo, andRosso, 2008) is determined with respect tothe difficulty level that it presents for estab-lishing an accurate similarity measure amongits documents.

In the present work we will take a differ-ent perspective and we will focus on the den-sity of the “correct” clustering defined by ahuman editor for estimating how complex acorpus is for a density-based algorithm. Therationale behind this idea is simple: if it ishard to identify dense groups in a documentgrouping defined by an expert, this collectionwill be similarly difficult for those clusteringalgorithms that search regions of high den-sity. We will consider three corpora whichare assumed to have different levels of dif-ficulty from this perspective: in particular,we are interested in detecting how well thedifferent density-based algorithms work withshort-text corpora and narrow domain short-text corpora with respect to other more stan-dard corpora. These corpora are introducedin the following subsection.

2.1 Data Sets

The complexity of clustering problems withshort-text corpora demands a meticulousanalysis of the features of each collection usedin the experiments. For this reason, we willfocus on specific characteristics of the collec-tions such as document lengths and its close-ness with respect to the topics considered inthese documents. We attempt with this de-cision to avoid introducing other factors thatcan make the results incomparable.

With the exception of the CICling-2002collection which has already been used inprevious works (Makagonov, Alexandrov,and Gelbukh, 2004; Alexandrov, Gelbukh,and Rosso, 2005; Pinto, Benedı, and Rosso,2007), the remaining two corpora were ar-tificially generated with the goal of obtain-ing corpora with different levels of complex-ity with respect to the length of documentsand vocabulary overlapping. Our intentionwas that in each corpus the similarity mea-sure used to quantify the “closeness” betweendocuments has different levels of complexityfor detecting the conceptual proximity be-tween two texts. In that way, the accuracyof the similarity measure will be different forthe different collections and this fact will also

Diego A. Ingaramo, Marcelo L. Errecalde y Paolo Rosso

82

affect the density estimation of the documentclusterings. However, other features such asthe number of groups and number of docu-ments per group were maintained the samefor all collections in order to obtain compa-rable results.

It could be argued that our analysis is lim-ited to small size collections. However, webelieve that short-text clustering in generaland clustering of narrow domain abstracts inparticular, demand a detailed understandingof each collection that would be difficult toachieve with large size standard corpora.

In the following subsections, a general de-scription of two collections used in this workis presented. These collections are introducedin increasing order of complexity. We beginwith the Micro4News corpus, a collection ofmedium-length documents about well differ-entiated topics (low complexity). Then, theEasyAbstracts corpus with short-length docu-ments (scientific abstracts) and well differen-tiated topics is presented (medium complex-ity corpus). These two new collections werecreated with similar general characteristics(number of groups and number of documentsper group).2 The CICling-2002 corpus withrelatively high complexity was also used inour work. This collection is considered to beharder to cluster than the previous corporasince its documents are narrow domain ab-stracts (see (Pinto, Benedı, and Rosso, 2007)for a more detailed description of the corpus).

2.1.1 The Micro4News CorpusThis first collection was constructed withmedium-length documents that correspondto four very different topics. Consequently,in this case it is supposed that the similar-ity measure will not have any problem indetermining if two documents are semanti-cally related. Its documents are significa-tively larger than CICling-2002 and talk aboutwell differentiated topics. Documents wereselected from four very different groups of thepopular 20Newsgroups corpus (Lang, 1993): 1)sci.med, 2) soc.religion.christian, 3) rec.autosand 4) comp.os.ms-windows.misc. For eachtopic, the largest documents were selected.Thus, it was ensured that the average lengthof its documents were seven times (or more)

2A detailed description of the distribution and fea-tures of these two corpora is available in (Errecaldeand Ingaramo, 2008) where you can also find the in-formation on how to access the corpora for researchpurposes.

the length of abstracts of the remaining twocorpora.

2.1.2 The EasyAbstracts CorpusThis collection can be considered harder thanthe previous one because its documents arescientific abstracts (same characteristic asCICling-2002) and in consequence are shorttexts. It differs from CICling-2002 with respectto the overlapping degree of the documents’vocabulary. EasyAbstracts documents also re-fer to a shared thematic (intelligent systems)but its groups are not so closely related asthe CICling-2002 groups are. EasyAbstracts wasconstructed with abstracts publicly availableon Internet that correspond to articles of fourinternational journals in the following fields:1) Machine Learning, 2) Heuristics in Op-timization, 3) Automated reasoning and 4)Autonomous intelligent agents. It is possi-ble to select abstracts for these disciplinesin a way that two abstracts of two differ-ent categories are not related at all. How-ever, some degree of complexity can be intro-duced if abstracts of articles related to twoor more EasyAbstracts’s categories are used.3In the EasyAbstract corpus a few documentswere included with these last features in or-der to increase the complexity respect to theMicro4News corpus. Nevertheless, the majorityof documents in this collection clearly belongto a single group. This last fact allows us toassume that a similarity measure should nothave any problem in representing the prox-imity among documents compared with thecomplexity of CICling2002 corpus.

2.2 Density of Clusterings

Our study of different density-based algo-rithms will take as reference the ICVMnamed Density Expected Measure and de-noted usually as ρ. This measure has shownin recent previous works (Stein, Meyer zu Eis-sen, and Wißbrock, 2003; Ingaramo et al.,2008) an interesting correlation with the (ex-ternal) F -measure which is based on the in-formation of a correct clustering specified byan expert. Following, some preliminary con-cepts and the definition of DEM are intro-duced.

Let us consider a data collection as aweighted graph G = 〈V,E, w〉 with node set

3For instance, abstracts which refer to learning in-telligent agents or agents with high level reasoning ca-pabilities.

Density-based clustering of short-text corpora

83

V (representing documents), edge set E (rep-resenting similarity between documents) andweight function w : E → [0, 1] (representinga similarity function between documents).

A graph G = 〈V, E, w〉 may be calledsparse if |E| = O(|V |), whereas it is calleddense if |E| = O(|V |2). Then we can com-pute the density θ of a graph from theequation |E| = |V |θ where w(G) = |V | +∑

e∈E w(e), in the following manner:

w(G) = |V |θ ⇔ θ =ln(w(G))ln(|V |) (1)

θ can be used to compare the density of eachinduced subgraph G

′= 〈V ′

, E′, w

′〉 with re-spect to the density of the initial graph G.

G′ is sparse (dense) compared to G if w(G′)

|V ′|θis smaller (bigger) than 1. Formally (Stein,Meyer zu Eissen, and Wißbrock, 2003), letC = {C1, .., Ck} be a clustering of a weightedgraph G = 〈V, E, w〉 and Gi = 〈Vi, Ei, wi〉 bethe induced subgraph of G with respect tocluster Ci. Then the Density Expected Mea-sure ρ of a clustering C is obtained as shownin Eq. 2. A high value of ρ should indicate agood clustering.

ρ(C) =k∑

i=1

|Vi||V | ·

w(Gi)|Vi|θ (2)

As can be observed, the ρ computationheavily depends on the similarity measureused for determining how close are two doc-uments. Therefore, we should consider dif-ferent similarity measures in order to observewhich are the DEM values obtained in eachcase.

There are two main factors that usuallyimpact on a similarity measure between doc-uments: the document representation andthe procedure used for computing the simi-larity between documents with this represen-tation. One of the most widely used modelfor document representation is the VectorSpace Model which has associated a familyof weighting schemes that we will refer asthe “SMART codifications” (Salton, 1971).Here, vector (document) similarity is usu-ally measured by the cosine measure butother similarity measures derived from theEuclidean distance can also be used withthis representation. Another popular doc-ument representation approach is the set

model which considers a document as a setwhose elements are the document’s terms. Inthis case, proximity between documents is of-ten quantified by set intersection ratios beingthe Jaccard coefficient one of the most pop-ular scheme for measuring set similarity.

In our work, we used the Jaccard coeffi-cient and the SMART system conventionalcode scheme with the cosine similarity mea-sure. In the SMART system, each codi-fication is composed by three letters: thefirst two letters refer, respectively, to the TF(Term Frequency) and IDF (Inverse Docu-ment Frequency) components, whereas thethird one (NORM) indicates whether nor-malization is employed or not. Taking intoaccount standard SMART nomenclature, wewill consider five different alternatives forthe TF component: n (natural), b (binary),l (logarithm), m (max-norm) and a (aug-norm); two alternatives for the IDF com-ponent (n (none) and t) and two alternativesfor normalization: n (no normalization) andc (cosine). In this way, a codification ntc willrefer to the popular scheme where the weightfor the i-th component of the vector for thedocument d is computed as tfd,i×log( N

dfi) and

then cosine normalization is applied. Here,N denotes the number of documents in thecollection, tfd,i is the term frequency of thei-th term in the document d and dfi refers tothe document frequency of i-th term over thecollection (see (Manning and Schutze, ) for amore detailed explanation). With this repre-sentation scheme we can generate 20 differentcodifications but we will only consider resultswith the 10 normalized codifications (“**c”codifications) because codifications withoutnormalization give equivalent results whencosine similarity is used as proximity mea-sure.

Previously it was explained that, in thiswork, we will use the density of the “correct”clustering defined by a human editor for esti-mating how complex a corpus for a density-based algorithm is. Table 1 presents thesedensity values that correspond to the DEMvalues obtained with the correct clusteringsof the three corpora, using in each case: a)SMART codifications and cosine similarityand, b) Jaccard Coefficient (denoted Jac).

Here, it can be observed that the tradi-tional ntc codification with cosine similaritygives the highest values of DEM in each col-lection. In that sense, it should be noted that


84

the mtc codification is another valid candi-date to be selected as the “best” codification.From now on, we will refer to the DEM valueobtained with a correct clustering of a collec-tion as the “intrinsic” DEM value of the col-lection. Obviously, different codifications andsimilarity measures used with a collection willproduce different intrinsic DEM values.

Codif. M4N EasyAb CiC02atc 0.9 0.88 0.85btc 0.9 0.88 0.84mtc 1.07 0.93 0.87ntc 1.07 0.93 0.87Jac 0.78 0.74 0.79anc 0.77 0.72 0.76ltc 0.92 0.89 0.85bnc 0.77 0.72 0.75lnc 0.78 0.73 0.76mnc 0.82 0.75 0.8nnc 0.82 0.75 0.8

Table 1: “Intrinsic” density values

As can be observed in Table 1, the com-plexity of corpora directly impacts on thesimilarity measure and, in an indirect way,on the intrinsic DEM values obtained in eachcase. Considering the highest DEM valuesobtained with the ntc codification, it is evi-dent that a very good value of DEM (1.07) isachieved for the Micro4News corpus (denotedM4N in the table). However, short-text col-lections exhibit decreasing intrinsic DEM val-ues according to its complexity: 0.93 for theEasyAbstracts collection and the lowest valueof DEM (0.87) for the CICling2002 corpus.

It is important to note that the impactof the hardness of corpora on the similaritymeasure can also be appreciated in the resultsdelivered by other internal validity measureson the “correct” clustering. For example, inFigure 1 the silhouette graphics (Rousseeuw,1987) are shown for the best SMART cod-ification (ntc) with similarity cosine for thethree collections we will use in our study. Inthe first case (Micro4News), each documentshows an evident membership degree to itsgroup but results with EasyAbstracts are notso good and in the CiCLing2002 case, the sil-houette graphics are definitively bad. Theseresults, and the intrinsic DEM values ob-tained, are a clear evidence of the complex-ity of short-text and narrow domain short-text corpora for clustering purposes, with re-spect to standard document collections. In

the next section, we will show how robustthree density-based algorithms are to the dif-ficulties that presents each collection.

3 Experimental results

In this section, we will analyse the perfor-mance of the different density-based algo-rithms organizing the discussion around theresults obtained with each collection. Weconsider for the experimentation the repre-sentation schemes that showed the highestintrinsic DEM values for each corpus (see Ta-ble 1). In consequence, the results presentedbelow correspond to the ntc codification withcosine similarity for the three collections con-sidered.

We used three algorithms which are con-sidered in different works (Meyer zu Eissen,2007; Stein and Busch, 2005) as represen-tative of the density-based approach to theclustering problem: MajorClust (Stein andNiggemann, 1999), DBSCAN (Ester et al.,1996) and Chameleon (Karypis, Han, andVipin, 1999)4. Basically, these algorithmsattempt to separate the set of objects (docu-ments) into subsets of similar densities. How-ever, a significative difference between themis whether the algorithm requires informa-tion about the correct number of groups (k)or not. This information has to be providedto the Chameleon algorithm but MajorClustand DBSCAN determine the cluster’s num-ber k automatically. Space limitations pre-vent us from giving a more detailed expla-nation of the algorithms, but the interestedreader can obtain more information in (Steinand Niggemann, 1999; Ester et al., 1996;Karypis, Han, and Vipin, 1999).

3.1 Micro4News

In Table 2, we can observe that in this cor-pus MajorClust obtains the highest DEMvalues. Another interesting aspect observedduring the experimentation is that despiteconsidering different parameters that influ-ence the way the algorithm obtains the re-sults (threshold values), MajorClust usuallyyield similar (or the same) results with den-sity values in the interval [1.05 : 1.1]. Only6 different results were obtained taking dif-ferent threshold values and 5 of them hadDEM values greater than the intrinsic DEM

4We indeed use a variant of Chameleonprovided in the CLUTO toolkit:www.cs.umn.edu/ karypis/cluto.


85

(Micro4News) (EasyAbstracts) (CiCLing2002)

Figure 1: Silhouette graphics.

of the collection (1.07). The F -measure val-ues corresponding to these five DEM val-ues also were significatively high (in the in-terval [0.88 : 0.96]). On the other hand,Chameleon obtained lower DEM values andwith more variance than the remainder al-gorithms. However, it produced clusteringswith F -measure values as good as the resultsobtained by MajorClust (0.96). The DEMvalues obtained with DBSCAN are higherthan the values obtained with Chameleonand lower than the values corresponding toMajorClust. Nonetheless, respect to its Fmax

value, this value is lower than the values ob-tained with the other algorithms.

3.2 EasyAbstract

The results corresponding to this collection(Table 3), confirm the tendency previouslyexhibited by MajorClust to obtain group-ings with the highest DEM values.5 How-ever, it is important to observe that in thiscase the differences with the values obtainedwith the other algorithms are small, givingChameleon and DBSCAN very similar re-sults. Considering the F -measure values, wecan see that the highest Fmax is obtainedby MajorClust (0.98), a similar performanceof Chameleon (0.96) and a poor function-ing of DBSCAN (0.72). If only this infor-mation is considered, these results, could beinterpreted as a better performance of Ma-jorClust with respect to Chameleon. How-ever, it is important to take also into accountthe Favg and Fmin values where Chameleonclearly outperforms MajorClust. This asser-tion can be graphically appreciated in Fig-ure 2 which shows DEM values vs F -measure

5All the DEM values obtained were higher thanthe intrinsic DEM of the collection (0.93).

values of groupings obtained with Major-Clust(left) and Chameleon(right). We canobserve that MajorClust achieves the high-est F -measure value (0.98) but the remainingvalues exhibit a great variation and oscillatein the interval [0.44 − 0.96]. Furthermore,a weak correlation can be observed betweenthe DEM values and the corresponding F -measure values. Chameleon obtains in thiscase very different (and interesting) results.We can observe that a small number of re-sults were obtained. However, a considerableproportion of them reached F -measure val-ues greater than 0.82. For this algorithm,also is evident the good correlation betweenthe DEM values and the F -measure values.

These differences in the results of both al-gorithms, with respect to the correlations be-tween DEM and F -measure values, requirea deeper and more detailed analysis. Inthis case it may be useful to consider thevalues corresponding to the Spearman rankcorrelation index (Myers and Well, 2002),that are shown in Table 4 for each collec-tion and algorithm used in the experiments.Here, we can note that Chameleon showsthe best correlations between DEM and F -measure values for all the collections consid-ered. These results are indicative that, inthe case of using EasyAbstract, with an al-gorithm like Chameleon we can expect thatresults with high DEM values correspond tohigh F -measure values. This performancecannot be guaranteed with MajorClust whichexhibits the worst correlation value (0.15). Inorder to understand the causes for this poorperformance of MajorClust, we analysed thegroupings produced by this algorithm. Animportant aspect observed was that only 15%of the results had 4 groups (the correct num-


86

Algorithm DEMavg DEMmin DEMmax Favg Fmin Fmax

MajorClust 1.08 1.05 1.1 0.90 0.76 0.96Chameleon 1.03 0.97 1.07 0.76 0.46 0.96DBSCAN 1.05 1.01 1.1 0.82 0.71 0.88

Table 2: Micro4News: results with different density-based algorithms



Table 3: EasyAbstract: results with different density-based algorithms

ber of groups of the collection). This in-formation is indicative: in those collectionswhere the intrinsic DEM is not so high (asin the previous collection), MajorClust willhave problems for generating groupings withthe correct number of clusters. Therefore, itwill have little chances of producing a resultwith a high F -measure value. As an argu-ment in favor of MajorClust, we can say thatfor those cases where the results had the cor-rect number of clusters (4), the F -measurevalues achieved for MajorClust were compa-rable to those obtained with Chameleon.

3.3 CICLing2002

In this collection it is evident that the sim-ilarity measure does not adequately reflectsthe conceptual proximity between documentsand, therefore, the DEM values are not reli-able indicators of the quality of results. Animportant consequence of this fact, is thatthe algorithms which explicitly attempt toachieve high values of this ICVM cannot of-fer any guarantee about obtaining good re-sults from the user viewpoint. This affirma-tion can be verified in the results shown inTable 5 where high values of DEM6 do notcorrespond to high F -measure values whichare, in general, very low. Nonetheless, wecan observe an interesting result in this lastexperiments. Chameleon reaches, as in theprevious collection, better F -measure valuesthan the other algorithms considered and alsoshows the best Spearman correlation value.Based on this information we can concludethat Chameleon does not always reach thehighest DEM values but its correlation val-ues between the density of the clustering ob-tained and the F -measure values are very

6Greater than 0.87, the intrinsic DEM of this col-lection.

good in collections with diverse complexitylevels. An important additional observationis that Chameleon achieved, in all the collec-tions considered, the highest (or near to thehighest) F -measure value. This suggest thatthe mechanisms used in this algorithm forclustering the documents usually agree withthe grouping criteria of a human expert andit deserves additional research work.

4 Conclusions and Future Work

In this research work we investigated the re-lations between the hardness of short-textcorpora, the density expected measure andthe robustness of density-based algorithms.Our first conclusion is that in collectionswith medium-length documents with groupsthat correspond to very different topics, weprobably observe high intrinsic DEM values.In these cases the three density-based algo-rithms will usually be able to reach these highdensity values in the results obtained and willalso obtain good F -measure values.

In short text corpora, their intrinsic DEMis negatively affected by the low frequenciesof the document terms. This negative influ-ence is incremented when narrow domains areinvolved. In these situations, all the algo-rithms were affected but Chameleon showeda very interesting correlation level betweenthe DEM values and the F -measure values.These strengths of Chameleon combined withthe good results of F -measure deserve furtherresearch work employing this clustering algo-rithm.

With respect to the density values of theresults, MajorClust reached the highest DEMvalues and sometimes it also obtained goodF -measure values. However, in collectionswith low intrinsic DEM values it generated aconsiderable number of results with a wrong


87

0.4

0.5

0.6

0.7

0.8

0.9

1

0.93 0.935 0.94 0.945 0.95 0.955 0.96

F-M

easu

re

Density Expected Measure

DEM vs F-Measure

Approximation Line

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.936 0.938 0.94

F-M

easu

re

Density Expected Measure

DEM vs F-Measure

Approximation Line

Figure 2: EasyAbstracts: DEM vs F -measure for MajorClust(left) and Chameleon(right).

Algorithm 4MNG EasyAbstract CICLing2002MajorClust -0.08 0.15 0.13Chameleon 0.88 0.74 0.32DBSCAN 0.87 0.5 -0.24

Table 4: Spearman Correlation between DEM and F-Measure



Table 5: CICLing2002: results with different density-based algorithms

number of groups, affecting in that way theF -measure values obtained.

ReferencesAlexandrov, M., A. Gelbukh, and P. Rosso. 2005.

An approach to clustering abstracts. In Proc. ofNLDB-05, volume 3513 of LNCS, pages 8–13.

Errecalde, M. and D. Ingaramo. 2008. Short-text corpora for clustering evaluation.http://www.dirinfo.unsl.edu.ar/∼ia/resour-ces/shortexts.pdf. Technical report, LIDIC.

Errecalde, M., D. Ingaramo, and P. Rosso. 2008.Proximity estimation and hardness of short-textcorpora. In TIR-08 (to appear).

Ester, M., H. Kriegel, J. Sander, and X. Xu. 1996.A density-based algorithm for discovering clustersin large spatial databases with noise. In Proc. ofKDD-96, pages 226–231.

Ingaramo, D., D. Pinto, P. Rosso, and M. Errecalde.2008. Evaluation of internal validity measures inshort-text corpora. In Proc. of CICLing 2008, vol-ume 4919 of LNCS, pages 555–567.

Karypis, G., E.-H. Han, and K. Vipin. 1999.Chameleon: Hierarchical clustering using dy-namic modeling. Computer, 32(8):68–75.

Lang, K. 1993. 20 newsgroups, the original dataset. http://kdd.ics.uci.edu/databases/ 20news-groups/20newsgroups.html.

Makagonov, P., M. Alexandrov, and A. Gelbukh.2004. Clustering abstracts instead of full texts.In Proc. of the TSD-2004, volume 3206 of LNAI,pages 129–135.

Manning, C. D. and H. Schutze. Foundations of Sta-tistical Natural Language Processing. The MITPress.

Meyer zu Eissen, S. 2007. On Information Need andCategorizing Search. Dissertation, University ofPaderborn, Feb.

Myers, J. and A. Well. 2002. Research Designand Statistical Analysis. Lawrence Erlbaum As-sociates, second edition.

Pinto, D., J. M. Benedı, and P. Rosso. 2007. Clus-tering narrow-domain short texts by using theKullback-Leibler distance. In Proc. of CICLing2007, volume 4394 of LNCS, pages 611–622.

Pinto, D. and P. Rosso. 2007. On the relative hard-ness of clustering corpora. In Proc. of TSD07,volume 4629 of LNAI, pages 155–161.

Rousseeuw, P. 1987. Silhouettes: a graphical aid tothe interpretation and validation of cluster analy-sis. J. Comput. Appl. Math., 20(1):53–65.

Salton, G. 1971. The Smart Retrieval System: Exper-iments in Automatic Document Processing. Pren-tice Hall.

Stein, B. and M. Busch. 2005. Density-based Clus-ter Algorithms in Low-dimensional and High-dimensional Applications. In TIR 05, pages 45–56.

Stein, B., S. Meyer zu Eissen, and F. Wißbrock. 2003.On cluster validity and the information need ofusers. In 3rd IASTED, pages 216–221.

Stein, B. and O. Niggemann. 1999. On the Natureof Structure and its Identification. In Proc. ofWG99, volume 1665 of LNCS, pages 122–134.


88

��

��

��

��

�� !��

"#��!�$��%& �'(��)�

�� *��+��,��*� �

�� -��*�.��+�/� ��

��0��1��2� �� #��

"3!"!��1��1�'��1��1)�

�� ,� �*��1*� �

�

�� 4�� 5�/�� 5�� 6�� 1� �� %��1�� 6��1��+�� 7 �5��

1�� 8�� *�9�� 1�� 5�� 6��

�� 1�� 1�1� 1�� 6�� 1��

��+�� %�� %�� 6�� 1��

� �� 1�� '.��: )�� ;��*� �� <�� 1�� 6�� 1��

��5� �� 8��1��1��1��6�� 1��<��5��1��

�� 1��1��1�1� �� 18 ��

1��1�� =��1� �� 1��1��*� �� >� �� 6�� 7 �5�� +��?��6�� 7 �5�� 1�� 8��

��6��1��6��1��5�� @A.A;�

��!��!�B<� ��5�� <��1� �<�+<��

��<�� C�5��+� �� 1�� *��<� ��<�� 1��

� �� 1��5��+�� <��<��D ��

��5��<��C�5��+��<��<�� 1��<�� 1��<��

��<�� +��<�� : ��1��<��;2.*�$�1��

<� �5��5��<�� 1��<��1� ��<��<�� 1�<� �

�5��1��+�� 1�5��<�5�� 1�5��<��

��1��<�+<��<�� +��<�1 ��<�� <��*"�#$��>� 7 �5� �� 7 �5� ��+��?�� 1�� <��D �

��1��<��1��: ��@A.A;�

��

�

.��6��1��7 �5��

� ��?��6��1�� 1� � +��

��6�� 1�� 5��

�� 6��*�

.� � ��+�� 7 �5� �� ?��

�� 1��

� ��5��6��

1�� 5��1�1� 1�� /��%�?�� 1� 1��

1�� +�� 1��

1�� 8�� /��

� ��?��1�� 7 �5� ��1=��

1�� +�� 1�� 1��

��6�� /��

�� 1��+�/��*�

9� � 5�� 8��

1��1�� +�� +��1��1��

��1� �� 1�5�1��

�� %�� ?��

� �� E��

�� E�� %�� 1��

�� 6�� 5� �1��

��1�� 1�� 5��

��1�� 1�1�� 5�+F�1�1��

��1��1��6��1��

��5�� <��1�� 1��

1��6��+��8� *�

$<�� 5�� 1�� 1�� %�� 1��

��+�� <�<��%�=��

�� 5/�� +��

��1�� ?�� 1��%�� +��

%�� %�� %�1�� 1��

�� 1�� 1��

��+��

�� +��6�� <��1��

�� 5�� 8�� 1�� 5��



�

�� 1�� 1� �� 1��

��5�� ?�1� �� 1��1��

�� +��16�1�� 1��

��+�� 6�� 1��

�� 1�� 1��

��1�� 1�� 8��

��+��8� *�

4��8��

�� 6��1�� +��6��

��6��1� *�4�� 6�� 1��

1� ��6�� 1�� 6��

��1�*�4�� 6�� 1� ��5��

��6��1��6��1��5� � ��

��8 �� 6��%��

��6��1��+8��6��1��

��6��6��1�� 1� *�

.�� 6�� 1��

�5�� 8 ��

��1��1�� =��

��6�� 1� �

�� 5�� 1� �� 1� � �5��1� ��

1��1�� 6��%��E��

�� %��

��6�� %�� 8�� 1��

�� +��6�� *��9��E�� 1/��

��5�5��+��8�� 1��1��6��

1��8��

��

�

4�� 1�� 1�� %�� <��

��1��+��1�1�1��

��6��1��8 �� 1��7 �5��

� �� 1�?�/��

��1��?��6�*�

�� <�� +�1�� 1��

�� 1�� +��

1� 1��1�� 1��5�� 1��5��

'��7 )�� 5� �1��5��8 ��

�� 1�� 1�� 1�� +�� <� ��

��1�� %�� 1��

�� 1�� 1��C�5��6��

�� +�� 1��

�� 1��8 ��

�� 6�� 1�*�

4�� 1�� 7 � �� %�?� � ��

��1�� 1�� 6��

�� 6�� 1�� 1�� %��

�1�� 1�� ?�1��

��

�� ?�1� *�

;��/�� 6��

5� �1��1�� %��%��?��

�� 1�� ?��

��1�?�/�� 6��1��%��

�� 6�� 1�� 1�� 1�1� 1��

��5�� %�� 5��

��5�� *�4 �1��

'�<�� G!!")� �� 6�� /��

�+�� 6�� 1��

��+�� 5� �1� � �� E�� <�<� � ��

<�� +��1��/�� 1��<� ��

G��H� �5��5� ��7*�

9�� 6�� 7 �5� �I�1��

��1��5�� 6��

1��1�� 1��1��1��

�� 1�1� 1�� %�� 5��

��+�� 1�� 5��1�1�1��

��1� � 1�� 1��

� �� 5��1�1� 1�� 6�� I�1��

�� 5��1�1� J� �� <��

�5�� 8 �� 1�� +��

��1��6�� 5� �1� �E��

��1�*�

'.��1�� G!!�)� ��

�� 6�� 5� �1��

��8 �� +��1��

��GH� ��5�� 6��1��

��+�� 1��1�� 1��*�

'K�� G!!")� ��?�� 6��

�� 1�� 5�� %��

;2.� 1�� +�� 5��1�� 1� �

�� 1��

�� <�� '�<�<� G!!")� �� 5��%��

�5�� 1��1��1�� 1�� ;2.� ��

%�=��1�� ?�� 5��1��

�� +�� 1�� %=� ��

��1�� %=�

�� 1�� 5�� 1��

�� 6��1��1��5��

��5��1�� *�

9�� 8�� 1�� 8�� 1��

�� +��6�� 1��+�� <��

�� 1�� +�� %��

1��1�� 8�� 5��

�� 1�� <�� 1��

�� 1��1�1*�

4�� 1�� <�� 5� ��

�� 8 �� 1�� 1��

��/� �� 5�?�1� �� +��

%�� +�� 1� � �%��

�� %�� =�� 1� � 1�� +E��

��1�� <��*�$ 8�� '��

G!!G)� ��?�� 5��6�� 1��

��8 �� 81� � 1�� 8�� 1��

��+�� /� �� 1��

�� 1�� 1��

�=��1� �� 5� �1� ��1�*�

.��6��5� �1�� 1��

�� 5��

Francisco Manuel Rangel Pardo y Anselmo Peñas Padilla

90

�

1� �� 1�� +��

��1� *� '��1�� G!!L)� �� +��

��"#�� 5��

��1�� +�� *�'��G!!!)��<��

�� 1�� +�� @�B��

��+8�� 1�� <�� '0��<��

G!! )� %�� <�� 1�� 5��6�� 1��

�� E�� 1�� %�� 1��

�� ?��/��

��6�� 1��

��

'�<�:��5�� 3)� �� 'M<�G!!!)�<��

�� 1�� 1�� 1�1�� 1��

��?�� 6�� 1�� +��

�� 1��+��8��1��

�� 1�� *�

4�� 8�� 6��

�� <��1��

��1��6��1��1� ��

� 5��%��

�� /��

'�N�42 ��)�� '�� 7 �5� ��)� ��

'��)*�

��

��

��

� �� !��

�

4�� 1�� '.�C� � ��G)� ��

��+�� %�� 5�� 8 ��

�� 6��1��1�5��

�� +�� 1�1� >�

�

• ��

• ��1��1��1��6��

• B��1�1��

• ��1��

• 4 �� /�� 5�� 1��

�� %�� +��

• O�� 5�+� �

�

(�� 1�� 1��

�� 6�� 5� �1��

��1�*� 9�� +��6�� <��

��1�� 8�� 1�� 5��

��1�� 1� �� 1�� 5��

��?�1� �� %=��

�� %�� 6��1��

��1�� *�

%&'(��!�)��

�

4�� <�� 1��

1�� %�� 1�� 6��

�� 1�� +��

��5�� 1�� 1�� 5E %�1��

�� 1��

1��+�� 7 �5*�

4�� 1��5��1��

��1� ��6��1��1�� 5��

�� %�� 1��%��

5� �� 6�� 1��

��6��1�� +��*�

4��5�� %��

� �� 6��

��1�� '9�� G!!!)� 1�/��

�� 1��?�1�� 5��G�*��3�

7 �5 �� 1��

�� +�� 1�� %� ��

��1� �� 5�� 8 ��

��5��1�� 6��*�

%&*(��4�� 1��

�� 1�5��

��+�� 8 �� 8�� 1��

��6��1��

�� *��

.� � �� 8��1��

<�� 5��

��6��6��%��

�� 8�� 5��

�� 6�� %�� 1� ��

� ��*�

��+��1�� 1�� 5��1�1�

��%��1��1��+�� 7 �5��

%�� 1� ��6��

1��1�� 1�� 1� �� 1�� 1��

��%�� +��

�5��1�� <�� 16�1�� 1��+��

��1�� 6�� %��

�� %�� 8��

�/�� 1�� +��

�� +�� +�� 1��

�� P��<��%8***Q�

%&%(��+�(�

4��1��1��6��1��1�1�

1�� 7L�� ;2. �� +��

�� ?�� 5��

��1��1�� +��1��

�� *�

.��6��1�� 1�1��

��1� ��8� ��;2.�1��+��

Clasificación de Páginas Web en Dominio Específico

91

�

�� ;2. � 1��

��+�� *�

9�� +��

P��<�� %8Q� %�� 6��

�� 1�� ;2.� ��

� � 1�� P***��*�<�Q� ��

P��?��*� ��Q*�

.�� 5��6�� 1�� 1��

��6�� 5��1� � 1��

�5��1��1�� '<��1)��

'��: )� �� '��)�� 1�� =��1��

�+�1�� +��6�� %��

<�� 1��1��@A.A;*�

"� ��

�

@�� 1�� 1��

��/�� 1�� 7 �5� ��1� �

�� 1�� 8�� 1��

��*�

.��6��1�� 1�� / ��

1� 1��1� �� 1�� 1��1� 1��

��1�� 1��%�?�� 1��

��1�� 1� ��1��

��+��8� �� 1��1��

��6��1�� 1� 1��

��1�� =��1��1��1�1��

�5�� /�� 1��

�� 1��+��1��

1�� 1�1�� 1�� %�� 1� ��8��

��5�/�*�

�� 1��/��1��

#�� 7 �5 � ��1� � �� #� ��+��8� �

1�� *�

4��1��C��<� ��

1�� 1�� 1�� 5��

��/��1��"3! ��+�� 7 �5*�

.�� 6�� %�� <�� 1��1��

-�� +�� +�� 1� ��5�6�� 1��

��+�� +��8�>�

�� ,�!�� -

�� G#� !�R"H�

.�� RRL� �RGH�

�� G# � R"�L3H�

��!� �� "�� R�R#H�

�� G�!� R�"GH�

�� !�� R� �R#H�

/�� L!!� #�GRH�

,�0!�� 3G� L��H�

��!��!�� !�LRH�

,1,�( "3! � !!H�

,�.(�'��/�

��?��

��1��6��1��6��

�� 1�� 5�� 1��

�=��1��7 �� 1��1��6��?�1��

�� 5�� +�� 1� � ��

� ��18 �� '�� 6�� R� 1��

��1��+8��1��6�)>�

� �0��

�� !� LR�

.�� !��#�

�� !��!!�

��!� �� !��"R�

�� !��L#�

�� !�� !�G�#�

/��!��!� �� !�#R��

,�0!�� !��L#�

,�.(�*��!��!��.��!��

.� � �� %��

��6�� 1�1� ��1��

�� 1�� 6�� 8�� 1��

��+��8� � �� 1� �� $ ��

2�� %�� 1� ��

1�� *�

#� $��%��&��

�'��

2&' ��

.� � �=��1� � 1�� 1�?�/�� 1��

�� 6�� 1�� 1�� %��

+��?��1�� 1��

1�1� ��1��1��

1�� *�4�� 1�� 5�/��

��1�� 1�� 5��

��1� � �� 1� �� 1��

1��1��

��+��8��<�� 1��1�*�

9�� <�� +�1�� =��1�� 1��

��1�?�/�� %�� 1��

�� 1� � �� 5�� O�S��

�� 6��

7 �:�T7�:��G!!#U

2&* ,3��

4�� 1�� 1��

1�� =�� 1��

��+��6��1��<��6�� 5� �1��

�� 6��1��1�� E�� /��1��

�� %�� <��6��

��1�� +8��

��1�?�/�� ?�� E�� 1��


92

�

�� 1� ��

1�� '@��1�?�M��G!!")�

9�� ?�� 6��

1�� >� �� 1�� 1��

�� 1�� 6��

�� 5� �� 1��6��

�?�1�� =�� %��

��E��1��+�� %�� 1��

�1��1��+��8� ��+��

��1��1��1��1�� %��

��*�4 �� %�� 1��6��

�?�1�� 1�� <� � 1��

��1��1�?�/�� / ��1��

�� 1�=�1� �� +��

��+��5��1��1�1��

�� '-��<� ��3)� %��

��1��6��=��%��

�� ?�� 5��/�� 1��

�� 1��1�� *� 9��

�� /�1�1� 1�� 6�� <��

�� 1��6��%��

G�G��1��1�� ?��6��

��1��6��1��1��

�� /�� GR��RH� �� ?��

1�5�� 1��6�� 1��

��1�� 1� �� 5��1� ��

�� 1�*�

2&% � ��

.��6��1��7 �:�� 5� ��

�� 6�*� 9�� ?�� 1��

��1�� B9� '�� )�� 9�

'�� )��

� ��18 �� *� �� 1�� ?� 1��

�� 6�� 16�1�� 1�� B9�

�� 9� 1�� 1� ��

��1��1� �� 1�� 1��

�� 1�� ?�� 1��

��1�� *��

.��?��6��1��1� ��

�� 1��%��

�� <�� 1�� 5��6��

��1��1��1��5� �� E��

� ��18 ��+��1�� 1��

��6��1��5� >��

�

��

+

=

G�

�45+��'��!��!��

�� 1�� 1�� 1� �

�� 1�� 1�� 18 �� 1��

�=��1� � ��1� � �� 1��

�+��6�� 1�� RH�� 16�1�� <��6��

��@!� ��%��5� � �� +�� *��

1�<�� 18 ��

��5��1��'G�L#R)�%��1��1��1��1��

'�� 1�� 1�� 1� � �� )�� 1�� 1��

�+��6�� '�RH)� �� 1�� +��1� � 1��

��5��1� 'R� �� )� �� <�?��

<��6�� %�� +��%��1��

�� =��1� �1�� 6��

��1��1��1��1�� +��

��*�

2&6 � ��

M5�� 1�� ?�� 1��

��6��1�1�� 1��

�/�� 1�� 1�� 6��

�5/��1� ��5�6��-�� 5��

�� 1�� ?��

��1�1��'��'<))�1��<��6��

1��1�� '��'<))*��1��?��1��H�

�� 1�� 1�� 1��

��>�

��

))' )'')')'

−

±=�

�45+��*�4!��

-6�1�� ?� �� 5�� 1��

1� ��5�6�� ?�1��

�� 1�� 1�� #�

�%��RH�1��?�*�

�� 1�� 6��

�� 5�� %�� ?��

��1�� 1�� 1�� %��

�� 1�� =�� *�

(� )�� %��

�

.�� 6�� @A.A;� � � ��

�� 6�� 7 � �� 1��

�5��1��1� 1��5��

��1��1��*�

4 �� 1�� 1��

��1�� C��1�� 6��

1��8 �� 5��=�1� �� 1��

��1��!!��5�� *�

9��1��5��1��

�� 8 �� V@4$-�� V.�OK��

�V;2.��1��1��1��1�� 16�1��

��5 ��5�� *��

�� 5�� 1��

��+�� 5�� 1�� 1�� 1��

�5�� 1��


93

�

��6�� 8 ��

�V@4$-�� 1��*�

��5�� 5�� %��

�� ;2.�� 1��

��6�� 8 ��

�V;2.�� 1��*�

9�� ?��5��6��

1�� 1� 1�� 1� � ��1�1� � 1��

�� >� � �� *� ��

��5� � ��

��8 ��V.�OK�� 1��*�

-�� 1�� %�1�� 1��1��

�� 6��1��1��%��

�� ?�1��

��1��6��1�� 1�� *�

*� +��,��

�

4�� 5��

�� 6�� @A.A;� 1�� +�� 1��

�� -��?��1��6�� G�G�

�� 1� � ��

�5��1� ��=��1��+�1�*�

.�� <��6�� 1�� 1�� @!� � � %��

�1��6�� 1�� 8 �� 5��1� � 1��

��6�� 1�� +�� 5��

�� +��1��1��1��

�� 1�� *�

�� 1�� @!� '��1��

��<�?��1��)� ��1�� 5�� 1��

1�� 1�� 5��1��

�1�� 1��*�

��1��?��

�RH�1��1�*�

4�� 5� � � � � ��

�� 1� � �� 1��

�� *�

.� ��

��5��<��6�� 1��%�� =��1� �

��1�� 1�� 6��

�� 1��1��5��8��

�� 1� � �+��1��

1�� 1�� ?�� 1��

� ��8��*��

�� E�� 7 � � ��1�� 7 �

��/��1��8 ��

��7 �1�� ;�� *�

.� �� 1� �1�� 5� �� 1��

1�� 1�� 18 �� >� .��!� �&

.�� ,�7�7�4��(��,�51�8�

��W�!� 3R�X�G�L#R�

71 ��,�7�7�4��(��,�51�8�

��W� �"L3�X�G�L#R�

,�.(�%� ��!)�!��!��.��!� �&.��

.��!� �&.��+�(

��,�7�7�4��(��,�51�8�

��W�!�!3 �X�G�L#R�

71 ��,�7�7�4��(��,�51�8�

��W�!�GL �X�G�L#R�

,�.(�6� ��!)�!��!��.��!� �&.��

.� � �� 1� � �� 1��

%�� 1� � �� 1��

��1�� 1�� '@!)*� 4 ��

1�1�� 1�� %�� 1��

�5��1� � �� 1� ��

�� 5��1�� /�1� ��

��%��1��%�� 1�� 1��

��1�� 1�� +�� %��

��+��=��1��7 ��

�� *�

�� E��

1�� @A.A;��

�� 1� � �� 6�*��

4��+�� 5�� 1� �

1�� 5� ��5��1� �� 1��

��

��5� ��=��1� >� ��,�7�7�4��(��,�51�8�

.��!� .��9:�:�

.�� !*G�� ;&<<%�� !*##!� ;&=*2��!� �� !*!3"� ;&>6;�� !* R�� ;&%2<

/��!��!� �� !* 3R� ;&6;<,�0!�� !*3 "� ;&=<=

,�.(�2� ��!��.��!� �&.��9:�:�

�

7�) ��,�7�7�4��(��,�51�8�

.��!� .��9:�:�

.�� !*�!�� ;&?6>�� ;&>;< !*#��

��!� �� !*�#!� ;&?%?�� !*�# � ;&?;?

/��!��!� �� !*��R� ;&?6',�0!�� !*�� ;&??6

,�.(�<� ��)��!��.��!� �&.��

9:�:��

4��+�� <�� 1�� /��

�� 1�� /�1� *��

�� 1�� 1��

@A.A;�� 8��1��

��1�� 1��5��*��?��


94

�

�� 1��1�� +��6��1��

�RH� �� 5�� 18 ��

��6�� 6�>�

� .��!� �&

@:(:+ ��,�7�7�4��(��,�51�8�

��W�L�L !�Y�G�L#R�

71 ��,�7�7�4��(��,�51�8�

��W�G��G!�Y�G�L#R�

,�.(�>� ��!)�!��!��.��!� �&@:(:+�

-�1�� %�� 5��1�� 5� �

� � � � � �� %�� 5��1��

1�� %�� 1�� @A.A;��

�5�� +��

��5� ��7 �� 1��1�5�1��

�� 1� ��

�� 1��=��1� �� 1��1� *�

4�� 1�� 1�� 1��

�1�� 1�� 6�>�

� ��,�7�7�4

��(��,�51�8�

.��!� .��@:(:+

.�� !�GR#�Z��!�!L#� !�G# Z�!�!L��

�� !�L R�Z��!�! L� !�GG Z�!�! G�

��!� �� !�3� �Z��!�!GG� !�!�3Z�!�!!��

�� !�"!��Z��!�!R3� !�G LZ�!�!R!�

/��!��!� ��

!�L� �Z��!�!R�� !�# GZ�!�!R��

,�0!�� !�!""�Z��!�!L!� !�!LLZ�!�!G#�

,�.(�>�4!�� !��.��!�

�&.��9:�:��

�7�)

��,�7�7�4��(��,�51�8

�

.��!� .��@:(:+

.�� !�"L"�Z��!�! R� !�!#3Z�!�!!3�

�� !� #G�Z��!�! #� !�"G�Z�!�!G �

��!� �� !�G3"�Z��!�! "� !� !"Z�!�! !�

�� !�L�!�Z��!�! "� !� R#Z�!�! �

/��!��!� ��

!�LGL�Z��!�! "� !�!�!Z�!�!!��

,�0!�� !�! #�Z��!�!!"� !�! Z�!�!!L�

,�.(�=�4!�� )��!��.��

�!� �&.��9:�:��

�� 1��

�5��1� � �� 6�� @A.A;�

�� +��

�5��1� � �� =��1�� 1��

�� 8��1�� !H*��

-� ��

�

4�� 5�/�� <�� 1��

�8�� 1�� +��6��

�� 6��

��1��1�� 1�� *�

��<��1��6��1��5� �

�� 1�� 8��1�� <��

1�� 1�� 1�1��?��6��

�� 1�� 6�*�

�� <�� 1�� 1�� 18 �� %��

�� =��1� �� 1��1� �� 1��1��

�� 1�� /�� +�� 5��

�� 6�� 1�<��

1��*�

��<��1�� 1��%�� 6��

�� 5�� 18 ��

1�� 8��1�� G!�

�� %��

<� ��#!�� 1��1��1��1��+��8��

� 8� �� 1�� 1��

��1��8��1��

1�5�/��1�� !H�

.�� 6�� 1�� 5�/��

%�� =��1� � ��1�� 1��

�� 6�� 1�� +�� 7 �5��

%�� 5�� 1��

��1��1��1��

+�� 1� � ��

�� 6�� 1�� 8�� 1��

�� %�� 6��

�� +�� 1� � �+��

�� 1�/��1�� = � 1��

�+�� +��1�� 5�/��1��

�8��1��1��6��1��

�� 1�� +��

��*�

.� ��

�

-�� 1��6��1��=��1��

1�� 8�� 1�� 5��1� �

��/��1�� 5��/�� 8�� 1��

�� 1��1��7 �5K��

1��?�M-9*��

��=��1��8 ��

�� /�� 1��

��*�

�� +F��

�� +�� 6�� 1��

�1�� *��


95

�

�

/��

�

4 �� 5�/�� <�� 1�� 5��1��

�� [4$(� �

�� 'B�OG!!��#�R3 ��!G�! )� 1��

�� 1��6�*�

.�� 8�� Z-Z�� 1��

�� <�� 1�� +8��

�� +��6��1�� 5�/�*�

0��%��

�

$��1�� J� �� $��J�

��5� �� 5��?��*� ��

�� !� �"�

��#��:�� 2*� $�� %�

��%�� $#��'G!!")��1�� 9��J� �� J� ��

41��J� \�� O��J� 2�5��O��

��<��J� $1�=� �� *�

�� !��"� �"� ��

��"� &��"�� %�� '� ��

��%� ��<�:��5�� -�� 1�:� 9*� $�� "�

(��#�� )��

(��!�*��9��1��+ ��<��$��

��M-� ��

��+�� -�� +� � L!��L 3��

��7� <��+��0�� 3�� J� ��1�� 9��J� �� 1��

��41��J�\��O��J��<��

2�5��O��*� � �!� *%��

�� &��

��%� ��-��<�� B*� �*� ��#��

�� %�� "�

��%� ��

��3�� +�*� �� $#�� $��

��"�� %� �� &�� %��

��#��%� ��G!!G��

��:��?� 0*� $#��

*%�� %�� #�� %� ��

��*��+��-��$��

��+� �"3��"�3��

��4*0��B �� :�� K*��.�C��

9��:��-*�*��:��*7**�)��

�� %��%��"�'��

�� *� �� 9��1��+ � �� 777 �

!G*� �� <��

7��1�7�1��7 �5��G!!G�

@��1�?�M��0� =J�2��8��?�[��

�]� 0� =�� 2��8��?�� = ��*�

*��"� �+� �� &��,�� "�� '��

G!!"�

0��<�� B*�� O*��<�C��B��0*�

�� !�� %�� #��

��*� �� *� ��1�� 1�

$*-��:�� 1�� 9��1��+ � ��

��.�! �� 3�<��

�� <�� .��+�� +� � GR!�GR��

7�� +�� ;�� G!! *� ��+��

K�� 95�� <��

;�0��<�� *� � ��%��#��)��

�� -� �� &� �� &��"��

��"��*�G!!G�

K�� N��*� �� %� ��

��

.��1�� <�� <J� .��+�� .�� *�

��%��*��

M<�@*0��+��*@*��.��*@*��

(��#��&��"�)��

�!��"�* ��

*%��*��9��1��+ ��<��GL�1�

�� $�� 2�

�� 2� ��<� ��1�

1��

��+� �G#"�G� *�$��9�� G!!!�

MDO�� B*�� .�� *�� 2�:*�

��"�� $�� %� ��

�� 3�G!!G*�

9��A 0�<�� & �� *�� %��"��%��*�

��1��B�<��+�� *�G!!!�

�<�� -�J� �<�� \<��+J� N��+�� [��+J�

\��+�� @��0�J� \<��+�� J� .��

N<��+J� �� 7 ��B��+*� ��

��%� ��

�<�<�� .*KJ� K��+�� -*2*� )�� )�� "�

�� %�� %� ��

��!��

��*��*�'��

�� '��*� ��

9*� .��+�� 1�� 9��1��+ � ��

��.�!!�� <��

�� <�� .��+�� +� � 3�R��!G��

��1��;��G!!!*��+��K��

95�� <�� ;��

��$��J�.��4��9��+J�O+��7 ��K��+*�

�� %� �� )��

-� ��&� ��*��7�-��G!!G�(��+��

7 �:��!��&� �� .� ��B<��;�� 7��:��2�� L*"�G!!#�

\<��+� \J� 7�� ^�� <�� 2*� ��

�� %�� #��

*�� "� '��*� $�� K--�

4�� O�C ��G!!"��


96

MIDAS: An Information-Extraction Approach to Medical Text Classification

MIDAS: Un enfoque de extracción de información para la clasificación de texto médico

Anastasia Sotelsek-Margalef Universidad Carlos III de Madrid

Departamento de Ingeniería Telemática Av. de la Universidad 30, Leganés (Spain)

[email protected]

Julio Villena-Román DAEDALUS, S.A.

Av. de la Albufera 321, Madrid (Spain) [email protected]

Universidad Carlos III de Madrid Departamento de Ingeniería Telemática

Av. de la Universidad 30, Leganés (Spain) [email protected]

Resumen: Este artículo realiza una descripción de MIDAS (Medical Diagnosis Assistant), un sistema experto avanzado capaz de proporcionar un diagnóstico médico a partir de los informes radiológicos/patológicos del paciente, basado en extracción de información y aprendizaje automático a partir de historias clínicas de pacientes diagnosticados anteriormente. MIDAS fue diseñado para participar en la competición Medical Natural Language Processing Challenge 2007. Específicamente, el sistema automatiza la asignación de códigos ICD-9-CM (International Classification of Diseases) a informes medicos, logrando unos buenos resultados de precisión. Palabras clave: sistema experto, diagnóstico, texto médico, lenguaje natural, extracción de información, clasificación automática, códigos ICD-9-CM.

Abstract: This article describes MIDAS, an advanced expert system that is able to suggest medical diagnosis from the radiological/clinical patient records, based on information extraction and machine learning from clinical histories of previously diagnosed patients. MIDAS was designed to participate in the 2007 Medical Natural Language Processing Challenge. Specifically, it automates the assignment of ICD-9-CM codes to radiology reports, achieving good precision rates. Keywords: Expert system, medical diagnosis, medical text, natural language, information extraction, automatic classification, ICD-9-CM codes.

1 Introduction The fact that clinical information systems can improve medical care and reduce health costs has been in the academic agenda for quite some time. Nonetheless, nowadays patient data is still stored in narrative form by many hospitals, which produces a great quantity of information that, beyond the clinical visit, has limited utility because of its high volume and poor accessibility. However, attempts to address the problem of free text processing have led to demand for software that simulates and complements what people are able to do.

This article describes MIDAS (Medical Diagnosis Assistant), an advanced expert system that is able to suggest medical diagnosis from the radiological/clinical patient records, based on information extraction and machine learning from clinical histories of previously diagnosed patients. For this task, free text is turned into actionable knowledge using Natural Language Processing (NLP) techniques which is then used to train machine-learning systems to perform clinical free text classification.

MIDAS was specifically designed to participate in the 2007 Medical Natural Language Processing Challenge (CMC, 2007),



an international challenge task on the automated processing of clinical free text, hosted by the Computational Medicine Center, a collaborative medical research centre between Cincinnati Children’s Hospital Medical Center and the University of Cincinnati Medical Center).

MIDAS can be considered as one of the latest successors of MYCIN, the first expert system in history developed in the early 1970s at Stanford University, which was designed to diagnose infectious blood diseases (Shortliffe 1976).

2 Background and related work The task of classifying physicians’ diagnoses has been previously done. Gundersen et al. (1996) presented a system designed to assign diagnostic ICD-9-CM codes to the free text of admission diagnoses. This system encoded the diagnoses using categories from a standard classification scheme based on a text parsing technique informed with semantic information derived from a Bayesian network.

Yang et al. developed ExpNet (Yang, 1994), which comprised a machine learning method for automatic coding of medical diagnoses. This system offered improvements in scalability and computational training efficiency using Linear Least Squares Fit and Latent Semantic Indexing. Pakhomov et al. (1996) scaled up this groundwork with a hybrid approach consisting of example based classification and a simple but robust classification algorithm (naive Bayes) in order to improve the efficiency of diagnostic coding.

Other machine learning algorithms have been used to investigate classification problems related to medical reports. These include decision trees (Johnson, 2002), maximum entropy and symbolic rule induction (Nigam, 1999) among others.

As far as information extraction goes, many systems utilize patterns for extraction. Earlier pattern-based work such as AutoSlog (Riloff, 1993) solved the problem of domain specific dictionaries by developing a system that automatically builds domain specific dictionaries of concepts by extracting information from text. Other systems such as MedLEE (Friedman, 1994) used patterns to represent particular scenarios or events where the desired information is found by mapping clinical information into a structured representation containing clinical terms.

Linguistic variations of existing patterns have also been explored to increase domain patterns (Hobbs, 2003). These types of systems have the advantage of being able to “learn” patterns without the need of massive amounts of hand-tagged training data. Other groups have worked on the problem of automated biomedical concept recognition. The SAPHIRE system designed by Hersh et al. (1995) automatically encodes UMLS concepts using lexical mapping. The lexical approach is computationally fast and useful for real-time applications. More recently, Zou et al. (2003) developed IndexFinder to add syntactic and semantic filtering to improve performance on top of lexical mapping.

3 Description of Data The data provided in the framework of the Computational Medicine Center’s 2007 Medical Natural Language Processing Challenge (CMC, 2007) was used. The corpus was collected from the Cincinnati Children’s Hospital and included a repertoire of codes covering a substantial proportion of actual paediatric radiology activity. It was initially developed to train machine learning systems dedicated to automatic billing of medical records and other related activity. The set sample developed is representative of the problem: it has enough data in the well-represented classes for the automatic labeller to perform adequately and provides a proportionate representation of low-frequency classes.

An ICD-9-CM (International Classification of Diseases, 9th Revision, Clinical Modification) code is a 3 to 5 digit number with a decimal point after the third digit. Codes are organized in a hierarchy, with the highest levels of the hierarchy lumping codes together by assigning consecutive numbers, e.g.:

(580-629) GENITOURINARY SYSTEM -(580-589) NEPHRITIS AND NEPHROSIS - 580 Acute glomerulonephritis -580.8 Other specified pathological lesion in kidney -580.81 Acute glomerulonephritis in diseases classified elsewhere

-580.89 Other Two sections in a radiology report are

fundamental for assigning ICD-9-CM codes: clinical history, provided by an ordering physician before a radiological procedure, and impression, reported by a radiologist after the

Anastasia Sotelsek-Margalef y Julio Villena-Román

98

procedure. The language of clinicians is fundamental to patient care, but lacks the structure and clarity necessary for natural language analysis. These clinical annotations are dense with medical jargon and acronyms that often have multiple meanings. To resolve the ambiguities found in the free text, a series of clinical disambiguation rules were developed using clinical experts to translate the ambiguous terms, clinical acronyms, and abbreviations.

Finally, the data was converted to XML with two top-level subdivisions: texts and codes. Figure 1 shows a fragment of the patient record file.

Figure 1: Example of patient data

4 System Architecture The system is designed according to a modular cascade architecture (Figure 2). The first module extracts and structures clinical information from textual radiology reports and translates the information to terms in a controlled vocabulary so that clinical information can be accessed by further

automated procedures. The objective is to automate sufficient understanding of clinical records contents to be able to label all the phrases in them that contained information related to symptoms and signs of diseases that would be later used in the training of the classification algorithm. Each symptom, not always composed of a single word, was labelled as present, absent, family, history, past or unknown following a set of linguistic context rules.

The information extraction task is based on semantic pattern matching allowing for the identification of particular values of interest which are embedded within free text and determining a given value’s categorization. Keyword extraction from the free-text reports is susceptible to all the problems that result from the complexities of natural language, such as grammatical ambiguities, synonymy, negation of concepts and distribution of concepts (Sager, 1997).

Finally, a classifier is built based on a suitable ML algorithm. Weka (Witten, 2005) was used for the experiments.

4.1 Linguistic Preprocessor Clinical documents usually contain syntactic structures that are generally considered incorrect. Shorthand and telegraphic writing styles are common in radiology reports (in both fields). In addition, syntactic tagging implies that every word or phrase must be tagged whereas in our case only the targeted information needs to be identified. Sentences that are irrelevant to the domain can be effectively ignored without affecting the final classification. Therefore no syntactic parser was used in our system.

The first step to translate all relevant information into structured form is to standardize the character representation of the text and remove custom text formatting. Simple heuristic rules eliminate or modify line feeds, sequences of blanks between words and punctuation marks.

<doc id="97636670" type="RADIOLOGY_REPORT"> <codes> <code type="ICD-9-CM">786.2</code> </codes> <texts> <text type="CLINICAL_HISTORY">Eleven year old with ALL, bone marrow transplant on Jan. 2, now with three day history of cough.</text> <text type="IMPRESSION">1. No focal pneumonia. Likely chronic changes at the left lung base. 2. Mild anterior wedging of the thoracic vertebral bodies.</text> </texts> </doc> <doc id="99636934" type="RADIOLOGY_REPORT"> <codes> <code type="ICD-9-CM">593.70</code> <code type="ICD-9-CM">599.0</code> </codes> <texts> <text type="CLINICAL_HISTORY">10-year 5-month - old female with history of urinary tract infection. Patient had nuclear cystogram and was found to have left grade II vesicoureteral reflux. Last ultrasound of Jan. 27, 2001 demonstrated little growth of the right kidney compared to the left, otherwise stable renal ultrasound.</text> <text type="IMPRESSION">1. Normal renal ultrasound with interval growth of the both kidneys.</text> </texts> </doc>

ICD code Linguistic

Information Extraction

Feature matrix

Classification Clinical

reports

Clinical reports database

Figure 2: Overview of System Architecture


99

Then the structural analyzer segments the report into sections (e.g., clinical history and impression), sentences and words. Stop words are filtered based on their level of usefulness within this context and according to their usage. Words as also and or are eliminated since they are not useful in the labelling process.

The lexicon was manually developed. Both single words and multiple words phrases (multiword units) were included. Multiword combinations provide better retrieval performance allowing for a better capture of the content of the documents. In addition, abbreviations, proper names and descriptive adjectives that may not be found in electronic medical glossaries have been also considered.

A lexical lookup to identify multi-word phrases is performed. For instance, the sentence history of pneumonia would be considered a sequence of two terms, history of and pneumonia because the first term is considered a multiword phrase in the lexicon. In the next stages, these multiword phrases are treated as single entities.

The next phase of the process consists on the mapping of different forms of the same words and multiword units into a (single) term within the controlled vocabulary lexicon. In other words, a synonym knowledge base that consists of standard forms and their corresponding synonyms is used. If any value matches the argument of a synonym entry in the synonym knowledge base, it is substituted for the controlled vocabulary concept.

While the system was designed to consider every reference made to the symptoms, phrases like rule out, evaluate for or look for do not appear to be useful for classification since in the same report there is another reference to the sign or symptom indicating its diagnosis (e.g., no findings consistent with acute pneumonia). To address this issue, these phrases are eliminated in the pre-processing stage without causing a loss of relevant information.

4.2 Structured Representation Our data model can be described as a set of attributes (e.g., signs and symptoms) with their corresponding values. Our objective is to extract information on the existence and diagnostic interpretation of findings. Looking up isolated word meanings is not enough to make distinctions on whether the symptom is present, absent, or not mentioned at all. Furthermore other tags such as family were

added to avoid misinterpretation of the presence of a symptom in a patient when, for example, the symptom actually was suffered by a sibling.

Rules, specific to the writing style of medical reports, were used to assign the different tags. Negation, a particularly troublesome aspect of natural language processing, is specified as an atomic category absent. The target structure for negation is a finding qualifier whose value is part of a list of key words provided (e.g., no, without).

Since each attribute may have more than one possible value associated to it, there is a need to determine the value which best corresponds with the attribute. To resolve inconsistencies when labelling the attributes, there is an order in which they are looked for in the text. The label absent has more priority than history. In the sentence no history of pneumonia, the attribute pneumonia is therefore correctly labelled as absent.

For multi-valued attributes such as the age of the patients, regular expressions are used. Regular expressions have been widely used for lexical pattern matching tasks. Each attribute is assigned a set of regular expressions which represent every possible way a valid value for that attribute can be lexically expressed within a document (Meng, 2004). The label suspected is associated with certainty information related to the finding. Semantic relations such as could represent, suggesting and consistent with are recognized and the finding to which they are referring to is assigned this label.

Because there are many words and phrases linked to this type of information, and because their underlying meanings are vague, they all are mapped into one category only. We considered extracting more detailed information in terms of low, moderate or high certainty but we finally rejected that idea. In other applications, handling qualitative information more precisely may be important, in which case more labels could be desirable.

Parallel findings, such as hyperinflated lungs without pleural effusion are represented as independent findings, the first labelled with the tag present and the second with the tag absent. In the case of sentences containing or and and, such as no pneumonia or atelectasis and history of cough and fever, the interpretation made consists of two findings. For the former case both pneumonia and atelectasis are labelled as absent, for the latter cough and fever are labelled as history.


100

4.3 Classification The classifier was built using Weka, a suite of machine learning software that implements numerous machine learning algorithms. The first problem that we encountered was how to handle a multi-labelled data set, as Weka does not support multi-labelled learning. The chosen solution was to create new artificial classes corresponding to the combination of labels (e.g., 780.6-786.2).

Several algorithms were evaluated, but after the preliminary evaluations, two of them were finally selected: the classical C4.5 decision tree algorithm (Quinlan, 1993), namely J48 in Weka, and the k-Nearest-Neighbour classifier (Mitchell, 1997), IBk in Weka.

5 Evaluation The provided 1,954 patient reports contained 29 different ICD-9-CM labels (e.g. 780.6) that formed 89 distinct combinations (e.g. the combination 780.6-786.2).

Code Description No. 786.2 Cough 155 599.0 Urinary tract infection 114

593.70 Unspecified or w/o reflux nephropathy 80

780.6-786.2 Fever-Cough 76 486 Pneumonia 66

780.6 Fever 41 591 Hydronephrosis 40

786.50 Chest pain 32 596.54 Neurogenic bladder 31 788.30 Urinary incontinence 29 599.7 Hematuria 25

786.07 Wheezing 24

795.5 Nonspecific reaction to tuberculin test w/o tuberculosis 16

591-593.89 Hydronephrosis-disorders of kidney and ureter 16

493.90 Asthma 15 277.00 Cystic Fibrosis 15 518.0 Pulmonary collapse 12

786.07-786.2 Wheezing-Cough 12 759.89 Congenital malformation 11

596.54- 741.90

Neurogenic bladder-w/o hydrocephalus 11

Table 1: Distribution of radiology reports in the largest categories of the training set.

Table 1 shows the number of reports per category, the ICD-9-CM code and its description for those categories with more than 10 reports in the training set.

Three different experiments were performed, one based on J48 (decision trees) and the other two based on IBk (kNN), using two values for k (number of neighbours). Experiments were run using a 10-fold cross validation test. Results are shown in Table 2. The standard evaluation metric of F-Measure, the weighted harmonic mean of precision and recall, was calculated, using the micro-averaged figure (value is first calculated for each category and then averaged). J48 achieves the best performance.

ML algorithm (micro-averaged) F-Measure J48 0.8004

IBk (k=1) 0.7671 IBk (k=2) 0.7625

Table 2: F-Measure values Table 3 shows the detailed accuracy per

class of J48 algorithm. Notice that precision and recall are significantly better for those categories with a high number of instances (shown in Table 1).

Code Precision Recall F-Measure 786.2 0.913 0.91 0.911 599.0 0.872 0.93 0.9 593.70 0.84 0.882 0.861

780.6-786.2 0.842 0.954 0.894 486 0.841 0.879 0.859

780.6 0.837 0.878 0.857 591 0.729 0.765 0.747

786.50 0.87 0.923 0.896 596.54 0.757 0.903 0.824 788.30 0.94 0.81 0.87 599.7 0.796 0.86 0.827 786.07 0.816 0.833 0.825 795.5 0.875 0.875 0.875

591-593.89 0.92 0.719 0.807 493.90 0.762 0.533 0.627 277.00 1 1 1 518.0 0.625 0.4 0.488

786.07-786.2 0.75 0.875 0.808 759.89 1 0.818 0.9

596.54-741.90 0.615 0.364 0.457

Table 3: Detailed accuracy per class. If categories with less than 5 reports are

filtered out from data, the percentage of correctly classified instances is noticeably higher (Table 4).

Algorithm F-Measure Increment

J48 0.8586 7.3% IBk (k=1) 0.8255 7.6% IBk(k=2) 0.7959 4.3%

Table 4: Results for categories with 5 or more instances.


101

Regretfully we were not able to submit any experiment to the challenge, due to delays during the system development. In fact, only 44 out of the over 120 registered participants in the challenge finally submitted their results. The best and worst systems achieved F-Measure values of 0.8908 and 0.1541, respectively. The average value was 0.7670 with a standard deviation of 0.1340. In addition, 21 systems get F-measure values between 0.81 y 0.90.

The groups in 1st and 3rd position used machine learning approaches, whereas the system in 2nd position was based on symbolic methods. Actually the best system was based on a particular implementation of C4.5 algorithm, the same as our system.

6 Conclusions and Future Work The expected potential of such systems is to make available a large body of clinical information that would otherwise be inaccessible for applications other than manual physician review. We do not intend to replace coded data entry, but we offer a solution for the virtual enrolment of previously evaluated patients that would benefit research studies, teaching hospitals and physicians with a large workload in emergency situations.

The accuracy and hence the utility of a medical natural language processor relies heavily on the number and diversity of high-quality training examples. Furthermore, the accuracy of a language system depends on the specific information that it extracts. The important types of information for a given type of study should be established a priori, allowing system developers to emphasize training on high-priority information items.

Natural language used within patient documents is limited in word and phrasal variation. Thus the linguistic context in which the information to be extracted resides may only take on several basic structural forms. With a reasonable amount of training, which in MIDAS means labelling domain specific symptoms, any system built with the described methodology can obtain successful results.

We believe that our system could allow medical experts, making the necessary configuration changes, to tune the processor to their particular field without possessing expertise in the technical aspects of the system.

Moreover, although MIDAS has been specifically applied to the radiology domain,

the proposed methodology is modular and extensible and can be ported to other clinical domains. Explorations of the system’s adaptability to new clinical domains will be further conducted.

References Computational Medicine Center (CMC). 2007.

Medical Natural Language Processing Challenge. http://computationalmedicine.org /challenge

Friedman C, Alderson P, Austin J, Cimino JJ and Johnson SB. 1994. A general natural language text processor for clinical radiology. Journal of American Medical Informatics Association, March 1994, 1(2):161-174.

Gundersen ML, Haug PJ, Pryor TA, et al. 1996. Development and evaluation of a computerized admission diagnoses encoding system. Comp Biomed Res; 29(5): 351–72.

Hersh WR, Hickam D. 1995. Information retrieval in medicine: the SAPHIRE experience. Medinfo, 8 Pt 2:1433–7.

Hobbs JR. 2003. Information extraction from biomedical text. Journal of Biomedical Informatics.

Johnson D., et al. 2002. A decision tree based symbolic rule induction system for text categorization. IBM Systems Journal, 41(3).

Meng F, Chen AA, Son RY, Taira RK, Churchill BM, Kangarloo H. 2004. Information Extraction Using Semantic Patterns for Populating Clinical Data Models. METMBS’04: 10-16

Mitchell TM. 1997. Machine Learning. McGraw-Hill.

Nigam K, Lafferty J, McCullum A. 1999. Using Maximum Entropy for Text classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering.

Pakhomov S, Buntrock J, Chute CG. 2006. Automating the assignment of diagnosis codes to pacient encounters, Journal of American Medical Informatics Association, 13: 516-525.

Quinlan JR. 1993. C4.5. Programs for Machine Learning. Morgan Kaufmann.


102

Riloff E. 1993. Automatically constructing a dictionary for information extraction tasks, Proceedings of the 11th National Conference on Artificial Intelligence, AAAI Press, pp. 811-816.

Sager N. 1997. Medical Language Processing: Computer Management of Narrative Data. Springer-Verlag, New York.

Shortliffe E. 1976. MYCIN: Computer-Based Medical Consultations. Elsevier, New York.

Witten IH, Frank E. 2005. Data Mining- Practical Machine Learning Tools and Techniques. Elsevier Inc.

Yang Y, Chute CG. 1994. An application of Expert Network to clinical classification and MEDLINE indexing. Journal of American Medical Informatics Association 18; 157-61.

Zou Q, Chu WW, Morioka C, et al. 2003. IndexFinder: a method of extracting key concepts from clinical texts for indexing. Proceedings of AMIA Symposium; 763–7.

A Appendix 1: Web interface The web interface of the system is shown in Figure 3. There are two textboxes for writing the clinical history (physician information) and impression (radiologist report) and the diagnosis is shown in real-time after clicking on the “Diagnose” button.

Figure 3: Web interface

B Appendix 2: List of symptoms The list of symptoms in the MIDAS system covers the whole range of illnesses included in the CMC challenge, a substantial proportion of actual paediatric radiology activity.

abdominal pain, air space disease, anomal, anuresis, asthma, atelectasis

Beckwith Wiedemann syndrome, bronchiectasis

calculi, cardiopulmonary disease, chest pain, chest tightness, congestion, consolidation, cough, cystic fibrosis

deflux, difficulty breathing, dilatation, distended bladder, duplication

enuresis fever, flank pain hematuria, hemihypertrophy, horseshoe kidney,

hydronephrosis, hydroureter, hydroureteronephrosis, hyperinflat, hypertrophy, hypoventilation

infiltrate, interval growth lobe collapse, loss of appetite,

lymphadenopathy mass, myelomeningocele neurogenic bladder, normal chest, normal heart,

normal kidney, normal lungs peribronchial cuffing, peribronchial thickening,

pleural effusion, pneumonia, pneumothorax, positive PPD, post void residual, proteinuria, pyelectasis, pyelocaliectasis, pyeloplasty

reactive airway, reflux, renal transplant shortness of breath, sore throat, spina bifida tachypnea, tuberculosis, Turner syndrome ureteropelvic junction obstruction, unilateral

kidney, ureterocele, urinary incontinence, urinary tract infection, urothelial thickening

vesicoureteral reflux, voiding dysfunction, vomiting

wheezing, Williams syndrome, Wiskott Aldrich

C Appendix 3: List of synonyms airway disease, reactive airway calculi, calculus, calcifications cough, coughing cystic fibrosis, CF difficulty breathing, work of breathing disease, illness duplication, duplicated kidney examination, evaluation, exam, study family, history of, siblings, brothers fever, febrile hyperinflation, hyperinflated lungs interval growth, interval renal growth may, could normal, unremarkable, stable, clear, normal

radiographic appearance of the, normal radiographs of the, normal sonographic appearance of the, normal examination of the, normal sonographic examination of the


103

normal heart, heart normal normal kidney, kidney normal, normal renal normal lungs, lungs normal post void, postvoid postive PPD, reactive PPD prior, previous, past, status post, had probable, may represent, likely representing,

likely represent, probably representing, favored to represent, raising the question of, can be associated, may be related, sometimes associated with, consistent with

probable, worrisome, questionable, suggesting, suggest, suggests, suggestive, suspected, presumed, suspicion, possible, likely, unsure

radiograph, x ray represent, reflect shortness of breath, breathlessness sonography, ultrasound, sonogram tuberculosis, TB urinary incontinence, wetting urinary tract infection, UTI, UTIs viral disease, viral infection vomiting, emesis vs, versus, is favored over, favored over


104

Lexicografía Computacional

Mutual terminology extraction using a statistical framework

Extracción mutua de términos utilizando un marco estadístico

Le An Ha University of

Wolverhampton [email protected]

Ruslan Mitkov University of Wolverhampton

[email protected]

Gloria Corpas Universidad de Malaga

[email protected]

Resumen: El presente trabajo aborda la utilización de un marco estadístico para la extracción de

terminología bilingüe por asociación o información mutua. Se proponen tres modelos probabilísticos

para evaluar si el alineamiento automático puede desempeñar un papel activo en la extracción de

terminología bilingüe y si ello es extrapolable a la extracción de terminología bilingüe por información

mutua. Los resultados indican que dichos modelos son válidos y que la extracción de terminología

bilingüe por información mutua puede ser un enfoque viable.

Keywords: Extracción automática de términos, extracción bilingüe de términos.

Abstract: In this paper, we explore a statistical framework for mutual bilingual terminology extraction.

We propose three probabilistic models to assess the proposition that automatic alignment can play an

active role in bilingual terminology extraction and translate it into mutual bilingual terminology

extraction. The results indicate that such models are valid and can show that mutual bilingual

terminology extraction is indeed a viable approach. Keywords: Automatic terminology extraction, bilingual terminology extraction.

1 Introduction

The identification of terms in scientific and technical documents is a crucial issue for any application dealing with analysis, understanding, generation, and translation of such documents. Throughout the last decade, computational linguists, translators, lexicographers, and computer engineers among other specialists have been interested in automatically identifying terminology in texts. Software tools to accomplish terminologically related tasks have been designed and implemented. There is also increasing interest in bilingual terminology extraction (BLTE) (detailed in Section 2), whose usual approach is monolingual terminology extraction followed by automatic alignment. Recently, it has been suggested that automatic alignment can play a bigger role in bilingual terminology extraction, by assuming that if a noun phrase in the target language is aligned to a term in the source language, this noun phrase is more likely to be a term (Ha et al. 2008). In that paper, the authors provide an ad hoc framework to assess the effect of the term scores of the source language noun phrases on the term scores of the target language

noun phrases. Whereas such an ad hoc assessment is a promising approach, it does rely on experiments to find the optimised settings.

In this paper, we will provide a statistic

framework to examine this assumption by providing several models in which the probability of a noun phrase to be a term in a target language is affected by the probability of its alignment to a term in the source language. Our statistical models, therefore, provide a better foundation for mutual bilingual terminology extraction.

This paper is organised as follows: After the

introduction (this section), we will discuss terminology extraction in general (Section 2). Our models are then presented in Section 3. Evaluation of the proposed models can be found in Section 4. Conclusions and future directions are found in Section 5.



2 Terminology extraction (monolingual and bilingual)

2.1 Monolingual terminology extraction

The main stages in terminology work can be summarised as: extraction of terms from a corpus, validation of terms found, and organisation of validated terms by domain and sub-domain (Sauron, 2002). In this respect, a number of projects have created automatic extraction tools, which identify term candidates by starting from a corpus in electronic form. Some projects go one step further: on the basis of parallel corpora of texts and their translations they propose not only candidate terms but also possible equivalents in a target language.

Approaches to term extraction (TE) are usually

classified as linguistic, statistical, or hybrid. Linguistic and statistical approaches can be further subdivided into term-based (intrinsic) and context-based (extrinsic) methods (cf. Bourigault et al., 2001; Streiter et al., 2003).

Terminology Extraction tools (TETs) following a

linguistic approach try to identify terms by their linguistic (morphological and syntactic) structure. For this purpose, texts are annotated with linguistic information with the help of morphological analysers, part-of-speech taggers and parsers. Then, term candidates (TCs) following certain syntactic structures are filtered from the annotated text by using pattern matching techniques. Intrinsic methods try to filter TCs according to their internal (i.e morphological) structures (Ananiadou 1994). Extrinsic methods, on the other hand, try to identify TCs by analysing the morpho-syntactic structure of a word or phrase, such as looking for part-of-speech sequences like NP= noun + noun (e.g. computer science). An example of this kind is represented by the program LEXTER (Bourigault, 1992). Another commonly used technique consists in filtering TCs by looking for commonly used text structures such as definitions and explanatory contexts like “X is defined as …” or “X is composed of …” (cf. Pearson, 1998).

The general assumption underlying the statistical

approaches to TE is that specialised documents are characterised by the repeated use of certain lexical units or morpho-syntactic constructions. TETs based on statistics try to filter out words and phrases having a certain frequency-based statistic higher than a given threshold (see Manning & Schütze 1999 for an overview). Another common method is to compare the frequency of words and phrases in a specialised text to their frequency in general language texts, assuming that terms tend to appear

more often in specialised texts than in general language texts.

Different evaluation criteria exist for TETs,

involving among others accuracy, as well as supported file formats and languages. The most frequently used criteria are noise and silence, as well as recall and precision. While noise refers to the ratio between discarded TCs and the accepted ones, silence refers to the number of terms not detected by a TET. Recall and precision are two measures frequently used in IR, the former being defined as the ratio between the number of correctly retrieved terms and the number of existing terms, the latter being defined as the ratio between correctly extracted terms and the number of proposed TCs (cf. Zielinski, 2002).

TETs following a purely linguistic approach tend

to produce too many irrelevant TCs (noise), whereas those following a purely statistical approach tend to miss TCs that appear with a low frequency value (silence, cf. Clematide, 2003). Linguistic-based TETs often provide better delimited TCs than statistical-based ones. However, the disadvantage of linguistically based TETs is that they are language-dependent and thus only available for major languages. Statistical TETs, on the other hand, can be used for lesser-used languages that lack computational resources such as minority languages (cf. Streiter et al., 2003).

More recently, approaches to automatic TE and

TR have moved towards using both statistical and linguistic information (Daille et al., 1994; Justeson & Katz, 1996; Frantzi, 1998). Generally, the main part of the algorithm is the statistical part, but shallow linguistic information is incorporated in the form of a syntactic filter which only permits phrases having certain syntactic structures to be considered as candidate terms.

2.2 Bilingual terminology extraction Most of what has been discussed so far applies to

monolingual TE and TR. Lately, research has evolved towards the automatic extraction of bilingual terms. This process involves automatically capturing bilingual terminology from existing technical texts and their translations (parallel corpora), validating the candidate term pairs generated and generating terminological records in an automatic or semi-automatic manner. Several works have focused on the extraction of knowledge from bilingual corpora. All of them address the problem of aligning units across languages. Although very successful methods have been designed to align paragraphs and sentences in two different languages, aligning units smaller than a sentence still raises a real challenge.

Le An Ha, Ruslan Mitkov y Gloria Corpas

108

Thus, Gaussier (1998) relies on corpora aligned at

the sentence level. Association probabilities between single words are calculated on the basis of bilingual co-occurrences of words in aligned sentences. Then these probabilities are used to find the French equivalents of English terms through a flow network model. Hull (1998) differs from Gaussier (1998) in that single-word alignment, term extraction and term alignment are three independent modules. Terms and words are aligned through an algorithm that scores the candidate bilingual pairs according to probabilistic data, chooses the highest scored pair, removes it from the pool, and repeatedly recomputes the scores and removes pairs until all the pairs are chosen. Further improvements on Gaussier’s first model can be found in Gaussier et al. (2000) and Dejean et al. (2003).

Chambers (2000) describes a project launched in

1999 whose main aims include the automatic capture of bilingual terminology from parallel corpora, the manual validation of bilingual term pairs and the automatic generation of terminological records. The whole process has three major operations: monolingual extraction in the source text, monolingual extraction in the target text and bilingual matching to produce candidate term pairs.

Many methods have been proposed for extracting

translation pairs from bilingual corpora, but most are based on word frequency and are, therefore, not effective in extracting low-frequency pairs. Word-frequency-based methods are language-pair-independent. Examples include Melamed (2000) and Hiemstra (1997). While popular and well-known translation pairs may already be included in existing bilingual dictionaries, newly coined and minor translation pairs are not very well-covered in available resources. In order to tackle this problem, Tsuji & Kageura (2004) present a method for extracting low-frequency translation pairs from Japanese-English bilingual corpora. Their method uses transliteration patterns that are observed in actual loan-word pairs, thus incorporating language-pair-dependent knowledge.

More recently (Ha et al. 2008), it was proposed

the use of automatic term alignment to help propagate the strengths of terminology extraction from one language into another. The availability of parallel corpora aligned at sentence level makes the alignment process more accurate, and thus makes this possible. The overall process of the mutual bilingual terminology extraction methodology can be described as follows: firstly, a list of term candidates is extracted for the first language; then term candidates from the second language are aligned to this list. If a term candidate in the second language is

aligned to a term candidate in the first language, its term score is increased, and the candidate is promoted. This process can be repeated many times. In this study, as no suitable mathematical framework was employed, different settings had to be experimented with, in order to choose the best ones. To overcome this weakness, we propose in this paper several probabilistic models which can be used to propagate the term scores of a noun phrase in the source language to its aligned noun phrase in the target language.

3 Mutual bilingual terminology extraction

3.1 Three probabilistic models

Let P(Ns) is the probability of the noun phrase Ns in the source language is a term, P(Nt): the probability of the noun phrase Nt in the target language is a term, and P(Nt=Ns) is the probability of the noun phrase Ns translated into the noun phrase Nt. Let Pm(Ns) and Pm(Nt) are the probabilities of the noun phrase Ns and Nt to be a term in monolingual contexts.

We will use the notion “model 0” to refer to automatic terminology extraction in the monolingual context. In the model 0, P(Ns)=Pm(Ns) and P(Nt)=Pm(Nt).

In model 1, we assume that the probability of the

noun phrase Nt in the target language as a term only depends on whether it is a translation of a term in the source language, or in other words: P(Nt)=P(Ns is a term and Ns is translated into Nt). As Ns is a term and Ns is translated into Nt are two independent events, P(Nt) is calculated as:

P(Nt)=P(Ns)*P(Nt=Ns)= Pm(Ns)*P(Nt=Ns). (1)

This model is similar to the approach suggested

by Gaussier (1998). This approach assumes that the target language only plays a passive role in terminology extraction.

In the next model (model 2), we assume that the

probability of the noun phrase Nt in the target language to be a term does not only depend on whether it is a translation of a term in the source language, but also whether it is a term in the target language in the context of monolingual terminology processing. In this model, P(Nt) = P(Nt is a term in the target language or [Ns is a term in the source language and Ns is translated into Nt]). As [Nt is a term in the target language] and [Ns is a term in the source language and Ns is translated into Nt] are two overlapping, but independent events, the probability of the joint event is calculated as


109

P(Nt)=Pm(Nt)+P(Ns)*P(Nt=Ns)-

Pm(Nt)*P(Ns)*P(Nt=Ns) (2)

in which Pm(Nt) is the probability of Nt is a term in the target language in a monolingual context.

In the third model (model 3), we propose that the

probability of a noun phrase Ns in the source language as a term is also affected by the probability of its translation to be a term in the target language. In this way P(Ns) in (2) should be calculated as

P(Ns)=Pm(Ns)+P(Nt=Ns)*P(Nt)-

Pm(Ns)*P(Nt=Ns)*P(Nt) (3)

(3) is a recursive formula: as P(Nt) is calculated

using P(Ns) also. As a result, (3) should be rewritten

as:

P0(Ns)=Pm(Ns)

P0(Nt)=Pm(Nt)

Pn+1(Ns)=Pn(Ns)+P(Nt=Ns)*Pn(Nt)-

Pn(Ns)*P(Nt=Ns)*Pn(Nt)

Pn+1(Nt)=Pn(Nt)+P(Nt=Ns)*Pn(Ns)-

Pn(Nt)*P(Nt=Ns)*Pn(Ns)

The calculation should be repeated until

converged. 3.2 Calculating component probabilities

In the previous section, we proposed three different probabilistic models to calculate the probability of a noun phrase Nt to be a term given the probability of it being the translation of a noun phrase Ns, and the probability of Ns to be a term. The next step is to figure out how the probability of Ns to be a term can be calculated in the monolingual context. As discussed in Section 2, statistical measures have been derived to calculate the “termhood” of a term candidate. Although the value of these termhood functions is related to the probability of a noun phrase to be a term (i.e. the higher the value is, the more likely that it is a term), the actual probability is not often explicitly calculated. In order to calculate these probabilities using a known termhood function, we have to use linear regression as described below.

Given that F(N) is a termhood function of the

noun phrase N, C(F(N)) is the number of noun phrases Ni having F(Ni)>=F(N), T(F(N)) is the number of confirmed terms Ti whose F(Ti)>=F(N).

The probability of a noun phrase N to be a term in a monolingual context can be estimated as:

Pm(N)=T(F(N))/C(F(N))

Our task is to find a function G(F(N)) which can

be used as a good estimation of T(F(N))/C(F(N)). In order to find such a function, a graph between F(N) and T(F(N))/C(F(N)) can be drawn. Figure 1 shows the relation between F(N) and T(F(N))/C(F(N)) (Pm(N)) when F(N) is calculated as the log of frequency of N, for 400 noun phrases in English found in a parallel corpus of English and Spanish Law (See Section 4). The graph indicates that T(F(N))/C(F(N)) (i.e. Pm(N)) and log(Fre(N)) seems to have a linear relation whose coefficients can be estimated using linear regression. (Assuming the relationship is y=ax+b, in this case, a=0.384 and b=-0.064).

The use of linear regression also has another

benefit: the standard errors from the linear regression can also be used for estimating the predictive powers of termhood functions. A small standard error indicates that the termhood function is a good indicator of the probability of a noun phrase to be a term and vice versa.

We have experimented with several termhood

functions, and it proves that frequency remains a very good termhood function (i.e it produces the smallest standard error when linear regression is used).

Having established how to calculate Pm(N), we

now move to calculate the probability of the noun phrase Ns translated into the noun phrase Nt. Using the sentence-aligned parallel corpus, we can use contingency tables to estimate this probability by employing log likelihood calculation (Manning and Schütze 1999).

Relation between Log(Frequency(N)) and Pm(N) for English

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 1.5 2 2.5

Log(Frequency(N))

Pm

(N)

Figure 1: Relation between a termhood function

and Pm(N)


110

4 The experiment

We compile a parallel corpus in the domain of EU Immigration law in English and Spanish. The corpus contains 4390 segments, 121534 English words and 136585 Spanish words. We use the Inter Active Terminology for Europe (IATE) as an authoritative source to confirm whether a noun phrase is a term in the domain or not. This confirmation is done for both English and Spanish.

In order to evaluate the three models, we

calculate the standard errors of the predicted probability suggested by each model and the maximum likelihood probability of a noun phrase having the predicted probability greater than or equal the current one to be a term. This seems to be an unusual way to evaluate performance of automatic terminology extraction, but given that our main objective is to evaluate our probabilistic models, this is an appropriate choice: the smaller the standard error is, the better the model at predicting the probability of a noun phrase to be a term.

Table 1 shows the standard errors calculated as described above using the three proposed models (see Section 3), in which Log(Frequency) is used to estimate the initial probability of a noun phrase to be a term in the monolingual context. The results indicate that out of the three models, model 1 provides the most errors, whereas model 3 is slightly better than model 2. This confirms our mathematical prediction.

The results also indicate that weaknesses in term

extraction in one language can be overcome by employing a corpus aligned at sentence level. In our case, the part-of-speech sequence pattern used for Spanish is not as good as the pattern used for English, resulting in a higher standard error in model 0 (monolingual terminology extraction) for Spanish. When mutual bilingual terminology extraction is applied, the standard errors have been reduced to much closer to that of English.

Spanish English Model 0 0.056 0.026 Model 1 0.053 0.044 Model 2 0.04 0.024 Model 3 0.035 0.022

Table 1: Standard errors between predicted

probability and maximum likelihood probability In order to show our probabilistic models also

work with other types of termhood function, we tried another type of combination in which F(Ns)=Log(Fre(Ns))*Length(Ns) (in which Length(Ns) is the number of words Ns has). F(Nt) is still Log(Fre(Nt)). The results are shown in Table 2.

These results also indicate that out of the three models, model 3 gives the most accurate prediction of the probability of a noun phrase to be a term. Nevertheless, it is shown that our models can also propagate weaknesses as well as strengths: the use of a less accurate termhood function in English can result in higher standard errors in Spanish. Other experiments in which different combination of F(Ns) and F(Nt) have been used have been performed. None of these experiments yield better results (in term of standard errors) when compared to the results given in Table 1.

Spanish English Model 0 0.056 0.044 Model 1 0.065 0.054 Model 2 0.059 0.042 Model 3 0.057 0.04

Table 2: Standard errors when

F(Ns)=Log(Fre(Ns))*Length(Ns)

5 Conclusions and future directions

In this paper, we propose three probabilistic models to incorporate alignment scores in automatic term extraction. The proposed probabilistic models have advantages over the Ha et al. (2008) approach in that they are built on sound mathematical basis, and the remaining problem shifts to calculating the probability of a noun phrase to be a term in the monolingual context, rather than performing different experiments to find an optimised way to normalise and incorporate different termhood functions. Using this approach, any termhood function can be used, if the function can be converted into a probabilistic function predicting the possibility of a noun phrase to be a term.

In the future, we will explore different ways to

calculate the alignment probability, and propose new models to account for the fact that a term in the source language may have multiple translations.

Reference

Ananiadou, S. 1994. A methodology for Automatic Term Recognition. In Proceedings of the 15th International Conference on Computational Linguistics (COLING94), pp. 1034-1038. Kyoto, Japan.

Bourigault, D., C. Jacquemin, and M. C L'Homme (ed.) 2001. Recent Advances in Computational Terminology. Amsterdam: John Benjamins Publishing Company.

Chambers, D. 2000. Automatic Bilingual Terminology Extraction: A Practical Approach. In Proceedings of Translating and the Computer 22, Aslib/IMI.

Daille, B., E. Gaussier; J.-M. Lange. 1994. Towards


111

Automatic Extraction of Monolingual and Bilingual Terminology. In Proceedings of COLING 1994.

Dejean, H., E. Gaussier, C. Goutte, and K. Yamada. 2003. Reducing parameter space for word alignment. In Proceedings of HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, Edmonton, Alberta.

Frantzi, K. T. 1998. Automatic Recognition of Multi-Word Terms. PhD Thesis. Manchester Metropolitan University, UK.

Gaussier, E. 1998. Flow Network Models for Word Alignment and Terminology Extraction from Bilingual Corpora. In Proceedings of Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, pp. 444--450. San Francisco, California.

Gaussier, E., D. Hull, and S. At-Mokthar. 2000. Term alignment in use: Machine-aided human translation. In J. Veronis (ed.). Parallel text processing: Alignment and use of translation corpora, pp. 253--274. Dordrecht: Kluwer Academic Publishers.

Ha, L. A., G. Fernandez, R. Mitkov, and G. Corpas. 2008. Mutual bilingual terminology extraction. To appear in LREC 2008.

Hiemstra, D. 1997. Deriving a bilingual lexicon for cross language information retrieval. In Proceedings of Gronics 1997, pp. 21-26.

Hull, D. 1998. A practical approach to terminology alignment. In Proceedings of CompuTerm 1998, pp. 1-7.

Justeson, J. S., and S. L. Katz. 1996. Technical Terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 3(2): 259-289.

Manning, C. D., and H. Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press.

Melamed, I. D. 2000. Models of translational equivalence among words. Computational Linguistics 26(2): 221-249.

Pearson, J. 1999. Terms in context. Amsterdam: John Benjamins.

Sauron, V. A. 2002. Tearing out the terms: evaluating terms extractors. In Proceedings of Translating and the Computer 24, London, Britain.

Streiter, O., D. Zielinski, I. Ties, and L. Voltmer. 2003. Term extraction for Ladin: An example-based approach. In Proceedings of TANL 2003 Workshop on Natural Language Processing of Minority Languages with few computational linguistic resources, Batz-sur la Mer.

Tsuji, K., and K. Kageura. 2004. Extracting low-frequency translation pairs from Japanese-English

bilingual corpora. In Proceedings of CompuTerm 2004, pp. 23-30.

Zielinski, D., and Y. R. Safar. 2005. t-survey 2005: An Online Survey on Terminology Extraction an Terminology Management. In Proceedings of Translating and the Computer 27, London, Britain


112

Comparing languages from vocabulary growth to inflection paradigms: A study run on parallel corpora and

multilingual lexicons

Comparando lenguas desde el léxico a paradigmas de flexión: un estudio sobre corpus paralelo y léxicos multilingües

Helena Blancafort1,2 Claude de Loupy1,3

1Syllabs

2 rue de Fontarabie 75020 Paris, France

{blancafort,loupy}@syllabs.com

2Universitat Pompeu Fabra La Rambla, 30-32

08002 Barcelona, España

3Laboratoire Modyco Université de Paris 10

200 av. de la République 92001 Nanterre, France

Abstract: In this paper we report on a corpora and lexical comparative study on how to compare the difficulties of five languages (English, German, Spanish, French and Italian) for morphosyntactic analysis and the development of lexicographic resources. Experiments were conducted on two different sets of multilingual parallel corpora and two different morphosyntactic lexicons per language. We measure and compare statistics on dynamic and static coverage, form-lemma and morphosyntactic ambiguities in the lexicon and the corpus. In addition to this, we use the lexicons to automatically generate inflection paradigms and calculate how many inflection paradigms are needed per language. Results show the difficulty of working with multilingual resources and parallel corpora and offer some surprising quantitative results on differences in languages. Keywords: computational lexicography, morphosyntactic lexicons, computational morphology, inflection, multilingual parallel corpora, comparison of languages for NLP.

Resumen: En este artículo presentamos un estudio comparativo de corpus y de léxicos con el objetivo de comparar las dificultades que representan cinco lenguas (inglés, alemán, español, francés e italiano) para el análisis morfosintáctico y el desarrollo de recursos lexicográficos. Para ello hemos llevado a cabo varios experimentos utilizando dos corpus paralelos multilingües y dos léxicos morfosintácticos por lengua. Primero comparamos los resultados cuantitativos respecto a la cobertura dinámica y estática, y las ambigüedades morfosintácticas de los léxicos y corpus. Además, a partir de los léxicos hemos generado paradigmas de flexión para calcular cuántos son necesarios en cada lengua. Los resultados muestran la dificultad de trabajar con recursos multilingües y corpus paralelos. También ofrecen resultados cuantitativos sorprendentes respecto a las diferencias entre lenguas. Palabras clave: lexicografía computacional, léxicos morfosintácticos, morfología computacional, flexión, corpus paralelos multilingües, comparación de lenguas para el PNL.

1 Introduction

In recent years the number of multilingual data on the Web has been growing in leaps and bounds. As multilingual processing is gaining in importance, it is becoming urgent, for NLP purposes, to understand better the differences between languages. In this article we present

current work on how to compare the difficulties of five languages (English, German, Spanish, French and Italian) for morphosyntactic analysis and the development of lexicographic resources.

It is known, e.g., that Latin languages have a richer verbal inflection than English and that German has a richer nominal inflection. Traditional morphological typological studies



already describe several linguistic phenomena for the comparison of languages, but don’t provide any quantitative information about them.

In this paper we present a corpora and lexical comparative study conducted on two sets of multilingual parallel corpora, the JRC-Acquis (Steinberger et al., 2006) and the bible (Resnik et al., 1999) using two different morphosyntactic lexicons per language: MulText (Ide and Véronis, 1994) for each language, FreeLing (Atserias et al., 2006) for English, Spanish and Italian, Lefff (Sagot et al., 2006) for French, and Morphy (Lezius, 2000) for German. We measure and compare statistics on dynamic and static coverage, form lemma and morphosyntactic ambiguities in the lexicon and the corpus. In addition to this, we calculate how many inflection paradigms are needed to handle inflection of open class categories in each lexicon.

The paper is organized as follows: first, we give a short overview of the state of the art; in section three we describe the resources we used. Next, we report on vocabulary growth and coverage comparison. In section five we tackle the issues of morphosyntactic complexity, ambiguity and also compare inflection paradigms. Finally, we draw conclusions and discuss further work.

2 State of the art

2.1 Comparing languages

Traditional typology distinguishes four types of languages: isolating, agglutinative, inflectional and polysynthetic. As observed by Trost (2003), this classification is quite artificial and real languages rarely fall into one of those classes: Chinese, e.g., is an isolating language but does have some suffixes. Pirkola (2001) expresses the need for a language typology for IR and suggests using the index of synthesis and fusion (Comrie, 1989; Whaley, 1997) to measure morphological phenomena. Furthermore, he suggests finer-grained indexes and semantic analysis. He claims that by combining these variables it would be possible to predict the performance of morphological processing and hence, the difficulties that a given language represents for IR.

2.2 Induction of morphological rules

Lexicographic resources are needed for basic morphosyntactic analysis like lemmatization. The difficulty and time needed for accomplishing these tasks depend on the characteristics of a language. Latin languages, e.g., are supposed to be longer to encode than English because of their verbal inflection paradigms. Hence, it is quite common to develop an inflection engine using hand encoded inflection rules. This can be a time consuming task for languages with rich inflection. In the case of Spanish, e.g., a verb paradigm can contain more than 40 forms. Besides, the number of inflections for a lemma is irregular, which implies that two verbs will not always have the same number of forms. Furthermore, there can be variants for the same inflectional form (e.g., two different participles such as imprimido and impreso in Spanish and also orthographic variants like the French verb forms essaie or essaye).

More recently some work has been carried out on automatic induction of morphology. Schone & Jurafsky (2001) designed an algorithm for inducing inflection rules in German, English and Dutch from a corpus without any human intervention. As far as we are aware, they have obtained the best results for a knowledge-free algorithm. Clément et al., (2004) present work carried out to build a French lexicon from a big corpus using morphological information. They apply a verbal inflection engine developed manually following the inflection patterns for open classes described in French grammars. We are not aware of any studies concerning the induction of inflection rules directly from a morphosyntactic lexicon. This is what we have carried out for the quantitative comparison of inflection paradigms (section 5.2).

3 Description of the resources: lexica and corpora

3.1 Description of the lexica

To minimize the bias introduced by the lexicons, we used two different lexicons per language, the Multext and the FreeLing lexicons (v2.0). As FreeLing is not available for French and German, we took other large-coverage lexicons: the Lefff for French and Morphy for German.

Helena Blancafort y Claude de Loupy

114

One of the main goals of MulText was to develop monolingual and multilingual linguistic resources and to ensure the comparability and harmonization of tagsets in several European languages. Linguistic information is coded in a simple form-lemma-tag format. The tags are common to all languages. The definition of a tagset for all languages is not an obvious task, as there is an intrinsic incomparability of the tagsets due to the specifications of each language. Indeed, some tags are language specific. When this is the case, the attribute is marked with “-“, as for Latin Languages having no case attribute.

Despite the big effort made for the harmonization of multilingual tagsets and lexical resources, the MulText lexicons present some incoherencies that have obliged us to modify each lexicon to some extent. Examples are epicene nouns and adjectives like Spanish periodista, Italian giornalista and French journaliste that didn’t have the same encoding as shown in the figure below.

FR

journaliste = Ncms-- journaliste journaliste Ncfs-- journalistes journaliste Ncfp-- journalistes journaliste Ncmp--

ES periodista periodista Nc.s- periodistas periodista Nc.p-

IT giornalista giornalista Nccs- giornaliste giornalista Ncfp- giornalisti giornalista Ncmp-

Figure 1: epicene nouns in MulText

To avoid inconsistency, some incoherencies or errors were corrected, as these had a negative effect on the statistics conducted on the lexicon, especially with respect to the inflection paradigms. Some inflected forms, e.g, were missing and produced incomplete paradigms.

As for FreeLing, it is an open-source library providing multilingual NLP services such as lemmatization and PoS tagging. The English dictionary was automatically extracted from WSF with minimal hand post-edition and tends to be a little noisy. The Spanish dictionary is hand coded whereas the Italian dictionary has been extracted from Morph-it! Morphy is freely available software for morphological analysis and PoS tagging for German. Lefff 2.1 is a freely available wide-coverage morphosyntactic and syntactic lexicon for French.

The number of lemma and lexicon entries is given in Table 1. We removed the entries

containing proper names to avoid the bias introduced by these types of entries.

3.2 Description of the corpora set

For our study we have used parallel corpora. Unfortunately, multilingual parallel corpora are hard to come by. As lexical studies on corpora are always biased by the type of discourse represented in the corpus, we used two different sets: the JRC-Acquis v.3.0 and the aligned bible. The XML-encoded JRC-Acquis is a freely available parallel corpus containing EU documents of mostly legal nature in more than 20 languages. Unfortunately, monolingual documents include sentences or paragraphs in one or more languages that are not always marked up and therefore cannot be removed automatically. This fact drastically decreases coverage performance.

As we can see in Table 2, German is the language with the smallest number of word occurrences (tokens) in each of the corpora and with the highest number of different words (types). As we are working with parallel corpora, this indicates that German uses fewer words to express the same thing, while Spanish and French in the JRC Corpus and French in the bible corpus are the languages with more words. This fact isn’t surprising since German has a very productive morphological composition that enables the creation of new words.

A curious fact is that English is the language with a smaller proportion of types, indicating that the vocabulary used is less variable than in other languages. Italian shows a more varied vocabulary, especially in the bible corpus.

4 Comparing vocabulary size and coverage

4.1 Comparing vocabulary growth

The vocabulary growth is an indicator of the difficulty to build an appropriate lexicon for a given language. Table 3 gives the number of words (forms) needed in order to have a certain static coverage. Static coverage indicates the percentage of tokens in the corpus mapped by the lexicon, while dynamic coverage refers to the types (Mérialdo, 1988). Cells in grey indicate the largest number of words needed to reach the given coverage while the cells in italics indicate the smallest number of required words.

Comparing languages from vocabulary growth to inflection paradigms

115

We can see that German has an extensive vocabulary. English uses a smaller vocabulary when the coverage is higher than 70%. The needed vocabulary can be twice as big in one language as in another.

4.2 Comparing coverage

In this section we present the dynamic and static coverage for each lexicon described in 3.1 and run on the corpora mentioned in 3.2. As we can see in Table 4, results on the bible are better than the ones in the JRC corpus, because the JRC corpus represents a quite technical discourse and also because of the noise reported in 3.2.

In German it seems to be more difficult to achieve a good coverage than in other languages. The German MulText and the German Morphy lexicons score a dynamic coverage of 0.35 and 0.59 and a static one of 0.83 and 0.89, while the highest score for dynamic and static coverage is achieved in French for both lexicons (0.82 and 0.83 dynamic coverage and 0.96 and 0.95 static coverage). Italian is the Latin language with the weakest coverage, while the Spanish FreeLing and the French Lefff achieve a dynamic coverage of 0.78 and 0.83; Italian shows 0.70 of dynamic coverage. The static coverage is also lower than for the other Latin languages (0.91 in the bible corpus against 0.94 for Spanish and 0.95 for French). The question arises as to whether this difference is due to the quality of the lexicon or to the language itself.

4.3 Comparing statistics on lexicons

with the same coverage

In order to compare the lexicon on the same basis, we have extracted new lexicons from the original ones that are needed to cover 60% of the tokens in the JRC corpus and 70% in the bible corpus. These lexicons were generated extracting all the lemmas that could be associated with a given form (in French, portes is associated with the noun porte and the verb porter). Then we derived all the inflectional forms associated with these lemmas. After generating those lexicons, we created automatically the associated lemmas with all their corresponding inflections.

Note that we have limited our study to open classes (without adverbs). Table 5 shows that in German more than double the number of lemmas are needed to achieve the same

coverage as for Spanish. For coverage of 70% in the bible corpus in German we needed 482 MulText lemmas and 478 Morphy lemmas whereas in French we only needed 119 MulText lemmas and 202 FreeLing lemmas. At the same time, results indicate that German is the language with the largest number of open class tags (MulText 393, Morphy 175, JRC) and English the one with the smallest amount of tags (36 in MulText, 12 in FreeLing, JRC). Latin languages do not have the same number of tags, but they all have a number greater than for English and smaller than for German. Surprisingly, Italian has many more tags than other Latin languages.

5 Analysis of the morphosyntactic complexity

5.1 Comparing morphosyntactic ambiguity

Concerning ambiguity, Table 6 gives the average number of possible tags for a given form using MulText. This is evaluated considering both types and tokens. Moreover, simple PoS, that is A, N or V and complete tags (for instance Ncms-) are considered.

Spanish seems to be the less ambiguous language. The number of tags per form for German is very high due to the choices made when building the original lexicons as explained below (5.2). Italian is shown to be the most ambiguous language regarding the number of PoS. Again the question arises as to whether this result is a consequence of the quality of the lexicons, especially since the Italian FreeLing has been built up automatically.

5.2 Comparing inflection paradigms

After generating the lexicons needed for a given coverage (see section 4.3) we generated automatically the number of inflection paradigms and rules to inflect the lemmas. These rules were induced from the obtained lexicon. The idea is to get a lexicon with a lemma and an inflection rule that can be applied to generate a form-lemma-tag lexicon.

Table 7 presents the number of paradigms per language in the bible corpus, the number of paradigms per PoS, the average number of inflections for the total paradigms and the number of inflections per PoS paradigms. We also expose the number of endings per rule that are added to the stem and the number of


116

endings that are removed in the inflection process.

To our surprise we can see that in Spanish a smaller number of paradigms is needed than for other languages; depending on the corpus and lexicon used, Spanish is equivalent to English. We expected that only English would show a small number of paradigms. Regarding verbal inflection, although Spanish has fewer paradigm rules than English (between 18 and 22 for Spanish for the bible corpus and between 31 and 32 in English), each paradigm generates a high number of forms: while English obtains less than 7 inflected forms per paradigm, Spanish has an average of between 65 (FreeLing) and 157 (MulText)! Again the striking difference between the lexicons can be explained by the fact that the encoding philosophy diverges a lot from one project to another. Whereas FreeLing handles Spanish verbal cliticization with a special module for morphological analysis, Multext includes all the inflected forms with clitic as bebiéndolo in the lexicon.

The same can be argued for the differences noted for German MulText and Morphy. In Morphy, e.g, all verbs with a separable particle are lemmatized to the verb without particle, as zurückgekommen lemmatized to kommen instead of to its infitive zurückkommen as encoded in Multext. This explains big differences in verbal paradigms. The number of verbal endings in Morphy, e.g., is multiplied by a factor of 30!

Yet another interesting observation is the number of characters to be deleted in the inflection process. The average number in Spanish is lower than in French in Italian, but English is still the language with the lowest average.

6 Conclusions and perspectives

The different figures highlighted in this paper provide a great deal of information on languages and are sometimes quite surprising, as the low number of inflection paradigms needed for Spanish.

But beyond the figures themselves, our results indicate how difficult it is to build up harmonized multilingual lexicons; that is, to create lexicons according to a common tagset (even when language specific attributes are foreseen). Even lexicons developed for the same purposes and under the same project as

MulText and FreeLing, do not always fulfil this requirement.

The errors found in the lexicons are another problem for our study. Sometimes they are due to the use of automatic procedures as was the case for the English FreeLing, that was generated using an automatically created lexicon whereas in Spanish these tasks were handled by linguists.

In this paper we present a first approach for the automatic comparison of the difficulties of different languages for NLP applications. Lexicons are indeed of paramount importance for NLP. The development of these resources is a complex task and it is interesting to find clues to help predict the degree of difficulty. The approach presented here makes use of existing lexicons and shows how encoding differences and errors impede the obtention of reliable results. A more challenging method would be to predict the difficulty without previous knowledge. We plan to run further studies using tools to automatically induce morphology from corpora like Linguistica (Goldsmith 2006) and to compare the obtained results with the ones presented here.

Moreover, as multilingual parallel corpora are too specific and difficult to come by, further work will be carried out using comparable corpora. Here again, the comparison between the results obtained with parallel and comparable corpora will enable us to determine whether it is possible to evaluate the difficulty of creating morphosyntactic lexicons without previous resources.

7 References

Atserias, J., Casas, B., Comelles, E., González, M., Padró, L., Padró, M., 2006. FreeLing 1.3: Syntactic and semantic services in an open-source NLP library. Proceedings of the fifth international conference on Language Resources and Evaluation (LREC 2006), ELRA. Genoa, Italy. May, 2006.

Comrie, B., 1989. Language universals and linguistic typology. Chicago: The University of Chicago Press.

Goldsmith, J., 2006. An algorithm for the unsupervised learning of morphology. Natural Language Engineering 12. 1-19.

Ide, N., Véronis, J., 1994. MULTEXT: Multilingual Text Tools and Corpora. Proceedings of the 15th International


117

Conference on Computational Linguistics, COLING'94, Kyoto, Japan, 588-92.

Lezius, W., 2000. Morphy - German Morphology, Part-of-Speech Tagging and Applications in Ulrich Heid; Stefan Evert; Egbert Lehmann and Christian Rohrer, editors, Proceedings of the 9th EURALEX International Congress pp. 619-623 Stuttgart, Germany.

Mérialdo, B., 1988. Multilevel decoding for very-large-size-dictionary speech recognition, IBM Journal of Research and Development, v.32 n.2, p.227-237, March 1988.

Pirkola, A., 2001. "Morphological Typology of Languages for IR", Journal of Documentation, 57, 2001, 330-348.

Resnik, P., Broman Olsen, M., and Diab, M., 1999. The Bible as a parallel corpus: Annotating the “Book of 2000 Tongues.” Computers and the Humanities 33, 1–2 (1999) 363–379.

Sagot, B., Clément, L., Villemonte de la Clergerie, E., Boullier, P., 2006. The Lefff 2 syntactic lexicon for French: architecture,

acquisition, use. In the Proceedings of the Language Resources and Evaluation Conference, LREC'06, Gênes

Schone, P., & Jurafsky, D., 2001. Knowledge-Free Induction of Inflectional Morphologies. Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-01).

Steinberger, R., Pouliquen, B., Widiger, A., Ignat , C., Erjavec ,T., Tufis, D., Varga, D., 2006. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006). Genoa, Italy, 24-26 May 2006.

Trost, H., 2003. Computational Morphology. In: Ruslan Mitkov (editor), The Oxford Handbook of Computational Linguistics, pp. 25-47. Oxford University Press.

Whaley, L.J., 1997. Introduction to typology: the unity and diversity of language. Thousand Oaks - London - New Delhi: Sage Publications.


118

8 Annex : Tables

Lexicon Language Lemma Entries

MulText EN EN 14,639 66,215 FreeLing EN EN 40,219 67,213 MulText ES ES 18,027 510,711 FreeLing ES ES 76,201 668,816 MulText IT IT 10,238 232,079 FreeLing IT IT 40,277 437,399 MulText DE DE 12,733 233,858 Morphy DE DE 91,311 4,055,789 MulText FR FR 28,627 306,795 Lefff FR FR 56,917 472,582

Table 1: Number of lexicon entries

Corpus Bible JRC Language de en es fr it de en es fr it

nb of types in corpus

26,380 14,679 25,238 21,385 30,498 58,800 45,079 50,202 47,858 50,388

nb of tokens in corpus

649,488 816,270 841,765 929,211 855,329 1,458,661 1,524,011 1,634,317 1,612,744 1,557,464

Table 2: Number of words and types in the corpora

Bible JRC Coverage de en es fr it de en es fr it 60% 202 106 108 102 186 368 253 217 238 321 70% 502 235 264 220 416 792 538 515 527 652 80% 1321 562 742 621 1117 2488 1257 1307 1291 1665 90% 4069 1606 2675 2274 3822 10631 5074 5738 5460 6774 99% 20847 8672 18054 13617 23230 50832 36365 40499 38345 41515 100% 26380 14679 25238 21385 30498 58800 45079 50202 47858 50388

Table 3: Vocabulary growth according to the static coverage

Lexicon MulText Morphy MulText FreeLing MulText FreeLing MulText Lefff MulText FreeLing

Language de de en en es es fr fr it it JRC

Known types 9190 13154 9733 8862 13296 15574 13768 14538 11123 14285 Unknown types 49533 45569 35329 36199 36817 34538 33971 33201 39235 36073 Dynamic coverage 0.16 0.22 0.22 0.20 0.27 0.31 0.29 0.30 0.22 0.28 Static coverage 0.63 0.68 0.74 0.75 0.76 0.81 0.82 0.80 0.76 0.74

Bible Known types 9111 15446 9747 8498 16267 19779 17515 17804 13582 21287 Unknown types 17238 10904 4932 6180 8969 5456 3824 3535 16914 9209 Dynamic coverage 0.35 0.59 0.66 0.58 0.64 0.78 0.82 0.83 0.45 0.70 Static coverage 0.83 0.89 0.93 0.90 0.90 0.94 0.96 0.95 0.84 0.91

Table 4: Coverage of the lexicons


119

Lexicon MulText Morphy MulText FreeLing MulText FreeLing MulText Lefff MulText FreeLing

Language de de en en es es fr fr it it JRC

Lemmas (A, N, V) 1533 915 577 524 393 328 397 373 557 748 Tags (A, N, V) 393 175 36 12 131 194 176 112 123 277 Adjectives 484 204 109 126 81 41 82 48 106 165 Nouns 742 563 303 258 214 209 228 248 302 398

Bible Lemmas (A, N, V) 482 478 229 280 129 205 119 202 510 350 Tags (A, N, V) 391 173 38 12 131 188 170 112 122 278 Adjectives 146 122 32 55 11 17 12 28 85 73 Nouns 225 253 112 136 80 136 82 138 272 190 Verbs 111 103 85 89 38 52 25 36 153 87

Table 5: Number of lexicon entries for a coverage of 60% for the JRC Corpus and 70% for the bible

Bible de en es fr it

Average tags by form in lexicon 5,80 1,44 1,23 1,58 1,61 Average tags by form in corpus 4,10 1,51 1,33 2,21 2,09 Average PoS by form in lexicon 1,32 1,29 1,09 1,11 1,26 Average PoS by form in corpus 1,52 1,33 1,24 1,42 1,67

JRC de en es fr it Average tags by form in lexicon 6,90 1,35 1,29 1,54 1,73 Average tags by form in corpus 4,90 1,43 1,37 2,06 2,01 Average PoS by form in lexicon 1,28 1,26 1,15 1,15 1,37 Average PoS by form in corpus 1,40 1,34 1,26 1,42 1,64

Table 6: Grammatical ambiguity rates (MulText)

Bible MulText Morphy MulText FreeLing MulText FreeLing MulText Lefff MulText FreeLing de de en en es es fr fr it It Total paradigms 292 184 56 50 34 51 40 54 79 61 Paradigms (A) 119 17 8 7 5 7 6 12 11 9 Paradigms (N) 65 64 16 12 11 22 15 21 27 22 Paradigms (V) 108 103 32 31 18 22 19 21 41 30 Inflections per paradigm 55.97 139.04 4.66 4.98 85.03 29.47 24.93 23.33 29.51 36.21 Inflections per paradigm (A) 104.22 152.12 2.75 2.00 5.20 2.14 4.67 4.75 5.73 3.56 Inflections per paradigm (N) 9.14 9.30 2.12 1.92 2.91 2.45 2.73 3.05 2.59 2.00 Inflections per paradigm (V) 30.98 217.50 6.41 6.84 157.39 65.18 48.84 54.24 53.61 71.10 Endings (A, N, V) 1161 14258 54 47 704 416 274 342 370 355 Endings (A) 752 102 6 2 4 4 5 13 14 9 Endings (N) 50 58 7 2 10 9 8 16 10 7 Endings (V) 483 14110 43 43 698 411 267 327 356 347 Average endings length (A, N, V) 4.59 8.99 1.28 1.21 6.12 3.88 3.36 3.35 2.99 3.08 Average endings length (A) 5.40 4.24 0.91 0.36 0.85 0.80 0.71 1.14 2.06 1.66 Average endings length (N) 1.40 1.54 0.85 0.22 1.28 1.07 0.95 1.16 0.64 0.43 Average endings length (V) 2.13 9.74 1.40 1.38 6.22 4.02 3.54 3.58 3.10 3.16 Average deleted char (A, N, V) -2.33 -5.05 -0.46 -0.50 -1.66 -1.78 -2.41 -2.53 -2.14 -2.30 Average deleted char (A) -2.54 -1.07 -0.23 -0.21 -0.35 -0.13 -0.04 -0.54 -0.70 -1.41 Average deleted char (N) -0.46 -0.56 -0.29 -0.04 -0.31 -0.43 -0.29 -1.48 -0.54 -0.39 Average deleted char (V) -1.89 -5.63 -0.51 -0.57 -1.69 -1.84 -2.57 -2.68 -2.23 -2.35

Table 7: Inflection paradigms for a lexicon covering 70% of the bible


120

Multilingual Evaluation of KnowNet

Evaluacion Multilıngue de KnowNet

Montse CuadrosTALP Research Center, UPC

Barcelona, [email protected]

German RigauIXA NLP Group, UPV/EHU

Donostia, [email protected]

Resumen: Este artıculo presenta un nuevo metodo totalmente automatico deconstruccion de bases de conocimiento muy densas y precisas a partir de recur-sos semanticos preexistentes. Basicamente, el metodo usa un algoritmo de Inter-pretacion Semantica de las palabras preciso y de amplia cobertura para asignar elsentido mas apropiado a grandes conjuntos de palabras de un mismo topico que hansido obtenidas de la web. KnowNet, la base de conocimiento resultante que conectagrandes conjuntos de conceptos semanticamente relacionados es un paso importantehacia la adquisicion automatica de conocimiento a partir de corpus. De hecho,KnowNet es varias veces mas grande que cualquier otro recurso de conocimientodisponible que codifique relaciones entre sentidos, y el conocimiento que KnowNetcontiene supera cualquier otro recurso cuando es empıricamente evaluado en unmarco multilingue comun.Palabras clave: Bases de Conocimiento de amplia cobertura, InterpretacionSemantica de las Palabras, Adquisicion de Conocimiento.

Abstract: This paper presents a new fully automatic method for building highlydense and accurate knowledge bases from existing semantic resources. Basically,the method uses a wide-coverage and accurate knowledge-based Word Sense Dis-ambiguation algorithm to assign the most appropriate senses to large sets of topi-cally related words acquired from the web. KnowNet, the resulting knowledge-basewhich connects large sets of semantically-related concepts is a major step towardsthe autonomous acquisition of knowledge from raw corpora. In fact, KnowNet isseveral times larger than any available knowledge resource encoding relations be-tween synsets, and the knowledge KnowNet contains outperform any other resourcewhen is empirically evaluated in a common multilingual framework.Keywords: Large-Scale Knowledge Resources, Word Sense Disambiguation,Knowledge Acquisition

1 Introduction

Using large-scale knowledge bases, such asWordNet (Fellbaum, 1998), has become ausual, often necessary, practice for most cur-rent Natural Language Processing (NLP)systems. Even now, building large and richenough knowledge bases for broad–coveragesemantic processing takes a great deal ofexpensive manual effort involving large re-search groups during long periods of de-velopment. In fact, hundreds of person-years have been invested in the developmentof wordnets for various languages (Vossen,1998). For example, in more than ten yearsof manual construction (from 1995 to 2006,that is from version 1.5 to 3.0), WordNetgrew from 103,445 to 235,402 semantic re-lations(Symmetric relations are counted onlyonce). But this data does not seems to be

rich enough to support advanced concept-based NLP applications directly. It seemsthat applications will not scale up to work-ing in open domains without more detailedand rich general-purpose (and also domain-specific) semantic knowledge built by au-tomatic means. Obviously, this fact hasseverely hampered the state-of-the-art of ad-vanced NLP applications.

However, the Princeton WordNet (WN) isby far the most widely-used knowledge base(Fellbaum, 1998). In fact, WordNet is be-ing used world-wide for anchoring differenttypes of semantic knowledge including word-nets for languages other than English (Atse-rias et al., 2004), domain knowledge (Magniniand Cavaglia, 2000) or ontologies like SUMO(Niles and Pease, 2001) or the EuroWord-Net Top Concept Ontology (Alvez et al.,



2008). It contains manually coded informa-tion about English nouns, verbs, adjectivesand adverbs and is organized around the no-tion of a synset. A synset is a set of wordswith the same part-of-speech that can be in-terchanged in a certain context. For exam-ple, <party, political party> form a synsetbecause they can be used to refer to the sameconcept. A synset is often further describedby a gloss, in this case: “an organization togain political power” and by explicit seman-tic relations to other synsets.

Fortunately, during the last years the re-search community has devised a large set ofinnovative methods and tools for large-scaleautomatic acquisition of lexical knowledgefrom structured and unstructured corpora.Among others we can mention eXtendedWordNet (Mihalcea and Moldovan, 2001),large collections of semantic preferences ac-quired from SemCor (Agirre and Martinez,2001; Agirre and Martinez, 2002) or acquiredfrom British National Corpus (BNC) (Mc-Carthy, 2001), large-scale Topic Signaturesfor each synset acquired from the web (Agirreand de Lacalle, 2004) or knowledge aboutindividuals from Wikipedia (Suchanek, Kas-neci, and Weikum, 2007). Obviously, allthese semantic resources have been acquiredusing a very different set of processes, toolsand corpora. As expected, each semanticresource has different volume and accuracyfigures when evaluated in a common andcontrolled framework (Cuadros and Rigau,2006).

However, not all these large-scale re-sources encode semantic relations betweensynsets. In some cases, only relationsbetween synsets and words have beenacquired. This is the case of the TopicSignatures acquired from the web (Agirreand de Lacalle, 2004). This is one of thelargest semantic resources ever build witharound one hundred million relations be-tween synsets and semantically related words(http://ixa.si.ehu.es/Ixa/resources/sensecorpus).

A knowledge net or KnowNet (KN), isan extensible, large and accurate knowledgebase, which has been derived by semanti-cally disambiguating small portions of theTopic Signatures acquired from the web. Ba-sically, the method uses a robust and ac-curate knowledge-based Word Sense Disam-biguation algorithm to assign the most ap-propriate senses to the topic words associ-

Knowledge Resources #relationsPrinceton WN3.0 235,402Selectional Preferences from SemCor 203,546eXtended WN 550,922Co-occurring relations from SemCor 932,008New KnowNet-5 231,163New KnowNet-10 689,610New KnowNet-15 1,378,286New KnowNet-20 2,358,927New KnowNet-5 (es) 144,493New KnowNet-10 (es) 447,317New KnowNet-15 (es) 922,256New KnowNet-20 (es) 1,606,893

Table 1: Number of synset relations

ated to a particular synset. The resultingknowledge-base which connects large sets oftopically-related concepts is a major step to-wards the autonomous acquisition of knowl-edge from raw text.

Table 1 compares the different vol-umes of semantic relations between synsetpairs of available knowledge bases andthe newly created KnowNets in En-glish and its ported relations to Spanish(es)(These KnowNet versions are availableat http://adimen.si.ehu.es).

Variying from five to twenty the numberof processed words from each Topic Signa-ture, we created automatically four differentKnowNets with millions of new semantic re-lations between synsets. In fact, KnowNet isseveral times larger than WordNet, and whenevaluated empirically across languages, theknowledge it contains outperforms any othersemantic resource.

After this introduction, section 2 describesthe Topic Signatures acquired from the web.Section 3 presents the approach we followedfor building highly dense and accurate knowl-edge bases from the Topic Signatures. In sec-tion 4, we present the evaluation frameworkused in this study and we describe the re-sults when evaluating in a multilingual frame-work different versions of KnowNet for En-glish and Spanish. Finally, section 5 presentssome concluding remarks and future work.

2 Topic Signatures

Topic Signatures (TS) are word vectors re-lated to a particular topic (Lin and Hovy,2000). Topic Signatures are built by re-trieving context words of a target topic fromlarge corpora. In our case, we consider word

Montse Cuadros y German Rigau

122

tammany#n 0.0319federalist#n 0.0315whig#n 0.0300missionary#j 0.0229Democratic#n 0.0218nazi#j 0.0202republican#n 0.0189constitutional#n 0.0186conservative#j 0.0148socialist#n 0.0140

Table 2: TS of party#n#1 (first 10 out of12,890 total words)

senses as topics. Basically, the acquisition ofTS consists of a) acquiring the best possiblecorpus examples for a particular word sense(usually characterizing each word sense as aquery and performing a search on the cor-pus for those examples that best match thequeries), and then, b) building the TS by se-lecting the context words that best representthe word sense from the selected corpora.

The Topic Signatures acquired from theweb (hereinafter TSWEB) constitutes one ofthe largest semantic resource available witharound 100 million relations (between synsetsand words) (Agirre and de Lacalle, 2004). In-spired by the work of (Leacock, Chodorow,and Miller, 1998), TSWEB was constructedusing monosemous relatives from WN (syn-onyms, hypernyms, direct and indirect hy-ponyms, and siblings), querying Google andretrieving up to one thousand snippets perquery (that is, a word sense), extracting thesalient words with distinctive frequency us-ing TFIDF. Thus, TSWEB consist of a largeordered list of words with weights associatedto each of the polysemous nouns of WN1.6.The number of constructed topic signaturesis 35,250 with an average size per signature of6,877 words. When evaluating TSWEB, weused at maximum the first 700 words whilefor building KnowNet we used at maximumthe first 20 words.

For example, table 2 presents the firstwords (lemmas and part-of-speech) andweights of the Topic Signature acquiredfor party#n#1(This format stands forword#pos#sense).

3 Building highly connected anddense knowledge bases

We acquired by fully automatic means highlyconnected and dense knowledge bases by dis-

ambiguating small portions of the Topic Sig-natures obtained from the web, increasing thetotal number of semantic relations from lessthan one million (the current number of avail-able relations) to millions of new and accu-rate semantic relations between synsets. Weapplied a knowledge–based all–words WordSense Disambiguation algorithm to the TopicSignatures for deriving a sense vector fromeach word vector.

3.1 SSI-Dijkstra

We have implemented a version of the Struc-tural Semantic Interconnections algorithm(SSI), a knowledge-based iterative approachto Word Sense Disambiguation (Cuadros andRigau, to appear 2008). The SSI algorithmis very simple and consists of an initializationstep and a set of iterative steps (Navigli andVelardi, 2005).

Given W, an ordered list of words to bedisambiguated, the SSI algorithm performsas follows. During the initialization step, allmonosemous words are included into the setI of already interpreted words, and the poly-semous words are included in P (all of thempending to be disambiguated). At each step,the set I is used to disambiguate one wordof P, selecting the word sense which is closerto the set I of already disambiguated words.Once a sense is selected, the word sense is re-moved from P and included into I. The algo-rithm finishes when no more pending wordsremain in P.

Initially, the list I of interpreted wordsshould include the senses of the monosemouswords in W, or a fixed set of word senses (Ifno monosemous words are found or if no ini-tial senses are provided, the algorithm couldmake an initial guess based on the most prob-able sense of the less ambiguous word of W).However, when disambiguating a TS of aword sense s (for instance party#n#1), thelist I already includes s.

In order to measure the proximity ofone synset to the rest of synsets of I, weuse part of the knowledge already avail-able to build a very large connected graphwith 99,635 nodes (synsets) and 636,077edges. This graph includes the set of di-rect relations between synsets gathered fromWordNet and eXtended WordNet. On thatgraph, we used a very efficient graph library,BoostGraph(http://www.boost.org) to com-pute the Dijkstra algorithm. The Dijkstra al-


123

gorithm is a greedy algorithm for computingthe shortest path distance between one nodean the rest of nodes of a graph. In that way,we can compute very efficiently the shortestdistance between any two given nodes of agraph. We call this version of the SSI algo-rithm, SSI-Dijkstra.

SSI-Dijkstra has very interesting proper-ties. For instance, always provides the min-imum distance between two synsets. Thatis, the algorithm always provides an answerbeing the minimum distance close or far. Incontrast, the original SSI algorithm not al-ways provides a path distance because it de-pends on a predefined grammar of semanticrelations. In fact, the SSI-Dijkstra algorithmcompares the distances between the synsetsof a word and all the synsets already inter-preted in I. At each step, the SSI-Dijkstraalgorithm selects the synset which is closerto I (the set of already interpreted words).

Furthermore, this approach is completelylanguage independent. It could be repeatedfor any language having words connected toWordNet (for instance, Spanish).

3.2 Building KnowNet

We developed KnowNet (KN), a large-scaleand extensible knowledge base by applyingSSI-Dijkstra to each topic signature fromTSWEB.

We have generated four different versionsof KnowNet applying SSI-Dijkstra to onlythe first 5, 10, 15 and 20 words for each TS.SSI-Dijkstra used only the knowledge presentin WordNet and eXtended WordNet whichconsist of a very large connected graph with99,635 nodes (synsets) and 636,077 edges (se-mantic relations).

We generated each KN by applying theSSI-Dijkstra algorithm to the whole TSWEB(processing the first words of each of the35,250 topic signatures). For each TS, weobtained the direct relations from the topic(a word sense) to the disambiguated wordsenses of the TS (for instance, party#n#1 –> federalist#n#1), but also the indirect rela-tions between disambiguated words from theTS (for instance, federalist#n#1 –> repub-lican#n#1). Finally, we removed symmetricand repeated relations.

Table 3 shows the percentage of the over-laping between each KnowNet with respectthe knowledge contained into WordNet andeXtended WordNet, the total number of re-

KB WN+XWN #relations #synsetsKN-5 3.2% 231,164 39,837KN-10 5.4% 689,610 45,770KN-15 7.0% 1,378,286 48,461KN-20 8.6% 2,358,927 50,705

Table 3: Size and percentage of overlap-ping relations between KnowNet versions andWN+XWN

lations and synsets of each resource. For in-stance, only an 8,5% of the total relationsincluded into WN+XWN are also present inKnowNet-20. This means that the rest of re-lations from KnowNet-20 are new. As ex-pected, each KnowNet is very large, rang-ing from hundreds of thousands to millionsof new semantic relations among increasingsets of synsets.

4 Evaluation framework

In order to empirically establish the rela-tive quality of these new semantic resources,we used the evaluation framework of task 16of SemEval-2007: Evaluation of wide cover-age knowledge resources (Cuadros and Rigau,2007).

All knowledge resources are evaluated ona WSD task. In particular, in section 4.5 weused the noun-set of Senseval-3 English Lex-ical Sample task which consists of 20 nounsand in section 4.6 we used the noun-set ofthe Senseval-3 Spanish Lexical Sample taskwhich consists of 21 nouns. For Spanish, theMiniDir dictionary was specially developedfor the task. Most of the MiniDir word senseshave links to WN1.5 (which in turn are linkedby the MCR to the Spanish WordNet (Atse-rias et al., 2004)). All performances are eval-uated on the test data using the fine-grainedscoring system provided by the organizers.We use the noun-set only because TSWEBis available only for nouns, and the EnglishLexical Sample uses the WordSmyth dictio-nary (Mihalcea, T.Chlovski, and A.Killgariff,2004) as a sense repository for verbs insteadof WordNet.

Furthermore, trying to be as neutral aspossible with respect to the resources studied,we applied systematically the same disam-biguation method to all of them. Recall thatour main goal is to establish a fair compar-ison of the knowledge resources rather thanproviding the best disambiguation techniquefor a particular knowledge base. All knowl-


124

edge bases are evaluated as topic signatures.That is, word vectors with weights associatedto a particular synset which are obtained bycollecting those word senses appearing in thesynsets directly related to the topics. Thissimple representation tries to be as neutral aspossible with respect to the resources used.

A common WSD method has been ap-plied to all knowledge resources. A simpleword overlapping counting is performed be-tween the topic signature representing a wordsense and the test example (We also considerthose multiword terms appearing in WN).The synset having higher overlapping wordcounts is selected. In fact, this is a very sim-ple WSD method which only considers thetopical information around the word to bedisambiguated. Finally, we should remarkthat the results are not skewed (for instance,for resolving ties) by the most frequent sensein WN or any other statistically predictedknowledge.

4.1 KnowNet Evaluation

We evaluated KnowNet using the sameframework explained in section 4. That is,the noun part of the test set from the En-glish and Spanish Senseval-3 lexical sampletasks.

4.2 English Baselines

We have designed a number of baselines in or-der to establish a complete evaluation frame-work for comparing the performance of eachsemantic resource when evaluated on the En-glish WSD task.

RANDOM: For each target word, thismethod selects a random sense. This baselinecan be considered as a lower-bound.

SEMCOR-MFS: This baseline selectsthe most frequent sense of the target wordin SemCor.

WN-MFS: This baseline is obtained byselecting the most frequent sense (the firstsense in WN1.6) of the target word. WordNetword-senses were ranked using SemCor andother sense-annotated corpora. Thus, WN-MFS and SemCor-MFS are similar, but notequal.

TRAIN-MFS: This baseline selects themost frequent sense in the training corpus ofthe target word.

TRAIN: This baseline uses the trainingcorpus to directly build a Topic Signature us-ing TFIDF measure for each word sense and

selecting at maximum the first 450 words.Note that in WSD evaluation frameworks,this is a very basic baseline. However, in ourevaluation framework, this ”WSD baseline”could be considered as an upper-bound. Wedo not expect to obtain better topic signa-tures for a particular sense than from his ownannotated corpus.

4.3 Spanish Baselines

As well as for English, we have designeda number of baselines in order to establisha complete evaluation framework for com-paring the performance of each semantic re-source when evaluated on the Spanish WSDtask.

RANDOM: For each target word, thismethod selects a random sense. Again, thisbaseline can be considered as a lower-bound.

Minidir-MFS: This method selects themost frequent sense (the first sense in Mini-dir) of the target word. Since Minidir isa special dictionary built for the task, theword-sense ordering corresponds to their fre-quency in the training data. Thus, for Span-ish, Minidir-MFS is equal to TRAIN-MFS.

TRAIN: This baseline uses the trainingcorpus to directly build a Topic Signature us-ing TFIDF measure for each word sense. Asfor English, this baseline can be consideredas an upper-bound of our evaluation.

Note that the Spanish WN do not en-codes word-sense frequency information andfor Spanish there is no all-words sense taggedcorpora available of the style of Italian(http://multisemcor.itc.it/).

In the Spanish evaluation only sense–disambiguated relations can be ported with-out introducing extra noise. For instance,TSWEB has not been tested on the Spanishside. TSWEB relate synsets to words, notsynsets to synsets. As this resource is notword-sense disambiguated, when translatingthe English words to Spanish, a large amountof noise would be introduced (Spanish wordsnot related to the original synset).

4.4 Other Large-scale KnowledgeResources

In order to measure the relative quality of thenew resources, we include in the evaluation awide range of large-scale knowledge resourcesconnected to WordNet.

WN (Fellbaum, 1998): This resourceuses the different direct relations encoded in


125

WN1.6 and WN2.0. We also tested WN2 us-ing relations at distance 1 and 2, WN3 usingrelations at distances 1 to 3 and WN4 usingrelations at distances 1 to 4.

XWN (Mihalcea and Moldovan, 2001):This resource uses the direct relations en-coded in eXtended WN.

spBNC (McCarthy, 2001): This resourcecontains 707,618 selectional preferences ac-quired for subjects and objects from BNC.

spSemCor (Agirre and Martinez, 2002):This resource contains the selectional prefer-ences acquired for subjects and objects fromSemCor.

MCR (Atserias et al., 2004): This re-source integrates the direct relations of WN,XWN and spSemCor.

TSSEM (Cuadros, Rigau, and Castillo,2007): These Topic Signatures have beenconstructed using SemCor. For each word-sense appearing in SemCor, we gather all sen-tences for that word sense, building a TS us-ing TFIDF for all word-senses co-occurringin those sentences.

4.5 Evaluating each resource inEnglish

Table 4 presents ordered by F1 measure, theperformance in terms of precision (P), re-call (R) and F1 measure (F1, harmonic meanof recall and precision) of each knowledgeresource on Senseval-3 and its average sizeof the TS per word-sense. The differentKnowNet versions appear marked in bold andthe baselines appear in italics.

In this table, TRAIN has been calculatedwith a vector size of at maximum 450 words.As expected, RANDOM baseline obtains thepoorest result. The most frequent senses ob-tained from SemCor (SEMCOR-MFS) andWN (WN-MFS) are both below the most fre-quent sense of the training corpus (TRAIN-MFS). However, all of them are far belowto the Topic Signatures acquired using thetraining corpus (TRAIN).

The best resources would be those obtain-ing better performances with a smaller num-ber of related words per synset. The bestresults are obtained by TSSEM (with F1 of52.4). The lowest result is obtained by theknowledge directly gathered from WN mainlybecause of its poor coverage (R of 18.4 andF1 of 26.1). Interestingly, the knowledge in-tegrated in the MCR although partly derivedby automatic means performs much better in

KB P R F1 Av. SizeTRAIN 65.1 65.1 65.1 450TRAIN-MFS 54.5 54.5 54.5WN-MFS 53.0 53.0 53.0TSSEM 52.5 52.4 52.4 103SEMCOR-MFS 49.0 49.1 49.0MCR2 45.1 45.1 45.1 26,429MCR 45.3 43.7 44.5 129KnowNet-20 44.1 44.1 44.1 610KnowNet-15 43.9 43.9 43.9 339spSemCor 43.1 38.7 40.8 56KnowNet-10 40.1 40.0 40.0 154(WN+XWN)2 38.5 38.0 38.3 5,730WN+XWN 40.0 34.2 36.8 74TSWEB 36.1 35.9 36.0 1,721XWN 38.8 32.5 35.4 69KnowNet-5 35.0 35.0 35.0 44WN3 35.0 34.7 34.8 503WN4 33.2 33.1 33.2 2,346WN2 33.1 27.5 30.0 105spBNC 36.3 25.4 29.9 128WN 44.9 18.4 26.1 14RANDOM 19.1 19.1 19.1

Table 4: P, R and F1 fine-grained resultsfor the resources evaluated at Senseval-3, En-glish Lexical Sample Task.

terms of precision, recall and F1 measuresthan using them separately (F1 with 18.4points higher than WN, 9.1 than XWN and3.7 than spSemCor).

Despite its small size, the resources de-rived from SemCor obtain better results thanits counterparts using much larger corpora(TSSEM vs. TSWEB and spSemCor vs.spBNC).

Regarding the baselines, all knowledge re-sources surpass RANDOM, but none achievesneither WN-MFS, TRAIN-MFS nor TRAIN.Only TSSEM obtains better results thanSEMCOR-MFS and is very close to the mostfrequent sense of WN (WN-MFS) and thetraining (TRAIN-MFS).

The different versions of KnowNet consis-tently obtain better performances when in-creasing the window size of processed wordsof TSWEB. As expected, KnowNet-5 ob-tains the lower results. However, it per-forms better than WN (and all its extensions)and spBNC. Interestingly, from KnowNet-10,all KnowNet versions surpass the knowledgeresources used for their construction (WN,XWN, TSWEB and WN+XWN).

These initial results seem to be verypromising. If we do not consider the re-


126

KB P R F1 Av. STRAIN 81.8 68.0 74.3 450MiniDir-MFS 67.1 52.7 59.2KnowNet-15 54.7 48.9 51.6 176KnowNet-20 51.8 49.6 50.7 319KnowNet-10 53.5 43.1 47.7 81MCR 46.1 41.1 43.5 66WN2 56.0 29.0 42.5 51(WN+XWN)2 41.3 41.2 41.3 1,892KnowNet-5 58.5 26.9 36.8 22TSSEM 33.6 33.2 33.4 208XWN 42.6 27.1 33.1 24WN 65.5 13.6 22.5 8RANDOM 21.3 21.3 21.3

Table 5: P, R and F1 fine-grained results forthe resources evaluated individually on Span-ish.

sources derived from manually sense anno-tated data (spSemCor, MCR, TSSEM, etc.),KnowNet-10 performs better that any knowl-edge resource derived by manual or auto-matic means. In fact, KnowNet-15 andKnowNet-20 outperforms spSemCor whichwas derived from manually annotated cor-pora. This is a very interesting result sincethese KnowNet versions have been derivedonly with the knowledge coming from WNand the web (that is, TSWEB), and WN andXWN as a knowledge source for SSI-Dijkstra(eXtended WordNet only has 17,185 manu-ally labeled senses).

4.6 Evaluating each resource onSpanish

Table 5 presents ordered by F1 measure, theperformance of each knowledge resource onthe Senseval-3 Spanish Lexical Sample taskand its average size of the TS per word-sense.Obviously, the average size in this case isalso different with respect the English eval-uations. The best results for precision, re-call and F1 measures are shown in bold. Wealso mark in italics the results of the differentbaselines.

As for English, TRAIN has been calcu-lated with a vector size of at maximum 450words. As expected, RANDOM baseline ob-tains the poorest result and the most frequentsense obtained from Minidir (Minidir-MFS,and also TRAIN-MFS) is far below the TopicSignatures acquired using the training corpus(TRAIN).

In bold appear the best results for pre-cision, recall and F1 measures. WN ob-

tains the highest precision (P of 65.5) butdue to its poor coverage (R of 13.6), thelowest result (F1 of 22.5). Also interest-ing, is that the knowledge integrated in theMCR outperforms in terms of precision, re-call and F1 measures the results of TSSEM,possibly indicating that the knowledge cur-rently uploaded in the MCR is more robustthan TSSEM and that the topical knowledgegathered from a sense-annotated corpus ofone language can not be directly ported toanother language. Possible explanations ofthese low results could be the smaller size ofthe resources (approximately a half size), thedifferences in the evaluation frameworks, in-cluding the dictionary (sense distinctions andmappings), etc.

Regarding the baselines, all knowledge re-sources surpass RANDOM, but none achievesneither Minidir-MFS (equal to TRAIN-MFS)nor TRAIN.

Interestingly, the knowledge containedinto the MCR (F1 of 43.5), partially derivedby automatic means and ported from Englishresources, almost doubles the results of theoriginal Spanish WN (F1 of 22.5).

Regarding the KnowNet versions portedto Spanish, KnowNet-5 performs better thanWN, XWN and the TS acquired fromSemCor. Starting from KnowNet-10, allKnowNet versions perform better than anyother knowledge resource on Spanish de-rived by manual or automatic means (in-cluding the MCR). Interestingly, the best re-sult is obtained by the ported relations ofKnowNet-15 which performs slightly betterthan KnowNet-20 (while having much less re-lations).

5 Conclusions and futureresearch

It is our belief, that accurate semantic pro-cessing (such as WSD) would rely not only onsophisticated algorithms but on knowledgeintensive approaches. The results presentedin this report suggests that much more re-search on acquiring and using large-scale se-mantic resources should be addressed.

The initial results obtained for the differ-ent versions of KnowNet seem to be a majorstep towards the autonomous acquisition ofknowledge from raw corpora, since they areseveral times larger than the available knowl-edge resources which encode relations be-tween synsets, and the knowledge they con-


127

tain outperform any other resource when isempirically evaluated in a common multilin-gual framework. In fact, when comparing theranking of the different knowledge resources,the different versions of KnowNet seem to bemore robust and stable across languages.

Acknowledgments

We want to thank Aitor Soroa for his tech-nical support and the anonymous reviewersfor their comments. This work has been sup-ported by KNOW (TIN2006-15049-C03-01)and KYOTO (ICT-2007-211423).

References

Agirre, E. and O. Lopez de Lacalle. 2004.Publicly available topic signatures for allwordnet nominal senses. In Proceedings ofLREC, Lisbon, Portugal.

Agirre, E. and D. Martinez. 2001. Learningclass-to-class selectional preferences. InProceedings of CoNLL, Toulouse, France.

Agirre, E. and D. Martinez. 2002. Integrat-ing selectional preferences in wordnet. InProceedings of GWC, Mysore, India.

Alvez, J., J. Atserias, J. Carrera, S. Climent,A. Oliver, and G. Rigau. 2008. Consis-tent annotation of eurowordnet with thetop concept ontology. In Proceedings ofFourth International WordNet Conference(GWC’08).

Atserias, J., L. Villarejo, G. Rigau, E. Agirre,J. Carroll, B. Magnini, and Piek Vossen.2004. The meaning multilingual centralrepository. In Proceedings of GWC, Brno,Czech Republic.

Cuadros, M. and G. Rigau. 2006. Qual-ity assessment of large scale knowledge re-sources. In Proceedings of the EMNLP.

Cuadros, M. and G. Rigau. 2007. Semeval-2007 task 16: Evaluation of wide cover-age knowledge resources. In Proceedingsof the Fourth International Workshop onSemantic Evaluations (SemEval-2007).

Cuadros, M. and G. Rigau. to appear 2008.KnowNet: Building a large net of knowl-edge from the web. In COLING’08.

Cuadros, M., G. Rigau, and M. Castillo.2007. Evaluating large-scale knowledgeresources across languages. In Proceedingsof RANLP.

Fellbaum, C., editor. 1998. WordNet. AnElectronic Lexical Database. The MITPress.

Leacock, C., M. Chodorow, and G. Miller.1998. Using Corpus Statistics and Word-Net Relations for Sense Identification.Computational Linguistics, 24(1):147–166.

Lin, C. and E. Hovy. 2000. The auto-mated acquisition of topic signatures fortext summarization. In Proceedings ofCOLING. Strasbourg, France.

Magnini, B. and G. Cavaglia. 2000. Inte-grating subject field codes into wordnet.In Proceedings of LREC, Athens. Greece.

McCarthy, D. 2001. Lexical Acquisition atthe Syntax-Semantics Interface: Diathe-sis Aternations, Subcategorization Framesand Selectional Preferences. Ph.D. thesis,University of Sussex.

Mihalcea, R. and D. Moldovan. 2001. ex-tended wordnet: Progress report. In Pro-ceedings of NAACL Workshop on Word-Net and Other Lexical Resources, Pitts-burgh, PA.

Mihalcea, R., T.Chlovski, and A.Killgariff.2004. The senseval-3 english lexical sam-ple task. In Proceedings of ACL/SIGLEXSenseval-3, Barcelona.

Navigli, R. and P. Velardi. 2005. Structuralsemantic interconnections: a knowledge-based approach to word sense disam-biguation. IEEE Transactions on Pat-tern Analysis and Machine Intelligence(PAMI), 27(7):1063–1074.

Niles, I. and A. Pease. 2001. Towards astandard upper ontology. In Proceedingsof the 2nd International Conference onFormal Ontology in Information Systems(FOIS-2001), pages 17–19. Chris Weltyand Barry Smith, eds.

Suchanek, Fabian M., Gjergji Kasneci, andGerhard Weikum. 2007. Yago: A Coreof Semantic Knowledge. In 16th in-ternational World Wide Web conference(WWW 2007), New York, NY, USA.ACM Press.

Vossen, P., editor. 1998. EuroWordNet:A Multilingual Database with Lexical Se-mantic Networks . Kluwer AcademicPublishers .


128

Extensión y orre ión semi-automáti a de léxi osmorfo-sintá ti os∗Semi-automati extension and orre tion of morpho-synta ti lexi onsLionel Ni olas♦ Benoît Sagot♣ Miguel A. Molinero♠Ja ques Farré♦ Éri de La Clergerie♣♦Team RL, Laboratory I3S - UNSA + CNRS, 2000 routes des lu ioles B.P. 12106903 Sophia Antipolis, Fran e{lni olas, jf}�i3s.uni e.fr

♣Proje t ALPAGE, INRIA Ro quen ourt + París 7, Domaine de Volu eau B.P. 10578153 Le Chesnay, Fran e{benoit.sagot, Eri .De_La_Clergerie}�inria.fr♠Grupo LYS, Univ. de A Coruña, Dpto. de Computa ión, Fa . de Informáti aCampus de Elviña S/N, 15071 A Coruña, Españammolinero�ud .esResumen: En este artí ulo des ribimos un onjunto de té ni as para la extensióny orre ión de léxi os de amplia obertura. Se basan en la dete ión de entradaserróneas y la genera ión automáti a de hipótesis de orre ión mediante el uso del ontexto sintá ti o. Exponemos los resultados al anzados sobre un léxi o fran és yplanteamos su apli a ión en el desarrollo de un léxi o español.Palabras lave: Adquisi ión de re ursos lingüísti os, análisis sintá ti o, léxi osmorfo-sintá ti os, análisis de erroresAbstra t: This paper des ribes a set of te hniques for the extension and orre tionof wide- overage lexi ons based on dete tion of erroneous entries and automati generation of orre tion hypotheses using the synta ti al ontext. We report theresults a hieved on a Fren h lexi on and we onsider the appli ation of our te hniqueson a Spanish lexi on.Keywords: Linguisti resour e a quisition, parsing, morpho-synta ti lexi ons,error-minning1. Introdu iónEl in remento de la obertura y lapre isión de los analizadores sintá ti os noentrenados depende fundamentalmente dela mejora de los léxi os y gramáti as queutilizan.La onstru ión manual de re ursoslingüísti os de amplia obertura es un trabajolaborioso, omplejo y ausante de errores,que requiere la interven ión de personalexperto. Con el objetivo de minimizar lainterven ión humana, simpli� ar el pro esoy aumentar la alidad de los resultados,

∗ Par ialmente �nan iado por el Ministerio deEdu a ión y Cien ia (HUM2007-66607-C04-02) y laXunta de Gali ia (�Red gallega para el pro esamientodel lenguaje y re upera ión de informa ión� 2006-2009). Damos también las gra ias al grupo COLE dela Univ. de Vigo por permitirnos utilizar sus sistemasde ál ulo.

es posible usar herramientas automáti aso semi-automáti as. En el presente trabajopresentamos un onjunto de herramientasque permiten dete tar defe tos en léxi osmorfo-sintá ti os y proponer orre iones alos mismos. Todo ello tomando texto plano omo entrada del pro eso.La extensión y orre ión de un léxi opuede dividirse en dos fases: Primeroidenti� ar entradas erróneas o in ompletasen el léxi o, y segundo proponer orre ionespara di has entradas.Afrontamos el primer paso usando dosté ni as que permiten des ubrir formassospe hosas, es de ir, aquellas que pare en ausar errores de análisis sintá ti o en un onjunto de frases.La solu ión al segundo paso se basa en elsiguiente prin ipio: podemos en ontrar patro-nes de uso para una forma sospe hosa estu-



diando varias frases que no han podido seranalizadas y viendo que informa ión hubie-ra ne esitado la gramáti a para poder rea-lizar análisis ompletos. Estos esquemas pue-den enton es ser planteados omo hipótesis de orre ión. En ierto modo, podríamos de irque sabemos que el problema se debe al léxi- o, y le pedimos a la gramáti a que expresequé informa ión hubiese a eptado para unaforma sospe hosa.El onjunto de té ni as presentado es omplemente independiente del lenguaje y dela plataforma. Puede ser apli ado a ualquiera ualquier analizador sintá ti o. La úni a ondi ión es garantizar que el texto usado omo entrada es lexi al y gramati almente orre to. Esto asegura que el re hazo de unafrase se debe solamente a errores en algun omponente (tipi amente el lexi o y/o lagrammati a).Este artí ulo está organizado de la siguien-te manera. Primero introdu iremos los on- eptos teóri os en los que se basan nues-tras té ni as (Se . 2). Después detallaremosen Se . 3 y Se . 4 las té ni as usadas pa-ra dete tar informa iones erróneas en el léxi- o. A ontinua ión expli aremos omo gene-rar (Se . 5) y ordenar hipótesis de orre ión(Se . 6). En Se . 7 omentaremos las diferen- ias y similitudes on trabajos previos. Des-pués, presentaremos los resultados al anza-dos (Se . 8). Finalmente, hablaremos de tra-bajo futuro (Se . 9), justo antes de on luir(Se . 10).2. Con eptos teóri osLas formas de una lengua suelen des ribir-se en un léxi o mediante una o más entradasque in luyen varios tipos de informa ión: la ategoría gramati al, informa ión morfológi- a, informa ión sintá ti a (mar os de sub a-tegoriza ión) y informa ión semánti a.Una forma on reta provo ará un error deanálisis sintá ti o si su des rip ión en el léxi o ondu e a un on�i to on la gramáti a. Esde ir, uando la gramáti a y el léxi o no oin iden en el patrón de uso de una forma.Por razones prá ti as diferen iaremos en-tre on�i tos rela ionados on ategorías gra-mati ales, que llamaremos defe tos de a-tegoriza ión, y on�i tos rela ionados onmar os de sub ategoriza ión, que llamaremos on�i tos de rasgos.Los defe tos de ategoriza ión ha en refe-ren ia al he ho de que una forma on reta no

tenga todas sus posibles ategorías gramati- ales representadas en las entradas del léxi- o. Por ejemplo, la forma "� ha"podría apa-re er omo verbo (� har) y no omo sustanti-vo. Este tipo de errores suele estar aso iada ala homonimia. Se trata de lemas que puedendesempeñar varias ategorías gramati ales yalguna de las menos habituales ha sido olvi-dada.Los on�i tos de rasgos re�ejan in oheren- ias en la des rip ión del mar o de sub ate-goriza ión de alguna entrada del léxi o. Re-sultan de la di� ultad de des ribir exahusti-vamente el omportamiento sintá ti o de unaforma. Si el uso más omún es también el másrestri tivo, ondu e a la sobre espe i� a ión,es de ir, el mar o sintá ti o no permite todaslas fun iones que esa forma puede desempe-ñar en la prá ti a.Tomemos una forma sospe hosa ualquie-ra aso iada a un onjunto de frases no anali-zables, donde di ha forma es la prin ipal sos-pe hosa de ausar el fallo de análisis. La gene-ra ión de orre iones léxi as para esta formarequiere obtener datos de la gramáti a para ada una de las frases aso iadas. Es de ir, ob-tener análisis de frases no analizables. Bus a-mos el onjunto de análisis sintá ti os que lagramáti a hubiese generado para esas frases on un léxi o arente de errores.Conseguiremos este objetivo eliminandolas restri iones sintá ti as de la forma sospe- hosa, es de ir, in rementando el onjunto deposibles ategorías gramati ales (esto es, aña-diendo, de forma virtual, nuevas entradas alléxi o) y/o relajaremos las restri iones sin-tá ti as de una entrada del léxi o. Aunque ave es la forma sospe hosa no es la úni a ra-zón de todos los errores de análisis, este pro- eso habitualmente in rementa el por entajede análisis ompletados.La supresión de restri iones puede versede la siguiente forma: durante el pro eso deanálisis sintá ti o, ada vez que se a ede a lainforma ión lexi al de una forma sospe hosa,el léxi o es ignorado y todas las restri ionessintá ti as se onsideran umplidas. De estemodo, la forma se onvierte en lo que lagramáti a quiera que sea, es de ir, en aja on ualquier patrón morfológi o y sintá ti o quela gramáti a ne esitase para ha er un análisis ompleto. Estos patrones son los datos queusaremos para generar las orre iones.Suprimimos las restri iones de las formassospe hosas sustituyéndolas en las frases

Lionel Nicolas, benoit Sagot, Miguel A. Molinero, Jacques Farré y Éric de La Clargerie

130

por una forma espe ial que llamaremos omodín.3. Dete ión de defe tos de ategoriza iónCon el objetivo de des ubrir defe tos de ategoriza ión en el léxi o, hemos desarrolla-do una té ni a que se basa en el uso de unetiquetador esto ásti o (Graña, Chappelier, yVilares, 2001; Molinero et al., 2007). La ideaes intentar adivinar nuevas ategorías grama-ti ales para las formas del orpus de entradausando un etiquetador on�gurado de formaespe ial. Este etiquetador onsiderará omodes ono idas todas aquellas palabras que per-tene en a las ategorías abiertas1. Como on-se uen ia el etiquetador propondrá etiquetas andidatas para ada una de estas palabrasy las más probables de ser orre tas son es- ogidas por el propio pro eso de etiqueta iónesto ásti a. De este modo, nuevas ategoríasgramati ales surgen para algunas formas del orpus de entrada.Para obtener este etiquetador hemosusando dos orpus de entrenamiento. Elprimero es un orpus de ora iones (330Kpalabras) etiquetado manualmente y extraidodel Treebank de la Universidad de París7 (Abeillé, 2003). El segundo está ompuestopor una lista de formas pertene ientes alas lases erradas2. El etiquetador fuemodi� ado para onsiderar omo ono idaslas formas pertene ientes al segundo orpus.El resto son onsideradas des ono idas.Hemos pasado el orpus de entrada al eti-quetador y extraido los pares forma/etiqueta.Aquellos pares que no existían en el léxi o fue-ron propuestos omo andidatos de defe tosde ategoriza ión. La apari ión de falsos po-sitivos ha sido atenuada ordenando los andi-datos según la siguiente medida:(nwt/nw) ∗ log(nwt),Donde nwt es el número de apari iones de laforma w etiquetada omo t y nw es númerototal de apari iones de la forma w.4. Dete ión de on�i tos derasgosLa té ni a des rita aquí amplía las ideasdes ritas en Sagot y Villemonte de La1Adjetivos, sustantivos, adverbios, verbos ynombres propios.2Preposi iones, determinantes, pronombres ysignos de puntua ión.

Clergerie (2006), donde los autores dete tanformas sospe hosas mediante el análisisestadísti o de los resultados de un analizadorsintá ti o. Esta té ni a permite obtener unalista de formas, ada una on un oe� iente desospe ha y un onjunto de frases aso iadas enlas que di ha forma es la prin ipal sospe hosade ser la ausante del fallo de análisis.Dado que no hay un modo automáti oe inequívo o para de idir si un fallo deanálisis se debe a un error en el léxi oo en otro omponente del analizador,la té ni a de análisis de errores (errormining) para dete tar formas sospe hosasse basa en la siguiente idea: estudiando losresultados del análisis sintá ti o de un orpussu� ientemente amplio de frases orre tas, uanto menos apare e una forma en frasesanalizables y más lo ha e en frases noanalizables, más probable es que las entradaslexi ales de esa forma sean in orre tas;sobre todo si di ha forma apare e en frasesno analizables junto on otras formas queapare en en frases analizables.La prin ipal desventaja es que los resulta-dos dependen en gran medida de la alidad dela gramáti a usada. De he ho, si una forma on reta está aso iada on iertas onstru - iones sintá ti as no manejadas por la gramá-ti a, esta forma apare erá en frases no ana-lizables y será onsiderada, in orre tamente, omo sospe hosa. Se puede limitar este in on-veniente apli ando dos mejoras:Usar varios analizadores, omo se des ri-be en Sagot y Villemonte de La Clergerie(2006), basados en diferentes gramáti asy ombinar sus resultados para evitar loserrores sistemáti os de ada una de ellas.Bus ar patrones sintá ti os no ubiertosen la gramáti a y �ltrar las frases no ana-lizables donde apare en. Para ha er esto,se puede redu ir ada frase de la entradaa una se uen ia de ategorías gramati a-les mediante un etiquetador, y luego en-trenar un lasi� ador de máxima entro-pía (Daumé III, 2004) usando los posi-bles trigramas. Este lasi� ador permiteidenti� ar ada frase, a priori, omo ana-lizable o no analizable. Aunque el resul-tado no sea perfe to (el etiquetador o el lasi� ador pueden equivo arse), este �l-trado permite in rementar notablementela alidad de los sospe hosos que se ob-tienen mediante el análisis de errores.

Extesión y correción semi-automática de léxicos morfo-sintácticos

131

5. Genera ión de orre ionesUna vez que las formas sospe hosas hansido dete tadas y ordenadas, el siguiente pasoes sugerir automáti amente orre iones. Lamanera más simple de generar hipótesisde orre ión sería usar omodines que no ontengan ningún tipo de restri ión. Así seevitarían todo tipo de on�i tos y aumentaríanotablemente la obertura del analizador.Sin embargo, omo se expli a en Fouvry(2003), esto genera una ambigüedad inne e-saria y ondu e a una explosión del númerode análisis posibles o in luso a ningún análisispor falta de memoria o de tiempo. De modometafóri o, omo hemos di ho antes, bus a-mos que la gramáti a nos propor ione la in-forma ión léxi al que hubiera a eptado paralas formas sospe hosas. Introdu iendo omo-dines sin restri iones, la gramáti a generaríatanta informa ión que no sabríamos uál to-mar omo orre ta, o in luso podría ser quetenga tantas osas que de ir que no pueda ex-presar ninguna.Por lo tanto re�namos los omodines in-trodu iendo datos para restringir su uso y dis-minuir la ambiguidad. Por razones prá ti as,usamos omodines on una ategoría grama-ti al de�nida.Para obtener hipótesis sobre defe tos de ategoriza ión ne esitamos que el analizadorexplore reglas gramati ales distintas a las vi-sitadas uando el análisis falló. Por lo tanto,para ada forma sospe hosa generamos omo-dines on ategorías gramati ales diferentes alas presentes en el léxi o.Para obtener hipótesis sobre on�i tosde rasgos, ne esitamos que el analizadorexplore de nuevo las mismas reglas dela gramáti a pero sin detenerse por fallosde uni� a ión de los rasgos. Por lo tantogeneramos omodines on la misma ategoríagramati al que aquellos ya presentes en elléxi o.Los análisis obtenidos tras la introdu iónde los omodines son propor ionados a unmódulo de onversión, desarrollado para adaanalizador, que extrae la entrada lexi alinstan iada de ada omodín en el formatodel léxi o. Esta forma de pro eder tiene tresventajas:No se ne esita omprender el formato desalida del analizador para estudiar las orre iones;

Las orre iones propuestas están om-puestas ex lusivamente de datos relati-vos al léxi o;Se pueden ombinar los resultados pro-du idos por varios analizadores, lo uales una solu ión e� iente para solven-tar algunas limita iones del pro eso (VerSe . 6).6. Ordena ión de las hipótesisLos lenguajes naturales son ambiguos, ypor tanto lo son las gramáti as que los mo-delan. Por ejemplo, en algunas lenguas ro-man es, un adjetivo puede ser usado omosustantivo y un sustantivo omo adjetivo. En onse uen ia, un omodín on una ategoríagramati al in orre ta puede ondu ir a análi-sis ompletos y ofre er orre iones in orre -tas. Para paliar este problema lasi� amosprimero las hipótesis de orre ión de a uer-do a sus orrespondientes omodines atego-rizados. Estudiando el por entaje de análisis ompletos produ idos por ada tipo de omo-dín y las frases que son analizables gra ias aellos, resulta simple para un humano identi�- ar el omodín válido.Cuando se usa un solo analizador ordenarlas orre iones es una tarea simple, perolos resultados dependen ompletamente dela alidad de la gramáti a. Utilizar lashipótesis de orre ión provenientes devarios analizadores alivia este problema,pero requiere té ni as de ordena ión másso�sti adas.6.1. Ordena ión simple on unsolo analizadorLas hipótesis de orre ión obtenidas des-pués de introdu ir un omodín son general-mente irrelevantes, es de ir, mu has de ellasson orre iones parásitas que provienen de laambigüedad introdu ida por el omodín. Sinembargo, entre todas las orre iones, algu-nas son válidas, o al menos útiles para des u-brir las verdaderas. En el ámbito de una solafrase, no hay un modo �able de determinar uáles son parásitas y uáles válidas. Pero si onsideramos simultaneamente mu has frasesque in luyen la misma forma sospe hosa endiferentes onstru iones sintá ti as re ono- idas por diferentes reglas gramati ales, po-dremos observar una gran dispersión de lashipótesis parásitas. Al ontrario, las orre - iones orre tas que representan el verdadero


132

sentido de la palabra según la gramáti a, apa-re erán de forma re urrente. Por tanto, orde-naremos las hipótesis de orre ión en fun ióndel número de frases que la produ en.6.2. Ordena ión avanzada onvarios analizadoresUsar más de un analizador no sólomejora la dete ión de formas sospe hosassino que también permite ombinar hipótesisde orre ión para redu ir al máximo lain�uen ia de ada gramáti a. Cuando algunaforma está rela ionada on una onstru iónsintá ti a que no está orre tamente ubiertapor una gramáti a, esta forma apare een frases no analizables y por tanto serásospe hosa. Reemplazarla por omodines solo ondu irá a orre iones in orre tas porque elproblema no se en uentra en el léxi o.Por tanto, usar varios analizadores per-mite obtener varios onjuntos de frases noanalizables y varios onjuntos de hipótesis de orre ión. Las hipótesis pueden des artarse(o onsiderarse menos relevantes) según tresprin ipios:Si una forma sospe hosa realmente se orresponde on un error en el léxi o,ninguna frase que la ontenga desempe-ñando la fun ión sintá ti a aso iada alerror podrá ser analizada. Las hipótesisprodu idas por frases que son analizablespor al menos uno de los analizadores pue-den ser des artadas, ya que generalmenteel error no proviene del léxi o sino de lasgramáti as.Por la misma razón, las hipótesis de orre ión produ idas a partir de frasesen las que sólo un analizador haidenti� ado la forma omo sospe hosadeben ser también eliminadas.Finalmente, las hipótesis de orre iónpropuestas sólo por uno de los analizado-res (o propuestas mu has más ve es poruno de los analizadores que por los otros)pueden ser simplemente onse uen ia dela ambigüedad de la gramáti a. Al �n yal abo, las gramáti as des riben el mis-mo lenguaje, por lo que deberían de ofre- er resultados omunes en el uso de unaforma.Enton es, usamos el siguiente esquema deordena ión: dada una forma sospe hosa, sologuardamos las hipótesis de orre ión que

son obtenidas de onjuntos de frases queeran originalmente no analizables, pero quepasan a serlo por todos los analizadores on la introdu ión de un mismo omodín.A ontinua ión, ordenamos las hipótesis de ada uno de los analizadores por separado y�nalmente ombinamos los resultados.7. Trabajos rela ionadosUna vez expuestas nuestras té ni as,dis utimos las similitudes y diferen ias entrenuestras investiga iones y las ya publi adas.La adquisi ión/extensión/ orre ión deléxi os ha sido un tema de investiga iónmuy a tivo durante los últimos años. Sobretodo desde que formalismos lexi ales ygramati ales ade uados para representar ono imiento lingüísti o profundo han sidodesarrollados.La idea de inspirarse en el ontexto sintá -ti o para adquirir datos lexi ales omenzó en1990 (Erba h, 1990). La té ni a de identi�- a ión de formas sospe hosas des rita en vanNoord (2004), se ombinó on esta idea a par-tir de 2006 (van de Cruys, 2006; Yi y Kordoni,2006). Salvo en Ni olas, Farré, y Villemon-te de La Clergerie (2007), no se ha usado lamejora des rita en Sagot y Villemonte de LaClergerie (2006). Hasta el momento tampo ose ha intentado �ltrar las frases de la entrada(Se . 4) para mejorar la identi� a ión.La genera ión de omodines empezó aa�narse a partir del año 1998 (Barg yWalther, 1998). Desde enton es se suelen onstruir omodines par iales para las lasesabiertas. En Yi y Kordoni (2006) se utilizauna elegante té ni a de lasi� a ión porentropía para elegir los omodines másade uados antes de introdu irlos.La forma de lasi� ar las hipótesis sueleser mediante el uso de una herramientaentrenada (van de Cruys, 2006; Yi y Kordoni,2006), omo un lasi� ador de entropía,pero nun a se ha intentado evaluar lashipótesis sobre varias frases para dis riminarlas parásitas.En de�nitiva, no se obtuvo ningúnresultado on reto en la orre ión de léxi oshasta el año 2005. van de Cruys (2006) ysobre todo Yi y Kordoni (2006) exponenresultados a eptables basándose en frasesextraidas de un Treebank van de Cruys (2006)separa los resultados según la ategoríasintá ti a y se puede observar laramente,espe ialmente en lemas omplejos omo


133

los verbos, la imposibilidad de apli areste tipo de té ni as de forma automáti asin perjudi ar la alidad del léxi o. SalvoNi olas, Farré, y Villemonte de La Clergerie(2007), ningún trabajo expone de formaexplí ita la dependen ia ha ia la alidadde las gramáti as usadas, que representa elumbral de esta orriente y expli a por quépro edemos de forma semi-automáti a y noautomáti a.8. ResultadosA ontinua ión, presentamos los resulta-dos al anzados al apli ar las té ni as des ritasen este artí ulo al léxi o fran és Le�f 3. Des- ribiremos primero el ontexto prá ti o y me-diremos la efe tividad del pro eso de orre - ión.8.1. Contexto prá ti oEl léxi o Le�f es un léxi o morfo-sintá ti ode amplia obertura que ha sido par ialmente onstruido usando té ni as de adquisi iónautomáti a (Sagot et al., 2006). En elmomento de es ribir el presente artí ulo, ontiene mas de 520.000 formas.Hemos usado dos analizadores basados ensendas gramáti as:FRMG (Fren h Meta-Grammar) esuna meta-gramáti a (Thomasset y Vi-llemonte de La Clergerie, 2005) que ompilamos en un analizador híbridoTAG/TIG.SxLFG-Fr (Boullier y Sagot, 2005;Boullier y Sagot, 2006) es una gramáti aLFG profunda no-probabilísti a.El orpus de entrada usado proviene deun periódi o de noti ias políti as Le mondediplomatique y está formado por más de280.000 frases de menos de 25 palabras. Entotal, onsta de 4,3 millones de palabras.8.2. E� ien ia de las orre ionesExisten varias formas de medir la alidadde un onjunto de orre iones. En nuestro aso, hemos es ogido medir la e� ien ia delpro eso estudiando el aumento del por entajede frases analizables al anzado durantenuestros experimentos. En ualquier aso,debemos tener presente que las orre ionesson validadas y añadidas manualmente, por3Lexique des formes �é hies du français/Léxi o deformas �exionadas del fran és.

150000

151000

152000

153000

154000

155000

156000

157000

158000

0 1 2 3

NU

ME

RO

DE

AN

ALI

SIS

CO

MP

LET

AD

OS

NUMERO DE SESION

FrmgSxlfg

Figura 1: Número de frases analizadasdespués de ada sesión de orre ión.Sesión 1 2 3 totaln 30 99 1 130adj 66 694 27 787verb 1183 0 385 1568adv 1 7 0 8total 1280 800 413 2493Cuadro 1: Formas a tualizadas en el léxi o en ada sesión de orre ióntanto el notable in remento experimentadoen la obertura del analizador se debeglobalmente a la mejora del léxi o.La Figura 1 muestra esta ganan ia omo el número de frases analizables on ada analizador después de ada sesión de orre ión.El uadro 1 muestra el número de formasa tualizadas en el léxi o en ada sesión.Todas las sesiones de orre ión hansido realizadas usando las té ni as dedete ión de errores y genera ión de hipótesisex epto la segunda sesión. En ella soloha sido apli ada la té ni a de dete iónde defe tos de ategoriza ión, que todavíano ha sido one tada on el módulo degenera ión automáti a de hipótesis por faltade tiempo. En ualquier aso, la listade formas sospe hosas produ ida por estaté ni a era su� ientemente simple omo paraser revisada sin la ayuda del módulo degenera ión de hipótesis.Como temíamos, los resultados al anzadoshan sido rápidamente limitados por la alidadde las gramáti as y del orpus. De he ho,


134

el léxi o y las gramáti as usados hansido desarrollados onjuntamente durante losúltimos años usando el mismo orpus omo ampo de pruebas. Esto ha e que la té ni ade dete ión de errores dé lugar a orre ionesirrelevantes después de unas po as sesiones.Además, la té ni a de dete ión de defe tosde ategoriza ión sólo puede ser usada unasola vez para ada orpus de entrada. Pararealizar nuevas sesiones es ne esario mejoraro ambiar las gramáti as usadas u obtenernuevos orpora de entrada.Aun así, en este experimento hemos orre-gido 254 lemas orrespondientes a 2493 for-mas. El por entaje de frases analizablesha aumentado un 3,41% (5141 frases) pa-ra FRMG y un 1,73% (2677 frases) paraSXLFG. Cabe desta ar que gra ias a la e�- ien ia de las té ni as de dete ión de erroresy genera ión de hipótesis aquí presentadas,estos resultados fueron al anzados on tan so-lo unas po as horas de trabajo humano.9. Trabajo futuroNuestros esfuerzos se fo alizarán en dostareas.9.1. Apli a ión al españolLa Universitat Pompeu Fabra4 ha sidopionera en el desarrollo de un léxi o morfo-sintá ti o de amplia obertura para elespañol: SRG (Spanish Resour e Grammar)(Marimon, Seghezzi, y Bel, 2006), que a díade hoy es el más extenso y desarrollado. En Yiy Kordoni (2006), los autores apuntan a losfallos del léxi o omo ausantes de la mayorparte de los errores de análisis sintá ti ode textos generalistas es ritos en inglés:alrededor del 70% de los análisis se detienenpor no disponer de informa ión léxi a dealguna palabra. La lejanía entre el inglés yel español impide extender esta on lusión.Pero si pensamos en el fran és, un idiomamu ho más er ano al español en términoslingüísti os, vemos que el Le�f des ribe másde 110.000 lemas, y el SRG tan sólo 50.000.Pare e razonable onsiderar que este re ursotodavía ha de ser ampliado.Consideramos su extensión apli ando lametodología siguiente:Ampliaremos el número de lemas apli- ando una té ni a semi-automáti a deadquisi ión (Clément, Sagot, y Lang,4http://www.iula.upf.edu/

2004) que ha desmostrado su e� a iaen varios idiomas tan diferentes omo elfran és, el eslova o y el he o. Obtendre-mos así nuevos lemas on informa ionesmorfológi as.A ontinua ión, apli aremos la té ni ades rita en este arti ulo para obtener suinforma ión sintá ti a.En teoría, esta metodología se puedeapli ar in luso a idiomas on léxi os muypequeños. Pero es ne esario que el léxi opermita en ontrar en el orpus de entradaun buen número de frases on una solaforma sospe hosa. SRG es lo su� ientementeextenso omo para obtener mu has frases que umplen esta ondi ión, lo ual ha e viable eluso esta metodología.9.2. Extensión de las té ni asAunque la té ni a de dete ión y orre - ión de defe tos de ategoriza ión ha ofre i-do resultados a eptables, se en uentra toda-vía en un estado preliminar. Es ne esario dis-minuir la ambigüedad introdu ida por el altonúmero de palabras des ono idas que indu enuestra té ni a. Nos planteamos modi� arlapara onsiderar, uando sea posible, una solapalabra des ono ida en ada frase. Tambiénes ne esario one tarla on el módulo de ge-nera ión de hipótesis de orre ión para ons-tituir una herramienta integrada.Una ventaja del pro esso está rela ionada on su prin ipal desventaja : la dependen iaha ia la gramáti a usada. Si en una fraseno analizable no se ha podido validarninguna de las orre iones propuestas paralas formas sospe hosas, enton es esta frasepuede onsiderarse léxi amente orre ta parael estado a tual de la gramáti a. Es de ir,esa frase representa un error de la gramáti a.Por lo tanto, mejorar su esivamente el léxi ohasta que no dé lugar a nuevas hipótesisde orre ión orre tas, permitirá obtenerun orpus representativo de las aren iasde la gramáti a. Este orpus podría serla base de otra herramienta que permitamejorar la gramáti a. He ho esto, podríausarse de nuevo el mismo orpus en ladete ión de errores lexi ales. De esta formase podría realizar un pro eso alternativoe in remental para la mejora onjunta deléxi os y gramáti as.


135

10. Con lusionesEn on lusión, el onjunto de té ni aspresentadas han probado ser relevantes ye� ientes en la prá ti a sobre un léxi ofran és. Su apli a ión a un léxi o español nospermitirá, por un lado, mejorar los re ursoslingüísti os disponibles en español y, por otro,dete tar aren ias en nuestras té ni as quetodavía no hayan sido identi� adas.El punto al anzado en el desarrollode las té ni as presentadas no onstituyeun �nal. Todavía existen mejoras quepodemos implementar pero, sobre todo,es el objetivo de la orre ión gramati alel que llama nuestra aten ión. En efe to,las té ni as presentadas en este trabajo onstituyen un sistema efe tivo para laextensión y orre ión de léxi os morfo-sintá ti os. Pero también permiten onstruirun orpus representativo de las aren ias dela gramáti a, lo ual abre un amino ha ia laextensión y orre ión de la gramáti a usada.BibliografíaAbeillé, Anne. 2003. Annotationmorpho-syntaxique. Paper available athttp://www.llf. nrs.fr/Gens/Abeille/guide-morpho-synt.02.pdf, January.Barg, Petra y Markus Walther. 1998.Pro essing unkonwn words in hpsg. EnPro eedings of the 36th Conferen e ofthe ACL and the 17th InternationalConferen e on Computational Linguisti s.Boullier, Pierre y Benoît Sagot. 2005.E� ient and robust LFG parsing: SxLfg.En Pro eedings of IWPT'05, páginas 1�10.Boullier, Pierre y Benoît Sagot. 2006.E� ient parsing of large orpora witha deep LFG parser. En Pro eedings ofLREC'06.Clément, Lionel, Benoît Sagot, y BernardLang. 2004. Morphology based automati a quisition of large- overage lexi a. EnPro eedings of the LREC'04.Daumé III, Hal. 2004. Notes on CGand LM-BFGS optimization of lo-gisti regression. Paper availableat http://pub.hal3.name/daume04 g-bfgs, implementation available athttp://hal3.name/megam/, August.Erba h, Gregor. 1990. Synta ti pro essingof unknown words. En IWBS Report 131.

Fouvry, Frederik. 2003. Lexi on a quisitionwith a large overage uni� ation-basedgrammar. En Companion to the 10th ofEACL.Graña, Jorge, Jean-Cédri Chappelier, yManuel Vilares. 2001. Integratingexternal di tionaries into sto hasti part-of-spee h taggers. EuroConferen e Re entAdvan es in Natural Language Pro essing(RANLP). Pro eedings, pp. 122-128.Marimon, Montserrat, Natalia Seghezzi, yNúria Bel. 2006. An open-sour e lexi onfor spanish. En XXIII Congreso de laSo iedad Española para el Pro esamientodel Lenguaje Natural.Molinero, Miguel A., F o. Mario Bar ala,Juan Otero, y Jorge Graña. 2007.Pra ti al appli ation of one-pass viterbialgorithm in tokenization and pos tagging.Re ent Advan es in Natural LanguagePro essing (RANLP). Pro eedings, pp. 35-40.Ni olas, Lionel, Ja ques Farré, y Éri Villemonte de La Clergerie. 2007.Corre tion mining in parsing results. EnPro eedings of LTC'07.Sagot, Benoît, Lionel Clément, Éri Ville-monte de La Clergerie, y Pierre Boullier.2006. The Le�f 2 synta ti lexi on forfren h: ar hite ture, a quisition, use. EnPro eedings of LREC'06.Sagot, Benoît y Éri Villemonte de La Clerge-rie. 2006. Error mining in parsing results.En Pro eedings of ACL/COLING'06, pá-ginas 329�336. Asso iation for Compu-tational Linguisti s.Thomasset, François y Éri Villemonte de LaClergerie. 2005. Comment obtenir plusdes méta-grammaires. En Pro eedings ofTALN'05.van de Cruys, Tim. 2006. Automati ally ex-tending the lexi on for parsing. En Pro ee-dings of the eleventh ESSLLI student ses-sion.van Noord, Gertjan. 2004. Error mining forwide- overage grammar engineering. EnPro eedings of ACL 2004.Yi, Zhang y Valia Kordoni. 2006. Automateddeep lexi al a quisition for robust opentexts pro essing. En Pro eedings ofLREC-2006.


136

A cognitive approach to qualities for NLP

Un enfoque cognitivo de las cualidades para el PLN

Resumen: En muchas bases de conocimiento para el PLN prevalece actualmente un enfoque lexicista. En cambio, FunGramKB utiliza la ontología como módulo pivote entre los niveles léxicos y cognitivos, convirtiéndose así en el componente más importante. El propósito de este artículo es la descripción de los tipos semánticos que se asignan a las cualidades de FunGramKB y cómo el enfoque cognitivo que se adopta facilita tanto la estructuración de la base de conocimiento como el razonamiento en sistemas del PLN. Palabras clave: Base de conocimiento, representación del conocimiento, ontología, postulado de significado, razonamiento, comprensión del lenguaje natural.

Abstract: Unlike most current NLP knowledge bases, where the lexicalist approach prevails, FunGramKB is ontology-oriented, since the ontology plays a pivotal role between the lexical and the cognitive levels. The objective of this paper is to describe the semantic types assigned to qualities in FunGramKB ontology and how the cognitive approach adopted facilitates the structuring of the knowledge base as well as the reasoning in NLP systems. Keywords: Knowledge base, knowledge representation, meaning postulate, ontology, reasoning, natural language understanding.

1 FunGramKB ontology

FunGramKB (Periñán-Pascual and Arcas-Túnez, 2004, 2005, 2006, 2007) is a multipurpose lexico-conceptual knowledge base for natural language processing (NLP) systems. It is multipurpose in the sense that it is both multifunctional and multilanguage. In other words, FunGramKB can be reused in various NLP tasks (e.g. information retrieval and extraction, machine translation, dialogue-based systems, etc) and with several natural languages.1 FunGramKB is made up of five independent

but interrelated modules: lexicon, morphicon,

1 Although the current version of FunGramKB

deals with just two languages (i.e. English and Spanish), we intend to develop lexica for French, German and Italian.

onomasticon, cognicon and ontology. The most important module is the ontology, since it is deemed as a pivotal component. For example, lexical units are assigned syntactic, pragmatic and collocational information in the lexicon, but their meaning representations are conceived as semantic properties in the ontology, so that every sense of a lexical unit is linked to a conceptual unit. FunGramKB ontology is presented as a hierarchical structure of all the concepts that a person has in mind when talking about everyday situations. This ontology distinguishes three different conceptual levels, each one of them with concepts of a different kind: a. Metaconcepts, which constitute the upper level, are used as cognitive dimensions.

b. Basic concepts, which constitute the intermediary level, are used as defining units which allow the construction of

Carlos Periñán-Pascual, Francisco Arcas-Túnez

Universidad Católica San Antonio Campus de los Jerónimos s/n

30107 Guadalupe - Murcia (Spain) {jcperinan, farcas}@pdi.ucam.edu



interlinguistic meaning postulates for basic concepts and terminals.

c. Terminals constitute the leaf nodes of the ontology, so hierarchization at this level is practically non-existent. The borderline between basic concepts and terminals is based on their definitory potential to take part in meaning postulates.

Basic concepts and terminals are not stored

as atomic symbols but are provided with a rich internal structure consisting of properties such as the semantic types and the meaning postulate. This paper focuses on the knowledge representation of qualities in FunGramKB ontology.2 More particularly, section 2 gives an accurate description of their semantic types and section 3 presents the benefits of the cognitive approach adopted for NLP.

2 The semantic types of qualities

FunGramKB assigns four different semantic types to every concept linked to the metaconcept #QUALITY, i.e. the cognitive dimension for the qualities: intersective/subsective, dynamic/static, gradable/non-gradable and polar/serial/none. Although these types have been already used in other NLP models—namely, Generalized Upper Model (Bateman, Henschel and Rinaldi, 1995), Mikrokosmos (Raskin and Nirenburg, 1995, 1998), EAGLES (1999) and SIMPLE (Peters and Peters, 2000), a cognition-oriented approach is adopted in FunGramKB. Thus, lexical-syntactic substitution tests characteristic of traditional semantic analyses (Quirk et al., 1985; Cruse, 1986) are not suitable for the diagnosis of concepts. For instance, supposing

that α and β are lexical realizations of a noun and an adjective respectively, the following validation tests have been usually employed to verify that the adjective is [i] dynamic or [ii] gradable:

[i] α is being β

[ii] α is very β or How β is α? On the contrary, the semantic types of

FunGramKB qualities are exclusively

2 Currently, FunGramKB contains approximately

750 and 630 adjectives in the English and Spanish lexica respectively, organized into 320 full-featured concepts in the ontology.

determined on the basis of semantic criteria, regardless of their surface realizations.3

2.1 Intersective/subsective

This parameter takes into account the speaker’s standpoint on the truth value expressed by the quality, resulting in the dichotomy of intersectivity (i.e. absolute truth-value) and subsectivity (i.e. relative truth-value). For example, the concept NAKED,4 understood as “not wearing any clothes”, is shared by all people in such an identical way that it causes no disagreement when describing the same reality. On the contrary, not all individuals can perceive the same instance of an entity as INTERESTING, maybe because, for some speakers, the instance (e.g. a theory, a class, etc.) provides information that they already know. Therefore, NAKED and INTERESTING are intersective and subsective concepts respectively.

2.2 Dynamic/static

A quality is dynamic when, for the same instance of entity, the quality can be affected along the time—because of the nature of the entity itself or an action exerted by an external entity. Otherwise, the quality is static. For example, HOT describes a quality that can be temporally present in an instance of entity, so the concept is dynamic. On the contrary, GERMAN, understood as “born in Germany”, is static, since it refers to a quality which will never be altered in an instance of entity.

2.3 Gradable/non-gradable

A quality is gradable (e.g. EXPENSIVE) when, for the same instance of entity, the quality can take varying degrees of intensity along the time. Otherwise, the quality is non-gradable (e.g. ALIVE).

2.4 Polar/serial/none

FunGramKB describes meaning oppositions by locating them in cognitive spaces, where positive and negative focal concepts are

3 This is the reason why a careful distinction is

made in this paper between the terms ‘noun-adjective’ (lexical labels) and ‘entity-quality’ (conceptual labels) respectively.

4 Although their names are represented by English words in block letters, concepts are not language dependent. Thus, lexical units such as naked and nude (English), or desnudo and en cueros (Spanish), are all linked to NAKED.


138

determined. Here terms such as “positive” and “negative” are not applied to refer to a kind of meaning connotation, but to the presence or not of the negation operator in the meaning representation. In other words, the negative focal concept is defined as the negation of the positive one: e.g. false means not true. Evidently, if A is the opposing concept of B, then there is no need to state that B is the opposing concept of A. A priori, any of the two focal concepts in a dimension is liable to be deemed as positive. However, FunGramKB knowledge engineers follow the arbitrary criterion of taking as positive the concept to which the lexical unit with the highest frequency index5 is linked.6 Figure 1, which has been captured from FunGramKB Suite,7 illustrates a case of non-gradable polarity. If there is gradation within a semantic

dimension, the concepts involved are described around the two focal concepts, which are determined by comparing the frequency indices of the lexical units linked to all those concepts. More particularly, the positive one is selected on the basis of the highest index, and the negative one follows the same criterion but taking into account just those concepts located in the opposing side of the dimension. For the remaining concepts, quantifying operators m (many/much) and p (few/little) are used to describe different degrees of intensity around the focal concepts. Figure 2 displays FunGramKB framework for the semantic representation of oppositions involving qualities. As can be seen, a cognitive dimension in which qualities are involved in a meaning opposition can be divided up to seven semantic zones, where the central one results from the negation of both focal concepts. Indeed, the canonical structure of these dimensions is

5 This frequency index is obtained from

WordNet. 6 In order to facilitate meaning representation,

this criterion is not applied in the case that standard dictionaries use a less frequent concept to define the opposing one. Some examples are natural-artificial and different-similar, where the first adjective is more frequent but the second one is preferred as defining word.

7 FunGramKB Suite has been developed in order to assist engineers in the acquisition and maintenance of the knowlege base. In this case, the tool can automatically reconstruct a whole semantic dimension from the meaning postulate of a particular quality.

determined by the combination of two semantic types (Table 1). zones type examples

8

7 gradable + serial

BEAUTIFUL, HAPPY, INTELLIGENT

6 gradable + polar

ANGRY, SENSITIVE, TIRED

4 quasi-gradable9 + polar

DIRTY, NOISY, OPEN

2 non-gradable + polar

ARTIFICIAL, MALE, WRONG

Table 1: Structuring meaning oppositions.

One of the key features of these semantic zones is their “cognitive feasibility”, what does not necessarily imply “lexicalization”. In other words, every semantic zone can be represented by a concept, but it is possible for a particular language to have no lexical realization for that concept. In fact, the difference between series and polarities lies on the cognitive feasibility of the central semantic zone, regardless of the possibility lexicalization in that zone. For example, Figure 3 illustrates the dimension of size. Although not all their semanitic zones are lexicalized in English or Spanish, it is treated as a series, since any zone is “potentially lexicalizable” when embedding other natural languages in the knowledge base.

2.5 Final remarks

Since FunGramKB approach is remarkably conceptualist, validation tests based on the linguistic behaviour of adjectives are substituted for questions concerning the conceptual perception of qualities in the real world: a. Do all individuals share the same view of the quality when present in an instance of entity? (Y: intersective/N: subsective)

8 Each example is represented by the positive

focal concept of a cognitive dimension. 9 A cognitive dimension is “quasi-gradable” if

one side is gradable but the other isn’t. For example, an instance of entity can be open in different degrees (i.e. gradable quality), but that instance can only stay in one position if it is closed (i.e. non-gradable quality).


139

Figure 1: A sample of non-gradable polarity.

focal + focal -m focal +n focal +n focal -

C O G N I T I V E D I M E N S I O N

p focal + p focal - m focal -

Figure 2: Cognitive framework for the representation of meaning oppositions.

Figure 3: A sample of gradable series.

b. Can the quality be affected along the time when present in an instance of entity? (Y: dynamic/N: static)

c. Can the quality take varying degrees of intensity along the time when present in an instance of entity? (Y: gradable/N: non-gradable)

d. Does the quality take part in a meaning opposition? (Y: next question/N: none) Does the absence of that quality in an instance of entity necessarily imply the

presence of the opposing quality? (Y: polar/N: serial)

3 NLP and the cognitive approach to

qualities

This view of the semantic types of qualities facilitates the structuring of the NLP knowledge base (e.g. ontological modelling and composition of meaning postulates) as well as the reasoning in natural language understanding systems, among other advantages.


140

3.1 Ontological modelling

On the one hand, the distribution of basic qualities within meaning oppositions (i.e. polarity or series) made engineers re-arrange concepts between the basic and terminal levels of the ontology. The source of the inventory of FunGramKB basic concepts was the defining vocabulary in Longman Dictionary of

Contemporary English (1988).10 However, some of these concepts were finally demoted to terminals, because of their place in the cognitive dimension. For example, adjectives big and small are included in the Longman vocabulary, so initially they were going to be treated as basic concepts (i.e. +BIG and +SMALL). However, since +SMALL plays the role of negative focal concept, it was finally stored as terminal concept. But the problem raised when building the meaning representation of the concept to which adjectives midget, minuscule, minute or tiny are linked, since it is impossible to describe them accurately without using the concept SMALL. The solution to this and other similar problems in the conceptualization of the Longman vocabulary lied in the adoption of the following protocol: a. All concepts in a polarity or series are stored as terminal, except for the positive focal concept. For example, in the dimension of size + BIG_00 is the positive focal concept and $BIG_N_00 is the negative one, which has been demoted to terminal concept (Figure 3).

b. However, if any terminal concept in the dimension serves as a co-superordinate of another concept, then the former is promoted to basic concept. The reason is that only basic concepts can be used as defining units, and superordinates appear in the meaning postulate of subordinate concepts. For example, +BIG_01 was promoted to basic concept, which is one of the superordinates of $MONSTROUS_00 (Figures 3 and 4).

c. The names of all terminal concepts in a dimension will be formed out of that of the positive focal concept plus the infix _N_ (i.e. not). The reason is that the only case in which FunGramKB permits terminal concepts to be included in meaning

10 In the conceptualization phase, lexical units

from the Longman vocabulary were mapped into basic cognitive units.

postulates occurs when the terminal concept owns the same name as the definiendum. In this way, both focal concepts can be used as descriptors in gradable cognitive dimensions. For example, $BIG_N_00 is used to describe the meaning of $BIG_N_01 (Figure 3).

On the other hand, the gradation and polarity

parameters shape the IS-A taxonomy of qualities. Firstly, both focal concepts are siblings. Secondly, neighbouring concepts around the focal ones should be subsumed by their corresponding focal concept. For example, Figure 4 displays the hierarchical structure of the dimension of size (Figure 3).

#QUANTITATIVE

#QUALITY

#PHYSICAL

$BIG_N_00 +BIG_00 +HEAVY_00 +UGLY_00

$MONSTROUS_00

$BIG_N_01 +BIG_01 $BULKY_00

&

&

Figure 4: A sample of the quality taxonomy.

Therefore, these two parameters help to formally describe qualities around two complementary axis: a horizontal axis in which meaning oppositions are organised internally and a vertical axis in which qualities are related by subsumption.

3.2 The composition of meaning

postulates

A meaning postulate is a set of one or more logically connected predications (e1, e2... en), which are cognitive constructs carrying the generic features of the concept.11 Concepts, and not words, are the building blocks for the formal description of meaning postulates, so a meaning postulate becomes a language-independent semantic knowledge representation. To illustrate, some predications in the meaning postulate of STRONG are presented:

11 The formal grammar of well-formed predications for meaning postulates in FunGramKB is described in Periñán-Pascual and Arcas-Túnez (2004).


141

+STRONG_00 *(e1: +BE_01 (x1: +HUMAN_00 ^ +ANIMAL_00)Theme (x2: +STRONG_00)Attribute) *(e2: +HAVE_00 (x1)Theme (x3: m +ENERGY_00)Referent (f1: ((e3: pos +MOVE_00 (x1)Agent (x4)Theme (x5)Location (x6)Origin (x7)Goal) (e4: +BE_01 (x4)Theme (x8: +HEAVY_00)Attribute)))Result)

These predications have the following

natural language equivalents: A person or animal that is strong has a lot of physical power to that they can move heavy things.

In FunGramKB, semantic types of qualities

(mainly, gradation and polarity parameters) help to determine the structural pattern of predications in their meaning postulates. More particularly, the canonical layout structure is as follows: a. The first predication provides information about prototypical entities to which the quality is usually assigned. This predication is structured as follows:

(e1: +BE_01 (x1: <entity>)Theme (x2: <quality>)Attribute)

For example: +ROUND_00 *(e1: +BE_01 (x1: +CORPUSCULAR_00)Theme (x2: +ROUND_00)Attribute)

b. In case of cognitive dimension, the second predication can explicitly state the entity which best describes that dimension. This predication is structured as follows:

(e2: +BE_00 (x2: <quality>)Theme (x3: <entity>)Referent)

For example:

+RED_00 +(e2: +BE_00 (x2: +RED_00)Theme (x3: +COLOUR_00)Referent)

c. In case of meaning opposition, the third predication describes the quality in relation to a focal concept. This predication can be structured by one of the following patterns:

[i] (e3: n +BE_01 (x1: <entity>)Theme (x4: <quality>)Attribute)

[ii] (e3: +BE_01 (x1: <entity>)Theme (x4: m <quality>)Attribute)

[iii] (e3: +BE_01 (x1: <entity>)Theme (x4: p <quality>)Attribute)

For example: [i] $INTELLIGENT_N_00 [daft, idiotic, imbecile, silly, stupid] *(e2: n +BE_01 (x1: +HUMAN_00)Theme (x3: +INTELLIGENT_00)Attribute)

[ii] $INTELLIGENT_01 [brainy, brilliant, gifted] *(e2: +BE_01 (x1: +HUMAN_00)Theme (x3: m +INTELLIGENT_00)Attribute)

[iii] $INTELLIGENT_N_01 [simple] *(e2: +BE_01 (x1: +HUMAN_00)Theme (x3: p $INTELLIGENT_N_00)Attribute)

The choice of one particular type of pattern depends on the location of the definiendum within the meaning opposition. In other words, [i] is used when describing the negative focal concept—where x4 is the positive one, and [ii]-[iii] for the description of any other concept in the dimension except for the focal ones—indeed, one of the focal concepts should be referenced by x4. When the definiendum plays the role of positive focal concept (i.e. +INTELLIGENT_00), there is no need to include this third predication, so that redundancy is minimized. On the contrary, the concept located in the central zone of the opposition requires two predications for the negation of both focal concepts:

$MATURE_00 *(e3: n +BE_01 (x1: $MATURE_00)Theme (x4: +YOUNG_00)Attribute) *(e4: n +BE_01 (x1)Theme (x5: +YOUNG_N_00)Attribute)

d. Further features about the differentiae can be described in other predications of the meaning postulate.

This description of the structural pattern of

meaning postulates is aimed at qualities with a single parent node in the ontology. In case of multiple inheritance (i.e. multi-parent qualities),


142

the pattern is slightly different, since the second predication is used to list all superordinate concepts, being connected by means of logical operators: (e2: +BE_01 (x1)Theme (x3: <superordinate1> <op> <superordinate2> <op>… <superordinaten>)Attribute)

For example:

$CLAMMY_00 *(e1: +BE_01 (x1: +CORPUSCULAR_00)Theme (x2: $CLAMMY_00)Attribute) *(e2: +BE_01 (x1)Theme (x3: +HOT_N_00 & +WET_01)Attribute)

3.3 Reasoning in Natural Language

Understanding

Some NLP systems, e.g. machine translation or dialogue-based systems, attempt to “understand” the input text by translating it into a formal language-independent representation. This approach requires a knowledge base with conceptual representations which reflect the structure of human beings’ cognitive system. However, this type of knowledge-based systems should also be provided with a reasoning engine. To this respect, semantic types of FunGramKB qualities (mainly, intersectivity and dynamism) can enhance reasoning results in AI systems. A key issue in natural language

understanding is the treatment of non-monotonicity. In FunGramKB, each predication taking part in a meaning postulate is preceded by a reasoning operator in order to state if the predication is strict (+) or defeasible (*). FunGramKB inference engine handles predications as rules, allowing monotonic reasoning with strict predications, and non-monotonic with defeasible predications. Strict predications are law-like rules, which have no exceptions: e.g. whales are mammals, circles are round. On the other hand, defeasible predications can be defeated by contrary evidence: e.g. birds typically fly.12 The intersectivity parameter of qualities can

determine the choice of the reasoning operator in their meaning postulates. Indeed, in the case of subjective qualities, all predications in the

12 An accurate account of how these operators

work in common-sense reasoning is described in Periñán-Pascual and Arcas-Túnez (2004).

meaning postulate are headed by the defeasible operator. The reason lies on the fact that, because the truth value of some qualities is subject to speakers’ interpretation, it cannot be really assured that all individuals will share the same truth value when perceiving the same instance of entity. To illustrate, the meaning postulate of HEAVY is presented: +HEAVY_00 *(e1: +BE_01 (x1)Theme (x2: +HEAVY_00)Attribute) *(e2: +WEIGH_00 (x1)Theme (x3: +MUCH_00)Attribute (f1: (e3: pos +MOVE_00 (x4: +HUMAN_00)Agent (x1)Theme (x5)Location (x6)Origin (x7)Goal (f2: +DIFFICULT_00)Manner))Result)

When you say that an instance of entity is

heavy, it is supposed to weigh a lot, but what do you mean by ‘a lot’? The answer is directly dependent on the individual’s physical strength and/or the weight of other instances of that entity. In other words, the assessment of weight is conditioned by the context of the world model, leading to the relativism of the truth value of the predications in the above example. Therefore, strict operators are not appropriate for this case. With regard to temporal reasoning in natural

language understanding, FunGramKB also contributes to mitigate the effects of the persistence problem in temporal projection. Until the 1980s, much debate was raised over the “frame problem” (McCarthy and Hayes, 1969), i.e. the construction of a logic-based model to efficiently represent the things which remain the same as actions are performed. Many solutions were then proposed, and the classical problem was solved. Most of these proposals were based on the “common sense law of inertia”, whereby “a fluent13 is assumed to persist unless there is reason to believe otherwise” (Shanahan, 1997). However, when reasoning over time with dynamic domains, many of those solutions proposed to the classical frame problem do not work for the persistence problem, i.e. the difficulty of determining which things remain the same in a changing world (Morgenstern, 1996). The problem is that some properties change even when no event occurs that interrupts them. In

13 In temporal logics, a fluent is usually

understood as anything whose truth value is subject to change over time.


143

FunGramKB, the dynamism parameter helps to predict which qualities are liable to change and which ones remain the same in a given situation, particularly valuable in understanding unexpected changes.

4 Conclusions

Currently most NLP systems adopt a relational approach to represent lexical meanings, since it is easier to state associations among lexical units in the way of meaning relations than describing the cognitive content of lexical units formally. Although large-scale development of deep-semantic resources requires a lot of time and effort, the expressive power of conceptual meanings is much more robust (Periñán-Pascual and Arcas-Túnez, 2007). Within this framework, describing semantic properties of qualities according to the syntactic criteria of traditional lexical semantics would have been an inconsistent decision. Moreover, the cognitive approach to semantic types benefits the construction of a sound NLP knowledge base. On the one hand, parameters such as gradation and polarity improve the cognitive economy in conceptual description and organization, as well as facilitating the efficient management and maintenance of the knowledge base. On the other hand, parameters such as intersectivity and dynamism contribute to improve the performance of reasoning engines in natural language understanding systems.

Bibliography

Bateman, J.A., R. Henschel, and F. Rinaldi. 1995. The Generalized Upper Model 2.0. Technical report. IPSI/GMD, Darmstadt.

Cruse, D.A. 1986. Lexical semantics. Cambridge University Press, Cambridge.

EAGLES. 1999. Preliminary recommendations on lexical semantic encoding. Technical report EAGLES LE3-4244.

McCarthy, J. and P.J. Hayes. 1969. Some philosophical problems from the standpoint of AI. Machine Intelligence, 4: 463-502.

Morgenstern, L. 1996. The problem with solutions to the frame problem. K.M. Ford and Z.W. Pylyshyn (eds.) The robot’s dilemma revisited: the frame problem in AI. Ablex, Norwood: 99-133.

Periñán-Pascual, C. and F. Arcas-Túnez. 2004. Meaning postulates in a lexico-conceptual

knowledge base. Proceedings of the 15th International Workshop on Databases and

Expert Systems Applications. 38-42, IEEE, Los Alamitos.

Periñán-Pascual, C. and F. Arcas-Túnez. 2005. Microconceptual-knowledge spreading in FunGramKB. Proceedings on the 9th IASTED International Conference on

Artificial Intelligence and Soft Computing. 239-244, ACTA Press, Anaheim-Calgary-Zurich.

Periñán-Pascual, C. and F. Arcas-Túnez. 2006. Reusing computer-oriented lexica as foreign-language electronic dictionaries. Anglogermánica Online, 4: 69-93.

Periñán-Pascual, C. and F. Arcas-Túnez. 2007. Cognitive modules of an NLP knowledge base for language understanding. Procesamiento del Lenguaje Natural, 39: 197-204.

Periñán-Pascual, C. and F. Arcas-Túnez. 2007. Deep semantics in an NLP knowledge base. Proceedings of the 12th Conference of the

Spanish Association for Artificial

Intelligence. 279-288, Universidad de Salamanca, Salamanca.

Peters, I. and W. Peters. 2000. The treatment of adjectives in SIMPLE: theoretical observations. Proceedings of the Second International Conference on Language

Resources and Evaluation, Athens.

Quirk, R., S. Greenbaum, G. Leach, and J. Svartvik. 1985. A comprehensive grammar of the English language. Longman, London.

Raskin, V. and S. Nirenburg. 1995. Lexical semantics of adjectives: a microtheory of adjectival meaning. Technical report MCCS-95-288, Computing Research Laboratory- New Mexico State University.

Raskin, V. and S. Nirenburg. 1998. An applied ontological semantic microtheory of adjective meaning for natural language processing. Machine Translation, 13: 135-227.

Shanahan, M. 1997. Solving the frame problem: a mathematical investigation of the common

sense law of inertia. MIT Press, Cambridge (MA).


144

Lingüística de Corpus

From Dependencies to Constituents in the Reference Corpus for the Processing of Basque (EPEC)

Arantza Diaz de Ilarraza Sánchez Enrique Fernández Terrones

University of the Basque Country LSI Department

Manuel Lardizabal pasealekua z/g 20018 Donostia (Gipuzkoa)

{a.diazdeilarraza}{sisfetek}@ehu.es

Izaskun Aldezabal Roteta Maria Jesús Aranzabe Urruzola University of the Basque Country

Basque Philology Department Sarriena auzoa z/g

48940 Leioa (Bizkaia) {izaskun.aldezabal}{maxux.aranzabe}@ehu.es

Resumen: En este artículo se expone el proceso adoptado para la transformación de un treebank anotado con dependencias a un treebank anotado con constituyentes. En este trabajo se toma en cuenta primeramente las características de ambos formalismos, para luego proponer las correspondientes equivalencias lingüísticas. Al final se explica brevemente el desarrollo, mediante refinamientos de las equivalencias lingüísticas, llevado a cabo. La evaluación del trabajo realizado es satisfactoria ya que el resultado es que en este momento es posible explotar y trabajar con corpus anotados en los dos formalismos normalmente usados en la tarea de etiquetado sintáctico. Si las equivalencias lingüísticas son iguales, la conversión es expansible a otros corpus; de lo contrario, habría que volver a definir nuevas equivalencias. Palabras clave: treebank, formalismo de dependencias, formalismo de constituyentes, conversión de formalismos, equivalencias lingüísticas, conversor

Abstract: In this paper the process for turning a dependency-based corpus to a constituent-based one is explained. For this purpose, first both the Dependency and the Constituent formalism are analized and then the corresponding equivalences of linguistic phenomena are treated. This process has had different phases in which the linguistic equivalences have been improved. Finally, the evaluation process is briefly explained and, as a result, we get corpora annotated in the two different formalisms usually proposed for syntactic tagging. If the linguistic equivalences are the same, the conversion process could be expanded to other corpus; otherwise, new equivalences should be defined. Keywords: treebank, dependecy-based, constituent-based, turning of formalism, linguistic equivalents, conversor

1 Introduction

In this paper we present the process followed to build CBT (Constituent based Basque Treebank). CBT is a new syntactically annotated resource built semiautomatically from the manually annotated Dependency-based Basque Treebank (DBT). It is a resource motivated by the CESS-ECE Project (HUM2004-21127; http://clic.ub.edu/cessece) in order to get compatible the resources developed for Spanish, Catalan and Basque. As a result, we have the Corpus syntactically tagged following the two models

generally used in the annotation task, so we get flexibility when interchanging information for the development of different parser types. This kind of works has been treated in Xia & Palmer, 2001 and Civit et al, 2006, between others. In this paper we discuss decisions taken during the automatic translation from the dependency-based to the constituent-based model. A Treebank is a text corpus in which each sentence has been annotated with its syntactic structure. The construction of a Treebank although expensive, it is indispensable for the development of real applications in the field of Natural Language



Processing (NLP). At a purely linguistic level, the Treebank is an essential database for the study of a language given that it provides analyzed/annotated examples of real language. In Kakkonen (2005) and Abeillé (2003) we can find the state of the art of dependency-based Treebank.

2 EPEC Corpus

The Basque Dependency Treebank (BDT) is actually the Reference Corpus for the Processing of Basque (EPEC) annotated following the dependency model. The EPEC Corpus of Basque is a 300,000 words collection of written standard Basque that has been manually tagged at different levels (morphology, surface syntax, phrases). A small part of this collection has been obtained from the EEBS project (http://www.euskaracorpusa.net), and the other from Euskaldunon Egunkaria (not accesible at this moment), the only daily newspaper written entirely in standard Basque written in the second half of 1999 and in 2000. The articles were chosen so that they covered an assorted range of topics (economics, culture, international, local, opinion, politics, sports entertainment …). This corpus is being used for Natural Language Processing and, although its small size, it is a strategic resource for a minority language like Basque. The corpus has been morphosyntactically analyzed by means of MORFEUS (Alegria et al, 1996). Thus, each word-form of the whole corpus was assigned their every possible segmentation, without taking into account the context in which it appeared. After that, we carried out the manual disambiguation process (Aldezabal et al., 2007a) by selecting the correct segmentation and analysis. This manually disambiguated corpus was used both to improve a Constraint Grammar disambiguator and to develop a stochastic tagger. We chose the Constraint Grammar (CG) formalism (Karlsson et al., 1995; Tapanainen & Voutilainen, 1997).

These two automatic taggers helped us in the task of manually disambiguate at lemmatization level. The corpus manually disambiguated at lemmatization level is then processed sequentially by means of the two tools we’ll briefly explain below: EIHERA and IXATI.

• EIHERA identifies entities corresponding to the categories: Institution, Person and Location (Alegria et al, 2006).

• IXATI Chunker (Aduriz et al., 2006). IXATI chunker identifies, besides verb chains and noun phrase units, complex postpositions. As far as the manually tagging process is concerned, only the detection of the latest, complex postpositions, is useful.

The dependency tagging process starts with the outcome of these tools. The linguistic information obtained in all the processes have been represented following a general stand-off schema that uses TEI-conformant feature structures (FS) coded in XML (Artola et al., 2005).

3 Two models: dependency-based and constituent-based

Phrase-structure theory and dependency theory are two different methods of conceptualizing the linguistic structure of sentences. Focusing on the dependency theory, we should stress that in grammars constructed following dependencies (e.g., Hudson, 1990; Mel'cuk, 1988), syntax is handled in terms of grammatical relations between pairs of individual words, such as the relation between the subject and the predicate or between a modifier and a common noun. Grammatical relations are seen as subtypes of a general, asymmetrical dependency relation: one of the words (the head) determines the syntactic and semantic features of the combination. In addition, the head also controls the characteristics and placement of the other word (the dependent). The syntactic structure of a sentence as a whole is built up from such dependency relations between individual pairs of words. On the one hand, based on a number of tests set out in Skut et al. (1997), Tapanainen & Järvinen (1998) and Oflazer et al. (1999) to deal with the free word-order languages, we decided to follow the dependency-based procedure rather than phrase-structure. On the other hand, requirements for integrating the Catalan, Spanish and Basque Treebank imposed in the framework of CESS-ECE project lead us to perform the translation to constituent-based model.

It should be noted that the formalization of the syntactic tagging that follows the

Arantza Diaz de Ilarraza Sánchez, Enrique Fernández Terrones, Izaskun Aldezabal Roteta y Mª Jesús Aranzabe Urruzola

148

Dependency Model was the first approach done for Basque. The syntactic description of Basque has been mainly developed within the generative framework by Goenaga (1991), Eguzkitza (1993), Laka (1993), Artiagoitia (2002), Trask (2003), and other attempts have been made in general and applied linguistics (Odriozola & Zabala, 1993; Zabala, I., 2003).

3.1.1 Constituency-based formalism In this type of formalism, every single constituent that makes up a syntactic constituent is tagged, including the syntactic category itself; thus, the final result derives from defining the emerging constituents and their categories (noun phrases, sentences, etc.). The most complete and most widely-used English corpus, namely the Penn Treebank, (Marcus et al., 1993) employs this sort of tagging. This method has two outstanding properties: 1. It is based on linear word order; that is to

say, the order of syntactic components reflects the order in which they appear in the sentence.

2. Hierarchical information is made explicit.

3.1.2 Dependency-based formalism Unlike the constituency-based approach, dependency-based formalism (Järvinen & Tapanainen, 1997) describes the relations between the components. This tagging formalism has been used for German (NEGRA) (Brants et al. 2003) and Czech (PDT) corpora1 (Böhomovà et al., 2003), among others. The properties of this method are: 1. The relevance of word order is minimized. 2. It is a method strongly based on

hierarchical relations. 3. The functional information is extremely

important.

4 Equivalences of linguistic phenomena

In this section we will explain the two steps followed in the conversion process. First of all, we established the equivalences between constituent and dependency tags. It is known that the tags used are different depending on the

1http://ufal.mff.cuni.cz/pcedt/doc/PCEDT_main.

html

criteria adopted. The constituent-based system we have based on is the one developed for Catalan and Spanish in the CESS-ECE project. We will explain these equivalences in subsection 4.1.

Secondly, the tree format has to be changed to the constituent format. This process will be briefly mentioned in the subsection 3.2.

4.1 Table of equivalences

Being our start point the dependency based annotation of the Treebank, we have split up our study of equivalences in three groups. In the first one, we deal with the tags related those elements that are classified as non-clauses; in the second one, those related to subordinated clauses; and finally, we focus on coordination. Added to that, we will mention some other equivalence needed for elements that are not considered as belonging to phrase-level.

Before going on giving details about the equivalences established in each group, let us show an example annotated following both formalisms.

(1) Dima Arratiako bailaran dago.

(‘Dima is in the valley of Arratia’)

Dependency-style ncmod (gel, bailaran, Arratiako, Arratiako) ncmod (ine, dago, bailaran, bailaran) ncsubj (abs, dago, Dima, Dima, subj)

Constituent-style

(S (sn =func:SUJ= (grup.nom (w62 Dima Dima))) (sp =func:CC= (sp =func:CC= (grup.nom (w63 Arratiako Arratia))) (grup.nom (w64 bailaran bailara))) (gv (w65 dago egon))))

The set of tags used in the dependency model is based on the proposal made in (Carroll et al., 1998) and thoroughly explained in (Aldezabal et al, 2007b; Aranzabe et al., 2003). About the constituent-style, some references about the tags used and the syntactic functions defined can be found in (Civit et al, 2006). Here we will only mention the syntactic functions employed in the dependency-style system. That is: subject, associated to SUJ, direct object,


149

associated to CD, indirect object, associated to CI, predicative and attribute, associated to CPRED and ATR, and circumstantial complements associated to CC. Other functions such as CAG, C.REG, CCT and CCL, used only in the constituents, are not treated in this step.

4.1.1 Non-clausal phrases

In the dependency-style tags used for Basque, we have not distinguished among phrases headed by noun, adjective or adverb, neither if there is preposition or not in the phrase. We make a generalization and we consider all them as non-clausal phrases (nc). We have to mention that these non-clausal phrases have indicated their respective declension case. On the other hand, the constituent-style distinguishes the phrases having a preposition (sp) from those that have not (sn, sa, sadv).

In non-clausal phrases quite a range of categories can be the head: noun (IZE), determine or adjective -when the noun is omitted- (DET, ADJ), pronoun (IOR), adverb (ADB), and ellipsis just after the verb (ADI_IZEELI, ADT_IZEELI). Therefore, all of them have to be taken into account.

Other information in the dependency tag is the function (subject, object, indirect object, predicative and modifier). This information is given apart in the constituent-based Treebank, so all the combinations have to be defined. I.e.: ncsubj -> sn-SUJ / sa-SUJ.

There is one dependency-tag (gradmod) that not being “nc” has the same equivalence as a non-clausal modifier (“ncmod”) headed by an adverb.

12 2

3 4

ncsubj

IZE ADI_IZEELI ADT_IZEELI DET ADJ IOR sn SUJ

ncsubj - sn SUJ ncsubj ADJ sa SUJ

ncobj

IZE ADI_IZEELI ADT_IZEELI sn CD

2 The meaning of the numbers is the following:

1-The dependency tag. 2- The category of the head, and sometimes also the case of the phrase. 3- The constituent tag. 4-The function assigned to the constituent.

DET ADJ IOR

ncobj - sn CD ncobj ADJ sa CD nczobj - sp CI ncmod ADJ sa CC ncmod ADB sadv CC ncmod - sp CC ncpred sn ATR gradmod sadv

Table 1: equivalences for non-clausal phrases

For instance, in the previous example (1) the second “ncmod” in the dependencies (“bailaran” ‘in the valley’) is equivalent to the most prominent “sp-CC” in the constituents; this “ncmod” has, at the same time, another “ncmod” inside (“Arratiako”, ‘of Arratia’) that has been decided to map also as a “sp-CC”. On the other hand, the “ncsubj” (“Dima” ‘Dima’) is equivalent to the “sn-SUJ” of the constituents.

4.1.2 Subordinated clauses

Regarding subordinated clauses, in the dependency-tags we distinguish between finite (c) and non-finite (x) clauses, and then the function is added (i.e. xcomp_obj for non-finite subordinated clauses that have object function). In the constituent tags there is no finiteness distinction and only the “S” tag is used. The function is added apart, and then, all the combinations have to be taken into account again.

1 2

3 4

cmod S CC

xmod S CC

xcomp_obj S CD

ccomp_obj S CD

xcomp_subj S SUJ

ccomp_subj S SUJ

xcomp_zobj S CI

xpred S ATR

Table 2: equivalences for subordinated clauses

For instance, in the example (2) the subordinated clause “ekitaldi guztiak eguraldiaren beldurrik gabe egin ahal izateko” (‘so that all the events could be held without any problem’) is tagged as “xmod”, and it is equivalent to the “S-CC” tag of the constituents; this “xmod” has, at the same time,


150

a “xcomp_obj” inside (“ekitaldi guztiak eguraldiaren beldurrik gabe egin”, ‘be held without any problem’) that is equivalent to the “S-CD” of the constituents.

(2) Lau estalpe ezarri dituzte Zumeltzako zelaietan, ekitaldi guztiak eguraldiaren beldurrik gabe egin ahal izateko.

(‘Four shelters have been put in the field of Zumeltza, so that all the events could be held without any problem’)

Dependency-style

auxmod (-, ezarri, dituzte) detmod (-, ekitaldi, guztiak) detmod (-, estalpe, Lau) ncmod (gel, zelaietan, Zumeltzako, Zumeltzako) ncmod (gen, beldurrik_gabe, eguraldiaren, eguraldiaren) ncmod (ine, ezarri, zelaietan, zelaietan) ncmod (par_post_zero, egin, beldurrik_gabe, beldurrik_gabe) ncobj (abs, egin, ekitaldi, guztiak, obj) ncobj (abs, ezarri, estalpe, estalpe, obj) xcomp_obj (konpl, ahal_izateko, egin, egin) xmod (helb, ezarri, ahal_izateko, ahal_izateko)

Constituent-style (S (sn =func:CD= (espec (w93 Lau lau)) (grup.nom (w94 estalpe estalpe))) (gv (w95 ezarri ezarri) (w96 dituzte *edun)) (sp =func:CC= (sp =func:CC= (grup.nom (w97 Zumeltzako Zumeltza))) (grup.nom (w98 zelaietan zelai))) (S =func:CC= (S =func:CD= (sn =func:CD= (grup.nom (w100 ekitaldi ekitaldi)) (espec (w101 guztiak guzti))) (sp =func:CC= (sp =func:CC= (grup.nom (w102 eguraldiaren eguraldi))) (pos3 beldurrik_gabe gabe)) (gv (w105 egin egin))) (gv (mw1 ahal_izateko ahal_izan)))))

4.1.3 Coordination

The coordinated elements are marked as “lot” in the dependencies, and the conjunction is the head of them, taking the corresponding function. In the constituents, the conjunction marks the coordination and the coordinated

elements have their corresponding phrasal category and the function added.

Due to almost all the main category elements can be coordinated, all the specifications must be done. I.e. A “lot” element will be “sp” if the head of the phrase is a noun (IZE) and the case is neither absolutive (ABS) nor ergative (ERG); or the other way round: a “lot” element will be “sn” if the head of the phrase is a noun (IZE) and the cases are either ABS or ERG. Then, the function is specified (subj-SUJ, obj-CD…)

1 2 3

lot ADI ADT S

lot IZE -

lot IZE, neither ABS nor ERG sp

lot ADB sadv

Table 3: equivalences for coordination

For instance, in the example (3) “trenak” ‘trains’ and “autobusak” ‘buses’ are coordinated objects. Thus, in the dependencies the tag in both cases is “lot” and the conjunction “eta” is tagged as “ncobj”. In the constituents, there is the conjunction “coord” coordinating the two nominal groups (“group_nom”), which are tagged as a “sn-CD”.

(3) Eusko Trenbideak trenak eta autobusak jarriko ditu Donostiako geltokian.

(‘Eusko Trenbideak is going to put trains and buses in the Donostia station’)

Dependency-style auxmod (-, jarriko, ditu) lot (emen, eta, autobusak) lot (emen, eta, trenak) ncmod (gel, geltokian, Donostiako, Donostiako) ncmod (ine, jarriko, geltokian, geltokian) ncobj (abs, jarriko, eta, autobusak, obj) ncsubj (erg, jarriko, Eusko_Trenbideak, Eusko_Trenbideak, subj)

Constituent-style

(S (sn =func:SUJ= (grup.nom (ent8 Eusko_Trenbideak eusko_trenbide))) (sn =func:CD= (sn (grup.nom (w70 trenak tren))) (coord (w71 eta eta)) (sn (grup.nom (w72 autobusak autobus)))) (gv


151

(w73 jarriko jarri) (w74 ditu *edun)) (sp =func:CC= (sp =func:CC= (grup.nom (w75 Donostiako Donostia))) (grup.nom (w76 geltokian geltoki))))

4.1.4 Not phrase-level elements

Sometimes, not phrasal level elements have to be tagged, and they need to be mapped element by element. Some of them, such as “grup.nom” and “gv”, can be coordinated. Therefore, they have to be grouped.

Dep. element

Const. element

Group yes/not

IZE grup.nom y DET espec n ITJ interjeccio n LOT coord n ADI ADT ADL gv y PRT&lema=ez neg n

Tabla 4: not phrase-level elements

For instance, in the above example (3) “trenak” ‘trains’ and “autobusak” ‘buses’ are not tagged with phrase level tags (as seen in section 4.1.3) because they are in coordination; so they have to be identified by their category and then do the equivalence. In the example, “trenak” ‘trains’ and “autobusak” ‘buses’ are nouns (IZE), and their constituent equivalent are two nominal groups (“group_nom”).

4.2 From tree to constituent format

Once the equivalences are well defined, a program starts analysing the dependency tree from the top to the branches. In this way, as dependency-tags are found, their corresponding constituent-based tags are being created opening brackets. Once the branch ends, the bracket is closed. Thus, we get the constituent hierarchy structure from the top level (sentence and phrase level) to word level. However, the hierarchy structure of some intermediate levels (such as group.nom) must be analyzed more deeply.

5 Evaluation

The process has been accomplished by refinements. In the first step, general equivalences were established and, accordingly to them, the conversion was done. After examining a sample of the resulting output, mistakes were solved and new refinements were faced. This sequence of steps was repeated until having satisfactory results.

As a first approach, we have manually evaluated 25 sentences of the corpus, and 5 of them failed when getting a successful constituent structure. We explain briefly the main reasons:

- Sentence connectors have not been treated in detail; then, a lot of them are not represented in the constituents.

- We have not studied in depth the correct representation of the multiword and discontinuous expressions; then, they appear separated and sometimes without any tag.

- Some phrases do not get any function since the equivalences do not cover all the possible contexts.

- Punctuation marks that make coordination are not treated as such; then they are just put as tokens in the hierarchy without any other information.

In any case, with an 80 % of correctness, we can say that the converser tool is quite robust at this stage. In the future we should improve the results solving the phenomena we have found and others that we probably have not detected yet.

6 Conclusions

In this paper the process for turning a dependency-based corpus to a constituent-based one has been explained. For this purpose, the corresponding equivalences of linguistic phenomena are treated. The process has had different phases in which the linguistic equivalences have been improved. The 300.000 words contained in EPEC have been converted. Treebank in both formats are freely available for research purposes.

Furthermore, the tool can be useful for other corpus if the linguistic equivalences are the same.


152

Bibliography

Abeillé, A. (2003). Treebanks: Building and Using Parsed Corpora, Kluwer Academic Publishers, Dordrecht, The Netherlands.

Aduriz, I.; Aranzabe, M. J.; Arriola, J. M.; Atutxa, A.; Díaz de Illarraza, A.; Ezeiza, N.; Gojenola, K.; Oronoz, M.; Soroa, A.; Urizar, R. (2006). Methodology and steps towards the construction of EPEC, a corpus of written Basque tagged at morphological and syntactic levels for automatic processing. Corpus Linguistics Around the World. Book series: Language and Computers. Vol. 56 (pag 1- 15). ISBN 90-420-1836-4 Ed. Andrew Wilson, Paul Rayson, and Dawn Archer. Rodopi. Netherlands.

Aldezabal I., Ceberio K., Esparza I., Estarrona A., Etxeberria J., Iruskieta Quintian M., Izagirre E., Uria L. (2007a). EPEC (Euskararen Prozesamendurako Erreferentzia Corpusa) segmentazio-mailan etiketatzeko eskuliburua. UPV/EHU / LSI / TR 11-2007.

Aldezabal I., Aranzabe M., Arriola J., Díaz de Ilarraza A., Estarrona A., Fernandez K., Uria L., Quintian M. 2007b). EPEC (Euskararen Prozesamendurako Erreferentzia Corpusa) dependentziekin etiketatzeko eskuliburua. UPV/EHU / LSI / TR 12-2007.

Alegria I., Arregi O., Ezeiza N., Fernandez I. (2006). Lessons from the Development of a Named Entity Recognizer. Procesamiento del Lenguaje Natural, ISSN 1135-5948 Revista nº 36, pag. 25-37.

Alegria I., X. Artola & K. Sarasola. (1996). Automatic morphological analysis of Basque. Literary & Linguistic Computing Vol. 11, No. 4. Oxford: Oxford University Press. 193-203.

Aranzabe M., Arriola J., Atutxa A., Balza I., Uria L. (2003) Guía para la anotación sintáctica manual de Eus3LB (corpus del euskera anotado a nivel sintáctico, semántico y pragmático). UPV/EHU/LSI/TR 13-2003

Artola X., Díaz de Ilarraza A., Ezeiza N., Gojenola K., Labaka G., Sologaistoa A., Soroa A. A framework for representing and managing linguistic annotations based on typed feature structures. RANLP 2005. ISBN: 954-91743-3-6.

Artiagoitia, X. (2002). The functional structure of the Basque noun phrase. Erramu Boneta: Festschrift for Rudolf P.G. de Rijk. Supplements of International Journal of Basque Linguistics and Philology: 73-90.

Böhomovà, A.; Hajic, J.; Hajicova, E. and Hladka B. (2003). The PDT: a 3-level annotation scenario. In Abeillé, editor, Treebanks: Building and Using Parsed Corpora. Kluwer Academic Publishers, Dordrecht, The Netherlands.

Brants, T.; Skut, W.; Krenn, B. and Uszkoreit, H. (2003). Syntactic annotation of a German newspaper corpus. In Abeillé, editor, Treebanks: Building and Using Parsed Corpora, Kluwer Academic Publisher, Dordrecht, The Netherlands.

Carroll, J.; Briscoe, T. and Sanfilippo, A. (1998). Parser evaluation: a survey and a new proposal. In Proceedings of the International Conference on Language Resources and Evaluation, 447-454. Granada, Spain.

Civit M., Martí A, Bufí N. Cat3lb and Cast3lb: from constituents to dependencies. Advances in Natural Language Processing (LNAI 4139), pp-141-153. Springer Berlag Berlin 2006.

Eguzkitza, A. (1993). Adnominals in the Grammar of Basque. In J. I. Hualde and J. Ortiz de Urbina, eds., Generative Studies in Basque Linguistics, 163-187. Amsterdam: John Benjamins.

Goenaga, P. (1991). Gramatika bideetan. Donostia: Erein.

Hudson, R. (1990). Word Grammar. Oxford, England: Basil Blackwell Publishers Limited.

Järvinen, T. and Tapanainen, P. (1997). A Dependency Parser for English. Technical Report, No. TR-1, Department of General Linguistics. University of Helsinki.

Kakkonen, T. (2005). Dependency Treebanks: Methods, Annotation Schemes and Tools. In Proceedings of the 15th Nordic Conference of Computational Linguistics. Finland.

Karlsson, F., Voutilainen, A., Heikkilä, J and Anttila, A. (1995). Constraint Grammar. Mouton Gruyter. Berlin.


153

Laka, I. (1993). Unergatives that Assign Ergative, Unaccusatives that Assign Accusative. In J. Bobaljik and C. Phillips, eds., Papers on Case & Agreement 1, 149-172. Cambridge: MIT Working Papers in Linguistics, Volume 18.

Marcus, M.; Santorini, B. and Marcinkiewicz, M.A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19:313−−330.

Mel’cuk, I.A. (1988). Dependency syntax: theory and practice. Albany: State University of New York Press.

Odriozola, J. C. and Zabala, I. (1993). Izen-sintagma. Idazkera teknikoa II. EHUko argitalpen zerbitzua, Bilbo.

Oflazer, K.; Zynep, D. and Tür, G. (1999). Design for a Turkish Treebank. Proceedings of Workshop on Linguistically Interpreted Corpora, at EACL’99, Bergen.

Skut, W.; Krenn, B.; Brants, T. and Uszkoreit, H. (1997). An Annotation Scheme for Free Word Order Languages, Fifth Conference on Applied Natural Language Processing (ANLP’97), Washington, DC, USA, 88-95.

Tapanainen, P. and Järvinen, T. (1997). A non-projective dependency parser. Proceedings of the 5th Conference on Applied Natural Language Processing (ANLP'97), 64-71

Tapanainen, P. and Järvinen, T. (1998). Dependency concordances. International Journal of Lexicography, 11 (3): 187-203. September.

Trask, R L. (2003). The Noun Phrase: nouns, determiners and modifiers; pronouns and names A Grammar of Basque. José Ignacio Hualde & Jon Ortiz de Urbina (eds.). Mouton de Gruyter. Berlin-New York.

Xia, F., Palmer, M. (2001). Converting dependency structures to phrase structures. In Proc. Int. Conf. on Human Language Technology, HLT-2001. San Diego, CA.

Zabala I. (2003). Nominal Predication J.I. Hualde y J. Ortiz de Urbina (eds.) A Grammar of Basque: 426-446. Mouton Gruyter. Berlin.


154

A Web-Platform for Preserving, Exploring, Visualising,

and Querying Linguistic Corpora and other Resources

Plataforma web para el mantenimiento, exploracion, visulizacion y

busqueda en corpus linguısticos y en otros recursos

Georg Rehm,

Oliver Schonefeld, Andreas Witt

SFB 441 Linguistic Data Structures

University of Tubingen

Nauklerstrasse 35

72074 Tubingen, Germany

[email protected]

[email protected]

[email protected]

Christian Chiarcos

SFB 632 Information Structure

University of Potsdam

Karl-Liebknecht-Strasse 24–25

14476 Potsdam, Germany

[email protected]

Timm Lehmberg

SFB 538 Multilingualism

University of Hamburg

Max-Brauer-Allee 60

22765 Hamburg, Germany

[email protected]

Resumen: Presentamos SPLICR, una plataforma de sostenibilidad para corpus yrecursos linguısticos basada en web. El sistema esta destinado a personas que tra-bajan en el campo de la linguıstica o de la linguıstica computacional. Consiste enuna base de datos extensa para metadatos que puede ser explorada para buscarrecursos linguısticos, que pudieran ser apropiados para las necesidades especıficasde una investigacion. SPLICR tambien ofrece una interfaz grafica, que permite alos usuarios buscar y visualizar los corpus. El proyecto, en el que se ha desarolladoel sistema, aspira a archivar de modo sostenible aproximadamente sesenta recursoslinguısticos, que han sido construidos mediante la colaboracion de tres centros deinvestigacion. Nuestro proyecto tiene dos metas principales: (a) Procesar y archivarrecursos de forma sostenible, de manera que los recursos sigan siendo accesibles parala comunidad cientıfica dentro de cinco, diez, o incluso veinte anos. (b) El permitira los investigadores buscar en los recursos tanto a nivel de metadatos como a nivelde anotaciones linguısticas. En terminos mas generales, nuestro objetivo es propor-cionar soluciones que posibiliten la interoperabilidad, reutilizacion y sostenibilidadde compilaciones heterogeneas de recursos de lenguaje.Palabras clave: sostenibilidad, mantenimiento, busqueda, corpus, recursos, XML

Abstract: We present SPLICR, the Web-based Sustainability Platform for Linguis-tic Corpora and Resources. The system is aimed at people who work in Linguisticsor Computational Linguistics: a comprehensive database of metadata records can beexplored in order to find language resources that could be appropriate for one’s spe-cific research needs. SPLICR also provides a graphical interface that enables usersto query and to visualise corpora. The project in which the system is developed aimsat sustainably archiving the ca. 60 language resources that have been constructedin three collaborative research centres. Our project has two primary goals: (a) Toprocess and to archive sustainably the resources so that they are still available tothe research community in five, ten, or even 20 years time. (b) To enable researchersto query the resources both on the level of their metadata as well as on the level oflinguistic annotations. In more general terms, our goal is to enable solutions thatleverage the interoperability, reusability, and sustainability of heterogeneous collec-tions of language resources.Keywords: Sustainability, Preservation, Querying, Corpora, Resources, XML



1. Introduction

This contribution presents SPLICR, theWeb-based Sustainability Platform for Lin-guistic Corpora and Resources aimed atpeople who work in Linguistics or Com-putational Linguistics: a comprehensivedatabase of metadata records can be ex-plored and searched in order to find languageresources that could be appropriate for one’sspecific research needs. SPLICR also pro-vides a graphical interface that enables usersto query and to visualise corpora.

The project in which SPLICR is devel-oped aims at sustainably archiving (Trils-beek and Wittenburg, 2006) the languageresources that have been constructed or arestill work in progress in three collaborativeresearch centres. The groups in Tubingen(SFB 441: “Linguistic Data Structures”),Hamburg (SFB 538: “Multilingualism”),and Potsdam/Berlin (SFB 632: “Informa-tion Structure”) built a total of 56 resources– corpora and treebanks mostly. Accordingto estimates it took more than one hundredperson years to collect and to annotate thesedatasets. Our project has two goals: (a) Toprocess and to sustainably archive the re-sources so that they are still available tothe research community and other interestedparties in five, ten, or even 20 years time(Schmidt et al., 2006). (b) To enable re-searchers to query the resources both on thelevel of their metadata as well as on the levelof linguistic annotations. In more generalterms, our main goal is to enable solutionsthat leverage the interoperability, reusabil-ity, and sustainability of a large collection ofheterogeneous language resources.

The remainder of this paper is structuredas follows: section 2 introduces our approachto normalising corpus data (section 2.1) andmetadata records (section 2.2). SPLICR’sarchitecture is described in section 3, al-though we are only able to highlight selectedparts of the system due to space restrictions.The staging area is briefly discussed in sec-tion 3.1, while section 3.2 gives an overviewof our approach to representing knowledgeabout linguistic annotation schemes usingontologies. A third major component of the

system is the graphical corpus query and vi-sualisation front-end (section 3.3). The arti-cle ends with concluding remarks (section 4).

2. Data Normalisation and

Representation

One of the obstacles we are confrontedwith is providing homogeneous means of ac-cessing a large collection of diverse and com-plex linguistic resources. For this purposewe developed several custom tools in orderto normalise the corpora (section 2.1) andtheir metadata records (section 2.2).

2.1. Normalisation of Linguistic

Resources

Language resources are usually built us-ing XML-based languages nowadays (Ide etal., 2000; Sperberg-McQueen and Burnard,2002; Worner et al., 2006; Lehmberg andWorner, 2007) and contain several con-current annotation layers that correspondto multiple levels of linguistic description(e. g., part-of-speech, syntax, coreference).Our approach includes the normalisation ofXML-annotated resources, e. g., for cases inwhich corpora use PCDATA content to cap-ture both primary data (i. e., the originaltext or transcription) as well as annotationinformation (e. g., POS tags). We use a setof tools to ensure that only primary data isencoded in PCDATA content and that allannotations proper are encoded using XMLelements and attributes.

Another reason for the normalisationprocedure is that both hierarchical andtimeline-based corpora (Bird and Liberman,2001; Schmidt, 2005) need to be transformedinto a common annotation approach, be-cause we want our users to be able to queryboth types of resources at the same time andin a uniform way. Our approach (Dipper etal., 2006; Schmidt et al., 2006; Worner et al.,2006) can be compared to the NITE ObjectModel (Carletta et al., 2003): we developedtools that semiautomatically split hierarchi-cally annotated corpora that typically con-sist of a single XML document instance intoindividual files, so that each file representsthe information related to a single annota-tion layer (Witt et al., 2007; Rehm et al.,

Georg Rehm, Oliver Schonefeld, Andreas Witt, Christian Chiarcos y Timm Lehmberg

156

2008a); this approach also guarantees thatoverlapping structures can be representedstraightforwardly. Timeline-based corporaare also processed in order to separate graphannotations. This approach enables us torepresent arbitrary types of XML-annotatedcorpora as individual files, i. e., individualXML element trees. These are encoded asregular XML document instances, but, as asingle corpus comprises multiple files, thereis a need to go beyond the functionality of-fered by typical XML tools to enable us toprocess multiple files, as regular tools workwith single files only (Rehm et al., 2007a;Rehm et al., 2008b).

2.2. Normalisation of Metadata

Records

The separation of the individual annota-tion layers contained in a corpus has seri-ous consequences with regard to legal issues(Zimmermann and Lehmberg, 2007; Lehm-berg et al., 2007a; Lehmberg et al., 2007b;Lehmberg et al., 2008; Rehm et al., 2007b):due to copyright and personal rights specificsthat usually apply to a corpus’s primarydata we provide a fine-grained access controllayer to regulate access by means of user ac-counts and access roles. We have to be ableto explicitly specify that a certain user onlyhas access to the set of, say, six annotationlayers (in this example they might be avail-able free of charge for research purposes) butnot to the primary data, because they mightbe copyright-protected (Rehm et al., 2007b;Rehm et al., 2007c).

Our generic metadata schema, eTEI, isbased on the TEI P4 header (Sperberg-McQueen and Burnard, 2002) and extendedby a set of additional requirements. BotheTEI records and the corpora are storedin an XML database. The underlying as-sumption is that XML-annotated datasetsare more sustainable than, for example, datastored in a proprietary relational DBMS.The main difference between eTEI and otherapproaches is that the generic eTEI meta-data schema, currently formalised as a sin-gle document type definition (DTD), can beapplied to five different levels of description(Trippel, 2004; Himmelmann, 2006). One

eTEI file contains information on one of thefollowing levels: (1) setting (recordings ortranscripts of spoken language, describes thesituation in which the speech or dialoguetook place); (2) raw data (e. g., a book, apiece of paper, an audio or video recording ofa conversation etc.); (3) primary data (tran-scribed speech, digital texts etc.); (4) an-

notations; (5) a corpus (consists of primarydata with one or more annotation levels).

We devised a workflow that helps usersedit eTEI records (Rehm et al., 2008a).The workflow’s primary components are theeTEI DTD and the Oxygen XML editor.Based on structured annotations containedin the DTD we can generate automaticallyan empty XML document with embeddeddocumentation and a Schematron schema.The Schematron specification can be usedto check whether all elements and attributesinstantiated in an eTEI document conformto the current level of metadata description.

3. Architecture

The sustainability platform consists of afront-end and a back-end. The front-end isthe user visible part and is realised using JSP(Java Server Pages) and Ajax technology. Itruns in the user’s browser and provides func-tions for searching and exploring metadatarecords and corpus data. The back-end hoststhe JSP files and related data. It accessestwo different databases, the corpus database

and the system database, as well as a setof ontologies and additional components.1

The corpus database is an XML database,extended by the AnnoLab system (Eckartand Teich, 2007), in which all resources andmetadata are stored. The system databaseis a relational database that contains all dataabout user accounts, resources (i. e., annota-tion layers), resource groups (i. e., corpora)and access rights. A specific user can onlyaccess a specific resource if the permissionsfor this user/resource tuple allow it.

1In the file vault area of the system, SPLICR con-tains additional data about a resource, such as theoriginal corpus data files, PDF files that act as docu-mentation, and transformation scripts, amongst oth-ers. These additional files are available through theuser interface as well by providing access via HTTP.

A Web-Platform for Preserving, Exploring, Visualising, and Querying Linguistic Corpora and other Resources

157

Metadata recordsMetadata records

Metadata records

− eTEI metadata records

− Original corpus data (original files, audio data etc.)

− Manifest files (XML−based inventory list of a corpus)

Staging area (directory tree on a server) contains:

Dispatcher (JSP)

HTTP/REST

data

and Normalisation

Corpus data

in eTEI

eTEI editorCustom normalisation tools

metadata

Original

Manifest file

(XML−based)(normalised)

remote copy remote copy

Resource

XML database

eTEI−based

metadata

staging area

Importer traverses

Manifest file

(XML−based)

Original corpus data

(arbitrary XML−based formats)

Corpus data

(normalised) in eTEI


metadata

Original

Manifest file

(XML−based)

Original corpus data


Corpus data

(normalised) in eTEI


metadata

Original Original corpus data


Staging Area

User

data

data

SQL system database

AnnoLab

− Corpus data (separated into individual annotation layers)

administration interface

restrictions

Access

Corpus Processing

SFB 632 (University of Potsdam)

SFB 538 (University of Hamburg) (at different sites)

SFB 441 (University of Tübingen)

(for additional files)

File vault

Web−Based Sustain−

ability Platform

Front−end/Browser

Importer service

Corpus

Front−end/Browser

Figure 1: Resource normalisation and SPLICR’s staging area

The following subsections describe threeselected parts of SPLICR’s architecture: thestaging area (section 3.1), a set of ontologiesof linguistic annotations (section 3.2) andthe querying front-end (section 3.3).

3.1. Staging Area

A new resource is imported into the sus-tainability platform by (remotely) copyingall corresponding files into the staging areawhose directory structure is defined in a

technical specification. Strict naming rulesapply for the processed files (see section 2)and for the directories so that the whole di-rectory tree can be traversed and processedautomatically. Each corpus contains a mani-fest file, that is represented in a simple XMLformat and that acts as a corpus inventory.Manifest files are automatically generatedby the normalisation tools, their contentsare used by the GUI and by the importand export tools. The importer traverses


158

the staging area, checks, among others, thedata for consistency and imports the corpusdata and metadata records into the XMLdatabase (we currently use eXist but are ex-ploring several alternatives) using a REST-style HTTP interface. At the same time,new resource and resource group records aswell as permissions are set up in the systemdatabase (MySQL). Permissions are chosenbased on the restrictions defined in metadatarecords.

3.2. Ontologies of Linguistic

Annotation

The corpora that we process are markedup using several different markup languagesand linguistic tag sets. As we want toenable users to query multiple corpora atthe same time, we need to provide a uni-fying view of the markup languages usedin the original resources. For this sustain-able operationalisation of existing annota-tion schemes we employ the ontologies oflinguistic annotation (OLiA) approach: webuilt an OWL DL ontology that serves asa terminological reference. This referencemodel is based on the EAGLES recommen-dations for morphosyntax, the general on-tology for linguistic description (Farrar andLangendoen, 2003), and the SFB632 annota-tion standard (Dipper et al., 2007). It coversreference specifications for word classes, andmorpho-syntax (Chiarcos, 2007), and is cur-rently extended to syntax and informationstructure. The reference model representsa terminological backbone that different an-notations are linked to and consists of threecomponents: a taxonomy of linguistic cat-egories (OWL classes such as Noun, Com-

monNoun), a taxonomy of grammatical fea-tures (OWL classes, e. g., Accusative), andrelations (OWL properties, e. g., hasCase).An annotation model is an ontology thatrepresents one specific annotation scheme(see figure 2). We built, among others, an-notation models for the SFB632 annotationformat (Dipper et al., 2007) used in typolog-ical research, TIGER/STTS (Schiller et al.,1999; Brants et al., 2003), two tag sets forRussian and five tag sets for English, e. g.,Susanne (Sampson, 1995), and PTB (Mar-

Figure 2: The Susanne tag APPGf, its repre-sentation within the annotation model andlinking with the reference model

cus et al., 1993). The linking between an-notation models and the reference model isspecified in separate OWL files.

Any tag from an annotation model can beretrieved from the reference model by a de-scription in terms of OWL classes and prop-erties. For this task, OntoClient was devel-oped, a query preprocessor implemented inJava that uses an OWL DL reasoner to re-trieve the set of individuals that conform toa particular description with regard to thereference model. The OntoClient enables usto use abstract linguistic concepts such asVerb or Noun in a query. By means of anXQuery extension function, these conceptsare expanded into the concrete tag namesused in the annotation schemes of the cor-pora that are currently in the user’s focus.

3.3. The Corpus Query Front-End

As we cannot expect our target users(i. e., linguists) to be proficient in XMLquery languages such as XQuery, we providean intuitive user interface that generalisesfrom the underlying data structures andquerying methods actually used. The on-tology of linguistic annotations (section 3.2)provides abstract representations of linguis-tic concepts (e. g., Noun, Verb, Preposition)that may have a specific set of features;operands can be used to glue together thelinguistic concepts by dragging and droppingthese graphical representations onto a spe-


159

Figure 3: The front-end in tree fragment query mode (upper left), single-tree corpus browsingmode (upper right) and multi-rooted tree display mode (below)

cific area of the screen, building a query stepby step. We collected a set of requirementsand functions that the front-end should have(such as the ones briefly sketched at the be-ginning of this section) by conducting in-depth interviews with the staff members ofSFB 441 and by asking them to fill out aquestionnaire (Soehn et al., 2008).

The front-end is implemented inJavaScript extended by the frameworks Pro-totype (http://www.prototypejs.org) andscript.aculo.us (http://script.aculo.us).One of its central components is a graphicaltree fragment query editor that supportsthe processing of multi-layer annotationsand that interprets and translates graphical

queries into XQuery. The front-end commu-nicates with the backend via Ajax, postingXQuery requests to a servlet running onthe backend. The servlet responds with thematches encoded in an XML format, whichis then interpreted by a variety of displaymodules. Four different major displaymodes are already implemented: plain textview, XML view, graphical tree view andtimeline view.

The tree fragment query editor (figure 3)involves dragging and dropping elements onan assembly pane, so that queries can beconstructed in a step-by-step fashion. Atthe moment, structural nodes can be com-bined by dominance, precedence, and sec-


160

ondary edge relations. The structures de-fined by these graphs mirror the structuresto be found. Each node may contain oneor more conditions linked by boolean con-nectives that help to refine the node classesallowed in the structures. We plan to realisea set of functions that can be roughly com-pared to TIGERSearch’s feature set (Lezius,2002) enhanced by our specific requirements,i. e., multi-layer querying and query expan-sion through ontologies.

4. Concluding Remarks and

Future Work

The research presented in this contribu-tion is still work in progress. We want tohighlight some of the aspects that we planto realise by the end of 2008. While the cor-pus normalisation and preprocessing phaseis, with only minor exceptions, finished, theprocess of transforming the existing meta-data records into the eTEI format was com-pleted in June. Work on the querying en-gine and integration of the XML database,metadata exploration and on the graphicalvisualisation and querying front-end (Rehmet al., 2008b) as well as on the back-end isongoing; we plan to finish work on the firstprototype of the platform by September.

In addition we plan several extensionsand modifications for the eTEI schema.Most notably, we plan to replace the cur-rent DTD, based on TEI P4, with an XMLSchema description that is based on the cur-rent version of the guidelines (P5) and re-alised by means of an ODD (“one documentdoes it all”) specification. XML Schema hasbetter and more appropriate facilities for in-cluding embedded documentation than therather simple and unstructured commentsavailable in DTDs. Another area that needsfurther work is the query front-end that weplan to upgrade and to enhance. In addi-tion to a substantial overhaul of the interfacein order to improve its usability, we will in-tegrate query templates and saved searchesthat act like bookmarks in a web browser.

Acknowledgments

The authors would like to thank Solange Oter-

min and Wolfgang Maier for translating this pa-

per’s abstract into Spanish. We would also like

to thank the three anonymous reviewers for their

helpful comments.

References

S. Bird and M. Liberman. 2001. A Formal Frame-work for Linguistic Annotation. Speech Commu-nication, 33(1/2):23–60.

S. Brants, S. Dipper, P. Eisenberg, S. Hansen-Schirra, E. Konig, W. Lezius, C. Rohrer,G. Smith, and H. Uszkoreit. 2003. TIGER: Lin-guistic Interpretation of a German Corpus. Jour-nal of Language and Computation.

J. Carletta, J. Kilgour, T. J. O’Donnell, S. Evert,and H. Voormann. 2003. The NITE ObjectModel Library for Handling Structured Linguis-tic Annotation on Multimodal Data Sets. InProc. of the EACL Workshop on Language Tech-nology and the Semantic Web (3rd Workshop onNLP and XML).

C. Chiarcos. 2007. An Ontology of Linguistic Anno-tation: Word Classes and Morphology. In Proc.of DIALOG 2007.

S. Dipper, E. Hinrichs, T. Schmidt, A. Wagner, andA. Witt. 2006. Sustainability of Linguistic Re-sources. In E. Hinrichs, N. Ide, M. Palmer, andJ. Pustejovsky, editors, Proc. of the LREC 2006Workshop Merging and Layering Linguistic In-formation, pages 48–54, Genoa, Italy, May.

S. Dipper, M. Gotze, and S. Skopeteas, editors.2007. Information Structure in Cross-LinguisticCorpora: Annotation Guidelines for Phonology,Morphology, Syntax, Semantics, and Informa-tion Structure, volume 7 of ISIS.

R. Eckart and E. Teich. 2007. An XML-Based DataModel for Flexible Representation and Query ofLinguistically Interpreted Corpora. In G. Rehm,A. Witt, and L. Lemnitzer, editors, Data Struc-tures for Linguistic Resources and Applications.Gunter Narr, Tubingen, Germany.

S. Farrar and D. T. Langendoen. 2003. A LinguisticOntology for the Semantic Web. GLOT Interna-tional, 3:97–100.

N. P. Himmelmann. 2006. Daten und Datenhu-berei. Keynote speech, 28th annual meeting ofthe DGfS, University of Bielefeld, February.

N. Ide, P. Bonhomme, and L. Romary. 2000. XCES:An XML-based Standard for Linguistic Cor-pora. In Proc. of the Second Language Resourcesand Evaluation Conf. (LREC), pages 825–830,Athens.

T. Lehmberg and K. Worner. 2007. AnnotationStandards. In A. Ludeling and M. Kyto, editors,Corpus Linguistics, Handbucher zur Sprach-und Kommunikationswissenschaft (HSK). deGruyter, Berlin, New York. In press.


161

T. Lehmberg, C. Chiarcos, E. Hinrichs, G. Rehm,and A. Witt. 2007a. Collecting Legally Rel-evant Metadata by Means of a Decision-Tree-Based Questionnaire System. In Digital Human-ities 2007, pages 164–166, Urbana-Champaign,IL, USA, June. ACH, ALLC.

T. Lehmberg, C. Chiarcos, G. Rehm, and A. Witt.2007b. Rechtsfragen bei der Nutzung undWeitergabe linguistischer Daten. In G. Rehm,A. Witt, and L. Lemnitzer, editors, Datenstruk-turen fur linguistische Ressourcen und ihre An-wendungen – Data Structures for Linguistic Re-sources and Applications: Proc. of the BiennialGLDV Conf. 2007, pages 93–102. Gunter Narr,Tubingen.

T. Lehmberg, G. Rehm, A. Witt, and F. Zimmer-mann. 2008. Preserving Linguistic Resources:Licensing – Privacy Issues – Mashups. LibraryTrends. In print.

W. Lezius. 2002. Ein Suchwerkzeug fur syntaktischannotierte Textkorpora. Ph.D. thesis, Universityof Stuttgart.

M. P. Marcus, B. Santorini, and M. A.Marcinkiewicz. 1993. Building a Large Anno-tated Corpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330, June.

G. Rehm, R. Eckart, and C. Chiarcos. 2007a.An OWL- and XQuery-Based Mechanism forthe Retrieval of Linguistic Patterns from XML-Corpora. In Int. Conf. Recent Advances in Nat-ural Language Processing (RANLP 2007), pages510–514, Borovets, Bulgaria, September.

G. Rehm, A. Witt, H. Zinsmeister, and J. Dellert.2007b. Corpus Masking: Legally Bypassing Li-censing Restrictions for the Free Distribution ofText Collections. In Digital Humanities 2007,pages 166–170, Urbana-Champaign, IL, USA,June. ACH, ALLC.

G. Rehm, A. Witt, H. Zinsmeister, and J. Dellert.2007c. Masking Treebanks for the Free Distri-bution of Linguistic Resources and Other Appli-cations. In Proc. of the Sixth Int. Workshop onTreebanks and Linguistic Theories (TLT 2007),number 1 in Northern European Association forLanguage Technology Proc. Series, pages 127–138, Bergen, Norway, December.

G. Rehm, O. Schonefeld, A. Witt, T. Lehm-berg, C. Chiarcos, H. Bechara, F. Eishold,K. Evang, M. Leshtanska, Aleksandar Savkov,and Matthias Stark. 2008a. The Metadata-Database of a Next Generation SustainabilityWeb-Platform for Language Resources. In Proc.of the 6th Language Resources and EvaluationConf. (LREC 2008), Marrakech, Morocco, May.

Georg Rehm, Richard Eckart, Christian Chiar-cos, and Johannes Dellert. 2008b. Ontology-Based XQuery’ing of XML-Encoded LanguageResources on Multiple Annotation Layers. In

Proc. of the 6th Language Resources and Evalu-ation Conf. (LREC 2008), Marrakech, Morocco,May.

G. Sampson. 1995. English for the Computer.The SUSANNE Corpus and Analytic Scheme.Clarendon, Oxford.

A. Schiller, S. Teufel, and C. Stockert. 1999. Guide-lines fur das Tagging deutscher Textcorpora mitSTTS. Technical report, University of Stuttgart,University of Tubingen.

T. Schmidt, C. Chiarcos, T. Lehmberg, G. Rehm,A. Witt, and E. Hinrichs. 2006. Avoiding DataGraveyards: From Heterogeneous Data Collectedin Multiple Research Projects to Sustainable Lin-guistic Resources. In Proc. of the E-MELD 2006Workshop on Digital Language Documentation:Tools and Standards – The State of the Art, EastLansing, Michigan, June.

T. Schmidt. 2005. Time Based Data Models and theText Encoding Initiative’s Guidelines for Tran-scription of Speech. Working Papers in Multilin-gualism, Series B, 62.

J.-P. Soehn, H. Zinsmeister, and G. Rehm.2008. Requirements of a User-Friendly, General-Purpose Corpus Query Interface. In L. Burnard,K. Choukri, G. Rehm, T. Schmidt, and A. Witt,editors, Proc. of the LREC 2008 Workshop Sus-tainability of Language Resources and Tools forNatural Language Processing, Marrakech, Mo-rocco, May 31.

C. M. Sperberg-McQueen and L. Burnard, editors.2002. TEI P4: Guidelines for Electronic TextEncoding and Interchange. Text Encoding Ini-tiative Consortium. XML Version: Oxford, Prov-idence, Charlottesville, Bergen.

P. Trilsbeek and P. Wittenburg. 2006. Archiv-ing Challenges. In J. Gippert, N. P. Himmel-mann, and U. Mosel, editors, Essentials of Lan-guage Documentation, pages 311–335. Mouton deGruyter, Berlin, New York.

T. Trippel. 2004. Metadata for Time Aligned Cor-pora. In Proc. of the LREC Workshop: A Reg-istry of Linguistic Data Categories within an In-tegrated Language Repository Area, Lisbon.

A. Witt, O. Schonefeld, G. Rehm, J. Khoo, andK. Evang. 2007. On the Lossless Transforma-tion of Single-File, Multi-Layer Annotations intoMulti-Rooted Trees. In B. T. Usdin, editor, Proc.of Extreme Markup Languages 2007, Montreal,Canada, August.

K. Worner, A. Witt, G. Rehm, and S. Dipper. 2006.Modelling Linguistic Data Structures. In B. T.Usdin, editor, Proc. of Extreme Markup Lan-guages 2006, Montreal, Canada, August.

F. Zimmermann and T. Lehmberg. 2007. Lan-guage Corpora – Copyright – Data Protection:The Legal Point of View. In Digital Human-ities 2007, pages 162–164, Urbana-Champaign,IL, USA, June. ACH, ALLC.


162

Recuperación de Información

Sistema de Recomendacion para la Recuperacion Automaticade Enlaces Web Rotos ∗

Recommendation System for Automatic Recovering Broken Web Links

Juan Martinez-RomoUNED

[email protected]

Lourdes AraujoUNED

[email protected]

Resumen: Tanto en las paginas Web a las que accedemos cuando navegamos porInternet, como en las nuestras propias, a veces encontramos enlaces que han deja-do de ser validos. A menudo la busqueda de la pagina que correspondıa a dichosenlaces no es sencilla. En este trabajo investigamos distintas formas de recuperarautomaticamente dichas paginas, de manera que le podamos ofrecer al usuario unalista de direcciones Web candidatas para sustituir el enlace roto. Concretamenteutilizamos de forma alternativa o combinada, dependiendo de las caracterısticas dela pagina y del enlace, el texto del ancla e informacion extraıda de la Web en laque se encuentra el enlace roto. La informacion extraıda de estas fuentes se uti-liza para realizar una consulta con un motor de busqueda usual, como Google oYahoo. El sistema ordena posteriormente las paginas recuperadas en base a su con-tenido, utilizando tecnicas de recuperacion de informacion, y finalmente el resultadoes presentado al usuario. Presentamos los resultados del analisis realizado sobre nu-merosos enlaces seleccionados aleatoriamente, los cuales nos han permitido decidiren que condiciones es posible hacer una recomendacion con un alto grado de fiabili-dad.Palabras clave: recuperacion de informacion, World Wide Web, enlaces rotos

Abstract: In the Web pages accessed when navigating through Internet or even inour own Web pages, we sometimes find links which are not valid any more. The searchof the right Web pages which correspond to those links is often hard. In this work wehave analyzed different sources of information to automatically recover broken Weblinks so that the user can be offered a list of possible pages to substitute that link.Specifically, we have used either the anchor text or the Web page containing the link,or a combination of both. The information extracted is then used to perform a searchwith some of the usual search engines, such as Google or Yahoo. The candidate pagesare then ranked applying information retrieval techniques on their content. Finally,the user is presented the pages resulting from this process. We report the analysis ofa number of issues on a set of links randomly chosen, what has allowed us to decidethe conditions under which the system can make a reliable recommendation.Keywords: information retrieval, World Wide Web, broken links, link integrity

1. Introduccion

La Web es un sistema altamente dinami-co en el que constantemente desaparecen, secrean o se mueven las paginas de informa-cion. Esto provoca que algunos de los enlacesa los que apuntan dichas paginas se rompanun tiempo mas o menos largo despues de su

∗ Trabajo financiado por el proyecto TIN2007-67581-C02-01

creacion. Nos encontramos esta situacion fre-cuentemente en Internet. Tambien nos obligaa revisar periodicamente nuestros sitios Webpara comprobar que todos sus enlaces siguensiendo validos. Encontrar la nueva ubicacionde la pagina a la que apuntaba un enlace rotono siempre es trivial. La recuperacion de en-laces en nuestras propias paginas deberıa serfacil, aunque puede resultar tediosa.

Existen algunos trabajos enfocados a la re-



cuperacion de enlaces, aunque se basan eninformacion anotada por anticipado en el en-lace. El sistema Webvise (Grønbæk, Sloth, yØrbæk, 1999), integrado con software de Mi-crosoft, permite cierto grado de recuperacionde enlaces Web rotos utilizando informacionredundante sobre los enlaces almacenada enbases de datos de servidores de Internet. Lainformacion se almacena al crearse o modi-ficarse el enlace. Davis (Davis, 2000) analizalas causas del problema de los enlaces rotosy propone soluciones enfocadas a la recopi-lacion de informacion sobre la estructura dela red de enlaces. Nakamizo y colaboradores(Nakamizo et al., 2005) han desarrollado unsistema de recuperacion de enlaces basado enlo que denominan “enlaces con autoridad” deuna pagina que son otras paginas que enlazana la primera con enlaces que siempre se actu-alizan cuando la pagina Web se mueve. Paraello utilizan servidores de este tipo de pagi-nas. Shimada y Futakata (Shimada y Futaka-ta, 1998) propusieron la creacion de una basede datos de enlaces, SEDB, en la que sonposibles ciertas operaciones de reparacion delos enlaces almacenados. SEDB maneja losdocumentos usando enlaces con tipos entreellos. Solo los enlaces se almacenan de unaforma centralizada, mientras que los docu-mentos quedan en sus localizaciones origi-nales. Este sistema aplica una reparacion au-tomatica de enlaces disenada para preservarla topologıa de la red de enlaces.

Tambien se han desarrollado trabajos que,aunque con propositos diferentes de la re-cuperacion de enlaces rotos, han investiga-do mecanismos de extraccion de informaciona partir de los enlaces y sus contextos. Al-gunos de los mecanismos utilizados en estostrabajos han sido investigados en nuestro sis-tema de recuperacion de enlaces. McBryan(McBryan, 1994) propuso el uso del textodel ancla como una ayuda para la busque-da. En este trabajo se describe WWWW, unaherramienta de localizacion de recursos. Esteprograma explora Internet localizando todotipo de recursos Web con los que construyeuna base de datos. El motor de busqueda deWWWW se ejecuta cuando un usuario ac-cede a la pagina de este servicio y rellena unformulario de busqueda. A partir de esta in-formacion, que puede ser de distintos tipos,incluyendo Urls y anclas se hacen busquedasde patrones de cadenas. Chakrabarti y colab-oradores (Chakrabarti et al., 1998) han de-

sarrollado un algoritmo para la compilacionautomatica de recursos Web con autoridaden cualquier tematica suficientemente am-plia. Dicho algoritmo se basa en una combi-nacion de la informacion extraıda de un anali-sis local de los textos y de los enlaces de laspaginas.

Nuestro trabajo difiere de los anterioresya que no presupone la existencia de ningunainformacion almacenada de antemano sobrelos enlaces y es aplicable a cualquier paginade Internet.

Cuando se trata de paginas a las quehemos llegado navegando por Internet, a ve-ces podemos recuperarlas utilizando un bus-cador Web con los terminos del ancla delenlace roto. Sin embargo, en muchos casos,el texto del ancla no es suficientemente in-formativo para recuperar la pagina deseada.Entonces podemos realizar consultas comple-mentando la informacion del ancla con datosextraıdos de otras fuentes: la pagina Web enla que se encontraba el enlace, la pagina al-macenada por el buscador en su ultima in-dexacion, la Url, etc.

En este trabajo hemos disenado un sis-tema para automatizar este proceso. Nuestrosistema comprueba los enlaces de la paginaque se le indica. Si alguno de ellos esta roto,hace una propuesta al usuario de una seriede paginas candidatas para sustituir el en-lace roto. Las paginas candidatas se obtienenmediante busquedas en Internet compuestasde terminos extraıdos de distintas fuentes. Alas paginas recuperadas con la busqueda Webse les aplica un proceso de ordenacion que re-fina los resultados antes de hacer la recomen-dacion al usuario. La figura 1 presenta un es-quema del sistema propuesto.

Hemos comenzado este trabajo analizan-do numerosas paginas Web y sus enlacespara determinar que fuentes de informaciony que combinaciones de ellas son mas apropi-adas en cada caso. Este analisis nos ha permi-tido extraer criterios para determinar cuan-do tiene sentido hacer una recomendacion alusuario, y cuando la informacion disponiblees insuficiente para llevar a cabo la recu-peracion. En este caso, se informa al usuariode la situacion. Si la informacion es suficientese hace una recomendacion de paginas candi-datas ordenadas por relevancia.

El resto del artıculo se organiza de la si-guiente forma: en la seccion 2 se describe lametodologıa seguida para estudiar la utili-

Juan Martinez-Romo y Lourdes Araujo

166

Informaciondel enlace

Informacionde la pagina

Extraccion determinos relevantes

Buscador

Enlace roto

Pagina web

terminos

paginas recomendadas

paginas web

Ordenar

Figura 1: Esquema del funcionamiento delsistema de recomendacion para la recu-peracion de enlaces rotos.

dad de la informacion de las distintas fuentesde informacion consideradas, mientras que laseccion 3 presenta los resultados de dichosestudios y analiza su utilidad para nuestroproblema. La seccion 4 describe el proceso deordenacion de los documentos obtenidos. Laseccion 5 presenta el esquema resultante delos analisis anteriores y su evaluacion sobreun conjunto de enlaces rotos, y finalmente, laseccion 6 resume las conclusiones del trabajo.

2. Metodologıa

Si analizamos la utilidad de las distin-tas fuentes de informacion utilizadas direc-tamente sobre enlaces rotos, es muy difıcilevaluar la calidad de las paginas candidatasa sustituir el enlace. Por ello, en esta fase deanalisis trabajamos con enlaces Web toma-dos de forma aleatoria, que no estan real-mente rotos, y que denominamos supuesta-mente rotos. De esta forma disponemos de lapagina a la que apuntan y podemos evaluar larecomendacion que hacemos utilizando cadafuente de informacion.

2.1. Seleccion de los enlaces arecuperar

Para realizar el analisis, tomamos losenlaces de paginas seleccionadas aleato-riamente mediante peticiones sucesivas awww.randomwebsite.com, un sitio que pro-porciona paginas Web aleatorias.

Hemos impuesto ciertos requisitos a nues-tras paginas de prueba:

Intentamos restringir el idioma al in-gles, considerando los siguientes domin-ios: “.com”, “.org”, “.net”, “.gov” y”.edu”.

Buscamos paginas con al menos 250 pal-abras, con el objetivo de utilizar este tex-to para caracterizar la pagina. Ademas,el texto debera contener al menos diezterminos que no sean palabras vacıas, esdecir, palabras comunes como artıculos,pronombres, etc., que no sirven para dis-criminar.

Tambien exigimos que la pagina contenga almenos cinco enlaces potencialmente analiz-ables, lo que significa:

Estar en lınea si buscamos enlaces ac-tivos o roto si estamos buscando enlacesrotos. Este parametro cambia segun eltipo de estudio.

El sistema analiza enlaces externos, porlo tanto los enlaces que apunten al mis-mo sitio son descartados.

El texto del ancla no debe estar vacıo oser un numero o una Url.

En el caso de que el texto del anclasea un solo caracter y ademas coincidacon un signo de puntuacion, este enlacesera descartado.

Las paginas que no poseen estas caracterısti-cas son descartadas, y el proceso de seleccionno finaliza hasta que se han reunido un to-tal de cien paginas, lo que supone al menos500 enlaces a estudiar. Algunos experimentospreliminares nos indicaron que es frecuenteencontrar paginas en las que la mayorıa delos enlaces son correctos y otras en las que lamayorıa de los enlaces son incorrectos. Cuan-do estas paginas tienen muchos enlaces, ses-gan los resultados en uno u otro sentido. Porello decidimos limitar el numero de enlacestomado de cada pagina a diez. La eleccionde este subconjunto de enlaces se realiza por

Sistema de Recomendación para la Recuperación Automática de Enlaces Web Rotos

167

cada pagina de prueba y siguiendo una dis-tribucion aleatoria y uniforme sobre el con-junto total de sus “enlaces analizables”.

3. Fuentes de Informacion

En esta seccion analizamos cada una delas fuentes de informacion consideradas, ex-trayendo estadısticas de su utilidad para larecuperacion de enlaces cuando se aplican porseparado o combinadas.

3.1. Texto del ancla de los enlaces

En muchos casos las palabras que compo-nen el texto del ancla de un enlace son laprincipal fuente de informacion para identi-ficar la pagina apuntada. Para verificar es-ta teorıa, hemos realizado un estudio, que semuestra en el cuadro 1, del numero de casosen los que los enlaces rotos se han recuperadobuscando en Google el texto del ancla entre-comillado.

Para considerar que un enlace se ha re-cuperado, se ha utilizado una combinacionde distintos mecanismos. En primer lugar secomprueba si la Url de la pagina candidataa sustituir el enlace coincide con la del en-lace analizado (que recordemos, en esta fasede analisis no esta roto en realidad). Sin em-bargo, hemos encontrado casos en los que lapagina que se recupera tiene el mismo con-tenido que la del enlace supuestamente ro-to, pero distinta Url. Por ello si las Urls nocoinciden, comprobamos si el contenido delas paginas es el mismo. Tambien hemos en-contrado varios casos en los que el contenidode las paginas no es exactamente el mismo,pero es muy similar: cambia algun anuncio,la fecha, etc. Por ello, si el contenido no coin-cide, aplicamos el modelo de espacio vectorial(Manning, Raghavan, y Schutze, 2008), rep-resentando cada una de las paginas a com-parar por un vector de terminos, y hayamosla distancia dada por el coseno entre ellos. Sieste valor es mayor de 0.9, consideramos lapagina recuperada. Para valores menores queeste umbral, como un 0.8, aunque en la may-orıa de los casos se trata de la misma paginacon pequenos cambios como los mencionados,hemos encontrado algun caso en que se trata-ba de paginas distintas, aunque del mismositio Web.

El cuadro 1 muestra el numero de enlacessupuestamente rotos que se ha conseguidorecuperar entre las diez primeras posicionesde los documentos devueltos por el buscador.

Podemos observar que utilizando un umbralde similitud de 0.9 se ha conseguido recuperarun 41 % de los enlaces entre las diez primerasposiciones (Google). Ademas un 66 % de losenlaces recuperados han logrado encontrarseen la primera posicion. Estos datos demues-tran que el texto del ancla de un enlace esuna gran fuente de informacion de cara a re-cuperar un enlace roto. Los resultados paralas filas correspondientes a grados de simili-tud menores de 0.9 muestran que el numerode enlaces adicionales que se conseguirıa ba-jando el umbral, es muy pequeno. Por ello, ydados los casos erroneos que se podrıan in-cluir con otros umbrales, hemos utilizado unumbral de 0.9.

Grado Sim. 1 pos. 1-10 pos. E.N.R.0.9 253 380 5360.8 3 4 5290.7 2 6 5210.6 4 13 5040.5 4 22 478

Cuadro 1: Valores agregados (descontandolos del nivel anterior) de la busqueda del tex-to del ancla en Google segun el grado de simil-itud utilizado. La primera columna indica elgrado de similitud requerido para calcularestos valores, entendidos como el incremen-to que se conseguirıa al pasar de 0.9 a 0.8,etc. 1 pos. representa el numero de enlaces“supuestamente rotos” que se han recupera-do en primera posicion entre los resultadosdel buscador, y 1-10 pos. el numero de losrecuperados entre las 10 primeras posiciones.E.N.R. representa los enlaces que no se hanconseguido recuperar.

Sin embargo, hay ocasiones en las que losterminos del ancla pueden ser poco o nadadescriptivos. Imaginemos un enlace cuyo tex-to de anclaje es ”pincha aqui”. En este caso,el encontrar el enlace roto podrıa calificarsecomo imposible. Por este motivo tambien esmuy importante analizar estos terminos parapoder decidir que tareas realizar dependiendode su numero y calidad.

En este trabajo hemos optado por realizarun reconocimiento de entidades nombradas(nombres de personas, organizaciones o lu-gares) sobre el texto del ancla, para poderextraer determinados terminos cuya impor-tancia sea mayor que la del resto. Para talfin, existen varias soluciones software comoLingPipe, Gate, FreeLing, etc. Tambien exis-


168

ten multiples recursos en forma de gazetteers,pero el amplio dominio sobre el que traba-jamos ha impedido conseguir resultados pre-cisos. Estamos en un entorno en el que anal-izamos paginas aleatorias cuyo unico factorcomun es el idioma (ingles). Ademas, el hechode que el texto de las anclas sean conjuntosmuy reducidos de palabras y/o numeros, haceque los sistemas usuales de reconocimiento deentidades proporcionen resultados muy po-bres.

Por estos motivos, hemos decidido em-plear la estrategia opuesta. En lugar de en-contrar entidades nombradas, hemos optadopor recopilar un conjunto de diccionarios ydescartar las palabras comunes y numeros,suponiendo que el resto de palabras son en-tidades nombradas, ademas trabajamos conuniterminos. Aunque hemos encontrado al-gunos falsos negativos, como por ejemplo lacompanıa ”Apple”, en el caso de las anclashemos obtenido mejores resultados con estatecnica.

El cuadro 2 muestra los resultados de re-cuperacion de los enlaces ”supuestamente ro-tos” en funcion del contenido de entidadesnombradas de las anclas. Podemos ver quecuando el ancla no contiene ninguna enti-dad nombrada, el numero de enlaces paralos que no se consigue recuperar la paginaes mucho mayor que el numero de los quese recupera, mientras que cuando hay enti-dades nombradas ambas cantidades son sim-ilares. Esto demuestra que la presencia de en-tidades nombradas en el ancla favorece la re-cuperacion del enlace.

Tipo de ancla E. N. R. E. R.Ent. Nomb. 240 232No Ent. 296 148

Cuadro 2: Analisis del tipo de ancla de los en-laces no recuperados (E.N.R.) y recuperados(E.R.). Ent. Nomb. representa a las anclascon una o mas entidades nombradas, y NoEnt, a las que no contienen ninguna entidadnombrada.

Tambien hemos analizado los resultadosde recuperacion en funcion del numero determinos del ancla. El cuadro 3 muestra esteestudio. El resultado mas claro es que cuandoel ancla consta de un solo termino y este noes una entidad nombrada, el numero de casosen los que se consigue recuperar el documen-

to correcto es realmente muy pequeno. Cuan-do hay entidades nombradas, aunque haya unsolo termino, el numero de casos recuperadoses importante. Otro dato que podemos obser-var es que a partir de dos terminos, el numerode terminos del ancla no representa una granvariacion en los resultados.

Tipo de ancla Terminos E. N. R. E. R.

Ent.Nomb.

1 term. 102 672 term. 52 753 term. 32 29

4+ term. 57 61

No Ent.

1 term. 145 72 term. 91 493 term. 27 45

4+ term. 33 47

Cuadro 3: Analisis de los enlaces no recupera-dos (E.N.R.) y recuperados (E.R.) en funciondel tipo de ancla, con (Ent. Nomb.) y sin (NoEnt.) entidades nombradas, y del numero determinos del ancla. 4+ term. se refiere a an-clas con cuatro o mas terminos.

3.2. El texto de la pagina

Los terminos mas frecuentes encontradosen una pagina Web son una forma de car-acterizar el tema principal de dicha pagina.Esta tecnica requiere que el contenido de lapagina sea suficientemente grande. Un ejem-plo claro de utilidad de esta informacion sonlos enlaces a paginas personales. Es muy fre-cuente que el ancla de un enlace a una pagi-na personal este formada por el nombre de lapersona a la que corresponde la pagina. Sinembargo, en muchos casos los nombres, inclu-ido el apellido, no identifican a una personade forma unıvoca. Por ejemplo, si buscamosen Google por el nombre “Juan Martınez”, elnombre de uno de los autores de este traba-jo, nos aparecen numerosas entradas (99.900aprox. en el momento de escribir este artıcu-lo). La primera respuesta del buscador quecorresponde a Juan Martınez Romo ocupala decima posicion. Sin embargo, si anadi-mos algun termino de los que aparecen en supagina Web, como “Web search”, entonces laentrada a su pagina pasa a ser la primera.Este ejemplo nos muestra la utilidad del usode una seleccion adecuada de terminos de lapagina que contiene el enlace.

Hemos aplicado tecnicas clasicas de re-cuperacion de informacion para extraer los


169

terminos mas representativos de la pagina.Una vez eliminadas las palabras vacıas, gen-eramos un ındice de terminos ordenado porfrecuencias. Los diez primeros terminos deeste ındice se utilizan, uno a uno, para ex-pandir la consulta formada por el texto delancla. Es decir, se expande con cada uno deellos y se toman los diez primeros documen-tos recuperados en cada caso.

En el cuadro 4 se puede observar como laexpansion mejora globalmente los resultadosaumentando el numero de enlaces recupera-dos en las diez primeras posiciones y por tan-to reduciendo los enlaces no recuperados. Apesar de esto, el numero de enlaces recuper-ados en primera posicion se ve reducido. El

Analisis. 1 pos. 1-10 pos. E.N.R.No EXP 253 380 536EXP 213 418 498

Cuadro 4: Analisis del numero de documen-tos recuperados en primera posicion (1 pos.),entre las diez primeras posiciones (1-10 pos.)o no recuperados (E.N.R.) en funcion de uti-lizar (EXP) o no (No EXP), el metodo deexpansion de la consulta.

cuadro 5 muestra el numero de casos en losque la expansion ha mejorado los resultados,y en los que los ha empeorado. Podemos verque aunque el numero de casos en que mejoraes bastante mayor, casi el doble (90 frente a52), el numero de casos en los que empeora noes despreciable. Por ello consideramos que lomas adecuado es aplicar ambas formas de re-cuperacion, y ordenar despues los resultadospara presentar al usuario los mas relevantesen primer lugar.

Resultado expansion Num. CasosMejora 90Empeora 52

Cuadro 5: Numero de casos en los que la ex-pansion mejora y empeora los resultados.

Analizando los casos en los que se con-sigue recuperar la pagina correcta con ysin entidades nombradas (cuadro 6) y enfuncion del numero de terminos del an-cla (cuadro 7) vemos que las proporcionesobtenidas recuperando sin expandir la con-sulta se mantienen. Es decir, los mejores re-sultados se obtienen cuando hay entidades

nombradas y cuando hay dos o mas termi-nos. Sin embargo, en este caso, es decir conexpansion, el numero de enlaces recuperadoscuando el ancla consta de un unico termi-no y no es una entidad nombrada es 25, queya puede ser una cantidad significativa. Es-to sugiere intentar recuperar con expansiontambien en este caso, siempre que sea posiblecomprobar la validez de los resultados, comose explica despues en la seccion 5.

Tipo de ancla E. N. R. E. R.Ent. Nomb. 248 224No Ent. 250 194

Cuadro 6: Analisis, cuando se aplica el meto-do de expansion de la consulta, de los en-laces no recuperados (E.N.R.) y recuperados(E.R.) en funcion del tipo de ancla, con (Ent.Nomb.) y sin (No Ent.) entidades nombradas.

Tipo de ancla Terminos E. N. R. E. R.

Ent.Nomb.

1 term. 104 652 term. 55 723 term. 30 28

4+ term. 59 59

No Ent.

1 term. 127 252 term. 70 703 term. 22 50

4+ term. 31 49

Cuadro 7: Analisis, cuando se aplica el meto-do de expansion de la consulta, de los en-laces no recuperados (E.N.R.) y recuperados(E.R.) en funcion del tipo de ancla, con (Ent.Nomb.) y sin (No Ent.) entidades nombradas,y del numero de terminos del ancla. 4+ term.se refiere a anclas con 4 o mas terminos.

4. Ordenacion de los enlaces arecomendar

En este momento hemos recuperado unconjunto de enlaces candidatos a sustituir alenlace roto, procedentes de la busqueda conel ancla y con el ancla expandida con ca-da uno de los diez primeros terminos querepresentan a la pagina padre. Ahora quer-emos ordenarlos por relevancia para presen-tarlos al usuario. Para calcular esta relevan-cia hemos considerado dos fuentes de infor-macion. En primer lugar, si existe, la pagi-na a la que apuntaba el enlace roto alma-cenada en la cache del buscador, en nuestro


170

caso de Google. Si esta informacion no ex-iste, entonces utilizamos la pagina padre quecontiene el enlace roto. La idea es que lapagina enlazada tratara en general sobre unatematica relacionada con la pagina en la quese encuentra el enlace.

De nuevo hemos aplicado el modelo deespacio vectorial (Manning, Raghavan, ySchutze, 2008) para estudiar la similitud en-tre la pagina que contenıa el enlace roto y laspaginas recuperadas. Con esta tecnica calcu-lamos la similitud o bien con la cache o bi-en con la pagina padre. El cuadro 8 muestralos resultados obtenidos ordenando por simil-itud con la cache, mientras que el cuadro 9muestra los resultados ordenando por simili-tud con la pagina padre. En el primer caso,la mayorıa de los documentos correctos recu-perados se presentan entre los diez primerosdocumentos, con lo que si se dispone de lacache, podremos hacer recomendaciones muyfiables. En el caso de la similitud con la pagi-na padre, el orden de los resultados es peor.Por lo que solo recurriremos a esta informa-cion si no se dispone de la cache.

N primeros Apariciones mejordocs. seleccionados candidato

10 30120 30530 30650 30780 310100 312110 313

Cuadro 8: Numero de apariciones de pagi-nas correctas en el ranking elaborado, selec-cionando los N mejores candidatos segun lasimilitud con la cache.

N primeros Apariciones mejordocs. seleccionados candidato

10 4720 10530 13250 19180 263100 305110 313

Cuadro 9: Numero de apariciones de pagi-nas correctas en el ranking elaborado, selec-cionando los N mejores candidatos segun lasimilitud con la pagina padre.

5. Algoritmo de RecuperacionAutomatica de enlaces

si long(ancla) = 1 y NoEN(ancla) entsi EnCache(pagina) entdocs = busqueda Web(ancla + info pagina)ordenar(docs,cache)si similitud(docs, cache(pagina) > 0.9) ent

propuesta usuario(docs)sino

No se recuperasinoNo se recupera

sinodocs = busqueda Web(ancla)docs = docs +

busqueda Web(ancla + info pagina)si EnCache(pagina) entordenar(docs,cache)

sinoordenar(docs,pagina padre)

propuesta usuario(docs)

Figura 2: Algoritmo de recomendacion de en-laces sustitutos de uno roto.

Los resultados del analisis descrito en lassecciones anteriores sugieren criterios paradecidir en que casos hay informacion sufi-ciente para intentar la recuperacion del en-lace y que fuentes de informacion utilizar.De acuerdo con ellos proponemos el proced-imiento de recuperacion que aparece en lafigura 2. En primer lugar se comprueba siel numero de terminos del ancla es solo uno(long(ancla) = 1) y si no contiene entidadesnombradas (NoEN(ancla)). En este caso solose intenta recuperar si la pagina desaparecidaesta en la cache y por tanto tenemos informa-cion que nos permita comprobar que la prop-uesta que hagamos al usuario sea relevante.Si no es ası, se informa al usuario de la im-posibilidad de hacer la recomendacion. Si lapagina esta en la cache, entonces se recupera,expandiendo la consulta de los terminos delancla con los extraıdos de la pagina padre, seordenan los resultados y solo si hay algunosuficientemente proximo al contenido de lacache se hace la recomendacion al usuario. Enlos casos restantes, es decir anclas con mas deun termino o que contienen alguna entidadnombrada, se recupera con los terminos delancla, tambien expandiendo con terminos dela pagina padre y se juntan y ordenan todoslos documentos. Si la cache de la pagina de-saparecida esta disponible se utiliza para la


171

ordenacion, y si no se utiliza la pagina padre.Hemos aplicado este algoritmo a enlaces

que estan realmente rotos, pero solamentede los que se dispone de cache, para poderevaluar los resultados. El cuadro 10 mues-tra los resultados de la posicion de los doc-umentos mas relevantes en una ordenacionpor similitud con la pagina padre. La rele-vancia se mide por similitud con la cache.Hemos comprobado que en unos casos se tra-ta de la pagina original, que ha cambiado deUrl, y en otros casos de paginas con con-tenido muy relacionado en una localizaciondiferente. Podemos observar, que aun si nocontamos con la cache y ordenamos por simil-itud con la pagina padre, el sistema es ca-paz de presentar documentos sustitutos rele-vantes entre las 10 primeras posiciones en un48% de los casos y entre las 20 primeras enun 76 %.

N primeros E.R1-10 1210-20 720-50 6

Cuadro 10: Numero de apariciones de paginassustitutas (de acuerdo con su similitud con elcontenido de la cache) entre los N primerosdocumentos ordenados por similitud con lapagina padre.

6. Conclusiones y Futurostrabajos

En este trabajo hemos analizado distintasfuentes de informacion que podemos utilizarpara hacer una recuperacion automatica deenlaces Web que han dejado de ser validos.Los resultados indican que los terminos delancla pueden ser muy utiles, especialmente sihay mas de uno y si contienen alguna enti-dad nombrada. Hemos estudiado tambien elefecto de anadir terminos procedentes de lapagina que contiene el enlace, con el fin dereducir la ambiguedad que puede conllevar lacantidad limitada de terminos del ancla. Esteestudio ha mostrado que los resultados mejo-ran a los obtenidos utilizando solo los termi-nos del ancla. Sin embargo, como hay casosen los que la expansion empeora el resultadode la recuperacion, hemos decidido combinarambos metodos, ordenando despues los doc-umentos obtenidos por relevancia, para pre-sentar al usuario las mejores paginas candi-

datas en primer lugar. El resultado de esteanalisis ha sido un algoritmo que ha consegui-do recuperar una pagina muy cercana a la de-saparecida entre las diez primeras posicionesde los documentos candidatos en un 48% delos casos, y entre las 20 primeras en un 76 %.

En este momento trabajamos en analizarotras fuentes de informacion que pueden serutiles para la recuperacion, como las Urls olas paginas apuntadas por otros enlaces de lapagina que contiene el enlace roto.

Bibliografıa

Chakrabarti, S., B. Dom, D. Gibson, J. Klein-berg, P. Raghavan, y S. Rajagopalan.1998. Automatic resource list compilationby analyzing hyperlink structure and asso-ciated text. En Proceedings of the 7th In-ternational World Wide Web Conference.

Davis, H.C. 2000. Hypertext link integri-ty. ACM Computing Surveys ElectronicSymposium on Hypertext and Hypermedia,31(4).

Grønbæk, Kaj, Lennert Sloth, y PeterØrbæk. 1999. Webvise: Browserand proxy support for open hyperme-dia structuring mechanisms on the worldwide web. Computer Networks, 31(11-16):1331–1345.

Manning, Christopher D., Prabhakar Ragha-van, y Hinrich Schutze. 2008. Introduc-tion to Information Retrieval. CambridgeUniversity Press.

McBryan, Oliver A. 1994. GENVL andWWWW: Tools for Taming the Web. EnO.Nierstarsz, editor, Proceedings of thefirst International World Wide Web Con-ference, pagina 15, CERN, Geneva.

Nakamizo, A., T. Iida, A. Morishima, S. Sug-imoto, , y H. Kitagawa. 2005. A tool tocompute reliable web links and its applica-tions. En SWOD ’05: Proc. InternationalSpecial Workshop on Databases for NextGeneration Researchers, paginas 146–149.IEEE Computer Society.

Shimada, Takehiro y Atsushi Futakata. 1998.Automatic link generation and repairmechanism for document management.En HICSS ’98: Proceedings of the Thirty-First Annual Hawaii International Con-ference on System Sciences-Volume 2,pagina 226, Washington, DC, USA. IEEEComputer Society.


172

Funciones de Ranking basadas en Logica Borrosa para IRestructurada∗

Ranking Functions based on Fuzzy Logic for Structured IR

Joaquın Perez-IglesiasNLP Group at UNED

[email protected]

Vıctor FresnoNLP Group at [email protected]

Jose R. Perez-AgueraUCM

[email protected]

Resumen: Con el auge de los lenguajes de marcado se ha desarrollado un nuevoescenario en el campo de la recuperacion de informacion centrado en los documentosque presentan una estructura, y asumiendo que esta puede ayudar en el procesode recuperacion; es lo que se define como IR estructurada. Los modelos clasicosde IR se han aplicado a este problema adaptando sus funciones de ranking alconsiderar los campos en los que se estructura un documento, y estas adaptacionesse han realizado siempre asumiendo una independencia estadıstica entre estoscampos. Este hecho fuerza a la eleccion o estimacion de unos coeficientes de empujeque determinen los diferentes pesos que se quiere dar a cada uno de los camposconsiderados. En este trabajo se presenta una nueva funcion de ranking para IRestructurada, basada en logica borrosa, que trata de modelar mediante conocimientoexperto la relacion existente entre campos.

Palabras clave: IR estructurada, Logica borrosa, Funciones de Ranking.

Abstract: With the increase in the use of mark-up languages, a new scenario hasraised into the IR field; this new scenario is focused on structured documents, andhas been defined as structured IR. The classic IR models have been extended inorder to be applied to this new scenario. Generally these adaptations have beencarried on by weighting the fields that form the document structure, and makingthe assumption of statistics independence between fields. This assumption force toan estimation of the different boosts applied to every field. In this paper a newranking function for structured IR is proposed. This new function is based on FuzzyLogic, and its main aim is to model through heuristics and expert knowledge therelations between fields.

Keywords: Structured IR, Fuzzy Logic, Ranking Functions.

1. Introduccion

En la actualidad, la Web se ha convertidoen la mayor fuente de informacion disponi-ble en el mundo y el acceso a toda esta in-formacion se realiza fundamentalmente pormedio de motores de busqueda web, sistemasde recuperacion de informacion (IR) que per-miten obtener una lista ordenada de docu-mentos como respuesta a una necesidad deinformacion (o consulta) expresada como unconjunto de terminos.

∗ Este trabajo ha sido subvencionado parcialmentepor el proyecto QEAVis-Catiex (TIN2007-67581-C02-01) del Ministerio de Ciencia e Innovacion.

Tradicionalmente, las tareas de IR se hanrealizado sobre documentos sin ningun tipode estructura, de forma que desde el puntode vista del procesamiento de los documen-tos, no era posible distinguir entre diferentespartes del documento. Este tipo de recupera-cion sobre texto plano se denomina recupera-cion ad-hoc, y ha constituıdo el eje central dela investigacion teorica y practica en IR du-rante los ultimos 40 anos. Ahora bien, con elsurgimiento de la Web y su lenguaje HTML,por un lado, y con el auge de los lenguajesde marcado como XML, por el otro, se hagenerado un nuevo escenario de recuperaciondonde los documentos sı presentan de forma



explicita una estructura que puede ser de uti-lidad de cara a la IR.

En el caso del XML la investigacion se hacentrado en la reimplementacion de los mode-los clasicos de IR, de forma que la estructurapuede ser tenida en cuenta en las funciones deranking que permiten generar la ordenacionde los documentos. En este sentido la compe-ticion INEX, al igual que TREC en su mo-mento para IR ad-hoc, ha disenado un mar-co de pruebas especıfico para IR estructura-da, el cual esta siendo de gran utilidad parala comunidad de investigadores en IR. Comoproducto de las sinergias creadas a partir deINEX, se han disenado distintas estrategias apartir de los modelos clasicos de IR destina-das a la recuperacion sobre documentos conestructura.

Desde el punto de vista de la teorıa de IRlos dos hitos mas emblematicos en este sen-tido han sido la adaptacion del modelo deespacio vectorial a la IR estructurada (Sch-lieder y Meuss, 2002) y la adaptacion del es-quema de ponderacion BM25 al tratamientode la estructura de los documentos, denomi-nado BM25F(Robertson, Zaragoza, y Taylor,2004).

En el caso de la recuperacion de documen-tos HMTL, mucho del trabajo realizado so-bre XML puede ser facilmente adaptado altratamiento de la estructura especıfica de laspaginas web. El principal factor diferenciadorconsiste en que si bien en XML podemos con-tar con un marcado directo de la estructurade los documentos, en HTML necesitamos in-ferir la estructura en funcion de un marcadomeramente descriptivo, centrado en la defi-nicion de la apariencia con la que se debenmostrar los documentos en el momento de suvisualizacion por medio de un navegador. Es-ta particularidad del HTML provoca que seanecesario un trabajo previo de definicion deheurısticas que nos permita establecer rela-ciones entre la informacion de visualizaciondel documento y su estructura subyacente.

Si nos fijamos en las funciones de rankingaplicadas a la IR ad-hoc, podemos ver quesiempre asumen una independencia estadısti-ca, primero entre terminos en la consulta, ydespues entre los documentos de la coleccion.De este modo, estas funciones se formalizancomo sumatorios a los terminos de la consultade funciones de pesado que generalmente con-sideran, por un lado, la frecuencia del propiotermino en el documento y en la consulta y,

por otro, su frecuencia inversa de documento,o IDF.

Al considerar la estructura del documen-to, estas funciones de ranking deberan incluiruna variable anadida: los campos que repre-sentan las diferentes partes del documento.El tratamiento de esta estructura y su inte-gracion en las funciones de ranking clasicasse realiza asumiendo una independencia es-tadıstica entre campos, de modo que ahorala funcion general de ranking se formulara co-mo un sumatorio a los terminos de la consultay para cada uno de los campos que se estenconsiderando. Esta independencia entre cam-pos fuerza irremediablemente a la estimaciono eleccion de una serie de valores que repre-senten los pesos de cada uno de los camposdentro de la funcion de ranking. Es decir, pa-ra cada coleccion deben fijarse inicialmenteunos coeficientes, conocidos como factores deboost o factores de empuje, que representaranla importancia que se quiere dar a cada cam-po dentro del valor final de ranking. Por tan-to, esta funcion que debe ajustarse a la co-leccion, en el caso de considerar la estructuradel documento, debera anadir nuevas varia-bles, por lo que el problema de estimacion sehace mas complejo cuantos mas campos seanconsiderados.

Sin embargo, esta independencia entrecampos no siempre deberıa ser asumida.Cuando se trata de encontrar la relevanciade un documento respecto de una consulta,los campos a combinar no siempre deberıantratarse de modo independiente, como sucedeen las funciones de ranking que encontramosen la literatura. El motivo es que a menudola relevancia en un campo toma verdaderaimportancia en union con otro. Por ejemplo,podrıa suceder que el tıtulo de un documen-to tuviera una componente retorica, de mo-do que los rasgos presentes en el no ayudarana describir adecuadamente el contenido delmismo. Por este motivo, los rasgos presen-tes en el tıtulo deberıan tener mayor relevan-cia si, ademas, aparecieran en otros camposdel documento. Este tipo de consideracionesno se contemplan con funciones de rankingbasadas en combinaciones lineales de facto-res, tales como las funciones usadas dentrodel modelo de espacio vectorial aplicado a IRestructurada o el esquema BM25F. En estoscasos, si un termino aparece en el tıtulo, lacomponente relativa a ese campo tıtulo to-mara un valor que se sumara siempre en el

Joaqíın Pérez-Iglesias, Víctor Fresno y Jose R. Pérez-Agüera

174

calculo final del valor de ranking, indepen-dientemente del valor que tomen el resto decampos.

En este trabajo se presenta una funcion deranking aplicable a problemas de IR estructu-rada que sı tiene en cuenta las posibles depen-dencias entre los campos. La idea sobre la quese construye esta funcion es que con un siste-ma fuzzy (o borroso) podemos ser capaces decombinar conocimiento y experiencia en unconjunto de expresiones linguısticas que ma-nejan palabras en lugar de valores numericos(W.G.J. Howells, 2001). En nuestro caso, conel objetivo de combinar la informacion de ca-da uno de los campos en que se estructura eldocumento. De este modo, un sistema de re-glas fuzzy podrıa suponer un mecanismo masapropiado si trataramos de combinar la infor-macion de diferentes campos en problemas deIR estructurada.

El resto de este artıculo se estructura co-mo sigue. En la Seccion 2 se describe la ex-tension del modelo de espacio vectorial pa-ra documentos con estructura. En la Seccion3 se presenta el modelo propuesto, ası comolas bases teoricas que lo soportan. A conti-nuacion, en la Seccion 4 se detalla la expe-rimentacion realizada y se realiza un analisisde los resultados obtenidos. Finalmente, enel Seccion 5 se extraen las conclusiones y sesugieren posibles trabajos futuros.

2. Modelo de espacio vectorial

para IR estructurada

A dıa de hoy todos los modelos clasicoscuentan con una adaptacion de su funcion deranking a documentos con estructura. Den-tro del modelo de espacio vectorial, la funcionde ranking puede asumir distintas formas, enfuncion del esquema SMART (Salton y Bu-ckley, 1965) que estemos utilizando. En es-te trabajo partimos de la siguiente ecuacion,que calcula la relevancia de un documentorespecto una consulta:

score(d, q) =∑

t∈q

idft · tfd

t(1)

donde t representa los terminos contenidosen la consulta q, idft la frecuencia inversa deltermino en la coleccion y tfd

tla frecuencia de

aparicion del termino en el documento d.Teniendo en cuenta el uso de campos den-

tro de un documento, el recuento de la fre-cuencia tf debera considerar esta posibilidad.Ası:

tfd

t=

∑

c∈d

wc · tfd

tc(2)

donde c se corresponde con cada uno delos campos que contiene el documento, wc esel peso relativo asignado a cada uno de ellosy se corresponde con el factor de empuje uti-lizado para aumentar o disminuir la impor-tancia de un campo frente a los demas y tfd

tc

se corresponde con la frecuencia del terminoen el campo y en el documento.

De esta forma se aplica el modelo de es-pacio vectorial teniendo en cuenta el peso decada una de las unidades de las que se com-pone el documento, ası como la frecuencia delos terminos dentro de cada una de ellas.

3. Modelo fuzzy aplicado a la IR

estructurada

La logica borrosa (fuzzy logic) se ha mos-trado como un marco de trabajo adecuadopara capturar el conocimiento experto hu-mano, aplicando heurısticas a la hora de re-solver la ambiguedad inherente a procesos derazonamiento cualitativo. Esta es una carac-terıstica importante, habida cuenta de que elobjetivo principal de este trabajo es encon-trar una funcion de ranking que combine in-formacion de los diferentes campos en los quese estructuran los documentos.

La logica borrosa se contruye sobre el con-cepto de variable linguıstica, variable quepuede tomar como valor palabras del lengua-je natural y que se define a partir de con-juntos borrosos. Por otro lado, la Teorıa deConjuntos Borrosos (Zadeh, 1965) se basa enel reconocimiento de que determinados con-juntos poseen unos lımites imprecisos. Estosconjuntos estan constituidos por coleccionesde objetos para los cuales la transicion de“pertenecer” a “no pertenecer” es gradual.

Un conjunto borroso permite describir elgrado de pertenencia de un objeto a una de-terminada clase. Dicho grado de pertenenciaviene descrito por una funcion de pertenenciaµF : U → [0, 1], siendo U el universo de dis-curso. Si el objeto u ∈ U entonces µF (u) essu grado de pertenencia al conjunto borrosoF .

La arquitectura basica de un sistema deinferencia fuzzy como el que se empleara eneste trabajo se compone de tres etapas deprocesamiento: borrosificacion de entradas,aplicacion de las reglas de inferencia que

Funciones de Ranking basadas en Lógica Borrosa para IR estructurada

175

constituyen la base de conocimiento del siste-ma, y una desborrosificacion que permite ob-tener el valor final de ranking. Las funcionesde pertenencia de los conjuntos borrosos seestablecen tambien a partir del conocimientoexperto aportado al sistema.

Para expresar la base de conocimiento senecesita una serie de reglas IF-THEN quedescriban el comportamiento que debera te-ner el sistema de la manera mas precisa po-sible, y donde se aporte el conocimiento ex-perto. En nuestro caso estas reglas reflejaranel conocimiento heurıstico que se tiene acer-ca de la relacion existente entre los diferentescampos considerados dentro de un documen-to. En el proceso de inferencia se interpre-tan las reglas IF-THEN, asociando uno o masconjuntos borrosos de entrada (antecedentes)con un conjunto borroso de salida (conse-cuente). En un sistema de control borroso co-mo el planteado en este trabajo estas reglascontienen varias entradas, correspondiente acada uno de los campos considerados, y unaunica salida que se correspondera con el valorasignado por la funcion de ranking.

Una vez fijada la base de conocimiento,los antecedentes de las reglas se combinarana traves de operadores de Union (OR) o In-terseccion (AND) que pueden implementarsede muy diversas formas. De este modo, lasoperaciones definidas entre conjuntos borro-sos permiten combinar los valores linguısti-cos, expresando la base de conocimiento co-mo afirmaciones condicionales borrosas.

Tras la obtencion de los consecuentes pa-ra cada regla IF-THEN, y tras una etapa deagregacion, se obtiene un conjunto agregadofinal, entrada a la ultima etapa del controla-dor, la desborrosificacion, que realiza una co-rrespondencia entre un conjunto borroso desalida con un punto concreto (salida nıtida o“crisp”) que representa el valor de relevancia.Una explicacion mas detallada de este proce-so puede encontrarse en (Fresno, 2006).

Es importante asegurarse de que el uso deconjuntos borrosos producira una represen-tacion mas realista que si se emplearan lasmismas variables pero definidas de una for-ma nıtida. Como ya se ha indicado, el uso deheurısticas para la combinacion de informa-cion de campo hace pensar que esta situacionse puede dar y que una combinacion borrosapodra capturar mejor la informacion asocia-da a los campos considerados que una combi-nacion lineal. En la seccion 4 se describira en

detalle el modelo propuesto.

4. Experimentacion

La idea de este trabajo es desarrollar unsistema basado en logica borrosa capaz deasignar un valor de relevancia a cada terminodentro de un documento, de modo similar acomo lo harıa una funcion de pesado clasicacomo TF-IDF. Por tanto, para la aplicaciondel modelo propuesto se debe construir unsistema capaz de representar el tipo de do-cumento sobre el que se evaluara el modelo,y dentro de el, deberan establecerse un con-junto de reglas con las que cuantificar la rele-vancia de los terminos del documento en fun-cion de su frecuencia en los diferentes campospresentes en el documento. Ası, el sistema de-bera modelar la estructura de los documen-tos, aplicando despues heurısticas para com-binar los campos establecidos.

Por tanto, el primer paso en el desarro-llo del modelo consiste en la definicion de lasvariables linguısticas y de los conjuntos bo-rrosos que modelen la estructura del tipo dedocumento. Para el caso de paginas web, ba-sadas en HTML, se han definido tres campos:

‘Tıtulo’, que incluye el texto que se en-cuentra dentro de la etiqueta TITLE. Enalgunas paginas web, los terminos queaparecen dentro de esta etiqueta se pue-den considerar como muy relevantes. Encierto modo el tıtulo de un documento sepuede entender como un resumen en po-cas palabras del contenido del documen-to. La variable linguıstica que representael campo tıtulo se compone de dos con-juntos borrosos: ‘Bajo’ y ‘Alto’, como semuestra en la figura 1.

‘Enfatizado’, denominamos enfatizadoal conjunto de terminos que han sido re-saltados de forma explıcita por el crea-dor del documento y de forma que des-tacan respecto del resto del contenidodel documento. Ası consideraremos quelos terminos que aparecen con una grafıaque les hace resaltar sobre el resto poseenuna mayor relevancia. Este subconjun-to incluye los terminos que aparecen en-tre las etiquetas: B ,EM, U, STRONG,

BIG I, H1, H2, H3, H4, H5, H6, CI-

TE, DFN, BLOCKQUOT. La variablelinguıstica que representa el campo enfa-sis se compone de dos conjuntos borro-sos: ‘Bajo’ y ‘Alto’, como se muestra en


176

AltoBajo

10.80.60.40.20

1

0.8

0.6

0.4

0.2

0

Figura 1: Variable Linguıstica Tıtulo

AltoBajo

10.80.60.40.20

1

0.8

0.6

0.4

0.2

0

Figura 2: Variable Linguıstica Enfatizado

la figura 2.

‘Resto de contenido’; este subconjun-to incluye el resto de terminos que apa-rezcan en el documento y la variablelinguıstica que lo representa se componede tres conjuntos borrosos: ‘Alta’, ‘Me-dia’ y ‘Baja’, como se muestra en la fi-gura 3.

A continuacion, se deben definir las fun-ciones de entrada a cada una de las variableslinguısticas del sistema borroso. Estas fun-ciones serviran para cuantificar el grado de

AltaMediaBaja

10.80.60.40.20

1

0.8

0.6

0.4

0.2

0

Figura 3: Variable Linguıstica Resto

MaximaAlta

MediaBajaNula

10.80.60.40.20

1

0.8

0.6

0.4

0.2

0

Figura 4: Variable Linguıstica Resultado

pertenencia de un termino a un conjunto bo-rroso, y para ello se ha utilizado como base lafrecuencia normalizada del termino dentro decada campo. Ademas, se ha decidido saturarla funcion tf resultante mediante el uso de laraız cuadrada, de forma que se evite un creci-miento lineal de la relevancia de un terminorespecto a su frecuencia.

De este modo, esta funcion se define como:

fc(t, d) =

√

tfd

tc

tfdmaxc

(3)

donde tfd

maxc

representa la frecuenciamaxima de un termino t en un campo c deldocumento d y para cada uno de los conjun-tos borrosos.

A continuacion, debe definirse la varia-ble linguıstica correspondiente a la salida delsistema. Esta variable se denominara ‘Rele-vancia’, representara la importancia de untermino en el contenido de un documento,y se compondra de cinco conjuntos borro-sos:‘Nula’,‘Baja’,‘Media’,‘Alta’ y ‘Maxima’,como se muestra en la figura 3.

Una vez definidas las variables linguısticasy sus conjuntos borrosos, a continuacion sepresenta el conjunto de reglas que conformanla base de conocimiento y que se activarana partir de las entradas a nuestro sistema, esdecir, de las entradas a las variables linguısti-cas de entrada y sus conjuntos borrosos. Eltotal de reglas definidas es nueve y se mues-tran en el Cuadro 1.

El valor de relevancia total de un terminose calculara a traves del sistema borroso, detal forma que combinen las frecuencias par-ciales de dicho termino en cada uno de loscampos, aplicando la base de conocimiento,


177

Tıtulo Enfatizado Resto Relevancia

IF Alto AND - AND Alta THEN MaximaIF Alto AND - AND Media THEN AltaIF Alto AND Alto AND Baja THEN AltaIF Alto AND Bajo AND Baja THEN MediaIF Bajo AND Alto AND Alta THEN AltaIF Bajo AND Bajo AND Alta THEN MediaIF Bajo AND - AND Media THEN BajaIF Bajo AND Alto AND Baja THEN BajaIF Bajo AND Bajo AND Baja THEN Nula

Cuadro 1: Conjunto de reglas del sistema borroso global

y dando como resultado el valor de relevan-cia final del termino para un documento.

Si denotamos nuestro sistema fuzzy comouna funcion Fuzzy(t, d) que estime la rele-vancia total del termino t en el documento d,entonces

Fuzzy(t, d) =

Fuzzy(fti(t, d), fen(t, d), fre(t, d))

teniendo que Fuzzy(t, d) ∈ [0, 1], dondefti,fen,fre, son las funciones que miden la re-levancia parcial del termino t en los campos,tıtulo, enfatizado y resto respectivamente.

De este modo, se define la relevancia to-tal del siguiente modo: dada una necesidadinformativa q compuesta por los terminost1, t2, . . . , tn y una coleccion C formada porlos documentos d1, d2, d3, . . . , dm, el valor derelevancia R entre un documento d y una con-sulta q se calcula como

∑

t en q

Fuzzy(fti(t, d), fen(t, d), fre(t, d)) (4)

Puede observarse que ahora ya no se tie-ne una combinacion lineal entre los campos,como en el caso del VSM estructurado oBM25F, sino una funcion, en este caso borro-sa, que combina globalmente la informaciondel conjunto de campos seleccionados.

Dado que en el campo de la IR es biensabida la utilidad del factor idf para corregirla informacion suministrada por la frecuenciade un termino (Robertson, 2004), se anade elfactor idf como corrector de la relevancia deun termino en un documento en la ecuacion4.

La funcion de relevancia final que aplicanuestro sistema queda como sigue

∑

t en q

Fuzzy(fti(t, d), fen(t, d), fre(t, d)) ∗ idft

(5)siendo esta la ecuacion que representa el

sistema propuesto.

4.1. Experimentacion

En este apartado se describiran los dife-rentes experimentos realizados para la valida-cion del modelo propuesto. Se comenzara des-cribiendo la coleccion de pruebas utilizada, acontinuacion se detallara el modelo de espa-cio vectorial y sus extensiones con los que serealizara la evaluacion de nuestra propuesta,para finalizar detallando los experimentos.

4.1.1. Coleccion Utilizada

La coleccion utilizada para la evaluaciones un subconjunto de EuroGOV 2005, colec-cion de paginas web en varios idiomas quese construyo para el ‘track ’ WebCLEF den-tro del CLEF (Cross language Evaluation Fo-rum) celebrado en el ano 20051.

Para la experimentacion realizada nos he-mos cenido al dominio de primer nivel UK,que contiene mas de un 99 % de los documen-tos en ingles. El numero total de documentosdentro de este dominio, que estan en forma-to HTML, eliminando repeticiones y paginasvacıas es de 58393. Por otro lado, el nume-ro de consultas y de los juicios de relevanciapara este dominio es de 48.

4.1.2. Baseline

La funcion de relevancia utilizada parala evaluacion de nuestro modelo se descri-be como: dada una necesidad informativa qcompuesta por los terminos t1, t2, . . . , tn y

1http://www.clef-campaign.org/2005.html


178

una coleccion C formada por los documen-tos d1, d2, d3, . . . , dm, el valor de relevancia Rentre un documento d y una query q se cal-cula como:

R(q,d) =∑

t en q

tfd

t∗ idft ∗ norm(d) (6)

Extendiendo la ecuacion 6 para documen-tos con una estructura formada por los cam-pos c1, c2, c3, . . . , ck.

R(q,d) =∑

t en q

idft ∗

∑

c en d

tfd

tc∗ norm(c) (7)

donde

tfd

tc=

√

frecuencia · wc

idft = 1 + log

(

N

df(t) + 1

)

siendo frecuencia el numero de aparicionesdel termino en el campo, N el numero totalde documentos de la coleccion C y wc el fac-tor de empuje asignado al campo c.

4.2. Medidas de Evaluacion

A continuacion se describen brevementelas medidas de evaluacion que se utilizaran,

MAP(“Mean Average Precision”), tra-ta de calcular una media de la preci-sion hallada a distintos niveles de co-bertura. Mas formalmente, sea Rel =[d1, d2, . . . , dn] un conjunto de n docu-mentos relevantes para una necesidad in-formativa q y sea Rj el conjunto de docu-mentos recuperados antes de recuperarel j-esimo documento relevante del con-junto Rel. MAP se define como la mediaaritmetica del “Average Precision” sobreel conjunto Q.

MAP =1

m

m∑

i=1

1

n

n∑

j=1

Precision(Rj)

R-Prec(“R-Precision”), se obtiene calcu-lando la media aritmetica de las distin-tas medidas de precision a n, siendo nel numero de documentos relevantes pa-ra la consulta qi. Sea Q = [q1, q2, . . . , qm]un conjunto de consultas, se define

R − Prec =1

m∗

∑ docret

n

MRR (Mean Reciprocal Rank), compu-ta la media de los valores individualesalcanzados para cada consulta de la co-leccion segun la siguiente expresion:

MRR =

Q∑

i=1

1

fari

siendo far el primer documento relevan-te recuperado para la consulta q ∈ Q.

4.3. Modelos Propuestos

Con el objetivo de validar el modelo pro-puesto, se han disenado los siguientes experi-mentos

Baseline En este caso, se evalua sin te-ner en cuenta la estructura del documen-to o, lo que es lo mismo, asignando fac-tores de peso equivalentes a cada campo.

VSM-I En este caso se parte del siste-ma aplicado en el experimento anterior,pero ahora se asignan factores de ponde-racion a cada campo del siguiente modo:Tıtulo (5), Enfasis (2) y Resto (1). Secomprobo que estos valores representa-ban un maximo local.

VSM-II A continuacion, y partiendocomo base del modelo anterior, se mo-dificaron los factores de ponderacion delsiguiente modo: Tıtulo (10), Enfasis (5)y Resto (2,5); obteniendose un nuevomaximo local.

Fuzzy Finalmente se evalua el sistemapropuesto en este artıculo, con las varia-bles linguısticas de entrada, los conjun-tos borrosos y la base de conocimientodescritos en la seccion 3.

Los resultados obtenidos se pueden obser-var en la siguiente tabla.

Baseline VSM-I VSM-II Fuzzy

MAP 0.455 0.462 0.474 0.564

R-Prec 0.368 0.368 0.388 0.467

MRR 0.466 0.479 0.490 0.580

Cuadro 2: Resultados

4.4. Analisis de Resultados

Los valores obtenidos para MAP y R-Precnos permiten comprobar que la mejora obte-nida gracias al uso de la estructura es consis-tente en terminos de precision y cobertura,


179

tanto en los enfoques analıticos como en laaproximacion borrosa.

Por otro lado, la medida MRR muestraque no solo se obtienen mejores resultadosen todos los casos con el uso de la estructu-ra sobre medidas clasicas de evaluacion comoMAP Y R-Prec, sino tambien sobre el ordenen el que los documentos relevantes son re-cuperados, un factor de gran importancia encolecciones compuestas por un alto numerode documentos como ocurre en el caso de laWeb.

En el caso de VSM-I y VSM-II, se obser-va que la asignacion de diferentes valores alos factores de empuje puede hacer variar lacalidad de la recuperacion, aunque en un gra-do bastante pequeno. La eleccion de los valo-res con los que se implementaron los modelosVSM-I y VSM-II se realizo tras una explora-cion exhaustiva del espacio de valores posi-bles, relativos a los factores de empuje de loscampos ‘Tıtulo’, ‘Enfatizado’ y ‘Resto’. Sinembargo, es resenable el hecho de que la me-jora conseguida por los metodos que realizanuna combinacion analıtica, nunca suponen unaumento para las medidas de evaluacion uti-lizadas por encima del 5% sobre el baseline,mientras que la mejora conseguida por la fun-cion borrosa propuesta en este trabajo esta entorno al 20 % para todas las medidas evalua-cion utilizadas.

4.5. Conclusiones y Trabajo

Futuro

Al igual que se mostraba en (Robertson,Zaragoza, y Taylor, 2004), y considerando losresultados obtenidos mediante las distintasmedidas de evaluacion, se comprueba que eluso de la estructura de un documento me-jora los resultados en problemas de IR. Es-te hecho puede observarse en el Cuadro 2,donde los resultados obtenidos por los mode-los VSM-I y VSM-II mejoran los del baseline,unico modelo de los evaluados que no utilizaestructura.

Si nos centramos en el analisis de los dis-tintos metodos de recuperacion que tienenen cuenta la estructura de los documentos,VSM-I, VSM-II y Fuzzy, podemos ver que elenfoque basado en logica borrosa supone unamejora consistente, en funcion de las distin-tas medidas de evaluacion consideradas, so-bre los metodos basados en la simple combi-nacion analıtica de los campos. Esta mejorase manifiesta tanto en terminos de precision

como de cobertura. Asimismo, la ordenacionde los documentos tambien se ve beneficia-da por el enfoque borroso, mostrando que lainterrelacion entre campos es un problema aconsiderar cuando nos enfrentamos a una co-leccion de documentos con estructura.

Como trabajo futuro destaca la compa-racion de la funcion borrosa con otros mo-delos de recuperacion de informacion comoBM25F, Modelos de Lenguaje y DivergenceFrom Randomness. Por otro lado es necesarioaplicar la funcion borrosa sobre otras colec-ciones de documentos estructurados que uti-licen distintas formas de marcado al HTML,de forma que se pueda medir la viabilidad denuestro enfoque sobre un rango amplio de es-quemas de marcado. En este sentido, la colec-cion INEX construıda a partir de un marcadoen XML de la Wikipedia conforma un marcode pruebas futuro de gran interes.

Bibliografıa

Fresno, Vıctor. 2006. Representacion Auto-

contenida de documentos HTML: una pro-

puesta basada en combinaciones heurısti-

cas de criterios. Ph.D. tesis, Departamen-to de Ingenierıa Telematica y TecnologıaElectronica, Universidad Rey Juan Carlos.

Robertson, Stephen. 2004. Understandinginverse document frequency: on theoreti-cal arguments for idf. Journal of Docu-

mentation, 60.

Robertson, Stephen, Hugo Zaragoza, y Mi-chael Taylor. 2004. Simple bm25 ex-tension to multiple weighted fields. EnCIKM ’04: Proceedings of the thirteenth

ACM international conference on Infor-

mation and knowledge management, pagi-nas 42–49, New York, NY, USA. ACM.

Salton, G. y C. Buckley. 1965. The SMARTautomatic documento retrieval system -an ilustration. Communications of the

ACM, 8(6):391–398.

Schlieder, Torsten y Holger Meuss. 2002.Querying and ranking XML documents.JASIST, 53(6):489–503.

W.G.J. Howells, H. Selim. 2001. The auto-nomous document object (ado) model. EnProceedings of the International Conferen-

ce on Document Analysis and Recognition.

Zadeh, L. A. 1965. Fuzzy sets. Information

and control, 8:338–353.


180

Resúmenes de Textos

Integracion del reconocimiento de la implicacion textual entareas automaticas de resumenes de textos∗

Incorporating Textual Entailment Recognition in Single- and

Multi-Document Summarization Systems

Elena Lloret, Oscar Ferrandez, Rafael Munoz y Manuel PalomarDept. Lenguajes y Sistemas Informaticos (Universidad de Alicante)

Carretera San Vicente s/n 03690San Vicente del Raspeig, Alicante, Espana{elloret, ofe, rafael, mpalomar}@dlsi.ua.es

Resumen: Este artıculo presenta un estudio preliminar sobre la influencia de laimplicacion textual en tareas de generacion automatica de resumenes de textos.Se proponen dos aproximaciones para producir resumenes a partir de uno o variosdocumentos de entrada. La primera de ellas se basa unicamente en la frecuencia de laspalabras para determinar las frases mas relevantes del texto, mientras que la segundaaproximacion combina esta tecnica con el reconocimiento de la implicacion textualpara reducir el numero de frases del documento de entrada. Los resultados inicialesobtenidos presentan una mejora del 5.85 % para monodocumento y del 6.80 % paramultidocumento cuando se incorpora el reconocimiento de la implicacion textual.Palabras clave: resumen automatico, implicacion texual, frecuencia de palabras,multidocumento

Abstract: This paper presents a preliminary study of the usefulness of applying tex-tual entailment recognition in text summarization tasks. Two different approachesare proposed. The first one relies on the word’s frequency in order to select the mostrelevant sentences to include in the final summary, whereas the second one combinestextual entailment recognition together with the aforementioned approach so thatwe can reduce the size of the original text or texts first. Results show that this com-bination has been appropriate, obtaining an increase of 5.85 % for single-documentand of 6.80 % for multi-document when textual entailment and the frequency ofwords are applied together.Keywords: automatic summarization, textual entailment, word-frequency, multi-document

1. Introduccion

Debido al gran aumento de informaciondigital, especialmente desde el crecimientoexponencial de Internet, la necesidad de her-ramientas capaces de manejar, controlar yfiltrar informacion es cada vez mas impor-tante tanto para la comunidad cientıfica co-mo para los usuarios finales. Consecuente-mente, las tareas automaticas de resumenesde textos, capaces de presentar la informa-cion disponible de una manera mas con-

∗ Esta investigacion ha sido financiada por el proyec-to TEXT-MESS (TIN2006-15265-C06-01) del Minis-terio de Educacion y Ciencia y parcialmente finan-ciada bajo el proyecto QALL-ME (FP6-IST-033860),dentro del Sexto Programa Marco de Investigacion dela Union Europea.

cisa para el usuario, estan adquiriendo unagran relevancia recientemente. Ademas, exis-ten otras tareas dentro del campo del Proce-samiento del Lenguaje Natural (PLN) quetambien contribuyen al avance cientıfico enestos aspectos. Tareas como la recuperacionde informacion, extraccion de informacion,busqueda de respuestas, clasificacion de tex-tos e implicacion textual estan siendo amplia-mente investigadas tanto de manera individ-ual como colaborativa. En este artıculo ex-ploraremos la posibilidad de usar implicaciontextual para ayudar a la tarea de resumenesde textos, estudiando la manera en la que estase ve influenciada.

Un resumen se define como un textoque se genera a partir de uno o mas textos,



que contiene la informacion mas significati-va y que no es mas extenso que la mitad delos textos originales (Hovy, 2005). Los sis-temas de resumenes de textos pueden estarcaracterizados por multiples factores, perosegun (Sparck Jones, 2007) hay principal-mente tres que influyen en los resumenes: laentrada, el proposito y los factores de salida.Por ejemplo, aunque esta tarea tradicional-mente se ha centrado en resumenes de tex-tos, estos tambien pueden obtenerse de infor-macion multimedia como imagenes, videos,audios o informacion on-line o hipertextos.Ademas, podemos realizar el resumen de unsolo documento (resumen monodocumento)o de varios (resumen multidocumento). Re-specto a la salida, el resumen puede ser un ex-

tract1 o un abstract. Tambien es posible dis-tinguir entre resumenes genericos, los cualesintentan representar toda la informacion rel-evante del texto original, y resumenes orien-tados a usuario, en los cuales se representala informacion necesaria para un usuario es-pecıfico. Acerca del estilo de la salida, encon-tramos una distincion principalmente entredos tipos: resumenes indicativos, usados paramostrar los temas de los que trata el textooriginal y que darıan una breve idea sobre loque se comenta en el, e informativos, desti-nados a cubrir en mayor medida el contenidodel texto a resumir.

Por otro lado, como breve introduccion ala implicacion textual, decir que ha sido re-cientemente propuesto como un marco de tra-bajo generico para el modelado de la vari-abilidad del lenguaje en muchas aplicacionesde PLN. La implicacion textual se define co-mo una relacion unidireccional entre dos tex-tos, de tal manera que del significado de unode ellos se puede deducir el significado delotro. Siguiendo la metodologıa propuesta en(Glickman, 2006), el texto inferido serıa lahipotesis (H) y el texto que permite la infer-encia de significados se denomina texto (T).Varias aproximaciones han sido propuestasrecientemente, la mayorıa de ellas y segura-mente las mas relevantes, han sido presen-tadas en las tres ediciones de la competicionsobre reconocimiento de implicacion textual

1Adoptamos la terminologıa inglesa de extractpara referirnos al resumen formado a partir de laextraccion de algunas frases del texto original pre-viamente seleccionadas, mientras que utilizaremos eltermino abstract cuando se genera nuevo lenguajepara componer el resumen final.

(RTE, del ingles Recognising Textual Entail-

ment) (Giampiccolo et al., 2007).La siguiente figura muestra dos ejemplos,

uno positivo y otro negativo, de relacionesimplicacion textual extraıdos del corpus dedesarrollo de la tercera competicion del RTE.

Par id=109 implicación=SI

Par id=194 implicación=NO

T: ASCAP is a membership associationof more than 200,000 US composers,songwriters, lyricists and music publishersof every kind of music.

H: More than 200,000 US composers,songwriters, lyricists and music publishersare members of ASCAP.

T: US President George W. Bush has indicatedhe will invite Abbas to the United States fortalks, something he never did with Abbas'spredecessor, the late Yasser Arafat.

H: Yasser Arafat succeeded Abbas.

Figura 1: Ejemplos de implicacion textual.

El objetivo de este artıculo es valorarcomo el uso de un modulo de implicaciontextual en un sistema de resumenes de tex-tos puede mejorar los resultados finales. Alo largo del artıculo se detallaran los difer-entes experimentos realizados, tanto paramonodocumento como para multidocumento,ası como los resultados obtenidos.

El artıculo esta estructurado de la sigu-iente manera: la seccion 2 presenta un breveestado de la cuestion. La seccion 3 describeel sistema junto con los diferentes modulosque lo componen. A continuacion, la seccion4 muestra la evaluacion del sistema y los re-sultados obtenidos. Por ultimo, la seccion 5presenta las conclusiones y trabajos futuros.

2. Estado de la cuestion

La tarea de generacion de resumenes detextos de forma automatica se inicio hacemas de cincuenta anos con las investigacionesrealizadas por (Luhn, 1958) o (Edmundson,1969), quienes aplicaron tecnicas como la fre-cuencia de palabras o la posicion de unafrase dentro de un documento, para producirresumenes sin depender de la intervencionhumana. A partir de estas primeras investi-gaciones, se han desarrollado y perfeccionadomultitud de tecnicas, algunas de ellas basadasen conocimiento y recursos linguısticos como(Lin y Hovy, 2002; Gotti et al., 2007) y otrasbasadas en metodos estadısticos y en apren-dizaje automatico para indentificar las fras-

Elena Lloret, Óscar Ferrández, Rafael Muñoz y Manuel Palomar

184

es que compondran el resumen final (Hiraoet al., 2002; Svore, Vanderwende, y Burges,2007). Ademas, en los ultimos anos, la gen-eracion de resumenes multidocumento ha co-brado especial interes en este campo del PLN(Goldstein et al., 2000; Mihalcea, 2004; Qiu etal., 2007; Kuo y Chen, 2008) debido a la grancantidad de informacion redundante de la quedisponemos y la importancia de condensar ofusionar toda esa informacion de forma brevey resumida.

En la literatura podemos encontrar variasrevisiones del estado de la cuestion (Alonsoet al., 2004; Sparck Jones, 2007) en las quese describen los aspectos basicos relaciona-dos con la tarea de resumenes automaticos ylas tecnicas existentes para dicha tarea (tantopara resumenes monodocumento como mul-tidocumento). Ademas, tambien se exponeny analizan ejemplos de algunos sistemas con-cretos que han participado en evaluacionesinternacionales, tales como las que se han or-ganizado desde el 2001 hasta el 2007 por elNational Institute of Standards and Technol-

ogy (NIST) dentro del marco de conferenciasDUC (http://duc.nist.gov/).

Por otra parte, la tarea de generacion deresumenes automaticos se puede beneficiar detecnicas existentes para otras tareas del PLN,como categorizacion de textos, recuperacionde informacion o implicacion textual. En esteartıculo nos centraremos en esta ultima tarea,el reconocimiento de la implicacion textu-al, para estudiar como se puede aplicar con-juntamente con la tarea de resumenes au-tomaticos de texto, obteniendo unos resulta-dos preliminares que son prometedores parafuturas lıneas de investigacion. Se trata deuna aproximacion novedosa, en la que se uti-liza el reconocimiento de la implicacion tex-tual para reducir el tamano del documentooriginal. Hasta ahora, la implicacion textualsolamente se habıa utilizado en la evaluacionde resumenes automaticos (Harabagiu, Hickl,y Lacatusu, 2007) sin que influyera directa-mente en la generacion de estos. En (Tataret al., 2008) encontramos una aproximacionen la que se aplica implicacion textual comotecnica de segmentacion, agrupando las fras-es que estan implicadas entre ellas para ex-traer posteriormente de esos segmentos, lasoraciones que forman el resumen final. Por elcontrario, nuestro trabajo muestra una man-era diferente de integrar la implicacion textu-al en un sistema de generacion de resumenes,

que consiste en eliminar oraciones que estanimplicadas entre sı y reducir ası el numero defrases del documento original. En un procesoposterior, de entre las frases restantes, se se-leccionaran aquellas con mayor puntuacion,de acuerdo a los criterios establecidos en laseccion 3, para formar parte del resumen fi-nal. En la siguiente seccion se describen losdetalles relacionados con la arquitectura delsistema.

3. Arquitectura del sistema

El sistema que hemos desarrollado pro-duce extracts de uno o varios documentosde forma automatica aplicando dos compo-nentes como fuentes de informacion: la fre-cuencia de las palabras y la implicacion tex-tual. Nuestra hipotesis de trabajo es queel reconocimiento de la implicacion textu-al puede ser muy util como paso previo alprocesamiento del texto o textos a la horade generar resumenes de uno o varios docu-mentos. A partir de la identificacion de esasrelaciones de implicacion podemos reducir eltamano del texto a procesar. El objetivo deeste trabajo consiste en estudiar la influenciapositiva que tiene la implicacion textual enla generacion de resumenes automaticos. Acontinuacion, se explican las distintas fuentesde conocimiento utilizadas en el sistema (fre-cuencia de palabras e implicacion textual), larelacion entre ellas y por ultimo, una primeraaproximacion para extender el sistema a mul-tidocumento.

3.1. Frecuencia de palabras

La frecuencia de las palabras es la tecni-ca base de nuestro sistema2 . La idea subya-cente es que cuantas mas veces aparezca unapalabra en un texto (sin tener en cuenta lasstopwords), mas importante sera la oracionque contenga esa palabra y por tanto, lasfrases con mas puntuacion seran las que for-men parte del resumen final. Es decir, la pun-tuacion de una oracion, tal como se muestraen la Formula 1, consitira en la suma de lasfrecuencias de las palabras que contenga y senormalizara diviendo por la longitud de dichafrase:

Scs =

∑

n

i=1 tfi

n(1)

2Para calcular la frecuencia de las palabrashemos usado la herramienta WVTool desarrolladapor Michael Wurst (Universidad de Dortmund) ydisponible en http://wvtool.sourceforge.net/

Integración del reconocimiento de la implicación textual en tareas automáticas de resúmenes de textos

185

donde:tfi = frecuencia de la palabra i.

n = longitud de la frase (en numero de

palabras) sin tener en cuenta las stopwords.

Si consideramos, por ejemplo, las dosfrases mostradas en la Figura 2, extraıdasdel corpus del DUC del 2002 (http://www-nlpir.nist.gov/projects/duc/data.html), lapuntuacion para la frase a es 3.2 y parala b es 1. Entre parentesis se indica lafrecuencia de cada palabra respecto a todoel documento, y se puede observar como las“stopwords” no se han tenido en cuenta parael calculo de la frecuencia ni para contar lalongitud de la oracion.

S : (2) (6) (7) (1)

(0) (0) (1) (1) (0)(1) (0) (0) (7)

(4) (2).S : (0) (0) (0) (1) (0)

(1).

a

b

Tropical Storm Gilbert formed

in the eastern Caribbean andstrengthened into a hurricaneSaturday nightThere were no reports of

casualties

Figura 2: Frases de ejemplo del documentoAP880911-0016 (corpus DUC 2002).

Si tuvieramos que escoger entre una de el-las para formar el resumen, elegirıamos laprimera oracion ya que la puntuacion queobtenemos (3.2) es mayor que la obtenidapara la segunda (1).

3.2. Implicacion textual

La idea principal que sostiene el uso deimplicacion textual en tareas automaticas deresumenes, reside en conseguir un resumenpreliminar formado por las oraciones que notienen relacion de implicacion con ningunaotra frase del documento. Como ejemplo,supongamos que un documento esta forma-do por el siguiente conjunto de frases:

S1 S2 S3 S4 S5 S6

y el documento reducido por el computode las relaciones de implicacion textual seobtiene de la siguiente manera:

RESUMEN = {S1}

RESUMEN −→ implica −→ S2 ⇒ NO

RESUMEN = {S1, S2}


RESUMEN = {S1, S2, S3}

RESUMEN −→ implica −→ S4 ⇒ SI


RESUMEN −→ implica −→ S5 ⇒ SI



RESUMEN = {S1, S2, S3, S6}

Por lo tanto, el documento reducidoobtenido por el procesamiento de las inferen-cias de implicacion textual comprende aquel-las frases que no son implicadas por el conjun-to de oraciones que no han producido relacionde implicacion previamente (i.e. S1,S2,S3,S6en el ejemplo anterior). Para procesar las im-plicaciones textuales se ha utilizado el sis-tema de implicacion textual presentado en(Ferrandez et al., 2007) con algunas mejoras yentrenado con el corpus proporcionado en laTercera edicion del RTE (Giampiccolo et al.,2007). Este sistema se basa en el computo deun conjunto de medidas lexicas (e.g. distan-cia de Leveshtein, Smith–Waterman, simil-itud del coseno) y semanticas basadas enWordNet3.0, aplicando un clasificador SVMcon el objetivo de tomar la decision final.

3.3. Frecuencia de palabras e

implicacion textual

Con el objetivo de valorar como la de-teccion de relaciones de implicacion textu-al mejora la tarea completa de resumenesde textos, creamos una aproximacion que sebeneficia del uso de las dos tecnicas anteriores(la frecuencia de palabras y la implicaciontextual).

Inicialmente aplicamos implicacion textu-al para obtener el documento reducido, elcual contendra solo aquellas frases que nohan producido relacion de implicacion. Enparalelo, aplicamos la tecnica de la frecuen-cia de las palabras sobre las frases del docu-mento obteniendo la frecuencia de cada pal-abra. Finalmente, creamos el resumen a par-tir de aquellas frases que estan contenidasen el documento reducido y que mayor pun-tuacion obtengan.

3.4. Extension a multidocumento

Partiendo de la hipotesis de que aplicarimplicacion textual como parte del proceso degeneracion automatica de resumenes es ven-tajoso ya que se elimina informacion redun-dante del documento de origen, y que la re-dundancia de informacion es uno de los prob-lemas mas importantes en el ambito de losresumenes multidocumento (Radev, Hovy, yMcKeown, 2002), se decidio extender el sis-


186

tema a multidocumento para estudiar el com-portamiento de la implicacion textual en estenuevo ambito. Como una primera aproxi-macion muy basica, se opto por considerartodos los documentos de entrada al sistemacomo un solo documento, concatenando unoa continuacion del otro y aplicando, en primerlugar, el renocimiento de la implicacion tex-tual como mecanismo para eliminar frases re-dundantes. De esta manera, conseguimos re-ducir considerablemente el tamano del texto.El calculo de las frecuencias de palabras seaplicarıa despues sobre ese nuevo texto, tal ycomo se describe en la seccion 3.1, para de-terminar las frases que seran extraıdas paraformar parte del resumen final. En la sigu-iente seccion se describen detalladamente losexperimentos y los resultados obtenidos.

4. Evaluacion del sistema

Para evaluar nuestro sistema de maneraintrınseca3 hemos tomado como datos deprueba los documentos utilizados en el DUC2002. El corpus de prueba esta formado por59 grupos de noticias que contienen aproxi-madamente 10 documentos que tratan sobreel mismo tema. En esta edicion del DUC, latarea 1 se oriento a monodocumento mien-tras que la tarea 2, a multidocumento. Laprimera de ellas proponıa la generacion deresumenes genericos de 100 palabras de ca-da uno de los documentos y la segunda, lageneracion de resumenes genericos de diversalongitud (10,50,100,200) de cada uno de losgrupos. A partir de la definicion de estas tar-eas, decidimos realizar las siguientes pruebas(vease el Cuadro 1 de resultados):

Generacion de extracts de 100 palabrasde longitud, para cada uno de los doc-umentos del corpus del DUC 2002 apli-cando solamente la frecuencia de las pal-abras en el proceso de resumen (mon-odocumento, TS).

Generacion de extracts de 100 palabrasde longitud, para cada uno de los docu-mentos del corpus del DUC 2002 apli-cando el reconocimiento de la impli-cacion textual como paso previo a la fre-cuencia de las palabras en el proceso deresumen (monodocumento, TE+TS).

3Este tipo de evaluacion difiere de la extrınseca enque se evalua el contenido del resumen, es decir, lacalidad del resumen generado atendiendo a diversoscriterios.

Generacion de extracts de 100 palabrasde longitud, para cada uno de los gruposdel corpus del DUC 2002 considerandosolamente la frecuencia de las palabrasen el proceso de resumen (multidocu-mento, TS).

Generacion de extracts de 100 palabrasde longitud, para cada uno de los gruposdel corpus del DUC 2002 aplicando elreconocimiento de la implicacion textualcomo paso previo a la frecuencia de laspalabras en el proceso de resumen (mul-tidocumento, TE+TS).

Para llevar a cabo estas pruebas, real-izamos un preprocesado inicial a los textosproporcionados por el DUC, quedandonos so-lamente con el cuerpo de la noticia de cadauno de los documentos, y eliminando todaslas etiquetas con informacion adicional. Larazon por la que se hizo este tratamiento pre-vio fue que dichas etiquetas introducıan mu-cho ruido en los resumenes finales y aunque,a priori se pudiera pensar que estamos elimi-nando informacion relevante de los documen-tos, como puede ser el tıtulo de la noticia, estainformacion suele venir resumida de nuevo enla primera frase de la misma, al tratarse detextos periodısticos.

El corpus del DUC 2002 contenıa, ademas,resumenes generados manualmente por ex-pertos humanos, que nos sirvieron para com-pararlos con los obtenidos por nuestro sis-tema. Por otra parte, se desarrollaron parael DUC 2002 dos baselines, una para la tarea1 consistente en extraer las primeras 100palabras del documento y otra para la se-gunda tarea, que seleccionaba las primerasn palabras (en nuestro caso, tambien 100,ya que los resumenes multidocumento pro-ducidos tienen una longitud de 100 palabras)del documento mas reciente. Ambos baselines

nos sirvieron de referencia y reciben el nom-bre de Lead baseline como se puede ver en elCuadro 1.

Para la evaluacion se utilizo el paqueteROUGE4 (Lin, 2004), con un intervalo deconfianza del 95 %. Esta herramienta, desar-rollada en 2004, ha sido muy utilizada parala evaluacion automatica de resumenes enlas ultimas competiciones del DUC y pro-porciona los valores de precision, cobertura

4La version utilizada en este trabajo,ROUGE-1.5.5, se encuentra disponible en:http://haydn.isi.edu/ROUGE


187

Monodocumento ROUGE-1 ROUGE-2 ROUGE-L

Lead baseline 41.132 21.075 37.535TS 43.210 17.072 39.188TE+TS 44.759(+3.58 %) 18.840(+10.36 %) 40.606(+3.62 %)

Multidocumento ROUGE-1 ROUGE-2 ROUGE-W

Lead baseline 28.684 5.283 9.525TS 29.620 5.200 9.266TE+TS 31.333(+5.78 %) 5.780(+11.15 %) 9.588(+3.48 %)

Cuadro 1: Resultados sobre corpus DUC 2002 (tareas mono y multidocumento).

Multidocumento ROUGE-1 ROUGE-2 ROUGE-W

Abstracts-HUM 37.762 8.004 10.436Extracts-HUM 41.811 (+10.72 %) 13.466 (+68.24 %) 11.503 (+10.22 %)

Cuadro 2: Evaluacion de resultados comparando “abstracts” vs. “extracts”.

y F-medida, para diferentes niveles de sola-pamiento entre distintos textos. En este tra-bajo solo tendremos en cuenta el valor parala cobertura (recall), debido a que las evalua-ciones que se han realizado a posteriori sobrelos datos del DUC 2002 (Wan, Yang, y Xi-ao, 2006; Steinberger et al., 2007) han uti-lizado la version anterior de la herramien-ta ROUGE (version 1.4.2) que solo realiz-aba el calculo de la cobertura. Las medi-das ROUGE que usaremos para nuestra eval-uacion comprenden ROUGE-N (ROUGE-1y ROUGE-2), que determinan la coberturabasada en n-gramas entre el resumen can-didato y el resumen (o resumenes) modelo;ROUGE-L, que se basa en obtener la sub-secuencia comun mas larga, consecutiva ono, entre dos textos y ROUGE-W (w=1.2),similar a ROUGE-L, con la salvedad de quememoriza los tamanos de los emparejamien-tos consecutivos para quedarse con el mayorde ellos.

Aprovechando los extracts generados porhumanos incluidos en el corpus del DUC2002, se decidio realizar una prueba adicionalque consitio en generar resumenes multidoc-umento a partir de cada uno de los 59 gru-pos para analizar las diferencias en los re-sultados obtenidos con ROUGE, puesto quela mayorıa de los sistemas existentes en laactualidad producen extracts, pero se suelencomparar con abstracts en el momento de suevaluacion. Dicha prueba consistio en pro-ducir resumenes de 200 palabras de longitud,utilizando conjuntamente los componentes de

implicacion textual y la frecuencia de las pal-abras. Los resultados de esta prueba se mues-tran en el Cuadro 2.

En el Cuadro 1 se muestran los resulta-dos para las cuatro pruebas descritas ante-riormente. Entre parentesis se indica el in-cremento de mejora5 que se obtiene paracada medida ROUGE, respecto a no incor-porar la implicacion textual en el sistema.De estos resultados se observa que cuandose incorpora el reconocimiento de la impli-cacion textual en el proceso de generacion deresumenes, se obtiene como media una mejo-ra del 5.85 % para el caso de monodocumen-to y de 6.80 % para el de multidocumento,frente a aplicar solamente la frecuencia de laspalabras en dichos procesos. Cabe destacarel incremento de mejora que obtiene el re-conocimento de implicacion textual aplicadoa multidocumento. Se ha comprobado exper-imentalmente que se ha eliminado el 71.57 %de las frases originales del corpus del DUC2002, por lo que la reduccion de los textoses bastante significativa. Por otra parte, lasdos aproximaciones propuestas en este traba-jo mejoran los resultados obtenidos para lasrespectivas baselines del DUC 2002 para am-bas tareas (excepto para la medida ROUGE-2 en el caso monodocumento).

En cuanto a los resultados de la salidagenerada por nuestro sistema aplicando con-

5Se ha comprobado, aplicando el “sign-test” conun nivel de significacion del 5%, que la mejora obteni-da con el reconocimiento de la implicacion textual esestadısticamente significativa.


188

juntamente implicacion textual y frecuenciade palabras sobre cada grupo de noticias, yevaluada respecto a los abstracts y los ex-

tracts manuales proporcionados en el corpusdel DUC 2002 (Cuadro 2), se observa que sicomparamos las salidas obtenidas por nuestrosistema con extracts producidos por expertoshumanos obtenemos una mejora del 29.73 %,frente a los resultados que se derivan de lacomparacion con abstracts. Esto simplementeindica que, puesto que la mayorıa de sistemasen la actualidad se basan en la produccion deextracts (Radev, Hovy, y McKeown, 2002) yno de abstracts, si dispusieramos de extracts

como resumenes modelo para poder evaluarnuestros sistemas, los resultados serıan masrepresentativos. No obstante, a pesar de quehan habido intentos para la generacion deabstracts de forma automatica (Kan, McKe-own, y Klavans, 2001; Daume III et al., 2002)esta tarea es todavıa un reto.

5. Conclusion y trabajo futuro

En este artıculo se presentan dos aprox-imaciones para la tarea de resumenes au-tomaticos, tanto para resumenes monodoc-umento como para multidocumento. Laprimera aproximacion consiste en considerarsolamente la frecuencia de las palabras paraque, a partir de estas, asignemos una pun-tuacion a las frases del texto y seleccionemosaquellas con mayor puntuacion para formarparte del resumen final. La segunda aproxi-macion consiste en incorporar, como tarea depreprocesamiento, el recocimiento de la im-plicacion textual entre frases, para reducirel tamano del documento o documentos yanalizar, a partir de la combinacion de im-plicacion textual con la frecuencia de laspalabras, si se obtienen resultados que cor-roboren que el uso conjunto de estas dosfuentes de informacion es util en el procesode generacion de resumenes. Los resultadospresentados en la seccion 4 verifican nuestrahipotesis de trabajo, obteniendo unos resulta-dos preliminares que son el punto de partidapara futuras investigaciones.

Como trabajos futuros, se plantea la in-corporacion de otras tecnicas en el sistemapara dotarlo de mas conocimiento, como porejemplo, la resolucion de la anafora y deexpresiones temporales, ası como tambien,conocimiento basado en semantica, hacien-do uso de WordNet, que proporciona rela-ciones tales como las de sinonimia o hiper-

onimia. Ademas, unos de los trabajos prior-itarios para un futuro serıa estudiar mas enprofundidad los mecanismos existentes paradetectar y eliminar la redundancia en unoo varios documentos, y compararlos con latecnica de implicacion textual propuesta eneste artıculo. Otro aspecto a considerar con-sistirıa en evaluar los resumenes generadoscon otras medidas complementarias, utilizan-do herramientas de evaluacion automaticaque no se basen unicamente en la cobertura.

Bibliografıa

Alonso, Laura, Irene Castellon, Salvador Cli-ment, Marıa Fuentes, Lluis Padro, y Hora-cio Rodrıguez. 2004. Approaches to Text

Summarization: Questions and Answers.Revista Iberoamericana de Inteligencia

Artificial, ISSN 1137-3601, (22):79–102.

Daume III, Hal, Abdessamad Echihabi,Daniel Marcu, Dragos Stefan Munteanu,y Rado Soricut. 2002. GLEANS: A

Generator of Logical Extracts and Ab-

stracts for Nice Summaries. En Work-

shop on Text Summarization (In con-

junction with the ACL 2002 and includ-

ing the DARPA/NIST sponsored DUC

2002 Meeting on Text Summarization),

Philadelphia.

Edmundson, H. P. 1969. New Methods in

Automatic Extracting. En Inderjeet Mani

and Mark Maybury, editors, Advances inAutomatic Text Summarization, paginas23–42. MIT Press.

Ferrandez, Oscar, Daniel Micol, RafaelMunoz, y Manuel Palomar. 2007. Aperspective-based approach for solvingtextual entailment recognition. En Pro-

ceedings of the ACL-PASCAL Workshop

on Textual Entailment and Paraphrasing,paginas 66–71, Prague, June. Associationfor Computational Linguistics.

Giampiccolo, Danilo, Bernardo Magnini, IdoDagan, y Bill Dolan. 2007. Thethird pascal recognizing textual entail-ment challenge. En Proceedings of the

ACL-PASCAL Workshop on Textual En-

tailment and Paraphrasing, paginas 1–9,Prague, June. Association for Computa-tional Linguistics.

Glickman, Oren. 2006. Applied Textual En-

tailment. Ph.D. tesis, Bar Ilan University.


189

Goldstein, Jade, Vibhu Mittal, Jaime Car-bonell, y Mark Kantrowitz. 2000. Multi-

document summarization by sentence ex-

traction. En NAACL-ANLP 2000 Work-

shop on Automatic summarization, pagi-nas 40–48, Morristown, NJ, USA.

Gotti, Fabrizio, Guy Lapalme, Luka Nerima,y Eric Wehrli. 2007. GOFAISUM: A

Symbolic Summarizer for DUC. En the

Document Understanding Workshop (pre-

sented at the HLT/NAACL), Rochester,

New York USA.

Harabagiu, Sanda, Andrew Hickl, y Fin-ley Lacatusu. 2007. Satisfying infor-

mation needs with multi-document sum-

maries. Information Processing & Man-

agement, 43(6):1619–1642.

Hirao, Tsutomu, Yutaka Sasaki, HidekiIsozaki, y Eisaku Maeda. 2002. NTT’s

Text Summarization system for DUC-

2002. En Workshop on Text Summariza-

tion (In conjunction with the ACL 2002

and including the DARPA/NIST spon-

sored DUC 2002 Meeting on Text Sum-

marization), Philadelphia.

Hovy, Eduard, 2005. The Oxford Handbook of

Computational Linguistics, capıtulo Auto-mated Text Summarization, paginas 583–598. Oxford University Press.

Kan, Min-Yen, Kathleen R McKeown, y Ju-dith L. Klavans. 2001. Applying Nat-

ural Language Generation to Indicative

Summarization. En 8th European Work-

shop on Natural Language Generation,

Toulouse, France.

Kuo, June-Jei y Hsin-Hsi Chen. 2008. Mul-

tidocument Summary Generation: Using

Informative and Event Words. ACM

Transactions on Asian Language Informa-

tion Processing (TALIP), 7(1):1–23.

Lin, Chin-Yew. 2004. ROUGE: a package for

automatic evaluation of summaries. EnProceedings of ACL Text Summarization

Workshop, paginas 74–81.

Lin, Chin-Yew y Eduard Hovy. 2002. Auto-

mated Multi-document Summarization in

NeATS. En Proceedings of the Human

Language Technology (HLT) Conference.

San Diego, CA, paginas 50–53.

Luhn, H. P. 1958. The Automatic Creation

of Literature Abstracts. En Inderjeet Mani

and Mark Maybury, editors, Advances inAutomatic Text Summarization, paginas15–22. MIT Press.

Mihalcea, Rada. 2004. Graph-based ranking

algorithms for sentence extraction, applied

to text summarization. En Proceedings of

the ACL 2004 on Interactive poster and

demonstration sessions, pagina 20.

Qiu, Li-Qing, Bin Pang, Sai-Qun Lin, y PengChen. 2007. A Novel Approach to Multi-

document Summarization. En Proceed-

ings of the 18th International Workshop

on Database and Expert Systems Applica-

tions (DEXA 2007), 3-7 September 2007,

Regensburg, Germany, paginas 187–191.

Radev, Dragomir R., Eduard Hovy, y Kath-leen McKeown. 2002. Introduction to the

special issue on summarization. Compu-

tational Linguistics, 28(4):399–408.

Sparck Jones, Karen. 2007. Automat-

ic summarising: The state of the art.Information Processing & Management,43(6):1449–1481.

Steinberger, Josef, Massimo Poesio, Mijail A.Kabadjov, y Kerel Jezek. 2007. Two us-

es of anaphora resolution in summariza-

tion. Information Processing & Manage-

ment, 43(6):1663–1680.

Svore, Krysta M., Lucy Vanderwende, yChristopher J.C. Burges. 2007. En-

hancing Single-Document Summarization

by Combining RankNet and Third-Party

Sources. En Proceedings of the Joint Con-

ference on Empirical Methods in Natural

Language Processing and Computation-

al Natural Language Learning (EMNLP-

CoNLL), paginas 448–457.

Tatar, Doina, Emma Tamaianu-Morita, An-dreea Mihis, y Dana Lupsa. 2008. Sum-

marization by Logic Segmentation and

Text Entailment. En Proceedings of

Conference on Intelligent Text Processing

and Computational Linguistics (CICLing

2008), paginas 15–26.

Wan, Xiaojun, Jianwu Yang, y Jianguo Xi-ao. 2006. The Great Importance of

Cross-Document Relationships for Multi-

document Summarization. En Proceedings

of the 21st International Conference Com-

puter Processing of Oriental Languages

(ICCPOL 2006), Singapore, paginas 131–138.


190

Uso de Grafos de Conceptos para la Generacion Automatica deResumenes en Biomedicina∗

Concept-graphs based Biomedical Automatic Summarization using UMLS

Laura Plaza MoralesAlberto Dıaz Esteban

Pablo GervasUniversidad Complutense de Madrid

C/Profesor Jose Garcıa Santesmases, s/n, Madrid 28040, [email protected],[email protected], [email protected]

Resumen: Uno de los principales problemas en la investigacion sobre generacionautomatica de resumenes (GAR) es la falta de utilizacion de conocimiento de do-minio, que se refleja en la incorrecta interpretacion semantica del documento y labaja calidad de los resumenes obtenidos. En este trabajo se propone un metodo deextraccion de oraciones para la GAR de artıculos biomedicos, mediante el mapeodel documento a los conceptos de la ontologıa UMLS, y la representacion del do-cumento y de sus oraciones como grafos. La seleccion de las oraciones relevantes serealiza a partir de la conectividad de los conceptos que contienen en el grafo deldocumento. Se muestran los resultados empıricos preliminares de la aplicacion dedistintas heurısticas para la seleccion de las oraciones del resumen, y se identificanalgunos problemas y lıneas de trabajo futuras.Palabras clave: Generacion automatica de resumenes, Unified Medical LanguageSystem(UMLS), redes libres de escala, artıculo biomedico, ontologıa.

Abstract: One of the main problems in research on automatic summarization isthe inaccurate semantic interpretation of the source, which is reflected in the de-ficiencies shown by the resulting summaries. Using specific domain knowledge, asthat supplied by ontologies, can considerably alleviate the problem. In this paper,we introduce an ontology-based extractive method for summarization. It is based onmapping the text to concepts in the ontology and representing the document as ascale-free graph. To assess the importance of the sentences we compute the centralityof their concepts in the text. We have applied our approach to summarize scientificbiomedical literature, taking advantage from free resources as UMLS. Preliminaryempirical results are presented and pending problems are identified.Keywords: automatic summarization, degree-based methods,Unified Medical Lan-guage System(UMLS), scale-free network, biomedical article, ontology.

1. Introduccion

En la sociedad actual, la cantidad dedocumentacion electronica accesible desdecualquier lugar y cualquier dispositivo crecede manera exponencial. Gracias a los avancestecnologicos de las ultimas decadas, su alma-cenamiento y acceso ya no suponen un pro-blema, pero el tiempo sigue siendo un bienvalioso y limitado. Esta realidad afecta espe-∗ Esta investigacion esta financiada por el Ministeriode Educacion y Ciencia (TIN2006-14433-C02-01) y laUniversidad Complutense de Madrid y la DireccionGeneral de Universidades e Investigacion de la Co-munidad Autonoma de Madrid (CCG07-UCM/TIC-2803).

cialmente al campo de la medicina, en el quelos recursos digitales son muchos y muy va-riados. Es obvio que en este contexto, la ge-neracion automatica de resumenes (en ade-lante, GAR), ya sean informativos o mera-mente indicativos, puede ser de gran utilidad.Durante el ejercicio de la medicina, dispo-ner de resumenes de los historiales de los pa-cientes puede ayudar a los profesionales a ac-tuar con mayor celeridad en el tratamiento deurgencias sanitarias; mientras que, durante laformacion y la investigacion, los resumenespueden ser utiles para determinar si un do-cumento resulta de interes, y si merece o no



una lectura exhaustiva.Durante los ultimos anos, y como res-

puesta a esta explosion de informacion, hanaparecido nuevos recursos linguısticos para sutratamiento. Diccionarios, tesauros, bases dedatos lexicas y grandes bases de conocimientobiomedico, muchos de ellos de disponibilidadpublica, facilitan la construccion de sistemasde procesamiento de lenguaje natural y lesconfieren mayores garantıas de exito.

Por otra parte, construir resumenesgenericos y totalmente independientes delcontexto es un ideal aun lejos de alcanzar.Restringir el problema a un dominio concre-to, la biomedicina, y un tipo de documentosespecıfico, el artıculo cientıfico, sin duda re-duce la complejidad del proceso y redunda enuna mayor calidad de los resumenes.

En este trabajo se propone un metodode extraccion de oraciones para la GAR deartıculos biomedicos, mediante el mapeo deldocumento a los conceptos de la ontologıabiomedica UMLS, y la representacion del do-cumento y de sus oraciones como grafos. Laseleccion de las oraciones relevantes se realizaa partir de la conectividad de los conceptosque contienen en el grafo del documento.

El resto del documento se organiza co-mo sigue. En el apartado 2 se ofrece unapanoramica general de la problematica de laGAR y del estado del arte. En el aparta-do 3 se describen algunas de las bases deconocimiento biomedico mas populares y sejustifica la eleccion de UMLS para el traba-jo que nos ocupa. El apartado 4 presenta elmetodo de GAR desarrollado. En el apartado5 se muestran los resultados y la evaluaciondel sistema. Finalmente, en el apartado 6 seanalizan las conclusiones extraıdas y se sub-rayan algunas posibles lıneas de trabajo fu-turo.

2. Trabajo Previo

Segun Sparck-Jones (Sparck-Jones, 1999),un resumen consiste en la transformacion deun texto a traves de la reduccion de su con-tenido, bien por seleccion o por generaliza-cion de lo que es importante. La informa-cion presentada en el resumen dependera delas necesidades del usuario. Mientras que losresumenes adaptativos seleccionan los con-tenidos que son de interes para el lector, losresumenes genericos tratan de preservar elpunto de vista del autor y la organizacionoriginal del texto. Por otra parte, en funcion

del numero de documentos que intervienenen la elaboracion del resumen, cabe hablarde resumenes mono-documento y resumenesmulti-documento. A pesar de que los traba-jos mas recientes centran su atencion en estosultimos, lo cierto es que los resultados en ge-neracion mono-documento aun presentan no-tables deficiencias en cuanto a contenido ycoherencia gramatical se refiere.

Una clasificacion de alto nivel de los sis-temas de GAR es la que distingue entre aque-llos que utilizan tecnicas de extraccion; es de-cir, generan resumenes compuestos ıntegra-mente por material del documento original, yaquellos que utilizan tecnicas de abstraccion;es decir, generan resumenes que incluyen con-tenidos que no estan presentes, al menos ex-plıcitamente, en la entrada. Aunque tıpica-mente los humanos realizan resumenes medi-ante abstraccion, la mayor parte de la inves-tigacion hoy dıa sigue siendo en extraccion.

Los sistemas basados en extraccion de ora-ciones realizan un analisis superficial del tex-to, y no van mas alla del nivel sintactico.Los primeros trabajos se limitaban a localizarsegmentos clave en el original, utilizando cri-terios estadısticos, como la frecuencia de laspalabras en el documento(Luhn, 1958; Ed-mundson, 1969); criterios posicionales, te-niendo en cuenta la posicion que ocupa ca-da oracion (Brandow, Mitze, y Rau, 1995);y criterios linguısticos, que evaluan la pre-sencia de ciertas expresiones o palabras in-dicativas (Edmundson, 1969). Muchos traba-jos (Edmundson, 1969) combinan algunos otodos los criterios anteriores, mientras quelos enfoques mas sofisticados utilizan tecnicasde aprendizaje automatico para determinarel conjunto de atributos que mejor se com-portan en la extraccion (Kupiec, Pedersen, yChen, 1995; Lin, 1999).

En los ultimos anos, han cobrado relevan-cia los enfoques que, al igual que este tra-bajo, utilizan algoritmos basados en grafospara representar la estructura de los docu-mentos y elaborar el resumen (Yoo, Hu, ySong, 2007). Un trabajo muy representativode este tipo de aproximaciones se presentaen (Erkan y Radev, 2004), donde se abordael problema de la GAR multi-documento. Losautores proponen la construccion de un grafopara el conjunto de los textos, en el que exis-te un vertice por cada oracion, representa-da por sus vectores de frecuencias (tf*idf ), ycalculan la similitud entre ellas utilizando la

Laura Plaza Morales, Alberto Díaz Esteban y Pablo Gervás

192

metrica del coseno. No obstante, el enfoqueanterior presenta algunos problemas impor-tantes, derivados de la no consideracion de laestructura semantica del documento y de lasrelaciones entre los terminos que lo componen(sinonimia, hiperonimia, homonimia, coocur-rencias o asociaciones semanticas). Para ilus-trar alguno de estos problemas, consideremoslas siguientes oraciones extraıdas de (Yoo,Hu, y Song, 2007).

1. Cerebrovascular disorders during preg-nancy results from any of three majormechanisms: arterial infarction, hemor-rhage, or venous thrombosis

2. Central nervous system diseases duringgestation results from any of three majormechanisms: arterial infarction, hemor-rhage, or venous thrombosis

Puesto que ambas secuencias contienenterminos diferentes, con una aproximacionbasada en las frecuencias de los terminos re-sulta imposible determinar que las dos ora-ciones presentan una semantica comun.

El metodo que se propone trata de sol-ventar este problema. Para ello, se ha adop-tado un enfoque basado en la representaciondel documento en forma de grafo, utilizan-do los conceptos de UMLS asociados a susterminos, extendidos con sus correspondi-entes hiperonimos y relaciones asociativas.A diferencia de los trabajos de (Yoo, Hu, ySong, 2007; Erkan y Radev, 2004), que se cen-tran en la construccion de clusters de ora-ciones para determinar los temas comunes enmultiples documentos, y en la identificacionde las oraciones centrales de cada cluster, eneste trabajo el algoritmo de clustering es apli-cado a la identificacion de grupos de con-ceptos estrechamente relacionados, que de-limitan los distintos subtemas que se tratandentro de un texto, y cuya presencia en lasoraciones determinara su grado de relevan-cia. El enfoque presenta la ventaja adicionalde ser facilmente extensible a la GAR multi-documento.

3. Ontologıas y RecursosLinguısticos para Biomedicina:UMLS

Las ontologıas biomedicas proveen unmarco organizativo de los conceptos involu-crados en entidades y procesos biologicos,en un sistema de relaciones jerarquicas y

asociativas que permite razonar sobre elconocimiento del dominio. Sin duda alguna,las mas utilizadas en recuperacion de infor-macion, son SNOMED1, UMLS2 y MeSH3.

En este trabajo se ha utilizado UMLS(Unified Medical Language System), un sis-tema desarrollado por la Biblioteca Nacionalde Medicina de los Estados Unidos, com-puesto por tres aplicaciones: el meta-tesauro,que es una base de datos multilingue con in-formacion sobre conceptos biomedicos y susrelaciones; la red semantica, que proporcionauna clasificacion de los conceptos representa-dos en el meta-tesauro; y el lexicon especiali-zado, que incluye terminos biomedicos juntocon informacion sintactica, morfologica y or-tografica sobre los mismos.

UMLS presenta algunas ventajas frentea las otras ontologıas mencionadas. Enprimer lugar, proporciona una mayor riquezasemantica, al recopilar el vocabulario y la or-ganizacion de distintas ontologıas, incluyendoMESH y SNOMED. En segundo lugar, per-mite restringir las fuentes de conocimientoque se desea consultar, posibilitando la com-paracion entre distintas terminologıas. Entercer lugar, contempla vocabularios en dis-tintos idiomas, lo que resulta muy interesantede cara a futuros trabajos en acceso a infor-macion multilingue. Finalmente, se encuadraen un proyecto muy activo, y que cuenta conel respaldo de un considerable numero deaplicaciones que lo utilizan (PubMed, Index-ing Initiative de NLM o Enterprise Vocabu-lary Services del NCI).

4. Generacion Automatica deResumenes

En este apartado se presenta el metodopropuesto para resolver la tarea, a traves delas distintas etapas que conducen a la elabo-racion del resumen. Los documentos utiliza-dos en la experimentacion proceden del cor-pus desarrollado por la editorial BioMed Cen-tral4, especialmente concebido para la inves-tigacion en minerıa de texto. Esta compuestopor mas de 23900 artıculos completos publi-cados, incluyendo una version etiquetada y

1SNOMED International. URL:http://www.snomed.org/snomedct

2NLM Unified Medical Language System (UMLS).URL: http://www.nlm.nih.gov/research/umls

3NLM Medical Subject Headings (MeSH). URL:http://www.nlm.nih.gov/mesh/

4BioMed Central: http://www.biomedcentral.com/

Uso de Grafos de Conceptos para la Generación Automática de Resúmenes en Biomedicina

193

estructurada de los mismos en XML, que per-mite identificar los distintos elementos queconforman el artıculo (tıtulo, abstract, au-tores, secciones, palabras clave, etc.)

Como paso previo, el artıculo es sometidoa una etapa de preprocesamiento en la queel texto se divide en tokens, se realiza su eti-quetado morfosintactico y se divide el textoen en oraciones. Para ello, se han utilizandolos modulos Tokenizer, Part of Speech Taggery Sentence Splitter de la librerıa GATE5. Fi-nalmente, se eliminan las palabras genericasutilizando una lista de parada6, ası como losterminos que presentan una alta frecuenciaen el documento, puesto que no van a ser deutilidad a la hora de discriminar entre con-tenidos importantes e irrelevantes.

Con el objetivo de clarificar el fun-cionamiento del algoritmo, a lo largode la exposicion se hara referencia a unejemplo concreto de GAR desarrolladopara uno de los documentos del corpus.A continuacion se muestra un extractodel documento utilizado. El texto completopresenta un total de 60 oraciones, y puede en-contrarse en el sitio web de BioMed Central7.

In April 2000, the first results of the Antihypertensive andLipid Lowering Treatment to Prevent Heart Attack Trial (ALL-HAT) were published summarizing the comparison between twoof the four drugs studied (chlorthalidone and doxazosin) as ini-tial monotherapy for hypertension [¡abbr bid=”B1))1¡/abbr¿]. Thisprospective, randomized trial was designed to compare a diuret-ic (chlorthalidone) with long-acting (once-a-day) drugs among dif-ferent classes: angiotensin-converting enzyme inhibitor (lisinopril);calcium-channel blocker (amlodipine); and alpha blocker (doxa-zosin)...

4.1. Construccion del Grafo delDocumento

El objetivo de esta etapa es construir unarepresentacion del documento en forma degrafo, en la que los vertices representan losconceptos en UMLS asociados a cada termi-no, y las aristas representan las relaciones isaexistentes entre los conceptos.

Para ello, los terminos de las oracionesdel documento se traducen a conceptos dela ontologıa, utilizando Metamap8(Aronson,2001). MetaMap es una herramienta desa-rrollada por la National Library of Medicine(NLM), inicialmente pensada para su uso

5GATE (Generic Architecture for Text Engineer-ing): http://gate.ac.uk/

6PubMed StopWords:http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#Stopwords

7BioMed Central: www.biomedcentral.com/content/download/xml/cvm-2-6-254.xml

8MetaMap Transfer(MMTx):http://mmtx.nlm.nih.gov/

en indexacion y recuperacion de artıculos enMEDLINE, y hoy en dıa utilizada en todotipo de tareas de PLN en biomedicina. Eluso de MetaMap en este proyecto presentados atractivos fundamentales. En primer lu-gar, MetaMap utiliza el lexicon especializa-do de UMLS para normalizar los terminosdel documento, considerando todas sus posi-bles variantes morfologicas antes de obtenerel concepto asociado en el meta-tesauro. Ensegundo lugar, y puesto que ante un mismotermino el meta-tesauro de UMLS puede re-cuperar distintos conceptos, realiza la desam-biguacion necesaria para determinar cual esel concepto correcto de acuerdo con el con-texto de la oracion. Ademas, la indexacionse realiza seleccionando n-gramas en lugar determinos individuales. De este modo, no solose consigue una mayor precision en la inter-pretacion semantica, sino que se reduce con-siderablemente el tamano del grafo.

A continuacion, los conceptos extraıdos seexpanden con sus hiperonimos a traves de lasrelaciones isa de la ontologıa, y se construyela jerarquıa que representa a la oracion. LaFigura 1 muestra el arbol correspondiente ala oracion ”The goal of the trial was to as-sess cardiovascular mortality and morbidityfor stroke, coronary heart disease and conges-tive heart failure, as an evidence-based guidefor clinicians who treat hypertension.”

Seguidamente, a cada arista que une unconcepto con su padre se le asigna un pe-so que es proporcional a la profundidad delconcepto en la jerarquıa; es decir, sera tantomayor cuanto mas especıficos sean los con-ceptos que conecte. El calculo de los pesos serealiza utilizando una medida de la similitudentre conjuntos (Rada et al., 1989), de acuer-do con la expresion (1), donde α representa elconjunto de todos los ancestros de un concep-to determinado, incluido el propio concepto,y β representa el conjunto de todos los ance-stros del concepto del nivel inmediatamentesuperior(Figura 2).

|α ∩ β||α ∪ β|

=|β||α|

(1)

Finalmente, los grafos de las distintas ora-ciones se fusionan en un unico grafo, quese completa con las relaciones associated-with entre los grupos semanticos de UMLSa los que pertenecen los conceptos. El pe-so de estos enlaces se calculara siguiendoel criterio adoptado para las relaciones isa.


194

Figura 1: Grafo de una oracion

Por ejemplo, los conceptos trial y hyperten-sion estan asociados, ya que sus respectivostipos semanticos(Research Activity y Diseaseor Syndrome) presentan una relacion asso-ciated with en UMLS (Figura 2).

Figura 2: Asignacion de pesos y relacionesassociated with

4.2. Clustering de Conceptos.Identificacion de Subtemas

El proposito de esta etapa es realizar unaagrupacion de los conceptos del grafo del do-cumento, utilizando un algoritmo de cluste-ring basado en la conectividad (degree-basedmethod) (Erkan y Radev, 2004), donde ca-da cluster puede verse como una red de con-ceptos que mantienen una estrecha relacionsemantica entre sı. En este contexto, cadacluster representa un theme o tema del do-cumento; y dentro de ellos, los conceptos

centrales (centroides) aportan la informacionnecesaria y suficiente de cada tema.

Se parte de la hipotesis de que el grafoobtenido constituye un ejemplo de red libre deescala (Barabasi y Albert, 1999). Una red li-bre de escala es un tipo especıfico de red com-pleja en la que algunos nodos estan altamenteconectados (nodos hub); es decir, poseen ungran numero de enlaces a otros nodos, aunqueel grado de conexion de casi todos los nodoses bastante bajo.

Siguiendo a (Yoo, Hu, y Song, 2007) sedefine el prestigio o salience de cada vertice(vi) como la suma de los pesos de todas lasaristas (ej) que tienen como origen o destinoa dicho vertice, de acuerdo con la expresion(2).

salience(vi) =∑

ej |∃vk∧ejconecta(vi,vk)

weight(ej)

(2)A continuacion, se seleccionan los n

vertices de mayor salience, y se agrupan for-mando Hub Vertex Sets (HVS), que consti-tuiran los centroides de los clusters a cons-truir. El resto de vertices se asignan al clusterpara el que presenten la mayor conectividadcon alguno de sus vertices, reajustando losHVS y los vertices asignados en un procesoiterativo. Para el ejemplo que nos ocupa, sehan generado 4 clusters.

Finalmente, se asigna cada oracion del do-cumento a uno de los clusters anteriores. Para


195

ello, es preciso definir una medida de la simi-litud entre el cluster y el grafo de la oracion.Es importante aclarar que, puesto que ambasrepresentaciones son muy distintas en cuan-to a tamano se refiere, las metricas clasicasde similitud entre grafos (i.e. la distancia deedicion) no resultan adecuadas. En su lugar,se utiliza un mecanismo de votos (Yoo, Hu, ySong, 2007). Cada vertice (vk) de una oracion(Oj) asigna a cada cluster (Ci) en el quese encuentra presente una puntuacion (pi,j)distinta dependiendo de si pertenece o no alHVS de dicho cluster (3).

similitud(Ci, Oj) =∑

vk|vk∈Oj

wk,j (3)

donde

{wk,j=0 si vk 6∈Ci

wk,j=γ si vk∈HV S(Ci)

wk,j=δ si vk 6∈HV S(Ci)

Los valores de γ y δ se han establecidoa 1,0 y 0,5 respectivamente, lo que significaque se atribuye el doble de importancia a losconceptos que pertenecen a los HVS que a losrestantes.

4.3. Seleccion de OracionesRelevantes

El ultimo paso del algoritmo consiste enextraer oraciones completas del texto originalen funcion de su distancia semantica a los dis-tintos clusters. El numero total de oraciones aseleccionar (N ) dependera de la tasa de com-presion utilizada. En esta etapa, se han in-vestigado tres heurısticas.

Heurıstica 1: Todos los clusters con-tribuyen a la construccion del resumencon un numero de oraciones (ni) propor-cional a su tamano. Por lo tanto, paracada uno de los clusters, se seleccionanlas ni oraciones con las que presenta ma-yor similitud.

Heurıstica 2: El cluster de mayortamano (esto es, el que representa eltheme principal en el documento), es elunico que deberıa tenerse en considera-cion para la generacion del resumen. Porlo tanto, se seleccionan las N oracionescon las que presenta mayor similitud.

Heurıstica 3: Para cada oracion, se cal-cula el total de sus votaciones a todoslos clusters, ponderadas por el tamano

de estos ultimos, segun la expresion (4).Se seleccionan las N oraciones con ma-yor puntuacion total.

score(Oj) =∑Ci

similitud(Ci, Oj)|Ci|

(4)

El problema de la ordenacion de las ora-ciones es trivial al tratarse de un resumenmono-documento, y se resuelve tomandolasen el mismo orden en el que aparecen en eldocumento original.

5. Resultados y Evaluacion

En este apartado se analiza un ejemplo delextracto generado con las distintas heurısti-cas, para el documento presentado al iniciodel apartado 4, utilizando una tasa de com-presion del 20 %.

La tabla 1 recoge las oraciones selec-cionadas por cada heurıstica, junto con supuntuacion. Si bien los resultados obtenidosno son estadısticamente significativos, suanalisis muestra algunos aspectos en los queel algoritmo se comporta satisfactoriamente.Llama la atencion que las heurısticas 1 y 3presentan a la oracion 0 como la mas rele-vante, con una puntuacion muy superior alresto de oraciones. Esto concuerda con elcriterio posicional adoptado en muchos tra-bajos de seleccionar la primera oracion deldocumento para el resumen, por ser la quegeneralmente contiene informacion mas sig-nificativa. El motivo por el que la heurıstica2 no selecciona esta oracion es que el clus-ter de mayor tamano (es decir, aquel del quedicha heurıstica extrae todas las oraciones)es el numero 2, mientras que la oracion 0pertenece al cluster 0. La numero 58 pre-senta un claro ejemplo de oracion que, es-tando posicionada al final del documento,recoge conclusiones sobre la exposicion, y porlo tanto, tiene un alto contenido informati-vo. Por su parte, la numero 19 ejemplificala sobrevaloracion de oraciones de gran lon-gitud. En esta oracion, el mapeo al meta-tesauro de UMLS da como resultado un to-tal de 23 conceptos, cuando el resto de ora-ciones presentan en torno a 10-12 concep-tos. Por ello, y a pesar de que la mayorıade los conceptos que contienen no son cen-trales (es decir, no pertenecen a los HVS),recibe una puntuacion elevada(20.0). Otro as-pecto a destacar es que las heurısticas 1 y3 comparten un gran numero de oraciones;


196

en concreto, 9 de las 12 seleccionadas, mien-tras que la heurıstica 2, por su parte, se ale-ja bastante de las otras dos. De hecho, unanalisis detallado del resultado de la segun-da heurıstica demuestra que esta estrategiaignora algunos topicos importantes del tex-to original. Por ultimo, la oracion 28 ponede manifiesto los problemas de inconsisten-cia tıpicos de los metodos de extraccion, altratarse de una oracion que no es autocon-tenida, y que no tiene sentido incluir en elresumen si no se incluye tambien la oracionque la precede.

Dado que los artıculos del corpus se pre-sentan acompanados del resumen elaboradopor su autor, resulta interesante realizar unacomparacion entre este y los resultados delas distintas heurısticas. A pesar de que laslongitudes de los resumenes varıan significa-tivamente (de las 3 oraciones del abstract alas 12 oraciones del resumen automatico), seobserva que las heurısticas 1 y 3 cubren la to-talidad de los temas tratados en el abstract.En primer lugar, las oraciones 0 y 4 presen-tan el punto de partida del estudio, mien-tras que las oraciones 15, 17, 19, 20 y 25presentan resultados de distintas investiga-ciones en ALLHAT, y de tratamientos condoxazosin. Ambos grupos de oraciones abar-can el contenido de la primera oracion del ab-stract. Por su parte, las oraciones 43 y 58 ad-vierten de la poca efectividad de las terapiascon doxazosin contra las enfermedades car-diovasculares (oraciones 2 y 3 del abstract).La heurıstica 2, por su parte, no cubre satis-factoriamente el contenido del abstract.

6. Conclusiones y Trabajo Futuro

En este artıculo se ha presentado un meto-do para la GAR de textos biomedicos, basadoen la representacion del documento como ungrafo extendido de conceptos y relaciones deUMLS, y en el calculo de la relevancia de lasoraciones a extraer en relacion al prestigio osalience de los conceptos que las componenen este grafo. De este modo, se construye unarepresentacion mas rica en conocimiento quela que se tendrıa utilizando un modelo delespacio vectorial, y se consiguen solventar losproblemas identificados en el apartado 2.

En el apartado 5, se han evaluado distintasheurısticas para la extraccion de las oracionesdel resumen. Como resultado, se ha compro-bado que la heurıstica numero 2 no cubre to-dos los contenidos importantes, a la vez que

selecciona oraciones de poca relevancia relati-va. Por lo tanto, se concluye que no es validapara la resolucion del problema. En cuanto alas heurısticas 1 y 3, y a falta de evaluarlasformalmente, se observa que presentan resul-tados muy similares y que cubren todos lostopicos importantes.

Por otra parte, el trabajo realizado hastael momento ha puesto de relieve la gran com-plejidad de la tarea, y subrayado algunas defi-ciencias y posibles mejoras. En primer lugar,el metodo extrae oraciones completas, lo queimplica que las de mayor longitud, al contenerun mayor numero de conceptos, tienen mayorposibilidad de ser seleccionadas. Una soluciona considerar serıa dividir la puntuacion de lasoraciones entre el numero de conceptos quelas componen.

En segundo lugar, los conceptos con unsignificado muy general no aportan informa-cion a la hora de identificar los topicos y dediscriminar entre oraciones relevantes e irrel-evantes. Por lo tanto, pueden ser eliminadosde los grafos, consiguiendo representacionesmas compactas, y en consecuencia, mejorarel rendimiento de la aplicacion. Los tipossemanticos de UMLS se pueden utilizar paraidentificar los terminos asociados a conceptosmuy generales. Por ejemplo, se podrıan igno-rar los terminos correspondientes a los tiposfunctional concept, temporal concept, entity,idea or concept y language.

Otro problema a resolver serıa el dela extraccion de oraciones cuyo contenidoeste relacionado con el de otras oraciones queno hayan sido seleccionadas. Para solucionareste tipo de inconsistencias, podrıa imple-mentarse una estrategia basada en detectarpalabras o grupos de palabras que actuan co-mo conectores de oraciones, y no seleccionarlas oraciones que las contengan a menos quetambien se seleccione la oracion inmediata-mente anterior.

Asimismo, se esta estudiando una posi-ble modificacion del algoritmo para gener-ar resumenes adaptados al usuario. Paraello, se necesitarıa disponer de un mode-lo del usuario, entendido como una repre-sentacion de sus intereses y preferencias (Dıazy Gervas, 2004).

Otra lınea de trabajo futura sera la exten-sion del metodo poder realizar resumenes apartir de multiples documentos sobre un mis-mo tema. Como ya se menciono en el aparta-do 2, las modificaciones no supondran cam-


197

Heurıstica 1Oraciones 0 4 19 58 7 28 25 20 21 8 43 15Puntuacion 99.0 20.0 19.0 18.5 17.0 16.5 16.0 15.5 15.5 13.5 13.5 12.0



Tabla 1: Resultados

bios sustanciales, aunque deberan resolversealgunos problemas adicionales derivados dela consideracion de varias fuentes (evitar laredundancia, ordenar las oraciones, etc.)

Finalmente, para garantizar la adecuaciondel metodo, se esta realizando una evaluaciona gran escala sobre los documentos del cor-pus de BioMed, basada en el calculo de lasmedidas ROUGE-1, ROUGE-L y ROUGE-W (Lin, 2004), utilizadas en las conferen-cias DUC (Document Understanding Confer-ences)9. Esta evaluacion servira para el ajustede parametros, ası como para evaluar ob-jetivamente la bondad de cada una de lasheurısticas definidas.

Bibliografıa

Aronson, A. R. 2001. Effective Mapping ofBiomedical Text to the UMLS Metathe-saurus: The MetaMap Program. En Pro-ceedings of American Medical InformaticsAssociation.

Barabasi, A.L. y R. Albert. 1999. Emergenceof scaling in random networks. Science,paginas 286–509.

Brandow, R., K. Mitze, y L. F. Rau. 1995.Automatic Condensation of ElectronicPublications by Sentence Selection. In-formation Processing and Management,5(31):675–685.

Dıaz, A. y P. Gervas. 2004. User-ModelBased Personalized Summarization. In-formation Processing and Management,43(6):1715–1734.

Edmundson, H. P. 1969. New Methodsin Automatic Extracting. Journal ofthe Association for Computing Machin-ery, 2(16):264–285.9Document Understanding Conference:

http://duc.nist.gov/

Erkan, G. y D. R. Radev. 2004.LexRank: Graph-based Lexical Central-ity as Salience in Text Summarization.Journal of Artificial Intelligence Research(JAIR), 22:457–479.

Kupiec, J., J.O. Pedersen, y F. Chen. 1995.A Trainable Document Summarizer. EnProceedings of the 18th Annual Interna-tional ACM SIGIR Conference on Re-search and Development in InformationRetrieval, paginas 68–73.

Lin, C-Y. 1999. Training a Selection Func-tion for Extraction. En Proceedings ofthe Eighteenth Annual International ACMConference on Information and Knowl-edge Management (CIKM), paginas 55–62, Kansas City.

Lin, C-Y. 2004. ROUGE: A Package for Au-tomatic Evaluation of Summaries. En InProceedings of Workshop on Text Summa-rization Branches Out, Post-ConferenceWorkshop of ACL 2004, Barcelona, Spain.

Luhn, H. P. 1958. The Automatic Creationof Literature Abstracts. IBM Journal ofResearch Development, 2(2):159–165.

Rada, R., H. Mili, E. Bicknell, y M. Blettner.1989. Development and application of ametric on semantic nets. IEEE Transac-tions on Systems, Man and Cybernetics,paginas 17–30.

Sparck-Jones, K. 1999. Automatic Summa-rizing: Factors and Directions. En I. Maniy M.T. Maybury, Advances in AutomaticText Summarization. The MIT Press.

Yoo, I., X. Hu, y I.Y. Song. 2007. A coherentgraph-based semantic clustering and sum-marization approach for biomedical litera-ture and a new summarization evaluationmethod. BMC Bioinformatics, 8(9).


198

Semántica y Pragmática

Determining the Semantic Orientation of Opinions on Products –

a Comparative Analysis

Análisis comparativo de métodos para determinar la polaridad de opiniones

sobre productos

Alexandra Balahur

Universidad de Alicante, DLSI

Apartado de Correos 99

E-03080, Alicante

[email protected]

Andrés Montoyo

Universidad de Alicante, DLSI

Apartado de Correos 99

E-03080, Alicante

[email protected]

Abstract: The high volume of user feedback on products under the form of reviews and forum

or blog posts is helpful both to prospective buyers, as well as to producer companies. However,

automatically determining the semantic orientation of the opinions expressed on different

products and their features is a complex problem, requiring a series of steps: identifying the

product features, extracting the opinion words present in a text and finally classifying them as

positive or negative. This article concentrates on three approaches to solving the latter problem.

One method employed determines polarity of the opinions expressed on the product features

using on the one hand the sentiment bearing words in WordNet Affect (Strapparava and

Valitutti, 2004). Two other methods explored involved determining the polarity of opinion

holders (feature attributes) using Support Vector Machines Sequential Minimal Optimization

(Platt, 1998) machine learning with the Normalized Google Distance (Cilibrasi and Vitanyi,

2006) and, respectively, with Latent Semantic Analysis (Deerwester et al., 1990) on a

specialized versus a non-specialized corpus of user reviews. We comparatively analyze the

methods, show the advantages and disadvantages resulted from using each of them and the

results obtained by performing an evaluation on our opinion mining and summarization system.

Keywords: opinion mining, summarization, Support Vector Machines Sequential Minimal

Optimization, Normalized Google Distance, Latent Semantic Analysis.

Resumen La gran cantidad de opiniones que los usuarios emiten sobre las características de los

productos en blogs, foros y en documentos en internet, son de gran ayuda para los posibles

compradores o para las compañías que los producen. Sin embargo, determinar de forma

automática si un usuario tiene una opinión positiva o negativa de las características de un

producto o del propio producto es un problema complejo que requiere de varios pasos para su

resolución. Inicialmente hay que identificar las características del producto, extraer los términos

que expresan la opinión del usuario y finalmente clasificar el producto de forma positiva o

negativa. Este artículo describe un método para resumir los comentarios positivos o negativos

sobre el producto a partir de las opiniones que los usuarios expresan a través de las

características de los productos. Este problema se resuelve utilizando varias aproximaciones.

Inicialmente se utilizan las palabras que aparecen en WordNet Affect (Strapparava and Valitutti,

2004) que expresan sentimiento. Finalmente se utiliza el método de aprendizaje automatico

(Support Vector Machines Sequential Minimal Optimization (Platt, 1998)) aplicado a las

medidas de similitud denominadas Normalized Google Distance (Cilibrasi and Vitanyi, 2006) y

Latent Semantic Analysis (Deerwester et al., 1990). Los resultados obtenidos por estas medidas

de similitud se comparan, para posteriormente ser analizados y presentar las ventajas y los

inconvenientes cuando se aplican al sistema de minería y resúmenes de opiniones.

Palabras clave: minería de opiniones, resumir, Support Vector Machines Sequential Minimal

Optimization, Normalized Google Distance, Latent Semantic Analysis.



1 Introduction

The multitude of products of any category

presently available on the market offer the

prospective buyer both the opportunity to best

choose according to personal needs, as well as

the difficulty of choice and the need for detailed

information on product capabilities. On the

other hand, recent years have brought about a

large amount of public user feedback on

products, in the form of reviews on e-commerce

sites, forums or blogs. Nevertheless, the high

volume of text containing this information

makes it impossible for a potential customer to

review all relevant data, while partially reading

reviews can result in misinformation or biases.

The present paper concentrates on methods

to solve the issues involved in determining the

semantic orientation of opinions in the task of

automatically mining user reviews on products

and presenting the potential buyers with

summaries of positive and negative opinions

expressed on the product and its features. This

task is known in the literature under the name

of “feature-based opinion mining and

summarization” (Hu and Liu, 2004). For a

given product, producing a feature-driven

summary of the opinions expressed on its

features is equivalent to producing an output in

the form (feature, percentage of positive

opinions, percentage of negative opinions). The

process consists of three distinct steps. The

first one involves discovering the potential

features that will be commented on in the

product reviews; the second step is identifying

the opinions expressed in reviews on each of

the features; the last step consists in

summarizing the polarity of the opinions

expressed on each feature as percentages of

positive and negative opinions.

It is important to note the difference between

this task and the classical definition of

summarization (Ding, Liu and Yu, 2008), as

this particular type of summarization only refers

to summarizing the polarity of opinions

expressed about features.

The present paper has the following

structure: in section 2 we describe related work

in feature-based summarization of customer

reviews. Section 3 delimits the problem we

intend to solve and our contribution, relating to

our feature-driven opinion summarization

system, whose extended description can be

found in (Balahur and Montoyo, 2008). The

contribution of the present work is described in

the next two sections: in section 4 we present a

comparative analysis of two methods for the

classification of the opinion polarity using

SVM SMO with the Normalized Google

Distance and Latent Semantic Analysis,

respectively. In section 5, we explore a method

to extract feature polarity using subjective

phrases constructed with the help of emotion

words found in WordNet Affect (Strapparava

and Valitutti, 2004). The next section shows the

results obtained when evaluating our system

employing the two methods in section 4 and the

approach described in section 5. Finally, we

conclude on our approach and sketch the

directions for future work.

2 Related work

Recent years have brought about a growing

interest in the field of opinion mining and

sentiment analysis. The high number of

applications, such as multi-perspective question

answering, automatic market research or

recommender systems, have determined

extensive research - in classifying documents

for polarity (Riloff and Wiebe, 2003; Dave,

Lawrence and Penncock, 2003; Pang, Lee and

Vaithyanan, 2002; Turney, 2002), sentences

(Wilson, Wiebe and Hwa, 2004;

Hatzivassiloglou and Wiebe, 2000).

The idea of feature-based summarization of

opinions expressed in customer reviews was

proposed in (Hu and Liu, 2004). The approach

the authors describe is lexicon-based and

consists in discovering frequent features using

association mining and determining the

semantic orientation of opinions as polarity of

adjectives (as opinion holders) that features are

described by. The classification of adjectives is

done using an initial list of seeds which is

completed using the WordNet synonymy and

antonymy relations. Infrequent features are

deduced using the opinion holders. However,

the fact that there is no well-organized structure

of features and sub-features of products leads to

the fact that, for example, the summarization of

Alexandra Balahur y Andrés Montoyo

202

opinions is done for 720 features for an mp3

player (Ding, Liu and Yu, 2008). The question

that arises is: would a user in a real-life

situation be interested on whether the edges of a

camera are round or flat and what the previous

buyers think about that, or would a potential

buyer like to see if the design of the product is

fine or not, according to the many criteria

developed by buyers to assess this feature? The

work does not approach implicit features and

does not classify the orientation of adjectives

depending on the context. A solution to the

latter problem is presented in (Ding, Liu and

Yu, 2008), where the authors take a holistic

approach to classifying adjectives, that is,

consider not only the local context in which

they appear next to the feature they determine,

but also other adjectives appearing with the

feature and their polarity in different contexts.

In (Popescu and Etzioni, 2005), a more

complex approach is used for feature-based

summarization of opinions, employing web

PMI (Pointwise Mutual Information) statistics

for the explicit feature extraction and a

technique called relaxation labeling for the

assignation of polarity to the opinions. In this

approach, dependency parsing is used together

with ten extraction rules that were developed

intuitively.

Our approach differs from and improves on

previous work in a series of aspects. Firstly, we

employ anaphora resolution and dependency

parsing to ensure that features extracted and the

opinions expressed are on the product we are

interested in, and the review does not comment

on a related one, nor does a near opinion word

express a positive or negative thought about

other topics than the product. Secondly, we

employ an offline method to determine product

features and sub-features, which allows the

system to gather for summarization the

opinions expressed on sub-features (as in the

case of “edge”) to the feature they correspond

(“design”). We also determine product specific

feature attributes with polarity, as well as

compute the set of learning examples of feature-

specific opinion words. This is accomplished

using a corpus of opinions over the same

product class, which is structured in two

sections: arguments in favor and against (“pros

and cons-style reviews”). Thirdly, the

classification of opinion words is feature-

dependent and does not rely on the local context

in which they appear, but to a larger semantic

context. In the case of employing the

Normalized Google Distance scores for the

machine learning algorithm, the context is the

World Wide Web; in the case of LSA scores,

the context is given by the corpus from which

the model is learnt. We show the manner in

which all these factors influence the system

performance and at what cost.

Last, but not least, many of the opinions on

products are expressed in an indirect manner,

that is, not relating the product or its features

with polarity words, but expressing an emotion

about them. We propose a set of patterns to

extract such indirectly expressed opinions using

the emotion lists from WordNet Affect.

3 Problem definition and contribution

In the task of feature-driven opinion mining

and summarization, the aim is to identify in

user reviews the opinions expressed on the

product and its features, determine if they are

positive or negative and summarize them as

percentage.

In order to fulfill the first step, the features

of the product that will be commented upon in

the reviews must be determined. Methods

proposed included association mining (Hu and

Liu, 2004), the use of WordNet (Fellbaum,

1999) relations (Popescu and Etzioni, 2004) or

WordNet and commonsense knowledge in

ConceptNet (Liu and Singh, 2004) as shown in

(Balahur and Montoyo, 2008). In the present

approach, we decided to add to the extracted

knowledge the structured one comprised on the

same sites that contain product reviews under

the chapter technical details.

In the second step, the extraction is done

using patterns, in case of lexicon-based

approaches or rules, in case of dependency

parsing solutions.

The last step consists in classifying opinions

as positive or negative. In the method described

by (Hu and Liu, 2004), the words considered as

opinion holders are adjectives and they are

classified using a core of annotated adjectives

and the synonymy and antonymy relations in

WordNet. However, this approach has a serious

problem – residing in the fact that the polarity

of feature attributes (or opinion holders, as they

are called by the authors) are feature dependent,

i.e. for example, “large” in the context of an

LCD screen is a positive attribute, whereas in

the context of a mobile phone, for instance, it is

a negative one. A remedy to this problem is

sought in (Ding, Liu and Yu, 2008), where the

Determining the Semantic Orientation of Opinions on Products – a Comparative Analysis

203

larger context is taken into consideration in the

classification of opinion holders – using

conjunction rules and the polarity of the opinion

holders the adjectives to be classified appear

with. The shortcoming of this approach is that

opinion expression in user review is mostly an

enumeration of qualities and faults of the

products. In (Popescu and Etzioni), a complex

function is employed to classify the polarity of

opinions, based on the polarity of the

surrounding context, but this approach too has

the shortcoming that user reviews tend to be

short and the negative and positive aspects are

mixed without any specific order. (Kim and

Hovy, 2006), on the other hand, use the

statistics given by the Pointwise Mutual

Information score together with the number of

search engine hits of target words and positive

and, respectively, negative words.

In our approach in (Balahur and Montoyo,

2008), we use the classification of feature

specific attributes on the one hand and of

opinion words extracted from “pros and cons”-

style reviews for each of the product categories

we are interested in. For English, sites that

contain such types of reviews are

“newegg.com” and “eopinions.com”, which are

American, or “shopping.com”, on which the

regional site can be chosen also for European

countries. For Spanish, sites containing reviews

in the form of pros and cons (“a favor” and “en

contra” or “ventaja” and “desventaja”) are

“quesabesde.com” or “ciao.es”. If the opinion

word is contained in a “pro” section, then the

extracted words are classified as positive

feature attributes. In the contrary case, the word

is classified as negative. For example:

Pros: Beautiful pictures, ease of use, high

quality, 52mm lens.

Cons: high price, a bit big and bulky.

Features encountered in text: picture, use,

quality, lens, price.

Feature attributes extracted: (positive):

beautiful (picture), easy (use), high (quality),

52mm (lens); (negative): high (price).

Feature attributes remaining: big, bulky,

which are both negative and correspond,

according to the feature categorization made in

section 4, to the “size” feature.

Our solution to the problem of feature

attributes classification is using machine

learning with two measures of similarity. On

the one hand, we employ the Normalized

Google Distance, which gives a measure of the

strength of relationship between two considered

words at the level of the entire WWW and on

the other hand, we use the LSA, which gives

the same measure of strength, but at a local

corpus level. Classifying the feature attributes

according to these scores and taking into

consideration 6 anchor words that relate each

word with the feature and known polarities, we

show how the classification of feature attributes

can be done in the feature context.

Last, but not least, in the reviews to be

mined and summarized, however, other opinion

words can be found and other manners of

expressing opinion can be encountered, such as

those describing emotional states related to the

the product (for example, “I love this camera”)

or to using it. Methods to solve these problems

are discussed in section 5, where we show the

list of patterns we used to extract from the

reviews such phrases containing emotions to

express opinions of the different product

features using the words associated to different

emotions from WordNet Affect. In the

evaluation section, we show how the use of

such patterns raised with 12% the recall of the

system, while the precision of classification

rose to the same degree.

4 Comparative experiments

In our previous approach, in order to assign

polarity to each of the identified feature

attributes of a product, we employed SMO

SVM machine learning and the Normalized

Google Distance (NGD). In this approach, we

complete the solution with a classification

employing Latent Semantic Analysis with

Support Vector Machines classification.

For the NGD classification, we consider a

set of anchors containing the terms

{featureName, happy, unsatisfied, nice, small,

buy}, that relate to all possible classes of

products, as well as give an orientation to

product feature attributes.

Further on, we build the classes of positive

and negative examples for each of the feature

attributes considered. From the list of classified

feature attributes in the pros and cons reviews,

we consider all positive and negative terms

associated to the considered attribute features.

We then complete the lists of positive and

negative terms with their WordNet synonyms.

Since the number of positive and negative

examples must be equal, we will consider from

each of the categories a number of elements

equal to the size of the smallest set among the


204

two, with a size of at least 10 and less or equal

with 20. We give as example the classification

of the feature attribute “tiny”', for the “size”

feature. The set of positive feature attributes

considered contains 15 terms such as “big”,

“broad”, “bulky”, “massive”, “voluminous”,

“large-scale” etc. and the set of negative feature

attributes considered is composed as opposed

examples, such as “small”, “petite”, “pocket-

sized”, “little” etc. We use the anchor words to

convert each of the 30 training words to 6-

dimensional training vectors defined as v(j,i) =

NGD(wi,aj), where aj with j ranging from 1 to 6

are the anchors and wi, with i from 1 to 30 are

the words from the positive and negative

categories. After obtaining the total 180 values

for the vectors, we use SMO SVM to learn to

distinguish the product specific nuances. For

each of the new feature attributes we wish to

classify, we calculate a new value of the vector

vNew(j,word) = NGD(word, aj), with j ranging

from 1 to 6 and classify it using the same

anchors and trained SVM model.

In the example considered, we had the

following results ( by V1, V2, V3, V4, V5, and

V6 we denote the six values corresponding to

the NGD scores of the anchors and the words in

the NGD classification scores, and the LSA

scores, respectively; “pol” refers to the polarity

of the feature attribute) :

FA V1 V2 V3 V4 V5 V6 pol small 1.52 1.87 0.82 1.75 1.92 1.93 pos

big 2.27 1.19 0.86 1.55 1.16 1.77 neg

bulky 1.33 1.17 0.92 1.13 1.12 1.16 neg

little 1.44 1.84 0.80 1.64 2.11 1.85 pos

tiny 1.51 1.41 0.82 1.32 1.60 1.36 pos

Table 1. Example NGD scores

The vector for the feature attribute “tiny”

was classified by SVM as positive, using the

training set specified above.

For the LSA classification, as in the case of

NGD, we consider a set of anchors containing

the terms {featureName, happy, unsatisfied,

nice, small, buy}.

Further on, we build the classes of positive

and negative examples for each of the feature

attributes considered. From the list of classified

feature attributes in the pros and cons reviews,

we consider all positive and negative terms

associated to the considered attribute features.

We then complete the lists of positive and

negative terms with their WordNet synonyms.

Since the number of positive and negative

examples must be equal, we will consider from

each of the categories a number of elements

equal to the size of the smallest set among the

two, with a size of at least 10 and less or equal

with 20. We give as example the classification

of the feature attribute `”tiny”, for the “size”

feature. The set of positive feature attributes

considered contains 15 terms such as “big”,

“broad”, “bulky”, “massive”, “voluminous”,

“large-scale” etc. and the set of negative feature

attributes considered is composed as opposed

examples, such as “small”, “petite”, “pocket-

sized”, “little” etc. We use the anchor words to

convert each of the 30 training words to 6-

dimensional training vectors defined as v(j,i) =

LSA(wi,aj), where aj with j ranging from 1 to 6

are the anchors and wi, with i from 1 to 30 are

the words from the positive and negative

categories. After obtaining the total 180 values

for the vectors, we use SMO SVM to learn to

distinguish the product specific nuances. For

each of the new feature attributes we wish to

classify, we calculate a new value of the vector

vNew(j,word) = LSA(word, aj), with j ranging

from 1 to 6 and classify it using the same

anchors and trained SVM model.

We employed the classification on the

corpus present for training in the Infomap

software pack. The blank lines represent the

words which were not found in the corpus;

therefore a LSA score could not be computed.

FA V1 V2 V3 V4 V5 V6 pol Small 0.76 0.74 --- 0.71 1 0.71 pos

Big 0.80 0.75 --- 0.74 0.73 0.68 neg

Bulky --- --- --- --- --- --- pos

Little --- --- --- --- --- --- neg

Tiny 0.81 0.71 --- 0.80 0.73 0.72 ---

Table2. LSA scores on non-specialized corpus

On the other hand, we employed the

classification on a corpus made up of reviews

on different electronic products, gathered using

the Google API and a site restriction on

“amazon.com”.

In the table below, we show an example of

the scores obtained with LSA on the features

attributes classified for the feature “size”. The

vector for the feature attribute “tiny” was

classified by SVM as positive, using the

training set specified above.

FA V1 V2 V3 V4 V5 V6 pol small 0.83 0.77 0.48 0.72 1 0.64 pos

big 0.79 0.68 0.74 0.73 0.77 0.71 neg

bulky 0.76 0.67 0.71 0.75 0.63 0.78 neg

little 0.82 0.76 0.52 0.71 0.83 0.63 pos

tiny 0.78 0.70 0.65 0.67 0.71 0.71 pos

Table 3. LSA scores on specialized corpus


205

In tables 4 and 5 below, we show the

precision values in some example

classifications we made with NGD and LSA for

different product features for the examples of

digital camera reviews and the mobile phones

reviews, respectively and the kappa statistics

values.

LSA

NGD

NonOpinion Opinion

Feature

(digital

camera)

P k P K P k

price 0.75 0.5 0.83 0.7 0.83 0.7

quality 0.78 0.45 0.84 0.7 0.84 0.7

design 0.75 0.45 -- -- 0.85 0.65

size 0.80 0.6 -- -- 0.85 0.7

resolution 0.83 0.5 0.84 0.6 0.85 0.7

zoom 0.8 0.6 -- -- 0.86 0.6

display 0.78 0.6 0.78 0.5 0.84 0.7

software 0.8 0.5 -- -- 0.82 0.5

Table 4. Results NGD versus LSA digital camera

LSA

NGD

NonOpinion Opinion

Feature

(mobile

phone)

P k P k P k

price 0.75 0.5 0.83 0.7 0.83 0.7

quality 0.75 0.5 0.84 0.7 0.84 0.5

design 0.78 0.45 -- -- 0.85 0.65

size 0.8 0.5 -- -- 0.85 0.7

display 0.8 0.45 0.7 0.4 0.83 0.5

memory 0.75 0.5 -- -- 0.87 0.6

camera 0.75 0.45 -- -- 0.84 0.7

Table 5. Results NGD versus LSA mobile phone

The conclusion that can be drawn from the

results presented is that the main advantage in

using the first method of polarity assignment is

that NGD is language independent and offers a

measure of semantic similarity taking into

account the meaning given to words in all texts

indexed by Google from the World Wide Web.

On the other hand, using the whole Web corpus

can also add significant noise. Therefore, we

employ Latent Semantic Analysis at a local

level, both on a non- specialized corpus, as well

as on a corpus containing customer reviews. As

we will show, the classification using LSA on a

specialized corpus brings an average of 8% of

improvement in the classification of polarity

and a rise of 0.20 in the kappa measure, leading

to an 8% overall improvement in the precision

of the summarization system. However, these

results were obtained using a specialized corpus

of opinions, which was previously gathered

from the WWW. To this respect, it is important

to determine sources (web sites, blogs or

forums) specific to each of the working

languages, from which to gather the corpus on

which the LSA model can be built. Using LSA

on a non-specialized corpus improved the

classification to the same degree as the

classification on a specialized corpus in the

cases where the specific pairs of words to be

classified were found in the corpus. However,

in 41% of the cases, the classification failed due

to the fact that the words we tried to classify

were not found in the corpus.

5 Feature polarity extraction using

subjective phrases

As observed before, some opinions on the

product or its features are expressed indirectly,

with subjective phrases containing positive or

negative emotions which are related to the

product name, product brand or its features. In

order to identify those phrases, we have

constructed a set of rules for extraction, using

the emotion lists from WordNet Affect. For the

words present in the “joy” emotion list, we

consider the phrases extracted as having a

positive opinion on the product or the feature

contained. For the words in the “anger”,

“sadness” and “disgust” emotion lists, we

consider the phrases extracted as having a

negative opinion on the product or the feature

contained. Apart from the emotion words, we

have considered a list of “positive words”

(pos_list), containing adverbs such as

“definitely”, “totally”, “very”, “absolutely” and

so on - as words positively stressing upon an

idea – (Iftene and Balahur, 2007), that influence

on the polarity of the emotion expressed and

that are often found in user reviews. We present

the extraction rules in table 6 (verb_emotion,

noun_emotion and adj_emotion correspond to

the verbs, nouns and adjectives, respectively,

found in the emotion lists from WordNet Affect

under the emotions “joy”, “sadness”, “anger”

and “disgust”). In case of “surprise”, as

emotion expressed about a product and its

features, it can have both a positive, as well as

negative connotation. Therefore, we have

chosen not to include the terms expressing this

emotion in the extraction patterns.

Rule

ID

Rule pattern

1 I [pos_list*] [verb_emotion] [this| the |

my] [product_name| product_feature]

2 I ([am | ’m | was| feel | felt])

([pos_list**]) [adj_emotion] [with|

about| by] [product_name |

product_feature]

3 I [feel | felt] [noun_emotion] [about |


206

with] [product_name | product_brand]

4 I [pos_list*] [recommend] [this|the]

[product_name | product_brand]

5 I ([don’t]) [think | believe] [sentence**]

6 [It] [’s| is] [adj_emotion] [ how| what]

[product_name| |

product_feature][product_action***]

7 [You | Everybody | Everyone | All | He |

She | They] [will | would]

[verb_emotion] [ this | the]

[product_name | product_brand |

feature] Table 6. Extraction patterns for subjective opinion phrases

6 Evaluation and discussion

We have performed a comparative analysis

of the system employing the SMO SVM

polarity classification using NGD and LSA on a

specialized corpus, the subjective phrases and

combined, with the corpus used in (Balahur and

Montoyo, 2008) and also the corpus of 5

reviews from (Hu and Liu, 2004). Results

obtained in table 7 are for the use of our own

annotated corpus:

Table 7. System results on own corpus

In the case of the (Hu and Liu, 2004) 5-

reviews corpus, the observation that is

important to make is that, as opposed to the

annotation made in the corpus, we have first

mapped the features identified to the general

feature of the product (for example “fit” refers

to “size” and “edges” refers to “design”), as we

believe that in real life situations, a user

benefits more from a summary on coarser-

classes of product features. Also, a set of

sentences that were not annotated in the corpus,

such as “You’ll love this camera”, which

expresses a positive opinion on the product.

The results are shown in table 8:

Table 8. System results (Hu and Liu, 2004) corpus

The results shown are compared against the

baseline of 0.20 precision and 0.41 recall,

which was obtained using only the features

determined as in (Balahur and Montoyo, 2008)

and the feature attributes whose polarity was

computed from the “pros and cons” –style

reviews. As it can be seen, the best results are

obtained when using the combination of LSA

with the rules for subjective phrases extraction.

However, gathering the corpus for the LSA

model can be a costly process, whereas NGD

scores are straightforward to be obtained and

classifying is less costly as time and resources

used.

What is interesting to study is the impact of

employing LSA for gradual learning and

correction of a system that uses NGD for

classifying the polarity of feature attributes. In

such a self-learning scheme, the “online”

classification would be that of NGD. However,

the classification of the new feature attributes

can be later improved “offline” using the

classification given by LSA, which can then be

used as better training for learning the polarity

of new feature attributes by the “online” NGD

classification.


In this paper, we presented a method to assign

polarity to feature attributes in a feature-

dependent manner, using the scores showing

relational strength between two words, given by

the Normalized Google Distance and Latent

Semantic Analysis and the classification using

SMO SVM machine learning. The main

advantage in using polarity assignment

depending on NGD scores is that this is

language independent and offers a measure of

semantic similarity taking into account the

meaning given to words in all texts indexed by

Google from the World Wide Web. The main

advantage in using LSA on a specialized

corpus, on the other hand, is that it eliminates

the noise given by the multiple senses of words.

We completed the opinion extraction on

different product features with rules using the

words present in WordNet Affect, as indicative

of indirectly expressed opinions on products.

We showed how all the employed methods

led to significant growth in the precision and

recall of our opinion mining and summarization

system.

Future work includes a more thorough and

systematic organization of product categories

and features with their corresponding attributes

NGD LSA Rules NGD+

Rules

LSA+

Rules

P R P R P R P R P R

0.80 0.79 0.88 0.87 0.32 0.6 0.89 0.85 0.93 0.93

NGD LSA Rules NGD+

Rules

LSA+

Rules

P R P R P R P R P R

0.81 0.80 0.85 0.88 0.28 0.5 0.89 0.85 0.93 0.93


207

and the fuzzy analysis of text for the detection

of misspellings and grammar errors.

References

Balahur, A., Montoyo, A. 2008. Building a

Recommender System Using Community

Level Social Filtering. To appear in the

Proceedings of the 5th International

Workshop on Natural Language Processing

and Cognitive Science (NLPCS 2008),

ICEIS 2008.

Cilibrasi, D., Vitanyi, P. 2006. Automatic

Meaning Discovery Using Google. IEEE

Journal of Transactions on Knowledge and

Data Engineering .

Dave, K., Lawrence, S., Pennock, D. 2003.

Mining the Peanut Gallery: Opinion

Extraction and Semantic Classification of

Product Reviews. In: Proceedings of

WWW-03.

Deerwester, S. Dumais, S., Furnas, G. W.,

Landauer, T. K, Harshman, R. (1990).

Indexing by Latent Semantic Analysis.

Journal of the American Society for

Information Science 41 (6): 391-407.

Ding, X., Liu, B., Yu, P. 2008. A Holistic

Lexicon-Based Approach to Opinion

Mining. In Proceedings of ACM WSDM,

2008.

Fellbaum, C. (ed.). 1999. WordNet: An

Electronic Lexical Database, MIT Press,

Cambridge, Massachusetts.

Hu., M. Liu, B.: Mining and summarizing

customer reviews. In Proceedings of KDD

(2004)

Iftene, A., Balahur-Dobrescu, A. 2007.

Hypothesis Transformation and Semantic

Variability Rules Used in Recognizing

Textual Entailment. In Proceedings of the

ACL-PASCAL Workshop on Textual

Entailment and Paraphrasing.

Kim, S.M., Hovy, E.H. 2006. Identifying and

Analyzing Judgement Opinions. In

Proceedings of HLT-NAACL 2006, ACL,

pp. 200-207

Liu, H., Singh, P. .2004. ConceptNet: A

Practical Commonsense Reasoning Toolkit.

BT Technology Journal, To Appear. Volume

22, forthcoming issue. Kluwer Academic

Publishers.

Pang, B., Lee, L. and Vaithyanathan, S. 2002.

Thumbs up? Sentiment Classification Using

Machine Learning Techniques. EMNLP

2002.

Platt,J.. 1998. Sequential Minimal

Optimization: A Fast Algorithm for Training

Support Vector Machines, Microsoft

Research Technical Report MSR-TR-98-14,

(1998)

Popescu, A.-M., Etzioni, O. 2005. Extracting

product features and opinions from reviews.

In Proceedings of the conference on Human

Language Technology and Empirical

Methods in Natural Language Processing

(EMNLP 2008).

Riloff, E. and Wiebe, J.. 2003. Learning

extraction patterns for subjective

expressions. EMNLP 2003.

Strapparava, C. Valitutti, A. 2004. WordNet-

Affect: an affective extension of WordNet.

In Proceedings of the 4th International

Conference on Language Resources and

Evaluation (LREC 2004), Lisbon, Portugal,

pp. 1083-1086.

Turney, P. 2002. Thumbs Up or Thumbs

Down? Semantic Orientation Applied to

Unsupervised Classification of Reviews.

ACL’02.

Wilson, T., Wiebe, J. and Hwa, R.2004. Just

how mad are you? Finding strong and weak

opinion clauses. AAAI’04, 2004.


208

Methodological approach for pragmatic annotation

Aproximación Metodológica para la Anotación Pragmática

Javier Calle Universidad Carlos III Av. Universidad n 30,

28911 Leganés [email protected]

David del Valle Universidad Carlos III Av. Universidad n 30,


Jesica Rivero Universidad Carlos III Av. Universidad n 30,


Dolores Cuadra Universidad Carlos III Av. Universidad n 30,


Resumen: En el desarrollo de sistemas basados en la interacción, es necesario transcribir y analizar una gran cantidad de corpus. Este análisis es desarrollado a través de distintos niveles lingüísticos, entre los que se encuentra el nivel pragmático. Este trabajo presenta cómo desarrollar esta tarea y describe las partes relevantes del conocimiento que deben tratarse en este análisis. Para ello se presenta un conjunto de tareas para realizar la anotación pragmática del corpus con el objetivo de conseguir una metodología que facilite este trabajo a los desarrolladores y asegure completitud y rigor en el modelado de este conocimiento. Con estas propiedades se conseguirá que los corpus anotados a través de esta metodología puedan ser reutilizados, de manera total o parcial, en otros dominios de interacción. Palabras clave: Anotación Pragmática, Análisis de corpus, Sistemas de Interacción Natural

Abstract: For developing corpus based interaction systems, it is necessary to acquire, transcribe, and analyze significant amount of corpus. Such analysis should be performed at several linguistic levels, among which pragmatics is surely found. This work proposes how to perform that task and describes part of the relevant knowledge to be met through that analysis. The approach presents a set of steps to be tracking during the pragmatic annotation of corpus. The proposed steps aim to guide dialogue coders to attain completeness in the analysis and maximize their agreement in their joint work. Keywords: Pragmatic annotation, corpus analysis, natural interaction systems

1 Introduction

Late years have witnessed the increasing interest in systems interacting like humans for reaching those potential users that haven’t enough technological abilities for interacting with computers but are able of interacting with other humans. For a computer to behave like this, interactive knowledge and reasoning mechanisms over it have to be modeled. Since that knowledge is complex and from diverse nature, it has been often divided up into several categories or types of interactive knowledge, which are to be analyzed and modeled separately and bring together through a Cognitive Architecture (Calle et al., 2006).

The linguistic knowledge usually gives rise to several components in such knowledge distributions. On one hand, the expressive components for understanding the semantic content of users’ messages (literal meaning) and

for expressing system’s own. Thus, there should be settled components for voice recognition and synthesis, and some others for Natural Language Processing. On the other hand, once the literal semantics of the user’s message have been clarified and until the semantic content of system’s intervention is established, there should be observed a group of reasoning processes over other sort of knowledge that determine the real interpretation of the interaction state and the behavior of the system. Among other types of knowledge (emotional, situational, etc.), there should be observed a particular subset of the linguistic: pragmatics.

Pragmatic knowledge should enable to cover all that gap of human knowledge abilities: from literal meaning representation of any participant intervention to the production of the next one, through solving references, taking into account presuppositions, inferring implicatures, discovering the underlying individual and



mutual intentions, structuring the interaction, settling the proper effects of each intervention, and some others. Among them, because of its crucial importance for natural interaction, arise the mechanisms for checking up the interaction health and applying reinforcement techniques in case. The so often named Dialogue Model usually covers much of this knowledge, sometimes helped by some others such as the Task Model or the Session Model (Cuadra et al., 2008), intimately linked with it.

Anyhow, for formalizing and implementing such knowledge, it is essential to acquire it. Some of the knowledge could be considered as general (domain independent), exposed through some pragmatic theory, and formalized in a knowledge model. The other depends on the particular Interaction Domain, and should be acquired through the analysis of a sample (interaction corpus). Then, the corpus should be annalysed and pragmatically annotated for later implementation, for which a methodology will be proposed in this work. The proposed steps aim to guide dialogue coders to attain completeness in the analysis and maximize their agreement in their joint work. Other interactive knowledge also appearing through the corpus (emotional, circumstantial, etc.) should also be annotated and processed, but will be left for further work.

The paper is structured as follows. Proposals relate to pragmatic annotation is presented in the section 2. The proposed methodology is shown in the section 3. Section 4 and 5 explain in detail the individual and total pragmatic annotation of corpus. Last, some conclusions and future work are presented.

2 Premises and Related Work

Pragmatic annotation has been classically observed to be applied in three levels (Gibbon et al.): micro-level, meso-level, and macro-level. First, minimum meaningful functional units should be identified through microlevel annotation, and marked with utterance tags (or dialogue acts). On second place, the meso-level annotation should give rise to sequences, differentiating the initiation from the development of a subdialogue which is to be represented by a dialogue game (Levin & Moore, 1977). Some other authors consider these functional units as common ground units, by adding some features for attaining mutual knowledge on them from both participants (see

Nakatani &Traum, 1998; Clark, 1996). Finally, macro-level stands for differentiating major subdialogues (transactions), immediately minor to the whole interaction and developing its main tasks or intentions, from the other minor ones (acts exchanges). These three levels are similar to the four proposed by (Sinclair & Coulthard, 1975) (transaction, exchange, move and act), and can be found on several works. Such as the structure annotation of the 128 dialogues within the HCRC map task corpus (Carletta et al, 1996).

For analyzing dialogue, it is essential to divide it up into small structural and functional units. Thus, appears the interventions as the realization of a turn by a participant. Yet this structural unit seems to occur sequentially (alternatively by both participants) in real interaction two adjacent interventions might be performed by the same participant, or even overlap (two participants can intervene simultaneously). However, there could be differentiated minor structural units such as the sentences, and these in turn could involve several communicative acts (as extensions of those from the speech acts theories (Austin, 1962; Searle, 1969). In sum, the pragmatic analysis requires the corpus to be preprocessed at the syntactic, semantic and prosodic levels, for having it structurally segmented into turns, sentences and acts. This last step, transcribing the corpus into communicative acts, is in fact part of the pragmatic (micro level) annotation, but that transcription just seeks for the literal acts (ignoring its functional value), thus classified into the preprocessing part.

Functional annotation could depart from small structural units, or look for larger ones. DAMSL (Allen &Core, 1997) (Dialogue Act Markup in Several Layers) bases its functional annotation in tagging utterances over four dimensions: (i) Communicative status: whether the utterance is intelligible and successfully completed, uninterpretable, abandoned, or self-talk. (ii) Information level: semantic content and relation to the underlying task. Could be tagged as task, task management, communication management, or others. (iii)Forward-communicative-function: nstraints on interlocutors future beliefs and actions. Feasible tags are: statement (assert,…), influencing future actions (directive), committing future actions (offer,…), and other.

Javier Calle, David del Valle, Jesica Rivero y Dolores Cuadra

210

(iv)Backward-communicative-function: referring previous parts in a similar way. Following this criteria the utterance could be an agreement, understanding, answering, information relation, and antecedents relating more than just preceding unit.

It also gives some cues on tagging other phenomena, such as speech repairs. These guidelines have been used and evolved by many other projects, such as the ADAM (Cattoni et al,2002), which not only reach the annotation of some pragmatic features for 450 dialogues but also observes its annotation in other levels (prosodic, morpho-syntactic, and semantic). MATE (Dybkjaer et al, 1998) observes similar linguistic levels, but adding annotation for co-reference and communication problems. The Cast3LB project (Navarro et al., 2003) pursues the linguistic annotation of a Spanish language corpus (in parallel with other two co-official languages, Catalan and Basque), also at several levels (morphological, syntactic and semantic). For pragmatic, Cast3LB observes just the co-reference of nominal phrases and the anaphora annotation (for which they count on a semi-automatic tagger, detecting possible anaphoric elements and proposing resolutions for the human tagger to choose). Finally, there also should be mentioned the Monroe project (Tetreault et al., 2004), which simple pragmatic annotation observed co-reference, speech acts, and an interesting scope on segmentation (as generalized functional units).

Several projects for corpus annotation have developed its own toolkit, such as for example Dexter (Garretson, 2006) which is a free open-source suite of software tools for analyzing language data, initially developed for the MICASE corpus but reusable for other purposes.

3 Describing Pragmatic annotation through a methodological approach

This section is aimed to establish which sort of pragmatic information is going to be annotated and formalized from corpus using this methodological approach. The knowledge will be acquired through both the analysis of individual dialogues and the complete corpus. The later process should not be tackled until prior one is not finished. The figure 1 shows the methodology approach according to the inputs, steps and products as results to apply each step.

Segmentation

Intention Labelling

Commitment Evolution

Interactions

in literal CA

Operative Global Ann.

Commnctv.

Acts Set

Intention Normalization

Attention Study

Operative Annotation

dialogues left?

Commitment Learning

end

tasks coverage?

Segments & sequences

Intentions & goals

Focus changes

Commitment Events & reinforcement tech

Internal & External task definition

Operative matrix

Normalized Intentions Set

Commitment Parameters(variations & thresholds)

productsinputssteps

New

Co

rpus

Acq

uisi

tion

glo

bal

ana

lysi

sin

div

idua

lana

lysi

s

Figure 1: Pragmatic annotation methodology

The proposed methodology departs from any well defined task-oriented Interaction Domain for which a complete corpus has been obtained and annotated (at every level previous to pragmatic analysis). The approach is going to focus dialogues, this is bipartite interactions. A dialogue can be defined as runs of interventions, which are the performance of a turn by any participant in the dialogue (often, interventions are developed alternatively by both participants, but might overlap and this eventuality should also be annotated).For easing this task, the input should be pre-processed at the micro-level annotation. Thus, interventions will be described in form of Communicative Acts. These acts will be the literal semantic representation of the interventions, based on a CA set (suited to the Interaction Domain) and Ontology. Those semantic structures should observe not only the literal content of the message, but also prosodic information (pauses, transition relevant places, intonation, etc.). Thus, the absence of utterance will be represented as an act itself. Indirections, anaphora, ellipsis, and other linguistic phenomena should not be observed in such pre-processing, since they are to be solved at the interaction level. However, referential elements


211

should be identified and marked, for later resolution.

The proposal will be presented split into two phases: individual analysis of each dialogue, and corpus global processing. Before getting with it, some definitions are provided for making clear the exposition, since they could have different interpretations in other proposals. Definition 1: Let us define segment as the fragment of interaction that could be intentionally independently interpreted (apart from context), has a functional sense in the dialogue, and has the features of a common ground unit. A segment usually is composed of one or more interventions, but not necessarily: it could be a part of intervention or even less (none intervention at all). Definition 2: Let us define intervention as the realization of a turn by a participant through the interaction. Any intervention consists of one or more discourses, which are to be defined as an uninterrupted unbroken fragment of intervention developing the same specific goal (intention). Definition 3: Let us define piece of discourse as an atomic functional unit of illocutive understanding, which could be represented by a communicative act (CA) or several complementary ones.

Each discourse is composed of one or more pieces. These discourses could shape an intervention, part of it, several consecutive or non-consecutive interventions or part of them, or whatever combination. It should be pointed out that a segment could even be developed with none discourse at all.

4 Individual Dialog Analysis Phase

The individual dialog analysis phase is composed by: segmentation, intention labelling, attentional study, commitment evolution, and operative annotation.

4.1 Segmentation

For discovering segments throughout a dialogue, it should be examined what is happening during its development. For simplifying this task, dialogues are to be represented. When a segment is found, it should be marked its discourses and boundaries. Thus, it will be revealed its sequences: Opening: this sequence determines the instantiation of the segment, given an interaction state (specific or generalized) and a sequence of discourse pieces.

Closings: this sequence is performed for finishing the segment, with interactive success. Cloakings/Disclosings: sometimes a segment appears fragmented, because of an interruption for developing a segment that has nothing to do with the first one. The cloaking sequence reveals when current segment is to be (temporally) abandoned, while the disclosing sequence is set to resume it. Cancellations/Recoverings: the cancellation is a particular case of closing, which is set to finish the segment before the interactive success is achieved for the segment. The recovering consists in re-opening either a finished or an abandoned segment (successfully ended or not). It should be distinguished when the development of a segment is deferred from when it is abandoned yet later recovered. Developments: the sequences used to progress the segment until its interactive success are labelled as development.

During the segment analysis can be detected minor segments. The relation between prior and minor segments will be named decompositional link. If the relation is hierarchical should distinguish requisite and optional decompositional link. The first relation shows the interactive success of minor segment is crucial for interactive success of prior one. However, the second one the interactive success of minor segment has influence on interactive success of prior one, but its occurrence and interactive success (yet desirable) are not necessary to achieve it. Since each entire dialogue (from the corpus) is a segment itself, the process goes on recursively until no minor segments are identified. If the relation is at the same level of decomposition, the sequential and serial relations are distinguished.

4.2 Intention labelling

From the intentional scope, each discovered segment is to be cooperatively developed by both participants that share (mutual) knowledge and information on it. Hence, first thing to do is to choose a significant label for the mutual goal related to each segment with some features such as the formalization of its sequences (for example, by means of grammars or automata), the identification of the participant role which initiate the instance of the intention, the relationships with other already identified intentions (if any), and the links with individual goals. Individual goals represent feasible interests for both participants of the interaction within the interaction domain. Each segment represents the development of the instance of a


212

mutual goal (an intention) thus, there exists certain links between an intention and some individual goals for both participants, and such goals and links should also be annotated. A contextual space is going to be defined as the set of pieces of static information characterizing the instance of an intention. Therefore, each intention instance has a contextual space and has visibility over prior intention instances’ spaces. However, the intention instance has not access to minor intention instances’ contextual spaces, so the parts of them observed as useful for prior intention instance should be labelled as contributions (relevant context pieces to be inherited by prior instance).

4.3 Attentional study

Once the dialogues are segmented and the intentions properly identified, it is possible to find out which intention is developed anytime within each dialogue. Since any intention needs to receive the attention from both participants in the interaction for being developed, at that particular point the developed intention will receive the attention and thus will be named focused intention (or focus, yet this term is usually applied to refer more complex information, regarding every focused intention across the dialogue). Often appear several overlapped segments, with decompositional links between them, which are developed simultaneously. For such cases, the minor segment determines the focused intention (while the hierarchical ordered intentions define the focus). The attentional study of a dialogue involves annotating every change of focus, classifying it, and describing the observed causes (if any). Feasible types of attentional change are the next: Initiation: when initiating a new intention instance, it always gains the focus. Termination: after successfully ending of an intention instance, prior one should gain the focus. Cancellation: when aborting the development of an intention instance, prior one usually gains the focus. However, cancellations might occur over non focused intentions, and could involve several of them (the cancelled and all its descendents). Because of this, it is necessary to identify both the cancelled intention and the new focus. Disclosing: when retaking the developing of an abandoned intention instance (which has a

decompositional link with previously focused one). Skip: any other attentional change should be labelled as ‘focus skip’. There should be annotated carefully the origin (previously focused intention), the final focused intention (new one), and all the intermediate steps.

4.4 Commitment Evolution

During the interaction, there might go on certain events altering the confidence on some part of the common ground (mutual knowledge), positively or negatively. When that confidence weakens, human interlocutors use to apply some technique ensuring the beliefs correctness (hence reinforcing confidence) and pointing out deviations (in case). Eventually, they might change the dialogue strategy (game) or even cancel the common ground element.

Main elements to be found in the common ground are the intention instances (and their features: aim, instanced development strategy, contextual space, etc.). For these elements initiation and development, commitment between both participants is required. Such commitments have three aspects, which should be analysed separately: Mutual Knowledge: both participants should (a) possess enough information on the intention instance to develop it successfully to an end; (b) know their interlocutor meets (a); (c) know their interlocutor meets (b); and so on. Interactive events affecting this aspect include reaffirmations (either positive or negative), (contextual) incompatibilities, interruptions, etc. The reinforcement techniques applied comprise to introduce redundancies, explanations, announcements, direct requests (for reaffirmation), etc. Interest: both participants should have individual interest in the element (by linking it to an individual goal), and have confidence on their interlocutor interest to develop it till an end. Affecting this aspect could be found changes of focus, interruptions, delays, etc. Its feasible reinforcement techniques should help the interlocutor to link the mutual goal to some of his/her individual goals. Thus, there could be applied explanations (for revealing the link), negotiations, etc. Attention: both participants should simultaneously focus the same element for its development. In fact, the focus is part of the mutual knowledge, but it should be studied apart because it has its own reinforcement


213

techniques (enumerations, explanations, etc.) and because when focus confusion is found it also could be due to interest loss. Events altering the attention aspect were described through the attentional study, and techniques applied should be identified as the way any participant helps to fix the focus when a change occurs, or later if any focus confusion is detected.

Through this analysis step, the events altering commitment on some part of the common ground are going to be identified and labelled (as commitment variations, either positive or negative), and so are the reinforcement techniques applied (as commitment threshold for such procedure and later commitment variation). All the dialogues, rewritten in such terms, will be at last subject to a global analysis (through a learning algorithm) for obtaining a measure of each sort of variation and the boundaries of the threshold for the application of each technique. For gaining naturality in later processing of both events and techniques, it is crucial to find when two identified events (or techniques) are the same one.

4.5 Operative Knowledge

The term task is to be used to refer any perlocutive effect of the interaction state on any participant or the interaction itself. With regard to interaction, internal tasks are the effects on interaction based on some condition or process over interactive knowledge and/or interaction state. In contrast, external tasks involve some prompt into any application or external agent which outcome could again have some effects on interaction.

The effects a task performance could entail are always performed on an intention instance (usually the currently focused one, but not necessary), and include: changes in the intentional state (initiation, disclosing, or cancellation), in the attention (cloaking of instance and skipping into another one), in the commitment (variations on any of its aspects), and, progress in the development (changes in its state of development), context alteration (new context assertion, or context retraction).

When some of these effects are found in the corpus, it should be searched for the task that gave rise to it. If the task has nothing to do with knowledge or performance external to interaction itself, the task will be labelled as internal and then analysed. Internal tasks are

represented by a check (condition) on interactive state and/or knowledge, and the consequences (effects) of its feasible results. The terms of that check could be based on: Context: certain values for a given context piece within currently focused intention instance (eventually, checks could be applied to context spaces of intention instances prior to currently focused, by inheritance processes). Currently focused intention instance: checks on its links to individual goals, current value of its commitment aspects, its development progress state, and/or other features (initiator, age, etc). Intention instances structure: checks on other already instantiated intentions, their state (terminated, cloaked, cancelled), relationships, and/or features. Attention: checks on focus structure and/or history.

Constant tasks are a particular case of internal task which is that performed obligatorily with no concern to any check result. External tasks require a deep analysis because of their variability. Apart from their identification and sorting, it should be annotated their input and consequences, which will be effects based upon checks on their outputs. For such complete annotation, the understanding of (non interactive) capabilities is often required. Therefore, external task annotation should include: a) the description of the task (or the application or agent if known); b) the enumeration of its inputs, with their name or alias, link to pieces of contextual information, and full description (or parameter label in the application if known); c) the enumeration of its outputs, with their name or alias, and full description (or parameter label in the application if known); and d) the description of its consequences, as a set of rules with an antecedent (expression based on outputs, inputs, and the same information used for internal tasks) and an effect (already described as common for any task). The rules of the consequences could have an execution order or not (in case).

5 Global Dialog Analysis Phase

When finished every individual dialogue analysis, there should be performed a set of actions to ensure normalization and completeness of corpus. By doing this, arise the risk of disregarding trace information, as long as each piece of the implemented corpus is not


214

going to refer a single dialogue. On the other hand, utter refinement or corpus reusing might require reviewing the original sources. Therefore, this global analysis and the final normalization methodology and formalization should observe ways to keep any modelled knowledge properly linked to the pieces of corpus that gave rise to it. Following subsections will describe just the need of information processing to achieve the global analysis, thus end this process.

5.1 Operative global annotation

It is essential to have at least one scenario for each feasible external task the system should perform during interactions. Naturally, external tasks absents through the corpus annotations won’t be accessible through the interaction. For such verification, the operative matrix will be drawn up. The operative matrix relates each feasible task (columns) with each scenario (rows), placing a tic (or ‘1’) at any cross where the scenario development involves performing such external task. Apart from traceability benefits, this matrix eases the checking of the corpus operative coverage: it is only necessary to find that every column has (at least) a tic. If not, reviewing the description of the Interaction Domain is required, for adding as many scenarios as required for completing it.

So it is recommended to perform a previous task analysis, in which every different external task is identified and documented (inputs/outputs, description, call, etc.). By drawing operative matrix, lack of corpus could be detected (and, in case, new scenarios should be defined, new corpus acquired, and each dialogue individually analyzed). Finally, a task unification process should be performed.

5.2 Intentions global annotation

The individual annotation should have revealed several intentions instances (at least one per dialogue). Through the individual analysis phase, they have been already generalized and formalized into abstract intentional entities by means of an intentional dialogue model. The resulting set of intentions should be examined to find equivalences between them. Finally, their sequences descriptions, if multiple for the same intention, should be analysed and simplified when is possible (development sequences, for example, often have common parts). Besides, extra description could be sometimes required (if several opening sequences are found with the same premises but

different initiation state, circumstantial criteria for differentiating them will be needed or random selection should be applied).

5.3 Commitment Learning

The commitment values, as fore defined, are measures of the health of an intention instance (belonging, for example, to the domain of real numbers between 0 and 1). This proposal observes three aspects for the commitment (mutual knowledge, interest, and focus) that will regard three independent variables through the processing of an intention instance. These variables are affected by commitment events (either positive or negative) increasing or decreasing (respectively) its value during the progress of the instance, always observing the defined boundaries. Depending in the modeling, variations could be considered as a constant value or a percentage based on current commitment value: a percentage of current to be deducted when negative, or a percentage on the difference between absolute (1) and current values to be added when positive. For guaranteeing successful progressing of an instance, a minimum value is required for each of these values. Because of this, some reinforcement techniques are eventually applied by human participants to restore these values up to those minimums. Hence, each reinforcement technique presents a threshold for the commitment values (when it should be applied) and an effect (positive variation of the commitment values). In a general way, initial values for each commitment aspect of an intention instance will be set to the maximum value (1), except for interest aspect that inherits its value from prior intention instance. However, there could be differentiated several ways to initiate and instance that present different initial values. When this applies, it will be modeled as an event associated to the initiation of an instance of the intention, which will apply the proper negative effect.

Before get going with the commitment learning, equivalence checks should be performed with commitment events and reinforcement techniques. These late ones are often developed as an independent intention, so equivalence between techniques is a check of intention identification. The commitment event equivalence is a little more difficult to define, because such event definition is also very open. Anyway, it is recommended to summarize both (events and techniques) in a table, and check


215

their description for finding similarities. If there are misplaced duplicates, apart from the drawback of their redundant definition, each of them will be less precise than their joint definition. Once each event and technique is described, every dialogue in the corpus containing any of them will be considered for running the learning algorithm. These dialogues will feed a process of ‘progressive refinement’ through which both variations and thresholds will be improved (from a general definition to a more precise one, by successive boundaries based on the occurrences of each element in the dialogues). When a cycle is completed (all the dialogues already fed), another cycle will be performed, and so on until a whole cycle does not vary the learned values.

6 Conclusion

This paper presents a methodological approach to pragmatic annotation1 for corpus of task-oriented dialogues within a defined interaction domain. This approach departs from preprocessed dialogues (in form of literal communicative acts) and analyses them, first separately and them as a whole. The information seek involves structure of dialogues, intentions and their features as common ground units, attention changes, reference solving, and task invocation. Yet it is aimed to be suited to certain Dialogue Models (of the joint action type), in fact both the methodology and the coding scheme are general enough to be applied to many other models.

As future work, it could be interesting to integrate several analyzing tools not only for assistance in the annotation phase (providing an XML output), but also for automatically (or semi-automatically) implement the formalized corpus as content of a knowledge base, getting it ready for use anytime. It also would be of interest providing export/import functionalities for others XML (or SGML) based pragmatic annotations. With regard to the methodology, some other pragmatic knowledge could also be observed, yet current dialogue modeling does not make use of it its annotation could be advanced for getting the corpus ready for future reutilization.

1 The presented work has been developed within the MAVIR

project (endorsed by the Regional Government of Madrid), and is being extended through the SOPAT project (supported by the Spanish Ministry of Science and Education).

Bibliografía

Allen J. & Core M. (1997). DAMSL: Dialog Act Markup in Several Layers. Draft; http://www.cs.rochester.edu /research/cisd/resources/damsl/RevisedManual/RevisedManual.html.

Austin, J.L. (1962). How to do things with words. Oxford Univ. Press, 1975.

Calle, J., García-Serrano, A., Martínez, P. (2006). Intentional Processing as a Key for Rational Behaviour through Natural Interaction. Interacting With Computers, © 2006 Elsevier Ltd.

Carletta, J. C., Isard, A., Isard, S., Kowtko, J., Doherty-Sneddon, G., & Anderson, A. (1996). HCRC Dialogue Structure Coding Manual (HCRC/TR-82), Human Communication Research Centre, University of Edinburgh.

Cattoni, R., Danieli, M., Sandrini, V., Soria, C. (2002) ADAM: The SI-TAL Corpus of Annotated Dialogues. In Procs. of LREC 2002, Las Palmas, Spain, 2002.

Clark, H.H(1996). Using Language. © 1996, Cambridge University Press.

Cuadra D., Rivero J., Valle D., Calle J (2008). Enhancing Natural Interaction with Circumstantial Knowledge. Int. Trans. on Systems Science and Applications vol. 4, Springer 2008.

Dybkjaer L., Bernsen N.O., Dybkjaer H., McKelvie D., and Mengel A. (1998). The MATE Markup Framework. MATE Deliv. D1.2, 1998. http://mate.nis.sdu.dk/ information/d12/.

Garretson, G. (2006). Dexter: free tools for analyzing texts. Proceedings of the 5th International AELFE Conference, 2006. pp. 659-665

Gibbon, D. Mertins, I. and Moore, R. (Eds.). Handbook of Multimodal and Spoken Dialogue Systems: Resources, Terminology and Product Evaluation. Kluwer Academic Publishers, 2000.

Levin, J.A. & Moore, J.A., (1977). Dialogue games: metacommunication strategies for natural language interaction. Cognitive Science 1 (4), 395–420.

Nakatani, C. & Traum, D. (1998). Draft: Discourse Structure Coding Manual. Technical Report of the 3rd Discourse Resource Initiative (DRI) Meeting, Chiba, Japan.

Navarro B., Civit M., Martí M. A., Fernández B., Marcos R. (2003). Syntactic, Semantic and Pragmatic Annotation in Cast3LB. In Proceedings of the Corpus Linguistics, Lancaster.

Searle J. R.(1969) Speech Acts. Cambridge University Press.

Sinclair, J. McH., & Coulthard, M. (1975). Towards an analysis of discourse: the English used by teachers and pupils, Oxford: Oxford University Press.

Tetreault J., Swift M., Prithviraj P., Dzikovska M., and Allen J (2004). Discourse annotation in the Monroe corpus. In ACL workshop on Discourse Annotation.


216

Descripcion de Entidades y Generacion de Expresiones deReferencia en la Generacion Automatica de Discurso∗

Entity Description and Referring Expression Generation in Automatic

Generation of Discourse

Raquel Hervas Pablo GervasUniversidad Complutense de Madrid

Instituto de Tecnologıas del Conocimiento28040 Madrid, Spain

[email protected], [email protected]

Resumen: La descripcion de entidades es una tarea indispensable en la generacionautomatica de texto para cualquier tipo de discurso. Todo texto necesita referirsea elementos del mundo con distintos fines, y la base de estas referencias seran lasdescripciones que se han hecho de dichos elementos anteriormente. En este proce-so se hayan involucradas dos de las tareas basicas de la Generacion de LenguajeNatural (GLN): la Determinacion de Contenido y la Generacion de Expresiones deReferencia. En este trabajo se estudia la relacion entre ambas al describir entidadesy referirse a ellas en cualquier tipo de discurso generado automaticamente. Para ellosera necesario revisar la arquitectura clasica de pipeline, estudiando las interaccionesque pueden ser necesarias entre los distintos modulos involucrados.Palabras clave: Descripcion de entidades, Generacion de lenguaje natural, Gene-racion de expresiones de referencia, Arquitecturas GLN

Abstract: Entity description is an essential task in automatic generation of textfor any kind of discouse. Every text needs to refer to domain elements with diffe-rent goals, and the baseline of these references would be the descriptions of theseelements presented before. Two of the main tasks of Natural Language Generation(NLG) are involved in this process: Content Determination and Referring Expres-sion Generation. In this work the relation between them is studied from the pointof view of entity descriptions and references in any type of automatically generateddiscourse. It would be necessary to revise the classic pipeline architecture, exploringthe interactions between the different modules involved.Keywords: Entity description, Natural language generation, Referring ExpressionGeneration, NLG arquitectures

1. Introduccion

La descripcion de entidades es una tareaindispensable en la generacion automatica detexto para cualquier tipo de discurso. Desdelos sistemas de dialogo hasta los textos narra-tivos todo texto generado necesita referirse aelementos del mundo con distintos fines, y labase de estas referencias seran las descripcio-nes que se hayan hecho de dichos elementos

∗ Esta investigacion esta financiada por el Ministeriode Educacion y Ciencia (TIN2006-14433-C02-01) y laUniversidad Complutense de Madrid y la DireccionGeneral de Universidades e Investigacion de la Co-munidad Autonoma de Madrid (CCG07-UCM/TIC-2803).

anteriormente. En ocasiones las entidades re-feridas necesitaran ser distinguidas de otrasdel mismo tipo o con caracterısticas comu-nes, y para ello sus caracterısticas deberanser expuestas de manera que puedan ser utilespara tal fin. Otras veces la informacion conque son descritas ciertas entidades sera im-portante para una comprension completa deldiscurso.

La importancia de estas descripciones esmayor cuando el tipo de discurso hace que launica informacion disponible para el lector uoyente sea la dada en el propio discurso. Enel contexto de un dialogo entre dos personas,



el oyente dispone tanto de la informacion delentorno que les rodea como de las ideas in-tercambiadas entre ambos. Por ello, una refe-rencia como “la mesa” podrıa ser entendidasin necesidad de mayores explicaciones. Sinembargo, en un discurso donde la comunica-cion es exclusivamente linguıstica y no hayningun contexto visual, la unica informacionde la que dispondra el lector u oyente es laque se ha mostrado en el texto, y por ellolas descripciones de las entidades del discur-so seran cruciales para que el texto transmitala informacion deseada.

En la generacion y posterior uso de estasdescripciones de entidades se hayan involu-cradas dos de las tareas basicas de la Gene-racion de Lenguaje Natural (GLN): la Deter-minacion de Contenido (DC) y la Generacionde Expresiones de Referencia (GER). En lamayorıa de los sistemas GLN la relacion en-tre estas dos tareas es unidireccional.

La arquitectura mas extendida para sis-temas de GLN es la arquitectura secuencialo de pipeline, donde la eleccion del conteni-do del discurso y su organizacion en mensa-jes es realizada al principio del proceso. Esesta informacion seleccionada la que utilizanlos subsiguientes pasos de la generacion, entreellos la GER, sin posibilidad de que sea revi-sada por etapas anteriores del pipeline. Des-de el punto de vista de la relacion de la DCy la GER en la descripcion de entidades, elflujo de informacion proporcionado por estaarquitectura puede no resultar adecuado enciertas ocasiones. Imaginemos que en la fasede GER el sistema se da cuenta de que nodispone de suficiente informacion sobre unaentidad para distinguirla del resto que estanen el mismo contexto. Dada una arquitecturasecuencial el problema no tendrıa remedio, yla expresion de referencia generada serıa am-bigua.

Un problema como este puede solucionar-se de varias maneras. Una de ellas serıa quela DC se encargara, ademas de decidir que in-formacion es importante para el discurso co-mo un todo, de comprobar si la informacionseleccionada es suficiente para que no se pro-duzcan ambiguedades a la hora de generar lasexpresiones de referencia. Otra solucion serıaque la DC seleccionara poca o nada de infor-macion para las descripciones, y que fuera laGER la que se encargara de solicitar la inclu-sion de nueva informacion cuando resultaranecesaria para generar las referencias.

En este trabajo se estudia la relacion entrela Determinacion de Contenido y la Genera-cion de Expresiones de Referencia al describirentidades y referirse a ellas en discursos ge-nerados automaticamente donde la unica in-formacion disponible para el lector u oyentees la contenida en el propio texto. Para ellosera necesario revisar la arquitectura clasi-ca de pipeline, estudiando las interaccionesque pueden ser necesarias entre los distintosmodulos involucrados.

Un dominio que resulta muy adecuado pa-ra el estudio de esta relacion es el de la ge-neracion automatica de texto para cuentos.Los cuentos contienen por definicion muchoselementos a los que habra que referirse, des-de personajes principales o secundarios, has-ta lugares u objetos implicados en la accion.Y en ellos tienen una funcion muy impor-tante tanto la descripcion de entidades, quese encargara de presentar todos elementos allector, como la Determinacion de Contenidoen general, ya que son ricos en informacionrelativa a los distintos elementos y esta in-formacion debera ser filtrada adecuadamen-te a la hora de plasmar la historia en tex-to. Por ejemplo, en todo cuento nos encon-traremos con mucha informacion relativa alos personajes (edad, color de pelo, hermo-sura, bondad,...) y a sus relaciones con losdemas elementos de la historia (quien es supadre o madre, donde viven, a quien aman uodian,...). Es importante que toda esta infor-macion sea organizada y filtrada adecuada-mente para que el texto final resulte legible.

2. Revision del Trabajo Previo

La generacion de lenguaje natural (GLN)se subdivide en varias tareas concretas(Reiter y Dale, 2000), y cada una de ellasopera a un nivel distinto de representacionlinguıstica (discurso, semantica, lexico, sinta-xis...). La GLN se puede aplicar en dominiosdonde los objetivos de comunicacion y las ca-racterısticas de los textos a generar son muydistintos, desde la transcripcion a lenguajenatural de contenidos numericos (Goldberg,Driedger, y Kittredge, 1994), a la generacionde textos literarios (Callaway y Lester, 2001).

2.1. Arquitecturas de Sistemas

GLN

Hay muchas formas de organizar un sis-tema de Generacion de Lenguaje Natural(GLN), y las ventajas y desventajas de ca-

Raquel Hervás y Pablo Gervás

218

da una de ellas son todavıa campo de dis-cusion (DeSmedt, Horacek, y Zock, 1995;Reiter, 1994). Teniendo en cuenta la divisionen modulos, podemos encontrar desde unaarquitectura integrada (Kantrowitz y Bates,1992), donde el sistema esta formado por ununico modulo, hasta una arquitectura conmodulos separados para cada una de las ta-reas de la GLN (Cahill et al., 2001). Conside-rando el flujo de control, por un lado pode-mos encontrar una arquitectura de pipeline(Reiter, 1994), donde los modulos son com-pletamente independientes, y por otro unaarquitectura de pizarra (Calder et al., 1999),donde los modulos colocan informacion en unespacio de almacenamiento comun sin preo-cuparse de que otros modulos van a utilizarla.

La arquitectura mas generalizada es el pi-peline de Reiter y Dale (2000), que ha servidode inspiracion para muchos otros (Figura 1).La arquitectura secuencial o de pipeline tie-ne la ventaja de la simplicidad. Cada moduloesta encapsulado al maximo porque recibe laentrada de un origen y envıa la salida a undestino. Se supone que la representacion in-termedia en cada etapa es una representacioncompleta de lo que se sabe en ese punto sobrela declaracion que se va a generar. Sin embar-go, su principal desventaja es que las decisio-nes tomadas en una fase deben mantenerse alo largo de toda la cadena, sin posibilidad derevision o mejora. Por tanto, se trata de unmodelo muy poco flexible.

Figura 1: Esquema basico de una arquitectu-ra secuencial

Sistemas como HYLITE+ (Bontcheva yWilks, 2001) han explorado alternativas a larigidez de comunicacion entre modulos pro-porcionada por el pipeline desde el punto devista de la Determinacion de Contenido yla Realizacion Superficial. HYLITE+ es unsistema de generacion de hipertexto dinami-co que produce explicaciones de tipo enci-clopedico en un determinado dominio. En es-te sistema el planificador de contenido pro-duce inicialmente un plan de texto para unhipertexto conciso que solo contiene hechossobre el concepto a explicar. Durante la rea-lizacion superficial, una vez que el formatofinal ha sido decidido, se escoge la alterna-tiva adaptable mas apropiada. En ocasionesesto lleva a la inclusion de un nuevo objetivocomunicativo que resulta en expandir el textobasico inicial utilizando informacion extra.

STREAK (Robin y McKeown, 1996) esun sistema para la generacion automatica deresumenes de partidos de baloncesto utilizan-do una arquitectura de revision. En una pri-mera fase se construye un borrador que con-tiene solo la informacion esencial para el resu-men, y en una segunda pasada se revisa incre-mentalmente este borrador para anadir tantainformacion de caracter secundario como seanecesaria para mejorar la legibilidad y apro-vechar al maximo el espacio dado. Este mo-delo requiere un nuevo tipo de conocimientolinguıstico que son las reglas de revision, queespecifican las distintas formas en las que unborrador puede ser modificado para anadirnueva informacion de forma concisa.

2.2. Generacion de Expresiones de

Referencia

El uso adecuado de expresiones de refe-rencia para competir con los textos genera-dos por autores humanos entrana cierta difi-cultad. Segun Reiter and Dale (Reiter y Da-le, 2000), una expresion de referencia debecomunicar suficiente informacion para iden-tificar unıvocamente un referente dentro delcontexto del discurso, pero siempre evitandomodificadores innecesarios o redundantes.

Sin embargo, hay ocasiones en las que lainformacion usada en la referencia puede serdesconocida para el oyente o innecesaria pa-ra distinguir a la entidad. El uso de este ti-po de referencias podrıa ser considerado co-mo una violacion de la Maxima Conversacio-nal de Cantidad de Grice “Do not make your

contribution more informative than is requi-

Descripción de Entidades y Generación de Expresiones de Referencia en la Generación Automática de Discurso

219

red” (Grice, 1975). Esto serıa cierto si el ob-jetivo fuera exclusivamente proporcionar su-ficiente informacion al oyente para identificarla entidad a la que su interlocutor se esta re-firiendo. No obstante, si estamos consideran-do como objetivo no solo la identificacion delreferente, sino tambien alertar al oyente so-bre algunas de sus propiedades, entonces laMaxima de Cantidad se cumple.

2.3. Descripcion de Entidades

Segun Milosavljevic (2003), una descrip-cion de una entidad es la realizacion linguısti-ca de un conjunto de una o mas proposicionesque tienen como proposito hacer que el oyentese construya un modelo mental de la entidaddescrita. En su trabajo de tesis (Milosavlje-vic, 1999) exploro el uso de comparaciones enla descripcion de entidades, postulando queen el proceso de producir una descripcion re-sulta util el uso de comparaciones con enti-dades similares y familiares. Otros trabajoshan estudiado la generacion de descripcionesde entidades de diversos tipos.

En (Lavoie, Rambow, y Reiter, 1997) sepresenta el sistema ModEx, que genera des-cripciones en lenguaje natural de modelossoftware orientados a objetos. ModEx per-mite al usuario personalizar los planes de tex-to en ejecucion, de manera que cada texto re-fleja las preferencias individuales del usuarioen cuanto al contenido y/o la presentacion dela salida generada.

En (Ardissono y Goy, 2000) se presentaSETA, un sistema de generacion automati-ca de catalogos web personalizados. Para elloemplea tecnicas de GLN basada en plantillas,que permite la generacion dinamica de cate-gorıas de productos y sus elementos. En estasdescripciones el sistema mezcla diferentes ti-pos de informacion sobre las caracterısticas ypropiedades de los productos presentados.

El sistema de generacion de texto Metho-dius (Isard, 2007) crea descripciones persona-lizadas de objetos de museo que pueden serpresentadas de diversas maneras, desde textoo voz en un aparato en mano hasta dialogocon un guıa de museo robotico.

3. Generacion de Descripciones y

su Relacion con el Resto del

Proceso GLN

Dos tareas dentro de la Generacion deLenguaje Natural estan involucradas en elproceso de generacion de descripciones para

entidades en un discurso y su posterior rela-cion con el resto del texto: la Determinacionde Contenido y la Generacion de Expresio-nes de Referencia. La relacion entre ambasa la hora de referirse a las entidades en undiscurso es importante para transmitir la in-formacion deseada y evitar ambiguedades.

La Determinacion de Contenido (DC) seencarga de seleccionar la informacion que sepretende comunicar en la descripcion de laentidad. Dependiendo del lugar que ocupadicha descripcion dentro del discurso, y dela importancia de la entidad en el mismo,la informacion a expresar en la descripcionpodra ser mas o menos extensa. Segun el ti-po de discurso habra entidades que sobre-salen sobre el resto o son mas importantesque las demas. En un cuento, por ejemplo,los personajes protagonistas o el lugar don-de se desarrolla la accion seran presentadoscon descripciones mucho mas detalladas quelos personajes secundarios o los objetos me-nos importantes. La DC debera determinarque elementos del discurso sobresalen sobrelos demas y debera seleccionar informacionacorde a este hecho a la hora de describirlos.

La Generacion de Expresiones de Referen-cia (GER) se encarga de generar referencias aentidades que permitan distinguirlas del res-to de elementos del discurso, al mismo tiempoque proporciona cualquier informacion extraque el lector pueda encontrar util mas adelan-te. Para ello depende no solo de la informa-cion proporcionada por la DC, sino tambiendel contexto del texto en el momento de rea-lizar la referencia. Este contexto esta deter-minado, entre otras cosas, por como ha sidodescrita previamente cada entidad, es decir,por la informacion que se puede considerarconocida por el lector.

La mayorıa de las referencias que aparecenen cualquier tipo de discurso se usan basica-mente para referirse a entidades del contex-to que deben ser distinguidas de otras quetienen alrededor. Este tipo de referencias sepueden considerar referencias distintivas. Unejemplo de este tipo de referencias serıa “Put

the box on the table next to the door”, dondeel hablante senala que la caja debe ponerseen la mesa que esta al lado de la puerta yno en ninguna otra mesa. En estos casos laspropiedades mencionadas son conocidas porel oyente y necesarias para distinguir una en-tidad determinada del resto del contexto.

En la lınea de las implicaciones conversa-


220

cionales expuestas por Grice, si la referenciacontiene informacion que no es necesaria pa-ra la identificacion del referente, entonces eloyente podrıa asumir que esa informacion esimportante para algo mas, y dependiendo delcontexto hara diferentes inferencias a partirde esta informacion. Este tipo de referenciasse consideran referencias informativas.

En el caso de las referencias distintivas,la GER debe tener en cuenta a la hora degenerar la referencia no la informacion com-pleta de que dispone, sino la informacion queha sido transmitida previamente en el texto.En el caso de las referencias informativas, esposible que la informacion extra que se pre-tende proporcionar ya haya sido expuesta an-teriormente. Por ejemplo, en el cuento de LaCenicienta, casi todas las menciones a los za-patos de Cenicienta incluyen la informacionextra de que estos son de cristal. En estoscasos la reiteracion de dicha informacion pre-tende destacar este hecho al lector. Por otrolado, si la informacion a presentar no ha sidodada previamente, es tarea de la GER encar-garse de que sea entendida correctamente yno produzca ningun tipo de confusion. Estopodrıa ocurrir, por ejemplo, si no se relacio-na correctamente el concepto referido ante-riormente con la nueva caracterıstica. Si porejemplo tenemos una historia en la que haaparecido una princesa de la que solo se hadicho que es guapa, y mas tarde se usa unareferencia como “la princesa morena”, el lec-tor podrıa pensar que la princesa morena esuna nueva princesa y no tiene nada que vercon la anterior.

3.1. Generacion de Descripciones

Guiada por la Determinacion

de Contenido

La Determinacion de Contenido podrıa serla encargada de decidir que informacion co-municar en la descripcion de entidades te-niendo en cuenta todos los elementos queapareceran en el texto y sus caracterısticas.En el momento de decidir que informacion in-cluir en la descripcion de una entidad la DCdebe ser consciente de todos elementos queapareceran en el discurso e incluir la infor-macion necesaria para que todos ellos seandistinguibles.

Una consecuencia de esta solucion es quela DC tendra que prever parte de las deci-siones que tomara mas adelante la GER. Pa-ra ello debera asegurarse de anadir al discur-

so suficiente informacion para que algoritmoscomo la seleccion de atributos, que se encargade elegir que atributos de un elemento debenser mencionados para distinguirlo del resto deelementos del mismo tipo, no se encuentrencon menos informacion de la que necesitanpara hacer las referencias.

3.2. Generacion de Descripciones

Guiada por la Generacion de

Expresiones de Referencia

Otra posible solucion para el problemaplanteado serıa proporcionar una comunica-cion bidireccional entre los modulos de Deter-minacion de Contenido y Generacion de Ex-presiones de Referencia. Ante la falta de in-formacion para crear una referencia adecua-da, como en el ejemplo de las princesas men-cionado anteriormente, el modulo de GERharıa una solicitud al modulo de DC para ob-tener mas informacion sobre las dos princesasinvolucradas. Supongamos que se sabe que lasegunda princesa es morena, aunque inicial-mente este hecho no parecıa relevante para eldiscurso y no se habıa incluido en su descrip-cion. Con esta nueva informacion sı sera po-sible generar referencias que distingan a lasdos princesas.

Por supuesto, podrıa darse la situacionen que el modulo de DC no pudiese propor-cionar informacion adicional para la descrip-cion de una entidad. En estos casos, la GERdebera recurrir a otro tipo de tecnicas, co-mo por ejemplo el uso de demostrativos (thisprincess o that princess) o de otros identifica-dores (the first princess). Pero estos aspectosde la GER quedan fuera del estudio realizadoen este artıculo.

4. Propuesta de Implementacion

en una Arquitectura GLN

En esta seccion se estudian las implicacio-nes de las dos propuestas expuestas anterior-mente desde el punto de vista de una posibleimplementacion de las mismas dentro de unsistema de generacion previamente existen-te. TAP (Text Arranging Pipeline) (Gervas,2007) es una arquitectura software para la ge-neracion automatica de texto. TAP esta for-mado por una serie de interfaces que definenuna funcionalidad generica para un pipelinede tareas orientadas a la generacion de len-guaje natural, desde una entrada conceptualinicial hasta la realizacion superficial final enforma de texto, con etapas intermedias de


221

planificacion de contenido y planificacion deoraciones. A partir de la entrada conceptualque debe procesarse, los diversos modulos deTAP trabajan sobre las representaciones in-termedias utilizadas para almacenar resulta-dos parciales de manera que progresivamentese filtra, agrupa y enriquece la entrada hastaconseguir estructuras cada vez mas cercanasal lenguaje natural tanto en estructura comoen contenido.

La filosofıa seguida en el diseno de TAP esidentificar la naturaleza estructural genericade la informacion que requiere de procesa-miento en las diferentes fases de la generacionde lenguaje natural, y proporcionar interfacespara acceder a esta estructura. La arquitectu-ra de TAP ha sido especıficamente disenadapara permitir el desarrollo de un conjuntoreutilizable de componentes software capa-ces de resolver tareas basicas de GLN, traba-jando con informacion codificada en forma-tos estandar, y proporcionando textos ade-cuados a diferentes tareas en diferentes domi-nios. Con respecto a las tareas que se discutenaquı, son relevantes el modulo de determina-cion de contenido y el modulo de generacionde expresiones de referencia.

4.1. Determinacion de Contenido

La implementacion actual del modulo dedeterminacion de contenido de TAP recibecomo entrada un conjunto de referentes, unconjunto de propiedades, un conjunto de re-laciones y un conjunto de eventos. Las pro-piedades pueden aplicarse tanto a referentes(en cuyo caso suelen acabar representadas co-mo adjetivos) como a eventos (en cuyo casoacaban representadas como adverbios). Lasrelaciones pueden establecerse indistintamen-te entre cualesquiera de los cuatro elementosposibles (referentes, propiedades, relaciones oeventos), de manera que permiten tanto larepresentacion de modificadores de frases no-minales, complementos de oraciones, o de re-laciones de causalidad.

Dentro del modulo de determinacion decontenido, la tarea basica que se lleva a ca-bo es agrupar los eventos en secuencias deeventos contiguos en el tiempo y que ocurrenen el mismo lugar. Estos bloques se conside-ran como una primera aproximacion al con-cepto de escena, y se utilizan para estableceruna planificacion inicial del discurso. Las es-cenas se ordenan relativamente segun el or-den cronologico de los primeros eventos de

cada una. Esta primera aproximacion del dis-curso a generar debe enriquecerse con infor-macion adicional relativa a las descripcionesde los referentes que participan en cada even-to. La informacion que se introduzca en estepunto aparecera en el texto final menciona-da explıcitamente como parte del discurso enoraciones independientes (“La princesa es

rubia”).

4.2. Generacion de Expresiones de

Referencia

La version actual (Gervas, Hervas, y Leon,2008) del modulo de generacion de expresio-nes de referencia aplica una implementacionsencilla del algoritmo de Reiter y Dale (Reitery Dale, 1992), que permite seleccionar el con-junto de atributos (propiedades y relaciones)para un referente que lo identifican con res-pecto a posibles menciones previas del mismoen el discurso precedente. Esta tarea se apo-ya en el conjunto de propiedades y relacionesque se hayan establecido como validas pa-ra ese referente durante la determinacion decontenido. La informacion que se seleccioneen este punto aparecera en el texto final men-cionada como expresiones de referencia queparticipen en oraciones que describen even-tos (“La princesa rubia salio del castillo”).

El algoritmo de Reiter y Dale opera a basede generar un conjunto de contraste que con-tiene todos los referentes que podrıan confun-dirse con el referente que se pretende mencio-nar, e ir progresivamente eliminando elemen-tos de ese conjunto cada vez que se anadea la expresion a generar una propiedad delreferente a mencionar que los distingue. Lageneracion de la expresion tiene exito si elconjunto de contraste queda vacıo al final delproceso. En caso de que el conjunto de con-traste no quede vacıo, se ha encontrado uncaso en el que la informacion proporciona-da durante la generacion de contenido no essuficiente para distinguir unos referentes deotros.

4.3. Adaptaciones Necesarias

De cara a implementar las soluciones des-critas en abstracto mas arriba, resulta im-portante dotar al modulo de generacion deexpresiones de referencia de la capacidad deidentificar este tipo de situaciones anomalas,y, en cada caso, almacenar los datos que per-mitan identificar los referentes entre los quesurge el conflicto. Esto es tan sencillo como,


222

en cada caso en que el conjunto de contras-te no este vacıo al final del proceso, almace-nar ese remanente del conjunto de contrastejunto con el referente original: estos referen-tes son los que quedan indistinguibles con lainformacion proporcionada por el modulo dedeterminacion de contenido.

Con respecto al modulo de determinacionde contenido, serıa necesario anadir una fun-cionalidad que permitiera solicitar una revi-sion de una asignacion de contenido previa,de modo que esta solicitud pueda venir acom-panada de un conjunto de conjuntos de refe-rentes indistinguibles segun la asignacion pre-via.

4.4. Opciones Basicas de

Combinacion

Todo lo expuesto anteriormente darıa lu-gar a tres opciones de utilizacion.

En la primera, la determinacion de con-tenidos no anadirıa ningun tipo de informa-cion sobre los referentes en la primera pasa-da. Durante la generacion de expresiones dereferencia se detectan los conjuntos de refe-rentes indistinguibles, generando una lista delos mismos para los que no ha podido gene-rar una referencia correcta por falta de infor-macion. Se solicitarıa entonces una segundapasada de determinacion de contenido paraanadir informacion que los distinga. Este bu-cle podrıa continuar hasta que el modulo degeneracion de expresiones de referencia no en-cuentre problemas a la hora de generar lasreferencias, en cuyo caso el flujo del pipeli-ne continuarıa hasta la generacion del textofinal.

En la segunda opcion, la determinacion decontenido puede anadir informacion descrip-tiva en la primera pasada, pero no necesaria-mente tiene que ocuparse de que esa informa-cion sea distintiva, sino que seran mas biendescripciones iniciales apropiadas para cadaentidad segun su papel en el discurso comple-to. Por ejemplo, los personajes y lugares prin-cipales siempre se describen mas exhaustiva-mente que los secundarios. La generacion deexpresiones de referencia se lleva a cabo comoen la primera opcion y se solicita al modulode determinacion de contenido una segundapasada para completar las descripciones conla informacion necesaria para distinguir. Enesta segunda opcion, pueden combinarse lasfuncionalidades de las descripciones para fa-cilitar la identificacion y para anadir infor-

macion que no distingue de otros referentespero se considera importante.

La tercera opcion incluirıa en el modulode determinacion de contenido llamadas a losfragmentos de la generacion de expresiones dereferencia que son independientes del discur-so. Para cada referente se harıa una explora-cion generica del conjunto de contraste maxi-mo posible aplicando los criterios menciona-dos anteriormente, y se utilizarıa la realimen-tacion obtenida para anadir la informacionque pudiera ser necesaria ya en la primerapasada de la determinacion del contenido.

5. Conclusiones y trabajo futuro

La generacion de descripciones guiada porla Determinacion de Contenido puede resul-tar un proceso muy costoso dependiendo dela forma en que sea abordado. A la hora de se-leccionar la informacion correspondiente pa-ra la descripcion de una entidad, el resto deentidades con que se compara pueden habersido ya “descritas”, es decir, la informacionque se usara para su descripcion puede habersido seleccionada anteriormente. Habra portanto que decidir si la DC considera esta in-formacion seleccionada a la hora de realizarla descripcion de las otras entidades o no. Noes lo mismo tener en cuenta toda la informa-cion disponible sobre una entidad a la hora dedistinguirla de otra, que tener solo en cuentala informacion que va a ser mencionada sobreesa entidad en el discurso. En el segundo ca-so la descripcion estara teniendo en cuenta elcontexto del discurso y no solo la informaciondisponible inicialmente.

La generacion de descripciones guiada porla Generacion de Expresiones de Referenciaprovocarıa que en las descripciones del tex-to solo se incluyera la informacion necesariapara la distincion de los elementos del discur-so. Esto podrıa provocar textos demasiado es-cuetos, con pocas descripciones, si el numerode entidades que pueden ser confundidas conotras es bajo. Ademas, esta aproximacion nose encarga del uso de informacion adicionalno necesaria para distinguir pero que se con-sidera interesante para la comprension com-pleta del texto. Una posible solucion serıa queen la DC se decidiera que informacion no ne-cesaria se quiere expresar en el texto, y quela fase posterior de REG se encargue de soli-citar informacion adicional a esta cuando lonecesite.


223

Bibliografıa

Ardissono, L. y A. Goy. 2000. Dynamic gene-ration of adaptive web catalogs. En Proc.

Conference on Adaptive Hypermedia and

Adaptive Web-based Systems, Italy.

Bontcheva, K. y Y. Wilks. 2001. Dealingwith dependencies between content plan-ning and surface realisation in a pipelinegeneration architecture. En Proc. of the

International Joint Conference on Artifi-

cial Intelligence (IJCAI’01).

Cahill, L., R. Evans, C. Mellish, D. Paiva,M. Reape, y D. Scott. 2001. The RAGSreference manual. Informe Tecnico ITRI-01-07, Information Technology ResearchInstitute, University of Brighton.

Calder, J., R. Evans, C. Mellish, y M. Reape.1999. Free choice and templates: how toget both at the same time. En May I speak

freely? Between templates and free choi-

ce in natural language generation, paginas19–24, Saarbrucken.

Callaway, C. y J. Lester. 2001. Narrativeprose generation. En Proceedings of the

17th IJCAI, paginas 1241–1248, Seattle.

DeSmedt, K., H. Horacek, y M. Zock. 1995.Architectures for natural language gene-ration: Problems and perspectives. EnG. Ardoni y M. Zock, editores, Trends in

natural language generation: an artificial

intelligence perspective, LNAI 1036. Sprin-ger Verlag, paginas 17–46.

Gervas, P., R. Hervas, y C. Leon. 2008.Nil-ucm: Most-frequent-value-first attri-bute selection and best-scoring-choice rea-lization. En Referring Expression Genera-

tion Challenge 2008, Proc. of the 5th In-

ternational Natural Language Generation

Conference (INLG’08).

Gervas, P. 2007. TAP: a text arranging pi-peline. Informe tecnico, Natural Interac-tion based on Language Group, Universi-dad Complutense de Madrid, Spain.

Goldberg, E., N. Driedger, y R.I. Kittredge.1994. Using natural-language processingto produce weather forecasts. IEEE Ex-

pert: Intelligent Systems and Their Appli-

cations, 9(2):45–53.

Grice, H.P. 1975. Logic and conversation.Syntax and Semantics, 3:43–58.

Isard, A. 2007. Choosing the best compa-rison under the circumstances. En Pro-

ceedings of the International Workshop on

Personalization Enhanced Accessto Cultu-

ral Heritage (PATCH’07), Corfu, Greece.

Kantrowitz, M. y J. Bates. 1992. Integra-ted natural language generation systems.En R. Dale E. Hovy D. Rosner, y O. Sto-ck, editores, Aspects of Automated Natu-

ral Language Generation. Springer Verlag,Berlin, paginas 13–28.

Lavoie, B., O. Rambow, y E. Reiter.1997. Customizable descriptions ofobject-oriented models. En Proceedings of

ANLP’97.

Milosavljevic, M. 1999. The Automatic Ge-

neration of Comparisons in Descriptions

of Entities. Ph.D. tesis, Department ofComputing, Macquarie University, Syd-ney, Australia.

Milosavljevic, M. 2003. Defining compa-rison. En P. Slezak, editor, Proceedings

of the Joint International Conference on

Cognitive Science with the Australasian

Society for Cognitive Science, Universityof New South Wales.

Reiter, E. 1994. Has a consensus NL genera-tion architecture appeared, and is it psy-chologically plausible? paginas 163–170.

Reiter, E. y R. Dale. 2000. Building Natural

Language Generation Systems. Cambrid-ge University Press.

Reiter, Ehud y Robert Dale. 1992. A fastalgorithm for the generation of referringexpressions. En Proceedings of the 14th

conference on Computational linguistics,paginas 232–238, Morristown, NJ, USA.Association for Computational Linguis-tics.

Robin, J. y K. McKeown. 1996. Empiricallydesigning and evaluating a new revision-based model for summary generation. Ar-

tif. Intell., 85(1-2):135–179.


224

�atural Language Processing meets User Modeling for automatic

and adaptive free-text scoring

Combinando técnicas de Procesamiento de Lenguaje �atural y Modelado de

Usuario para la evaluación automática y adaptativa en texto libre

Diana Pérez Marín

EPS, UAM

Madrid, Spain

[email protected]

Ismael Pascual �ieto

EPS, UAM

Madrid, Spain

[email protected]

Pilar Rodríguez

EPS, UAM

Madrid, Spain

[email protected]

Resumen: Tradicionalmente, los sistemas de evaluación automática de respuestas en texto libre

se han basado únicamente en el uso de técnicas de Procesamiento de Lenguaje Natural. De esta

forma se ha ido mejorando el rendimiento de estos sistemas, pero no se ha podido ofrecer a los

estudiantes la posibilidad de una evaluación formativa adaptada a sus necesidades y a su nivel

real de conocimiento en función de las respuestas en texto libre proporcionadas al sistema

usado. En este artículo, se describe un procedimiento en el que técnicas de Procesamiento de

Lenguaje Natural y Modelado de Usuario se combinan para generar y mantener un modelo de

estudiante en sistemas de evaluación automática de respuestas en texto libre. De esta forma, la

evaluación de las respuestas en texto libre, no es sólo automática sino también adaptada a las

características específicas de cada estudiante en cada momento.

Palabras clave: Evaluación de respuestas en texto libre, Modelado de Usuario, Hipermedia Adaptativa,

Blended Learning

Abstract: Free-text Computer Assisted Assessment (CAA) systems are able to automatically

score free-text students’ answers using Natural Language Processing techniques. Traditionally,

free-text CAA systems have not included any possibility of adaptation, or kept a student model.

In this paper, a procedure in which Natural Language Processing and User Modeling techniques

are used together to generate and keep a student model in free-text CAA systems is described.

That way, it is possible to offer the students not only an automatic assessment of their free-text

answers, but also adaptation to their specific formative needs and their real level of knowledge.

The student model is extracted from the students’ free-text answers to the questions asked by

the system and, the model is used by the system to choose the next question to ask the student.

That way, not only the model is derived from the students’ answers, but the students’ answers

keep the model updated.

Keywords: Free-text Computer Assisted Assessment, User Modeling, Adaptive Hypermedia, Blended

Learning

1 Motivation

The benefits of a convergence between Natural

Language Processing (NLP) and User Modeling

(UM) techniques have already been discussed

by several researchers.

For instance, Zukerman and Litman (2001)

claimed that obtaining the model from free-text

introduced by the user into the system would

increase its accuracy, Reiter et al. (2003)

described how it is possible to acquire and use

limited user models for Natural Language

Understanding, and Johansson (2002) gathered

several benefits of UM in dialogue systems:

providing the user with tailored help and

advice, eliciting information, and helping

resolving ambiguity (Kass and Finin, 1988);

avoiding redundancy and incomprehensibility

in answers and explanations, taking into

account goals and plans for which the user

really needs some requested information, and

detecting misconceptions of the user and

informing the user about them (Kobsa, 1990);

and, enhancing effectiveness (i.e. to reach the



correct decision for a specific user), efficiency

(i.e. to reach the correct decision in an

economical way) and, acceptability (i.e. to

support the decision-making process in a

comprehensible and agreeable and enjoyable

way for a specific user) (Sparck-Jones, 1989).

However, many of these authors have

usually only focused on fields such as Natural

Language Generation (NLG), or Natural

Language Understanding (NLU), mainly in

relation to pragmatics. For instance, NLG

systems that consult the user model to do

content planning, or NLU systems that build the

user model to represent user’s plans and goals.

Little has been said about automatically

exploiting the user models to other domains

such as, for instance, education. User models,

or student models for the particular case of the

academic context, contain static and dynamic

information that could be used to choose which

questions should be asked to the students.

For instance, if an estimation of how well

each concept is known by each student was kept

in each student model, it would be possible to

identify which concepts are less understood.

That way, it would be possible to directly ask

the students about those misconceptions,

instead of repeating questions about concepts

already known.

In our opinion, the combination of the

benefits of UM in Intelligent Tutoring Systems

(ITS) (Kay, 2001) with the new possibilities

that NLP could also bring to e-learning (i.e.

using electronic media to teach), Blended

Learning (i.e. using traditional and electronic

media to teach), and Computer Assisted

Assessment (i.e. using computers to assess

students’ knowledge) are many.

In particular, in this paper, we describe the

combination of Natural Language Processing

and User Modeling techniques to automatically

generate each student’s model from his or her

free-text answers to a free-text CAA system.

In this context, a student model can be

defined as the set of static and dynamic

information about each student. The static

component consists of the student’s personal

data (i.e. name, age, language, etc.) which do

not usually change during the course. The

dynamic component consists of the information

about how the student has used the concepts in

the answers which changes during the course.

The dynamic component of the model is the

focus of this paper. We have called it the

student’s conceptual model. It can be defined as

a network of interrelated concepts in which

each concept is associated a confidence-value

(CV). This CV indicates how confident an

automatic free-text scoring system is that the

student knows each concept according to a set

of metrics.

The paper is organized as follows: Section 2

briefly reviews some concepts about student

models and educational systems able to keep

student models; Section 3 gives an overview of

the procedure to generate the conceptual model

from the students’ free-text answers; Section 4

reviews the Natural Language Processing and

User Modeling techniques used in the

procedure; Section 5 discusses the results

achieved; Section 6 exposes the limitation of

the current approach and how it can be

extended to other domains; and, Section 7 ends

with the main conclusions drawn about the

benefits of using UM+NLP for free-text

scoring.

2 Related work

Student models are one of the main

components of Intelligent Tutoring Systems

(Kay, 2001). There is not limits or restrictions

regarding the information that can be kept in

one student model, or how the information has

to be organized.

Therefore, there are many different types of

student models, and multiple criteria according

to which they can be classified. One of them is

concerning the relationship between domain

and student knowledge. That is, depending on

how the student knowledge represents the

domain knowledge, student models can be

classified as (Labidi and Sergio, 2000;

Mitrovic, 2001):

• Overlay: The student model is a projection

of the domain model, i.e. the student

knowledge is considered as a subset of the

domain knowledge.

• Bug: The bug model is based on a library

of possible mistakes that could be made up

by the student in its pedagogical activities.

• Perturbation: The perturbation model is an

hybrid model that involves the concepts of

the overlay and bug together.

• Constraint-based: Opposite to the previous

models, it does not compare the student’s

knowledge to the domain knowledge. It

rather focuses on correct knowledge by

checking if all the constraints of a certain

domain are satisfied by the student.

Diana Pérez Marín, Ismael Pascual Nieto y Pilar Rodrígez

226

The student model that we propose in this

paper can be classified as a perturbation model.

It is because it is not considered that the student

model is an exact projection of the domain

model, or a set of possible mistakes made by

the students. On the other hand, it is considered

that the model reflects several possible

misconceptions made by the students regarding

a certain domain. Furthermore, a constructivist

view is follow in the paper according to which

each student builds his or her knowledge as s/he

interacts with the world.

Several educational applications keep a

student model to support personalized learning.

For instance, ALE (Kravcik and Specht, 2004)

keeps a model of the students with information

about their learning style to adjust the

navigation possibilities in the course to them;

ConceptLab (Zapata-Rivera and Greer, 2001)

supports knowledge construction and

visualization using concept maps to represent

the student’s view of the domain; and, E-tester

(Guetl et al., 2005) diagnoses student’s

knowledge with a conceptual frequency

histogram student model.

However, none of these systems use any

kind of NLP technique to automatically

generate and update the model. In fact, the

most related system to our approach is E-tester.

E-tester keeps a model of the frequency in

which each term is used by each student to be

compared with the frequency in which it is used

by the teachers in a set of model answers.

However, the focus is only on individual

concepts without taking into account the

relationships among the concepts.

3 Overview of the procedure

Figure 1 shows an overview of the

procedure as implemented in the Will Tools

(Pérez-Marín, 2007). The Will Tools consist of

the following systems: Willow, a free-text CAA

system; Willed, an authoring tool; Willoc, a

configuration tool; and, COMOV, a conceptual

model viewer.

First of all, the teacher is asked to use

Willed to introduce the questions and its correct

answers (references) in the database. The

references are automatically processed to

generate the domain model. Next, whenever a

student answers one of the questions proposed

by Willow, not only s/he gets instant feedback

but, his or her use of the concepts of the domain

model is analyzed to automatically generate his

or her student model. The student model

consists of personal data gathered from the

student and the generated conceptual model.

Finally, the conceptual model can be shown

to teachers and students with COMOV to

identify which concepts of the lessons should

be reviewed and, which ones have already been

assimilated. The model is also used by Willow

Figure 1: Overview of the procedure to generate students’ conceptual models

Natural Language Processing meets User Modeling for automatic and adaptive free-text scoring

227

to choose the next question to ask, and the

content of the model is updated with the new

answers provided by the students.

This procedure can be used just from the

answers of one student to generate an individual

student’s conceptual model. Or, the procedure

can be used from the answers of a group of

students to generate a group conceptual model.

In any case, by using this procedure, the

student becomes immediately aware, by

looking at his or her generated model, of which

concepts s/he should still review more; and, the

teacher gains access to a monitoring tool that

reports not only information such as how many

questions each student and the whole group

have answered but, by looking at each student’s

generated model, which concepts each student

seems to have understood or misunderstood,

and by looking at the group conceptual model,

the average results of the whole group.

The Will Tools were used for the first time

in the 2005 year by a Spanish group of the

Operating Systems subject of an Engineering

degree. A year later, they were used during a

whole semester in the same domain, and in the

first semester of 2007-2008 academic year they

have been also used in a non-technical domain.

The results gathered in all these experiments

support the feasibility of the procedure.

However, given that this paper is focused on the

combination of the Natural Language

Processing and User Modeling techniques for

the procedure to work, the experiments are

mentioned here just to let the reader know that

the procedure has been implemented and has

already been used with students during several

years.

4 �LP+UM techniques for automatic

and adaptive free-text scoring

As can be seen in Figure 1, one of the inputs

to Willow is a student’s answer in plain text.

Other inputs are the knowledge contained in the

domain model as introduced by the teacher in

Willed (Willow’s authoring tool) and, as

external lexical resources: WordNet 1.7 for

English and the Spanish EuroWordNet for

Spanish (Vossen, 1998). Both WordNet and the

Spanish EuroWordNet have been used as

processed by Alfonseca (2003).

Regarding the NLP techniques, in Willed, it

is implemented the automatic identification of

the concepts of the course with the Term

Identification module that can be implemented

using the C4.5 algorithm (Quinlan, 1993),

considering that a term is the label of a concept.

The features considered as attributes for the

training should be at least:

• The relative frequency (freqRel.) of

appearance of the term in the domain-specific

corpus (i.e. a corpus of free-text students’

answers) with respect to its frequency in the

generic corpus (i.e. journal news on Computer

Science). This is because terms tend to be

specific to a certain knowledge field and thus,

to appear more frequently in the specific corpus

and consequently, have a relative frequency

higher than one.

• The sequence of part-of-speech (POS) tags

of the words composing the sample (e.g.

determiner+noun+adjective). This is because

terms tend to contain certain POS tags such as

nouns, adjectives, etc. but not others such as

verbs or adverbs.

The resulting list of terms can be shown to

the teacher, so that they can modify it as they

consider more adequate to finally produce the

list of concepts of the course and, the CV of all

these concepts is set to zero.

In Willow, the NLP techniques used are the

following:

• Stemming, removal of closed class words,

Word Sense Disambiguation, and/or

Multiword Identification for processing

the free-text answers provided by the

students and the correct answers

(references) provided by the teachers to

make their comparison easier (following the

core idea of Willow and many other free-

text scoring systems that the more similar

the student’s answers to the teachers’

answers are, the better they are).

Additionally, a simple pattern extraction

mechanism is implemented to find out

relationships “BC linking word BC” in the

students’ answers.

• Statistical n-gram comparison using the

Evaluating Responses with Bleu (ERB)

algorithm (Pérez-Marín, 2007), which is

more focused on the style of the answer;

and, Latent Semantic Analysis (LSA),

which is more focused on the content of the

answer to compare the processed student’s

answers and teachers’ answers. Each of

them gives as output a numerical score:

ERB between 0 (bad answer) and 1 (good


228

answer) and LSA between -1 (bad answer)

and 1 (good answer). They can be used

together or independently. In the case, that

ERB and LSA scores are used together, it

could be done as a linear combination of

their normalized values in a common scale

(e.g. from 0 up to 100).

Regarding User Modeling, the question

planner chooses the next question to ask

according to the level of difficulty of the

questions that the student is able to pass; and,

the clarification questions technique generates

new questions (not typed by the teacher) in the

form “What is X?” each time a concept is

wrongly used by the student (i.e. the ERB and

LSA metrics indicate that the student’s answer

is very different to the teachers’ answers).

Finally, the student is given as feedback the

numerical score (result of the combination of

ERB and/or LSA techniques), the processed

answer (with the concepts marked) and the

correct answers (as provided by the teacher).

Furthermore, the student and the teacher can

look at the generated student’s model for each

particular student and the whole class in the

Conceptual Model Viewer.

The student conceptual model is represented

in five different formats so that each student can

choose the format s/he considers more

illustrative or look at all of them. These formats

are: concept map, conceptual diagram, bar

chart, table and textual summary. Figure 2

shows a sample concept model represented as

textual summary.

Figure 2: Sample generated student’s

conceptual model presented as textual summary

Parameter Value

Language Spanish / English

Area-of-knowledge Operating Systems

Metric of goodness Pearson correlation

between Willow’s and

teachers’ scores

Table 1: Parameters used

5 Results

The results of using these NLP techniques

combined differ according to the language and

area-of-knowledge in which the techniques are

applied. Furthermore, it depends on the metric

of goodness of the procedure chosen.

Nevertheless, in order to give a general idea

of the values that can be reached, some results

are presented using the parameters indicated in

Table 1 (see Pérez-Marín, 2007 for the

experimental details):

- For Spanish, the optimum combination is to

choose Term Identification to extract the BCs

using the C4.5 algorithm (Quinlan, 1993)

achieving 74% F-score, stemming and ERB

reaching up to 54% Pearson correlation (an

average value in the field, Valenti et al., 2003)

Furthermore, if a Genetic Algorithms Module

is included to automatically choose some of the

best students’ answers of one year as correct

answers of the following year, the Pearson

correlation is increased up to 63% (Pérez-

Marín, 2007).

- For English, the optimum combination is to

choose Term Identification to extract the BCs

using the C4.5 algorithm (Quinlan, 1993) also

achieving 74% F-score, stemming, removal of

closed-class words, and ERB+LSA reaching

up to 56% Pearson correlation. Given that the

procedure has not been applied to English

students yet, no results are available to decide

how this percentage can be increased by

applying the Genetic Algorithms Module.

Finally, when the metric used is to measure

the Pearson correlation between the scores

achieved by the students in the final exam, and

the scores achieved by the students in the

generated conceptual model as the average of

the CVs of all the concepts of the model, a 50%

statistically significant positive correlation is

found (p=0.0063) (measured over the results of

31 Spanish Engineering students using the

optimum combination of NLP techniques, for

more experimental details see Pérez-Marín,

2007).

Furthermore, it is possible to validate the

automatically generated students’ conceptual

models at the concept level of granularity. In

order to that, first of all, a human teacher has to

be asked to estimate how well a group of

students know a set of concepts. Secondly,

Willow is used to estimate the same concepts


229

for the same group of students. Finally, the

mean quadratic error between the estimation

made by the human and the estimation

calculated by the system per each concept per

each student is calculated. In particular, 0.08

mean quadratic error was attained measured

over 65 concepts of nine students (for more

experimental details, see Pérez-Marín, 2007).

Qualitatively, it has also been observed that

the higher the scores achieved by the students

in the final exam, the more complex that their

conceptual models are.

6 Applications to other domains and/or

languages

The results previously indicated have been

focused on an Engineering area-of-knowledge,

specifically Operating Systems because we

work as teachers in that area. However, the

proposed combination of techniques is not

limited to Engineering degrees.

On the contrary, it can be applied to other

areas-of-knowledge provided that it is not

necessary to assess creative thinking or do

mathematical calculations, which are

completely out of scope of this work.

Moreover, the procedure can also be applied

to other languages different than Spanish or

English, just by taking the requirements

explained in Table 2 into account.

NLP technique Sp. En. Ot.

Term Identification Yes Yes No

Stemming Yes Yes No

Removal of

closed-class words

Yes Yes No

ERB Yes Yes Yes

LSA No Yes No

Table 2: NLP techniques already implemented

for the procedure to work in different languages

(Sp.: Spanish; En.: English; Ot.: Other)

The steps to apply the procedure to generate

the students’ conceptual models from free-text

answers typed into a free-text ACAA system

such as Willow to other domain and/or

language are the following ones:

1. Provided that the language in which the

procedure is to be applied is English or

Spanish, the teachers and students can use

Willow or a free-text CAA with an

English/Spanish interface without further

modification. For other languages, the

systems’ interfaces should be translated

into the target language.

2. Teachers should create a new area-of-

knowledge with the authoring tool and

afterwards, fill in the forms about its name,

description, features and topics that the

area-of-knowledge comprises.

3. Next, teachers should introduce the

questions. In particular, it is necessary, per

each question: its statement and several

correct answers (e.g. three) to capture as

much lexical variability as possible,

maximum score, topic and level of

difficulty.

4. For English and Spanish courses, it is not

necessary to acquire any other Natural

Language Processing techniques or

resources than the already implemented in

Willow, and can go directly to the next

step. On the other hand, for courses written

in a different language, it would be

necessary to have a stemmer, a Part-of-

Speech (POS) tagger, and a specific and

generic corpora for the Term Identification

Module. It is necessary to classify the

candidate n-grams in the references as

terms. The specific corpus is given by the

correct answers provided by the teachers

and the generic corpus can be

automatically retrieved from the web.

5. The administrator of the authoring tool

should apply the Term Identification

module to the correct answers provided by

the teachers to generate the first list of

terms. This list can be reviewed by the

teachers, and the resulting list of terms is

stored as the concepts of the conceptual

model together with their frequency in the

teachers’ correct answers.

6. Teachers ask their students to register in

the free-text CAA system.

7. Students answer the questions asked by the

free-text CAA system, which is keeping

track of each student’s use of the concepts

found in his or her free-text answers. The

system is also looking for patterns

“BC+linking words+BC” and, in general,

processing the answer with the NLP

techniques chosen to be compared with the

teachers’ answers.


230

8. During the course, teachers and students

can see the generated conceptual models

using a conceptual model viewer.

9. For the next course, teachers who are

willing to use the procedure again can ask

the administrator of the tools to tune some

internal parameters to improve the

accuracy of free-text scoring. In particular,

from the information gathered this year:

genetic algorithms can be run on the

teachers’ and students’ answers to choose

the best references among this set of texts;

and, teachers can be asked to manually

score a set of answers to calculate the

Pearson correlation between the automatic

and the teachers’ scores for these answers.

This is because, as it has been seen in the

previous section, the optimum combination

of NLP techniques for a different area-of-

knowledge and/or language may change

and should be optimized to each particular

case.

7 Conclusions

The benefits of the convergence between

User Modeling (UM) and Natural Language

Processing (NLP) techniques have been

claimed by several researchers (Zukerman and

Litman, 2001; Johansson, 2002; Reiter et al.,

2003). However, these authors have usually

focused the application of these techniques to

the Natural Language Understanding and

Natural Language Generation fields.

In this paper, it has been studied that UM and

NLP techniques can be combined to permit the

automatic and adaptive assessment of students’

free-text answers.

Traditionally, many of the existing free-text

CAA systems have formative assessment

purposes: to serve as double-checker of the

scores given the teachers, or to provide more

training to the students before their final exams

(Valenti et al., 2003). However, up to date,

none of them has kept a student model that

guides the student towards the correct answer,

or provide him or her with more detailed

feedback than a numerical score, correct answer

or link to the theoretical explanation.

However, the benefits of generating a student

model from his or her free-text answers are

several, both for teachers and for students.

For teachers, there are two main benefits that

can be highlighted:

- To receive more feedback to know how

well the students have understood the concepts

taught in the lessons. The teachers can look at

the generated students’ conceptual models in

conceptual model viewers such as COMOV.

Moreover, teachers can identify several types of

misunderstandings that have been classified in

the taxonomy of detectable errors detailed

below,

* For concepts:

– Ignorance: Whenever a students does not

use a certain concept, the concept is associated

a CV of zero, and it may indicate that the

student ignores that concept.

– Misconceptions: Some concepts may seem

to be known by students as sometimes they use

those concepts. However, the students might

have wrongly used the concepts in their

answers. Thus, these concepts are associated a

CV below 0.5 in a 0 (no knowledge) to 1

(perfect knowledge) scale of estimation.

* For links:

– Ignorance: Whenever a student does not

relate two concepts, the teacher can notice the

lack of links between these two concepts, and it

may indicate that the student ignores that the

concepts are related.

– Erroneous links: Sometimes students

erroneously relate two concepts in their

answers. This evidences an error in the

cognitive structure of the student as s/he

believes that the concepts are related in a wrong

way. It is fundamental to correct this situation

to allow the student to continue learning

meaningfully and linking correctly new

concepts to the existing ones (Ausubel, 1963).

- To keep track of the students’ learning

progress by looking at the representations of the

student model several times during the course,

i.e. from a concept map representation, the

teachers can easily see the conceptual evolution

of the students by observing how the new

concepts modify the previous ones, and the new

links that are being created.

For students, there are the following benefits:

- To be able to get more personalized and

efficient training before their final exams as the

system finds out which concepts are worst

known. The system starts asking each particular

student about concepts with low CV, instead of

going over concepts already known.


231

Furthermore, the system chooses the questions

related to those concepts that are in the level of

difficulty that the student is able to handle. The

reason for that is to increase the students’

motivation to keep answering new questions so

that the new questions are not too difficult (i.e.

too complicated to give any answer) or too easy

(without any real interest for the student).

- To be guided towards the correct answer

(promoting reflective thinking instead of

memorizing the answer) with a set of

clarification questions automatically generated

by the system.

- To have access to always updated feedback

as the model has not to be created by the

teacher or the student. On the contrary, the

model is automatically generated from the

students’ answers in a transparent process for

the student who only has to answer the

questions of the free-text CAA system.

References

Alfonseca, E. 2003. ‘An Approach for

Automatic Generation of on-line Information

Systems based on the Integration of Natural

Language Processing and Adaptive

Hypermedia techniques’, PhD dissertation,

Computer Science Department, Universidad

Autónoma de Madrid, available on-line at

http://alfonseca.org/eng/research/thesis.html

Guetl, C., H. Dreher, and R. Williams. 2005.

‘E-tester: A computer-based tool for auto-

generated question and answer assessment’,

E-Learn, Canada.

Johansson, P. 2002. ‘�LP Techniques for

Adaptive Dialogue Systems’, available on-line

at (link reviewed on 20th April 2008):

http://www.ida.liu.se/~ponjo/downloads/nlp1

paper/johansson_nlp1_paper.pdf

Kass, R. and T. Finin. 1988. ‘Modeling the

User in Natural Language Systems’,

Computational Linguistics, 14: 5–22.

Kay, J. 2001. ‘Learner Control’, User Modeling

and User-Adapted Interaction, 11: 111-127.

Kobsa, A. 1990. ‘User Modeling in Dialog

Systems: Potentials and Hazards’, AI &

Society, 4(3): 214-231.

Kravcik, M. and M. Specht. 2004. ‘Flexible

Navigation Support in the WINDS Learning

Environment for Architecture and Design’.

Proceedings of the Adaptive Hypermedia

International Conference, The Netherlands.

Labidi, S. and N. Sergio. 2000. ‘Student

modeling and semi-automatic domain

ontology construction for shiecc’. In

Proceedings of the 30th ASEE/IEEE

Frontiers in Education Conference.

Mitrovic, A. 2001. ‘Cosc420 lecture notes:

Cognitive modeling and intelligent tutoring

systems’.

Pérez-Marín, D. 2007. ‘Adaptive Computer

Assisted Assessment of free-text students'

answers: an approach to automatically

generate students' conceptual models’, Ph.D.

dissertation, Escuela Politécnica Superior,

Universidad Autónoma de Madrid, available

on-line at (link reviewed on 20th April 2008):

http://www.eps.uam.es/~dperez/index1.html

Quinlan, J.R. 1993. ‘C4.5: Programs for

Machine Learning’, Morgan Kaufmann

Publishers.

Reiter, E., S. Sripada, and S. Williams. 2003.

‘Acquiring and Using Limited User Models

in NLG’, European Workshop on �atural

Language Generation, Hungary, pp. 87-94.

Sparck Jones, K. 1989. ‘Realism about User

Modeling’. In Kobsa, A., Wahlster, W., eds.:

User Models in Dialog Systems. Springer-

Verlag, Symbolic Computation Series, pp.

341–363.

Valenti, S., F. Neri and A. Cucchiarelli. 2003.

‘An Overview of Current Research on

Automated Essay Grading’, Journal of

Information Technology Education 2, pp.

319–330.

Zapata-Rivera, J.D. and J.E. Greer. 2001.

‘Externalising learner modelling

representations’. Proceedings of Workshop on

External Representations of AIED: Multiple

Forms and Multiple Roles, pp. 71–76.

Zukerman, I. and D. Litman. 2001. ‘Natural

Language Processing and User Modeling:

Synergies and Limitations’, User Modeling

and User-Adapted Interaction, 11: 129-158.


232

Algunos problemas concretos en la anotación de papeles semánti-cos. Breve estudio comparativo a partir de los datos de AnCorA,

SenSem y ADESSE

Gael Vaamonde Grupo de Investigación Gramática y Léxico

(GIGRALEX) Departamento de Tradución e Lingüística

Universidade de Vigo E-36200 Vigo, España

[email protected]

Resumen: La etiquetación de papeles semánticos se ha convertido en un reto importante tanto en el campo de la lingüística de corpus como en el procesamiento del lenguaje natural. Sin em-bargo, se trata de una tarea compleja en la que debemos afrontar ciertos problemas de anotación y en la que diferentes grupos de trabajo a menudo adoptan soluciones dispares, independiente-mente del marco teórico que sustente el análisis. En este artículo se describen algunos de estos problemas a la vez que se comparan las distintas soluciones adoptadas por tres proyectos de in-vestigación que han abordado el análisis sintáctico-semántico de un corpus en español. Palabras clave: Papeles semánticos, anotación de corpus, clasificación de verbos, estructura ar-gumental.

Abstract: The labelling of semantic roles has become an important challenge both in the field of corpus linguistics and in the natural language processing. However, it is a hard task in which we have to deal with certain problems of annotation and in which different groups often take different solutions, regardless of the theoretical framework which supports the analysis. This paper outlines some of these problems and simultaneously compares the different solutions adopted by three research projects that have dealt with the syntactic-semantic analysis of a Spanish corpus. Keywords: Semantic roles, corpus annotation, verbal classification, argument structure.

1 Introducción El proceso de anotación de un corpus suele ser modular, es decir, suele obedecer a distintos niveles de análisis lingüístico (morfología, sin-taxis, semántica, pragmática). En este sentido, el trabajo en corpus no escapa a algunos de los problemas que ha tenido que tratar la lingüística teórica. El objeto de estudio en cada nivel de análisis se va haciendo cada vez más “escurri-dizo”, menos sistemático, y cada salto de nivel parece implicar una mayor reticencia a la des-cripción lingüística en términos de unidades discretas, claramente definidas y de fácil apli-cación.

A esta complejidad progresiva hay que aña-dir, de forma paralela, un acuerdo decreciente a efectos de anotación. Frente al relativo consen-so que encontramos en el enriquecimiento mor-fosintáctico de corpus diferentes (siempre que se haga abstracción de teorías sintácticas con-cretas), la etiquetación semántica puede variar significativamente entre unos anotadores y otros, tanto en lo metodológico como en lo descriptivo, y llevar a soluciones de análisis diferentes para un mismo ejemplo concreto.

Este trabajo pretende sacar a la luz algunos de los problemas con los que se encuentra el anotador al añadir información semántica a un corpus, en concreto al afrontar cuestiones rela-tivas a la etiquetación de papeles semánticos.



Para ello, se han tomado como referencia tres proyectos de investigación que han abordado esta tarea en el ámbito del español: AnCora (Annotated Corpora)1, SenSem (Sentence Se-mantics: Creación de una base de datos de Semántica Oracional)2 y ADESSE (Alternan-cias de Diátesis y Esquemas Sintáctico-Semánticos del Español)3.

El trabajo se estructura del modo siguiente. El apartado 2 está dedicado a explicar breve-mente los proyectos que serán objeto de estu-dio. En el apartado 3 se apuntan algunas consi-deraciones previas que deben ser tenidas en cuenta a la hora de realizar un estudio compara-tivo entre dichos proyectos. El apartado 4 se centra en tres problemas concretos que ilustran algunas dificultades en la anotación de papeles semánticos. El trabajo finalizará con algunas conclusiones generales en lo que concierne a la etiquetación de papeles en corpus.

2 Los recursos lingüísticos utilizados

2.1 Ancora El proyecto AnCora, llevado a cabo por el Cen-tre de Llenguatge i Computació (CLiC) de la Universidad de Barcelona, presenta dos corpus de 500.000 palabras cada uno, uno para el cata-lán (AnCora-CAT) y otro para el español (An-Cora-ESP), aunque en este trabajo sólo se ten-drán en cuenta los datos de AnCora-ESP. Dicho corpus está compuesto por 400.000 palabras extraídas de distintas fuentes periodísticas y 100.000 palabras provenientes del corpus 3LB-ESP (Civit y Martí, 2004).

La anotación semántica de AnCora parte de una clasificación verbal basada en la conocida tipología de Vendler (1967), posteriormente desarrollada en Dowty (1979), que diferencia cuatro tipos de eventos en función de la Aktion-sart: estados, actividades, logros y realizacio-nes. Además, AnCora adopta la descomposi-ción léxica como método de análisis (Levin y Rappaport, 1995; Rappaport y Levin, 1998), de tal forma que cada tipo de evento es asociado a una Estructura Léxico-Semántica, esto es, una combinación de variables, constantes y predica-dos primitivos que representan la estructura

1 http://clic.ub.edu/ancora/. HUM2006-27378-E. TIN2006-15265-C06-06 2 http://grial.uab.es/fproj.php?id=1. BFF2003-06456 3 http://webs.uvigo.es/adesse. HUM2005-01573

lógica del evento. Estas cuatro clases generales son a su vez divididas en diferentes subclases en función de la estructura argumental, los pa-peles semánticos y las alternancia de diátesis, dando lugar a un total de 13 clases semánticas.

La asignación de papeles semánticos a cada argumento del verbo dependerá de la clase se-mántica asociada a ese verbo (sentido verbal), más concretamente de la estructura léxico-semántica y las alternancias de diátesis en las que aparece (cf. Martí et al., 2007:27 y ss.)

2.2 SenSem El proyecto SenSem, desarrollado por el Grup de Recerca Interuniversitari en Aplicacions Lingüístiques (GRIAL) de Cataluña, ofrece información sintáctico-semántica de los que considera los 250 verbos más frecuentes del español. Partiendo de un corpus de aproxima-damente 13 millones de palabras, creado ínte-gramente a partir de las versiones online de “El Periódico de Catalunya”, en SenSem se ha op-tado por seleccionar 25.000 oraciones, 100 por cada verbo, que posteriormente han sido anota-das con información sintáctica y semántica.

El proceso de anotación en SenSem respon-de básicamente a tres niveles: la unidad léxica, los constituyentes y la oración en sí. Para cada participante se ha señalado su estatus argumen-tal (argumentos frente a adjuntos) y se ha aña-dido información sintáctica relevante (categoría y función). Además, cada argumento es asocia-do a un rol semántico determinado.

A nivel oracional, para cada sentido verbal se ha incluido información acerca del tipo de evento designado (evento, proceso o estado) y cada esquema sintáctico se asocia con una eti-queta que resume su significado construccional (anticausativa, antiagentiva, reflexiva, habitual, …), algo que, como se apunta en Castellón et al. (2006), distingue a SenSem de otros proyec-tos similares.

2.3 ADESSE ADESSE (Alternancias de Diátesis y Esquemas Sintáctico-Semánticos del Español) es un pro-yecto que se está desarrollando en la Universi-dad de Vigo y que, a partir de la anotación sin-táctico-semántica de un corpus del español, pretende ofrecer una base de datos para el estu-dio empírico de la interacción entre verbos y construcciones.

Toda la información sintáctica de ADESSE es una herencia directa de la Base de Datos

Gael Vaamonde

234

http://clic.ub.edu/ancora/

http://grial.uab.es/fproj.php?id=1

http://webs.uvigo.es/adesse

Sintácticos del Español Actual (BDS)4, que contiene el análisis sintáctico e información sobre los elementos valenciales de las casi 160.000 cláusulas que conforman la parte con-temporánea del corpus ARTHUS5. Este corpus de aproximadamente 1,5 millones de palabras está compuesto por una variada naturaleza de textos (narrativos, teatrales, ensayísticos, perio-dísticos y orales) procedentes de España e His-panoamérica.

El proyecto ADESSE basa su razón de ser en el enriquecimiento semántico de los datos aportados por la BDS y este enriquecimiento se orienta fundamentalmente hacia tres objetivos claros: diferenciación de acepciones, clasifica-ción semántica y etiquetación de papeles.

En ADESSE, cada sentido verbal es asocia-do a una clase semántica determinada (o a va-rias). Para cada clase semántica se ha previsto una serie de papeles prototípicos del dominio cognitivo descrito. A su vez, cada sentido ver-bal incuye un conjunto de papeles semánticos para el total de los participantes posibles con ese verbo (potencial valencial). En general, el verbo hereda por defecto los papeles de la(s) clase(s) en que se integra, y se añaden aquellos que se consideran necesarios para dar cuenta de todas las posibilidad construccionales con ese verbo (cf. García-Miguel y Albertuz, 2005)

2.4 Algunas consideraciones previas Antes de realizar cualquier tipo de comparación entre los proyectos citados, conviene apuntar algunos de los aspectos que los individualizan y que deben ser tenidos en cuenta como paso previo al estudio comparativo que se pretende.

Uno de los problemas más comunes, no sólo en la anotación de corpus sino de manera gene-ral en el estudio de la interfaz sintáctico-semántica, es el de la delimitación entre argu-mentos y adjuntos.

En la tarea de etiquetar los participantes ver-bales esta delimitación juega un papel relevan-te, desde el momento en que la anotación de roles semánticos suele estar asociada de manera exclusiva a aquellos elementos que se conside-ran exigidos por el predicado. Como se aprecia en los ejemplos siguientes, tan sólo AnCora, (1), incluye los adjuntos entre los participantes que llevan etiqueta semántica. En SenSem, (2), los elementos que no son considerados argu-

4 http://www.bds.usc.es/ 5 http://www.bds.usc.es/corpus.html

mentales prescinden de descripción semántica, mientras que ADESSE, (3), asume el trabajo anterior de la BDS y persigue únicamente la anotación de los elementos que fueron tratados como valenciales en dicha base de datos.

(1) […] asistirá a la XII Cumbre de Jefes de Estados Andinos que (Arg1-PAC) se cele-brará en Lima (ArgM-LOC) el 9 y 10 de junio (ArgM-TMP)

(2) En Juriba, ciudad del interior marroquí, cada verano se celebra el mercado de los italianos (Tema)

(3) […] el tema de estos cursos que (A2 Acti-vidad) se celebrarán la semana próxima en el Área de Cultura de Caixa Galicia

Haremos notar, también, que ninguno de los tres proyectos mencionados adopta como único recurso de anotación semántica el inventario de papeles. En los tres casos se aprecia una clasifi-cación semántica de los verbos, bien sea de tipo aspectual (AnCora y SenSem) como nocional (ADESSE). Además, AnCora recurre a la es-tructura léxico-semántica como método previo a la delimitación y asignación de papeles, mien-tras que en ADESSE la clase semántica a la que corresponde casa sentido verbal determina en gran medida el conjunto de etiquetas utilizadas para describir su potencial valencial.

Las dos tablas siguientes resumen las carac-terísticas principales de cada proyecto, tanto en lo que se refiere a datos del corpus como en lo tocante al proceso de anotación semántica:

Corpus palabras cláusulas lemas

AnCora 500.0006 6.009 1.895

SenSem 700.000 25.000 250 ADESSE 1.450.000 160.000 3.436

Tabla 1: Relación del número de palabras, cláu-sulas y lemas verbales en cada corpus

Anotación de papeles método cobertura etiquetas

AnCora semiautomático parcial 20 SenSem manual total 24 ADESSE manual total 1437

Tabla 2: Relación del tipo de método, grado de cobertura y número de etiquetas

6 En el momento de redactar estas páginas, la anota-ción semántica de Ancora-ESP todavía no se ha finalizado (188.513 palabras de un total de 500.000). 7 Esta lista está actulmente en proceso de revisión

Algunos problemas concretos en la anotación de papeles semánticos. Breve estudio comparativo a partir de los datos de AnCorA, SenSem y ADESSE

235

http://www.bds.usc.es/

http://www.bds.usc.es/corpus.html

Por último, deben tenerse en cuenta también los objetivos fundamentales de cada proyecto. SenSem y ADESSE son recursos lingüísticos primordialmente descriptivos que proporcionan un sistema de consulta de los datos analizados en cada corpus8; AnCora, en cambio, tiene una clara finalidad computacional como fuente de aplicaciones y herramientas relacionadas con el procesamiento del lenguaje natural.

Estos aspectos, que condicionan en muchos casos las soluciones de análisis adoptadas, no eximen, sin embargo, de un estudio como el presente, en el que se busca contrastar algunos problemas concretos en la etiquetación de pape-les semánticos en tres proyectos de investiga-ción que comparten el uso dichos papeles como herramienta descriptiva para anotar corpus del español.

3 Algumos problemas de anotación

3.1 La anotación de dativos y CINDs Son numerosos los trabajos que han mostra-

do interés por el CIND en español. Para el pre-sente estudio, tomaremos como referencia a Gutiérrez Ordoñez (1999), donde se establece una distinción entre CINDs argumentales (CIND1), ejemplificados en (4a-b) y que apare-cen prototípicamente con verbos de transferen-cia, CINDs no argumentales o incorporados (CIND2), ejemplificados en (4c-d), y que sue-len aparecen con verbos de creación, destruc-ción o preparación, y dativos superfluos, ejem-plificados en (4e-f) y diferenciados de los CINDs fundamentalmente en su presentación exclusivamente pronominal y en la posibilidad de coaparición con cualquier otra función sin-táctica (cf. Gutiérrez Ordoñez, 1999:1909 y ss.).

(4) a. Le envió una postal a su hermano b. No nos corresponden esos lujos c. Te arreglé las tijeras d. Le arañó la cara e. Nos tememos lo peor f. No te me acalores

Partiendo de dicha tipología, veamos cómo trata cada proyecto las funciones citadas.

En Ancora, tanto los CIND1 como los CIND2 llevan de manera general la etiqueta de Beneficiario (BEN), como se aprecia en los ejemplos (7a-b) y (7c-d), respectivamente. En cuanto a los dativos que cita Gutiérrez Ordóñez,

8 Para una comparación entre ambos proyectos, véase Cuadros Muñoz, 2005:126 y ss.

parecen quedar fuera del proceso de anotación semántica para este proyecto, a la luz de los ejemplos recogidos en (9e-f):

(5) a. […] dar a los fabricantes de ordenadores (Arg2-BEN) mayor flexibilidad b. […] uno de los dos puestos que le co-rresponden a España (Arg2-BEN) c. […] que abrirá a este país (Arg2-BEN) los mercados chinos d. […] para arreglarle la jaima a la Caballé (Arg2-BEN) e. Un solo visón se (ø) comió 87 huevos f. Se (ø) llevó una bolsa de 200.000 dólares

Quizás lo que más llama la atención es que

en AnCora son tratados como argumentales tanto los Beneficiarios que funcionan como CIND1 como los que funcionan como CIND2, por lo que, a efectos de anotación, no parece haber ningún aspecto diferenciador, ni sintácti-co ni semántico, entre uno y otro caso.

En lo que concierne a SenSem, los CIND1 son etiquetados como Destino (Dest), como se ve en (6a-b), mientras que los CIND2 ofrecen una solución dispar. Generalmente, no son eti-quetados semánticamente (6d), en consonancia con la idea de reservar esta información tan sólo para los elementos argumentales del verbo. En cualquier caso, la determinación de la argumen-talidad es una cuestión compleja y sujeta a dife-rentes interpretaciones, por lo que encontramos casos clasificados por Gutiérrez Ordoñez como CIND2 que vienen acompañados por un papel semántico en SenSem (6c). Los dativos super-fluos, obviamente, carecen de etiquetación se-mántica en este proyecto (6e-f)

(6) a. Daremos una respuesta positiva a las personas que trabajan en las casas (Dest) b. […] de los que 4.133 [trabajadores] co-rresponden a España (Dest) c. Si nos (Dest) crean una nueva barrera, que nos quiten otra d. […] pidiendo que le (ø) arreglen la Casa dels Canonges e. […] como las ovejas no son suyas sale corriendo, y el lobo se (ø) las come f. Se (ø) llevó la mano derecha a la boca

Conviene señalar que en SenSem algunas cláusulas han sido anotadas a nivel oracional con la etiqueta “Dativo de interés”, donde se incluyen tanto dativos posesivos (7a), que en Gutiérrez Ordoñez (1999) son tratados dentro del grupo CIND2, como dativos claramente

Gael Vaamonde

236

superfluos (7b). Sin embargo, casos como los de (7c-d), que también parecen claros dativos posesivos, no están tratados como “Dativo de interés” y tan sólo uno de ellos aparece acom-pañado de etiqueta semántica, por lo que no parece haber una solución sistemática para la anotación de este tipo de dativos en SenSem:

(7) a. […] me (ø) he reducido el estómago b. Se nos (ø) va Julia de TV-3 c. Se le (Dest) ve demasiado el truco d. […] como ciego, en Telecupón, llega a tocarle el culo a Belinda Washington (ø)

Por último, en ADESSE tanto los CIND1 como los CIND2 llevan etiqueta semántica. En el primer caso esta etiqueta vendrá determinada por la clase semántica asociada al verbo en cuestión (Poseedor-final con verbos de transfe-rencia, Entidad2 con verbos de atribución, …) y en el segundo caso será habitualmente un Bene-ficiario o un Poseedor, etiquetas generales no asociadas a ninguna clase concreta (AG). Por su parte, la mayor parte de los denominados dati-vos superfluos carecen de etiquetación y se interpretan como marca de voz media.

(8) a. […] ya se venció el plazo que le dimos a la gerencia (A1 POS-FIN) b. […] las mayores subidas han correspon-dido a Madrid (A2 ENT2) c. Ya de paso que nos (AG POS-A1) arre-glé la cocina d. A la mañana siguiente no quiso abrirme (AG BEN) la puerta e. ¿Y sabe mi señora qué haría después? ¡Me (ø) comería los cocodrilos! f. Se (ø) llevaron a mi padre, y mi madre lo veía en sueños

La tabla siguiente resume las diferentes so-luciones de anotación en cada proyecto para los CINDs diferenciados:

CIND2 CIND1

dat. pos. otros Dat

AnCora BEN BEN BEN BEN ø SenSem Dest Dest ø/dat

interés Dest/ø ø

ADESSE Papel de la clase

POS BEN ø

Tabla 3: Papeles semánticos frecuentemente

asociados a CINDs y dativos en AnCora, Sen-Sem y ADESSE

La diferencia principal entre unos proyectos y otros estriba en cómo tratar los CINDs incor-porados (CIND2). En AnCora se ha optado por unificar todos los participantes que son codifi-cados mediante CIND y que presentan cierto grado de afectación bajo la etiqueta general de Beneficiario.

SenSem parece dar un paso más allá en el tratamiento de estos constituyentes y, aunque en términos generales sigue la misma línea de análisis que AnCora, en esta caso tomando co-mo papel unificador el de Destino, reconoce una solución específica para los conocidos dati-vos posesivos. Sin embargo, esta solución se presenta a nivel oracional, no mediante la adop-ción de un papel diferente, y de forma asistemá-tica, como prueban los ejemplos de (7).

Por último, ADESSE entiende que los CIND2, al no ser claramente argumentales, no heredan un papel de la clase correspondiente, como sí lo hacen los CIND1, y por eso deben ser etiquetados con papeles generales. Además, entre esos últimos se establece una distinción, al menos en el nivel más específico del análisis, entre Beneficiarios y Poseedores. Esto otorga mayor granularidad al análisis que presenta ADESSE, aunque como contrapartida pueden darse aparentes incoherencias como las de (9), fruto de la etiquetación de casos ambiguos que suponen un problema adicional respecto de SenSem y AnCora:

(9) a. Le mira las manos (POS) b. Le inmoviliza los brazos (BEN) c. Se les ha detectado un virus (POS) d. Se le designó un abogado (BEN) e. Le golpeaba en el estómago (POS) f. Le soplaba en la boca (BEN)

3.2 Las alternancias con participantes adicionales Otro de los problemas con lo que debe lidiar todo proceso de anotación semántica en corpus tiene que ver con las conocidas alternancias construccionales que puede presentar un mismo núcleo verbal. De entre ellas, nos centraremos en aquellas que son consecuencia de añadir un participante adicional en el evento descrito, como se ilustra en los esquemas de (10) y (11):

(10) a. Alguien imita algo b. Alguien imita a alguien c. Alguien le imita algo a alguien d. Alguien imita a alguien en algo


237

(11) a. Alguien sorprende a alguien b. Algo sorprende a alguien c. Alguien sorprende a alguien con algo d. Algo sorprende a alguien de alguien

En general, para casos como estos se hace

necesario el uso de tres etiquetas semánticas, uno por cada constituyente de las cláusulas triactanciales correspondientes (10c-d) y (11c-d). El problema radica en cómo se recoge la relación entre los diferentes esquemas que con-forman la alternancia a través de los papeles semánticos seleccionados y en cómo aplicar estos papeles en función del carácter animado o inanimado del participante en cuestión.

Las dos tablas siguientes ilustran, a partir de la observación de diferentes ejemplos, las dis-tintas soluciones de anotación adoptadas:

(SUJ) Alguien

algo a alguien en algo

Agt Pat Agt Pat Agt Pat Ben

An

Agt Pat Adv Sen No registrado

Act Obj Act Obj Act Obj Ref

AD

Act Obj Ámb

Tabla 4: Soluciones de anotación para imitar y similares9

(SUJ)

alguien algo a

alguien de

alguien con algo

Cau Pat Cau Pat

Cau Pat Adv

An

No registrado Agt Exp

Cau Exp Agt Exp Cau

Sen

Cau Exp ø Est Exp

Est Exp Est Exp Med

AD

Est Exp Ref

Tabla 4: Soluciones de anotación para sorprender y similares

9 Act (Actor), Adv (Adverbial), Agt (Agente), Ámb (Ámbito), Ben (Beneficiario), Cau (Causa), Est (Estímulo), Exp (Experimentador), Med (Medio), Obj (Objeto), Ref (Referencia)

En líneas generales, nos encontramos con dos vías de etiquetación para estos casos. La primera de ellas consiste en utilizar etiquetas diferentes para un mismo esquema sintáctico en función del carácter animado o inanimado de sus constituyentes. Es lo que sucede en SenSem para verbos como sorprender y similares, en los que se establece una diferencia a efectos de anotación entre Agentes (entidades animadas) y Causas (entidades inanimadas). Por tanto, a un mismo esquema sintáctico (SUJ-CDIR) le co-rresponden dos esquemas semánticos diferentes (Agente/Causa -Exp) y en las construcciones con tres participantes, el Experimentador se mantiene inalterable y los papeles Agente y Causa cubren el resto de posibilidades, si es que se consideran argumentales.

La otra vía pasa por obviar la animación de los participantes en estos casos y usar el mismo papel para el objeto (imitar y similares) o el sujeto (sorprender y similares), sea o no ani-mado, en los esquemas transitivos. Mediante esta vía, en las construcciones con tres consti-tuyentes se hace necesario recurrir a papeles específicos para anotar el tercer participante en cuestión. Como se aprecia en las dos tablas anteriores, esta opción es la adoptada en AnCo-ra y en ADESSE.

Aunque las dos soluciones son válidas, debe tenerse en cuenta que implican una diferencia importante. Tomando como ejemplo el caso de sorprender, en el primer caso la relación se-mántica del constituyente en función de SUJ varía como consecuencia de la animicidad, pero se entiende que la relación semántica que man-tienen con el verbo tanto el SUJ inanimado como el CPREP(con) es la misma. En el segun-do caso, por el contrario, el carácter animado o inanimado del participante no supone un cam-bio de función semántica y, sin embargo, al CPREP(con) sí se le asocia una función semán-tica concreta, distinta de la del SUJ inanimado.

Dicho de otro modo, el análisis de SenSem refleja una asociación directa entre referentes y relaciones semánticas, independientemente de la función sintáctica que los codifique, mientras que el análisis adoptado por ADESSE y AnCo-ra entiende que la identidad de referentes no implica identidad de papeles semánticos, sino que es la alternancia construccional la que con-lleva un cambio de relaciones semánticas con el verbo.

Gael Vaamonde

238

3.3 Casos fronterizos y de difícil asignación Ya apuntamos con anterioridad que los límites entre papeles como Beneficiario y Poseedor no son fáciles de establecer. Pero este era un pro-blema específico de ADESSE, que opta por esta distincion de manera recurrente. Hay, sin em-bargo, otros casos fronterizos que suponen un problema común a los tres corpus anotados.

Tal es el caso de ciertas etiquetas semánticas utilizadas para anotar participantes que no están directamente implicados en el evento descrito, sino que suelen designar significados generales y hasta cierto punto opcionales. Me refiero a papeles semánticos como Manera, Instrumento, Finalidad o Estado final.

Una prueba evidente del carácter fronterizo y ambiguo que representan estas etiquetas es el hecho de que muchas veces éstas no presentan el mismo valor extensional en cada proyecto.

Así, la misma construcción con un verbo como cerrar ofrece soluciones diferentes en AnCora (12a), SenSem (12b) y ADESSE (12c). En (13) se ilustra un caso similar con el verbo conducir:

(12) a. El IBEX cierra otro mal mes con una caída acumulada del 6,8 % (Manera) b. Con un 25 % de cuota de pantalla (Ins-trumento), Telecinco cierra su mejor mes c. Este año espera cerrar el ejercicio con una facturación de 15 millones (ø)

(13) a. […] transformaciones que conduzcan a disminuir las desigualdades (Estado final) b. […] diseñó una planificación que con-ducía a lograr un estado de forma óptimo (Finalidad) c. […] distraer recursos en cuestiones que no conducen de forma inminente a deste-rrar su endemia (Dirección)

Incluso es posible que dentro de un mismo

corpus ejemplos similares reciben una anota-ción diferentes como consecuencia de una apli-cación vacilante de algunos de los papeles me-cionados. Es lo que ocurre, por ejemplo, con el papel Manera en AnCora, que puede presentar vacilaciones con el Instrumento o el Estado final, entre otros:

(14) a. […] limpiándose los dientes con un tro-

zo de abeto (Instrumento) b. Un hombre que era capaz de decapitar una rata con los dientes (Manera)

c. […] anunció que doblaría a cinco dóla-res (Estado final) el salario mínimo d. […] otro planteamiento que dividirá a la empresa en tres compañías (Manera)

Para solventar este problema, en AnCora muchas veces se opta por aplicar una misma etiqueta. Es lo que sucede con el papel Benefi-ciario, usado de manera general para todos los casos de CINDs ya mencionados. La desventaja obvia que esto implica es una relativa carencia de poder descriptivo, puesto que el análisis se detiene en un nivel a veces demasiado superfi-cial. En este sentido, me parecen reveladores ejemplos como los de (15), donde todos los constituyentes subrayados han sido tratados como Manera en dicho proyecto:

(15) a. En la reanudación, el marroquí Yunes el

Aynaui se impuso finalmente a Ferrero por 6-7, 3-7, 6-4… b. […] forzó la tercera y última [manga] al imponerse en el segundo set c. No pudo desarrollar el tenis con el que se impuso al croata Goran Ivanisevic

Quizás el caso opuesto en este sentido lo en-contramos en ADESSE, que ofrece una alta granularidad en su anotación. El precio que debe pagar por ello es el de tener que lidiar con un mayor número de casos fronterizos. Así sucede con papelse semánticos como Beneficia-rio y Poseedor, Finalidad y Rol, Manera y Esta-do final, Asunto y Ámbito o Causa y Referen-cia, entre otros

4 Conclusiones Desde los conocidos trabajos de Gruber

(1965) y Fillmore (1968), no son pocos los autores que han mostrado su escepticismo sobre la noción misma de papel semántico, al menos en el sentido más tradicional y reduccionista del término. Sin embargo, en un corpus lingüístico encontramos una variedad de ejemplos enorme, que responden a muestras de uso de la lengua y que necesitan ser descritos de forma práctica y sencilla. De ahí que el inventario de papeles semánticos resulta un método ampliamente aceptado en lingüística de corpus.

Pero se debe asumir igualmente que el signi-ficado es muchas veces reacio a una descripción en términos discretos y que, como consecuencia de ello, el proceso de etiquetación no está exen-to de problemas. En este trabajo se han querido mostrar algunos de esos problemas a partir de la


239

comparación de tres proyectos de investigación que etiquetan corpus del español.

En lo que se refiere al tratamiento de los CINDs, la complejidad intrínseca de esta fun-ción obliga a elegir entre dos vías de análisis. AnCora y SemSem aplican una etiqueta gene-ral para la mayor parte de los casos, ya sea (BEN), ya sea (Dest), aunque SenSem adicio-nalmente informa de ciertos casos de dativo de interés a nivel oracional. ADESSE opta por un análisis más específico y, al lado de las etique-tas propias de cada clase semántica, propone distinguir entre Beneficiarios y Poseedores, aunque ello lleve a encarar casos ambiguos de difícil asignación.

Respecto a las alternancias de diátesis co-mentadas, hemos visto que surgen también dos vías de anotación diferentes. En la primera, adoptada en SenSem, un mismo esquema puede ser anotada con papeles diferentes en función del carácter animado o inanimado de los refe-rentes, lo que refleja una asociación directa entre referentes y relaciones semánticas. En la segunda vía, adoptada en AnCora y ADESSE, un mismo esquema del verbo recibe una única anotación, con lo que la animación de los parti-cipantes se vuelve secundaria. Es la alternancia construccional la que conlleva un cambio de relaciones semánticas con el verbo, añadiéndose una etiqueta específica para los esquemas triac-tanciales de la alternancia en cuestión.

Por último, el problema de los casos fronte-rizos responde una vez más a dos estrategias diferentes. La adopción de etiquetas generales reduce el número de casos ambiguos, pero pue-de llevar a una superficialidad en el análisis. Por el contrario, un análisis más exhaustivo de los datos, multiplica el número de ambiguëda-des, por lo que se corre el riego de perder sis-tematicidad en la anotación

El reto principal en la etiquetación semánti-ca de corpus estriba, de hecho, en conseguir ese equilibrio entre ambas condiciones: facilidad de aplicación, que se traduce en una consistencia interna de los datos, y calidad de la anotación, que se traduce en una mayor granularidad en el análisis. En la relación inversamente proporcio-nal de ambos factores, SenSem y sobre todo AnCora (por su finalidad computacional), pare-cen decantarse por una mayor sistematicidad y coherencia internas, mientras que ADESSE, también por las características y objetivos del proyecto, persigue un mayor poder descriptivo en el tratamiento de los datos.

Bibliografía Castellón, I., A. Fernández, G. Vázquez, L.

Alonso y J. Capilla. 2006. The SenSem Cor-pus: a Corpus Annotated at the Syntactic and Semantic Level. Fifth International Confer-ence on Language Resources and Evalua-tion, páginas 355-359

Cuadros Muñoz, R. 2005. La complementación verbal. Viejos y nuevos enfoques. Language Design, 7:105-136.

Civit, M. y M. A. Martí. 2004. Building Cast3LB: a Spanish Treebank. En Research on Language & Computation 2(4):549-574

Dowty, D. R. 1979. Word Meaning and Monta-gue Grammar. Reidel, Dordrecht

Fillmore, Ch. 1968. The Case for Case. En E. Bach y R. T. Harms (eds.). Universals in Linguistic Theory. Holt, Rinehart and Winston, New York, páginas 1-88.

García-Miguel, J. M. y F. Albertuz. 2005. Verb, semantic classes and semantic roles in the ADESSE project. En K. Erk, A. Melinger y S. Walde (eds.). Proceedings of the Interdis-ciplinary Workshop on the Identification and Representation of Verb Features and Verb Classes. Saarbrüken, páginas 50-55.

Gruber, J. S. 1965. Studies in Lexical Relation, Tesis doctoral. The MIT Press, Cambridge, Massachusetts.

Gutiérrez Ordoñez, S. 1999. Los dativos. En I. Bosque y V. Demonte. Gramática descripti-va de la lengua española. RAE/Espasa Cal-pe, Madrid, (vol. 2), páginas 1855-1930

Levin, B. y M. Rappaport-Hovav. 1995. Unac-cusativity. At the Syntax-Lexical Semantics Interface. The MIT Press, Cambridge, Mas-sachusetts.

Martí, M. A., M. Taulé, M. Bertrán y L. Már-quez. 2007. AnCora: Multilingual and Mul-tilevel Annotated Corpora. Draft version. [http://clic.ub.edu/ancora/ancora-corpus.pdf]

Rappaport-Hovav, M. y B. Levin. 1998. Build-ing Verb Meanings. En M. Butt y W. Geu-der (eds.). The Projection of Arguments: Lexical and Compositional Factors. CSLI Publications, Standford, páginas 97-134.

Vendler, Z. 1967. Linguistics in Philosophy. Cornell University Press, New York

Gael Vaamonde

240

http://clic.ub.edu/ancora/ancora-corpus.pdf

Traducción Automática

Reutilizacion de datos linguısticos para la creacion de unsistema de traduccion automatica para un nuevo par de lenguas

Re-use of linguistic data to create a machine translation system for a newlanguage pair

Carme Armentano-OllerDLSI, Universitat d’Alacant

E-03071 [email protected]

Mikel L. ForcadaDLSI, Universitat d’Alacant

E-03071 [email protected]

Resumen: Este trabajo estudia varias formas de reutilizar datos linguısticos yadesarrollados para obtener rapidamente un sistema de traduccion automatica paraun nuevo par de lenguas. En particular, se ha desarrollado un traductor entre elportugues y el catalan basado en la plataforma Apertium (www.apertium.org), apartir de los datos ya disponibles en esta plataforma para traducir entre portugues yespanol y entre espanol y catalan. Los resultados obtenidos indican que una simplecomposicion de dos traductores completos es una buena primera opcion, aunquetambien se muestran otros resultados muy interesantes obtenidos en poco tiempousando las herramientas que proporciona esta plataforma.Palabras clave: reutilizacion datos linguısticos, traduccion automatica, Apertium

Abstract: This work examines various ways to re-use pre-existing linguistic datato quickly generate a machine translation system for a new language pair. In parti-cular, a machine translation system between Portuguese and Catalan based on theApertium platform (www.apertium.org) has been built from data existing in thisplatform for translating between Portuguese and Spanish and between Spanish andCatalan. The results obtained indicate that a simple composition of two completetranslators is an adequate first option, but other very interesting results are shownwhich have been obtained in short time using the tools provided in the Apertiumplatform.Keywords: re-use of linguistic data, machine translation, Apertium

1. Presentacion

La traduccion automatica esta ganan-do cada vez mas importancia en nuestrasociedad; en las ultimas decadas han idoapareciendo sistemas para un numero cre-ciente de pares de lenguas, que sin embargo,dejan aun muchos pares sin cubrir, especial-mente cuando una de ellas es una lenguamenor (Forcada, 2006). En este trabajose han querido estudiar varias formas dereutilizar datos linguısticos ya desarrolladosen traductores automaticos para obtenerrapidamente otro traductor para un nuevopar de lenguas. En particular, se han utili-zado para traducir entre la lengua A y lalengua C los datos linguısticos ya disponiblespara traducir entre dos pares de lenguascon una lengua comun B (el par A–B y elpar B–C). Concretamente, se han utilizadolos datos linguısticos disponibles en laplataforma Apertium (www.apertium.org)

para los traductores espanol↔catalan(apertium-es-ca, version 1.0) y es-panol↔portugues (apertium-es-pt, version0.8) para construir un traductor portu-gues↔catalan (apertium-pt-ca).1

A continuacion se presenta la plataformade traduccion automatica Apertium, se des-cribe como funciona y se explica por que seha escogido para este trabajo. En la seccion 3se presentan los datos linguısticos que se hanutilizado y se explica como se han combinadoen las diferentes configuraciones usadas pa-ra construir el traductor portugues↔catalan.Finalmente, en la seccion 4 se presentan losresultados obtenidos, las conclusiones y sesenalan algunas de las lıneas que quedanabiertas.

1En los paquetes Apertium se usan los codigos ISO639-1 para designar los pares de lenguas. Ası, se usaca para catalan, es para espanol y pt para portugues.



2. Apertium

Apertium es una plataforma de traduc-cion automatica de codigo abierto. La plata-forma incluye el motor de traduccion, datoslinguısticos para varios de pares de lenguas yherramientas para desarrollar nuevos pares.Los programas, las herramientas y los datoslinguısticos se distribuyen con licencia GNUGeneral Public License, version 2.2

2.1. Como funciona Apertium

Apertium sigue un modelo clasico de tra-duccion por transferencia basada en reglas(Hutchins y Somers, 1992, apart. 4.2), or-ganizada en una serie de modulos dispues-tos en cascada. Los modulos se comunicanentre ellos mediante texto. Existen dos ni-veles del motor de traduccion en Apertium:en el primer nivel la transferencia estructuralse hace en un solo modulo, mientras que enel segundo nivel, para lenguas mas distantesentre sı, la transferencia estructural se haceen tres etapas. Para desarrollar el traductorapertium-pt-ca se ha utilizado el motor deprimer nivel, dado que la proximidad estruc-tural entre las dos lenguas ası lo permitıa ytambien porque los dos traductores que sehan utilizado como punto de partida tambienestan desarrollados en el primer nivel.

Como se describe en detalle en(Armentano-Oller et al., 2006), los moduloslinguısticos que usa la primera versionde Apertium son los que se presentan acontinuacion (A es la lengua origen —en losejemplos, espanol— y B es la lengua meta—en los ejemplos, catalan):

Analizador morfologico: Analizamorfologicamente el texto a traducir apartir de la informacion del diccionariomorfologico monolingue que correspon-da (apertium-A-B.A.dix). Para cadaforma superficial del texto da todos losanalisis o formas posibles. Ası pues,ante la entrada “de la nube roja”, lasalida de este modulo serıa:

^de/de<pr>$^la/el<det><def><f><sg>

/lo<prn><pro><p3><f><sg>$^nube/nube<n><f><sg>$^roja/rojo<adj><f><sg>$

2http://www.gnu.org/licenses/old-licenses/gpl-2.0.html

Desambiguador lexico categorial:Basandose en modelos ocultos de Mar-kov elige el analisis mas probable paracada palabra ambigua, segun su contex-to. Siguiendo con el ejemplo, la salida deldesambiguador serıa:

^de<pr>$^el<det><def><f><sg>$^nube<n><f><sg>$^rojo<adj><f><sg>$

Modulo de transferencia lexica: Apartir de la informacion del diccionariobilingue apertium-A-B.A-B.dix, leecada una de las formas lexicas y entregasu equivalente en la lengua meta. En elejemplo, estas equivalencias serıan:

de<pr> → de<pr>el<det> → el<det>nube<n><f> → nuvol<n><m>rojo<adj> → vermell<adj>

Modulo de transferencia estruc-tural: Trata los cambios estructu-rales entre las dos lenguas a tra-ducir (cambios de genero y nume-ro, reordenamientos, etc.). El ficheroapertium-A-B.trules-A-B.xml con-tiene las reglas necesarias. En cada reglase detecta un patron (una secuencia deformas lexicas) al que se le aplican loscambios y para el cual se genera una sa-lida. Los patrones se suelen definir co-mo secuencias de categorıas gramatica-les, pero se pueden detectar tambien le-mas concretos. Tambien en la salida sepueden generar lemas concretos, inde-pendientemente del diccionario bilingue.En el caso del ejemplo, al traducir al ca-talan ha cambiado el genero del sustanti-vo, por lo que hay que cambiar tambienel de los elementos que concuerdan conel, aplicando una regla con patron det–n–adj. El resultado es:

^de<pr>$^el<det><def><m><sg>$^nuvol<n><m><sg>$^vermell<adj><m><sg>$

Generador morfologico: A partir delas formas lexicas de la lengua meta ge-nera las formas superficiales flexionadas

Carme Armentano-Oller y Mikel L. Forcada

244

correspondientes, utilizando el dicciona-rio morfologico apertium-A-B.B.dix.Siguiendo con el mismo ejemplo, la sali-da de este modulo serıa: ~de ~el nuvolvermell.3

Posgenerador: Aplica algunas opera-ciones ortograficas de la lengua meta,tales como apostrofaciones y contraccio-nes. Para cada lengua hay un fichero(apertium-A-B.post-A.dix) con lasreglas que tiene que aplicar este modu-lo. En el caso del ejemplo hay que hacerla contraccion entre la preposicion de yel determiante el. Ası, la salida de estemodulo serıa: del nuvol vermell

2.2. Por que Apertium?

Se ha decidido utilizar Apertium para es-tudiar la reutilizacion de datos porque la li-cencia que utiliza lo permite legalmente sintener que pedir permisos expresamente, yporque el motor y los datos linguısticos sonindependientes, lo que facilita el trabajo dedesarrollo linguıstico.

3. Reutilizacion de datos

Se han combinado de distintas formas losdatos linguısticos de los dos traductores departida, para poder estudiar que ventajas einconvenientes presentaba cada posible confi-guracion (vease 3.3). Tambien se han tenidoque modificar datos linguısticos y crear algu-nos nuevos, tanto de forma automatica (vease3.1) como manualmente.

La calidad de las traducciones seha avaluado con textos de 10.000 pa-labras de noticias periodısticas publica-das en Internet,4 usando el programaapertium-eval-translator.5

3.1. El programaapertium-dixtools

El programa apertium-dixtools6 lleva acabo distintas operaciones con los dicciona-rios de Apertium, una de las cuales es elcruce de diccionarios: a partir de los diccio-narios de dos traductores apertium-A-B y

3El signo “~” marca palabras que pueden requerircambios en la siguiente etapa (posgenerador).

4Textos accesibles en http://xixona.dlsi.ua.es/~carmentano/avaluacions.html

5http://www.apertium.org/?id=apertium-eval-translator

6Escrito en Java por Enrique Beni-meli ([email protected]), vease http://xixona.dlsi.ua.es/wiki/Apertium-dixtools

apertium-B-C genera los siguientes diccio-narios:

apertium-A-C.A-C-crossed.dix:Diccionario bilingue A–C con los lemascomunes de los dos traductores departida.

apertium-A-C.A-crossed.dix yapertium-A-C.C-crossed.dix: Dic-cionarios monolingues de las lenguasA y C consistentes con el diccionariobilingue generado, eso es, solo con los le-mas que se encuentran en el diccionarioA–C.

consistent-bil-A-B.dix yconsistent-bil-B-C.dix: Dic-cionarios bilingues A–B y B–C solo conlos lemas comunes de los dos traductoresde partida.

El programa apertium-dixtools es unaherramienta muy util para cruzar dicciona-rios, pero el cruce presenta aun algunos pro-blemas:

Perdida de cobertura: Hay que consi-derar que los diccionarios generados con-tienen solo las entradas comunes (mismolema con la misma categoria gramatical)de los dos traductores de partida, porlo que inevitablemente seran mas reduci-dos. Ademas, a la hora de desarrollar losdiccionarios de los traductores de parti-da no siempre se han seguido los mismoscriterios linguısticos, por lo algunos le-mas sı estan en ambos traductores pe-ro tienen asignadas distintas categorıasgramaticales (por ejemplo, en un diccio-nario tienen la categorıa nombre y en elotro, adjetivo): en estos casos los lemasno se cruzan.

Perdida de unidades multipalabra:En los diccionarios de Apertium haytambien unidades lexicas multipalabraque no se han de traducir palabra por pa-labra, como echar de menos. Una partede estas expresiones sı estan en los dic-cionarios de uno de los traductores pe-ro no en los del otro, sea porque no sehan introducido o porque en el otro parde lenguas son innecesarias. En amboscasos, la expresion no se encontrara enlos diccionarios cruzados generados porapertium-dixtools.

Reutilización de datos lingüísticos para la creación de un sistema de traducción automática para un nuevo par de lenguas

245

Errores de consistencia: Para que eltraductor funcione correctamente hacefalta que los diccionarios sean consisten-tes, eso es, que todas las formas que seencuentran en cada uno de los diccio-narios morfologicos tengan traduccionen el diccionario bilingue y su equiva-lente en el otro diccionario morfologi-co. En los diccionarios generados porapertium-dixtools hay dos tipos deerrores de consistencia: los generadospor el programa y los causados por losdistintos criterios linguısticos a la horade desarrollar los diccionarios morfologi-cos de los traductores de partida (porejemplo, los diccionarios del traductorapertium-es-ca tienen definidos los di-minutivos de los adjetivos, cosa que noocurre en los diccionarios del traductorapertium-es-pt).

Perdida de secciones del diccio-nario: En los diccionarios morfologi-cos los lemas se encuentran en sec-ciones distintas, segun como se debansegmentar sus formas superficiales enlos textos. Actualmente, el programaapertium-dixtools no cruza las entra-das que se encuentran en la seccion de-nominada postblank, que contiene las en-tradas cuyas formas superficiales no vanseguidas de un espacio en blanco (en elcaso del catalan, las palabras apostrofa-das como d’ o s’ ), pero que deben sepa-rarse de la siguiente con un blanco parasu tratamiento posterior.

3.2. Datos utilizados

Los datos linguısticos que se han utilizadoson los que se describen a continuacion:

Diccionarios: Se han utilizado tres ti-pos de diccionarios:

• diccionarios tomados de los traduc-tores de partida

• diccionariosconsistent-bil-pt-es.dix yconsistent-bil-es-ca.dix, ge-nerados automaticamente conapertium-dixtools

• diccionariosapertium-pt-ca.ca.dix,apertium-pt-ca.pt-ca.dix yapertium-pt-ca.pt.dix, co-rregidos a mano a partir de

los diccionarios generados porapertium-dixtools

Probabilidades de desambiguacion:Se han copiado directamente de los tra-ductores de partida.

Reglas de transferencia estructu-ral: Se han utilizado tres tipos de fiche-ros de definicion de reglas de transferen-cia estructural:

• ficheros copiados de los traductoresde partida• ficheros apertium-es-ca.trules-es-ca-m.xml,apertium-es-ca.trules-ca-es-m.xml,apertium-es-pt.trules-es-pt-m.xml,y apertium-es-pt.trules-pt-es-m.xml, modifi-cados a mano a partir de los deuno de los traductores de partida(veanse 3.3.3 y 3.3.4)• ficheros de reglasapertium-pt-ca.trules-pt-ca.xmly apertium-pt-ca.trules-ca-pt.xml,generados a mano (vease 3.3.5)

Diccionarios de postgeneracion: Sehan copiado directamente de los traduc-tores de partida.

3.3. Configuraciones estudiadas

Los datos linguısticos de los traductoresde partida se han combinado de las formasque se describen a continuacion.3.3.1. Composicion de dos

traductores completosConsiste en traducir un texto del por-

tugues al castellano con el traductorapertium-pt-es, y traducir el resultado alcatalan con apertium-es-ca, o en sentidocontrario para traducir del catalan al portu-gues.3.3.2. Composicion de los modulos

de transferencia de dostraductores

Tambien se ha probado a componer dostraductores componiendo sus modulos detransferencia. Ası pues, para traducir del por-tugues al catalan se ha:

1. analizado el texto con el diccionario mor-fologico apertium-pt-ca.pt.dix


246

2. desambiguado el texto con las probabili-dades del traductor apertium-es-pt

3. transferido al castellano conel modulo de transferenciaapertium-es-pt.trules-pt-es.xmly el diccionario bilingueconsistent-bil-pt-es.dix y pos-teriormente transferido del castellano alcatalan con el modulo de transferenciaapertium-es-ca.trules-es-ca.xmly el diccionario bilingueconsistent-bil-es-ca.dix

4. generado en catalan con el diccionariomorfologico apertium-pt-ca.ca.dix

5. aplicado las reglas de post-generacion del diccionarioapertium-es-ca.post-ca.dix

Para traducir del catalan al portugues sehan aplicado los pasos equivalentes en sen-tido contrario.

3.3.3. Utilizacion del modulo detransferencia del primertraductor

Se han querido evaluar tambien los resul-tados de traducir con las reglas de transfe-rencia estructural de solo uno de los dos tra-ductores de partida. Primero se ha probadocon las reglas utilizadas para traducir de lalengua de partida al espanol. Para ello, antesde aplicar las reglas ha hecho falta cambiarlos lemas de salida de algunas de ellas, paraque sean lemas de la nueva lengua meta. Losficheros de reglas *-m.xml son el resultado deesta modificacion.

A continuacion, se ha traducido delportugues al catalan de la misma forma queen la configuracion anterior, menos en elpaso numero 3: en este caso la transferenciase ha hecho directamente con las reglas deapertium-es-pt.trules-pt-es-m.xmly con el diccionario bilingueapertium-pt-ca.pt-ca.dix.

Para traducir del catalan al portugues sehan hecho las operaciones equivalentes ensentido contrario.

3.3.4. Utilizacion del modulo detransferencia del segundotraductor

Tambien se ha querido probar el resultadode traducir aplicando reglas de transferenciaestructural del espanol a la lengua meta. Eneste caso, ha hecho falta cambiar los lemas

de los patrones que hay que detectar en al-gunos casos, antes de aplicar las reglas, paraque sean lemas de la nueva lengua de origen.Estos ficheros modificados tambien se llaman*-m.xml.

Ası pues, para traducir del portugues alcatalan se han aplicado los mismos pasos, conlos mismos datos linguısticos, que en la con-figuracion anterior, pero cambiando el fiche-ro de reglas de transferencia estructural porapertium-es-ca.trules-es-ca-m.xml. Pa-ra traducir del catalan al portugues se ha he-cho el proceso equivalente en sentido contra-rio.3.3.5. Utilizacion de un modulo de

transferencia hecho a manoFinalmente, se han creado a mano dos fi-

cheros de reglas de transferencia estructural,uno para cada sentido de traduccion, combi-nando las reglas de los modulos de los tra-ductores de partida. Con ello, se ha creadoun traductor apertium-pt-ca completo.7

Al combinar las reglas de transferencia es-tructural de dos traductores aparecen cincosituaciones distintas:

Reglas que son iguales en los fi-cheros de partida: Algunas reglas soniguales en los dos traductores de partida.Cuando es ası, la regla se pude copiar di-rectamente en el nuevo fichero de reglasde transferencia.

Reglas que estan solo en uno de losdos ficheros: En esta situacion, la reglatambien se puede copiar, pero si esta le-xicalizada hay que cambiar los lemas.Este ha sido el caso de la regla que paratraducir del catalan al castellano cambiala preposicion a por en delante de untoponimo. En cambio, para traducir delcastellano al portugues no hay ningunaregla que se ocupe de esta situacion, yaque en portugues se utiliza em, equiva-lente a la preposicion en del castellano.

Reglas que detectan el mismopatron pero aplican acciones dis-tintas: En estos casos hay que crear unanueva regla que combine las acciones delas dos anteriores. Un ejemplo de elloserıa el patron verbo–pronombre enclıti-co: Por un lado, los pronombres en yhi del catalan no tienen traduccion alespanol (ni al portugues), por lo que en

7Vease 2.1


247

apertium-es-ca.trules-ca-es.xmlse comprueba, antes de enviar elpronombre, si se puede traducir.Por otro lado, los pronombres delportugues pueden ir en posicionproclıtica, mesoclıtica o enclıtica, porlo que la regla con este patron deapertium-es-pt.trules-es-pt.xmlcomprueba en que posicion tendra queir el pronombre. Ası pues, la nuevaregla combinada tiene que comprobarsi el pronombre tiene traduccion y enque posicion debe ir.

Reglas que detectan secuencias dis-tintas de categorıas lexicas que sesolapan: A menudo las reglas de los dosficheros de partida detectan secuenciasde categorıas lexicas que se solapan ose incluyen. Al combinar las reglas pa-ra traducir del portugues al catalan seda el caso de todas las reglas que con-tengan verbos, ya que para traducir delespanol al catalan el preterito perfecto segenera en su forma perifrastica (cante→vaig cantar).

3.3.6. Mejoras al traductor hechas amano

A partir del primer modelo de traductorapertium-pt-ca completo se hicieron las si-guientes mejoras:

Cambios en los diccionarios:

• Creacion de las secciones postblankde los diccionarios (vease 3.1).• Introduccion de las expresiones

multipalabra de los diccionarios departida.• Revision de las equivalencias del

diccionario bilingue a partir de laobservacion de las traducciones he-chas por el traductor: el programaapertium-dixtools cruza los dic-cionarios a partir de los lemas ysu categoria gramatical. Eso pue-de crear errores en el caso de pala-bras con lemas homografos o pala-bras polisemicas, ya que un mismolema puede estar en cada traductorcon un significado distinto.• Introduccion en los diccionarios de

las 370 palabras desconocidas por eltraductor que se dan con mas fre-cuencia en un corpus de 10.000 pa-labras

Metodo WER % desc.comp. 2 trad. 10,99 % 10,51 %comp. 2 mod. transf. 26,55 % 13,05 %1o mod. transf. 26,24 % 12,77 %2o mod. transf. 28,33 % 12,60 %mod. transf. nuevo 25,66 % 12,81 %mejoras 14,10 % 10,19 %

Cuadro 1: Resultados obtenidos al traducir tex-tos periodısticos del catalan al portugues (veasela seccion 3)

Cambios en los ficheros de reglas detransferencia estructural:

• Cambio de la traduccion de la es-tructura “ir + infinitivo” del por-tugues: Esta estructura se traduceal espanol por “ir + a + infinitivo”(por ejemplo, vou cantar → voy acantar); en cambio, en el traductorespanol↔catalan, no se trata estaestructura. Se han hecho los cam-bios necesarios para que se traduz-ca por un verbo al futuro de indi-cativo (por ejemplo, vou cantar →cantare).• Uso del determinante delante de

porcentajes: En catalan y en es-panol se usa un determinante delan-te de las expressiones de porcenta-je (por ejemplo, el aumento fue del3 % ), mientras que en portugues, no(o aumento foi de 3 % ). Se ha am-pliado el numero de reglas que tra-tan esta diferencia.

4. Resultados, conclusiones ylıneas abiertas

En los cuadros 1 y 2 se pueden ver losresultados obtenidos en cada una de las con-figuraciones explicadas, expresados en termi-nos de tasa de error por palabra o word errorrate (WER) y en porcentaje de palabras des-conocidas por el traductor.

4.1. Conclusiones

A partir de los resultados presentados po-demos llegar a las siguientes conclusiones:

Los resultados de la aplicacion sucesivade dos traductores completos son bas-tante buenos, lo que indica que esta esuna buena configuracion a utilizar cuan-do no se dispone de ningun traductor en-


248

Metodo WER % desc.comp. 2 trad. 15,17 % 8,74 %comp. 2 mod. transf. 19,28 % 10,37 %1o mod. transf. 23,56 % 10,49 %2o mod. transf. 19,33 % 10,28 %mod. transf. nuevo 18,52 % 10,24 %mejoras 16,67 % 9,06 %

Cuadro 2: Resultados obtenidos al traducir tex-tos periodısticos del portugues a catalan (vease laseccion 3)

tre las dos lenguas entre las que se quieretraducir.

La funcion de cruce de diccionarios deapertium-dixtools es una herramien-ta muy util para cruzar los diccionariosde dos traductores Apertium, pero, co-mo se muestra ( % de palabras desco-nocidas en los cuadros 1 y 2) se pierdecobertura lexica. Ası pues, si se quiereutilizar apertium-dixtools para desa-rrollar un nuevo traductor hay que con-siderar que los diccionarios resultantesseran solo una base a partir de la cualhabra que hacer revisiones y mejoras.

Cruzar dos traductores Apertium permi-te desarrollar un nuevo traductor rapida-mente y con poco esfuerzo, pero hay queasumir que en un principio se pierde ca-lidad en la traduccion. Ası pues, saldra acuenta solo si se hace con la intencion deanadir despues mejoras especıficas paratraducir entre el nuevo par de lenguas.

Entre las mejoras que se tienen que apli-car al nuevo traductor hay que conside-rar el aumento de la cobertura lexica yla introduccion de expresiones multipa-labra.

A la vista de los resultados de traducirutilizando uno de los modulos de trans-ferencia estructural de los traductores departida, vemos que, como las lenguas ro-mances tienen pocas diferencias estruc-turales entre sı, se pueden adaptar facil-mente estos modulos para traducir entreotros pares de lenguas.

4.2. Lıneas abiertas

Apertium dispone de herramientas y da-tos linguısticos que se pueden reutilizar paradesarrollar rapidamente y con poco esfuerzolos datos necesarios para traducir entre un

nuevo par de lenguas. Sin embargo, aun sepueden crear nuevas herramientas y mejorarlas ya existentes para facilitar la reutilizacionde datos, como por ejemplo:

Mejorar la funcion de cruce de dicciona-rios del programa apertium-dixtools.

Creacion de un sistema de cruce semiau-tomatico de reglas de transferencia es-tructural.

Adaptacion de las herramientas para fa-cilitar el intercambio de entradas entrelos distintos diccionarios Apertium.

Ademas, se esta trabajando en el desarro-llo de programas para facilitar el intercambiode datos linguısticos entre Apertium y otrossistemas de procesamiento automatico dellenguaje: Freeling8 (Atserias et al., 2006), js-pell9 (Almeida y Pinto, 1995), u OLIF (OpenLexical Interchange Format10).

Bibliografıa

Almeida, J.J. y Ulisses Pinto. 1995. Jspell– um modulo para analise lexica genericade linguagem natural. En Actas do X En-contro da Associacao Portuguesa de Lin-guıstica, paginas 1–15, Evora 1994.

Armentano-Oller, Carme, Rafael C. Carras-co, Antonio M. Corbı-Bellot, Mikel L.Forcada, Mireia Ginestı-Rosell, SergioOrtiz-Rojas, Juan Antonio Perez-Ortiz,Gema Ramırez-Sanchez, Felipe Sanchez-Martınez, y Miriam A. Scalco. 2006.Open-source Portuguese-Spanish machinetranslation. En R. Vieira P. QuaresmaM.d.G.V. Nunes N.J. Mamede C. Olivei-ra, y M.C. Dias, editores, Computatio-nal Processing of the Portuguese Langua-ge, Proceedings of the 7th InternationalWorkshop on Computational Processingof Written and Spoken Portuguese, PRO-POR 2006, volumen 3960 de Lecture No-tes in Computer Science. Springer-Verlag,May, paginas 50–59.

Atserias, Jordi, Bernardino Casas, Elisa-bet Comelles, Meritxell Gonzalez, LluısPadro, y Muntsa Padro. 2006. Free-ling 1.3: Syntactic and semantic services

8http://www.lsi.upc.edu/~nlp/freeling9http://natura.di.uminho.pt/natura/natura?

&topic=jspell10http://www.olif.net/


249

in an open-source NLP library. En Pro-ceedings of the fifth international confe-rence on Language Resources and Evalua-tion (LREC 2006), paginas 48–55. ELRA.Genova, Italia, mayo 2006.

Forcada, Mikel L. 2006. Open-source ma-chine translation: an opportunity for mi-nor languages. En Strategies for develo-ping machine translation for minority lan-guages (5th SALTMIL workshop on mino-rity languages). Organized in conjunctionwith LREC 2006.

Hutchins, W.J. y H.L. Somers. 1992. An In-troduction to Machine Translation. Aca-demic Press, London.


250

Aplicación de métodos estadísticos para la traducción de voz a Lengua de Signos

Using statistical methods for translating speech into Sign Language

B. Gallo, R. San-Segundo, J.M. Lucas, R. Barra, L.F. D’Haro, F. Fernández Grupo de Tecnología del Habla. Universidad Politécnica de Madrid.

ETSIT. Ciudad Universitaria SN 28040. Madrid. Spain. [email protected]

Resumen: Este artículo presenta un conjunto de experimentos para la realización de un sistema de traducción estadística de voz a lengua de signos para personas sordas. El sistema contiene un primer módulo de reconocimiento de voz, un segundo módulo de traducción estadística de palabras en castellano a signos en Lengua de Signos Española, y un tercer módulo que realiza el signado de los signos mediante un agente animado. La traducción se hace utilizando dos alternativas tecnológicas: la primera basada en modelos de subsecuencias de palabras y la segunda basada en transductores de estados finitos. De todos los experimentos, se obtienen los mejores resultados con el modelo que realiza la traducción mediante transductores de estados finitos con unas tasas de error de 26,06% para las frases de referencia, de 33,42% para la salida del reconocedor. Palabras clave: Traducción Automática Estadística, Lengua de Signos, subfrase, Transductor de Estados Finitos, Modelo de Lenguaje, Modelo de Traducción, alineamiento, tasa de errores de palabras.

Abstract: This paper presents a set of experiments used to develop a statistical system from translating speech to sign language for deaf people. This system is composed of an Automatic Speech Recognition (ASR) system, followed by a statistical translation module and an animated agent that represents the different signs. Two different approaches have been used to perform the translations: a phrase-based system and a finite state transducer. The best results were obtained with the finite state transducer, with a word error rate of 26.06% for the reference text, and 33.42% using the ASR output. Keywords: Statistical Machine Translation, Sign Language, phrase, Finite State Transducer, Language Model, Translation Model, alignment, word error rate.

1 Introducción Con la realización de este trabajo se pretende el desarrollo y evaluación de una Plataforma de Traducción capaz de transformar, en base a un conjunto de modelos probabilísticos, frases de una lengua a otra, concretamente, de castellano a Lengua de Signos Española (LSE). La importancia de esta plataforma radica en la necesidad cada vez mayor de una herramienta que permita una traducción rápida y relativamente precisa entre lenguas. El coste de un intérprete signante (que conoce la Lengua de Signos) es muy elevado. Se debe tener en cuenta que la capacidad de comprensión del español de las personas sordas prelocutivas (aquellas que se quedaron sordas antes de poder hablar) es muy inferior a la de los oyentes. Así, presentan una capacidad lectora y de escritura

en español muy inferior a la LSE, ya que no son capaces de extraer la información semántica de todas las palabras o construcciones, o no pueden formar una imagen mental de aquello que se les está comunicando. Se intenta, por lo tanto, desarrollar un software que permita traducir conjuntos de frases de castellano a una secuencia de signos de la LSE, que un avatar (agente visual) se encargará de representar.

2 Estado del arte Numerosos proyectos de investigación se

han enfocado a la traducción de habla natural, como por ejemplo en los casos de C-Star, ATR, Vermobil, Eutrans, LC-Star, PF-Star y TC-Star. A excepción de TC-Star, estos proyectos se dedican a la traducción de vocabularios medios (de unas 10000 palabras) en dominios restringidos de aplicación. Los sistemas de



traducción que dan mejores resultados son los basados en soluciones estadísticas (Och y Ney, 2002) (como el estudiado en este artículo), incluyendo técnicas basadas en ejemplos (Sumita et al, 2003), transductores de estados finitos (Casacuberta y Vidal, 2002) (“FST” en inglés) y técnicas basadas en subfrases (Koehn et al, 2003). Los avances importantes que se han conseguido en traducción de habla natural se debe a la aparición de medidas de error (Papineni et al, 2002), la mejora de eficiencia de algoritmos de entrenamiento (Och y Ney, 2003), el desarrollo de modelos dependientes del contexto (Sumita et al, 2003) y algoritmos de generación eficientes (Koehn et al, 2003).

En los últimos años, varios grupos de investigación han mostrado su interés en los sistemas de traducción de voz a Lengua de Signos desarrollando varios prototipos: basados en ejemplos (Morrissey, 2005), reglas (San-Segundo, 2006; Lynette, 2003), frases completas (Cox, 2002) o métodos estadísticos (Bungeroth, 2002) como el sistema de IBM SiSi (Say It Sign It). Este artículo presenta la evaluación de métodos estadísticos para la traducción a LSE de las explicaciones que un policía da a una persona que quiere renovar el DNI (Documento Nacional de Identidad).

3 Arquitectura del Sistema El sistema completo está formado por tres módulos: el reconocedor de voz, el traductor estadístico y finalmente, la representación por un agente animado de los signos obtenidos:

Animación Modelos Acústicos

Modelos de

Lenguaje

Modelo de

Traducción

Descripción de los Signos

Reconocimiento de Voz

Traducción Estocástica

Voz Palabras Signos

Modelo de Lenguaje

Figura 1: Arquitectura completa del sistema

3.1 Reconocimiento de Voz Este módulo realiza la conversión del habla en lenguaje natural (habla continua) a una secuencia de palabras independiente del locutor. De esta manera, a partir de unos Modelos de Lenguaje y Acústicos de los que se dispone previamente, puede hacerse el análisis de la señal de habla ofreciendo a su salida una secuencia de palabras resultado.

3.2 Traducción Estadística La traducción estadística consiste en un algoritmo de búsqueda dinámica que utiliza un modelo estadístico para obtener la mejor secuencia de signos resultado de la traducción de una secuencia de palabras obtenidas del reconocedor de voz. Este modelo integra principalmente información de dos tipos de probabilidades:

• Probabilidad de traducción: recoge información sobre qué palabras se traducen por qué signos.

• Probabilidad de la secuencia de signos: aporta información sobre qué secuencias de signos son más probables en la LSE.

En este paso se realiza una traducción de las palabras provenientes del reconocedor a signos correspondientes, en este caso, a la Lengua de Signos Española. Para esto se utilizan métodos estadísticos cuyos Modelos se aprenden a partir de un corpus paralelo, compuesto por documentos de texto en castellano y sus equivalentes en Lengua de Signos. El documento de texto contendrá palabras en castellano, mientras que el de LSE contendrá GLOSAS. Las glosas son palabras (en mayúsculas) que representan los signos. Por ejemplo la glosa FOTO representa el signo cuyo significado es el de “fotografía”.

3.3 Representación de Signos El último módulo corresponde al agente animado en 3D, que se encarga de la representación de los signos provenientes de la Traducción Estadística. El agente utilizado es “VGuido” del proyecto eSIGN (http://www.sign-lang.unihamburg.de/eSIGN). Este módulo está incorporado en el sistema como un control ActiveX. Cada glosa (representación de un signo) está asociada a un fichero de texto XML con la descripción detallada de los movimientos que tiene que realizar el avatar para representar dicho signo. Para representar varias glosas seguidas basta con ir accediendo a los ficheros XML correspondientes e ir dándole las instrucciones necesarias al avatar para que realice los movimientos oportunos.

4 Traducción Estadística basada en Modelos de Subsecuencias de Palabras La traducción estadística basada en modelos de subsecuencias de palabras (o subfrases) consiste en la obtención de un Modelo de Traducción a partir del alineamiento y extracción de

B. Gallo, R. San-Segundo, J.M. Lucas, R. Barra, L.F. D’Haro y F. Fernández

252

subsecuencias utilizando un corpus paralelo, y la generación de un modelo de lenguaje de la lengua destino. Estos modelos se utilizan por el módulo de traducción (Moses) para obtener la secuencia de signos/glosas dada una frase de entrada. La arquitectura completa de este sistema de traducción es la siguiente:

Alineamiento de Palabras

GIZA++

Phrase Model

Phrase-model

Corpus Paralelo

Entrenamiento N-gramas

SRI-LM

Corpus lengua destino

TraducciónMOSES

Corpus Lengua Origen

Evaluación

Corpus Lengua Destino

Modelo de Traducción

Modelo Lengua Destino

Figura 2: Arquitectura de la Traducción basada en Subsecuencias de Palabras

4.1 Generación de Modelos En primer lugar debe crearse el Modelo de Lenguaje (de la lengua destino) y el Modelo de Traducción (a partir de un corpus paralelo tanto en lengua origen como destino). Las ideas que hay detrás de la traducción automática estadística vienen de la teoría de la información. Esencialmente, el problema de la traducción se centra en conocer la probabilidad p(d|o) de que una cadena o de la lengua origen genere una cadena d en la lengua destino. Estas probabilidades se calculan utilizando técnicas de estimación de parámetros a partir del corpus paralelo.

Aplicando el Teorema de Bayes a p(d|o) se puede representar esta probabilidad como el producto p(o|d)·p(d), donde el Modelo de Traducción p(o|d) es la probabilidad de que la cadena origen se traduzca por la cadena destino, y el Modelo de Lenguaje p(d) es la probabilidad de ver aquella cadena origen. Matemáticamente, encontrar la mejor traducción o~ se consigue escogiendo aquella secuencia de signos que permita obtener la probabilidad máxima:

)p(o/d)·p(d*oo

maxargp(d/o)*oo

maxargo∈

=∈

=~ (1)

Para la creación del Modelo de Lenguaje, se

utiliza la herramienta SRILM (Stolcke, 2002), una herramienta que realiza la estimación de los modelos de lenguaje tipo N-grama, a partir del corpus de entrenamiento, y su evaluación calculando la probabilidad de un corpus de test. Estos Modelos se utilizan ampliamente en

muchos ámbitos: reconocimiento de habla, OCR (Reconocimiento Óptico de Caracteres), etc. En cuanto a los Modelos de Traducción, su generación se hace mediante una traducción basada en subfrases. Para esto la herramienta utilizada es el GIZA++ (que es una implementación de los modelos IBM de traducción (Och y Ney, 2000)), un sistema de traducción estadística automática capaz de entrenar estos modelos para cualquier par de lenguas (http://www.statmt.org/moses). Para esto se necesita una colección de textos traducidos, que será el corpus paralelo. Los pasos para la generación de los modelos son:

1. Obtención del alineamiento entre

palabras: consiste en que a partir de los dos textos en castellano y LSE se identifican qué palabras de uno corresponden con las del otro. Para esto se utiliza el programa GIZA++. El alineamiento se hace tanto en el sentido palabras-glosas como en la dirección glosas-palabras. Un ejemplo de un alineamiento es el siguiente:

Figura 3: Ejemplo de un alineamiento entre palabras en castellano y signos en LSE

representados mediante glosas.

2. Cálculo de una tabla de traducción léxica: a partir del alineamiento, se realiza una estimación de la tabla de traducción léxica más probable, obteniendo los valores de w(d|o) y su inversa w(o|d) para todas las palabras, es decir, las probabilidades de traducción para todos los pares de palabras. Un ejemplo para la palabra “por” con el texto utilizado es:

por PRIMER 0.5000000 por POR 0.3333333 …

3. Extracción de subsecuencias de palabras: se recopilan todos los pares de subsecuencias que sean consistentes con el alineamiento. El archivo generado en este paso tiene la forma siguiente, donde la subfrase “a los siguientes


253

países” se traduce por la subsecuencia de glosas “ESTOS PLURAL PAÍS”:

a los siguientes países ||| ESTOS PLURAL PAÍS ||| 0-0 2-0 1-1 3-2

a los siguientes ||| ESTOS PLURAL ||| 0-0 2-0 1-1

4. Cálculo de las probabilidades de

traducción de cada subsecuencia (“Phrase Scoring”): En este paso, se calculan las probabilidades de traducción para todos los pares de subfrases en los dos sentidos: subfrase en castellano- signo en LSE y signo en LSE – subfrase en castellano. Un ejemplo del archivo obtenido es:

a los siguientes países ||| ESTOS PLURAL PAÍS ||| (0) (1) (0) (2) ||| (0,2) (1) (3) ||| 1 0.0283293

a los siguientes ||| ESTOS PLURAL ||| (0) (1) (0) ||| (0,2) (1) ||| 1 0.0661018

4.2 Ajuste Para realizar el proceso de traducción se deben combinar los modelos generados en la fase anterior de entrenamiento. Esta composición se hace mediante una combinación lineal de probabilidades cuyos pesos se deben ajustar. El proceso de ajuste de los pesos consiste en probar el traductor Moses con un conjunto de frases (conjunto de validación) y, conociendo la traducción correcta, evaluar las salidas del traductor automático en función de los valores diferentes asignados a los pesos. Estos valores se eligen aleatoriamente y después de una búsqueda también aleatoria se eligen los valores que hayan ofrecido los mejores resultados. 4.3 Traducción Utilizando un nuevo conjunto de frases (conjunto de test) se evalúa el sistema. Tanto para la fase de ajuste como para la de evaluación se utiliza el traductor Moses que emplea los modelos obtenidos anteriormente (modelo de traducción y modelo de lenguaje de la lengua destino), combinados según los pesos ajustados. Moses(http://www.statmt.org/moses) es un sistema de traducción automática estadística basado en subsecuencias de palabras, que implementa un algoritmo de búsqueda para obtener, a partir de una frase de entrada, la secuencia de signos que con mayor probabilidad corresponde a su traducción. Permite trabajar con redes de confusión de palabras como las que se obtienen en gran cantidad de sistemas de reconocimiento de voz. Por otro lado, también permite la integración de varios modelos de traducción entrenados con

los diferentes factores con los que se puede etiquetar las palabras de las frases.

5 Traducción Estadística basada en Transductores de Estados Finitos Los transductores de Estados Finitos (“FSTs: Finite State Transducers”) se están usando en diferentes áreas de reconocimientos de patrones y lingüística computacional. Los FSTs parten de un corpus de entrenamiento que consta de pares de frases origen-destino, y usando métodos de alineamiento basados en GIZA++ generan un conjunto de cadenas a partir de las cuales se puede inferir una gramática racional. Esta gramática se convierte, por último, en un traductor de Estados Finitos. Una de las principales razones del interés de esta técnica es que las máquinas de estados finitos pueden aprenderse automáticamente a partir de ejemplos (Vidal et al, 2000).

Alineamiento de Palabras

GIZA++

Transductor de Estado Finito

GIATI

Corpus Paralelo

TraducciónREFX

Evaluación

Corpus lengua destino

Modelo de Traducción

Corpus lengua origen

Figura 4: Arquitectura de la Traducción basada en Transductores de Estados Finitos

Un FST se caracteriza por la topología y por las distribuciones de probabilidad, dos características distintivas que se pueden aprender de un corpus bilingüe mediante algoritmos eficientes, como el GIATI (“Grammar Inference and Alignments for Transducers Inference”). En la Figura 4 se muestra la arquitectura de esta solución. Los pasos de esta estrategia de traducción son los que se explican a continuación. 5.1 Alineamiento con GIZA++ En esta fase se pretende el alineamiento de las palabras de las frases de entrada (en castellano) con los signos/glosas de sus traducciones correspondientes (en LSE). Este alineamiento se realiza en los dos sentidos: tanto en el sentido palabras-glosas como en la dirección glosas-palabras. Para realizar este alineamiento se utiliza el programa GIZA++ como se comentó en el apartado 3.1.1.


254

5.2 Transformación de pares de entrenamiento a frases de palabras extendidas Partiendo de un alineamiento como el explicado en la sección 3.1, se realiza un proceso de etiquetado, en el cual se construyen un corpus extendido a partir de cada uno de los pares de subsecuencias de entrenamiento y sus correspondientes alineamientos: se asignarán por tanto palabras de lengua origen a su correspondiente palabra en lengua destino gracias a su alineamiento. Se muestra a continuación un ejemplo de pares castellano / LSE (en glosas) y su alineamiento:

el denei es obligatorio desde los catorce años # DNI(2) SE-LLAMA(3) OBLIGATORIO(4) DESDE(5) CATORCE(7) PLURAL(6) AÑO(8) EDAD(8)

el denei es obligatorio # DNI(2) SE-LLAMA(3) OBLIGATORIO(4)

el denei es el documento oficial # DNI(2) SE-LLAMA(3) DOCUMENTO(5) OFICIAL(6)

el denei es oficial # DNI(2) SE-LLAMA(3) OFICIAL(4)

Para que no se produzca una violación en el

orden secuencial de las palabras en la lengua destino, se sigue el siguiente criterio de etiquetado: cada palabra de lengua destino se une con su correspondiente palabra en lengua origen a partir del alineamiento si el orden de las palabras objetivo no se altera. Si fuera así, la palabra en lengua destino se une con la primera palabra en lengua origen que no viole el orden de las palabras objetivo. Por lo tanto, el ejemplo anterior quedaría de la siguiente manera, con la formación de palabras extendidas (“extended words”, unión de palabras y signos alineados):

(el, λ) (denei, DNI) (es, SE-LLAMA) (obligatorio, OBLIGATORIO) (desde, DESDE) (los, PLURAL), (catorce, CATORCE) (años, AÑO EDAD)

(el, λ) (denei, DNI) (es, SE-LLAMA) (obligatorio, OBLIGATORIO)

(el, λ) (denei, DNI) (es, SE-LLAMA) (el, λ) (documento, DOCUMENTO) (oficial, OFICIAL)

(el, λ) (denei, DNI) (es, SE-LLAMA) (oficial, OFICIAL)

Si se hace un refinamiento de este etiquetado, se puede implementar que las palabras origen que hayan quedado aisladas se unan a la primera palabra extendida que tenga palabra(s) destino asignadas. Por lo tanto, el ejemplo anterior se convierte en:

(el denei, DNI) (es, SE-LLAMA) (obligatorio, OBLIGATORIO) (desde, DESDE) (los, PLURAL), (catorce, CATORCE) (años, AÑO EDAD)

(el denei, DNI) (es, SE-LLAMA) (obligatorio, OBLIGATORIO)

(el denei, DNI) (es, SE-LLAMA) (el documento, DOCUMENTO) (oficial, OFICIAL)

(el denei, DNI) (es, SE-LLAMA) (oficial, OFICIAL) 5.3 Inferencia de un Gramática Estocástica y posteriormente de un Traductor de Estados Finitos Consiste en la obtención de un Transductor de Estados Finitos a partir de las frases con las palabras extendidas. Las probabilidades de saltos entre nodos de un FST se computan por las cuentas correspondientes en el conjunto de entrenamiento de palabras extendidas. La probabilidad de una palabra extendida zj a partir de una palabra origen si y una palabra destino ti: zj = (si, ti), dada una secuencia de palabras extendidas zi−n+1, zi−1 = (si−n+1,ti−n+1) (si−1,ti−1) es:

(2)

Donde c(·) es el número de veces que ocurre un evento en el conjunto de entrenamiento. A partir del resultado del apartado anterior se infiere un modelo tipo bigrama. Se ilustra este proceso en la siguiente figura, donde los nodos grises indican que la subfrase puede terminar en ese punto:

el denei/DNI es/SE-LLAMA

obligatorio/OBLIGATORIO

el documento/DOCUMENTO oficial/OFICIAL

desde/DESDE los/LOS catorce/CATORCE años/AÑO EDAD

oficial/OFICIAL

Figura 5: Transductor de estado finito inferido a partir del bigrama del ejemplo anterior

6 Evaluación

6.1 Medidas de evaluación Con el objetivo de evaluar la calidad de la traducción, es necesario comparar la salida del sistema automático con una referencia y calcular algunas medidas de evaluación. WER (“Word Error Rate”, porción de palabras con error) es una medida comúnmente utilizada en la evaluación de sistemas de reconocimiento del habla o de traducción automática. Calcula el número de inserciones, borrados y sustituciones de palabras cuando se comparan frases. Esta medida se basa en la distancia de edición o de Levensthein. En tareas tanto de traducción automática como de reconocimiento del habla,

)z,...,c(z)z,z,...,c(z)...zz|(zp

1i1ni

i1i1ni1i1niin

−+−

−+−−+− =


255

se calcula el WER entre la frase generada por el sistema de traducción y una frase que es de referencia correcta (en este caso, frases de signos o glosas).

BLEU (Bilingual Evaluation Understudy) (Papineni et al, 2002) es un método de evaluación de la calidad de traducciones realizadas por sistemas de traducción automática. Una traducción tiene mayor calidad cuanto más similar es con respecto a otra de referencia, que se supone correcta. BLEU puede calcularse utilizando más de una traducción de referencia. Esto permite una mayor robustez a la medida frente a traducciones libres realizadas por humanos. BLEU se calcula normalmente a nivel de frases y halla la precisión en n-gramas entre la traducción del sistema y la de referencia. Estas medidas surgen con el objetivo de encontrar medidas automáticas que correlen con la evaluación que un experto haría de la traducción.

Otra medida es NIST, que se basa en la BLEU con algunas modificaciones: en primer lugar, BLEU utiliza la media geométrica de la precisión de los N-gramas, mientras que NIST utiliza una media aritmética para reducir el impacto de bajas concurrencias para órdenes altos de n-gramas. También, BLEU calcula la precisión de n-gramas utilizando pesos iguales para cada n-grama, mientras que NIST considera la calidad de la información que proporciona un n-grama particular en sí mismo (por ejemplo, cuanto menos frecuente sea un n-grama más peso se le asignará).

6.2 Base de Datos La base de datos utilizada para los experimentos consiste en un corpus paralelo que contiene 414 frases típicas de un contexto restringido: aquellas que diría un funcionario cuando asiste a gente que quiere renovar el pasaporte y/o el Documento Nacional de Identidad, o información relacionada. En este contexto concreto, un sistema de traducción de voz a LSE es muy útil puesto que la mayoría de estos empleados no conocen este lenguaje y tienen dificultades a la hora de interactuar con personas sordas.

El conjunto de frases se dividió aleatoriamente en tres grupos: entrenamiento (conteniendo aproximadamente el 70% de las frases), evaluación (con el 15% de las frases) y test (15% de frases). Esta concentración se hace de forma arbitraria. Se muestra a continuación un resumen de la base de datos:

Castellano LSE

Pares de Frases 414 Total Nº de

palabras/glosas 4847 4564

Pares de Frases 314 Entrena-miento Nº de


Pares de Frases 50 Validación Nº de


Pares de Frases 50 Test Nº de


Tabla 1: Estadísticas de la base de datos

6.3 Resultados de los experimentos realizados Las 414 frases fueron pronunciadas por 14 personas para evaluar el reconocedor de voz. En este caso se ha realizado tres experimentos diferenciados que se describen a continuación:

• En la primera situación, se evalúa el sistema de reconocimiento de voz utilizando tanto el modelo de lenguaje como el vocabulario generados a partir del conjunto de entrenamiento. Esta situación es la más realista.

• En el segundo caso, el modelo de lenguaje se genera a partir del conjunto de entrenamiento, mientras que el vocabulario incluye todas las palabras. De esta forma se evita el efecto de las palabras fuera de vocabulario.

• En el último experimento se utilizan todas las frases tanto para el entrenamiento como para el vocabulario. En este caso, se intenta reproducir la situación en la que se disponga de tantas frases de entrenamiento que las frases de test estén contenidas en ellas.

A continuación se expresan en forma de tabla los resultados obtenidos para los tres experimentos. Como parámetros de medida se incluyen: WER (Word Error Rate), I (inserciones), D (borrados) y S (sustituciones):

WER I(%) D(%) S(%)Experimento 1 24,08 2,61 6,71 14,76Experimento 2 15,84 1,19 5,93 8,72 Experimento 3 4,74 0,86 1,94 1,94

Tabla 2: Resultados del Reconocedor de Voz para los tres experimentos realizados


256

A continuación se muestran los resultados de traducción obtenidos aplicando las técnicas de traducción estadística descritas en los apartados 3 y 4. En la Tabla 3 se observan los resultados de los experimentos de traducción realizados, tanto con las frases de referencia del corpus paralelo castellano-LSE (Referencia), como utilizando la salida del reconocedor de voz para los tres experimentos de reconocimiento comentados anteriormente (Experimento 1-3). Por otro lado se diferencian dos situaciones principales: en la primera parte de la tabla se muestran los resultados habiendo entrenado el modelo de traducción con las frases de referencia, en segundo lugar, la segunda parte de la tabla muestra los mismos resultados pero en este caso considerando la salida de reconocedor (de las frases de entrenamiento) para entrenar el modelo de traducción. Para todos los casos se muestran los resultados de WER (tasa de error de signos a la salida de la traducción), tasas de signos insertados, borrados o sustituidos en la traducción y las medidas de BLEU y NIST descritas anteriormente. Modelo de traducción generado con las frases de

referencia del conjunto de entrenamiento WER BLEU NIST

Exp 1 39,17 0,4853 6,2806 Exp 2 37,99 0,4900 6,4006 Exp 3 33,72 0,5301 6,7238

Traducción basada en subfrases

REF 31,75 0,5469 6,8652 Exp 1 35,85 0,5090 6,6473 Exp 2 33,89 0,5238 6,8610 Exp 3 29,32 0,5804 7,3100

Traducción basada en

FST REF 28,21 0,5905 7,3501

Modelo de traducción generado con la salida del reconocedor para las frases de entrenamiento

WER BLEU NIST Exp 1 40,04 0,4775 6,2076 Exp 2 37,46 0,4939 6,4738 Exp 3 32,44 0,5449 6,8606

Traducción basada en subfrases

REF 31,75 0,5469 6,8652 Exp 1 36,33 0,5188 6,5273 Exp 2 33,42 0,5235 6,8344 Exp 3 29,27 0,5698 7,1953

Traducción basada en

FST REF 28,21 0,5905 7,3501

Tabla 3: Resultado de los experimentos de traducción

En esta tabla se puede observar que los resultados de la Referencia siempre serán los mejores resultados (menor WER y mayor BLEU y NIST) en comparación con los obtenidos en la traducción de la salida del reconocedor de voz puesto que la referencia no contiene errores de reconocimiento que dificultan la traducción posterior. Además, se puede ver que cuanto peor es la tasa de reconocimiento, peor es la tasa de traducción que se consigue traduciendo la salida del reconocedor. En general, con esta base de datos (del dominio de frases del DNI/pasaporte), la traducción estadística basada en FST ofrece mejores resultados que la solución tecnológica basada en subfrases. Se observa también que al entrenar el modelo de traducción con las salidas de reconocimiento se permite entrenar dicho modelo con los posibles errores del reconocedor, de forma que el modelo de traducción puede aprender de estos errores y corregirlos durante el proceso de traducción. Si bien es cierto que los resultados mejoran, las diferencias son muy pequeñas. Finalmente se puede concluir que el mejor sistema es el de la traducción basada en FST entrenando con las salidas del reconocedor.

7 Conclusiones En este artículo se ha presentado un sistema de traducción estadística de voz a lengua de signos para personas sordas. En concreto se ha estudiado la traducción de castellano a Lengua de Signos Española, utilizando como dominio de aplicación el de frases que un policía pronuncia cuando informa sobre cómo renovar o solicitar el DNI. Estas situaciones requieren de un intérprete signante que sea capaz de traducir cualquier frase a personas sordas que quieran realizar estas acciones por lo que un sistema automático puede ser de gran ayuda. En cuanto al sistema desarrollado, contiene un primer módulo de reconocimiento de voz, un módulo de traducción en el que se ha centrado este artículo, y un último módulo de representación de los signos. Se han estudiado dos soluciones tecnológicas de traducción estadística, la primera utiliza un modelo de traducción basado en subsecuencias de palabras y la segunda utiliza un transductor de estados finitos (FSTs). En ambos casos se utilizan programas de libre distribución que se comentan a lo largo del texto (GIZA++, Moses y GIATI). Estos programas permiten hacer la traducción tanto de textos en castellano (que contienen las frases originales) como de textos


257

que contienen las frases obtenidas a la salida del reconocedor de voz (que contendrán los fallos propios del reconocimiento) a LSE.

Los resultados que se muestran corresponden a pruebas con el texto original (de referencia) y el texto obtenido a la salida del reconocedor. Para este estudio, la base de datos de frases se ha dividido en tres subconjuntos: entrenamiento, validación y test (comprobación). En general, con esta base de datos (del dominio de frases del DNI/pasaporte), la traducción estadística basada en FST ofrece mejores resultados que la solución tecnológica basada en subsecuencias de palabras. El hecho de entrenar el modelo de traducción con las salidas de reconocimiento permite aprender de los errores y corregirlos durante el proceso de traducción. Si bien es cierto que los resultados mejoran, las diferencias son muy pequeñas. Finalmente se puede concluir que el mejor sistema es el de la traducción basada en FST entrenando con las salidas del reconocedor con una WER de 29,27% y un BLEU de 0,5698.

8 Prototipo desarrollado

Figura 6: Interfaz del prototipo.

Con este trabajo se ha desarrollado un prototipo (Figura 6) de traducción de voz a LSE que ha sido evaluado con frases pronunciadas por estudiantes. El siguiente paso es evaluar el sistema en condiciones reales considerando interacciones reales entre los policías y personas sordas.

Agradecimientos Este trabajo ha sido posible gracias a la

financiación de los siguientes proyectos: EDECAN (MEC Ref: TIN2005-08660-C04), ROBONAUTA

(MEC Ref: DPI2007-66846-C02-02) y ANETO (UPM-DGUI-CAM. Ref: CCG07-UPM/TIC-1823)

Bibliografía Bungeroth J. and Ney H. Statistical Sign Language

Translation. Workshop on Representation and Processing of Sign Languages, LREC’04.

Casacuberta F., E. Vidal. “Machine Translation with Inferred Stochastic Finite-State Transducers”. Comp. Linguistics, V30, n2, 205-225. 2002.

Cox S. J., Lincoln M., J Tryggvason, M Nakisa, M. Wells, Mand Tutt, and S Abbott. TESSA, a system to aid communication with deaf people. In ASSETS 2002, pages 205-212, Edinburgh.

Koehn P., Och J., Marcu D. “Statistical Phrase-Based Translation”. Human Language Technology Conference 2003 (HLTNAACL 2003), Edmonton, Canada, pp. 127-133.

Lynette van Zijl Stellenbosch. “South African Sign Language Machine Translation System” Proc. of the 2nd international conference on Computer graphics, virtual Reality, visualisation and interaction in Africa table of contents Pages: 49 – 52. 2003

Morrissey S. and Way A. 2005. An example-based approach to translating sign language. In Workshop Example-Based Machine Translation (MT X–05), pages109–116, Phuket, Thailand, September.

Papineni K., S. Roukos, T.Wardm W.J.Zhu. “BLEU: a method for automatic evaluation of machine translation”. 40th Annual Meeting of the ACL, Philadelphia, PA, pp. 311-318. 2002

Och J., Ney H. "Improved Statistical Alignment Models". Proc. of the 38th Annual Meeting of the Association for Computational Linguistics, pp. 440-447, Hongkong, China, Octubre 2000.

Och J., H.Ney. “Discriminative Training anda Maximum Entropy Models for Statistical Machine Translation”.Annual Meeting of the Ass.ACL, Philadelphia, PA, pp.295-302. 2002

Och J., H. Ney. “A systematic comparison of various alignment models”. Computational Linguistics, Vol.29, No.1, pp.19-51. 2003.

San-Segundo R., Barra R., L.F. D’Haro, J.M. Montero, R. Córdoba, J. Ferreiros. “A Spanish Speech to Sign Language Translation System”. Interspeech 2006.

Stolcke A. “SRILM – An Extensible Language Modelling Toolkit” ICSLP. 2002.

Sumita E., Y.Akiba, T.Doi et al. “A Corpus-Centered Approach to Spoken Language Translation”. Conf. Of Ass. For Computational Linguistics (ACL) Hungary, pp.171-174.2003.

Vidal E., Casacuberta F, García P. “Gramatical Inferrence and Automatic Speech Recognition”. New Advances and Trends in Speech Recognition and Coding (volume 147 of NATO-ASI Series F: Casacuberta and Vidal. 2000.


258

Comparacion y combinacion de los sistemas de traduccionautomatica basados en n-gramas y en sintaxis

Comparison and system combination of n-gram-based and syntax-based

machine translation systems

Maxim Khalilov y Jose A. R. FonollosaCentre de Recerca TALP

Universitat Politecnica de CatalunyaCampus Nord, C. Jordi Girona, 1-3

Barcelona, Spain(khalilov,adrian)@gps.tsc.upc.edu

Resumen: En este artıculo se comparan dos sistemas basados en dos aproximacio-nes diferentes de traduccion automatica: El denominado sistema de la TraduccionAutomatica Aumentado con Sintaxis (SAMT / TAAS), basado en una sintaxis sub-yacente al modelo basado en frases, y el sistema de traduccion automatica estadısti-ca (TAE) basado en n-gramas en el cual el proceso de traduccion esta basado en elmodelado estocastico del contexto bilingue. Se realiza una comparacion de la arqui-tectura de los dos sistemas paso a paso y se comparan tambien los resultados en basea las medidas automaticas de evaluacion de la calidad de traduccion y los recursoscomputacionales para una pequena tarea arabe-ingles que pertenece al dominio denoticias. Finalmente, se combinan las salidas de ambos sistemas para obtener unamejora significativa de la calidad de la traduccion.Palabras clave: Traduccion automatica estocastica, traduccion basada en sintaxis,n-gramas, combinacion de sistemas

Abstract: In this paper we shall compare two approaches to machine translation:the Syntax Augmented Machine Translation system (SAMT), which is a syntax-driven translation system, underlain by phrase-based model, and the n-gram-basedStatistical Machine Translation (SMT), in which a translation process is based onstatistical modeling of the bilingual context. We provide a step-by-step compari-son of the systems, reporting results in terms of automatic evaluation metrics andrequired computational resources for a smaller Arabic-to-English translation taskfrom the news domain. Finally, we combine the output of both systems that yieldto significant improvement of translation quality.Keywords: Statistical machine translation, syntax-based translation, n-grams, sys-tem combination

1. Introduccion

La inclusion de informacion sintacticaen los sistemas de traduccion automati-ca estocasticos (sistemas hıbridos sintactico-estocasticos) es un tema actual de investi-gacion en Traduccion Automatica (TA). Losdenominados modelos clasicos de IBM ba-sados en palabras que aparecieron a prin-cipios de la decada de 1990 fueron mejo-rados incluyendo la posibilidad de traba-jar a nivel de frases (entendidas como se-cuencias de palabras) tal como se describeen (Koehn, Och, y Marcu, 2003) o en laimplementacion mas reciente: MOSES MT

(http://www.statmt.org/moses/ ).En paralelo a la aproximacion basada en

frases1, ha aparecido la aproximacion basadaen n-gramas (Marino et al., 2006), derivadade la traduccion basada en Transductores de

Estados Finitos (Casacuberta, Vidal, y Vi-lar, 2002). Los sistemas basados en n-gramastrabajan con unidades bilingues, denomina-das tuplas, compuestas por una o mas pa-labras del lenguaje fuente y por una o maspalabras del lenguaje destino.

En contraste o complementando a los sis-

1En este artıculo la palabra ”frase”se utiliza comola traduccion directa de la palabra inglesa ”phrase”



temas tradicionales de TAE, han ganadofuerza los sistemas basados en sintaxis y losmodelos basados en la jerarquıa de frases.Una muestra representativa de los sistemasde traduccion basados en sintaxis incluyelos que estan basados en gramatica bilinguesincronica (Melamed, 2004), en los modelosarboles de derivacion-a-cadena (parse tree-to-string) y en los mapeos arbol-a-arbol no iso-morfos (Charniak, 2003).

Basandose en las probabilidades relativasde las frases, Chiang (2005) introdujo un mo-delo jerarquico de frases, que puede conside-rarse como una generalizacion coherente delmodelo clasico basado en frases (este modelopermite crear multiples generalizaciones den-tro de cada frase). El sistema TAAS (SAMTen ingles) (Zollmann y Venugopal, 2006) esuna implementacion del sistema de TA, queofrece una mayor generalizacion de esa apro-ximacion, en que las categorıas sintacticas,extraıdas de la parte destino del arbol de de-rivacion, se asignan a las frases estructuradasjerarquicamente.

En este artıculo se comparan las diferen-cias y las similitudes de la traduccion es-tadıstica basada en n-gramas de las unida-des de traduccion y el sistema TAAS, queopera con las categorıas no terminales yuna Gramatica Sincronica Libre de Contex-to (GSincLC). La comparacion se ha rea-lizado en una pequena tarea de traduccionarabe-ingles del dominio de noticias (News);el corpus de entrenamiento incluye aproxima-damente 1,5M tokens.

El resto de la presentacion esta organiza-do de la siguiente manera: En la Seccion 2 elsistema TAAS de la CMU-UKA2 se describebrevemente, en la siguiente Seccion se descri-be el sistema de TAE basado en n-gramas.En la Seccion 4 se presentan la metodologıa,la descripcion de los experimentos y los resul-tados alcanzados, y finalmente en la Seccion5 se discuten los resultados y se presentan lasconclusiones.

2. Sistema TAAS

Uno de los mayores comentarios crıticosde los modelos basados en frases es la escasezde datos. Este problema es incluso mas seriocuando los lenguajes fuente y destino, o am-bos son de mucha inflexion y ricos en morfo-logıa. Ademas, los modelos basados en frases

2Carnegie Melon University - University of Karls-ruhe

tienen dificultades para considerar reordena-mientos de larga distancia, porque el modelode distorsion se basa unicamente en la dis-tancia del movimiento y los recursos compu-tacionales crecen rapidamente con la distan-cia considerada (Och y Ney, 2004).

Un intento satisfactorio de abordar esteproblema fue la introduccion y discusion delsistema de TA basado en las frases genera-lizadas y estructuradas jerarquicamente talcomo se describe en Chiang (2005). Este sis-tema opera con solo dos etiquetas (una cate-gorıa de frases sustancial y una etiqueta deunion3) y un trabajo reciente (Zollmann yVenugopal, 2006) presenta una mejora impor-tante si las categorıas sintacticas completas oparciales (obtenidas de los arboles de deriva-cion del lenguaje destino) estan asignadas alas frases.

2.1. Modelado

Un formalismo para la Traduccion Au-mentada con Sintaxis es la Gramatica Proba-bilıstica Sincronica Libre de Contexto (GP-SinLC), la cual se define en terminos de losconjuntos de terminales de los lenguajes fuen-te y destino y de un conjunto de no termina-les:

X −→ 〈γ, α,∼, ω〉

donde X es un elemento no terminal, γ esuna secuencia de elementos terminales rela-tivos a la parte fuente y no terminales, α esuna secuencia de elementos terminales relati-vos a la parte destino y no terminales, ∼ esel mapeo uno-a-uno del espacio de categorıasno terminales en γ al espacio de no termina-les en α, y ω es el peso no negativo asignadoa la regla.

El conjunto de no terminales se componede las categorıas sintacticas que correspon-den al conjunto Penn Treebank de la partedestino, un conjunto de las reglas de uniony de una etiqueta especial que representa lacategorıa por defecto, denominada segun lasreglas “del estilo Chiang”, que no se corres-ponde con ninguna otra categorıa del arbolde derivacion. Consecuentemente, todas lasreglas puramente linguısticas estan incluidasen la tabla del mapeo de frases.

3”Glue rule”

Maxim Khalilov y José A. R. Fonollosa

260

2.2. Anotacion, generalizacion y

poda de las reglas

La tabla puramente lexica que sostiene elsistema TAAS esta identificada como se des-cribe en Koehn et al. (2003) y esta basadaen el alineado de palabras, generado segun elmetodo grow-diag-final (Och y Ney, 2003).

La parte destino del corpus de entrena-miento ha sido procesada con el parser deCharniak (Charniak, 2000), y cada frase seha anotado con el constituyente que cubre laparte destino de las reglas. El conjunto deno terminales se ha extendido por las cate-gorıas condicionales y adicionales de acuer-do con la Gramatica Combinatoria Catego-rial (Steedman, 1999). Las reglas se cons-truyen, por ejemplo, como RB+VB, repre-sentando un constituyente de union de doscategorıas adyacentes, i.e. un adverbio y unverbo, o como DT\NP, que indica un gruponominal incompleto, a el que le falta el deter-minante inicial.

El procedimiento recursivo de generaliza-cion de las reglas coincide con el que se propu-so en Chiang (2005), pero violando las restric-ciones introducidas para una gramatica quecontenıa solo una categorıa (por ejemplo, lasreglas que contienen elementos generalizadosadyacentes).

Por lo tanto, cada regla existente

N −→ f1 . . . fm/e1 . . . en

puede ser extendida por la regla existente

M −→ fi . . . fu/ej . . . ev

donde 1 ≤ i < u ≤ m y 1 ≤ j < v ≤ npara obtener una regla nueva

N −→ f1 . . . fi−1Mkfu+1 . . . fm/e1 . . . ej−1Mkev+1 . . . en

donde k es un ındice de un no-terminal M queindica la correspondencia uno-a-uno entre losM tokens nuevos en los dos lados.

La figura 1 muestra un ejemplo de ex-traccion de las reglas iniciales. Estas reglasson extendidas posteriormente gracias a la es-tructura jerarquica del modelo (figura 2).

La poda de reglas es necesaria debido aque el tamano del conjunto de las reglas ge-neralizadas puede ser enorme y se realiza enbase a las frecuencias relativas y la natura-leza de las reglas: las reglas no lexicas que

han ocurrido solamente una vez se descartandirectamente, las reglas condicionadas por lafuente con una frecuencia de aparicion me-nor a un umbral tambien son eliminadas,mientras que las reglas que no contienen no-terminales nunca pueden ser podadas.

2.3. Decodificacion y las funciones

caracterısticas

El proceso de decodificacion se lleva acabo utilizando un modelo loglineal ”top-down”que decodifica una oracion fuente en-riquecida con GPSinLC de modo que la ca-lidad de traduccion sea representada por unconjunto de las funciones para cada regla, i.e.:

Las probabilidades condicionales, dadolas categorıas fuente, las destino o las ca-tegorıas por la izquierda;

Las funciones de pesos lexicos, como seha presentado en Koehn et al. (2003);

Los contadores del numero de palabrasen la parte destino y del numero de apli-caciones de las reglas;

Las caracterısticas binarias que reflejanel contexto de la regla (si es puramen-te lexica o puramente abstracta, entreotras);

Las penalizaciones por rareza y desequi-librio de la regla.

El proceso de decodificacion se puederepresentar como una busqueda (operacionargmax) en el espacio de probabilidad de losterminales del lenguaje destino, que es simi-lar al parsing monolingue con una gramaticalibre de contexto. Los pesos de las funcionescaracterısticas se optimizan en base a la ma-ximizacion de la medida BLEU (Zollmann yVenugopal, 2006).

3. El sistema de n-gramas

Una descripcion detallada del sistema ba-sado en n-gramas se encuentra en Marino etal. (2006). El problema de la TAE se formulaen terminos de los lenguajes fuente (f) y des-tino (e) y se define de acuerdo con la ecuacion(1). Se puede reformular como la seleccion dela traduccion con la probabilidad mas alta delconjunto de las oraciones destino (2):

eI

1 = arg maxeI

1

{

p(eI

1 | fJ

1 )}

= (1)

= arg maxeI

1

{

p(fJ

1 | eI

1) · p(eI

1)}

(2)

Comparación y combinación de los sistemas de traducción automática basados en n-gramas y en sintaxis

261

donde I y J representan el numero de pala-bras en los idiomas fuente y destino respecti-vamente.

Los sistemas mas recientes operan con lasunidades bilingues extraıdas del corpus pa-ralelo a partir del alineado de palabras. Loslogaritmos de las probabilidades asociadas alas funciones caracterısticas son combinadoslinealmente (aproximacion loglineal) para de-finir una funcion cuya maximizacion estable-ce la traduccion (Och y Ney, 2002) tal comose muestra en la formula (3):

eI

1 = arg maxeI

1

{

M∑

m=1

λmhm(eI

1, fJ

1 )

}

(3)

donde hm se refiere a las funciones carac-terısticas y λm a los pesos que correspondena cada modelo.

3.1. El sistema de traduccion

La aproximacion basada en n-gramas seconsidera como una alternativa a la traduc-cion basada en frases bilingues, donde la se-cuencia de palabras del idioma fuente es seg-mentada en frases monolingues que son tra-ducidas individualmente para formar la ora-cion destino (Koehn, Och, y Marcu, 2003).

La traduccion basada en n-gramas consi-dera la traduccion como un proceso estocasti-co que maximiza la probabilidad conjuntap(f, e), en base a una descomposicion en n-gramas bilingues. La parte principal del siste-ma ası construido es un modelo de traduccion(un modelo de lenguaje (ML)), basado en lasunidades bilingues denominadas tuplas. Lastuplas se extraen del alineado de palabras deacuerdo con unas condiciones que definen unasegmentacion unica (Marino et al., 2006).

Mientras que la TAE de frases considerael contexto solamente para reordenar las fra-ses pero no para la traduccion, los sistemasbasados en n-gramas condicionan las decisio-nes de traduccion en las decisiones previas detraduccion.

3.2. Las caracterısticas adicionales

Al igual que los sistemas basados en fra-ses, los sistemas basado en tuplas mas recien-tes implementan una combinacion lineal delos logaritmos de la probabilidad asignada ala traduccion por el modelo de traduccion yotras caracterısticas adicionales:

el ML de n-gramas de palabras del len-guaje destino;

el ML de n-gramas de tags del lenguajedestino (un modelo de n-gramas de lasetiquetas gramaticales o Part-Of-Speechtags (POS));

el modelo de penalizacion para las tra-ducciones mas cortas, que compensa latendencia a la generacion de traduccio-nes con un menor numero de palabras;

los modelos lexicos en cada direccion (defuente a destino y viceversa) como sedescribe en Och y Ney (2004).

3.3. El reordenamiento de

palabras extendido

La aproximacion basada en tuplas se tratade partida como una traduccion monotona yaque el modelo esta basado en el orden secuen-cial de las tuplas durante el entrenamiento,aunque es necesario introducir estrategias dereordenamiento para obtener buenos resulta-dos en algunas tareas de traduccion. El mode-lo de distorsion extendido ha sido implemen-tado tal como se presenta en Crego y Marino(2006). Basandose en el alineado de palabras,las tuplas se extrajeron siguiendo la tecnicadenominada unfolding, mediante el cual lastuplas se dividen en tuplas mas pequenas yestos trozos se secuencian en el orden de laspalabras destino. La estrategia de reordena-miento esta apoyada por un ML de 4-gramasde los tags POS del texto fuente reordenado.Un ejemplo de extraccion de tuplas, en con-traste con la construccion de reglas basadaen chunks como se hace en el sistema TAAS,se muestra en la figura 1.

3.4. Decodificacion y optimizacion

La herramienta de decodificacion MA-RIE, distribuida gratuitamente, se ha utili-zado como motor de busqueda del sistemade traduccion. Los detalles de su funciona-miento estan descritos en Crego et al. (2005).El decodificador implementa un algoritmo debusqueda en haz (beam-search) con poda ba-sada en histograma. Dado el corpus de de-sarrollo y las traducciones de referencia, lospesos de la combinacion loglineal se ajus-tan mediante el metodo de optimizacion de-nominado simplex (con el objetivo de ma-ximizar la medida BLEU) y un re-rankingde la lista n-best tal como se describe enhttp://www.statmt.org/jhuws/.


262

4. Experimentos

4.1. Evaluacion de los sistemas

Los resultados experimentales se obtuvie-ron utilizando unicamente las primeras 50Klıneas del corpus del dominio de noticias(News) ofrecido en la evaluacion de sistemasde traduccion NIST’08. Los datos estadısti-cos del corpus pueden verse en la tabla 1. Losconjuntos de desarrollo y de test tenıan 4 tra-ducciones de referencia y contenıan 663 y 500oraciones respectivamente.

Las medidas de la evaluacion automaticase obtuvieron ignorando las diferencias entremayusculas y minusculas (case-insensitive).Se consideraron las metricas clasicas BLEUy NIST, junto al mPER, el mWER y el ME-TEOR. El alineado de palabras se obtuvoautomaticamente utilizando el programa GI-ZA++ (Och y Ney, 2004) en las dos direccio-nes y simetrizando las dos salidas mediantela union.

Arabe Ingles

Oraciones 50 K 50 KPalabras 1,41 M 1,57 K

Longitud media 28,15 31,22Vocabulario 51,10 K 31,51 K

Cuadro 1: El material de entrenamiento.

Los experimentos se hicieron en la maqui-na Pentium IV Dual Intel Xeon Quad CoreX5355 2.66 GHz con 24 Gb de RAM. Los re-sultados estimados del tiempo computacionaly del espacio de memoria son aproximados.

4.2. El preprocesado del arabe

El arabe es un idioma VSO (verbo-sujeto-objeto) con una morfologıa de esquemasvocalicos en el que las palabras se compo-nen de raıces y afijos ası como clıticos pega-dos a las palabras. Para el preprocesado uti-lizamos una aproximacion parecida a la quese muestra en Habash y Sadat (2006), basa-da en el sistema MADA+TOKAN para eli-minar ambiguedades y tokenizacion. Para laeliminacion de ambiguedades en los diacrıti-cos se empleo exclusivamente la estadısticade unigramas. Para la tokenizacion usamosel esquema D3 con opcion –TAGBIES. Es-te esquema parte del siguiente conjunto declıticos: f+, b+, k+, l+, Al+ y de los clıticospronominales. La opcion –TAGBIES producelos tags POS Bies para todos los tokens.

4.3. Los experimentos con el

sistema TAAS

Para realizar los experimentos se-guimos la directriz, disponible on-line:http://www.cs.cmu.edu/∼zollmann/samt/.El script que forma parte del sistema deTA MOSES se utilizo para crear el alineadogrow − diag − final y extraer las frasespuramente lexicas, que posteriormentese utilizaron para inducir la gramatica delTAAS. La parte destino (el ingles) del corpusde entrenamiento se proceso con el PennTreebank parser de Charniak (Charniak,2000).

Los procedimientos de extraccion y filtra-do de reglas se restringieron en base a la con-catenacion de los conjuntos de desarrollo yde test, permitiendo solamente las reglas dehasta 12 elementos en la parte destino y conel criterio de ocurrencia mınima de cero pa-ra todas las reglas, pero las reglas puramenteabstractas (sin expresiones lexicas) se elimi-naron.

El numero de frases del estilo Moses ex-traıdas con el sistema basado en frases fue4,8M , mientras que el numero de las reglasgeneralizadas que representaban el modelojerarquico crecio sensiblemente hasta 22,9Mde las cuales 10,8M fueron podadas en el pro-ceso de filtrado.

El tamano del vocabulario de elementoselementales del Penn Treebak es 72, mien-tras que el numero de elementos generaliza-dos, que incluye las categorıas adicionales ytruncadas es de 35,7K.

El decodificador de busqueda en haz(beam-search) FastTranslateChart se uti-lizo como motor de entrenamiento MER conobjeto de ajustar los pesos de las funcionescaracterısticas y generar las traducciones “N-best” y “1-best”, entrecruzando la busquedaintensiva con un modelo de lenguaje estandarde 3-gramas (Venugopal, Zollmann, y Vo-gel, 2007). En el proceso de optimizacion, elnumero de iteraciones se limito a 10, la ex-traccion a la lista 1000-best y se utilizo lamedida BLEU como criterio de optimizacion.

La tabla 2 muestra un resumen de losrecursos computacionales necesarios en cadapaso de la traduccion. Los resultados para elsistema TAAS, junto a los resultados parael sistema de n-gramas y la combinacion desistemas (tal y como se explica en la subsec-cion 4.6) se presentan en la tabla 5.


263

Paso Tiempo Memoria

Parsing 1,5h 80MbExtraccion de reglas 10h 3,5GbPoda & fusion 3h 4GbAjuste de pesos 40h 3GbPrueba 2h 3,0Gb

Cuadro 2: TAAS: Recursos computacionales.

4.4. Los experimentos con el

sistema basado en n-gramas

Como ya se ha mencionado anteriormente,el modelo principal del sistema basado en n-gramas es un ML basado en 4-gramas de lasunidades bilingues, que contiene: 184.345 1-gramas4, 552.838 2-gramas, 179.466 3-gramasy 176.221 4-gramas.

Junto a este modelo, el sistema de tuplasimplementa una combinacion loglineal de unML de 5-gramas del idioma destino estima-do con la parte inglesa del corpus paralelo,ademas de los modelos de POS basados en 4-gramas de los lenguajes fuente y destino. Lostags POS Bies se utilizaron para la parte delarabe como se muestra en la subseccion 4.2,la herramienta TnT se utilizo para extraerlos POS para el ingles (Brants, 2000).

El numero de las tuplas no unicas, ex-traıdas inicialmente fue 1,1M , que se poda-ron de acuerdo con el numero maximo deopciones de traduccion por tupla en el ladofuente (30). Las tuplas con NULO en la par-te fuente se adjuntaron a la unidad anterioro proxima (Marino et al., 2006).

Los pesos de las caracterısticas adicionalesse ajustaron de acuerdo con el criterio delBLEU maximo. Las exigencias de tiempo yde memoria RAM se presentan en la tabla 3.

Paso Tiempo Memoria

Estimacion de modelos 0,2h 1,9GbReordenamiento 1h —Ajuste de pesos 15h 120MbPrueba 2h 120Mb

Cuadro 3: TAE: recursos computacionales.

4.5. Significacion estadıstica

Hemos realizado una prueba de la sig-nificacion estadıstica basada en el metodo

4Este numero tambien corresponde al tamano delvocabulario del modelo bilingue

“boostrap resampling” como se presenta enKoehn (2004). Para un nivel de confianza del98 % y 1000 re-extracciones, las traduccionesgeneradas por el TAAS y por el sistema basa-do en n-gramas son estadısticamente diferen-tes de acuerdo con el BLEU (43,20±1,69 parael TAAS contra 46,42 ± 1,61 para el sistemade n-gramas).

4.6. Combinacion de sistemas

Motivados por el hecho de que muchos sis-temas de TA generan traducciones muy dife-rentes pero de calidad similar, incluso si losmodelos que participan en el procedimien-to de traduccion son semejantes, decidimoscombinar las salidas del sistema sintactico ydel sistema de traduccion automatica pura-mente estadıstica. Lo hicimos utilizando lalista de las traducciones mas probables gene-radas por los dos sistemas (1000-best).

Para ello se utilizo el algoritmo del Ries-go Mınimo de Bayes tal como se introdujoen Kumar y Byrne (2004). La tabla 5 de-muestra los resultados de la combinacion desistemas en el conjunto de test, contrastadacon la traduccion oraculo hecha como una se-leccion de las traducciones con el BLEU masalto de la union de las dos listas, la generadapor el TAAS y por el sistema de n-gramas.

Ademas analizamos el porcentaje de con-tribucion de cada sistema a la combinacionde sistemas: 55-60 % de las mejores traduc-ciones vienen de la lista 1000-best generadapor el sistema de n-gramas en ambos casos(combinacion de sistemas y “oraculo”).

Experimentos TAAS N-gramas

Combinacion de sistemas 39 % 61 %Oraculo 44 % 56 %

Cuadro 4: El porcentaje de las oraciones ge-neradas por cada sistema.

5. Discusion y conclusiones

En esta presentacion hemos compara-do dos sistemas de traduccion automatica:un sistema estadıstico basado en n-gramas(TAE) y el denominado sistema de tra-duccion automatica aumentado con sinta-xis (TAAS). La aproximacion basada en n-gramas proporciona mejores prestaciones enla tarea analizada. La comparacion se ha rea-lizado utilizando el mismo material de entre-


264

BLEU NIST mPER mWER METEOR

TAAS 43,20 9,26 36,89 49,45 58,50TAE basada en n-gramas 46,39 10,06 32,98 48,47 62,36Combinacion de sistemas 48,00 10,15 33,20 47,54 62,27

Oraculo 61,90 11,41 28,84 41,52 66,19

Cuadro 5: Evaluacion de la traduccion arabe-ingles.

Figura 1: Ejemplo de la extraccion de unidades primitivas en caso de TAAS y de sistema den-gramas.

Figura 2: Ejemplo de las reglas generalizadas (TAAS).

namiento y las mismas herramientas para elpreprocesado y el alineado palabra-a-palabra.

Respecto al tamano de la memoria ocu-pada y el coste computacional, el sistema den-gramas ha obtenido tambien resultados sig-nificativamente mejores que los del sistemaTAAS, principalmente por el tamano clara-mente inferior del espacio de busqueda.

Se han obtenido resultados comparativosde interes al respecto de las medidas PERy WER: de acuerdo con el PER, el sistemaTAE supera a su rival en un 10 % relativo,mientras que la mejora de WER apenas al-canza un 2 %. Esto puede ser explicado enbase a que el sistema TAE, comparado con elTAAS, traduce mejor el contexto, pero pro-


265

duce mas errores de reordenamiento. Dadoque los idiomas arabe e ingles son lenguascon mucha disparidad en el orden de pala-bras y debido a la tendencia de producir lasunidades cortas, el sistema de n-gramas tratapeor los reordenamientos de larga distancia.Sin embargo, al introducir el contexto de laspalabras en el modelo de traduccion, capturade forma mas eficiente las dependencias bi-lingues de corta distancia.

Finalmente, se ha conseguido una mejo-ra muy significativa mediante la combinacionde las salidas de los dos sistemas basados enprincipios de traduccion diferentes.

Como trabajo futuro se va a aplicar la mis-ma metodologıa de comparacion y combina-cion de sistemas a otras tareas de traduccion.

Bibliografıa

Brants, T. 2000. TnT – a statistical part-of-speech tagger. En Proceedings of ANLP-

2000.

Casacuberta, F., E. Vidal, y J. M. Vilar.2002. Architectures for speech-to-speechtranslation using finite-state models. EnProceedings of the Workshop on Speech-to-

Speech Translation: Algorithms and Sys-

tems, paginas 39–44.

Charniak, E. 2000. A maximum entropy-inspired parser. En Proceedings of

NAACL 2000, paginas 132–139.

Charniak, J. 2003. Learning non-isomorphictree mappings for machine translation. EnProceedings of ACL 2003 (companion vo-

lume), paginas 205–208.

Chiang, D. 2005. A hierarchical phrase-based model for statistical machine trans-lation. En Proceedings of ACL 2005, pagi-nas 263–270.

Crego, J. M., J. Marino, y A. de Gispert.2005. An Ngram-based Statistical Machi-ne Translation Decoder. En Proceedings

of INTERSPEECH05, paginas 3185–3188.

Crego, J. M. y J. B. Marino. 2006. Im-proving statistical MT by coupling reorde-ring and decoding. Machine Translation,20(3):199–215.

Habash, N. y F. Sadat. 2006. Arabic pre-processing schemes for statistical machi-ne translation. En Proceedings of the Hu-

man Language Technology Conference of

the NAACL, paginas 49–52.

Koehn, P. 2004. Statistical significancetests for machine translation evaluation.En Proceedings of EMNLP 2004, paginas388–395.

Koehn, P., F.J. Och, y D. Marcu. 2003.Statistical phrase-based machine transla-tion. En Proceedings of HLT-NAACL

2003, paginas 48–54.

Kumar, S. y W. Byrne. 2004. Mini-mum bayes-risk decoding for statisticalmachine translation. En Proceedings of

HLT/NAACL 2004.

Marino, J. B., R. E. Banchs, J. M. Crego,A. de Gispert, P. Lambert, J. A. R. Fono-llosa, y M. R. Costa-jussa. 2006. N-grambased machine translation. Computatio-

nal Linguistics, 32(4):527–549, December.

Melamed, I.D. 2004. Statistical machinetranslation by parsing. En Proceedings of

ACL 2004, paginas 111–114.

Och, F. y H.Ney. 2003. A systematic compa-rison of various statistical alignment mo-dels. En Computational Linguistics, volu-men 29(1), paginas 19–52.

Och, F. y H.Ney. 2004. The alignmenttemplate approach to statistical machinetranslation. En Computational Linguis-

tics.

Och, F. J. y H.Ney. 2002. DiscriminativeTraining and Maximum Entropy Modelsfor Statistical Machine Translation. EnProceedings of ACL 2002, paginas 295–302.

Steedman, M. 1999. Alternative quantifierscope in ccg. En Proceedings of ACL 1999,paginas 301–308.

Venugopal, A., A. Zollmann, y S. Vogel.2007. An Efficient Two-Pass Approa-ch to Synchronous-CFG Driven Statisti-cal MT. En Proceedings of HLT/NAACL

2007, paginas 500–507.

Zollmann, A. y A. Venugopal. 2006. Syntaxaugmented machine translation via chartparsing. En Proceedings of NAACL 2006.


266

Generacion de multiples hipotesis ponderadas de reordenamiento para unsistema de traduccion automatica estadıstica∗

Generating multiple weighted reordering hypotheses for an SMT system

Marta R. Costa-jussaUniversitat Politecnica de Catalunya

Campus Nord 08034 [email protected]

Jose A. R. FonollosaUniversitat Politecnica de Catalunya

Campus Nord 08034 [email protected]

Resumen:Los errores debidos al cambio de orden de las palabras son uno de los principalesretos en los sistemas de traduccion automatica estadıstica (TAE). Esta comunicacion pro-pone la estrategia estadıstica de reordenamiento automatico estadıstico (RAE) para afrontarel reordenamiento. El metodo propuesto aprovecha la poderosas tecnicas de aprendizajeestadıstico desarrolladas en traduccion estadıstica para traducir la lengua fuente (S) a unalengua fuente reordenada (S’), que nos permita mejorar la traduccion final a la lengua des-tino (T). Esta tecnica permite extraer un grafo de hipotesis ponderadas de reordenamientoque se utiliza como entrada al sistema TAE. Ademas, el uso de clases de palabras en la es-trategia RAE ayuda a generalizar reordenamientos. En este artıculo se presentan resultadosen la tarea EPPS en la direccion ingles a espanol y se muestra una mejora de 2.4 puntosBLEU en la calidad de la traduccion.Palabras clave:traduccion automatica estadıstica, grafo de reordenamiento, tuplas

Abstract: Reordering is one of the most important challenges in Statistical Machine Trans-lation (SMT) systems. This paper describes a novel strategy to face it: Statistical MachineReordering (SMR). It consists in using the powerful techniques developed for StatisticalMachine Translation (SMT) in order to translate the source language (S) into a reorderedsource language (S’), which allows for an improved translation into the target language (T).This technique allows to extract a weighted reordering graph which is used as SMT input.In addition, the use of classes in SMR helps to generalize word reorderings. Experimentsare reported in the EPPS task in the direction English to Spanish showing a 2.4 point BLEUimprovement in translation quality.Keywords: statistical machine translation, reordering graph, tuples

1. Introduccion

La traduccion automatica estadıstica (TAE)considera que una oracions de una lengua fuen-te puede ser traducida en cualquier oraciont dela lengua destino con cierta probabilidad. La tra-duccion consiste precisamente en determinar laoracion con mayor probabilidad de constituir unatraduccion para la oracion fuente. Estas probabi-lidades se aprenden principalmente a partir textosparalelos bilingues.

Los sistemas TAE tienden a utilizar se-cuencias de palabras, denominadas sintag-mas (Koehn, Och, y Marcu, 2003), como uni-

∗ Este trabajo ha sido parcialmente subvencionado por elgobierno espanol (beca FPU), el proyecto AVIVAVOZ yTECNOPARLA

dades basicas del modelo de traduccion, con elobjetivo de introducir el contexto en dicho mo-delo. En paralelo, al modelo de sintagmas, tam-bien se ha propuesto el uso den-gramas de tu-plas bilingues (Casacuberta, Vidal, y Vilar, 2002;Marino et al., 2006) (o unidades de traduccion)como una alternativa para tener en cuenta el con-texto con unidades bilingues mas pequenas. Am-bos sistemas llevan a cabo la traduccion medianteuna busqueda que maximiza una combinacion delas probabilidades asignadas a la traduccion porel modelo de traduccion en sı y otras funcionesde traduccion (Och y Ney, 2002). La Ecuacion 1muestra la combinacion dondehm son las funcio-nes de traduccion yλm los pesos que se asigna acada una de ellas.



t = argmaxt

{

M∑

m=1

λmhm(t, s)

}

(1)

Tanto en los sistemas TAE basados en sintag-mas como el los basados enn-gramas, la intro-duccion de reordenamientos es crucial. La tecni-ca mas directa consiste en permitir que las pa-labras traducidas no sigan el orden de la len-gua fuente cuando se traduce. De esta manerael decodificador permite reordenamientos segunlos criterios de diversos modelos estadısticos, pe-ro con el inconveniente de incrementar sensible-mente el coste computacional (Knight, 1999).

Recientemente han apareciendo diversas es-trategias de reordenamiento de palabras que in-tentan modificar el orden de la oracion fuente pa-ra que se corresponda con el orden de la oraciondestino (Kanthak et al., 2005; Crego y Marino,2007; Zhang, Zens, y Ney, 2007). En el primercaso, se limitan los reordenamientos posibles uti-lizando diversos criterios como la tecnica IBM.En el segundo caso, se extraen reglas de reor-denamiento utilizando informacion morfologicadirectamente del corpus paralelo y se seleccio-nan las menos dispersas por frecuencia relativa.En el tercer caso, estas reglas se aprenden utili-zando informacion sintactica. En todos los casosse genera un grafo de hipotesis de reordenamien-to que se utiliza como entrada al sistema TAE.Las hipotesis de reordenamiento no tienen nin-guna probabilidad asignada.

La aproximacion propuesta en esta comunica-cion (RAE) para el reordenamiento de las pala-bras se basa en los mismos principios que la tra-duccion automatica estadıstica (TAE) y comparteel mismo tipo de decodificador. El reordenamien-to se trata como una traduccion estadıstica de lalengua fuente (S) a la lengua fuente reordenada(S’) y se entrena a partir de la informacion de ali-neado. Una vez estimadas las probabilidades dereordenado, el sistema RAE reordena la oracionfuente y esta se pasa al sistema TAE (Costa-jussay Fonollosa, 2006). En esta comunicacion espe-cialmente introducimos un nuevo acoplamientoentre el sistema RAE y el TAE. El sistema RAEpuede computar una unica hipotesis de reordena-miento o bien un grafo ponderado de hipotesis.Este grafo de hipotesis se utiliza como entradaal sistema TAE y la decision final de reordena-miento se produce juntamente a la decision detraduccion. Para mejorar la capacidad de genera-lizacion del sistema propuesto, se usaran clasesde palabras en vez de palabras como entrada alsistema RAE.

La comunicacion se organiza de la siguien-te manera. La Seccion 2 describe brevemente elsistema de referencia. La Seccion 3 describe condetalle la estrategia de reordenamiento propues-ta. La Seccion 4 presenta y discute los resultados,y finalmente la Seccion 5 concluye.

2. Sistema de referencia TAE basado enn-gramas

El modelo de traduccion puede entendersecomo un modelo de lenguaje de unidades bi-lingues (llamadas tuplas). Dichas tuplas, definenuna segmentacion monotona de los pares de ora-ciones utilizadas en el entrenamiento del sistema(sJ

1 , t I1 ), enK unidades (u1, ..., uK ).

En la extraccion de las unidades bilingues, ca-da par de oraciones da lugar a una secuencia detuplas que solo depende de los alineamientos in-ternos entre las palabras de la oracion.

La Figura 1 muestra un ejemplo de extraccionde tuplas.

I would like NULL to eat a huge ice−cream

NULL quisiera ir a comer un helado gigante

t1

t2

t3

t4

t5

t6

Figura 1:Extraccion de tuplas a partir de un parde oraciones alineadas palabra a palabra.

En la traduccion de una oracion de entrada, eldecodificador debe encontrar la secuencia de tu-plas asociada a una segmentacion de la oracionde entrada que produzca probabilidad maxima.Tal probabilidad maxima, se calcula como com-binacion lineal de los modelos utilizados en elsistema de traduccion.

El modelo de traduccion se ha implementadoutilizando un modelo de lenguaje bilingue basa-do enn-gramas, (conN = 4).

El decodificador utiliza la combinacion de lascuatro funciones de traduccion (definidas comoprobabilidades):

Un modelo de lenguaje basado enn-gramas del idioma destino (LM).

Una bonificacion basada en el numero depalabras de la traduccion, usada para com-pensar la preferencia del decodificador porlas traducciones cortas (WB).

Marta R. Costa-jussá y JoséA.R, Fonollosa

268

Un modelo lexicalizado de traduccion cal-culado utilizando las probabilidadeslexicasdel modelo IBM1, para ambas direcciones(fuente-destino y viceversa).

3. Sistema de reordenamiento estadıstico(RAE)

3.1. ConceptoEl sistema de reordenamiento automatico es-

tadıstico (RAE) se basa en utilizar un sistema deTAE para solventar el reto del reordenamiento.Por lo tanto, un sistema RAE se puede ver comoun sistema TAE que traduce de una lengua fuen-te (S) a una lengua fuente modificada (S’), da-do una lengua destino (T). Con lo cual la estra-tegia de reordenamiento se enfoca como una ta-rea de traduccionSS’(fuente-a-fuente reordena-do). Y en consecuencia, la tarea de traduccion ensı cambia deST (fuente-a-destino) aS’T (fuen-te reordenado-a-destino). El sistema RAE utilizaclases estadısticas de palabras en lugar de clasespara poder generalizar los reordenamientos queaprende.

3.2. DescripcionLa Figura 2 muestra un diagrama de bloques

de descripcion del sistema RAE. La entrada esuna oracion fuente (S) y la salida el una oracionfuente reordenada (S’). El sistema RAE se basaen tres bloques: (1) el extractor de clases; (2) eldecodificador que requiere un RAE-LM, es de-cir, un modelo de traduccion; y, (3) el bloque quereordena la oracion original usando los ındices ala salida del decodificador (postprocesado).

Figura 2:Diagrama de bloques del sistema RAE

3.3. EntrenamientoPara traducir deS a S’ utilizamos un sistema

TAE basado enn-gramas de tuplas, consideran-do unicamente el modelo basico de traduccion.El entrenamiento de este sistema consta de lossiguiente pasos:

1. Obtener clases de palabras de la lenguafuente y de la destino.

2. Alinear a nivel de palabra, en general, obte-nemos un alineado de multiples-a-multiplespalabras.

3. Extraer tuplas de reordenamiento.

a) Partiendo del alineamiento union, extraertuplas bilinguesST (es decir, fragmentosfuente y destino) manteniendo la informa-cion de alineado. Un ejemplo de tupla bi-lingue ST es: only possible compromise #compromiso solo podrıa # 0-1 1-1 1-2 2-0,donde los diferentes campos estan separa-dos por # y se corresponden a: (1) fragmentodestino; (2) fragmento fuente; y (3) alinea-do de palabras (aquı, los campos estan se-parados por - y se corresponden a la palabradestino y fuente, respectivamente).

b) Pasar de un alineado de multiples-a-multiples palabras dentro de cada tupla aun alineado de multiples-a-una palabra. Siuna palabra fuente esta alineada con dos omas palabras destino, se escoge el vıncu-lo mas probable segun el modelo IBM 1,y los otros vınculos se omiten (por ejem-plo, el numero de palabras fuente se man-tiene antes y despues de la traduccion dereordenamiento). En el ejemplo anterior, latupla cambiara a:only possible compromi-se # compromiso solo podrıa # 0-1 1-2 2-0, porquePI BM1(only, solo) es mayor quePI BM1(possible, solo).

c) A partir de lasST tuplas bilingues (conel alineado de palabras multiples-a-una), ex-traerSS’tuplas bilingues (fragmento fuentey su reordenamiento). Siguiendo el ejemplo:compromiso solo podrıa # 1 2 0, donde elprimer campo es el fragmento fuente, y elsegundo el reordenamiento de estas palabrasfuente.

d) Eliminar aquellas tuplas cuyo fragmentofuente es la palabra NULL.

e) Sustituir las palabras de cada fragmentofuente por las clases del paso 1.

4. Calcular el modelo de lenguaje de la se-cuencia de tuplas bilinguesSS’ compues-tas por el fragmento fuente (en clases) y sureordenamiento.

3.4. Uso de la tecnica RAE paramejorar el entrenamiento TAE

El corpus fuente originalS se traduce al cor-pus fuente reordenadoS′ con el sistema RAE.

El sistema TAE aquı se construye sobre la ta-rea S′T en lugar de sobre la tarea originalST.

Generación de múltiples hipótesis ponderadas de reordenamiento para un sistema de traducción automática estadística

269

Figura 3:Extraccion de tuplas

La Figura 3 (B) y (A) muestran respectivamenteun ejemplo de extraccion de unidades construidocon las mismas correspondencias de alineado pe-ro con la oracion fuente ordenada diferentemente(este entrenamiento proviene de la salida del sis-tema RAE). A pesar que la calidad en alineadoes la misma, las tuplas que se extraen son dife-rentes (notar que la extraccion de tuplas sigue elcriterio de monotonicidad). Se pueden extraer tu-plas de longitud menor, con lo cual se reduce ladispersion del vocabulario de traduccion.

3.5. Uso de la tecnica RAE para generarmultiples hipotesis ponderadas dereordenamiento

El sistema RAE puede generar o bien una uni-ca salida (RAE1mejor) o bien un grafo de salidas(RAEgrafo) que hemos denominado grafo ponde-rado de reordenamientos. El diagrama de bloquesse muestra en la Figura 4 (A) y (B), respectiva-mente. El grafo ponderado de reordenamientospermite extender la busqueda del sistema TAE.Este grafo ponderado de reordenamientos contie-ne multiples caminos y cada camino tiene su pro-pio peso. Este peso se incorpora como funcionadicional en la combinacion del sistema TAE.

Figura 4:Concatenacion del sistema RAE y TAE:(A) mediante 1mejor (B) mediante un grafo pon-derado.

4. Experimentos

4.1. CorpusLos experimentos se efectuaron utilizando el

corpus proporcionado en la segunda evaluaciondel proyecto Tc-Star1 en la tarea ingles a es-panol. El objetivo de este proyecto era construirun sistema de traduccion voz-a-voz que pudie-ra trabajar en tiempo real. Los corpus consistenen la version oficial de los discursos efectuadosen las Sesiones Plenarias del Parlamento Euro-peo (EPPS). El Cuadro 1 muestra las estadısticasbasicas de dicho corpus, es decir, numero de ora-ciones, palabras y vocabulario.

Espanol InglesEnt. Oraciones 1,2M

Palabras 32M 31MVocabulario 159k 111k

Dev Oraciones - 1122Palabras - 26k

Test Oraciones - 894Palabras - 26k

Palabras fv - 150

Cuadro 1:Corpus EPPS (Segunda Evaluaciondel Tc-Star). Ent significa entrenamiento y fv sig-nifica fuera de vocabulario.

4.2. Descripcion del sistemaConstruimos el sistema TAE basado enn-

gramas que se ha explicado brevemente en laSeccion 2.

Se segmentaron los corpus con herramien-tas estandard, los detalles se pueden encontraren (Costa-jussa y R. Fonollosa, 2007) La he-rramienta GIZA++ (Och y Ney, 2003) se uti-lizo para alinear en las direcciones fuente-destinoy destino-fuente y el alineado final se calculo apartir de la union de ambos. Las clases de pa-labras se entrenaron usando la herramientamk-cls 2. Los modelos de lenguaje se entrenaroncon el SRILM (Stolcke, 2002). El sistema RAEutiliza un modelo 5-grama y se utilizan 200 cla-ses estadısticas entrenadas sobre el corpus de en-trenamiento. El sistema TAE utiliza un modelode traduccion 4-grama y un modelo de lenguajedestino 3-grama, con suavizado Kneser-Ney. Porultimo, la herramienta MARIE3 se utilizo comodecodificador.

1www.tc-star.org2http://www.fjoch.com/mkcls.html3http://gps-tsc.upc.es/veu/soft/soft/marie/


270

En todos los experimentos, el algoritmo Sim-plex (Nelder y Mead, 1965) se uso para optimizarlos pesos de la combinacion de funciones de tra-duccion, con la medida BLEU (Papineni et al.,2002) como funcion objetivo.

4.3. ResultadosEn este apartado se exponen los experimentos

realizados y los resultados obtenidos al evaluarla influencia de la estrategia de reordenamientopropuesta (RAE) en la calidad de la traduccion.

Hemos estudiado la influencia del reordena-miento RAE propuesto en el sistema de traduc-cion basado enn-gramas que incluye ademas lascuatro funciones de traduccion descritas en lasSeccion 2: el modelo de lenguaje, la bonifica-cion de palabras, y los modelos de lexicon IBM1en ambos sentidos. Cuando se introduce el sis-tema RAE en la configuracion RAEgrafo, ademasse anade como funcion de traduccion la ponde-racion de las hipotesis de reordenamiento. Todaslas funciones de traduccion se optimizan conjun-tamente.

El Cuadro 2 muestra los resultados en el cor-pus de test. Podemos observar que la tecnicaRAE en su version 1mejor obtiene una mejorade 0.7 puntos BLEU. En la versiongra f ose ob-tiene una mejora adicional de 1.7 puntos BLEUdando lugar a una mejora total de 2.4 puntosBLEU respecto a la configuracion de referencia.

Sistema BL EUtest

NB 49.12RAE1mejor + NB 49.83RAEgrafo + NB 51.53

Cuadro 2:Resultados de traduccion. NB es el sis-tema de referencia.

4.4. DiscusionLa mejora en la calidad de traduccion se ob-

serva en los resultados BLEU y adicionalmentela Figura 5 muestra mediante unos ejemplos detraduccion como cambia la traduccion final. Lamejora en la calidad del sistema puede provenirprincipalmente por dos motivos. En primer lugar,se monotoniza la tarea y como es sabido las ta-reas de traduccion mas monotonas ofrecen unatraduccion mas fiable puesto que los alineamine-tos a nivel de palabra son mas faciles de aprendery las unidades de traduccion tienden a ser mascortas y menos dispersas (ver Figura 3).

En segundo lugar, las hipotesis de reordena-miento se han aprendido con las potentes tecni-

cas de traduccion utilizando clases de palabrasque permiten generalizar. Cuando utilizamos laconfiguracion RAE1mejor el reordenamiento ofre-ce un unico reorden posible y el sistema TAE nointerviene en la decision final de reordenamien-to. Por el contrario, cuando utilizamos la configu-racion RAEgrafo el reordenamiento ofrece variasalternativas de reordenamiento y el sistema TAEpuede intervenir en la decision final de reorde-namiento. Los pesos de las funciones de traduc-cion del sistema TAE (especialmente el mode-lo de lenguaje destino) combinadas con el pesode la funcion de reordenamiento que contiene elgrafo se optimizan utilizando el BLEU.

5. Conclusiones

En esta comunicacion hemos propuesto unasolucion para el reto del reordenamiento de pa-labras en un sistema de traduccion automaticaestadıstica (TAE). La tecnica propuesta ha sidodescrita y probada en un sistema de TAE actualbasado enn-gramas de tuplas, y se puede aplicarde manera similar a un sistema de TAE basadoen sintagmas.

El sistema de reordenamiento automatico es-tadıstico (RAE) se aplica previamente al sistemade traduccion estadıstica. Ambos sistemas, RAEy TAE se basan en los mismos principios y com-parten el mismo tipo de decodificador. Cuandose extraen las unidades bilingues de traduccion,el cambio de orden que se realiza en la oracionfuente permite mejorar su modelado, dado quelas unidades de traduccion son ahora mas cortas.

Por otro lado, el hecho de aprender el reor-denamiento como un preproceso y de maneraindependiente al sistema propiamente dicho detraduccion permite obtener un sistema final yuna traduccion eficientes. Ademas, la estrategiapropuesta permite utilizar clases de palabras enel reordenamiento para inferir reordenamientosno vistos durante el entrenamiento del sistema.El hecho de modelar el reordenamiento comouna traduccion de una lengua fuente a una len-gua fuente monotonizada permite que podamosextraer un unico reordenamiento (configuracion1mejor) o bien un grafo ponderado de reorder-namientos (configuraciongra f o). En el primercaso, se propone un reordenamiento deterministaen el cual el sistema TAE no interviene. En el se-gundo caso, el hecho de crear un grafo de hipote-sis ponderadas de reordenamiento ofrece mas ro-bustez a la tecnica RAE puesto que el reordena-miento se decide al mismo tiempo que se traduce.Los resultados muestran una mejora total de 2.4puntos BLEU en la tarea EPPS ingles a espanol.

Generación de múltiples hipótesis ponderadas de reordenamiento para un sistema de traducción automática estadística

271

FUENTE:to remove the fascist or military dictatorshipsTRAD SIST REF: a eliminar los fascistas o dictaduras militaresTRAD RAE1mejor: para eliminar lasdictaduras fascistas o militaresTRAD RAEgrafo: para eliminar lasdictaduras fascistas o militaresREFERENCIA 1:con el fin de acabar con las dictaduras militaresREFERENCIA 2:para derrumbar las dictaduras fascistas o militaresFUENTE:and the totalitarian dictatorships which then ruledmuch of eastern and central EuropeTRAD SIST REF: y el totalitario dictaduras que luego dictamino mucho de oriental y del centro de EuropaTRAD RAE1mejor: y el totalitario dictaduras que entonces gobernaba mucho de laEuropa central y orientalTRAD RAEgrafo: y lasdictaduras totalitarias que entonces gobernaba mucho de laEuropa central y orientalREFERENCIA 1: ası como con las dictaduras totalitarias que controlaban en aquel momento gran parte de laEuropa Central y del EsteREFERENCIA 2: y las dictaduras totali tarias que gobernaban en gran parte de Europa central y del esteFUENTE:exploit the fear factor in this matterTRAD SIST REF: explotar el miedo factor en este asuntoTRAD RAE1mejor: explotar el miedo factor en este asuntoTRAD RAEgrafo: explotar elfactor miedo en este asuntoREFERENCIA 1: explotar el factor miedo en este asuntoREFERENCIA 2:abusar del factor miedo en este asunto

Figura 5: Ejemplos de traducciones.

Bibliografıa

Casacuberta, F., E. Vidal, y J.M. Vilar. 2002. Ar-chitectures for speech-to-speech translationusing finite-state models.Proc. of the Works-hop on Speech-to-Speech Translation: Algo-rithms and Systems, paginas 39–44, July.

Costa-jussa, M.R. y J.A.R. Fonollosa. 2006.Statistical machine reordering. EnEMNLP,paginas 71–77, Sydney, July. ACL.

Costa-jussa, M.R. y J. A. R. Fonollosa. 2007.Analysis of statistical and morphologicalclasses to generate weighted reordering hy-potheses on a statistical machine translationsystem. EnProceedings of the Second Works-hop on Statistical Machine Translation, pagi-nas 171–176, Prague, June. ACL.

Crego, J.M. y J.B. Marino. 2007. Improving smtby coupling reordering and decoding.Machi-ne Translation, 20(3):199–215.

Kanthak, S., D. Vilar, E. Matusov, R. Zens, yH. Ney. 2005. Novel reordering approa-ches in phrase-based statistical machine trans-lation. En Proceedings of the ACL Works-hop on Building and Using Parallel Texts:Data-Driven Machine Translation and Be-yond, paginas 167–174, Ann Arbor, MI, June.

Knight, K. 1999. Decoding complexity in word-replacement translation models.Computatio-nal Linguistics, 25(4), December.

Koehn, P., F.J. Och, y D. Marcu. 2003. Sta-tistical phrase-based translation. EnProc.of the Human Language Technology Confe-rence, HLT-NAACL’2003, paginas 48–54, Ed-monton, Canada, May.

Marino, J.B., R.E. Banchs, J.M. Crego,A. de Gispert, P. Lambert, J.A.R. Fono-llosa, y M.R. Costa-jussa. 2006. N-grambased machine translation.ComputationalLinguistics, 32(4):527–549, December.

Nelder, J.A. y R. Mead. 1965. A simplex met-hod for function minimization.The Compu-ter Journal, 7:308–313.

Och, F.J. y H. Ney. 2002. Discriminative trai-ning and maximum entropy models for statis-tical machine translation. EnAnnual Meetingof the Association for Computational Linguis-tics, paginas 295–302, Philadelphia, USA,July.

Och, F.J. y H. Ney. 2003. A systematic com-parison of various statistical alignment mo-dels. Computational Linguistics, 29(1):19–51, March.

Papineni, K., S. Roukos, T. Ward, y W-J. Zhu.2002. Bleu: A method for automatic evalua-tion of machine translation. En40th AnnualMeeting of the Association for Computatio-nal Linguistics, paginas 311–318, Philadelp-hia, PA, July.

Stolcke, A. 2002. Srilm - an extensible lan-guage modeling toolkit. EnProc. of the 9thInt. Conf. on Spoken Language Processing,ICSLP’02, paginas 901–904, Denver, USA,September.

Zhang, Y., R. Zens, y H. Ney. 2007. Chunk-levelreordering of source language sentences withautomatically learned rules for statistical ma-chine translation. EnProc. of the Workshopon Syntax and Structure in Statistical Trans-lation (SSST), paginas 1–8, Rochester, April.


272

Mining Term Translations from Domain Restricted Comparable Corpora

Extracción de Traducciones de Términos a partir de Corpus Comparables pertenecientes a áreas específicas

Xabier SaralegiElhuyar R&D

Zelai Haundi kalea, 320170 Usurbil.

[email protected]

Iñaki San VicenteElhuyar R&D


[email protected]

Maddalen López de LacalleElhuyar R&D


[email protected]

Abstract: Several approaches have been proposed in the literature for extracting word translations from comparable corpora, almost all of them based on the idea of context similarity. This work addresses the aforementioned issue for the Basque-Spanish pair in a popular science domain. The main tasks our experiments focus on include: designing a method to combine some of the existing approaches; adapting this method to a popular science domain for the Basque-Spanish pair; and analyzing the performance of different approaches both for translating the contexts of the words and computing the similarity between contexts. We finally evaluate the different prototypes by calculating the precision for different cutoffs. The yielded results show the validity of the designed hybrid method, as well as the improvement obtained by using the probabilistic models we propose for computing the similarity between contexts.Keywords: Bilingual Terminology Extraction, Comparable Corpora, Machine Translation.

Resumen: En la literatura se han propuesto diferentes estrategias para la tarea de extracción automática de traducciones a partir de corpus comparables, estando basadas la mayoría de ellas en la idea de similitud entre contextos. Este trabajo aborda la citada tarea para el par de lenguas Euskera-Castellano y el género científico-divulgativo. Los principales puntos en los que se centra este trabajo son los siguientes: diseñar un método que combine las existentes aproximaciones; adaptar este método al par de lenguas Euskera-Castellano y al género científico-divulgativo; y por último analizar el comportamiento de distintas técnicas tanto para el proceso de traducción de contextos como el cálculo de similitud entre ellos. Finalmente, evaluaremos los diferentes prototipos implementados de acuerdo a la precisión obtenida para distintos cutoffs. Los resultados obtenidos muestran que el método híbrido diseñado resulta adecuado y una mejora para el cálculo de similitudes entre contextos mediante los modelos probabilísticos propuestos.Palabras clave: Extracción de Terminología Bilingüe, Corpus Comparables, Traducción Automática.

1 IntroductionIn the literature, several strategies have been proposed for extracting lexical equivalences from corpora. Most of them are designed to be used with parallel corpora. Although these kinds of corpora give the best results, they are a scarce resource, especially when we want to

deal with certain language pairs and certain domains and genres. To overcome this limitation the first algorithms (Rapp, 1995), (Fung, 1995) were developed for automatic extraction of translation pairs from comparable corpora. These kinds of corpora can be easily built from the Internet.



The techniques proposed for the extraction task are mainly based on the idea that translation equivalents tend to co-occur within similar contexts. An alternative is to detect translation equivalents by means of string similarity (cognates). Nevertheless, none of these techniques achieve the precision and recall obtained with the parallel corpora techniques.

This work focuses on the Basque-Spanish pair and popular-science domain. We channeled our efforts towards designing a hybrid approach by combining the methods proposed in the literature, adapting it to the scenario, and analyzing the performance of different strategies for the two main steps of the extraction approach based on context similarity: translation of the context of the source word to the target language, and calculation of the similarity between contexts. On the one hand we have compared a number of methods for resolving the two main problems in this first phase, which are translation selection and treatment of Out of Vocabulary (OOV) words. On the other, we have tested different models for representing contexts and different ranking algorithms to calculate the similarity between contexts.

Finally it must be said that this work is the continuation of the research started in (Saralegi, San Vicente and Gurrutxaga, 2008), focusing on the Basque-English pair.

2 Comparable CorporaComparable multilingual corpora are defined as collections of documents sharing certain characteristics and written in more than one language. In bilingual lexicon extraction some of these characteristics depend on the lexicon type we aim to extract. Thus, achieving a high degree of comparability with regard to these characteristics is very important, since context similarity techniques will be more effective. The more similar the corpora are, the higher the comparability between the collocated words of the equivalent translations (Morin et al., 2007). Therefore, it is essential to ensure that some characteristics are equal in both parts of the corpora built for terminology extraction purposes.

3 Identification of Equivalents

3.1 Context SimilarityThe main method is based on the idea that the same concept tends to appear with the same context words in both languages, that is, it maintains many collocates. The methods based on context similarity consist of two steps: modeling of the contexts, and calculation of the degree of similarity using a seed bilingual lexicon (Rapp, 1999), (Fung, 1998).

The majority of the methods for modeling are based on the “bag-of-words” paradigm. Thus, the contexts are represented by weighted collections of words. In fact, the context similarity calculation tasks can be seen as a Cross Language Information Retrieval (CLIR) problem. Therefore, all the paradigms proposed in the CLIR literature can be useful in this context. There are several techniques for determining which words make up the context of a word: distance-based window, syntactic based-window, etc.

Different models have been proposed to represent the context of words. The most widely used combines the Vector Space Model and Association Measures (AM) for establishing the weight of the context words with regard to a word: Log-likelihood ratio (LLR), Mutual Information, Dice coefficient, Jaccard measure, frequency, tf-idf, etc. After representing word contexts in both languages, the proposed algorithms compute for each source word a ranking of translation candidates according to the similarity between its context vector and the context vectors of all the target words. The similarity score is computed by means of measures such as Cosine, Jaccard or Dice.

Nevertheless, the number of works that exploit the recent advances obtained in the CLIR community is limited, in particular works involving translation selection techniques and probabilistic models. (Shao et al., 2004) can be an example of the use of probabilistic models. It represents the contexts by using language models. Other probabilistic retrieval-models proposed for IR tasks, which can also be of use in context similarity calculation, are Okapi (Robertson, Walker and Beaulieu, 1998) or Divergence From Randomness (DFR) (Amati and Van Rijsbergen, 2002). Okapi (BM25) represents the state of the art in IR and is often used as baseline. The DFR paradigm is, like

Xabier Saralegi, Iñaki San Vicente y Maddalen López de Lacalle

274

Okapi, a generalization of the Harter’s 2-Poisson (Harter, 1974) indexing-model which offers different models. The Terrier1 toolkit offers many of these DFR models as well as others, such as tf-idf, Okapi and language models. 3.2 Context words translationTo be able to compute the similarity, the context vectors are put in the same space by translating one of them. The methods proposed in the literature for the translation in CLIR tasks can be divided into two main groups (Hull, 1997): corpus-based methods and dictionary-based methods. Corpus-based methods use parallel and sometimes comparable corpora for mining query translations. Unfortunately, parallel corpora constitute a scarce resource and the results obtained using comparable corpora are still poor. On the other hand, dictionary-based methods use a bilingual dictionary to lookup the translations of the components of the query. However, the dictionary poses two main problems: it fails to solve the ambiguous translations and it has a coverage problem (OOV). 3.3 Translation selectionMany algorithms have been proposed for dealing with the translation disambiguation resulting from query translation guided by bilingual dictionaries. The simplest method is to select the first translation given by the dictionary as the best since the translations are often sorted by use frequency. However this approach fails to take into account the domain of the query, so the disambiguation can be very rigid. Other more flexible approaches (Pirkola, 1998), which perform better, take all the translations and group them as a unique word when the TF and DF values of the document words are calculated by the ranking method. The syn operator offered by the Indri and Inquery query languages allows this type of grouping (Pirkola, 1998). Other more complex approaches (Ballesteros and Croft, 1998) (Liu, Jin, and Chai, 2005) (Chen, Bian, and Lin, 1999) (Gao and Nie, 2006), which also use statistical information of monolingual word concurrences, are those based on the degree of cohesion or association between the translation candidates. They try to obtain the combination of translation candidates that maximize the

1http://ir.dcs.gla.ac.uk/terrier/

coherence between them. A corpus in the target language is used to compute association scores. 3.4 CognatesAnother technique proposed in the literature, also useful for the treatment of OOV, is the identification of translations by means of cognates (Al-Onaizan and Knight, 2002). This method could be appropriate in a science domain where the presence of cognates is high. In fact, using a Basque-Spanish technical dictionary we were able to calculate automatically that around 26% of the translation pairs were cognates. Dice coefficient and LCSR (Longest Common Subsequence Ratio) measures are proposed for computing string similarity.

4 Experiments

4.1 Term Extraction from Comparable Corpora

4.1.1 Preprocess

We needed to identify the words we considered to be meaningful for our process, that is, content-words. POS tags were used for this task. Treetagger2 is the tagger we chose to tag the Spanish corpus, and Eustagger3 in the case of the Basque corpus. Only nouns, adjectives and verbs are regarded as content words. In our experiments, adverbs were found to produce noise. Proper nouns also produced noise due to a cultural bias effect. Both were removed. 4.1.2 Contexts Construction

We established a distance-based window to delimit the contexts of the words. The window size was determined empirically: 10 words for Basque (plus and minus 5 around a given word) and 14 for Spanish (plus and minus 7). Furthermore, our experiments showed that using punctuation marks to delimit the window improved the results. So, we also included this technique in our system.

We calculated the weight of the words within the context by means of absolute frequency, LLR, Dice coefficient or Jaccard measure, and then the contexts were modeled in a vector space. The best results were achieved by using the LLR.

2http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger

3A POS tagger for Basque developed by the IXA group of the University of the Basque Country.


275

We also modeled the contexts of words by using different probabilistic models offered by the Terrier Toolkit. Specifically, we carried out tests with two models, Okapi and PL2, which is an instantiation of the DFR framework appropriate for tasks that require high precision. We indexed the context words of a word like a document. That is, the words that make up a context of a word throughout the collection are included in the same document that is indexed. 4.1.3 Context Translation

To compute the translation of a Basque word, we translated its contexts in order to make them comparable with Spanish contexts. A bilingual Machine Readable Dictionary (MRD) was used for this purpose. 4.1.4 OOV words

The recall of the MRD determines the representativity of the context vector. In our experiments with a general dictionary, the average translation recall by vector was 55%. The higher the recall the greater the possibilities of finding the right translation for a word, because context vectors held more detailed information about the word in question.

To increase the recall of our translated vectors, we try to find equivalents not included in the dictionary by means of cognates. For all the OOV words, we looked for cognates among all the context words in the target language. The identification of these cognates is made by calculating the LCSR between the Basque and Spanish context words. Before applying the LCSR, we processed some typographic rules to normalize equal phonology n-grams (e.g., c→k acta=akta) or regular transformation ones (e.g., -ción→-zio, acción=akzio) in both equivalent candidates. The candidates that exceeded the empirically determined threshold of 0.8 were taken as translations. 4.1.5 Translation selection methods

One of the problems of using bilingual dictionaries to translate contexts is that many translation candidates from the dictionary are obtained. This fact causes many problems for the subsequent calculation of the similarity between contexts. Incorrect translations distort the modeling of the context, and hence disfigure the semantic lexical representation and the distribution of the context words. Therefore, techniques for choosing the correct translations can help in this task. In this work we propose two techniques used in CLIR

systems that do not need the use of parallel corpora:

First translation: The first translation of an entry is usually the most probable. Although this fact can vary depending on the domain, taking the first translation is a general translation selection method.

Concurrences-based translation: The best translations are selected by using a concurrences-based method. The basic idea is that the degree of association between the correct translations is higher than between other translations. The algorithm tries to obtain the combination of translation candidates that maximizes that degree of association. The algorithm we use to obtain that combination is a greedy one because of the np-hard nature of the process. Some independence assumptions between translation candidates are adopted. Specifically we have used the algorithm proposed by (Gao et al., 2001):

(1) Given a Basque (source language) query e={e1,e2,...,en}, for each query term e, we define a set of m distinct Spanish translations according to a bilingual dictionary

D: D(ei)={ci,1,ci,2,...,ci,m}

(2) For each set D(ei):(a) For each translation ( )iji, eDc ∈ ,

define the similarity score between the translation ci,j and a set D(ek) ( ik ≠ ) as the sum of the similarities between ci,j and each translation in the set D(ek) according to Eq. (1)

( )( )( )

( )lk,ji,

klk,ckji, c,c

eDam=eD,cam ∑

∈ (1)

(b) Compute the cohesion score for ci,j as

( )( )

( )( )kji,

keDji, eD,cam=De,|ccohesion ∑log

(2)(c) Select the translation ( )ieDc ∈ with

the highest cohesion score

( ) ( )De,|ccohesioneDargmax=c je,ije,c ∈ (3)

We use a Spanish corpus of 10M words obtained from Madri+d to calculate the concurrences for the target collection. We adopted the Mutual Information measure to


276

calculate the degree of association at document level between translation candidate pairs. 4.1.6 Context Similarity Calculation

To obtain a ranked list of the translation candidates for a Basque word, we calculated the similarity between its translated context vector and the context vectors of the Spanish words by using two different ranking methods. Cosine distance for the case of weighting by LLR, and the aforementioned rank-models for the case of probabilistic models.

Furthermore, to prevent noise candidates in both strategies, after obtaining the rankings, we pruned those that had a different grammatical category from that of the word to be translated. 4.1.7 Equivalent Similarity Calculation

In addition to context similarity, string similarity between source words and equivalent candidates is also used to rank candidates. LCSR is calculated between each source word and its first 100 translation candidates in the ranking obtained after context similarity calculation. LCSR is applied in the same way as in context vector translation.

When used in combination with context similarity, LCSR data is used as the last ranking criterion. The candidates that exceeded an empirically established threshold (0.8) are ranked first, while the position in the ranking of the remaining candidates remains unchanged. A drawback to this method is that cognate translations are promoted over translations based on context vector similarity.

5 Evaluation

5.1 Building Test CorporaWe built one test corpus. The sources of the documents were the science news websites Zientzia.net4 (Basque), and Madri+d5 (Spanish).

Zientzia.net and Madri+d are quite similar with respect to the distribution of topics and register, so we chose them to build the test corpus. A correlation between topic and date was expected and for that reason we downloaded all news items between 2000 and 2008, only. Moreover, other types of documents like dossiers, etc. were rejected in order to maintain the same register throughout the corpus. Finally, the HTML documents were cleaned and converted into text using Kimatu

4http://www.zientzia.net5http://www.madridmasd.org

(Saralegi and Leturia, 2007). The size of this corpus was 1.092 million tokens for Basque and 1.107 for Spanish. We mapped the different domains in order to compare the distribution of documents among the different domains (table 1). The distribution of the documents among the domains was quite similar, so we expected an acceptable degree of comparability between the two corpora.

Domain Madri+d Zientzia.net

Biology, food,Agriculture &

fishing36.59% 24.31%

Health 9.73% 16.26%

Earth sciences 6.12% 10.44%

Physics, Chemistry & Math 6.65% 7.18%

Technology & Industry 29.45% 24.15%

Energy & Environment 11.45% 7.35%

Table 1: Domain distribution of documents for the test corpus.

corpus#word #doc

eu es eu es

Test corpus 1,092K 1,107K 2521 1242

Table 2: Characteristics of test corpora

5.2 TestsFor the automatic evaluation of our system, we needed a list of Basque-Spanish equivalent terms occurring in each part of the corpora and which were not included in the dictionary used for the translation of content words in the construction of context vectors. To build the list, firstly we took all the Basque content words obtained in the preprocess step for the two corpora, which had been built. Secondly, those words were searched in the Basque-Spanish Elhuyar dictionary6, and for all the Basque words not included in that dictionary, we randomly selected 200 pairs of words that yielded a minimum frequency (10) and which appeared in one of two terminology Basque-Spanish dictionaries (Elhuyar Science and

6 An abridged version of the Elhuyar Spanish/Basque dictionary including 20,000 entries.


277

Technology Dictionary7 and Euskalterm terminology bank8).

This enabled us to estimate the precision automatically. We computed, for each source word, the precision of the ranked translation candidates at different cutoff points. We took as correct translation only the one included in the test list as the Spanish translation of the source Basque word. In order to analyze the impact the frequency has on the results, we divided this set into two subsets. The first one includes words of high frequency (>50), and the other one, medium-low frequency words (within the 10-30 frequency range).

We analyzed different variables: the modeling of the contexts, translation methods, and the way to combine the different approaches:Modeling of contexts and similarity computation: LLR and cosine, and probabilistic models: Okapi (b=0.75) and PL2 (c=1).Translation methods: Cognate detection for treatment of OOV words in the context translation step, first translation selection, and concurrences based selection methods.Ranking of translation candidates: context similarity, cognates detection.

5.3 ResultsThe following tables show the results for the test corpora.

Mean precision

Top 1 Top 5 Top 10 Top 15 Top 20

LLR+cos 0.27 0.52 0.62 0.65 0.65

Okapi 0.34 0.47 0.60 0.65 0.69

PL2 0.37 0.50 0.61 0.68 0.73

Table 3: Precision results for high frequency test words. Context similarity (cosine+LLR, Okapi, PL2) combined with first translation selection.

7 Encyclopaedic dictionary of science and technology including 15,000 entries in Basque with equivalences in Spanish, French and English.

8 Terminological dictionary including 100,000 terms in Basque with equivalences in Spanish, French, English and Latin.

Mean precision

Top 1 Top 5 Top 10 Top 15 Top 20

LLR+cos 0.07 0.15 0.17 0.18 0.23

Okapi 0.05 0.12 0.17 0.21 0.23

PL2 0.06 0.16 0.21 0.23 0.24

Table 4: Precision results for high frequency test words. Context similarity (cosine+LLR, Okapi, PL2) combined with first translation selection.

Mean precision

Top 1

Top 5

Top 10

Top 15

Top 20

PL2 + First 0.37 0.50 0.61 0.68 0.73

PL2 + Coo 0.37 0.50 0.64 0.68 0.72

PL2 + First + Cog 0.30 0.54 0.59 0.72 0.74

PL2 + Coo + Cog 0.32 0.55 0.67 0.71 0.74

PL2 + Coo + Cog + Cog-re 0.38 0.61 0.72 0.75 0.78

Table 5: Precision results for high frequency test words. Context similarity (PL2) combined with first translation (First), concurrences based selection (Coo), cognates detection for vector translation (Cog) and re-ranking (Cog-re).

Mean precision

Top 1

Top 5

Top 10

Top 15

Top 20

PL2 + First 0.06 0.16 0.21 0.23 0.24

PL2 + Coo 0.07 0.13 0.19 0.22 0.22

PL2 + First + Cog 0.05 0.16 0.23 0.25 0.26

PL2 + Coo + Cog 0.06 0.18 0.19 0.22 0.25

PL2 + Coo + Cog + Cog-re 0.28 0.40 0.39 0.46 0.45

Table 6: Precision results for low frequency test words. Context similarity (PL2) combined with first translation (First), concurrences based selection (Co), cognates detection for vector translation (Cog) and re-ranking (Cog-re).


278

We have observed that combining the identification of cognates in the list of equivalents with context similarity (as proposed in section 4.1.7) improves the precision of the final ranking. The high presence of these kinds of translations explains this improvement.

Otherwise, the results obtained for the low frequency words are poorer than the ones obtained for the high frequency words, as we expected.

The detection of cognates in the translation of the context vectors slightly outperforms translation based exclusively on dictionaries.

The probabilistic models Okapi and PL2 perform much better than the LLR cosine combination for calculating the context similarity. Between Okapi and PL2 the latter is more appropriate.

As for the translation selection methods, there is little difference, but the first translation selection yields better results. This can be due to the short length of the contexts, or to the nature of the context. The contexts used as queries contain fewer specific words than topic queries. This fact could make more difficult the translation selection process.

6 ConclusionsWe have performed the first experiments aimed at terminology extraction from comparable corpora by integrating different existing techniques and adapting them for a new language pair.

The combination of the cognate detection in the final ranking as well as in the translation process of the context vectors seems suitable for corpora of the science domain, in which the presence of cognates is high, as we saw for the Basque-English pair.

On the other hand, the concurrences-based algorithm has not improved the quality of the translations achieved with the first translation selection method. This means that selection method adapted to the context sentences is worse than the general selection method. Nonetheless, further experiments will be carried out in order to explore these results in greater depth and to fine-tune the concurrences-based algorithm.

Finally, the representation of contexts and calculation of similarity is improved by using more advanced probabilistic models like Okapi and PL2.

ReferencesAmati G. and C.J. Van Rijsbergen. 2002.

“Probabilistic models of information retrieval based on measuring divergence from randomness” In the Transactions on Information Systems journal, vol. 20, issue 4, pp.357-389.

Al-Onaizan, Y. and K. Knight. 2002. ”Machine transliteration of names in Arabic text.” In Proceedings of the ACL-02 workshop on Computational approaches to Semitic languages, pp.1-13.

Ballesteros, L. and W. B. Croft. 1998. “Resolving ambiguity for cross-language retrieval. In Proceedings of SIGIR, pp. 64-71.

Chen, Hsin-Hsi, Guo-Wei Bian, and Wen-Cheng Lin. 1999. “Resolving translation ambiguity and target polysemy in cross-language information retrieval.” In Proceedings of ACL, pp.215-222.

Déjean, H., E. Gaussier and F. Sadat. 2002. “An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction.” In Proceedings of COLING.

Fung, P. 1995. "Compiling Bilingual Lexicon Entries from a Non-Parallel English-Chinese Corpus." In Proceedings of the Third Workshop on Very Large Corpora, pp.173-183.

Fung, P. and L. Y. Yee. 1998. “An IR Approach for Translating New Words from Nonparallel Comparable Texts.” In Proceedings of COLING-ACL 1998, pp.414-420.

Gao, J. and J. Nie. 2006. "A study of statistical models for query translation: finding a good unit of translation." In Proceedings of SIGIR, pp.194-201.

Gao, J., J. Nie, E. Xun, J. Zhang, M. Zhou and C. Huang. 2001. “Improving query translation for cross-language information retrieval using statistical models.” In Proceedings of SIGIR, pp.96-104.

Kilgarriff, A. and T. Rose. 1998. “Measures for corpus similarity and homogeneity.” In Proceedings of the 3rd EMNLP conference, pp.46-52.


279

Hull, D. A. 1997. “Using structured queries for disambiguation in cross-language information retrieval.” In Working Notes of AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, pp.73-81.

Liu, Y., R. Jin, and J. Y. Chai. 2005. “A maximum coherence model for dictionary-based cross-language information retrieval.” In Proceedings of SIGIR, pp.536-543.

Morin, E., B. Daille, K. Takeuchi and K. Kageura. 2007. “Bilingual Terminology Mining - Using Brain, not brawn comparable corpora.” In Proceedings of ACL, pp.664-671.

Pirkola, A. 1998. “The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval.” In Proceedings of SIGIR, pp.55–63.

Rapp, R. 1995. “Identifying word translations in non-parallel texts.” In Proceedings of ACL, pp.320-322.

Rapp, R. 1999. “Automatic identification of word translations from unrelated English and German corpora.” In Proceedings of ACL, pp.519-526.

Robertson, S. E., S. Walker and M. Beaulieu. 1998. “Okapi at trec-7: Automatic ad hoc, filtering, vlc and interactive”. In Proceedings of TREC, pp.199-210.

Saralegi, X. and I. Leturia. 2007. “Kimatu, a tool for cleaning non-content text parts from html docs.” In Building and exploring web corpora, Proceedings of the 3rd Web as Corpus workshop, pp. 163-167.

Saralegi, X., I. San Vicente and A. Gurrutxaga. 2008. “Automatic extraction of bilingual terms from comparable corpora in a popular science domain”. In Proceedings of the Building and Using Comparable Corpora workshop (LREC08).

Shao, L. and H.T. Ng. 2004. “Mining New Word Translations from Comparable Corpora.” In Proceedings of COLING, pp. 618-624.

Rayson, P. and R. Garside. 2000. “Comparing corpora using frequency profiling.” In Proceedings of the workshop on Comparing Corpora (38th ACL), pp.1–6.


280

Bilingual Terminology Extraction based on Translation Patterns

Extraccion de terminologıa bilingue con base en reglas de traduccion

Alberto SimoesDepartamento de Informatica

Universidade do MinhoBraga, Portugal

[email protected]

Jose Joao AlmeidaDepartamento de Informatica

Universidade do MinhoBraga, [email protected]

Resumen: Los corpora paralelos son fuentes ricas en recursos de traduccion. Estedocumento presenta una metodologıa para la extraccion de sintagmas nominales bil-ingues (candidatos terminologicos) a partir de corpora paralelos, utilizando reglasde traduccion.Los modelos propuestos en este trabajo especifican las alteraciones en el orden delas palabras que se producen durante la traduccion y que son intrınsecos a la sin-taxis de las lenguas implicadas. Estas reglas se describen en un lenguaje de dominioespecıfico llamado PDL (Pattern Description Language) y son sumamente eficientespara la deteccion de sintagmas nominales.Palabras clave: corpora paralelos, extraccion de informacion, recursos de tra-duccion, traduccion automatica

Abstract: Parallel corpora are rich sources of translation resources. This docu-ment presents a methodology for the extraction of bilingual nominals (terminologycandidates) from parallel corpora, using translation patterns.The patterns proposed in this work specify the order changes that occur duringtranslation and that are intrinsic to the involved languages syntaxes. These pat-terns are described in a domain specific language named PDL (Pattern DescriptionLanguage), and are extremely efficient for the detection of nominal phrases.Keywords: parallel corpora, information extraction, translation resources, machinetranslation

1 Introduction

Machine Translation (MT) resources are ex-pensive: translation dictionaries require a lotof hand-work, and translation grammars areimpossible to develop for real languages. Theadvances on computer processing power andmethods for statistical data extraction fromtexts lead to a burst in development of MTsystems (Tiedemann, 2003). These systemsare data-driven using just statistical infor-mation (like Statistical Machine Translation)or previously done translations (like Exam-ple Based Machine Translation). Actually,data-driven and rule-based methods coupledon hybrid translation systems. Thus, auto-matic extraction of translation resources isrelevant.

On this document we describe a simplemethodology for the extraction of parallelterminology entries (candidates) from par-allel corpora using translation patterns andprobabilistic translation dictionaries.

Translation patterns describe how a se-

quence (pattern) of words change order dur-ing translation. The patterns are describedon a Domain Specific Language (DSL) namedPattern Description Language (PDL), that isformalized on section 3.

These patterns are matched against analignment matrix where translation proba-bilities between words were defined using aprobabilistic translation dictionary (Simoes,2004). Section 2 explain what these dictio-naries are and how we can obtain them.

Each time one of the defined patternsmatch on the alignment matrix, the pair ofsequence of words is extracted. To this pair,we associate the rule identifier so we can inferfurther information from it. Section 4 showssome of the defined rules, some of the ex-tracted pairs of sequences and an evaluationas terminology candidates.

At the end, we present some remarksabout the method efficiency, the results ob-tained and future directions and uses for thisbilingual terminology.



2 Probabilistic TranslationDictionaries

One of the most important resources for MTis translation dictionaries. They are indis-pensable, as they establish relationships be-tween the language atoms: words. Unfortu-nately, freely available translation dictionar-ies have small coverage and, for minority lan-guages, are quite rare. Thus, it is crucial tohave an automated method for the extractionof word relationships.

(Simoes and Almeida, 2003) explains howa probabilistic word alignment algorithm canbe used for the automatic extraction of prob-abilistic translation dictionaries. This pro-cess relies on sentence-aligned parallel cor-pora.

The used algorithm is language indepen-dent, and thus, can be applied to any lan-guage pair. Experiments were done using di-verse languages, from Portuguese, English,French, German, Greek, Hebrew and Latin(Simoes, 2008). The algorithm is based onword co-occurrences and its analysis with sta-tistical methods. The result is a probabilis-tic dictionary, associating words on two lan-guages.

These dictionaries map words from asource language to a set of associated words(probable translations) in the target lan-guage. Given that the alignment matrix isnot symmetric, the process extracts two dic-tionaries: from source to target language andvice-versa.

The formal specification for one proba-bilistic translation dictionary (PTD) can bedefined as:

wA 7→ (occs (wA)× wB 7→ P(T (wA) = wB))

Figure 1 shows two entries from the En-glish:Portuguese dictionary extracted fromthe EuroParl(Koehn, 2002) corpus. Notethat these dictionaries include the number ofoccurrences of the word on the source corpus,and a probability measure for each possibletranslation.

Regarding these dictionaries it should benoted that, although we use the term trans-lation dictionaries, not all word relationshipson the dictionary are real translations. Thisis mainly explained by the translation free-dom, multi-word terms and a variety of lin-guistic phenomena.

Notwithstanding the probabilistic nature

europe ⇀ 42583×

europa 94.7%europeus 3.4%europeu 0.8%europeia 0.1%

stupid ⇀ 180×

estupido 47.6%estupida 11.0%estupidos 7.4%avisada 5.6%direita 5.6%

Figure 1: Probabilistic Translation Dictio-nary examples.

of these dictionaries, there is work on boot-strapping conventional translation dictionar-ies using probabilistic translation dictionaries(Guinovart and Fontenla, 2005). Also, (San-tos and Simoes, 2008) discusses the connec-tion between dictionaries quality and corporagenre and languages.

3 Pattern Description Language

This section presents the Pattern Descrip-tion Language (PDL), a DSL used to describetranslation patterns. It starts with a simpleexplanation of PDL syntax and how patternsare used to extract terminology candidates.Follows a section on pattern predicates, con-straints on the applicability of the definedpatterns.

3.1 PDL basics

The translation patterns defined with PDLare matched against a translation matrix.Each cell of this matrix contains the mutualtranslation probability between the words onthat line and column. For instance, on theexample of figure 2, words “discussion” and“discussao” have a mutual translation proba-bility of 44%. This mutual translation proba-bility is computed using a probabilistic trans-lation dictionary1.

Figure 2 includes some cells highlighted.These are anchor cells: cells which transla-tion probability is 20% higher than the re-maining probabilities in the same row and/orcolumn.

As can be seen on the translation ma-trix shown, although it includes an optimal

1Note that there is no restriction on the corpusfrom which the PTD is created. It is possible anddesirable to invest in a big and high quality PTDthat is used to extract terminology candidates fromdiverse parallel corpora.

Alberto Simões y José João Almeida

282

dis

cuss

ion

about

alte

rnat

ive

sourc

es

of

finan

cing

for

the

euro

pea

n

radic

al

allia

nce

.

discussão 44 0 0 0 0 0 0 0 0 0 0 0

sobre 0 11 0 0 0 0 0 0 0 0 0 0

fontes 0 0 0 74 0 0 0 0 0 0 0 0

de 0 3 0 0 27 0 6 3 0 0 0 0

financiamento 0 0 0 0 0 56 0 0 0 0 0 0

alternativas 0 0 23 0 0 0 0 0 0 0 0 0

para 0 0 0 0 0 0 28 0 0 0 0 0

a 0 1 0 0 1 0 4 33 0 0 0 0

aliança 0 0 0 0 0 0 0 0 0 0 65 0

radical 0 0 0 0 0 0 0 0 0 80 0 0

europeia 0 0 0 0 0 0 0 0 59 0 0 0

. 0 0 0 0 0 0 0 0 0 0 0 80

Figure 2: Example of a Translation matrix.

translation, the alignment includes word or-der changes. Also, these word changes arenot related to the translator will. They areimperious given that involved languages syn-taxes. As an example, consider the relativepositioning changes between nouns and ad-jectives during a Portuguese to English trans-lation. In Portuguese the noun appears be-fore (“gato gordo”), while in English it is atthe end (“fat cat”).

Although language dependent, thesechanges can be predicted, and thus, it is pos-sible to describe them mathematically:

T (N ·A) = T (A) · T (N)

PDL is a domain specific language de-signed for the formal description of theserules (and their applicability constraints).The pattern for the simple rule shown aboveis schematized on figure 1.

Oly

mpi

c

Gam

es

Jogos X

Olımpicos X

[ABBA] A B = B A

Table 1: T (A B) = T (B) T (A) Pattern.

The PDL syntax is interpreted as follows:

• between rectangular braces is the iden-tifier of the rule. It can be any valididentifier. We normally use identifiersthat helps us remembering a specific casewhere the rule matches.

• follows a sequence of variables (place-holders for words) or specific words (as

dis

cuss

ion

about

alte

rnat

ive

sourc

es

of

finan

cing

for

the

euro

pea

n

radic

al

allia

nce

.

discussão 44 0 0 0 0 0 0 0 0 0 0 0

sobre 0 11 0 0 0 0 0 0 0 0 0 0

fontes 0 0 0 74 0 0 0 0 0 0 0 0

de 0 3 0 0 27 0 6 3 0 0 0 0

financiamento 0 0 0 0 0 56 0 0 0 0 0 0

alternativas 0 0 23 0 0 0 0 0 0 0 0 0

para 0 0 0 0 0 0 28 0 0 0 0 0

a 0 1 0 0 1 0 4 33 0 0 0 0

aliança 0 0 0 0 0 0 0 0 0 0 65 0

radical 0 0 0 0 0 0 0 0 0 80 0 0

europeia 0 0 0 0 0 0 0 0 59 0 0 0

. 0 0 0 0 0 0 0 0 0 0 0 80

Figure 3: Translation matrix with detectedpatterns.

on table 4). This sequence is matchedagainst words on the source language;

• at the right hand of the equal sign thereis another sequence, with the same vari-ables, but in the order of the translation.

Tables 2 to 5 shows some (other) typicalPortuguese/English patterns. Each table in-cludes the rule in PDL notation, and a graph-ical representation of it. To understand thismatrix representation, consider the following:

• an X in a cell means that it will matchagainst an anchor cell in the translationmatrix;

• empty cells will be matched against cellswith low values (non anchor cells);

• cells marked with a Delta symbol (as theone used on table 3) will match with anyprobability at all (being it an anchor cellor not). This type of relation is quite im-portant because it is difficult to predictprobabilities between some type of wordclasses, like articles or prepositions.

These patterns are applied directly inthe alignment matrix, layering the patternthrough it until it matches. Figure 3 showsthe previous translation matrix from figure 2with the detected patterns marked.

The word sequences that are related to themarked patterns are extracted, and result inthe following nominals:

S: fontes de financiamento alternativasT: alternative sources of financing

S: alianca radical europeiaT: european radical alliance


283

Hum

an

Rig

hts

Direitos X

doHomem X

[HR] A "de" B = B A

Table 2: HR Pattern

pont

o

de vist

a

neut

ro

neutral X

point X

of ∆view X

[POV] P "de" V N = N P "of" V

Table 3: POV Pattern.

prot

ocol

o

de tran

sfer

enci

a

de fiche

iros

file X

transfer X

protocol X

[FTP] P "de" T "de" F = F T P

Table 4: FTP Pattern.ın

dice

de dese

nvol

vim

ento

hum

ano

human X

development X

index X

[HDI] I "de" D H = H D I

Table 5: HDI Pattern.

As described earlier, patterns are built notjust from variables. Examples on tables 2 to 5show patterns with specific words. The se-mantic of these words is exactly the expected:the pattern matches if the column or row in-cludes that word.

Some languages, like Portuguese, have arich gender and number flexion. Rules withspecific words should take care of all possibleflexions. To make this task easier, not forcingthe repetition of rules for different flexions, itis possible to specify a sequence of alternat-ing words (“or’ed” words), as in the followingexample:

[HDI] I "do"|"da"|"dos"|"das" D H = H D I

This small language makes it possible to de-fine quickly and in an easy to read syntaxalmost any kind of translation rule.

3.2 Conditional Patterns

Patterns might be applied to word orderchanges that are not really noun phrases (andthus, not terminology candidates). For in-stance,

• in Portuguese the conjunction is used af-ter the comma (“, e”) while in English

it is used before (“and ,”). Without con-ditional patterns, ABBA pattern wouldbe applied;

• another example for the Por-tuguese/English pair is the verbnegation: “nao e” instead of “is not.”

To solve these problems, PDL supports pat-tern predicates to restrict their applicability.PDL supports two types of restrictions:

• generic predicates, that are written ina programming language (Perl) and cando almost anything;

• morphological conditions, that are writ-ten in PDL and use a morphological an-alyzer to test their applicability.

3.2.1 Generic predicates

The most powerful way to add restrictionsto translation patterns is by the definition ofa generic function, written in a specific pro-gramming language, that validates the ap-plicability of the pattern for those specificwords.


284

Given that PDL is implemented in Perl,and Perl is a reflexive language, generic pred-icates are written in Perl. These predicatespredicates receive the word or words thatshould be validated, and return a booleanvalue, on wether the pattern should or notbe applied.

One of the main advantages of writingpredicates in Perl is the ability to performexternal actions. It is easy to apply a regu-lar expression, to query a morphological an-alyzer tool, perform concordancies or querieson a relational database, or yet a query onan Internet search engine.

These predicates are defined as Perl func-tions in the same file as the translation pat-terns. There is a Perl code zone where theuser can write their own functions (that canbe used as predicates or auxiliary code forother functions). These functions are calledeach time a pattern might match against thetranslation pattern.

As a simples example, for illustration pur-poses only, consider the following code to val-idate if we are not matching the A B = B Apattern with commas:

[ABBA] A B.notComma = B.notComma A

%%

sub notComma {my $word = shift;return $word ne ","

}

Note that the variables on the rule haveattached the name of the predicate (as amethod on an OO language), and that thecode section is separated from the main rulessection with a sequence of special characters.During processing time, these functions areparsed and it is constructed a symbol tablethat is used later when evaluating patternsapplicability.

3.2.2 Morphologic restrictions

The most usual predicate code is the mor-phological analysis of a word, checking fora specific morphological category, genre ornumber. To help on the definition of thiskind of predicates, PDL supports morpholog-ical restrictions directly on its syntax:

[ABBA] A B[CAT<-adj] = B[CAT<-adj] A

This example means that B, on both lan-guages, should be analyzed morphologicallyand that the rule will be applied only if thewords can2 be analyzed as adjectives.

To perform this validation the algo-rithm uses the Jspell3 morphological analyzer(Almeida and Pinto, 1994).

3.2.3 Inference rules

These are not restriction rules, but in-fer rules: if the pattern is applied, then wecan infer something about the words thatmatched. The syntax is very close to mor-phologic restrictions, just changing the direc-tion of the arrow. Suppose we do not havea morphological analyzer for the Portugueselanguage, but we have for the English lan-guage. We can write down a rule to infera rough morphological analyzer for the Por-tuguese language:

[ABBA] A[CAT->noun] B[CAT->adj] =B[CAT<-adj] A[CAT<-noun]

Although the result will include some falsepositives, this is a fast way to help inferringproperties from languages.

4 Bilingual TerminologyExtraction and Evaluation

The entries extracted using translation pat-terns are mostly nominal phrases. Althougha lot of the extracted nominal phrases are notterminology in its common sense definition,they can be easily filtered. At the momentwe are not interested in this filtering task, aswe are not using them to the creation of glos-saries, but to use them directly on translationsystems.

These nominal phrases are counted, andmulti-sets are created. These multi-sets ele-ments include the identifier of the rule, andthe nominal phrase on both languages. Fig-ure 6 shows some examples (the top occur-ring, and the least occurring) of the extractednominal phrases, together with their occur-rence counter. A quick look on the examplesshow that the overall quality of these multi-sets is quite good.

2In case of ambiguity we check if there is at leastone categories accordingly with the restriction.

3Jspell is actually being released as an hybrid Perlmodule, and is available on CPAN as Lingua::Jspell.


285

occs patt Portuguese English39214 ABBA comunidades europeias european communities32850 ABBA jornal oficial official journal32832 ABBA parlamento europeu european parliament32730 ABBA uniao europeia european union15602 ABBA paıses terceiros third countries

1 ABBA orgaos orcamentais budgetary organs1 ABBA orgaos relevantes relevant bodies1 HR ovulos de equino equine ova1 HR oxido de cadmio cadmium oxide1 HR oxido de estireno styrene oxide

Table 6: Nominals multisets.

We performed a quick evaluation for sixdifferent patterns, evaluating their fertilityand translation quality. Were used 700 000translation units from EuroParl. Thesetranslation units were processed, and ex-tracted 578 103 different occurrences. Afterconsolidation, it were created 139 781 differ-ent multi-sets. These multi-sets were filteredand removed entries with punctuation, stopwords and random noise. These filtering re-sulted in 103 617 multi-sets.

There were 578 103 pattern occurrencesthat were consolidated in 139 781 multi-sets (different patterns). We performedsome filtering removing entries with punc-tuation, stop words and noise, resulting in102 151 patterns. Table 7 shows the distri-bution for the six different patterns: numberof occurrences and precision of the obtainednominal phrases.

pattern occs. prec.A B = B A 77497 86

A de B = B A 12694 95A B C = C B A 7700 93

I de D H = H D I 3336 100P de V N = N P of V 564 98

P de T de F = F T P 360 96

Table 7: Evaluation of nominal phrases.

For the evaluation we took 20 nominalphrases from the top occurring one, 20 fromthe less occurring, and 20 from the middle ofthe list (median). As most of the entries havejust one occurrence, the 20 nominal phrasesfrom the middle of the list have a low oc-currence count (normally 1 or 2 occurrences).Thus, this was a really unfavourable test (2/3of the nominal phrases have a low occurrencecount).

5 Terminology Generalization

Although terminology extraction is an impor-tant task, if we want to use these resourcesfor machine translation, they need to be themore generic possible, so that they can beapplied to different situations.

Generalization (Brown, 2001) is the com-mon approach to make translation examples(where bilingual nominal phrases are a spe-cific case) have a wider range of applicationon translation. This section shows two waysto generalize examples:

• a simple approach based on non-wordclasses;

• the creation of word classes based onalignment patterns;

5.1 Numeric classes

The easiest way to generalize translation ex-amples is to detect non-words, and associatea class to them. This class can be anything,from numbers, years, emails or urls: any non-textual object that is easy to detect.

The nominal phrases are parsed and thesenon-textual objects are detected, being re-placed by the class name. For instance, ifwe define a class named year for numbersx : 1900 ≤ x ≤ 2200, and another classnamed int for any other integer value, it ispossible to extract the following generalizedterminology entries:

187 orcamento de {year } {year } budget136 {int } euros eur {int }127 directiva de {year } {year } directive51 orcamento {year } {year } budget46 {int } de setembro september {int }

At the moment we are using about tennon-word classes including numbers, years,


286

hours, time periods, cardinals and currencyvalues.

5.2 Word Classes

Another type of generalization is the substi-tution of a word by a member of a specificset. For instance, we can define a set withwords for gentilics, like:

G = {Nigerian,Mexican, Norwegian, . . .}

and then have a dictionary map each oneof the words from this set to the respec-tive translation. After the creation of theseclasses, it is possible to use the class identifierin the nominal phrases, with:

X People⇒ Povo T (X) X ∈ G

To construct these word classes we usedour translation patterns.

Consider the pattern “A B = B A” fromPortuguese to English. We are expecting thatwords matching the variable B are adjectives.If we chose a specific noun on A, we will getall the adjectives that are likely to be appliedto it (as can be seen on figure 4). This kind ofapproach is a quick and efficient way to con-struct bilingual word classes and generalizeterminology entries.

acido => clorıdrico (hydrochloric acid )sulfurico (sulphuric acid )acetico (acetic acid )folico (folic acid )cıtrico (citric acid )...

livro => verde (green paper )branco (white paper )azul (blue paper )laranja (orange book )vermelho (red book )azul (blue book )...

Figure 4: Automatic word class creation.

Unfortunately these classes should be an-alyzed manually before being used, speciallyin the case where the translation varies notonly on the word being cycled but also on thefixed word. Check for instance the secondexample of figure 4, where the word “livro”can be translated by “paper” or “book.” Thiswould be pacific if there wasn’t two differ-ent translations for the same noun phrase:

“livro azul.” Given this ambiguity, some careshould be taken on extracting word using thisapproach.

6 Conclusions and Future Work

Using statistical methods to obtain bilingualresources is possible. If we attach scalabletools to the statistical methods, resourcesquality can raise.

Translation patterns shown to be an inter-esting and efficient method for noun phrasesextraction. Further evaluation should bedone on counting how many of the extractednoun phrases can be considered real terminol-ogy. For the time being that evaluation is notreally important as the noun phrases usagefor machine translation is equally effective forboth terminology and non-terminology en-tries.

The PDL language makes it possible todefine translation patterns in a concise way,without discarding the readability. Also, thepossibility of adding constraints to the pat-terns applicability make them even more ef-fective, raising the quality of the extractednominal phrases.

At the moment we are applying these re-sources to two different systems:

• using Text::Translator, a Perl systemto prototype translation systems. ThisPerl module is quite versatile as we canuse all translation approaches on it, fromstatistical to example based or even rulebased translation;

• using the extracted noun phrases to pre-pare translation resources for Apertium(Armentano-Oller et al., 2006; Loinaz etal., 2006), the well known machine trans-lation system for close languages.

The extracted noun phrases are being use-ful not just as translation examples (Way,2001), but also as the source for other re-sources extraction, like the construction ofbilingual translation dictionaries.

While there is work on pattern extractionfrom corpora (Och and Ney, 2004), they arenot used for bilingual resources extraction.Instead, Och and Ney use parallel corporato infer translation templates that are usedlater on machine translation. This approachcan be used to bootstrap our translation pat-terns. In any case, they must be manuallyreviewed before being applied for the real ex-traction of nominal phrases.


287

The tools used are available as Open-Source and can be easily downloaded fromthe Internet at http://natools.sf.net.They rely on the NatServer translation re-sources server (Simoes and Almeida, 2006)for querying efficiency.

Acknowledgments

Alberto Simoes has a scholarship fromFundacao para a Computacao Cientıfica Na-cional and the work reported here has beenpartially funded by Fundacao para a Cienciae Tecnologia through project POSI/PLP-/43931/2001, co-financed by POSI, and byPOSC project POSC/339/1.3/C/NAC.

References

Almeida, Jose Joao and Ulisses Pinto. 1994.Jspell – um modulo para analise lexicagenerica de linguagem natural. In Actasdo X Encontro da Associacao Portuguesade Linguıstica, pages 1–15, Evora.

Armentano-Oller, Carme, Rafael C. Car-rasco, Antonio M. Corbı-Bellot, Mikel L.Forcada, Mireia Ginestı-Rosell, SergioOrtiz-Rojas, Juan Antonio Perez-Ortiz,Gema Ramırez-Sanchez, Felipe Sanchez-Martınez, and Miriam A. Scalco. 2006.Open-source portuguese-spanish machinetranslation. In 7th International Work-shop on Computational Processing ofWritten and Spoken Portuguese, PRO-POR 2006, pages 50–59, Itatiaia, Rio deJaneiro, Brazil, May.

Brown, Ralf D. 2001. Transfer-rule in-duction for example-based translation.In Michael Carl and Andy Way, edi-tors, Workshop on Example-Based Ma-chine Translation, pages 1–11, September.

Guinovart, Xavier Gomez and Elena SacauFontenla. 2005. Tecnicas para o desen-volvemento de dicionarios de traducion apartir de corpora aplicadas na xeracion doDicionario CLUVI Ingles-Galego. Vicev-ersa: Revista Galega de Traduccion,11:159–171.

Koehn, Philipp. 2002. EuroParl: a multi-lingual corpus for evaluation of machinetranslation. Draft.

Loinaz, Inaki Alegrıa, Inaki Arantzabal,Mikel L. Forcada, Xavier Gomez Guino-vart, Lluis Padro, Jose Ramom Pichel

Campos, and Josu Walino. 2006. Open-Trad: Traduccion automatica de codigoabierto para las lenguas del Estadoespanol. Procesamiento del Lenguaje Nat-ural, 37:357–358.

Och, Franz Josef and Hermann Ney. 2004.The alignment template approach to sta-tistical machine translation. Computa-tional Linguistics, 30:417–449.

Santos, Diana and Alberto Simoes. 2008.Portuguese-English word alignment: someexperiments. In LREC 2008 — The 6thedition of the Language Resources andEvaluation Conference, Marrakech, 28–30,May.

Simoes, Alberto and J. Joao Almeida. 2006.NatServer: a client-server architecturefor building parallel corpora applica-tions. Procesamiento del Lenguaje Natu-ral, 37:91–97, September.

Simoes, Alberto M. and J. Joao Almeida.2003. NATools – a statistical word alignerworkbench. Procesamiento del LenguajeNatural, 31:217–224, September.

Simoes, Alberto Manuel Brandao. 2004. Par-allel corpora word alignment and applica-tions. Master’s thesis, Escola de Engen-haria - Universidade do Minho.

Simoes, Alberto Manuel Brandao. 2008.Extraccao de Recursos de Traducao combase em Dicionarios Probabilısticos deTraducao. Ph.D. thesis, Escola de En-genharia, Universidade do Minho, Braga,May.

Tiedemann, Jorg. 2003. Recycling Trans-lations - Extraction of Lexical Data fromParallel Corpora and their Application inNatural Language Processing. Ph.D. the-sis, Studia Linguistica Upsaliensia 1.

Way, Andy. 2001. Translating with exam-ples. In Michael Carl and Andy Way, ed-itors, Workshop on Example-Based Ma-chine Translation, pages 66–80, Septem-ber.


288

Demostraciones

AnCoraPipe: A tool for multilevel annotation

Manuel Bertran, Oriol Borrega, Marta Recasens, Bàrbara Soriano CLiC – Centre de Llenguatge i Computació

Universitat de Barcelona Gran Via Corts Catalanes,585

08007 Barcelona [email protected], [email protected], {mrecasens,bsoriano}@ub.edu

Resumen: AnCoraPipe es una herramienta de anotación de corpus que permite etiquetar diferentes niveles lingüísticos de manera simultánea y eficiente, ya que utiliza un formato único para todas las etapas. De esta forma, se reduce el tiempo de anotación y se facilita la integración del trabajo de todos los anotadores en el proceso. Palabras clave: Lingüística de corpus, herramienta de anotación, niveles de anotación.

Abstract: AnCoraPipe is a corpus annotation tool which allows different linguistic levels to be annotated simultaneously and efficiently, since it uses a single format for all stages. In this way, the required annotation time is reduced and the integration of the work of all annotators is made easier. Keywords: Corpus linguistics, annotation tool, annotation levels.

1 Introduction

Corpora annotation is a very time-consuming task, and developing AnCora to its current state has meant a lot of effort by our research group. Throughout this process, different tools and formats have been used, yet always running the risk of losing data when translating from one format to another or when merging data that had been labeled with different tools. With a view to solving these problems, we present AnCoraPipe, which is based on a single XML data format. This data format allows annotation for different levels and languages. An effort was made to make the tool scalable and extensible.

Several linguists experienced in corpus annotation have participated in the process of building and testing AnCoraPipe. The interaction has made it possible to build a friendlier interface easy to use for the most usual operations. The new tool decreases the annotation time by 40% in semantic role labeling, by 60% in named entities, and by 25% in coreference.

2 Data format

In order to help concurrent annotation of different levels, the interface can associate the corpora in the local machine with a server, so users can be aware of changes made in the

server and synchronize them before making their own changes in local files. Changes made in local files can then be uploaded to the server for other users to add further annotations.

Items are stored in UTF-8 encoded XML format. XML allows portability and takes advantage of the several tools and libraries available in a variety of platforms and programming languages. Besides, UTF-8 allows the format to be cross-lingual. XML has a tree structure itself, so it maps easily to the syntactic constituent structure.

Our XML is based on the following principles: • Easy to read: the structure is intuitive. • Easy to maintain: internal coherence can

be maintained with little effort. • Robustness: little changes do not affect

overall coherence. The whole structure is maintained even when an error occurs.

These objectives are reflected in a series of design principles: • Small set of node names: only 15 node

names are possible. Thus, nodes are only generic and specificity is reached through attributes.

• Attributes are atomic: each attribute labels one and only one feature of the node. This reduces the number of possible values and makes the annotation levels independent.



• Attributes describe only their node. This makes moving, deleting and creating nodes very simple tasks, and so coherence is guaranteed.

• No redundant data. • Easy to add new annotation levels: only

the design of a new attribute and its possible values is needed.

3 Interface

This section briefly describes the AnCoraPipe editor.1

3.1 Description

The interface is organized in different panels where data are shown. Buttons and menus are available to perform operations on the corpora.

The GUI (Graphical User Interface) highlights in yellow the items to which the coder must pay attention, thus suggesting the tree nodes that should be annotated or the sentences containing such nodes depending on the annotation level. The available panels are: • Corpora directory tree: it shows the

directory structure and allows the user to select a file.

• Sentence list: it shows the sentences within each file.

• Sentence tree: it contains the selected sentence structure. The user can also see the words and lemmas together with the data of the corresponding annotation level.

• Annotation panel: This panel is used to perform operations on the tree and annotate its nodes. The display of the tree changes according to the annotation level, which eases annotation.

Current annotation levels include morpho-logy, syntax (changes in the tree structure, nodes grouping and splitting, etc), functions, arguments and thematic roles, named entities, WordNet synsets, and coreference.

The interface provides as well some external tools for specific levels, such as: • Coreference annotator: coreference can be

annotated in a user-friendly way, seeing the files as plain text.

1 For a more detailed explanation visit http://clic.ub.edu/mbertran/tbfeditor/help.

• WordNet synsets: on a lemma-by-lemma basis, the external tool looks for all occurrences of the same lemma in the corpus, so they are all annotated in a row. This favors the consistency of the annotation.

The interface is extensible by creating additional tools for further annotation levels. This can be done by writing two new Java classes after having specified the new attribute and possible values.

3.2 Functionality

Many linguists have participated in the development of AnCoraPipe. This has led to a tool that is very user-oriented, focusing on usability and operational simplicity. To this end, the required mouse clicks to perform operations have been minimized, and only the relevant nodes of each annotation level are highlighted so that the risk of oversight is avoided. In this way, we have reduced annotation time up to 60%. 3.3 Installation

The requirements for AnCoraPipe are: Java 1.5 and the Java graphical library SWT. Our package includes SWT library for WindowsXP. In other platforms, this library comes with the Eclipse package or it can be obtained from http://www.eclipse.org/swt/.

4 Future work

Plans to extend the current application include: making the application available via the Web, providing methods for querying the corpus from the interface, providing methods for making statistical descriptions of the corpus, providing tools for dealing with nominal and verbal lexicons, and adding semi-automatic methods and machine learning functionalities for semi-automatic labeling.

Acknowledgments This paper has been supported by Lang2World (TIN2006-15265-C06-06) – sub-project of TEXTMESS – and FPU-2006-08 grant from the Spanish Ministry of Education and Science.

Manuel Bertran, Oriol Borrega, Marta Recasens y Bàrbara Soriano

292

Plataforma de Interacción Natural para el Acompañamiento Virtual

Natural Interaction Platform for Virtual Attending

David del Valle

Jesica Rivero

Daniel Conde

Garazi Olaziregi

Javier Calle

Dolores Cuadra

Universidad Carlos III de Madrid, Av. Universidad n 30, 28911 Leganés

{dvalle, jrivero, dconde}@inf.uc3m.es, [email protected], {fcalle, dcuadra}@inf.uc3m.es

Resumen: Este trabajo persigue la realización de un acompañante virtual capaz de interaccionar con el usuario mientras este se mueve en un entorno, proporcionándole acceso a un conjunto de servicios preestablecidos. Entre estos servicios figuran aquellos que tienen en cuenta la posición y trayectoria del usuario (p.e., avisos), los orientados a dirigir esos parámetros (p.e. establecimiento de rutas y guiado hacia determinados puntos de interés) o explicarlos (descripción de situación y/o trayectoria). Palabras clave: Interacción Natural, Modelo de situación, Modelo de diálogo, plataforma de interacción multiagente.

Abstract: This work is focused to develop a virtual attendant which interacts with the user and takes into account his position in the environment. This feature provides access to services such as: services relative to the position and trajectory (for example, notice), oriented services to manage these parameters (for example, tracking and guiding to fixed and moving objects) or the description of the situation and/or trajectory. Keywords: Natural Interaction, Situation Model, Dialog Model, Multi-agent interaction platform.

1 Introduccion

Existe una cierta tendencia en el área de la Interacción Persona-Ordenador (IPO) hacia construir sistemas basados en conocimiento que posibiliten la IPO en la que participan usuarios no entrenados tecnológicamente. Para estos usuarios potenciales, las técnicas de interacción de las que se haga uso no deben presuponer ningún conocimiento previo ni habilidad específica del usuario, y en particular ninguna habilidad tecnológica. La única habilidad interactiva del usuario es la que le permite interaccionar con otros humanos, y este es el tipo de interacción que se espera que desarrolle la máquina. Este interés ha cristalizado en la Interacción Natural, que aglutina diversas disciplinas (Ingeniería del Conocimiento, Lingüística, Psicología, etc.) para alcanzar este paradigma de interacción.

El componente de investigación en interacción que se presenta en este artículo se ubica en esta dirección, siguiendo la línea de

trabajos anteriores (Cuadra et al., 2008). En particular, el subsistema de interacción incluye las siguientes características (Figura 1):

- Autonomía de operación de sus Componentes, soportada por una arquitectura multi-agente.

- Componentes de Interfaz: entrada directa de estructuras semánticas, sustituibles por módulos de procesamiento de voz y de Lenguaje Natural.

- Componente de Diálogo: con procesamiento intencional y de acción combinada (Clark, 1996). La integración de los interfaces con la interacción se hará con actos comunicativos (Austin, 1962). Incluirá la implementación de varios modelos de conocimiento: diálogo, tareas, y sesión (Calle, García-Serrano & Martinez, 2006).

- Componente de Situación: hará hincapié en la gestión de la circunstancia a través del aspecto material (espacio-temporal). Estará soportado por la tecnología de BB.DD. Espacio-Temporales (Bertino, Cuadra &



Martinez, 2005), y aportará la información necesaria para el guiado de usuarios.

Expressive Knowledge(interface components)

Conocimiento Expresivo(Componentes de Interfaz)

ConocimientoCircunstancial

Modelo de Situación

Lexicón

Dialog Knowledge(Dialogue Model)

Operative Knowledge(Task Model)

Intentional Processing(Common Ground)

InteractionState & Structure

Contextual Knowledge(Session Model)

Conocimiento DiálogoModelo de Diálogo

Conocimiento Operativo(Modelo de Tareas)

Procesamiento Intencional(Zona Común)

Estructura y Estado de la Interacción

Conocimiento Contextual(Modelo de Sesión)

B.D.Espacio-Temporal

capa física localización

Expressive Knowledge(interface components)

Conocimiento Expresivo(Componentes de Interfaz)

ConocimientoCircunstancial

Modelo de Situación

Lexicón





















B.D.Espacio-Temporal

capa física localización

Figura 1: Arq. Cognitiva del Sub-sistema de Interacción

2 Prototipo SOPAT

Cada uno de los componentes de la figura 1 se diseñará como uno o varios agentes en un sistema multiagente. Estos no serán otra cosa que procesos de cierto tipo, con un conjunto de habilidades definido, y que desarrollan un comportamiento autónomo. Para comunicarse entre ellos, harán uso de una pizarra almacenada en Base de Datos. Para este fin, se hará uso de un servidor con el Sistema Gestor de Base de Datos OracleTM.

La tecnología de Bases de Datos también soporta otras necesidades del sistema: base de datos espacio-temporal, bases de conocimiento de todos los modelos, bases de estado de la interacción, trazas, etc. Tanto las bases de datos como los agentes podrán estar en distintos servidores o en el mismo (dependiendo de las necesidades de eficiencia), planteando así una arquitectura completamente escalable. En este prototipo se sigue la arquitectura que se muestra en la figura 2. En esta figura se observa que los componentes de localización y de interfaz estarán alojados en un dispositivo portátil que mantiene una conexión inalámbrica de datos (WiFi) con un servidor de comunicaciones que le da acceso a la red (internet). A través de esta, llegará a un servidor que distribuirá las tareas a través de una red de área local. En ella estarán el servidor de Base de Datos, que alojará la BD Espacio Temporal y la Pizarra, y los servidores que contengan los componentes de PLN, Diálogo y Situación.

. . .

PizarraS-Multiagente

LAN

ServidorPLN

LAN

LAN

LAN

WiFi

ServidorAgentesDiálogo

ServidorBB.DD.

ServidorAgentesSituación

Servidor Distribuidor

Servidor Comunicaciones

. . .

PizarraS-Multiagente

LAN

ServidorPLN

LAN

LAN

LAN

WiFi

ServidorAgentesDiálogo

ServidorBB.DD.

ServidorAgentesSituación

Servidor Distribuidor

Servidor Comunicaciones

Figura 2: Arquitectura Física

El prototipo que se presenta muestra como con poco corpus pueden desarrollarse interacciones flexibles influenciadas por el conocimiento de la circunstancia. Los servicios accesibles a través de esta interacción son los de guiado (a través de un entorno espacio-temporal), establecimiento y ejecución de alarmas (sobre condiciones espacio-temporales), y descripción de situaciones y rutas. Las interfaces del prototipo incluyen un interfaz gráfico de usuario para la simulación del movimiento en un entorno real, PLN, reconocimiento de voz con ViaVoice® y entrada de texto.

3 Agradecimientos

El prototipo presentado está soportado por el trabajo desarrollado en los proyectos SOPAT (CIT-410000-2007-12) financiado por el Ministerio de Educación y Ciencia, y MAVIR financiado por la Comunidad de Madrid.

Nuestro agradecimiento a todos los participantes en estos proyectos por su trabajo y apoyo.

Bibliografía

Austin, J.L. (1962). How to do things with words. Oxford Univ. Press, 1975.

Bertino E., Cuadra D., Martínez P. An Object-Relational Approach to the Representation of Multi-Granular Spatio-Temporal Data. Proc. of 17th International Conference, CAiSE 2005, Porto, Portugal, 2005

Calle, J., García-Serrano, A., Martínez, P. (2006). Intentional Processing as a Key for Rational Behaviour through Natural Interaction. Interacting With Computers, © 2006 Elsevier Ltd.

Clark, H.H(1996). Using Language. © 1996, Cambridge University Press.

Cuadra D., Rivero J., Valle D., Calle J (2008). Enhancing Natural Interaction with Circumstantial Knowledge. Int. Trans. on Systems Science and Applications vol. 4, Springer 2008.

David del Valle, Jesica Rivero, Daniel Conde, Garazi Olaziregi, Julián Moreno, Javier Calle y Dolores Cuadra

294

El programa de búsqueda con lenguaje natural de Q-go aplicado a un sitio web multilingüe

Q-go’s Natural Language Search Software Applied To a Multilingual Website

Carolina Fraile Maldonado Q-go España

Angel Guimerà, 22 08017 Barcelona [email protected]

Leonoor van der Beek Q-go Nederland

Eekholt 40 1112 XH Diemen [email protected]

Resumen: El programa de búsqueda con lenguaje natural de Q-go permite a las empresas que lo implementan en su sitio web ofrecer a los usuarios una herramienta de búsqueda de información potente, rápida y eficaz, con la cual encuentran exactamente lo que buscan. Por su parte, las empresas pueden beneficiarse de las ventajas añadidas del programa. En esta presentación se explicarán los fundamentos y el funcionamiento de este programa y se mostrará su aplicación a un sitio web multilingüe. Palabras clave: lenguaje natural, búsqueda basada en lenguaje natural, sitio web, preguntas de usuario, respuestas relevantes, análisis sintáctico, relaciones semánticas.

Abstract: Q-go’s natural language search software allows companies that implement it in their websites to offer to their customers a powerful, fast and efficient information search tool, with which they find exactly what they were looking for. At the other side, the companies can benefit from the software’s additional advantages. In this demo we will describe the basics of the software and its application to a multilingual website. Keywords: natural language, natural language search, website, user questions, relevant answers, match engine, syntactic analysis, semantic relations.

1. Para qué sirve la tecnología de Q-go.

El programa de búsqueda de Q-go responde a las preguntas formuladas por usuarios de sitios web que buscan información concreta y específica y quieren resolver sus dudas online de forma autónoma y rápida. Para conseguirlo, este programa debe entender las preguntas que los usuarios formulan e indicarles las respuestas adecuadas.

2. Las fuentes de conocimiento lingüístico

Son las bases de datos con las que trabaja el programa para poder realizar su función. Se trata de diccionarios, gramáticas, relaciones semánticas, reglas de traducción, metarreglas, archivos de subcategorización verbal, archivos de corrección ortográfica, etc. En este apartado se explicará el contenido de cada una de ellas.

3. Funcionamiento del programa paso por paso

Estos pasos son:

Introducción de la pregunta por parte del usuario en la interfaz correspondiente del sitio web.

Asignación de categorías gramaticales a los elementos que componen la pregunta.

Aplicación de las reglas de la gramática: análisis sintáctico cuyo resultado es uno o varios árboles sintácticos.

Producción del ‘case frame’.

Aplicación de posibles reglas de traducción al ‘case frame’, lo que genera uno o varios ‘case frames’ equivalentes.

Consulta de las bases de datos de relaciones semánticas (sinónimos y palabras relacionadas).



Comparación del ‘case frame’ de la pregunta de usuario con los ‘case frames’ de las preguntas de la base de datos (las llamadas preguntas modelo).

Comparación por palabras clave.

Presentación de resultados.

4. Caso práctico: aplicación del programa de Q-go en el sitio web multilingüe de KLM

En este apartado explicaremos el proceso de implantación de la tecnología en un caso concreto, el sitio web multilingüe de la aerolínea holandesa KLM, en los idiomas neerlandés, inglés, francés, español y alemán. Dentro de este apartado se mostrarán varios ejemplos de preguntas formuladas por los usuarios en varios idiomas, se pondrá en marcha el programa, y se evaluarán los resultados.

5. Conclusión

Para finalizar la presentación se enumerarán todos los aspectos positivos del programa de Q-go y las ventajas añadidas que genera para los clientes que deciden implantarlo en su sitio web. 6. Trabajos futuros Prácticas académicas versus aplicaciones comerciales: la adquisición de relaciones semánticas. Formas de cuantificación de las relaciones semánticas.

Carolina Fraile Maldonado y Leonoor van der Beek

296

CHIEDE

Corpus de Habla Infantil Espontánea del Español

CHIEDE

Spontaneous Child Language Corpus of Spanish

Marta Garrote Salazar

Laboratorio de Lingüística Informática

Universidad Autónoma de Madrid

Campus de Cantoblanco,

Ctra. de Colmenar Viejo, Km. 15

28049-Madrid

[email protected]

José María Guirao Miras ETSIIT; Dpto. de Lenguajes y Sistemas

Informáticos

Universidad de Granada

C/ Periodista Daniel Saucedo

Aranda s/n Granada

[email protected]

Resumen: El presente trabajo consiste en la demostración del funcionamiento de la

página web desarrollada para la presentación y difusión del corpus de habla infantil

CHIEDE.

Palabras clave: Corpus, lengua oral espontánea, lenguaje infantil, página web.

Abstract: This work consists on the demonstration of the web site developed for the

presentation and spreading of the child language corpus CHIEDE. Keywords: Corpus, spontaneous oral language, child language, web site.

El Corpus de Habla Infantil Espontánea del

Español, CHIEDE, consta de unas 60.000

palabras. Aproximadamente un tercio del

corpus está formado por habla infantil y los

dos tercios restantes por habla adulta. La

principal característica de CHIEDE es la

espontaneidad de las interacciones que lo

integran: las grabaciones se han llevado a

cabo en su contexto natural. El recurso se

presenta en diferentes formatos: una

transcripción ortográfica, una transcripción

fonológica automática, una versión

etiquetada en XML y el alineamiento del

texto y el sonido. Además, aportamos los

resultados obtenidos tras la extracción,

mediante métodos estadísticos, de la

información de los textos anotados. El

diseño del corpus presenta el siguiente

aspecto:

Figura 1: Diseño del corpus

CHIEDE cumple con todas las

características que debe poseer un corpus de

lengua oral actual. Su formato es

electrónico, permitiendo el almacenamiento

y la manipulación de los datos y su posible

intercambio con otros investigadores

interesados. Por su diseño proporcionado y

su diversidad —variables de sexo, edad y

situación comunicativa— garantiza una

representatividad de la variedad lingüística

en cuestión. Posee una estructura interna de

clasificación de datos que posibilita una

óptima explotación de los mismos Su



presentación en una página web

(http://drusila.lllf.uam.es/chiede)1 facilita su

disponibilidad para todo aquel que esté

interesado en su consulta. De esta forma, se

puede acceder tanto a las transcripciones

(alineadas con su sonido) como a cualquier

tipo de información extraída de las mismas

(número de palabras, listados de

frecuencias). La página web de CHIEDE

consta de los siguientes apartados:

� Inicio:

<http://drusila.lllf.uam.es/chiede>.

� Introducción: en la que se explican

los motivos que nos llevaron a realizar el

presente proyecto y se plantea el estado

de la cuestión sobre la lingüística de

corpus y la ontogénesis del lenguaje. � Diseño del corpus: en este

apartado se describen el diseño del

corpus y sus características.

� Objetivos: donde se detallan los

objetivos de nuestro proyecto.

� Transcripciones: en esta sección

facilitamos el texto de las

transcripciones (tanto en su versión

ortográfica como fonológica) alineado

con su correspondiente sonido. Esto

permite que en caso de duda respecto de

las convenciones de transcripción, el

usuario pueda escuchar la grabación

original y juzgar por sí mismo.

� Resultados: todos los datos

obtenidos de forma automática con

métodos estadísticos pueden ser

consultados aquí. Se facilitan pues

listados de frecuencias por lemas,

categorías, longitud media de enunciado

(LME), etc.

� Guías: se facilitan al usuario dos

guías básicas para la comprensión y

utilización del corpus. En la primera de

ellas se incluyen todas las convenciones

de transcripción para la comprensión de

los textos transcritos; la segunda

contiene el tagset o sistema de

etiquetado categorial con

1

Esta investigación ha sido parcialmente

financiada por el proyecto Búsqueda de

Respuestas Avanzada Multimodal y

Multilingüe: Recursos Lingüísticos, (CICYT-

TIN2007-67407-C03-02).

especificaciones sobre los criterios

seguidos para el establecimiento del

mismo.

� Consulta: la aplicación

Concordancias permite al usuario

consultar cualquier palabra. De esta

forma, se pueden obtener todos los

ejemplos de dicha palabra que aparezcan

en el corpus, junto con su contexto y

sonido.

Hasta la fecha, el corpus CHIEDE se ha

utilizado como fuente de estudio de varias

investigaciones sobre lenguaje infantil. Es

nuestra intención en un futuro ampliar el

tamaño del corpus, el número de

participantes y la variedad de situaciones

comunicativas.

Bibliografía

González Ledesma, A. y Garrote Salazar,

M. 2007. Los marcadores discursivos en

CHIEDE, un corpus de habla infantil

espontánea. Actas del XXII Congreso

Internacional de la Asociación de

Jóvenes Lingüistas.

Garrote, M. y Moreno Sandoval, A. 2008.

CHIEDE, a spontaneous child language

corpus of Spanish. Proceedings of the

3rd

International LABLITA Workshop in

Corpus Linguistics.

Garrote, M., Guirao, J.M. y Moreno

Sandoval, A. 2008. Extracción de

variants léxicas en adultos y niños de un

corpus de lengua oral espontánea. Actas

del 8º Congreso de Lingüística General.

Marta Garrote Salazar y José María Guirao Miras

298

MOSTAS: Un Etiquetador Morfo-Semántico, Anonimizador y Corrector de Historiales Clínicos

MOSTAS: A Morpho-semantic Tagger, Anonymizer and Spellchecker for Clinical Reports

Ana Iglesias, Elena Castro, Rebeca Pérez,

Leonardo Castaño y Paloma Martínez Departamento de Informática

Universidad Carlos III de Madrid Avda. Universidad, 30

28911- Leganés (Madrid) [email protected]

José Manuel Gómez-Pérez, Sandra Kohler

y Ricardo Melero iSOCO S.A.

C/ Pedro de Valdivia, 10 28006 - Madrid

[email protected]

Resumen: El sistema MOSTAS pre-procesa historiales clínicos con el objetivo de facilitar el posterior tratamiento de los textos y recuperación de información de los mismos. El sistema añade información morfo-semántica a los historiales, busca el significado de las siglas, acrónimos y abreviaturas que existen en los mismos y detecta conceptos biomédicos, utilizando para ello recursos biomédicos especializados (bases de datos, tesauros, un servidor de terminologías multilingüe en OWL, etc.). Además, MOSTAS es capaz de anonimizar y corregir los historiales clínicos. Palabras clave: Etiquetador morpho-semántico; Anonimizador de textos; Corrector Ortográfico; Conversión de Siglas, Abreviaturas y Acrónimos biomédicos; Historiales Clínicos; Historiales Médicos.

Abstract: The MOSTAS (MOrpho-Semantic Tagger, Anonymizer and SpellChecker for biomedical texts) system preprocesses Clinical Reports in order to facilitate rear information retrieval of these texts. MOSTAS system annotates clinical reports with morpho-semantic information, applies abbreviation and acronyms conversions and detects biomedical concepts using specialized biomedical resources (databases, thesaurus, a multilingual terminology server, etc.). Moreover, MOSTAS is able to anonymize and correct the clinical reports. Keywords: Morpho-semantic tagger; Anonymizer; SpellChecker; Abbreviation and Acronym converser; Clinical Reports; Medical Reports.

1 Introducción y Motivación

En la actualidad existe un gran interés en el análisis de textos en el dominio de biomedicina con el objetivo de dar soporte a la búsqueda de documentación científica, ayuda a la toma de decisiones y seguridad de pacientes (Leroy y Chen, 2005).

Hasta el momento, la mayoría de los investigadores en este campo trabajaban con textos en inglés y terminología médica en inglés, pero aún queda mucho por hacer en los textos biomédicos en otros idiomas debido a la falta de estándares completos que aúnen terminologías (Lu et al., 2006). Además, la mayoría de los investigadores trabajan con documentos bien-formados como artículos,

libros o resúmenes médicos similares a los que se pueden encontrar en bases de conocimiento como por ejemplo MedLine1. Pero aún existen pocos trabajos que estudien las notas escritas por personal de los hospitales, donde se pueden encontrar siglas, abreviaturas y acrónimos, palabras biomédicas especializadas y otros símbolos o palabras no controlados ni recogidos en los recursos biomédicos (Jang, Song y Myaeng, 2006).

2 MOSTAS: Etiquetador, Anonimizador y Corrector de Textos Clínicos

El sistema MOSTAS trata de pre-procesar de forma automática historiales clínicos de un

1 http://www.nlm.nih.gov/medlineplus/



hospital con el objetivo de facilitar el posterior tratamiento de los datos y recuperación de información de los mismos. MOSTAS es un sistema creado para el proyecto ISSE2, donde más de 210.700 notas clínicas de un hospital de Madrid han sido procesadas.

La arquitectura de MOSTAS se puede dividir en cuatro grandes bloques dependiendo del tratamiento del texto que se haga: Analizador Morfo-semántico, Buscador de Términos Médicos, Anonimizador y Corrector Ortográfico (ver Figura 1).

BMorfo-semántic

Analyzer Acronimus

Finder

BiomedicalConcepts

Finder

NE recognitionDomain specific spell-checked

Clínical Notes

XML preprocessed clinical notes

Fuzzy spell-checker

SPINDEL

STILUS

Preprocessing Biomedicalresources

SNOMED

Spanishhealthacronyms

Active principles

Gazetteers

Figura 1: Arquitectura de MOSTAS.

El sistema MOSTAS recibe como entrada un conjunto de notas clínicas y proporciona como salida un documento XML con información morfo-semántica de las notas clínicas, buscando el significado de las abreviaturas y acrónimos en el texto, anonimizándolo y corrigiéndolo. El analizador morfo-semántico de MOSTAS utiliza la herramienta STILUS3, que detecta palabras en un diccionario general de español. Las palabras que no fueron reconocidas por STILUS se buscan en diccionarios de siglas, abreviaturas y acrónimos biomédicos (Nadeau, Turney y Stan, 2006). Si se encuentran, se almacenan en el documento XML los posibles significados que posee. En caso contrario, se busca su significado en diferentes recursos biomédicos mediante un servidor de terminologías4 (ST) que posee información de metatesaurus como SNOMED5, relacionados entre sí por un sistema de mapeo semántico. Para explotar la expresividad de las terminologías y facilitar el razonamiento,

2 ISSE: Interoperabilidad basada en Semántica

para la Sanidad Electrónica. Proyecto PROFIT (FIT-350300-2007-75)

3 http://stilus.daedalus.es/stilus.php 4 ST desarrollado por iSOCO en proyecto ISSE 5 http://www.snomed.org

hemos dotado al servidor de un proceso de transformación de las distintas terminologías al lenguaje estándar W3C para la representación de conocimiento OWL. Posteriormente, teniendo en cuenta las palabras que aún no fueron detectadas en los diferentes recursos biomédicos, las notas clínicas se anonimizan utilizando SPINDEL (De Pablo et al., 2007), un buscador de entidades nombradas (personas, localizaciones y organizaciones). Por último, se parte de la hipótesis de que si aún existen palabras que no han sido reconocidas por los procesos anteriores, puede ocurrir que éstas estén mal escritas (algo que se ha observado que ocurre con frecuencia en este tipo de textos), por lo que se cuenta con un programa que busca similitud ortográfica de estas palabras mediante técnicas borrosas utilizando los recursos médicos especializados.

El documento XML con los textos clínicos etiquetados, corregidos y anonimizados por MOSTAS facilitará el trabajo posterior del tratamiento de los textos y recuperación de información de los mismos.

Bibliografía

Jang, H., Song S.K., y Myaeng S.H. 2006. Semantic Tagging for Medical Knowledge Tracking Proc. 28th IEEE EMBS Annual International Conference.

Leroy, G. y Chen, H. 2005. Genescene: An ontology-enhanced integration of linguistic and co-occurrence based relations in biomedical texts. Journal of the American Society for Information Science and Technology, 56(1): 457-468.

Lu, W-H., Lin, R., Chan, Y-CH. y Chen, K-H. 2006. Overcoming Terminology Barrier Using Web Resources for Cross-Language Medical Information Retrieval. AMIA Annu Symp Proc., 519–523.

Nadeau, D., Turney, P.y Stan, M. 2006. Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity. 19th Canadian Conference on Artificial Intelligence. Québec City, Québec, Canada. June 7.

De Pablo, C., Martinez, J. L., Garcia-Ledesma, A., Samy, D., Martinez, P., Moreno-Sandoval, A., Al-Jumaily, H. MIRACLE Question Answering System for Spanish at CLEF 2007.

Ana Iglesias, Elena Castro, Rebeca Pérez,Leonardo Castaño, Paloma Martínez, José Manuel Gómez-Pérez, Sandra Kohler y Ricardo Melero

300

Herramientas de anotación de corpus de habla espontánea del Laboratorio de Lingüística Informática de la UAM

Toolbox for annotating spontaneous speech corpora (Computational Linguistics Lab – UAM)

Antonio Moreno Sandoval Laboratorio de Lingüística

Informática UAM

[email protected]

José Ma. Guirao Miras Dept. de Lenguajes y Sistemas Informáticos

UGranada [email protected]

Doroteo Torre Toledano Dept. Ingeniería Informática

UAM

[email protected]

Resumen: Presentamos un sistema de anotación fonológica, silábica y morfosintáctica (incluyendo categoría sintáctica, lema y rasgos morfológicos) especialmente adaptado para corpus orales. Todas las herramientas se han desarrollado y validado en corpus de habla espontánea (C-ORAL-ROM, CHIEDE, CORLEC). Palabras clave: anotación, fonología, sílaba, lematización.

Abstract: We show a toolbox for linguistic annotation (including phonology, sillabification, part of speech, lemma and morphological features) especially adapted to Spanish spoken corpora. These tools have been developed and validated against several spontaneous speech corpora compiled by the Laboratorio de Lingüística Informática-UAM: C-ORAL-ROM, CHIEDE, CORLEC Keywords: Corpus annotation, phonology, syllable, lemmatization, PoS tagging.

1 Características del sistema

1.1 La transcripción de corpus orales

Los corpus orales son mucho más complejos de compilar que los corpus escritos y en especial exigen una dedicación intensiva de los transcriptores: estimamos que para cada hora de grabación se necesitan unas 40 horas de trabajo especializado. Las tareas incluyen:

1) preparación del entorno de grabación, 2) registro de las conversaciones o

monólogos, 3) tratamiento digital de las grabaciones

hasta que se obtiene el fichero fuente de sonido, 4) transcripción manual por lingüistas

especializados, 5) anotación prosódica manual (pausas,

disfluencias, solapamientos, etc.), 6) revisión de la transcripción y anotación

prosódica por un lingüista distinto, 7) revisión conjunta de los dos lingüistas en

las cuestiones con discrepancia ,

8) alineamiento manual de cada segmento o utterance, es decir, del segmento sonoro con su transcripción.

Una vez terminado el proceso de

transcripción (ortográfica y prosódica) se comienza con la anotación de la información propiamente lingüística. Digamos que este proceso preliminar correspondería a la compilación de un corpus escrito con la introducción de metadatos en la cabecera de cada texto.

1.2. La anotación del nivel fonológico

Los corpus orales exigen este nivel de anotación, a diferencia de los escritos. Estos corpus se suelen emplear para dos tipos de tareas básicas:

1. entrenamiento de sistemas de reconocimiento de habla

2. base de datos para descripción de las características de la lengua oral



En el primer caso, los corpus de habla espontánea sirven de base de datos acústica y de modelo de lengua. Para ello tienen que estar en formato “fonológico”: cada fonema lleva un símbolo que lo identifique inequívocamente. La transcripción ortográfica es inservible.

En principio sería posible realizar la transcripción fonológica manualmente, pero requeriría tanto esfuerzo en términos de tiempo que no sería viable para un corpus de más de 50.000 palabras1.

Por ello, empleamos un transcriptor fonológico que transcribe automáticamente cualquier texto escrito en ortografía castellana estándar. El transcriptor se ha descrito en varias publicaciones.

La tasa de error estimada es menor al 2 por ciento y se concentra exclusivamente en las palabras de ortografía no castellana (es decir, extranjerismos, como “web”), nombres propios no castellanos (“John”) y acrónimos (“SEPLN”). La manera de resolver estos problemas es mediante la inclusión en una lista de excepciones, pero obviamente esa lista es muy incompleta.

El transcriptor además de traducir el texto a formato “fonológico” hace una silabificación y asignación de acento fonológico. No se ha realizado una evaluación exhaustiva pero sabemos que funciona con un nivel de precisión similar al fonológico, siempre que se tome cada palabra aisladamente. La juntura externa entre palabras no se trata de momento.

1.3. La anotación morfosintáctica Como es bien sabido, esta anotación es el

paso inicial esencial en el procesamiento automático, puesto que proporciona la información de entrada al nivel sintáctico (categoría y rasgos de concordancia) y al nivel semántico (el lema).

La anotación morfosintáctica del habla espontánea comparte los mismos problemas y requisitos que en los sistemas para textos escritos:

• reconocimiento de multi-words • desambiguación de varios análisis

posibles • tratamiento de palabras nuevas o

desconocidas

1 Si estimamos una media de 5 fonemas por

palabra, nos saldría un mínimo de 250.000 tokens-fonemas.

Partimos de un tagger diseñado y entrenado para textos escritos (Grampal, 1991) y le hemos añadido algunas especificaciones nuevas para adaptarlo a los textos orales:

• un tokenizador especial para los textos transcriptos, incluyendo reconocimiento de disfluencias (alargamientos vocálicos, interrupciones de palabras, sonidos paralingüísticos...),

• entrenamiento para desambiguación de análisis dentro de un contexto oral,

• inclusión de la categoría Marcador Discursivo (“es decir”, “o sea”, “bueno”, “vale”) necesaria para la sintaxis oral,

• añadido de un módulo para el tratamiento de diminutivos (“cafetito”) y neologismos (“megahortera”), mucho más frecuentes en la lengua oral que en la escrita.

2 La demostración El sistema de demostración permitirá a los

asistentes comprobar el funcionamiento de las herramientas mediante una conexión remota a nuestro servidor.

Hay dos modos de consulta: a. introducción de palabras aisladas b. introducción de un texto

En el primer caso, el demostrador muestra el

resultado en una presentación “gráfica”. Cuando se introduce un texto, el demostrador proporciona una salida etiquetada en xml. De esta manera se muestran distintas posibilidades.

Bibliografía Cresti, E., y M. Moneglia (eds.). 2005. C-

ORAL-ROM Integrated Reference Corpora for Spoken Romance Languages. Amsterdam, John Benjamins.

Moreno, A. y J.M. Guirao. 2006. Morpho-syntactic tagging of the Spanish C-ORAL-ROM Corpus. In Spoken Language Corpus and Linguistic Informatics. Amsterdam, John Benjamins.

Antonio Moreno Sandoval, José Ma. Guirao Miras y Doroteo Torre Toledano

302

TMILG (Tesouro Medieval Informatizado da Lingua Galega)

TMILG (Medieval Galician Computational Treasure)

António de Carlos Moura Barros Angel López López

José Ramom Pichel Camposimaxin|software

Rua Salgueiriños de abaixo N11 Local 6Santiago de Compostela, Galiza

[email protected]@imaxin.com

[email protected]

Resumen: El “Tesouro Medieval Informatizado da Lingua Galega” es un proyecto de investigación realizado en el Instituto da Lingua Galega (ILG) (a cargo de Xavier Varela y en convenio con la DXPL > SXPL de la Xunta de Galicia) que es visible en la Internet a través del corpus TMILG (http://ilg.usc.es/tmilg). En número, los documentos colectados son más de 12.500. El arco cronológico va del siglo XIII a principios del XVI. imaxin|software fue la encargada de realizar tanto los motores de indexación del corpus como el motor de búsquedas. Este recurso permite búsquedas variadas en la documentación gallega medieval: por fechas, géneros, tipología textual, por variantes de una misma palabra y por concordancias, además de por patrones y expresiones regulares. No tiene equivalente en ninguna de las lenguas románicas. Las obras que ofrece son muy variadas, y van desde la lírica profana o religiosa (Lírica trobadoresca gallegoportuguesa, Cantigas de Santa María) incluso la prosa técnica (Arte de Trovar, Tratado de Albeitaría), pasando por la prosa literaria (Crónica Troiana, Historia Troiana, Livro de Tristán), la prosa histórica (Crónica General y Crónica de Castilla, General Historia), la prosa religiosa (Miragres de Santiago, Crónica de Santa María de Iria) y la prosa jurídica (Flores de Derecho, fragmentos de la Partidas, Ordenamiento de Alcalá de Henares...). En lugar preferente está la prosa notarial con copiosas colecciones religiosas y civiles, entre las que destacan especialmente las monásticas. Palabras clave: TMILG, Galegoportugués, Portugués, Galego, corpus, imaxin|software, ILG.

Abstract: The “Medieval Galician Computational Treasure” is a research project developed in the ILG (Institute of Galician Language) (coordinated by Xavier Varela and in agreement with the DXPL > SXPL Linguistic Policy General Secretariat of the Galician Government) and is accessible through the TMILG corpus (http://ilg.usc.es/tmilg). In total there are more than 12500 documents collected that date from the 13th century to the beginning of the 16th century. imaxin|software was in charge of developing both the corpus indexing and search engines. This feature allows the user to carry out customized searches within the Medieval Galician documents based on dates, genres, text typology, variants of the same word, agreement, regular expressions... There is no equivalent to this in any Romance Language. The corpus includes varied types of works: sacred or profane lyric poetry, technical prose, literary prose, historical prose, sacred prose and legal prose. One outstanding genre is the notarial prose, including substantial sacred and civil collections, monastic prose being the most prominent.Keywords: TMILG, GalicianPortuguese, Portuguese, Galician, corpus, imaxin|software, ILG.



http://ilg.usc.es/tmilg

http://ilg.usc.es/tmilg




1 Codificación de los textos

La codificación de los textos demandó una atención muy especial, por las peculiaridades tanto de la scripta medieval gallegoportuguesa como de los hábitos modernos de edición. La especificidad de las grafías medievales los obligó a arbitrar soluciones poco convencionales en el asentamiento de los textos y en su tratamiento informático. La nasalidad sobre vocal o consonante (c, g, q, m) es el rasgo más característico. Otros elementos específicos son los tipos de (alto, de doble corva y corva y sigmática, el visigótico y el signo tironiano).

2 Indexadores

Se han desarrollado motores de indexación específicos para la migración de los documentos etiquetados en XML a formatos gestionables por sistemas de bases de datos como MySql. Por ahora no están lematizadas las palabras presentes en el corpus, cuestión que merece una investigación profunda, debido a la heterografía de todo corpus especialmente los medievales. Esta lematización y posterior agrupación de lemas facilitaría muchos trabajos posteriores diacrónicos del idioma gallegoportugués en la versión de Galicia.

3 Consultas

El corpus es de acceso libre, previo registro como usuario. El sistema de consulta permite buscar una o varias palabras, hacer buscas booleanas, utilizar comodines, un módulo estadístico para ver la frecuencia de aparición de palabras por tipología de documentos y por siglo, además de poder refinar las búsquedas haciendo restricciones cronológicas, por género, por subgénero o por obra. Estas consultas se realizan sobre bajo herramientas de gestión de bases de datos de software libre MySQL y también se ha desarrollado toda la programación en PHP 5.0.

4 Visualización de las consultas

El resultado de las consultas tiene cuatro formatos de salida “Listaxe”, “Tipoloxía textual”, “Formas” y “Concordancias”. En el primer formato de salida se mostrarán el

resultado de las consultas en sus diferentes formas. En nuestro caso si buscamos la palabra “Galiza” aparece con dos formas “galiza” y “Galiza”. En el caso de elegir la “Tipoloxía textual” podremos ver la frecuencia de aparición de la palabra en función de la tipología textual y el siglo de aparición de esa palabra. Si pulsamos en cualquiera de estas dos frecuencias visualizaremos la palabra en formato KWIC (Key Word in Context), donde podremos ver la palabra en su contexto de la izquierda y el contexto de la derecha, además del Año, la obra, etc. Existe otro formato de visualización de las consultas llamado “Formas”, en el cual podremos ver la frecuencia por siglos de cada una de las formas superficiales que en nuestro caso son “galiza” y “Galiza”. Por último “Concordancias” que muestra directamente las palabras en sus diferentes formas en su contexto.

Figura 1: Ejemplo de visualización de los resultados.

5 Trabajos similares y futuros

imaxin|software ha desarrollado para el Instituto da Lingua Galega el TILG (Tesouro Informatizado da Lingua galega), un corpus contemporáneo gallego antecesor del medieval, Dicionario de dicionarios y Dicionario de dicionarios medieval. El trabajo futuro se centrará en crear el gran “Tesouro da lingua galega” que abarque desde la época medieval hasta la actualidad, lo cual facilitará el estudio diacrónico de la lengua gallega la investigación en lexicografía en el ámbito global de la lengua galegolusobrasileira.

António de Carlos Moura Barros, Angel López López y José Ramom Pichel Campos

304

Subtitulado Cerrado para la Accesibilidad de Personas con Discapacidad Auditiva en Entornos Educativos

Closed Captioning for Accessibility of Hard of Hearing People in Educational

Environments

Pablo Revuelta Sanz CESyA

[email protected]

Javier Jiménez Dorado CESyA

[email protected]

José Manuel Sánchez Pena Departamento Tecnología Electrónica

[email protected]

Belén Ruiz Mezcua Departamento Informática

[email protected]

Universidad Carlos III de Madrid,

Av. Universidad, 30, 28911 Leganés, Madrid, España

Resumen: El objetivo de este proyecto es contribuir a la integración de las personas con discapacidad auditiva en

la educación. El sistema está basado en el reconocimiento automático del habla (ASR) y en conversión de texto a

voz. Se ha implementado una arquitectura cliente servidor con comunicación inalámbrica que puede funcionar

con tres tipos diferentes de dispositivos. El sistema genera dos recursos educativos como salida, uno es la voz del

ponente y otro es la trascripción de esta. El proceso ASR está realizado con Dragon NaturallySpeaking y la

conversión de texto a voz está realizada con el Speech API de Microsoft.

Palabras clave: subtitulado en directo, reconocimiento del habla automático, subtitulado cerrado, accesibilidad a

la educación, aplicación multidispositivo, educación inclusive.

Abstract: The aim of this project is facilitate the integration of hard of hearing people in education. The system is

based on automatic speech recognition (ASR) and in text to speech conversion. A client-server architecture was

implemented with wireless communication which can run on three different devices. The system provides two

output files with the transcription and the audio of the speech. ASR process was done with Dragon

NaturallySpeaking and the Text to Speech process with the Microsoft’s Speech API.

Keywords: live captioning, automatic speech recognition, closed captioning, educational accessibility,

multidevice application, inclusive education.

1 Introducción

Las personas con discapacidad y en particular las

personas sordas se tienen que enfrentar a múltiples

dificultades en sus vidas. También en la hora de

acceder a la educación existen barreras que les impiden

disfrutar y participar del proceso educativo plenamente

por lo que nos encontramos lejos de lo que llaman

Educación Inclusiva [1]. Los avances tecnológicos y

las investigaciones audiológicas, como los implantes

cocleares o bucles de inducción, han supuesto una gran

integración de los estudiantes sordos en el proceso

educativo y, naturalmente, en su vida diaria. Sin

embargo, todavía se encuentran con varios problemas

como la comprensión del lenguaje y del habla como

consecuencia de la sordera [2] pero también debido al

ambiente ruidoso de las clases, la distancia con el

profesor, las reverberaciones, etc. [3]. Para solucionar

estos problemas queremos proponer una nueva ayuda

técnica que facilite la integración de los estudiantes con

discapacidad auditiva en el nuevo modelo de educación

inclusiva. Para ello utilizaremos los conceptos de

subtitulado cerrado y subtitulado en directo [4]. Este

sistema permite transcribir el discurso del profesor y lo

envía vía radio de forma individual y en forma de

subtítulos a los estudiantes sordos. Además, los

estudiantes con problemas del habla pueden escribir

preguntas y comentarios en sus dispositivos portátiles

para que una voz sintética reproduzca lo que han

escrito.

2 Estructura del Sistema

El sistema implementado tiene una arquitectura

cliente servidor. El servidor corre en el ordenador del

profesor/a que puede seguir siendo usado para otras

aplicaciones. Los estudiantes sordos tendrán cada uno,

o podrán compartir, un dispositivo portátil donde una

aplicación cliente permite visualizar los subtítulos.

2.1 Servidor

El servidor se divide en las siguientes partes:

- Aplicación ASR

- Programa principal

- Comunicaciones

- Aplicación Texto a Voz

El modo en que interactúan cada una de las partes se

muestra en la figura 1.

Reconocimiento Automático del Habla. La

aplicación de ASR se consigue mediante el reconocedor de

voz comercial Dragon NaturallySpeaking (DNS). Se ha

creado una conexión interna TCP para mandar de forma

automática el texto trascrito al programa principal. Se eligió

DNS en un principio por su menor tasa de errores aún

sabiendo que el tiempo empleado es mayor que el ViaVoice.



Fig. 1. Interacción de los procesos que corren en el servidor.

Programa Principal. Está implementado con

LabVIEW 8.5. Proporciona una sencilla interfaz de

usuario al profesor/a. Entre sus funciones se encuentra

el manejo de las conexiones TCP y de las interfaces de

comunicación. Sincroniza y gestiona las aplicaciones

de ASR y de texto a voz. Se encarga de transformar la

trascripción para crear subtítulos según la norma de

subtitulado para teletexto [5]. Almacena en un fichero

de texto (XML) la trascripción de la clase así como los

comentarios de los estudiantes sordos recibidos y en un

fichero de audio (WAV) la voz del micrófono.

Comunicaciones. Existen dos modos alternativos

de comunicación: un enlace serie radio a 433MHz y un

enlace Bluetooth. Ambos son necesarios para poder dar

cobertura a diferentes dispositivos clientes.

Texto a Voz. La voz sintética se controla desde el

programa principal mediante una comunicación interna

de ActiveX que usa el Speech API 5.1 de Microsoft. El

profesor/a puede controlar el tipo de voz, la velocidad,

el volumen y la salida física.

2.2 Clientes

Existen tres tipos diferentes de clientes:

- un circuito electrónico con un pequeño display en

forma de gafas y un teclado.

- una PDA

- un ordenador portátil o de sobremesa

El dispositivo microcontrolado se comunica a través

del enlace radio serie. Posee un display en unas gafas y

un teclado para poder leer los subtítulos y escribir

preguntas.

La PDA está disponible para Windows Mobile,

Pocket PC y Palm OS. Ha sido desarrollada con

LabVIEW 8.5 al igual que la aplicación para el

ordenador. La PDA sólo se comunica con Bluetooth

mientras que el ordenador puede utilizar ambos

sistemas de comunicación.

La interfaz del ordenador portátil o sobremesa es

una versión avanzada de la PDA desarrollada para

Windows que contiene funcionalidades extra.

Fig. 2. Interfaz para PDA.

3 Conclusiones

El sistema es compatible con sistemas de FM y

bucles magnéticos. Tiene una cobertura de 50 metros

para el enlace radio y hasta 25 con algunos dispositivos

Bluetooth. Funciona de forma adecuada cuando se

entrena el reconocedor consiguiendo muy bajas tasas

de error (~5%), sobre todo si se adapta el vocabulario a

la materia a impartir. La mayor desventaja que posee el

sistema es el retardo (hasta 6 segundos) que existe

debido al tiempo que invierte Dragon. Esta latencia se

solventa si cambiamos a ViaVoice aunque se necesite

mucho más entrenamiento para conseguir las mismas

tasas de error. Otro inconveniente es el ruido de fondo

y sobre todo de las voces de los estudiantes. Para ello

se requiere la utilización de micrófonos que minimicen

estos efectos. El siguiente paso es conseguir incorporar

los signos de puntuación. Aunque este sistema ha sido

diseñado para estudiantes sordos también puede

utilizarse para estudiantes extranjeros o con problemas

de comprensión o del habla. Además, existe la

posibilidad de aplicar este sistema para la accesibilidad

en museos con guías, conferencias, congresos, etc [6].

Bibliografía

1. Arnáiz Sánchez, P.: Educación Inclusiva: Una

escuela para todos. Ediciones Aljibe, Málaga

(2003)

2. Davis, J. M., Elfenbein, J., Schum, R.and Bentler

R. A.: Effects of Mild and Moderate Hearing

Impairments on Language, Educational, and

Psychosocial Behavior of Children. Journal of

Speech and Hearing Disorders, 51. 53-62. (1986)

3. Pekkarinen, E. and Viljancn, V.: Acoustic

Conditions for Speech Communicacion in

Classrooms. International Journal of Audiology,

20(4). 257-263 (1991)

4. Revuelta P., Jiménez J., Sáchez J. M. y Ruiz B.:

Online Captioning System for Educational

Resources Accessibility of Hard-of-Hearing

people. ICCHP (2008)

5. AENOR, 2003. UNE 153010: Subtitulado para

personas sordas y personas con discapacidad

auditiva – Subtitulado a través del Teletexto.

(2003)

6. Revuelta P., Jiménez J., Sáchez J. M. y Ruiz B.:

Multidevice System for educational accessibility

of Hearing impaired students. Accepted at CATE

(2008)

Pablo Revuelta Sanz, Javier Jiménez Dorado, José Manuel Sánchez Pena y Belén Ruiz Mezcua

306

ESEDA: Tool for enhanced speech emotion detection andanalysis

ESEDA: una herramienta para el reconocimiento de emociones en el habla

Julia SidorovaUniversitat Pompeu Fabra

Ocata, 01 [email protected]

Toni Badia CardusUniversitat Pompeu Fabra

Ocata, 01 [email protected]

Resumen: Esta demostracion presenta una herramienta para el reconocimiento deemociones en el habla. Se basa en metodos estandar de aprendizadaje automaticosupervisado, ampliado con un modulo de analisis y mejora de errores en la clasifi-cacion. Los resultados experimetales muestran la validez de esta ampliacion.Palabras clave: reconocimiento de emociones en el habla, paralinguıstica

Abstract: This demo paper presents a speech emotion recognition tool, based onstandard supervised machine learning methods and enhanced with an additionalblock of classification error analysis and fixing. Experimental results demonstratevalidity of this enhancement.Keywords: emotion recognition, paralinguistics

1 Introduction

In a number of applications such as man ma-chine interfaces it is important to be able torecognise people’s emotional state. An aimof a speech emotion recognition (SER) en-gine is to produce an estimate of the emo-tional state of the speaker given a speechfragment as an input. The standard wayto do SER is through a supervised machinelearning procedure. Recently a number ofalternative classification strategies has beenoffered, which are preferable under certainconditions, e.g. unsupervised learning or nu-meric regression. The SER tool presented al-lows for these alternative classification strate-gies. We propose the ESEDA classificationstrategy, based on standard supervised learn-ing techniques and enhanced with an addi-tional block of classification error analysisand fixing. The achieved improvement is12.7% of recognition accuracy averaged overall classes, and 32.1% of accuracy for theanger class.

2 System Architecture

The standard part of the system is comprisedof 3 modules: Feature Selection (FS), FeatureExtraction (FE) and Classification. Theirperformance will serve as a baseline to val-idate the enhancement proposed.

FE and FS In the literature there isa consensus that global statistics features

lead to higher accuracies compared to thedynamic classification of multivariate time-series. The FE module extracts 116 statisti-cal features. The FS implements the wrapperapproach with forward selection. The result-ing vector depends on the language, for ex-ample for the French data set in this studyit had 8 features: intensity mean, harmonic-ity mean, long-term average spectrum valueat 1500 Hz as a function of frequency, maxof long-term average spectrum, frequency ofminimum of the power spectral density, minof pitch, std of pitch, and mean absolute slopeof pitch.

Classification The classification moduletakes as input the feature vector created bythe feature selector, and applies the Multi-layer Perceptron classifier, in order to assigna class label to it, the labels are the emotionalstates to discriminate among. Multilingualclassifier is constructed by merging the dataof several languages and further training andtesting on this merged data set.

Error analysis and fixing We pro-pose improved classification settings ESEDA,which is the standard classification step asdescribed above, but enhanced with the fol-lowing procedure:

I. We identify the class of special inter-est (denote it class I), for which the recog-nition rates are to be improved. It can bethe worst recognised class or a class of spe-



cial interest for some application. From theconfusion matrix of the standard classifica-tion step it is deduced with which other classthe class of interest is most frequently con-fused (denote it class J). Then we classify intwo steps: among the new classes (the newlabels are the old labels, except that we havea joint label for the class I and class J), andthen between classes I and J.

II. If the minority class problem is presentand hampers the classification, we employcost-sensitive training (more specifically, weduplicate every minority class sample in thedatabase).

3 Experimental work

Validation of baseline performance. Wedid the validation on acted emotional speechfrom the Interface databases. Although actedmaterial has a number of drawbacks, it wasused to establish a proof of concept for themethodology proposed; for future work it isplanned to test ESEDA on real emotions.There are six emotions (anger, disgust, fear,joy, surprise, sadness and neutral) from twomale and female speakers. The database con-tains isolated words and sentences (both affir-mative and interrogative) of various lengths:short (five to eight words), medium (13 w.)and long (14–18 w.). The recordings weremade in a studio environment with a sam-pling frequency of 48 kHz and quantisationof 16 bits. A randomly chosen subset ofthe Interface databases was used (3711, 3805,and 4030 utterances for English, Slovenianand French respectively). The proportion ofclasses in the validation subset is preservedas in the whole databases.

For the testing protocol, 10-fold cross-validation was used. (We also considered dis-joint sets for training (50%), validation (25%)and testing (25%) sets. We found that theaccuracies in the two modes differed in 1%,which is due to the homogeneity of the Inter-face databases, i.e. distributions are the samein different chunks of the database. There-fore cross validation can be used without lossof generality.) In case of monolingual vali-dation, the obtained accuracy is 73%. Ac-curacies for individual classes are as follows:76% for neutral, 70% for angry, 94% for dis-gusted, 53% for fear, 83% for joy, 63% forsurprise, and 72% for sad. As follows fromthese numbers, on average the accuracies aregood, with the exception of fear (is often con-

An Ne TotalBaseline 70% 76% 73.3%+ class. step 84% 95% 76.8%c-s. train. 99.5% 93% 86%

Table 1: The consecutive improvements inaccuracies: baseline, adding the classificationstep between anger and neutral, adding cost-sensitive training.

fused with surprise and sad) and surprise (isoften confused with fear). As for multilingualvalidation, the accuracy is 69.5%.

Validation for the enhanced archi-tecture. Due to improved classification set-tings, the system performance improved by12% (averaging over the three languages).For example for the French database, angerwas taken as a class of special interest asrequired in a number of applications. Forexample, in call centres anger detection isneeded for the off-line control of how wellconflict dialogues are resolved, etc. Fromthe confusion matrix obtained with the base-line classification it was deduced that anger ismostly confused with neutral. Therefore theclassification was done in two steps: amongthe new classes (the new labels are the oldlabels, except that there is a joint label foranger and neutral), and then add an extraclassification step to classify between angerand neutral. The minority class problem wasdetected, therefore every angry sample wasduplicated in the database.

4 Discussion and conclusions

Table 1. sums up the consecutive increaseof classification rates. Adding an extraclassification step brought the overall accu-racy improvement of 3.5% (the accuracy foranger and neutral improved by 14% and19% respectively). The cost-sensitive train-ing brought 15.5% and 9.2% more for angerand neutral respectively. As the recognitionrates improve, the false alarm rate increasesonly by 2% (i.e. the accuracy for the neutralclass drops from 95% to 93%).

We presented a SER tool based on theESEDA method, which is the standard su-pervised machine learning methods enhancedwith an additional block of classification erroranalysis and fixing. Although this enhance-ment is simple from the theoretical point ofview, it is of practical use.

Julia Sidorova y Toni Badia Cardús

308

Proyectos

CLARIN: Common Language Resources and Technology Infrastructure

Núria Bel, Montserrat Marimon Institut Universitari de Lingüística Aplicada

Universitat Pompeu Fabra Pl. de la Mercè, 10-12

08002-Barcelona [nuria.bel|montserrat.marimon]@upf.edu

Resumen: Presentamos el proyecto CLARIN, un proyecto cuyo objetivo es potenciar el uso de instrumentos tecnológicos en la investigación en las Humanidades y Ciencias Sociales. Palabras clave: Ciencias sociales, humanidades, servicios web, tecnología grid, tecnologías y recursos lingüísticos

Abstract: This article presents CLARIN, a project that aims to promote the use of technological tools in research in the fields of the Humanities and Social Sciences. Keywords: Humanities, social sciences, grid technology, web services, language resources and technologies.

1.1.1 Introducción

Presentamos el proyecto CLARIN (Common Language Resources and Technology Infrastructure), un proyecto de colaboración entre 22 países europeos cuyo objetivo es potenciar el uso de instrumentos tecnológicos en la investigación en ámbitos de las Humanidades y Ciencias Sociales. CLARIN creará la infraestructura necesaria para dar acceso genérico a grandes bancos de datos lingüísticos (textos, diccionarios, ontologías, etc.), así como a los instrumentos de análisis y explotación de estos datos (segmentadores, etiquetadores, analizadores sintácticos, etc.). Para ello se implementará, en una estructura de red grid, y mediante tecnología de servicios web y de web semántica, una única interfaz de acceso a los datos y a los instrumentos de análisis, así como a procesadores y otros servicios necesarios.

CLARIN es uno de los 35 proyectos seleccionados por el Comité ESFRI y que figuran en la "Hoja de ruta" de las infraestructuras que han de ser construidas, por su importancia para la investigación, a diez años vista.

2 Antecedentes CLARIN tiene sus antecedentes en los trabajos para la estandarización de datos lingüísticos y de los instrumentos que los analizan, para garantizar la reusabilidad y la interoperatibilidad: EAGLES, OLIF (Lieske et al., 2001), ISLE (Atkins et al., 2002) y LIRICS-ISO (Framcopoulo et al., 2006); así como implementaciones directas de estas directrices: MULTEXT (Ide y Véronis, 1994), PAROLE (Zampoli, 1997) y SIMPLE (Lenci et al., 2000).

Por otro lado, el ver la explotación de datos lingüísticos vino de la mano de proyectos de investigación como LAMUS (Broeder et al., 2007), en los que se necesitaba archivar y gestionar datos lingüísticos en el área de tipología lingüística.

Más recientemente, se han llevado a cabo proyectos que han usado el enorme potencial que tiene la integración virtual de recursos distribuidos y autónomos ya existentes y que han demostrado la viabilidad de formar colecciones digitales virtuales. Algunos ejemplos son: IMDI (Wittenburg et al., 2002) y DAM-LR (Broeder et al., 2006).



3 La Infraestructura CLARIN El objetivo de CLARIN es crear una infraestructura estable y persistente para dar acceso a los recursos lingüísticos y a sus instrumentos de análisis y explotación.

La infraestructura CLARIN consiste en la aplicación de la tecnología grid, del concepto de metadatos y de servicios web para, en primer lugar, garantizar la interoperabilidad que haga de un conjunto de elementos sin relación, diferentes y remotos, un sistema estructurado de componentes funcionales interconectados, y, en segundo lugar, para facilitar la identificación, la ubicación, el acceso y la explotación de recursos lingüísticos, entendiendo por recursos lingüísticos cualquier colección de datos en forma textual (hablada o escrita) o con información sobre lenguas y donde el objetivo de la tecnología sea el procesamiento del material lingüístico.

Por una parte, la tecnología grid permite utilizar de forma coordinada todo tipo de recursos (datos, procesos, servicios, etc.) sin necesidad de estar sujetos a un control centralizado. Estos recursos pueden ser heterogéneos y estar distribuidos geográficamente, es decir, pueden ser propiedad y/o estar administrados por diferentes instituciones. Por otra parte, los metadatos son una definición estándar, utilizada por todos los componentes del grid, para describir los contenidos de forma que haga posible la identificación y búsqueda unificadas de recursos y funcionalidades.

4 Planificación de CLARIN CLARIN se encuentra actualmente en su primera fase (2008-2010), una fase preparatoria en la que se realizará una planificación detallada de la construcción de la infraestructura, con una estimación de costes reales, la definición de uso de la red y la definición de centros, recursos y tecnología que aseguren su mantenimiento de forma estable. En una segunda fase (2011-2015), está prevista la construcción de la infraestructura, con la integración de recursos y tecnologías, y el desarrollo de aplicaciones piloto que la usarán. Y, finalmente, está prevista la fase de plena explotación, con el desarrollo de aplicaciones más complejas e innovadoras.

El proyecto que cubre la fase preparatoria ha sido aprobado por la Comisión de la Unión Europea y cuenta con la participación de 32

miembros de 22 Estados miembros de la Unión, además de un amplio apoyo internacional. CLARIN ha recibido también apoyo del Ministerio de Educación, Subdirección General de Promoción e Infraestructuras Tecnológicas y Grandes Instalaciones (CAC-2007-23). Además, El DIUE de la Generalitat de Catalunya y la UPF han firmado un convenio para la financiación del desarrollo de un demostrador catalán para CLARIN.

Bibliografía Atkins, S. et al. 2002. From Resources to

Applications. Designing the Multilingual ISLE Lexical Entry. En Proceedings of LREC. Las Palmas de Gran Canaria, España.

Broeder, D. et al. LAMUS – the Language Archive Management and Upload System. <http://www.lat-mpi.eu/papers/papers2006 /lamus-paper-final2.pdf>.

Broeder, D. et al. 2006. A Grid of Language Resource Repositories. En Proceedings of the 2nd IEEE International Conference on e-Science and Grid Computing. Amsterdam, Holanda.

Francopoulo, G. et al. 2006. Lexical Markup Framework (LMF). En Proceedings of LREC. Génova, Italia.

Ide, N. y Véronis, J. 1994. MULTEXT: Multilingual Text Tools and Corpora. En Proceedings of the 15th International Conference on Computational Linguistics. Kyoto, Japón.

Lenci, A. et al. 2000. SIMPLE: A General Framework for the Development of Multilingual Lexicons, International Journal of Lexicography. Vol. 13, núm. 4., pág. 249-263.

Lieske, C. at al. 2001. The Open Lexicon Interchange Format (OLIF) Comes of Age. En Proceedings of the MT Summit VIII. Santiago de Compostela, España.

Wittenburg, P. et al. 2002. Metadata Proposals for Corpora and Lexica. En Proceedings of LREC. Las Palmas de Gran Canaria, España.

Zampoli, A. 1997. The PAROLE project in the general context of the European actions for Language Resources. En Proceedings of the Second European Seminar: Language Applications for a Multilingual Europe. IDS/VDU, Manheim/Kaunas.

Núria Bel y Montserrat Marimon

312

SOPAT - Servicio de orientación personalizada y accesible para turismo

SOPAT – Service of personalized and accessible orientation for tourism

Luigi Ceccaroni

TMT Factory Marina 16-18, 08005, Barcelona

+34 932 892 076 [email protected]

Victor Codina TMT Factory

Marina 16-18, 08005, Barcelona [email protected]

Resumen: La falta de servicios automáticos y personalizados de información fácilmente accesibles para los turistas y ciudadanos cuándo éstos se encuentran en la vía pública, impacta negativamente en el sector turístico, ya que implica una disminución del uso de los servicios que ofrece la ciudad y del grado de satisfacción de los usuarios. El proyecto SOPAT ha desarrollado un servicio personalizado de información y guía de ciudad que facilita el acceso a la información de interés a todo tipo de turistas y ciudadanos utilizando como plataforma experimental un punto de información multimedia con servicios interactivos en la zona pública. Palabras clave: Sistema de interacción persona-ordenador-ambiente, personalización. Abstract: The lack of automatic, personalized services of information, which were easily accessible to tourists and citizens when they are in public spaces, negatively impacts the tourism sector, since it reduces the use of the services offered by the city and the satisfaction degree of the users. The SOPAT project has developed a personalized service of information and city guide that facilitates all kinds of tourists and citizens access to information of interest using an interactive community display as experimental platform. Keywords: Person-computer-environment interaction system, personalization.

1 Introducción Actualmente no existe un servicio automático y personalizado de información y guía de ciudad fácilmente accesible para los turistas y ciudadanos cuándo éstos se encuentran en la vía pública. En general, los usuarios han de dirigirse a las oficinas de información y turismo que hay situadas por la ciudad. En la mayoría de los casos la información proporcionada por las oficinas de turismo es generalista y poco personalizada a las preferencias de los usuarios. En algunas ciudades, por ejemplo Aberdeen (GB), Barcelona, Helsinki (Finlandia) y Oakland (US), se pueden encontrar puntos de información multimedia con servicios interactivos en la zona pública (interactive community displays, ICDs), que ofrecen

servicios de información a los ciudadanos, aunque suelen ser poco accesibles y sin personalización.

2 Objetivos SOPAT es un proyecto de investigación de la convocatoria PROFIT 2007, en el que se desarrolló la base de un sistema de interacción y de servicios de orientación para turistas en dos escenarios distintos:

(1) entornos abiertos, donde el usuario

puede interactuar con el servicio mediante un sistema de pantallas;

(2) entornos cerrados, donde el usuario lleva un dispositivo móvil que interactúa mediante algún sistema de comunicación inalámbrica.



El servicio inteligente y personalizado que se ha desarrollado aprovecha al máximo el potencial de la web semántica y además permite una interacción natural con los usuarios.

El objetivo general del proyecto consistía en desarrollar un servicio personalizado de información y guía de ciudad que mejorara los servicios existentes. Se pretendía facilitar el acceso a la información de interés a todo tipo de turistas y ciudadanos, incluyendo los que tienen alguna diversidad funcional, utilizando como plataforma experimental un ICD. Los objetivos específicos que se han planteado son los siguientes: (1) ofrecer una interfaz multimodal que permita una interacción más natural y adaptada al usuario, por ejemplo táctil o mediante la voz; (2) personalizar la información mostrada a los usuarios según la localización del ICD y las preferencias del usuario.

3 Descripción El meta-servicio de información y guía de ciudad que se ha desarrollado está formado por un conjunto de servicios, que se han seleccionado a partir de un estudio de las necesidades de usuario en el ámbito de la información turística (llevado a cabo por el User Lab de la Fundació privada universitat i tecnologia FUNITEC). Algunos de los servicios seleccionados son:

(1) agenda cultural: sobre los eventos que se llevan a cabo en la ciudad;

(2) qué visitar: lugares de interés de la ciudad;

(3) mapa interactivo: a través del cual los usuarios se orientan por los lugares de interés;

(4) rutas: las mejores rutas para visitar la ciudad. El ICD que se utiliza como plataforma experimental está compuesto por los siguientes componentes: pantalla táctil, dispositivos de audio y ordenador integrado en el soporte.

Cada usuario puede personalizar

explícitamente su interfaz de interacción, o bien el sistema lo hace de forma implícita a partir del perfil de usuario. El servicio se adapta a las preferencias del usuario según el conocimiento que el sistema ha extraído del perfil de usuario, la situación geográfica y las estadísticas de uso.

Los componentes de procesamiento del

lenguaje natural (PLN) aplicados al proyecto se basan en el corpus adquirido a partir del análisis de los diferentes escenarios que forman el dominio de interacción. Dado que el usuario debe poder interactuar de una forma natural, ha sido necesaria la obtención de un gran corpus.

Se ha planteado también el desarrollo de un avatar para que la interacción mediante la voz de los usuarios con el ICD sea lo más amigable posible. Para ello y como trabajo futuro se estudiará la posibilidad de añadir un componente emocional al subsistema de interacción para que el avatar disponga de conocimiento sobre el estado emocional del usuario en el momento de la interacción y pueda reaccionar de forma distinta en base a las emociones detectadas.

4 Grupos participantes Los grupos de investigación participantes en el proyecto e involucrados en el trabajo descrito son:

• BCN d’Infografia, S.L. (TMT Factory)

Departamento de investigación.

• Universidad Carlos III de Madrid.

Departamento de Informática.

• Universidad Politécnica de Madrid.

Departamento de Inteligencia Artificial.

Agradecimientos Este trabajo ha sido elaborado con la ayuda económica del Ministerio de Educación y Ciencia, en el marco del programa PROFIT (CIT-410000-2007-12). Los contenidos del presente documento, sin embargo, son sólo responsabilidad de los autores y pueden no reflejar la posición del Ministerio de Educación y Ciencia. Se puede encontrar más información sobre el proyecto en [http://www.tmtfactory. com/proyecto_SOPAT.asp].

Luigi Ceccaroni y Victor Codina

314

GODO: Generacion inteligente de Objetivos para elDescubrimiento de servicios web semanticos∗

GODO: Goal driven orchestration for Semantic Web Services

Juan Miguel GomezUniversidad Carlos III de Madrid

Avda. Universidad 30,28911 Leganes(Madrid)

[email protected]

Javier ChamizoUniversidad Carlos III de Madrid

Avda. Universidad 30,28911 Leganes(Madrid)[email protected]

Resumen: En este articulo se presentara GODO, un motor de ultima generacionque usa el procesamiento de lenguaje natural y tecnicas de mapeo para orquestar yalcanzar metas por medio de servicios web.Palabras clave: web semantica, ontology learning, orquestacion

Abstract: In this article we present GODO, a search engine which uses NLP andmapping techniques to orchestrate and achieve goals by means of web services.Keywords: semantic web, ontology learning, orchestation

1. Introduccion

La Web ha cambiado de ser un merorepositorio de informacion a convertirse enuna plataforma de negocio, donde multiplesorganizaciones pueden desplegar, compartir yexplorar procesos de negocio a traves de ser-vicios web. Han surgido ası nuevos y prom-etedores campos de aplicacion, como los rela-cionados con la Web Semantica. En concretolos Servicios Web Semanticos estan ayudan-do a potenciar el despliegue y composicionde servicios, pero aun no esta completamenteresuelto el problema de como buscar y des-cubrir los servicios y hacer un uso sencillo ydinamico de ellos. Ademas, el hecho de quelos usuarios puedan expresar sus deseos en unlenguaje natural, en lugar de tener que cono-cer un lenguaje logico complejo, esta limitan-do la eficiencia de la aproximacion seguidapor los servicios web semanticos.

El presente proyecto pretende conseguiruna solucion que combinando tecnologıas deanalisis de lenguaje natural y semantica, con-siga que el usuario final de las aplicacionesbasadas en servicios web semanticos puedaexpresar sus deseos y objetivos de una formamas sencilla al sistema.

∗ Proyecto GODO(FIT-340000-2007-134) Partici-pantes: Atos Origin S.A., Universidad de Murcia, Uni-versidad Carlos III de Madrid

2. Objetivo del Proyecto

Como objetivo principal del proyecto GO-DO, se persigue desarrollar una infraestruc-tura semantica basada en una aplicacion detecnicas de analisis de lenguaje natural yontologıas, de manera que pueda ofrecerseal usuario una capa de presentacion sencil-la para definir sus objetivos de busqueda ydescubrimiento de servicios web semanticos.Actualmente, las plataformas orientadas a losservicios web semanticos demuestran ciertaslimitaciones:

A nivel de usuario final, existe la dificultadde definir cuales son los objetivos (goals) parabuscar y descubrir los servicios que necesi-ta. Actualmente los servicios web semanticosno ofrecen un interfaz amigable para poderdefinir los objetivos de busqueda, y es nece-sario un gran conocimiento de lenguajes delogica matematica para poder llegar a definiruno de esos objetivos.

A nivel tecnico hay varios problemas:

Existen varios marcos de definicion deservicios web semanticos tales como,WSMO(Cristina Feier, 2005), OWL-S(Martin, 2005) o METEOR-S(AbhijitA. Patil, 2004); cada uno de ellos con supropio lenguaje y definicion.

La definicion de goals es a dıa de hoyun tema complejo que exige un granconocimiento



Aun ası, las arquitecturas de semanticweb services nos facilitan un marco adecua-do para la construccion plataformas de ofer-ta y descubrimiento de servicios. Por otro la-do, en los ultimos anos las plataformas dedescubrimiento y ejecucion de servicios websemanticos han evolucionado hasta hacerseuna realidad, al menos como resultado deproyectos de investigacion. Ası, proyectos fi-nanciados por la Union Europea, como DIP1

e INFRAWEBS2 han creado una serie deAPIs y entornos validos para crear aplica-ciones basadas en servicios web semanticos.Sin embargo, la realidad es que la interaccionde usuarios finales y aplicaciones para expre-sar los objetivos para el descubrimiento deservicios, es aun muy compleja y requiere ungran conocimiento de la plataforma a utilizary del lenguaje semantico en que se sustenta.

El objeto del proyecto GODO es pues, elde acercar al usuario final a la definicion deobjetivos para buscar y descubrir serviciosweb semanticos. En GODO se pretende haceruna realizacion de este paradigma utilizandouna plataforma de servicios web semanticosya existente, para probar ası que la definicionde objetivos de una manera sencilla y sin re-querir especiales conocimientos de lenguajescomplejos de ontologıas sea una realidad.

3. Tecnologıas PLN y Semanticas

El uso de tecnologıas semanticas juntocon el Procesamiento del Lenguaje Natu-ral (PLN) permite representar la estructuraconceptual del lenguaje, proporcinando unariqueza semantica mayor que un lexicon com-putacional o un tesauro.

La relacion que se establece entre las on-tologıas y las tecnicas de PLN es bidirec-cional: por un lado son herramientas para larepresentacion de redes semanticas y por otroel PLN es una tecnica importante en la con-struccion automatica de ontologıas (ontologylearning).

Ademas al representar conceptos (noterminos), son muy utiles en los sistemasde traduccion automatica, puesto que a ca-da concepto se le pueden asociar las formaslinguisticas (una o varias) que lo representenen cada lengua.

Segun la Recomendacion EstandarBritanica pra la seleccion, formacion ydefinicion de terminos tecnicos, los conceptos

1http://dip.semanticweb.org/2http://www.infrawebs.org/

son constructos mentales, abstracciones quese pueden emplear para clasificar los distin-tos objetos del mundo en interior. Mientrasque los terminos son las unidades lexicasconcretas que se emplean para referirse aun concepto. No siempre existe un unicotermino para referirse a un concepto, por estemotivo ha habido intentos de imponer unarelacion biunıvoca entre termino-concepto,atribuyendo un termino a cada conceptoy representando cada concepto por unsolo termino. El objetivo es facilitar lacomunicacion cientıfico-tecnica y eliminar lapolisemia, sinonimia y homonimia. Esta fuela propuesta realizada por la ISO 704, quepor otro lado ha recibido varias crıticas.

La aproximacion del proyecto GODO serefiere al uso de PLN y las ontologıas, cen-trandose en la construccion automatica deontologıas a partir de texto y en la detec-cion de los elementos ontologicos (conceptos,clases, relaciones, atributos) a partir de textoen lenguaje natural.

La construccion automatica de ontologıasse ha convertido en uno de los principalesfocos de investigacion dentro del ambito dela Web Semantica. Las ontologıas en la WebSemantica se utilizan como estructura com-pleja para la representacion del conocimien-to, generalmente de un dominio. Dichoconocimiento esta contenido en muchas oca-siones en textos escritos en lenguaje natural.La construccion de ontologıas es un proce-so lento y costoso que ralentiza el avance dela Web Semantica, por esto es necesario de-sarrollar metodos eficaces para la generacionautomatica de ontologıas a partir de lenguajenatural.

Bibliografıa

Abhijit A. Patil, Swapna A. Oundhakar,Amit P. Sheth Kunal Verma. 2004.Meteor-s web service annotation frame-work. En Procs. of the 13th internationalconference on World Wide Web.

Cristina Feier, John Domingue.2005. Wsmo primer.http://www.wsmo.org/TR/d3/d3.1/v0.1/,April.

Martin, David et al. 2005. Bringing se-mantics to web services: The owl-s ap-proach. Lecture Notes in Computer Sci-ence, 3387:26–42.

Juan Miguel Gómez y Javier Chamizo

316

TEXT-MESS: Minerıa de Textos Inteligente, Interactiva yMultilingue basada en Tecnologıa del Lenguaje Humano∗

TEXT-MESS: Intelligent, Interactive and Multilingual Text Mining basedon Human Language Technologies

Patricio Martınez-BarcoManuel PalomarUniv. de Alicante

GPLSI - [email protected]

[email protected]

Julio GonzaloAnselmo Penas

UNEDGPLN - DLSI

[email protected]@lsi.uned.es

L. Alfonso Urena-LopezMa Teresa Martın-Valdivia

Univ. de JaenSINAI - DI

[email protected]@ujaen.es

Ferran PlaPaolo Rosso

Univ. Pol. de ValenciaGPLIS,GRFIA - DSIC

[email protected]@dsic.upv.es

Alicia AgenoJordi Turmo

Univ. Pol. de CatalunyaGPLN,TALP - [email protected]@lsi.upc.edu

M. Antonia MartıMariona Taule

Univ. de BarcelonaCLIC

[email protected]@ub.edu.es

Resumen: El objeto de este proyecto es analizar, experimentar y desarrollar tec-nologıas inteligentes, interactivas y multilingues de minerıa de textos, como piezaclave de la proxima generacion de motores de busqueda y analisis textual, sistemascapaces de encontrar “la necesidad que subyace a la consulta”. Estas tecnologıas ofre-ceran servicios e interfaces especializadas segun el dominio y el tipo de necesidad deinformacion. Ademas, integraran busqueda documental (paginas web), multimedia(imagenes, audio, video), en informacion semiestructurada y en dominios especıficos.Palabras clave: Minerıa de textos, Tecnologıas del Lenguaje Humano (TLH), Re-cursos de TLH, Recuperacion de Informacion, Busqueda de Respuestas, Extraccionde Informacion, Evaluacion de TLH, CICYT

Abstract: The goal of this project is to analyze, experiment, and develop intelligent,interactive and multilingual Text Mining technologies, as a key element of the nextgeneration of search engines, systems with the capacity to find “the need behind thequery”. These technologies will provide specialized services and interfaces accordingto the search domain and type of information needed. Moreover, it will integra-te searchs on document collections (websites), multimedia (images, audio, video),semi-structured texts and restricted domains.Keywords: Text Mining, Human Language Technologies (HLT), HLT resources, In-formation Retrieval, Question Answering, Information Extraction, HLT Evaluation,CICYT

1. Descripcion general

La gran cantidad de informacion disponi-ble actualmente en formato electronico juntoal creciente numero de usuarios finales quedisponen de acceso directo a dicha informa-cion a traves de ordenadores personales, haimpulsado la investigacion en sistemas de in-formacion textual que faciliten el analisis, lalocalizacion, la gestion, el acceso y el trata-miento automatico de toda esta ingente can-

∗ TIN2006-15265-C06

tidad de datos.Internet ha cambiado profundamente la

forma en la que las personas se comunican,negocian y realizan el trabajo diario, al teneracceso a infinidad de recursos, en diferentesformatos y en diferentes idiomas. Todos estosfactores han contribuido al exito de la Web ya la vez ha originado, paradojicamente, unode sus principales problemas el exceso de in-formacion.

En este marco de sobrecarga de infor-macion los actuales motores de busqueda



han quedado obsoletos. Por este motivo, lasseis universidades integrantes del proyectoTEXT-MESS trabajan bajo el patrocinio delactual Ministerio de Ciencia e Innovacion conel fin de definir una nueva generacion de mo-tores de busqueda capaces de encontrar “lanecesidad detras de cada consulta” y queofreceran servicios e interfaces especializadassegun el dominio y el tipo de necesidad de in-formacion. Ademas integraran busqueda do-cumental (paginas web), busqueda multime-dia (imagenes, audio, video) y busqueda enbases de datos (biomedicina, turismo, bolsasde empleo, etc.). Los nuevos buscadores serancapaces de descubrir y organizar la informa-cion, y no solo de producir listas ordenadasde paginas web.

En estos nuevos buscadores las tecnologıasdel lenguaje jugaran un papel mas relevan-te que en los motores de busqueda actualesque ademas han venido dando prioridad a loscontenidos en ingles y es tecnologıa estadou-nidense en una gran proporcion. Ası TEXT-MESS cumple la doble mision estrategica dedefinir el papel de las tecnologıas del lenguajeen estos nuevos sistemas y de posicionar laslenguas oficiales del estado en esa “carrera”tecnologica que actualmente esta ya lanzada.

2. Objetivos

La finalidad del proyecto es desarrollartecnologıas inteligentes, interactivas y mul-tilingues de minerıa de textos, que integrenla busqueda documental en paginas web, labusqueda multimedia sobre imagenes y labusqueda sobre informacion semiestructura-da, que se basen en TLH.

Para llevar a cabo este objetivo, se propo-nen tres lıneas de actuacion.

(1) Desarrollar sistemas de minerıa de tex-tos (busqueda, extraccion, analisis, clasifica-cion y recuperacion de informacion), estu-diando por un lado los aspectos multilingues(con especial enfasis en el espanol y catalan)e interactivos, la eficacia y eficiencia de lossistemas sobre documentos escritos, trans-cripciones de audio e imagenes, trabajandoademas tanto en dominios genericos (la Web)como especıficos (como es el caso de la biome-dicina y turismo). (2) Mejorar y adaptar losrecursos y las herramientas existentes (ma-yor cobertura, calidad y tratamiento de do-minios especıficos) y crear nuevos recursos,tecnicas y herramientas necesarias para abor-dar las nuevas aplicaciones basadas en Tecno-

logıas del Lenguaje Humano combinando co-nocimiento linguıstico y tecnicas de aprendi-zaje automatico (machine learning). (3) En-troncar el proyecto con las principales cam-panas internacionales de evaluacion de siste-mas de busqueda y Tecnologıas del LenguajeHumano; por un lado, como participantes enestas campanas, para contrastar los resulta-dos de nuestra investigacion con los mejoresgrupos de investigacion a nivel internacional;por otro lado, como promotores y coordina-dores de algunas tareas, con el objetivo depromover la investigacion en las lıneas de in-teres de este proyecto y garantizar la presen-cia, en condiciones de igualdad, de los idio-mas de interes del proyecto (espanol y ca-talan) en la investigacion competitiva en estecampo.

Para la consecucion del objetivo global yel desarrollo optimo de las diferentes lıneas deactuacion del proyecto descritas previamente,este se ha propuesto a traves de un proyectocoordinado que consta de los subproyectos:

01-KRUA Knowledge discovery andRepresentation in Human Language Te-chnology (UA);

02-INES Intelligent exploration andsynthesis of search results (UNED);

03-TIMOM Tratamiento de Informa-cion multiMOdal y Multilingue (UJA);

04-MiDEs Metodos de Aprendizaje pa-ra Minerıa de Textos en Dominios Es-pecıficos (UPV);

05-SAMiT Sistemas Adaptativos deMinerıa de Textos (Text Mining Adap-tative Systems) (UPC);

06-Lang2World Discovering worldknowledge coded into language (UB).

3. Estado actual

TEXT-MESS tiene una duracion de tresanos (octubre 2006 - septiembre 2009) y ac-tualmente ya dispone de resultados tanto deinvestigacion (con mas de un centenar de pu-blicaciones en congresos y revistas de presti-gio de ambito internacional) como de recur-sos generados (corpus, herramientas, tecni-cas), ası como prototipos de aplicacion de lainvestigacion que pueden encontrarse en lapagina web del proyecto1.

1http://gplsi.dlsi.ua.es/text-mess

Patricio Martínez-Barco, Manuel Palomar, Julio Gonzalo, Anselmo Peñas, L. Alfonso Ureña-López, Mª Teresa Martín-Valdivia, Ferran Pla, Paolo Rosso, Alicia Ageno, Jordi Turmo, M. Antònia Martí y Mariona Taulé

318

TECNOPARLA - Speech technologies for Catalan and itsapplication to Speech-to-speech Translation ∗

TECNOPARLA - Tecnologıas del habla en catalan y su aplicacion a la

traduccion oral automatica

Henrik [email protected]

Marta R. [email protected]

TALP Research Center, Universitat Politecnica de Catalunya, Tel. +34 934016439

Jose A. R. [email protected]

Resumen: Este artıculo describe los objetivos y las principales tareas del proyectoTECNOPARLA dedicado al desarrollo de tecnologıas avanzadas del habla enCatalanPalabras clave: tecnologıa del habla, reconocimiento del habla, conversion textoa voz, traduccion automatica, traduccion oral

Abstract: The paper reports on objectives and activities of the project TEC-NOPARLA that aims to develop state-of-the-art Speech Technologies in CatalanKeywords: Speech-to-speech translation, automatic speech recognition, statisticalmachine translation

1 Overview

Speech-to-speech translation offers a rangeof applications for interpersonal communica-tion for people not sharing a common lan-guage as well as in information exchangeand access across languages. It becomeseven more eminent and desireable as soci-eties across the globe are moving together, re-gional languages and cultures more strength-ened, and multi-lingual and multi-cultural so-cieties more present. The TECNOPARLAproject aims to improve Catalan languagetechnology, and its application to speech-to-speech translation between Catalan, Englishand Spanish. Members of the TALP groupand RWTH Aachen university contribute tothe project. Broadcast radio and televisioncan be considered as one of the most widelyused sources of information, and thereforemay suggest task and domain to be disposedwithin the project. Translating broadcast de-bates implies several technologies involved,and may most simplified be regarded as a se-quential process of automatic speech recog-nition (ASR), machine translation (SMT),and speech synthesis (TTS). Besides those -acoustic events such as music or noise are de-tected beforehand, speaker diarization keepstrack on involved speakers in debates facil-itating the use of speaker adapted acous-

∗ funded by Generalitat de Catalunya

tic and language models in speech recogni-tion and the selection of dedicated voices inspeech synthesis. Language detection main-tains the use of monolingual models in multi-lingual debates. System components are in-tegrated using the Unstructured InformationManagement Architecture (UIMA).

2 Speech Recognition

The development of the speech recognitionsubsystem is carried out in collaboration withRWTH Aachen University (Germany) usingthe RWTH open source speech recognitionframework1.

An initial Catalan acoustic model was de-rived from a Spanish acoustic model devel-oped during the project TC-STAR (Loof etal., 2007). The feature space comprises Melfrequency cepstral coefficiants (MFCC) ex-tended by a voicedness feature.

A training phase is carried out by sev-eral steps: prior to the acoustic model esti-mation, a linear discriminative analysis esti-mates a feature space projection matrix. Fur-thermore a new phonetic classification andregression tree (CART) is grown tying theHMM states to generalized tri-phone states,finally the model estimation, that iterativelysplits and refines the gaussian mixture mod-els.

1http://www-i6.informatik.rwth-aachen.de/rwth-asr/



The acoustic model provides context de-pendent semi-tied continuous density HMMusing a 6-state topology for each tri-phone.Their emission probabilities are modeledwith Gaussian mixtures sharing a commondiagonal covariance matrix.

Both, the lexicon encompassing the 50kmost frequent words, and the 4-gram back-off language model comprising about 10.1 Mmulti-grams have been derived from the ’ElPeriodico’ corpus. The latter achieves mini-mal perplexity with a linear discounting andmodified Kneser-Ney smoothing methodol-ogy.

The recognition follows a multi-pass ap-proach, i.e. a first pass using a speaker inde-pendent acoustic model, followed by segmen-tation and clustering of segments, a secondpass using speaker cluster adapted acousticmodels.

3 Statistical Machine Translation

Our machine translation system follows anstatistical corpus-based approach, based onan n-grams, that offers state-of-the-art re-sults. A source sequence is translated bysearching the most probable target sequencegiven by a bilingual Ngram-based model(Marino et al., 2006).

Spanish and Catalan are high inflectedlanguages that generate an enormous vari-ability of genre and number agreementamong other linguistic challenges. However,most of them could be solved by introducingmorphological information and good qualitybilingual corpora. First, the morphologicalinformation is introduced in the system bymeans of:

Monolingual-expert rules. Lately, threematters have been addressed: the Catalanapostrophe and clitics, and the Spanish con-junctions.

Categorization. Hours are context-independent and are categorized in order toproduce the right forms. Furthermore, thenumbers written in letters do not appear inthe training corpora, so they are categorizedat a previous stage and translated in theirarabic form.

Part-of-Speech (POS) information. Genreand number agreement is improved by usingadditional statistical information provided bythe FreeLing tagger. A POS language modeltrained on the target language helps in thedecoding decision.

4 Speech Synthesis

The quality of speech synthesis systems cor-pus based has improved during the last yearsconsiderably, being the new goals to achieveexpressive speech synthesis imitating the hu-man voice in several styles (i.e. reading andtalking). A breakthrough in speech synthesisrequires the development of new models forprosody, emotions and for expressive speechin general.

The main goal of the project concerningspeech synthesis is the production of a state-of-the-art Catalan text-to-speech (TTS) sys-tem and its integration in a speech-to-speechtranslation application. This system must becapable of expressing the speaking styles, ac-cents, and voice quality parameters specifiedin an input message or text.

In order to achieve this general goal, thefollowing tasks are considered: adequate in-terpretation of the input message or text,production of expressive speech in severalspeaking styles, voices and languages, newspeech generation and voice adaptation algo-rithms, new voices based on Hidden MarkovModels (HMM) and the development ofCatalan-Spanish bilingual voices.

5 Achievements

Throughout the project, resources facilitat-ing the development of models and lexicafor the technologies involved have been ac-quired, initial acoustic and language modelsfor speech recognition and translation mod-els for statistical machine translation havebeen developed and are to be refined. Anoverall system architecture has been estab-lished integrating the described componentswith UIMA.

Bibliography

Loof, J., C. Gollan, S. Hahn, G. Heigold,B. Hoffmeister, C. Plahl, D. Rybach,R. Schluter, and H. Ney. 2007. TheRWTH 2007 TC-STAR Evaluation Sys-tem for European English and Span-ish. En Interspeech, paginas 2145–2148,Antwerp, Belgium.

Marino, J.B., R.E. Banchs, J.M. Crego,A. de Gispert, P. Lambert, J.A.R. Fonol-losa, and M.R. Costa-jussa. 2006. N-gram based machine translation. Com-

putational Linguistics, 32(4):527–549, De-cember.

Henrik Schulz, Marta R. Costa-Jussà y José A. R. Fonollosa

320

Información Adicional

Índice de Autores Ageno, A. 317 Aldezabal, I. 147 Alegría, I. 5 Almeida, J.J. 281 An Ha, L. 107 Aranzabe, M. 147 Araujo, L. 165 Arcas‐Túnez, F. 137 Armentano‐Oller, C. 243 Arrieta, B 5 Arriola, J.M. 13

Badia i Cardús, T. 307 Balahur, A. 201 Barra Chicote, R. 251 Bel, N. 311 Bertrán, M. 291 Blancafort, H. 113 Borrega, O. 291

Calle, F.J. 209, 293 Carlos Moura, A. 303 Carrera, J. 21 Carreras Pérez, X 5 Castaño, L. 299 Castellón, I. 21 Castro, E. 299 Ceccaroni, L. 313 Clergerie, E. 129 Codina, V. 313 Conde, D. 293 Corpas, G. 107 Crespo Miguel, M. 65 Cruz Mata, F.L. 73 Cuadra, D. 209, 293 Cuadros, M. 121

Chamizo, J. 315 Chiarcos, C. 155

DHaro, L.F. 251 Díaz de Ilarraza, A 5, 147 Díaz, A. 191

Enríquez, F. 73 Errecalde, M. 81

Farré, J. 129 Fernández, E. 147 Fernández, F. 251 Ferrández, O. 47, 183 Ferrández, S. 47 Fonollosa, J.A. 259, 267, 319 Forcada, M. 243 Fraile, C. 295 Fresno, V. 173 Frías Delgado, A. 65

Gallo, B. 251 Garrote Salazar, M. 297 Gervás, P. 29, 191, 217 Gómez, J.M. 315 Gómez‐Pérez, J.M. 299 Gonzalo, J. 317

Guiraó Miras, J.M. 297, 301

Herrera de la Cruz, J. 29 Hervás, R. 217

Iglesias, A. 299 Ingaramo, D. 81 Izquierdo‐Bevia, R. 47

Jiménez, J. 305

Khalilov, M. 259 Kohler, S. 299

Lehmberg, T. 155 Leoni de León, J.A. 37 López de Lacalle, M. 273 López, A. 303 Loupy, C. 113 Lucas, J.M. 251 Lloberes, M. 21 Llorens, H. 55 Lloret, E. 183

Marimon, M. 311 Martí, M.A. 317 Martín, M.T. 317 Martínez Romo, J. 165 Martínez, P. 299 Martínez‐Barco, P. 317 Melero, R. 299 Mitkov, R. 107 Molinero, M.A. 129 Montoyo, A. 201 Moreda, P. 55 Moreno, A. 301 Moreno, J. 293 Muñoz, R. 183

Nicolas, L. 129

Odriozola, J.C. 13 Olaziregi, G. 293 Ortega, F.J. 73

Padró, L. 21 Palomar, M. 55, 183, 317 Pascual‐Nieto, I. 225 Peñas, A. 89, 317 Pérez, R. 299 Pérez‐Agüera, J.R. 173 Pérez‐Iglesias, J. 173 Pérez‐Marín, D. 225 Periñán‐Pascual , C. 137 Pichel, J.R. 303 Pla, F. 317 Plaza Morales, L. 191

Rangel, F.M. 89 Recasens, M. 291 Rehm, G. 155 Revuelta, P. 305 Rigau, G. 121 Rivero, J. 209, 293

Rodríguez, P. 225 Rosso, P. 81, 317 Ruiz Costa‐Jussa, M. 267, 319 Ruiz, B. 305

Sagot, B. 129 San Vicente, I. 273 Sánchez, J.M. 305 San‐Segundo, R. 251 Saquete, E. 55 Saralegi, X. 273 Schonefeld, O. 155 Schulz, H. 319 Schwab, S. 37 Sidorova, J. 307 Simões, A. 281 Soriano, B. 291 Sotelsek‐Margalef, A. 97

Taulé, M. 317 Tincova, N. 21 Torre, D 301 Troyano, J.A. 73 Turmo, J. 317

Ureña, L.A. 317 Uria, L. 5

Vaamonde, G. 233 Valle, D. 209, 293 Van der Beek, L. 295 Vicedo, J.L. 47 Villena‐Román, J. 97

Wehrli, E. 37 Witt, A. 155

Envío de trabajos

Artículos Se aceptan artículos de carácter teórico o experimental relacionados con el campo del

Procesamiento del Lenguaje Natural. Las propuestas deben ser enviadas antes del 7 de mayo de 2008. Se aconseja utilizar las plantillas en LaTeX y en Word que figuran en la Web del congreso (http://basesdatos.uc3m.es/sepln2008/web/).

El envío y la revisión de las propuestas se realizará en formato electrónico (PDF o PostScript) mediante EasyChair (http://www.easychair.org/conferences/?conf=sepln2008). Como información general, las propuestas cumplirán los siguientes requerimientos:

- Incluirán el título del artículo. - Incluirán los nombres de los autores, afiliaciones, direcciones postales y electrónicas. - Incluirán un resumen bilingüe (inglés y español) de un máximo de 150 palabras cada uno

incluyendo un listado de temas relacionados o palabras clave. - El artículo podrá estar en español o inglés siendo su tamaño total máximo de 3500 palabras.

Proyectos y demostraciones Como en ediciones anteriores, los organizadores animan a los participantes a presentar proyectos

I+D y a realizar demostraciones de sistemas o herramientas informáticas relacionadas con el campo del PLN.

El envío y la revisión de las propuestas se realizará exclusivamente en formato electrónico (PDF o PostScript) a través del sistemaEasyChair(http://www.easychair.org/conferences/?conf=sepln2008). Puede utilizar las plantillas para el envío de proyectos y demostraciones en LaTeX y Word que figuran en la web del congreso (http://basesdatos.uc3m.es/sepln2008/web/). Para las presentaciones de proyectos se debe incluir la siguiente información: título del proyecto, patrocinadores, grupos participantes en el proyecto, nombre, afiliación, e-mail y teléfono del director del proyecto, y resumen de un máximo de dos páginas.

Para las demostraciones, se debe incluir la siguiente información: título de la demostración, nombre, afiliación, dirección y e-mail de los autores, resumen de un máximo de dos páginas, estimación de tiempo que dura la demostración del sistema.

Todo esto debe ser recibido por los organizadores del congreso antes del 6 de junio de 2008. Formato de la publicación La versión final de los trabajos será mandada antes del 27 de junio de 2008 mediante EasyChair

(http://www.easychair.org/conferences/?conf=sepln2008) accesible a través de la Web del congreso. - Los documentos no pueden incluir cabeceras ni pies de página. - La longitud máxima será de 8 páginas DIN A4 (210 x 297 mm), incluidas referencias y figuras. - Para los proyectos y demostraciones, la longitud máxima será de 2 páginas. Los artículos serán enviados en formato PostScript o PDF. Puede utilizar las plantillas para LaTeX

y Word incluidas en la Web del congreso (http://basesdatos.uc3m.es/sepln2008/web/).


http://www.easychair.org/conferences/?conf=sepln2008


http://basesdatos.uc3m.es/sepln2008/web/).%0CSubmissions

Submissions

Communications Authors are encouraged to send theoretical or system-related proposals to be presented at the

demonstration sessions, all of them related to the Natural Language Processing field. The proposal must be submitted earlier than May 7, 2008 and they must meet certain format and style requirements. We recommend using the LaTeX and Word templates that can be downloaded from the conference webpage (http://basesdatos.uc3m.es/sepln2008/web/). Both the delivery and revision of proposals will be done exclusively in electronic format (PostScript or PDF) via EasyChair (http://www.easychair.org/conferences/?conf=sepln2008). As general information, the proposals will meet the following requirements:

- Proposals will include a title of the communication. - Proposals will include the complete names of the authors, their address, telephone, fax and e-mail. - Proposals will include an abstract in English and Spanish (maximum 150 words). Keywords or

related topics must be specified in the proposal. - The proposal can be written and presented in Spanish or English, and its overall maximum length

will be 3,500 words.

Projects and Demos As in previous editions, the organizers encourage participants to give oral presentations of R+D

projects and demos of systems or tools related to the PLN field. As the communication proposals, both the delivery and revision of projects and demos proposals will be done exclusively in electronic format (PostScript or PDF) via EasyChair (http://www.easychair.org/conferences/?conf=sepln2008). We recommend using the LaTeX and Word templates for projects and demos that can be downloaded from the conference webpage (http://basesdatos.uc3m.es/sepln2008/web/). For oral presentation of I+D projects to be accepted, the following information must be included: project title, funding institution, participant groups in the project, name, affiliation, e-mail and phone number of the project director, and abstract (2 pages maximum).

For demonstrations to be accepted, the following information is mandatory: demo title, name, affiliation, e-mail and phone number of the authors, abstract (2 pages maximum), time estimation for the whole presentation.

This information must be received by the conference organization before June 6, 2008. Publication Format The final version of the article will be sent before June 27, 2008 through the same web system

(EasyChair -http://www.easychair.org/conferences/?conf=sepln2008). - Documents must not include headers or footings. - Maximum length will be 8 pages DIN A4 (210 x 297 mm), included references and figures. - In the case of demonstrations and projects, maximum length will be 2 pages. We recommend using the LaTeX and Word templates included in the WebPage of the conference

(http://basesdatos.uc3m.es/sepln2008/web/).

http://basesdatos.uc3m.es/sepln2008/web/).%0CSubmissions


http://www.easychair.org/conferences/?conf=sepln2008


XXIV Edición del Congreso Anual de la Sociedad … · XXIV Edición del Congreso Anual de la...

Documents

Transcript of XXIV Edición del Congreso Anual de la Sociedad … · XXIV Edición del Congreso Anual de la...