XVIII Congreso de La Asociacion Española para El P... - (PG 82 - 121) PDF

XVIII CONGRESO DE LA SOCIEDAD ESPAÑOLA PARA EL PROCESAMIENTO DEL LENGUAJE NATURAL 81
alcanzado logros importantes en la detección probabilísticos para evaluar el grado de

automática de hipónimos e hiperónimos través Precision de la extracción, sin tomar en cuenta
de métodos estadísticos —priorizando sobre las particularidades conceptuales que se
todo aquellos que soportan un proceso de proyectan en dominios de conocimiento
aprendizaje automático—, estas propuestas especializado. En la sección 4 presentamos una
parecen no tomar en cuenta planteamientos perspectiva de clasificación de relaciones de
formulados por teorías semánticas (ya sean de inclusión pertinente a la exploración que
corte lógico-formal o funcional-cognitivo) que realizamos en este trabajo. En la sección 5
permitan explicar la naturaleza lingüística que establecemos nuestros objetos de estudio: la
subyace en las relaciones léxicas. distinción entre adjetivos calificativos y
Reformulando la idea de Manning y Schütze relacionales, el principio de composicionalidad
(1999), podría decirse que los experimentos semántica y la relación entre hiperónimos con
realizados para identificar tales relaciones dan sus campos léxicos. En la sección 6
un alto valor a la cantidad de hipónimos e presentamos nuestras heurísticas para generar
hiperónimos generables, dependiendo del un conjunto de adjetivos calificativos, y con
método probabilístico empleado, más que ello filtrar hipónimos no relevantes. En la
reflexionar si los resultados muestran cómo sección 7 hacemos una descripción de nuestro
opera la composicionalidad semántica para experimento y mostramos nuestros resultados, y
generar relaciones de hiponimia/hiperonimia. finalmente en 8 ofrecemos una discusión junto
Resolver esta cuestión no es trivial: para con algunos comentarios preliminares respecto
Pustejovsky (1995), Partee (1995), Jackendoff a los resultados obtenidos.
(2010) y Pinker (2011), el análisis del
fenómeno de composicionalidad subyacente 2 Etapas y métodos considerados en la
entre palabras y frases1 es esencial para extracción de hipónimos e hiperónimos
entender cómo funciona la semántica de
cualquier lengua natural, de tal suerte que su Como hemos señalado antes, una buena parte
comprensión puede tener un impacto positivo de los avances logrados en tareas de extracción
para mejorar los resultados obtenidos en la de hipónimos e hiperónimos son el resultado de
extracción de relaciones léxicas. aplicar métodos híbridos. De forma resumida, y
En esta propuesta analizamos un fenómeno conforme a la explicación que dan Girju,
de composicionalidad semántica manifiesto en Badulescu y Moldovan (2006), Ritter,
relaciones de hiponimia/hiperonimia: la Soderland y Etzioni (2009), así como Ortega et
selección de adjetivos relacionales al. (2011) todos ellos consideran, por lo menos,
alguna de las siguientes etapas:
Copyright © 2012. Universitat Jaume I. Servei de Comunicació i Publicacions. All rights reserved.
conceptualmente complejos y vinculados con la

expresión de conceptos más específicos, ya que  Selección de un conjunto de
proyectan un conjunto de propiedades al instancias-semilla que caractericen la
hiperónimo, en comparación con los relación de interés.
calificativos que proveen una valoración poco o  A partir de este conjunto, se infiere un
nada relevante al dominio en cuestión. conjunto de patrones léxico-
La organización de nuestro artículo es la sintácticos que proyecten hipónimos e
siguiente: en la sección 2 describimos las etapas hiperónimos, p. e.: el <hipónimo> es
y métodos comúnmente aplicados en la un <hiperónimo>, etc.
extracción de hipónimos e hiperónimos. En la  Usando este conjunto de patrones
sección 3 detallamos los problemas que aprendidos, se obtienen nuevas
presenta la mera aplicación de métodos instancias de la relación a través de la
exploración de un corpus textual (o si
1
En este trabajo hacemos uso del término frase
es el caso, la WEB). Este proceso se
en un sentido similar al de sintagma, usado repite hasta que ya no es posible
regularmente en estudios de lingüística generados en generar nuevas instancias.
España (p.e., Bosque y Demonte, 1999). En  Se realiza una valoración del nivel de
contraparte, en Latinoamérica, incluyendo México, confianza que muestran los candidatos
se ha hecho un mayor uso del concepto frase, a hipónimos e hiperónimos obtenidos,
aplicándolo tanto en estudios de sintaxis formal así como de los patrones inferidos.
como en trabajos de lingüística computacional (p.e. Para tal fin, se han usado sobre todo
Pineda y Meza, 2003; así como Galicia y Gelbukh,
medidas de asociación entre palabras
2007).
<i>XVIII Congreso de la Asociación Española para el Procesamiento del Lenguaje Natural</i>, edited by Llavorí, Rafael Berlanga, et al., Universitat Jaume I. Servei de
Comunicació i Publicacions, 2012. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bibliotecauptsp/detail.action?docID=4184256.
Created from bibliotecauptsp on 2019-09-28 09:35:08.
tales como PMI (Church y Hanks, papel que juegan fenómenos concretos a la hora
1990; Hearst 1992; Pantel y de elegir hipónimos e hiperónimos que sean
Pennacchiotti, 2006), medición de relevantes para un dominio de conocimiento.
entropía entre pares de palabras (Ryu En un plano de análisis lingüístico, la
y Choi, 2005), así como cálculos de mayoría de estos métodos se han enfocado en
vectores para medir la distancia encontrar nuevas instancias de hipónimos e
conceptual entre palabras (Ritter, hiperónimos a partir de un conjunto de
Soderland y Etzioni, 2009). instancias semilla, que sean reconocibles en
 Como un mecanismo de apoyo para contextos oracionales (Hearst, 1992; Pantel y
corroborar si los candidatos a Pennacchiotti, 2006; Ritter, Soderland y
hipónimos e hiperónimos mantienen Etzioni, 2009; Ortega, Villaseñor y Montes,
una relación canónica, autores como 2007; Ortega et al., 2011). Sin embargo, no se
Hearst (1992), así como Ritter, ha considerado aún el potencial de relaciones de
Soderland y Etzioni (2009) emplean hiponimia que puede generar un hiperónimo en
la base léxica WordNet (Fellbaum, su función como núcleo de una frase nominal.
1998) como fuente de consulta. De acuerdo con lo expresado por Croft y Cruse
 Una vez que se ha corroborado cuál es (2004), creemos que un hiperónimo unipalabra
el contenido de información que un más un rasgo semántico pueden generar
par de palabras comparte como hipónimos relevantes que den cuenta de la
hipónimos e hiperónimos, se pasa a estructura de un dominio de conocimiento, e
una evaluación para determinar el igualmente reflejar perspectivas de clasificación
grado de precisión & Recall (Van de un hiperónimo.
Rijsbergen, 1979) que se ha logrado Siguiendo esta idea, en este trabajo nos
alcanzar con el método empleado, enfocamos en frases nombre + adjetivo,
haciendo ajustes con una medida-F en teniendo en mente la función semántica que
caso de que sea requerido (Ortega, cumplen los adjetivos como unidades que
Villaseñor y Montes, 2007; Ortega et expresan y priorizan rasgos conceptuales, cuya
al., 2011). selección puede ser condicionada por el
dominio de conocimiento en el cual se
3 Problemas en la selección de manifiestan, como es el caso de la terminología
hipónimos e hiperónimos pertinentes a un médica.
dominio de conocimiento Consideramos entonces que si no se toma en
cuenta la observación hecha por Croft y Cruse,
Tomando en cuenta los métodos de extracción
el mero uso de medidas de información para

mencionados, un rasgo característico en todos reconocer pares de palabras que sean candidatos
ellos es que priorizan el uso de medidas de a hipónimos plantea dificultades importantes,
asociación para valorar el grado de cercanía o sobre todo al momento de determinar si tales
lejanía conceptual que mantiene un par de hipónimos son conceptualmente relevantes o no
palabras insertas en una posible relación de para los especialistas de un dominio de
hiponimia/hiperonimia, junto con una conocimiento dado.
evaluación que les permita determinar cuán
precisos son los hipónimos e hiperónimos 4 Relaciones de inclusión entre
detectados, y cuántos podrían ser recuperados a hipónimos e hiperónimos
través de ajustes que mejoren el desempeño del
sistema diseñado para cumplir esta tarea. Para Croft y Cruse (2004) las relaciones de
Si bien esta prioridad responde a hiponimia-hiperonimia son de inclusión,
necesidades prácticas (en concreto, ofrecer un particularmente de dos tipos: hiponimia simple
listado de posibles candidatos a hipónimos e y taxonimia. La hiponimia simple podemos
hiperónimos fiables, en un tiempo de representarla lingüísticamente como X es un Y.
procesamiento razonable, de los cuales puedan Por su parte, la taxonimia puede ejemplificarse
deducirse patrones regulares de constitución), vía la construcción lingüística X es un tipo/clase
existen problemas a considerar que no pueden de Y. Esta última relación discrimina más que la
solucionarse de forma eficaz apoyándose hiponimia simple, y deriva generalmente en una
únicamente en criterios probabilísticos, sino relación taxonómica. Aunado a lo anterior,
que implican un análisis más detallado sobre el Croft y Cruse señalan que en muchos casos
donde un buen hipónimo no es un buen general (p. e., enfermedades gástricas versus
taxónimo de un hiperónimo, existe una enfermedades del estómago).
definición directa del hipónimo en términos del Retomando lo que plantea Demonte (1999),
hiperónimo más un rasgo semántico simple, p. existen dos clases de adjetivos que asignan
e.: propiedades a los nombres: los calificativos y
los relacionales. La diferencia entre los dos
Semental = Caballo macho consiste en el número de propiedades que cada
uno conlleva, así como la manera en que se
A pesar de esta situación, no es posible explicar vinculan con el nombre. Por un lado, los
por qué en algunos casos ciertos hipónimos con calificativos refieren a un rasgo constitutivo del
un rasgo semántico simple sí podrían nombre modificado, el cual exhibe o caracteriza
representar una buena taxonomía, y en otros una única propiedad física: color, forma,
casos no. Un ejemplo de esto es: carácter, etc.: libro azul, señora delgada,
Composicionalidad C (cuchara, w) = (de té, hombre simpático, y otros similares.
de café, de sopa…) Por otro lado, los relacionales se refieren a
La taxonomía anterior enfatiza la función del un conjunto de propiedades o características
objeto cuchara, por lo que puede ser relevante que puedan ser vinculadas a una entidad o
para algunos fines. Por otro lado, en el siguiente evento concreto, p. e.: puerto marítimo, vaca
caso: lechera, paseo campestre, etc.
Composicionalidad C (cuchara, w) = Dado lo anterior, nuestra propuesta consiste
(redonda, profunda, grande…) en enfocar la atención en los adjetivos
Tenemos rasgos que son conceptualmente relacionales y, para discernir éstos de los
simples y que muestran poca o nula utilidad calificativos, tomamos en cuenta las
para elaborar una clasificación del hiperónimo. observaciones hechas por Demonte para
Una pregunta que surge aquí es: ¿los rasgos diferenciarlos.
conceptualmente simples podrían ser
indicativos de hipónimos no relevantes? Si la
5.1 Composicionalidad semántica
respuesta es afirmativa, entonces es necesario
discernir si tal relación muestra rasgos La alternancia entre adjetivos relacionales y
conceptualmente simples o complejos, de modo calificativos se explica en términos de
que esto ayude a ubicar hipónimos que composicionalidad semántica. Para nuestros
expresen valoraciones generales, versus fines, entendemos la composicionalidad
aquellos que configuren una red conceptual semántica como un principio que regula la
jerarquizada subyacente en un dominio asignación de significados específicos a cada
especializado. una de las unidades léxicas que componen una

estructura de frase, dependiendo de la
5 Relaciones entre adjetivos configuración sintáctica que asuma tal
relacionales y nombres estructura (Partee, 1995). De este modo, la
combinatoria que tomen las unidades léxicas
De acuerdo con Demonte (1999) el adjetivo, determina el significado global de una frase u
además de ser una categoría gramatical que oración, generando no sólo unidades léxicas
modifica al nombre, igualmente es una clase de aisladas, sino también bloques que refieren a
palabra con características formales muy conceptos específicos (Jackendoff 2002).
precisas, así como una categoría semántica, Siguiendo a Pustejovsky (1995), así como a
pues hay significados que se expresan mejor Croft y Cruse (2004), podemos considerar que
por medio de adjetivos. estos bloques refieren a conceptos específicos,
Desde un enfoque terminológico, Saurí ya que su selección de rasgos de significado (o
(1997) señala que los adjetivos son importantes estructuras qualia) se ve influenciada
en la construcción de términos, pues directamente por el dominio de conocimiento
conceptualmente pueden insertar rasgos en el que están inmersas.
semánticos que permiten establecer límites Así, un término como inflamación
claros entre entidades o eventos propios de un gastrointestinal, opera como un hipónimo de
dominio de conocimiento (p. e., inflamación tipo taxónimo con mayor riqueza de
intestinal versus inflamación gastrointestinal), información específica, más que un hiperónimo
lo que ayuda a particularizar su significado en simple como inflamación. La configuración de
contraste con conceptos propios de la lengua
este significado específico se debe a un proceso Por otro lado, si consideramos un adjetivo
de composicionalidad semántica, introducido relacional de la tabla 2, por ejemplo,
para establecer diferencias entre conceptos cardiovascular, tenemos que modifica también
relacionados. a un conjunto de nombres, como se muestra en
la tabla 3:
5.2 Hiperónimos y sus campos léxicos
Tabla 3. Nombres modificados por el adjetivo
El hiperónimo, dado su estatus de categoría relacional cardiovascular
genérica, puede estar en relación directa con C(wi,cardiovascular) C(wi,rara)
más de un modificador que refleje conceptos o efecto, problema, congreso, televisión, enfermedad,
categorías específicas (p. e., enfermedad función, evento, relación, complicación, infancia,
cardiovascular), o simplemente valoraciones examen, inestabilidad, niño, color, obesidad, mhc,
trastorno, enfermedad, nucleótido, sustancia,
sensibles al contexto (p. e., enfermedad rara). bypass, causa, beneficio, mutación, trastorno, grupo,
Así, para el caso del hiperónimo enfermedad, sistema, reparador, meconio, epistaxis,
encontramos un conjunto de 132 relaciones, de descompensación, cirugía, derecha, síndrome, cáncer,
las cuales, 76 pueden considerarse relevantes operación, mortalidad, alelo, forma, caso, párpado
aparato, educación,
(58%). Si consideramos una medida de síntoma, eficiencia,
asociación como la información mutua puntual episodio, riesgo,
estandarizada (PMI) propuesta por Bouma investigación,
(2009), que tradicionalmente se ha usado en la manifestación, afección,
extracción de colocaciones, tenemos que las 10 medicamento, director,
muerte, salud
relaciones más relevantes se encuentran en la
tabla 1:
Ergo, tenemos que tanto el hiperónimo como el
Tabla 1. Adjetivos con PMI más alta adjetivo, sea relacional o calificativo, pueden
C(enfermedad, wi) PMI estar vinculados con otros elementos, situación
Transmisible 0.59 que evidencia cómo opera aquí el principio de
Prevenible 0.52
Diarreica 0.45
composicionalidad, restando Precision a las
Diverticular 0.44 medidas de asociación para detectar relaciones
Indicadora 0.41 útiles.
Autoinmunitaria 0.39
Aterosclerótica 0.39 6 Heurísticas lingüisticas para el
Meningococica 0.39
Cardiovascular 0.38
filtrado de hipónimos relevantes
Pulmonar 0.37 Con la finalidad de obtener una lista de paro de
adjetivos calificativos de la misma fuente de

Como se puede observar en la tabla 1, hay dos información de entrada, aplicamos los criterios
adjetivos no relevantes en las primeras 10 de Demonte (1999) para distinguir adjetivos
relaciones. Observando los datos de la tabla 2, calificativos de relacionales, junto con un
resulta más claro reconocer la cantidad de criterio de orden de palabra que indique la
adjetivos no relevantes que se relacionan con precedencia de adjetivos frente a nombres.
enfermedad, esto es, un 40% del total de estos Nuestras heurísticas, a grandes rasgos,
primeros 50 adjetivos: consideran:
1. La posibilidad de que un adjetivo sea
Tabla 2. Primeros 50 adjetivos con PMI más alta usado predicativamente: El método es
C(enfermedad, wi) importante. Para obtener estas
Transmisible, prevenible, diarreica, diverticular, construcciones consideramos la siguiente
indicadora, autoinmunitaria, aterosclerótica, expresión regular: <VSFIN><ADJ>.
meningocócica, cardiovascular, pulmonar, afecto,
2. El que un adjetivo sea parte de
febril, agravante, hepática, seudogripal, periodontal,
sujeto, bacteriano, emergente, benigno, parasitaria, comparaciones, de modo que su
postrombotica, bacteriémica, coexistente, significado sea modificado por adverbios
catastrófica, exclusiva, vectorial, supurativa, de grado: relativamente rápido. Para
infecciosa, debilitante, digestiva, invasora, rara, obtener estas construcciones
inflamatoria, esporádica, antimembrana, consideramos la siguiente expresión:
predisponente, ulcerosa, contagiosa, cardiaca, <ADV><ADJ>.
sistémica, activa, grave, prexistente, miocárdica, 3. La precedencia del adjetivo respecto al
somática, fulminante, atribuible, linfoproliferativa.
nombre: Una grave enfermedad:
<ART><ADJ><NC>.
7 Metodología adjetivos que serán parte de la lista de paro, vía

la aplicación de las heurísticas mencionadas.
La metodología que seguimos para el desarrollo
7.5 Extracción de hipónimos derivados
de nuestro experimento la detallamos a
continuación. de hiperónimos
7.1 Corpus de análisis Después del filtrado de adjetivos calificativos,
obtenemos todos los modificadores adjetivos,
Nuestro corpus está constituido por un conjunto
así como también su medida PMI.
de documentos del dominio médico,
Inmediatamente, pasamos a evaluar nuestro
básicamente enfermedades del cuerpo humano
método, comparando los niveles de Precision,
y temas relacionados (cirugías, tratamientos,
Recall y F-Measure obtenidos por ambos
estudios, etc.) de MedlinePlus en español
enfoques: PMI y heurísticas lingüísticas.
(www.ncbi.nlm.nih.gov/pubmed/). De igual
forma, se adicionaron dos libros de texto con
temáticas relacionadas.
8 Resultados
En total, el tamaño del conjunto de En este apartado mostramos los resultados
documentos es de 750 mil palabras. obtenidos con nuestro método.
Seleccionamos un dominio medico por razones
de disponibilidad de recursos textuales en 8.1 Producción inicial de relaciones
formato digital. Además, asumimos que la candidatas
selección de este dominio no supone Realizamos nuestro experimento considerando
restricciones fuertes para la generalización de los hiperónimos con una frecuencia de
resultados. ocurrencia de 5 o mayor, siguiendo los criterios
propuestos por Acosta, Sierra y Aguilar (2011).
7.2 Herramientas de extracción La tabla 4 muestra los primeros 10 hiperónimos
El lenguaje de programación usado para ordenados de acuerdo con su nivel de
automatizar todas las tareas requeridas fue productividad de relaciones (PR) y las que se
Python, en concreto el módulo NLTK (Bird, consideran relevantes (RRs) para el dominio de
Klein y Loper, 2009). Asimismo, nos basamos análisis, así como también la Precision inicial
en patrones léxico-semánticos, los cuales tienen obtenida (P). Asumimos como baseline esta
un mayor grado de generalidad, por lo que producción inicial, que en este caso sólo se guía
asumimos de entrada un corpus con etiquetado por el potencial hiperónimo:
de partes de la oración. El etiquetado de partes
Tabla 4. Hiperónimos más frecuentes
de la oración lo realizamos con TreeTagger
Hiperónimo PR RRs P
(Schmid, 1994). Enfermedad 132 76 58
Infección 125 69 55
7.3 Extracción automática de CDs Tratamiento 112 39 35
Vacuna 79 41 52
Siguiendo la metodología propuesta por Sierra Problema 67 40 60
et al. (2010), así como Acosta, Sierra y Aguilar Afección 64 38 59
(2011), extraemos un conjunto de hiperónimos Trastorno 61 45 74
más frecuentes detectados en contextos Examen 60 33 55
Dolor 54 26 48
definitorios (o CDs), y los llevamos a una
Célula 47 22 47
segunda etapa de extracción de hipónimos,
considerando únicamente adjetivos como
modificadores del hiperónimo. 8.2 Ranking por PMI
Consideramos la medida PMI, una versión
estandarizada propuesta por Bouma (2009),
7.4. Elaboración de una lista de paro de
cuya normalización obedece a dos cuestiones
adjetivos no relevantes fundamentales: usar medidas de asociación
En este punto asumimos que los adjetivos cuyos valores tengan una interpretación fija y
calificativos presentan rasgos conceptualmente reducir la sensibilidad a frecuencias bajas de
simples y de poca utilidad para generar ocurrencia de datos. La fórmula de PMI
hipónimos relevantes. Dado lo anterior, normalizada es la siguiente:
obtenemos automáticamente un conjunto de
Demonte (1999) señala que los rasgos

semánticos que distinguen mejor adjetivos
calificativos de relacionales son la
A partir de los resultados obtenidos, graduabilidad y la polaridad. Cabe señalar que
observamos que al establecer umbrales PMI este último rasgo no ha sido considerado en
para filtrar relaciones relevantes, si bien el nuestro análisis, dada la complejidad que
Recall se mantiene alta a medida que implica tratarla computacionalmente. En su
aumentamos el umbral, la Precision se ve lugar, hemos considerado tomar en cuenta la
afectada de forma poco significativa, como lo heurística de precedencia de adjetivos a un
muestra la tabla 5: nombre, que para el caso del corpus bajo
análisis, genera un porcentaje de error menor.
Tabla 5. Precision, Recall y F-Measure por umbral
Basados en la exploración del corpus para el
PMI
R P F
español del Sketch Engine (Kilgarriff et al.,
Producción inicial 100 51 68 2004), observamos que las heurísticas
Umbral IMP planteadas sí generan más del 95% de adjetivos
Umbral>0 98 51 67 calificativos, empero, cuando se consideran
Umbral>=0.10 93 53 68 dominios especializados, estas heurísticas no
Umbral>=0.20 83 55 66
son tan precisas.
8.3 Filtro de adjetivos no relevantes 9 Comentarios finales
Siguiendo lo comentado por Croft y Cruse
(2004) sobre los hipónimos que pueden ser En este trabajo presentamos una comparación
descritos por un hiperónimo más un rasgo, entre dos enfoques para obtener relaciones de
consideramos que descartando los adjetivos hiponimia relevantes, que puedan surgir a partir
calificativos, los cuales con frecuencia no se de un hiperónimo dentro de un dominio de
relacionan con términos o conceptos conocimiento médico. Los resultados obtenidos
importantes, es posible mejorar la extracción de sostienen empíricamente la idea planteada por
hipónimos relevantes. Los resultados obtenidos Croft y Cruse (2004): un buen hipónimo no
se resumen en la tabla 6: necesariamente es un buen taxónimo de un
hiperónimo.
Tabla 6. Precision, Recall y F-Measure por El punto clave en esta discusión es que
heurística podemos generar una gran cantidad de
Heurística R P F hipónimos relevantes que tengan como núcleo
Graduabilidad de adjetivos 84 67 75 un hiperónimo. Desafortunadamente, dada la

Graduabilidad y
naturaleza genérica de los hiperónimos
precedencia de adjetivos 82 69 75
Graduabilidad, precedencia unipalabra, estos pueden vincularse
y predicación de adjetivos 77 76 76 directamente con una gran cantidad de
modificadores a nivel adjetivo y de frase
La tabla 6 muestra un mejor desempeño en las preposicional.
medidas de Precision y Recall utilizando las En este trabajo sólo consideramos los
heurísticas, en comparación con el modificadores adjetivos donde observamos una
establecimiento de umbrales PMI. Sin embargo, gran cantidad de adjetivos calificativos y
un fenómeno que observamos es que las relacionales, siendo éstos últimos, desde nuestra
heurísticas pueden arrojar un conjunto de perspectiva, mejores candidatos a hipónimos
adjetivos relacionales relevantes, el porcentaje de un hiperónimo. Es notable el alto grado de
de error que encontramos en nuestro corpus se composicionalidad presente en la relación entre
muestra en la tabla 7. hiperónimos y adjetivos relacionales, lo que va
en detrimento de la Precision de medidas de
Tabla 7. Porcentaje de error obtenido de las asociación para seleccionar las relaciones
heurísticas lingüísticas relevantes. Es justo en estos escenarios donde la
Patrón Porcentaje de error regularidad del lenguaje, como lo mencionan
<ADV><ADJ> 18 Manning y Schütze (1999), ayuda a que los
<VSFIN><ADJ> 17
métodos de desambiguación, parsing y en
nuestro caso particular, extracción de
<ADJ><NC> 15
hipónimos relevantes, adquiera gran 11th EURALEX International Congress, páginas

importancia. 105-116, Lorient (Francia).
Manning, Ch. y Schütze, H. 1999. Foundations of
Bibliografía Statistical Natural Language Processing. MIT
Press, Cambridge, (Mass, USA).
Acosta, O., C. Aguilar, y G. Sierra. 2010. A Method Ortega, R., L. Villaseñor, y M. Montes. 2007. Using
for Extracting Hyponymy-Hypernymy Relations lexical patterns for extracting hyponyms from the
from Specialized Corpora Using Genus Terms. Web. En: Proceedings of MICAI, LNCS,
En: Proceedings of the Workshop in Natural Springer (Berlin).
Language Processing and Web-based Ortega, R., C. Aguilar, L. Villaseñor, M. Montes y
Technologies 2010, páginas 1-10, Universidad G. Sierra. 2011. Hacia la identificación de
Nacional de Córdoba (Argentina). relaciones de hiponimia/hiperonimia en Internet,
Acosta, O., G. Sierra, y C. Aguilar. 2011. Extraction Revista Signos. Estudios de Lingüística, 44(75):
of Definitional Contexts using Lexical Relations, 68-84.
International Journal of Computer Applications, Pantel, P., y M. Pennacchiotti. 2006. Espresso:
34(6): 46-53. Leveraging generic patterns for automatically
Berland, M., y E. Charniak. 1999. Finding parts in harvesting semantic relations. En Proceedings of
very large corpora. En: Proceedings of the 37th Conference on Computational Linguistics
Annual Meeting of the Association for Association for Computational Linguistics, ACL,
Computational Linguistics, páginas 57-64, Sydney (Australia).
Orlando (USA). Partee, B. 1995. Lexical Semantics and
Bird, S., E. Klein, y E. Loper. 2009. Natural Compositionality. En: Invitation to Cognitive
Language Processing whit Python. O'Reilly, Science, Part I: Language, páginas: 311-36, MIT
Sebastropol (Cal., USA). Press, Cambridge (Mass., USA).
Bosque, I. y Demonte, V. 1999. Gramática Pineda, L. y Meza, I. 2003. Un modelo para la
descriptiva de la lengua española, 3 Volúmenes, perífrasis española y el sistema de pronombres
Espasa-Calpe (Madrid). clíticos en HSPG. Estudios de Lingüística
Bouma, G. 2009. Normalized (Pointwise) Mutual Aplicada, Num. 38: 45–67.
Information in Collocation Extraction. En: From Pinker, S. 1997. How the mind works, Norton &
Form to Meaning: Processing Texts Company, (New York).
Automatically, Proceedings of the Biennial Pustejovsky, J. 1995. The Generative Lexicon. MIT
GSCL Conference, páginas 31-40, Gunter Narr Press, Cambridge, (Mass, USA).
Verlag (Tübingen). Ritter, A., Soderland, S., y Etzioni, O. 2009. What is
Croft, W., y D. Cruse. 2004. Cognitive Linguistics. This, Anyway: Automatic Hypernym Discovery.
Cambridge University Press (Cambridge, UK). En Papers from the AAAI Spring Symposium,
Cruse, D. 1986. Lexical Semantics. Cambridge páginas 88-93.
University Press (Cambridge, UK). Ryu, P., y K. Choi. 2005. An Information-Theoretic
Church. K., y Hanks, P. 1990. Word Association Approach to Taxonomy Extraction for Ontology
Norms, Mutual information and Lexicography. Learning. En: Ontology Learning from Text:
Computational Linguistics, 16(1): 22-29. Methods, Evaluation and Applications, páginas
Demonte, V. 1999. El adjetivo. Clases y usos. La 15-28, IOS Press (Amsterdam).
posición del adjetivo en el sintagma nominal. En: Saurí, R. 1997. Tractament lexicogràfic dels
Gramática descriptiva de la lengua española, adjectius: aspectes a considerar. Papers de
Vol. 1, Cap. 3, .páginas 129-215, Espasa-Calpe l'IULA: Monografies, Universitat Pompeu Fabra
(Madrid). (Barcelona).
Galicia, S. y Gelbukh, A. 2007. Investigaciones en Schmid, H. 1994. Probabilistic Part-of-Speech
análisis sintáctico del español. Instituto Tagging Using Decision Trees. En: Proceedings
Politécnico Nacional (México DF). of International Conference of New Methods in
Girju, R., A. Badulescu, y D. Moldovan. 2006. Language: www.ims.uni-
Automatic Discovery of Part–Whole Relations. stuttgart.de~schmid.TreeTagger.
Computational Linguistics, 32(1): 83-135. Sierra G., R. Alarcón, C. Aguilar, y C. Bach. 2010.
Hearst, M. 1992. Automatic Acquisition of Definitional verbal patterns for semantic relation
Hyponyms from Large Text Corpora. En: extraction. En: Probing Semantic Relations:
Proceedings of COLING-92, páginas 539-545, Exploration and Identification in Specialized
Nantes (Francia). Texts, páginas 73-96. John Benjamins Publishing
Jackendoff, R. 2002. Foundations of Language: (Amsterdam/Philadelphia).
Brain, Meaning, Grammar, Evolution. Oxford Snow, R., D. Jurafsky, y A. Ng. 2006. Semantic
University Press (Oxford, UK). Taxonomy Induction from Heterogeneous
Kilgarriff, A., P. Rychly, P. Smrz, y D. Tugwell. Evidence. En: Proceedings of the 21st
2004. The Sketch Engine. En: Proceedings of International Conference on Computational
Linguistics and 44th Annual Meeting of the ACL,

páginas 801–808, Sydney (Australia).
Van Rijsbergen, K. 1979. Information Retrieval.
Butterworths (Ontario, Canadá).
2nd International Workshop on Exploiting Large
Knowledge Repositories (E‐LKR)
1st International Workshop on Automatic Text
Summarization for the Future (ATSF)
Organizadores:
 Ernesto Jiménez‐Ruiz (University of Oxford)
 María José Aramburu (Universitat Jaume I)
 Roxana Dánger (Imperial College London)
 Antonio Jimeno‐Yepes (National Library of Medicine, USA)
 Horacio Saggion (Universitat Pompeu Fabra)
 Elena Lloret (Universidad de Alicante)
 Manuel Palomar (Universidad de Alicante)
2nd International Workshop on Exploiting Large Knowledge

Repositories (E‐LKR)
Very large knowledge repositories (LKR) are being created, published and exploited in a wide
range of fields, including Bioinformatics, Biomedicine, Geography, e‐Government, and many
others. Some well known examples of LKRs include the Wikipedia, large scale Bioinformatics
databases and ontologies such as those published by the EBI or the NIH (e.g. UMLS, GO), and
government data repositories such as data.gov. These repositories are publicly available and
can be used openly. Their exploitation offers many possibilities for improving current
information systems, and opens new challenges and research opportunities to the information
processing, databases and semantic web areas.
The main goal of this workshop is to bring together researchers that are working on the
creation of new LKRs on any domain, or on their exploitation for specific information
processing tasks such as data analysis, text mining, natural language processing and
visualization, as well as for knowledge engineering issues, like knowledge acquisition,
validation and personalization.
Research, demo and position papers showing the benefits that exploiting LKRs can bring to the
information processing area will be especially welcome to this workshop.
1st International Workshop on Automatic Text Summarization

for the Future (ATSF)
Research on automatic text summarization started over 50 years ago and although mature in
some application domains (e.g., news), faces new challenges in the current context of user‐
generated on‐line content and social networks.
Information on the Web is constantly updated sometimes without any quality‐control, an
important proportion of the information being informal and ephemeral, a typical example
being that of opinions and messages on the Internet.
 What techniques can be used to produce appropriate summaries in this context?
 How to measure relevance of ill‐formed input?
 How to produce understandable summaries from noisy texts? How to identify the
most relevant information in a set of opinions?
High quality documentation such as technical/scientific articles and patents, has not received
all the attention that the field deserves. Given the explosion of technical documentation
available on the Web and in intranets, scientist and research and development facilities face a
true scientific information deluge: summarization should be a key instrument not only for
reducing the information content but also for measuring information relevance in context,
providing to users adequate answers in context.
 What techniques can be used to extract knowledge from complex technical
documents?
 How to compile back the information in a well formed summary?
 How to measure relevance in a network of scientific articles, beyond mere citation
counts?
Another summarization research topic lying behind is non‐extractive summarization, the
generation of a concise summary which is not a set of sentences from the input. This is a very
difficult problem since summarization systems must be able to easily adapt from one domain
to another in order to recognize what is important and how to produce a coherent text from a
textual or conceptual representation.
The workshop “Automatic Text Summarization of the Future” aims to bringing together
researchers and practitioners of natural language processing to address the aforementioned
and related issues.
Los artículos completos de este taller han sido publicados en: http://ceur‐ws.org/Vol‐882/

PROGRAMA
A Challenge for Automatic Text Summarization
Leo Wanner, ICREA and DTIC, UPF
Towards an ontology based large repository for managing heterogeneous knowledge
resources
Nizar Ghoula, Gilles Falquet
Enhancing the expressiveness of linguistic structures
José Mora, José A. Ramos, Guadalupe Aguado de Cea
Integrating large knowledge repositories in multiagent ontologies
Herlina Jayadianti, Carlos B. Sousa Pinto, Lukito Nugroho, Paulus Insap Santosa
A proposal for a European large knowledge repository in advanced food composition tables
for assessing dietary intake
Oscar Coltell, Francisco Madueño, Zoe Falomir, Dolores Corella
Disambiguating automatically‐generated semantic annotations for Life Science open

registries
Antonio Jimeno, Rafael Berlanga Llavori, María Pérez Catalán
Redundancy reduction for multi‐document summaries using A* search and discriminative
training
Ahmet Aker, Trevor Cohn, Robert Gaizauskas
A dependency relation‐based method to identify attributive relations and its application in
text summarization
Shamima Mithun, Leila Kosseim
Short Papers
Using biomedical databases as knowledge sources for large‐scale text mining
Fabio Rinaldi
Exploiting the UMLS metathesaurus in the ontology alignment evaluation initiative
Ernesto Jiménez‐Ruiz, Bernardo Cuenca Grau, Ian Horrocks
Statements of interest
KB_Bio_101: a repository of graph‐structured knowledge
Vinay K Chaudhri, Michael Wessel, Stijn Heymans
If it's on web it's yours!
Abdul Mateen Rajput

TASS ‐ Taller de Análisis de Sentimientos
en la SEPLN
Organizadores:
 Julio Villena (Daedalus, SA)
 Sara Lana (Universidad Politécnica de Madrid)
 Alfonso Ureña (Universidad de Jaén)
According to Merriam‐Webster dictionary, reputation is the overall quality or character of a
given person or organization as seen or judged by people in general, or, in other words, the
general recognition by other people of some characteristics or abilities for a given entity.
Specifically, in business, reputation comprises the actions of a company and its internal
stakeholders along with the perception of consumers about the business. Reputation affects
attitudes like satisfaction, commitment and trust, and drives behaviour like loyalty and
support. In turn, reputation analysis is the process of tracking, investigating and reporting an
entity's actions and other entities' opinions about those actions. It covers many factors to
calculate the market value of reputation. Reputation analysis has come into wide use as a
major factor of competitiveness in the increasingly complex marketplace of personal and
business relationships among people and companies.
Currently market research using user surveys is typically performed. However, the rise of social
media such as blogs and social networks and the increasing amount of user‐generated
contents in the form of reviews, recommendations, ratings and any other form of opinion, has
led to creation of an emerging trend towards online reputation analysis. The so‐called
sentiment analysis, i.e., the application of natural language processing and text analytics to
identify and extract subjective information from texts, which is the first step towards the
online reputation analysis, is becoming a promising topic in the field of marketing and
customer relationship management, as the social media and its associated word‐of‐mouth
effect is turning out to be the most important source of information for companies and their
customers' sentiments towards their brands and products.
Sentiment analysis is a major technological challenge. The task is so hard that even humans
often disagree on the sentiment of a given text. The fact that issues that one individual finds
acceptable or relevant may not be the same to others, along with multilingual aspects, cultural
factors and different contexts make it very hard to classify a text written in a natural language
into a positive or negative sentiment. And the shorter the text is, for example, when analyzing
Twitter messages or short comments in Facebook, the harder the task becomes.
Within this context, TASS is an experimental evaluation workshop, as a satellite event of the

SEPLN 2012 Conference that will be held on September 7th, 2012 in Jaume I University at
Castellón de la Plana, Comunidad Valenciana, Spain, to foster the research in the field of
sentiment analysis in social media, specifically focused on Spanish language. The main
objective is to promote the application of existing state‐of‐the‐art algorithms and techniques
and the design of new ones for the implementation of complex systems able to perform a
sentiment analysis based on short text opinions extracted from social media messages
(specifically Twitter) published by a series of representative personalities.
The challenge task is intended to provide a benchmark forum for comparing the latest
approaches in this field. In addition, with the creation and release of the fully tagged corpus,
we aim to provide a benchmark dataset that enables researchers to compare their algorithms
and systems.
PROGRAMA
Overview of TASS 2012 ‐ Workshop on Sentiment Analysis at SEPLN
Julio Villena‐Román, Janine García‐Morera, Cristina Moreno‐García, Linda Ferrer‐Ureña, Sara
Lana‐Serrano, José Carlos González‐Cristóbal, Adam Westerski, Eugenio Martínez‐Cámara, M.
Ángel García‐Cumbreras, M. Teresa Martín‐Valdivia, L. Alfonso Ureña‐López .......................... 94
TASS: Detecting Sentiments in Spanish Tweets
Xabier Saralegi Urizar, Iñaki San Vicente Roncal ...................................................................... 103
Techniques for Sentiment Analysis and Topic Detection of Spanish Tweets: Preliminary
Report
Antonio Fernández Anta, Philippe Morere; Luis Núñez Chiroque, Agustín Santos .................... 112
The L2F Strategy for Sentiment Analysis and Topic Classification
Fernando Batista, Ricardo Ribeiro ............................................................................................. 125
Sentiment Analysis of Twitter messages based on Multinomial Naive Bayes
Alexandre Trilla, Francesc Alías ................................................................................................. 129
UNED at TASS 2012: Polarity Classification and Trending Topic System
Tamara Martín‐Wanton, Jorge Carrillo de Albornoz ................................................................. 131
UNED @ TASS: Using IR techniques for topic‐based sentiment analysis through divergence
models
Angel Castellano González, Juan Cigarrán Recuero, Ana García Serrano ................................. 140
SINAI en TASS 2012
Eugenio Martínez Cámara, M. Angel García Cumbreras, M. Teresa Martín Valdivia, L. Alfonso
Ureña López ............................................................................................................................... 147
Lexicon‐Based Sentiment Analysis of Twitter Messages in Spanish
Antonio Moreno‐Ortiz, Chantal Pérez‐Hernández .................................................................... 156

TASS - Workshop on Sentiment Analysis at SEPLN
TASS - Taller de Análisis de Sentimientos en la SEPLN

Julio Villena-Román Sara Lana-Serrano
Janine García-Morera DIATEL - Universidad Politécnica de Madrid
Cristina Moreno-García slana@diatel.upm.es
Linda Ferrer-Ureña
DAEDALUS
{jvillena, jgarcia, cmoreno}@daedalus.es
José Carlos González-Cristóbal Eugenio Martínez-Cámara

Adam Westerski M. Ángel García-Cumbreras
GSI - Universidad Politécnica de Madrid M. Teresa Martín-Valdivia
{jgonzalez, westerski}@dit.upm.es L. Alfonso Ureña-López
Universidad de Jaén
{emcamara, maite, magc, laurena}@ujaen.es
Resumen: Este artículo describe el desarrollo de TASS, taller de evaluación experimental en el

contexto de la SEPLN para fomentar la investigación en el campo del análisis de sentimiento en
los medios sociales, específicamente centrado en el idioma español. El principal objetivo es
promover el diseño de nuevas técnicas y algoritmos y la aplicación de los ya existentes para la
implementación de complejos sistemas capaces de realizar un análisis de sentimientos basados
en opiniones de textos cortos extraídos de medios sociales (concretamente Twitter). Este
artículo describe las tareas propuestas, el contenido, formato y las estadísticas más importantes
del corpus generado, los participantes y los diferentes enfoques planteados, así como los
resultados generales obtenidos.
Palabras clave: TASS, análisis de reputación, análisis de sentimientos, medios sociales
Abstract: This paper describes TASS, an experimental evaluation workshop within SEPLN to
foster the research in the field of sentiment analysis in social media, specifically focused on
Spanish language. The main objective is to promote the application of existing state-of-the-art
algorithms and techniques and the design of new ones for the implementation of complex
systems able to perform a sentiment analysis based on short text opinions extracted from social
media messages (specifically Twitter) published by representative personalities. The paper
presents the proposed tasks, the contents, format and main statistics of the generated corpus, the
participant groups and their different approaches, and, finally, the overall results achieved.
Keywords: TASS, reputation analysis, sentiment analysis, social media.
Specifically, in business, reputation

1 Introduction comprises the actions of a company and its
internal stakeholders along with the perception
According to Merriam-Webster dictionary1,
of consumers about the business. Reputation
reputation is the overall quality or character of
affects attitudes like satisfaction, commitment
a given person or organization as seen or judged
and trust, and drives behavior like loyalty and
by people in general, or, in other words, the
support.
general recognition by other people of some
In turn, reputation analysis is the process
characteristics or abilities for a given entity.
of tracking, investigating and reporting an
entity's actions and other entities' opinions
1
http://www.merriam-webster.com/
about those actions. It covers many factors to The main objective is to improve the
calculate the market value of reputation. existing techniques and algorithms and design
Reputation analysis has come into wide use new ones in order to perform a sentiment
as a major factor of competitiveness in the analysis in short text opinions extracted from
increasingly complex marketplace of personal social media messages (specifically Twitter)
and business relationships among people and published by a series of important personalities.
companies. The challenge task is intended to provide a
Currently market research using user benchmark forum for comparing the latest
surveys is typically performed. However, the approaches in this field. In addition, with the
rise of social media such as blogs and social creation and release of the fully tagged corpus,
networks and the increasing amount of user- we aim to provide a benchmark dataset that
generated contents in the form of reviews, enables researchers to compare their algorithms
recommendations, ratings and any other form of and systems.
opinion, has led to creation of an emerging
trend towards online reputation analysis. 2 Description of tasks
The so-called sentiment analysis, i.e., the
Two tasks are proposed for the participants in
application of natural language processing and
this first edition: sentiment analysis and
text analytics to identify and extract subjective
trending topic coverage.
information from texts, which is the first step
Groups may participate in both tasks or just
towards the online reputation analysis, is
in one of them.
becoming a promising topic in the field of
Along with the submission of experiments,
marketing and customer relationship
participants are encouraged to submit a paper to
management, as the social media and its
the workshop in order to describe their systems
associated word-of-mouth effect is turning out
to the audience in a regular workshop session
to be the most important source of information
together with special invited speakers.
for companies and their customers' sentiments
Submitted papers are reviewed by the program
towards their brands and products.
committee.
Sentiment analysis is a major technological
challenge. The task is so hard that even humans
often disagree on the sentiment of a given text. 2.1 Task 1: Sentiment Analysis
The fact that issues that one individual finds This task consists on performing an automatic
acceptable or relevant may not be the same to sentiment analysis to determine the polarity of
others, along with multilingual aspects, cultural each message in the test corpus.
factors and different contexts make it very hard The evaluation metrics to evaluate and
to classify a text written in a natural language compare the different systems are the usual
into a positive or negative sentiment. And the measurements of precision (1), recall (2) and F-
shorter the text is, for example, when analyzing measure (3) calculated over the full test set, as
Twitter messages or short comments in shown in Figure 1.
Facebook, the harder the task becomes.
Within this context, TASS2, which stands (1)
for Taller de Análisis de Sentimientos en la
SEPLN (Workshop on Sentiment Analysis at
SEPLN, in English) is an experimental (2)
evaluation workshop, organized as a satellite
event of the SEPLN 2012 Conference, held on
September 7th, 2012 in Jaume I University at (3)
Castellón de la Plana, Comunidad Valenciana,
Spain, to promote the research in the field of
Figure 1: Evaluation metrics
sentiment analysis in social media, initially
focused on Spanish, although it could be
extended to any language. 2.2 Task 2: Trending topic coverage
In this case, the technological challenge is to
2 build a classifier to identify the topic of the text,
http://www.daedalus.es/TASS
and then apply the polarity analysis to get the same time. Peter is a very good
assessment for each topic. friend but I cannot stand John
The evaluation metrics are the same as in COnsidered NEU with DI SAGREEMENT where
Task 1 (Figure 1). Peter is regarded as P+ andJohn as N+.
On the other hand, a selection of a set of 1O
3 Corpus tapies has been made based on the thematic
areas covered by the corpus, such as politics,
The corpus provided to part1c1pants contains
(política), soccer (fútbol), literature (literatura)
over 70 000 tweets, written in Spanish by
or entertainment (entretenimiento).
nearly 200 well-known personalities and
Each message of the corpus has been
celebrities of the world of politics, economy,
semiautomatically assigned to one or several of
communication, mass media and culture
these tapies (most messages are associated to
between November 2011 and March 2012'
just one topic, due to the short length of the
Although the context of extraction has a Spain-
text).
focused bias, the diverse nationality of the
This tagged corpus has been divided into
authors, including people from Spain, Mexico,
two sets: training and test. The training corpus
Colombia, Puerto Rico, USA and many other
was released along with the corresponding tags
countries, makes the corpus reach a global
so that participants may train and validate their
coverage in the Spanish-speaking world.
models for classification and sentiment
Each Twitter message includes its ID
analysis. The test corpus was provided without
(t witi d ), the creation date ( d ate) and the user
any tag and was used to evaluate the results
ID (user).
provided by the different systems.
Due to restrictions in the Twitter API Terms
Table 1 shows a summary of the training
of Service 3 , it is forbidden to redistribute a
data provided to participants.
corpus that includes text contents or
information about users. However, it is valid if
Attribute Value
those fields are removed and instead IDs Twits 7 219
(including Tweet IDs and user IDs) are Topics 10
provided. The actual message content can be Twit languages 1
easily obtained by making queries to the Users 154
Twitter API using the t wi ti d. In addition U ser types 3
using the user ID, it is possible to extrae~ U ser languages 1
information about the user name, registration Date start 2011-12-02 T00:47:55
date, geographical information of his/her Date end 2012-04-10 T23:40:36
location, and many other fields, which may

allow to perform experiments for instance on Table 1: Train corpus
the different varieties of Spanish.
Each message is tagged with its global
Users were journalists (periodistas),
polarity, indicating whether the text expresses a
politicians (políticos) or celebrities (famosos).
positive, negative or neutral sentiment, or no
The only language involved this year was
sentiment at all. 5 levels have been defined:
Spanish (es).
strong positive (P+ ), positive (P), neutral ( NEu),
Similarly, Table 2 shows a summary of the
negative ( N), strong negative ( N+) and one
test data.
additional no sentiment label ( NON E) .
. Moreover, in those cases where applicable, Attribute Value
th1s same polarity is tagged but related to the Twits 60 798
entities that are mentioned in the text. Topics 10
There is also an indication of the level of Twit languages 1
agreement or disagreement of the expressed Users 158
sentiment within the content. This is especially U ser types 3
useful to make out whether a neutral sentiment U ser languages 1
comes from neutral keywords or else the text Date start 2011-12-02 T00:03:32
contains positive and negative sentiments at the Date end 2012-04-10 T23:47:55
3 Table 2: Test Corpus

https ://dev.twitter.com/terms/api-terms
<twit>
The list of tapies that have been selected is <t:.wi tid>O000000000 < / twi t. id>
shown in Table 3, sorted by frequency in the <u ser>c.s u ar i oO < / u::ser>

<content X ~ [CDATA[ 'Conozco a alguien q es adicco al drama~
test corpus. Ja ja ja te suena d algo!] ) >< / conten t>

<dace>2011- 1 2 - 02 T02 : 59 : 03< / dace>
<lang>es </ lang>
<sentimenc::~>
To ic Fre <polarity>
Politics (política) <valu e>P+< / value>
<Cype>A.GREEMENT</ Cype>
Other (otros) </ polaricy>
Entertainment (entretenimiento) </ ::~en cimen c::~>
<copie::~>
Economy (economía) <Copie>entre t e n i mi e n t o</ copie>
</ copies>
Music (música) </ CWiC>
Soccer (fútbol)
<Cwit>
Films (cine) <cwi cid>O000000001 < / cwi cid>
Technology (tecnología) <u ser>u s u ari o1 </ user>
<concenc x ! [CDATA[ '0Py0 contará ca::~i ::~equro con grupo gracia::~
Sports (deportes) al Foro Asturias . ] ] >< / con cenc>
Literature (literatura) <dace>2011- 1 2-02TOO : 21 : 01< / dace>
<lang>es </ lang>
<sencimenc::~>
<polaric y >
<valu e>P</ valu e>
Table 3: Topic list <Cype>AGREEMENT</ Cype>
</ polaricy>
<polaric y >
<enticy>OPyD< / e n tity>
The corpus is encoded in XML as defined by <valu e>P< / valu e>
the following schema (Figure 2), in which the <Cype>AGREEMENT< / type>

</ polaricy>
text of the content entity has been removed to <polarity>
<encicy>Fo r o _ As tur i as </ e n tic y >
follow the Twitter restrictions. <valu e>P< / valu e>
<Cype>AGREEMENT</ type>
() twits ~ 1 </ polarity>
</ sentimencs>
<1: -e: ~) tt'twlt [1..1 ~ 1
i~
1
_,...--------1.---. <tapie::~>
~ ;tJf() twitid xsd:nonNegative!nteger <topic>polí ti c a </ copie>

</ Copies>
ª M() user xsd:string
</ twic>
~ () content xsd:string
() date xsd:string
() Jang xsd:language
Figure 3: Sample twits
~) .,..•_nt_' _(o.._11_ _ _ _. . . _ _ - - .
t t: sent im
~ -eJ () ¡ _ _"....;
polarity,.:.ll_.:;_ "Y;_ pe_
Po_
I"...;.''Y_~_J.......]_--.
~ ~ ~ () entity [0 .. 1} xsd:string
;< ~ .,6 () val ue tt:enumValuePolarity

ID ~
:
:::::
? () type [0.. 1)
4 Participants
1 tt:enumTypePolarity
~) tt:topics (0.. 1) Participants were required to register for the

J </: ;Jj() t opk [1.'] xsd:string } task(s) in order to obtain the corpus.
Results should be submitted in a plain text
file with the following format:
Figure 2: XML schema t wi t i t \ t p o larity \ t t o pi c
Two sample twits are shown in Figure 3. Where t witid is the twit ID for every
The second one is tagged with both the message in the test corpus, the p o lar i t y
global polarity of the message and the polarity contains one of the 6 valid tags (P+, P, NEU, N,
associated to each one of the entities that N+ and NONE), and the same for t o pi c .
appears in the text ( UPyD and Foro Although the polarity level must be
Asturias classified into those levels and the results were
with the global polarity as the text contains no primarily evaluated for the 5 of them, the
mentions to any entity. evaluation results also include metrics that
The full corpus will be made public after the consider just 3 levels (positive , neutral and
workshop so that any group interested in the negative).
field of sentiment analysis in Spanish can use it. Participants could submit results for one or
both tasks. Several results for the same task
were allowed too.
15 groups registered and finally 8 groups
sent their submissions for one of the two tasks.
The list of active groups is shown in Table 4.
All of thern subrnitted results for the sentirnent They present an interesting cornparative
analysis tasks and rnost of thern (6 out of 8, analysis of different approaches and
75%) participated in both tasks. classification techniques for these problerns
using the provided corpus of Spanish tweets.
[ Gr oup Tas k l Tas k 2 1 The data is preprocessed using well-known
E lhu ya r Fun daz i oa Yes No techniques and tools proposed in the literature,
IMD EA Yes Yes together with others specifically proposed here
L2F- IN ESC Yes Yes that take into account the characteristics of
La Sa ll e- URL Yes Yes Twitter.
LSI UNED Yes Yes Then, popular classifiers have been used (in
LSI UNED 2 Yes Yes particular, all classifiers of WEKA have been
SIN AI- UJAE N Yes Yes evaluated). Their report describes sorne of the
UM A Yes No results obtained in their prelirninary research.
Table 4: Participant groups 4.3 L2F- INESC

The strategy used by the L2F tearn for
There was another group at Delft University performing autornatic sentirnent analysis and
of Technology (TUDelft) that subrnitted topic classification over Spanish Twitter data is
experirnents for both tasks but finally did not described in (Batista and Ribeiro, 2012).
subrnit a report for the workshop, so their They have decided to consider both tasks as
results are not included. classification tasks, thus sharing the sarne
Next sections briefly describe the rnethod. Their rnost successful and recent
approaches for the different groups. experirnents cast the problern as a binary
classification problern, which airns at
4.1 Elhuyar Fundazioa discrirninating between two possible classes.
Binary classifiers are easier to develop, offer
In their paper (Saralegi and San Vicente, 2012), faster convergence ratios, and can be executed
they describe their supervised approach that in parallel. The final results are then generated
includes sorne linguistic knowledge-based by cornbining all the different binary classifiers.
processing for preparing the features. Specifically, they have adopted an approach
The processing cornprises lernrnatization, based on logistic regression classification
POS tagging, tagging of polarity words, rnodels, which corresponds to the rnaxirnum
treatrnent of ernoticons, treatrnent of negation, entropy (ME) classification for independent
and weighting of polarity words depending on events.

syntactic nesting level. A pre-processing step As described later, the L2F systern achieved
for treatrnent of spell-errors is also performed. the best results for the topic classification
Detection of polarity words is done contest, and the second place in terms of
according to a polarity lexicon built in two sentirnent analysis.
ways: projection to Spanish of an English
lexicon, and extraction of divergent words of
4.4 La Salle - U niversitat Ramon Llull
positive and negative tweets of training corpus.
As shown in next sections, evaluation results (Trilla and Alías, 2012) describe how they
show a good performance and also good adapt a Text Classification scherne based on
robustness of the systern both for the Multinornial Naive Bayes to deal with Twitter
granularity (65% of accuracy) as well as for rnessages labeled with six classes of sentirnent
coarse granularity polarity detection (71% of as well as with their topic.
accuracy). The Multinornial Naive Bayes (MNB) is a
probabilistic generative approach that builds a
4.2 IMDEA language rnodel assuming conditional
independence arnong the linguistic features.
In their paper (Femandez Anta et al., 2012), Therefore, no sense of history, sequence nor
they state that sentirnent analysis and topic order is introduced in this rnodel.
detection are new problerns that are at the The effectiveness of this scherne is
intersection of natural language processing evaluated using the TASS-SEPLN Twitter
(NLP) and data rnining.
dataset and it achieves maximum macro each tweet but, as an alternative, the named
averaged F1 measure rate of 36.28%. entities or adjectives detected as well.
Results show that modeling the tweets set
4.5 LSI – UNED using named entities and adjectives improves
the final precision results and, as a
(Martín-Wanton and Carrillo de Albornoz,
consequence, their representativeness in the
2012) presents the participation of the UNED
model compared with the use of common
group in TASS.
terms.
For polarity classification, they propose an
General results are promising (fifth and
emotional concept-based method. The original
fourth position in each of the proposed tasks),
method makes use of an affective lexicon to
indicating that an IR and language models
represent the text as the set of emotional
based approach may be an alternative to other
meanings it expresses, along with advanced
classical proposals focused on the application
syntactic techniques to identify negations and
of classification techniques.
intensifiers, their scope and their effect on the
emotions affected by them.
4.7 SINAI – Universidad de Jaén
Besides, the method addresses the problem
of word ambiguity, taking into account the The participation of the SINAI research group
contextual meaning of terms by using a word of the University of Jaén is described in
sense disambiguation algorithm. (Martínez Cámara et al., 2012).
On the other hand, for topic detection, their For the first task, they have chosen a
system is based on a probabilistic model supervised machine learning approach, in which
(Twitter-LDA). They first build for each topic they have used SVM for classifying the
of the task a lexicon of words that best describe polarity. Text features included are unigrams,
it, thus representing each topic as a ranking of emoticons, positive and negative words and
discriminative words. Moreover, a set of events intensity markers.
is retrieved based on a probabilistic approach In the second task, they have also used SVM
that was adapted to the characteristics of for the topic classification but several bags of
Twitter. words (BoW) have been used with the goal of
To determine which of the topics improving the classification performance.
corresponds to each event, the topic with the One BoW has been obtained using Google
highest statistical correlation was obtained AdWordsKeyWordTool, which allows to enter
comparing the ranking of words of each topic a term and directly returns the top N related
and the ranking of words most likely to belong concepts. The second BoW has generated based
to the event. on the hash tags of the training tweets, per each
The experimental results achieved show the category.
adequacy of their approach for the task, as
shown later. 4.8 Universidad de Málaga (UMA)
(Moreno-Ortiz and Pérez-Hernández, 2012)
4.6 LSI – UNED 2 describes the participation of the group at
(Castellano, Cigarrán and García Serrano, Facultad de Filosofía y Letras in Universidad de
2012) describes the research done for the Málaga.
workshop by the second team component of the They use a lexicon-based approach to
LSI group at UNED. Sentiment Analysis (SA). These approaches
Their proposal addresses the sentiment and differ from the more common machine-learning
topic detection from an Information Retrieval based approaches in that the former rely solely
(IR) perspective, based on language on previously generated lexical resources that
divergences. Kullback-Liebler Divergence store polarity information for lexical items,
(KLD) is used to generate both, polarity and which are then identified in the texts, assigned a
topic models, which will be used in the IR polarity tag, and finally weighed, to come up
process. with an overall score for the text.
In order to improve the accuracy of the Such SA systems have been proved to
results, they propose several approaches perform on par with supervised, statistical
focused on carry out language models, not only systems, with the added benefit of not requiring
considering the textual content associated to a training set. However, it remains to be seen
whether such lexically-motivated systems can Besides, results for different submissions
cope equally well with extremely short texts, as from the same group are typically very similar
generated on social networking sites, such as except for the case of SINAl - UJAEN group.
Twitter.
In their paper they perform such an [ Run Id Gr o u p Pr ec i s i o n 1
evaluation using Sentitext, a lexicon-based SA po l-e lhu ya r·l -5 1 Elhu ya r Fun d. 65 .29%
tool for Spanish. One conclusion is that po l· l2f-l -5 1 L2F- IN ESC 63 .37%
po l· l2f-3-5 1 L2F- IN ESC 63 .27%
affected by the number of lexical units available po l· l2f-2-5 1 L2F- IN ESC 62 .l 6%
in the text (or the lack of them, rather). On the po l·atr ill a· l -5 1 La Sa ll e- URL 57 .0l %
other hand, they also po l-s in ai-4-5 1 SINAl - UJAE N 54. 68%
tendency to assign middle-of-the-scale ratings, po l-un ed l -2-5 1 LSI UN ED 53 .82%
or at least avoid extreme values, which is po l-un ed l -l -5 1 LSI UN ED 52 .54%
reflected on its poor performance for the N+ po l· un ed2-2-5 1 LSI UN ED 2 40.4l %
po l· un ed2-l -5 1 LSI UN ED 2 39 .98%
and P+ classes, most of which were assigned to
po l· un ed2-3-5 1 LSI UN ED 2 39 .4 7%
the more neutral N and P classes.
po l· un ed2-4-5 1 LSI UN ED 2 38 .59%
Another interesting conclusion, which is
po l· im dea·l -5 1 IMD EA 36 .04%
drawn from their analysis of the average
po l-s in ai·2-5 1 SINAl - UJAE N 35 .65%
number of polarity lexical segments and Affect
po l-s in ai·l -5 1 SINAl - UJAE N 35 .28%
Intensity, is that Twitter users employ highly po l-s in ai-3-5 1 SINAl - UJAE N 34. 97%
emotionallanguage. po l-um a· l -5 1 UMA l 6.73%
5 Results Table 5: Results for task 1 (Sentiment Analysis)

with 5 levels + NONE
The gold standard (or qrels in TREC context)
has been generated by first pooling all
submissions, then a votation schema has been In order to perform a more in-depth
applied and finally an extensive human review evaluation, Table 4 gives results considering the
of the ambiguous decisions (thousands of classification only in 3 levels (P os, NEU, NEG)
them). Due to the high volume of data, this and no sentiment ( NONE) merging P and P+ in
process is unfortunately subject to errors and only one category, as well as N and N+ in
misclassifications. another one.
Both tasks have been evaluated as a single
label classification. This specifically affects to [ Run Id Gr o u p Pr ec i s i o n 1

the topic classification, where the most po l-e lhu ya r·l -3 1 Elhu ya r Fun d. 7l .l 2%
restnctive criterion has been applied: a pol-12f-l -3 1 L2F- IN ESC 69 .05%
success is achieved only when all the test po l· l2f-3-3 1 L2F- IN ESC 69 .04%
labels have been returned. Participants were po l· l2f-2-3 1 L2F- IN ESC 67 .63%
welcomed to reevaluate their experiments with po l·atr ill a· l -3 1 La Sa ll e- URL 6l .95%
a less restrictive strategy in their papers. po l-s in ai-4-3 1 SINAl - UJAE N 60 .63%
po l-un ed l -l -3 1 LSI UN ED 59 .03%
Regarding Task 1 (Sentiment analysis), 17
po l-un ed l -2-3 1 LSI UN ED 58 .77%
different experiments were submitted (plus 3
po l· un ed2-l -3 1 LSI UN ED 2 50 .08%
extra experiments for TUDelft). Results are
po l· im dea·l -3 1 IMD EA 45.95%
listed in the tables below. All tables show the
po l· un ed2-2-3 1 LSI UN ED 2 43 .6l %
precision value achieved in each experiment. po l· un ed2-4-3 1 4l .20%
LSI UN ED 2
Table 5 considers 5 levels of sentiments (P+ , po l· un ed2-3-3 1 LSI UN ED 2 40.4 3%
P , NE U, N, N+) and no sentiment ( NONE).
po l-um a· l -3 1 UMA 37 .6l %
Precision values range from 65.29% to po l-s in ai·2-3 1 SINAl - UJAE N 35 .83%
16.73%. Only 8 from 20 submissions achieve po l-s in ai·l -3 1 SINAl - UJAE N 35 .58%
figures higher than 50% and specifically 5 of po l-s in ai-3-3 1 SINAl - UJAE N 35 .ll %
the 9 groups have at least one submission above
this value.
Table 6: Results task 1 (Sentiment Analysis)
with 3 levels + NONE
In this case, precision values improve, as trending topic within the information
expected. The precision obtained now ranges technology field.
from 71.12% to 35.11%. In this case, 9 Sorne participants expressed sorne concerns
submissions have a precision value over 50% about the quality of both the annotation of the
and 6 groups have at least one result over this training corpus and also of the gold standard
percent. (the test corpus). In case of future editions of
Table 7 shows the results for Task 2 TASS and the reuse of the corpus, more effort
(Trending topic coverage ). 13 experiments were must be invested in filtering errors and
submitted in all (plus 2 experiments from improving the annotation of the corpora.
TUDelft). Furthermore, as expressed by (Moreno-Ortiz
and Pérez-Hernández), there is a need of further
rRun Id Gro u p Precision l discussion about whether differentiating
top· l2f-2 L2F- IN ESC 65.37% between neutral and no polarity is the best
top· l2f-l y3 L2F- IN ESC 64.92% decision, since it is not always clear what the
top-atr ill a- 1 La Sa ll e- URL 60.l6% difference is, and, moreover, if this distinction
po l·uned2-5a8 LSI UNED 2 45.26% is interesting from a practical perspective.
top-i m dea-1 IMD EA 45.24% In future editions of the workshop, it would
po l·uned2-9a l2 LSI UNED 2 42.24% be interesting to extend the corpus to other
po l·uned2-l a4 LSI UNED 2 40.5l % languages (English in particular) to compare the
top-s in a i-5 SINAl - UJAEN 39.37% performance of the different approaches on
top-s in a i-4 SINAl - UJAEN 37.79% different languages.
top-sinai-2 SINAl - UJAEN 34.76%
top-s in a i-3 SINAl - UJAEN 34.06%
References
top-s in a i-1 SINAl - UJAEN 32.34%
po l·un edl-ly2 LSI UNED 30.98% Saralegi Urizar, Xabier; San Vicente Roncal,
Iñaki. 2012. TASS: Detecting Sentiments in
Spanish Tweets. TASS 2012 Working Notes.
Table 7: Results for task 2 (Trending topic
coverage) Fernández Anta, Antonio; Morere, Philippe;
Núñez Chiroque, Luis; and Santos, Agustín.
Techniques for Sentiment Analysis and
In this task, precision ranges from 65.37% to
Tapie Detection of Spanish Tweets:
30.98% and only 4 of 15 submissions are above
Preliminary Report. TASS 2012 Working
50% (2 groups). As in task 1, different
Notes.
submissions from the same group usually have
a similar precision. Batista, Fernando; Ribeiro, Ricardo. The L2F

Strategy for Sentiment Analysis and Tapie
6 Conclusions and Future Work Classification. TASS 2012 Working Notes.
TASS has been the first workshop about Trilla, Alexandre; Alías, Francesc. Sentiment
sentiment analysis in the context of SEPLN. Analysis of Twitter messages based on
We expected to attract a certain interest in the Multinomial Naive Bayes. TASS 2012
proposed tasks, as many groups around the Working Notes.
world are currently carrying out an intense
Martín-Wanton, Tamara; Carrillo de Albornoz,
research in sentiment/opinion analysis in
Jorge. UNED en TASS 2012: Sistema para
general and using short-texts in particular. We
la Clasificación de la Polaridad y
think that the number of participants, the
Seguimiento de Temas. TASS 2012 Working
quality of their work and their reports, and the
Notes.
good results achieved in such hard tasks, has
met and gone beyond all our expectations. Castellano González, Angel; Cigarrán Recuero,
The diversity of groups coming from Juan; García Serrano, Ana. UNED@ TASS:
different fields and areas of expertise including Using IR techniques for topic-based
Information Retrieval, Natural Language sentiment analysis through divergence
Processing 1 Computational Linguistics, models. TASS 2012 Working Notes.
Machine Learning 1 Data Mining 1 Text Martínez Cámara, Eugenio; García Cumbreras,
Analytics, and even Semantic Web, has shown M. Ángel. Martín Valdivia, M. Teresa;
that the sentiment analysis is becoming a
Ureña López, L. Alfonso. SINAI at TASS

2012. TASS 2012 Working Notes.
Moreno-Ortiz, Antonio; Pérez-Hernández,
Chantal. Lexicon-Based Sentiment Analysis
of Twitter Messages in Spanish. TASS 2012
Working Notes.
TASS: Detecting Sentiments in Spanish Tweets
TASS: Detección de Sentimientos en Tuits en Español
Xabier Saralegi Urizar Iñaki San Vicente Roncal

Elhuyar Fundazioa Elhuyar Fundazioa
Zelai Haundi 3, 20170 Usurbil Zelai Haundi 3, 20170 Usurbil
x.saralegi@elhuyar.com i.sanvicente@elhuyar.com
Resumen: Este artı́culo describe el sistema presentado por nuestro grupo para la
tarea de análisis de sentimiento enmarcada en la campaña de evaluación TASS 2012.
Adoptamos una aproximación supervisada que hace uso de conocimiento lingüı́stico.
Este conocimiento lingüı́stico comprende lematización, etiquetado POS, etiquetado
de palabras de polaridad, tratamiento de emoticonos, tratamiento de negación, y
ponderación de polaridad según el nivel de anidamiento sintáctico. También se lleva
a cabo un preprocesado para el tratamiento de errores ortográficos. La detección
de las palabras de polaridad se hace de acuerdo a un léxico de polaridad para el
castellano creado en base a dos estrategias: Proyección o traducción de un léxico de
polaridad de inglés al castellano, y extracción de palabras divergentes entre los tuits
positivos y negativos correspondientes al corpus de entrenamiento. Los resultados de
la evaluación final muestran un buen rendimiento del sistema ası́ como una notable
robustez tanto para la detección de polaridad a alta granularidad (65% de exactitud)
como a baja granularidad (71% de exactitud).
Palabras clave: TASS, Análisis de sentimiento, Minerı́a de opiniones, Detección
de polaridad
Abstract: This article describes the system presented for the task of sentiment
analysis in the TASS 2012 evaluation campaign. We adopted a supervised approach
that includes some linguistic knowledge-based processing for preparing the features.
The processing comprises lemmatisation, POS tagging, tagging of polarity words,
treatment of emoticons, treatment of negation, and weighting of polarity words
depending on syntactic nesting level. A pre-processing for treatment of spell-errors

is also performed. Detection of polarity words is done according to a polarity lexicon
built in two ways: projection to Spanish of an English lexicon, and extraction of
divergent words of positive and negative tweets of training corpus. Evaluation results
show a good performance and also good robustness of the system both for fine
granularity (65% of accuracy) as well as for coarse granularity polarity detection
(71% of accuracy).
Keywords: TASS, Sentiment Analysis, Opinion-mining, Polarity detection
1 Introduction opinions of users about topics or products

would enable many organizations to obtain
Knowledge management is an emerging re-
global feedback on their activities. Some
search field that is very useful for improving
studies (O’Connor et al., 2010) have poin-
productivity in different activities. Know-
ted out that such systems could perform as
ledge discovery, for example, is proving very
well as traditional polling systems, but at a
useful for tasks such as decision making and
much lower cost. In this context, social media
market analysis. With the explosion of Web
like twitter constitute a very valuable source
2.0, the Internet has become a very rich
when seeking opinions and sentiments.
source of user-generated information, and re-
search areas such as opinion mining or senti- The TASS evaluation challenge consisted
ment analysis have attracted many research- of two tasks: predicting the sentiment of
ers. Being able to identify and extract the Spanish tweets, and identifying the topic of
the tweets. The TASS evaluation workshop icon to classify movie reviews. Read (2005)
aims “to provide a benchmark forum for com- confirmed the necessity to adapt the mod-
paring the latest approaches in this field”. els to the application domain, and (Choi and
Our team only took part in the first task, Cardie, 2009) address the same problem for
which involved predicting the polarity of a polarity lexicons.
number of tweets, with respect to 6-category In the last few years many researchers
classification, indicating whether the text ex- have turned their efforts to microblogging
presses a positive, negative or neutral senti- sites such as Twitter. As an example, (Bol-
ment, or no sentiment at all. It must be noted len, Mao, and Zeng, 2010) have studied the
that most works in the literature only classify possibility of predicting stock market res-
sentiments as positive or negative, and only ults by measuring the sentiments expressed
in a few papers are neutral and/or objective in Twitter about it. The special character-
categories included. We developed a super- istics of the language of Twitter require a
vised system based on a polarity lexicon and special treatment when analyzing the mes-
a series of additional linguistic features. sages. A special syntax (RT, @user, #tag,...),
The rest of the paper is organized as fol- emoticons, ungrammatical sentences, vocab-
lows. Section 2 reviews the state of the art ulary variations and other phenomena lead
in the polarity detection field, placing spe- to a drop in the performance of traditional
cial interest on sentence level detection, and NLP tools (Foster et al., 2011; Liu et al.,
on twitter messages, in particular. The third 2011). In order to solve this problem, many
section describes the system we developed, authors have proposed a normalization of the
the features we included in our supervised text, as a pre-process of any analysis, report-
system and the experiments we carried out ing an improvement in the results. Brody
over the training data. The next section (2011) deals with the word lengthening phe-
presents the results we obtained with our sys- nomenon, which is especially important for
tem first in the training-set and later in the sentiment analysis because it usually ex-
test data-set. The last section draws some presses emphasis of the message. (Han and
conclusions and future directions. Baldwin, 2011) use morphophonemic simil-
arity to match variations with their standard
2 State of the Art vocabulary words, although only 1:1 equival-
Much work has been done in the last dec- ences are treated, e.g., ’imo = in my opinion’
ade in the field of sentiment labelling. Most would not be identified. Instead, they use an
of these words are limited to polarity de- Internet slang dictionary to translate some of
tection. Determining the polarity of a text those expressions and acronyms. Liu et al.
unit (e.g., a sentence or a document) usually (2012) propose combining three strategies,
includes using a lexicon composed of words including letter transformation, “priming” ef-
and expressions annotated with prior polar- fect, and misspelling corrections.
ities (Turney, 2002; Kim and Hovy, 2004; Once the normalization has been per-
Riloff, Wiebe, and Phillips, 2005; Godbole, formed, traditional NLP tools may be used to
Srinivasaiah, and Skiena, 2007). Much re- analyse the tweets and extract features such
search has been done on the automatic or as lemmas or POS tags (Barbosa and Feng,
semi-automatic construction of such polar- 2010). Emoticons are also good indicators
ity lexicons (Riloff and Wiebe, 2003; Esuli of polarity (O’Connor et al., 2010). Other
and Sebastiani, 2006; Rao and Ravichandran, features analyzed in sentiment analysis such
2009; Velikovich et al., 2010). as discourse information (Somasundaran et
Regarding the algorithms used in senti- al., 2009) can also be helpful. (Speriosu et
ment classification, although there are ap- al., 2011) explore the possibility of exploiting
proaches based on averaging the polarity of the Twitter follower graph to improve polar-
the words appearing in the text (Turney, ity classification, under the assumption that
2002; Kim and Hovy, 2004; Hu and Liu, people influence one another or have shared
2004; Choi and Cardie, 2009), machine learn- affinities about topics. (Barbosa and Feng,
ing methods have become the more widely 2010; Kouloumpis, Wilson, and Moore, 2011)
used approach. Pang et al. (2002) proposed combined polarity lexicons with machine
a unigram model using Support Vector ma- learning for labelling sentiment of tweets.
chines which does not need any prior lex- Sindhwani and Melville (2008) adopt a semi-
supervised approach using a polarity lexicon English-Spanish bilingual dictionary Den−es

combined with label propagation. (see Table 2). Despite Pen including neutral
A common problem of the supervised ap- words, only positive and negative ones were
proaches is to gather labelled data for train- selected and translated. Ambiguous trans-
ing. In the case of the TASS challenge, we lations were solved manually by two annot-
would tackle this problem should we want to ators. Altogether, 7,751 translations were
collect additional training data. In order to checked. Polarity was also checked and cor-
automatically build annotated corpora, (Go, rected during this manual annotation. It
Bhayani, and Huang, 2009) collect tweets must be noted that as all translation candid-
containing the “:)” emoticon and regard ates were checked, many variants of the same
them as positive, and likewise for the “:(“ source word were selected in many cases. Fi-
emoticon. Kouloumpis (2011) uses a sim- nally, 2,164 negative words and 1,180 positive
ilar approach based on most common posit- words were included in the polarity lexicon
ive and negative hashtags. Barbosa (Barbosa (see fifth column of Table 3). We detected
and Feng, 2010) rely on existing web services a significant number of OOV words (35%) in
such as Twend or Tweetfeel to collect annot- this translation process (see second and third
ated emoticons. One major problem of the columns of Table 3). Most of these words
aforementioned strategies is that only posit- were inflected forms: pasts (e.g., “terrified”),
ive and negative tweets can be collected. plurals (e.g., “winners”), adverbs (e.g., “vi-
brantly”), etc. So they were not dealt with.
3 Experiments
3.1 Training Data #headwords #pairs avg.
#trans-
The training data Ct provided by the or- lations
ganization consists of 7,219 twitter messages Den−es 15,134 31,884 2.11
(see Table 1). Each tweet is tagged with its
global polarity, indicating whether the text
expresses a positive, negative or neutral sen- Table 2: Characteristics of the Den−es bilin-
timent, or no sentiment at all. 6 levels have gual dictionary.
been defined: strong positive (P+), positive
(P), neutral (NEU), negative (N), strong neg- b) As a second source for our polarity
ative (N+) and no sentiment (NONE). The lexicon, words were automatically extracted
numbers of tweets corresponding to P+ and from the training corpus Ct . In order to ex-
NONE are higher than the rest. NEU is the tract the words most associated with a cer-
class including the least tweets. In addition, tain polarity; let us say positive, we divided
each message includes its Twitter ID, the cre- the corpus into two parts: positive tweets
ation date and the twitter user ID. and the rest of the corpus. Using the Log-
Polarity #tweets % of #tweets
likelihood ratio (LLR) we obtained the rank-
P+ 1,764 24.44% ing of the most salient words in the positive
P 1,019 14.12% part with respect to the rest of the corpus.
NEU 610 8.45% The same process was conducted to obtain
N 1,221 16.91% negative candidates. The top 1,000 negative
N+ 903 12.51% and top 1,000 positive words were manually
NONE 1,702 23.58% checked. Among them, 338 negative and 271
Total 7,219 100% positive words were selected for the polarity
lexicon (see sixth column in Table 3). We
found a higher concentration of good candid-
Table 1: Polarity classes distribution in cor- ates among the best ranked candidates (see
pus Ct . Figure 1).
3.2 Polarity Lexicon 3.3 Supervised System

We created a new polarity lexicon for Spanish Although some preliminary experiments were
Pes from two different sources: conducted using an unsupervised approach,
a) An existing English polarity lexicon we chose to build a supervised classifier, be-
(Wilson et al., 2005) Pen was automatic- cause it allowed us to combine the various
ally translated into Spanish by using an features more effectively. We used the SMO
polarity English Words Trans- Manually Manually Final

words translation selected selected lex-
training corpus. These abbreviations are
in
Pen
lated
by
can-
did-
candid-
ates
from Ct icon
Pes
extended before the lemmatisation pro-
Den−es ates cess.
negative 4,144 2,416 3,480 2,164 271 2,435
positive 2,304 2,057 2,271 1,180 338 1,518 • Overuse of upper case (e.g., “MIRA
Total 6,878 4,473 5,751 3344 609 3,953
QUE BUENO”). Upper case is used to
give more intensity to the tweet. If we
Table 3: Statistics of the polarity lexicons. detect a sequence of two words all the
characters of which are upper case and
which are included in Freeling’s diction-
ary as common, we change them to lower
case.
• Normalization of urls. The complete url
is replaced by the “URL” string.
3.3.1 Baseline
As baseline we implemented a unigram rep-
resentation using all lemmas in the train-
ing corpus as features (15,069 altogether).
Lemmatisation was done by using Freeling.
Contrary to (Pang, Lee, and Vaithyanathan,
Figure 1: Precision of candidates from Ct de- 2002), we stored the frequency of the lem-
pending on LLR ranking intervals (100 can- mas in a tweet. Although using presence
didates per interval {1-100,101-200,...}). performed slightly better in the baseline con-
figuration (improvement was not significant),
implementation of the Support Vector Ma- as other features were included, we achieved
chine algorithm included in the Weka (Hall better results by using frequency. Thus, for
et al., 2009) data mining software. Default the sake of simplicity, all the experiments
configuration was used. All the classifiers shown make use of the frequency.
built over the training data were evaluated by 3.3.2 Selection of Polarity Words
means of the 10-fold cross validation strategy, (SP)
except for the one including additional train- Only lemmas corresponding to words in-
ing data (see section 3.3.6 for details). cluded in the polarity lexicon Pes (see section
As mentioned in section 2, microblogging 3.2) were selected as features. This allows the
in general and Twitter, in particular, suffers system to focus on features that express the
from a high presence of spelling errors. This polarity, without further noise. Another ef-
hampers any knowledge-based processing as fect is that the number of features decreases
well as supervised methods. We rejected the significantly (from 15,069 to 3,730), thus re-
use of spell-correctors such as Google spell- ducing the computational costs of the model.
checker because they try to treat many cor- In our experiments relying on the polarity
rect words that they do not know. There- lexicon (see Table 4) clearly outperforms the
fore, we apply some heuristics in order to pre- unigram-based baseline. The rest of the fea-
process the tweets and solve the main prob- tures were tested on top of this configuration.
lems we detected in the training corpus:
3.3.3 Emoticons and Interjections
• Replication of characters (e.g., (EM)
“Sueñooo”): Sequences of the same Emoticons and interjections are very strong
characters are replaced by a single expressions of sentiments. A list of emoticons
character when the pre-edited word is is collected from a Wikipedia article about
not included in Freeling’s1 dictionary emoticons and all of them are classified as
and the post-edited word appears in positive (e.g., “:)”, “:D” ...) or negative (e.g.,
Freeling’s dictionary. “:(“ , “u u” ...). 23 emoticons were classified
as positive and 35 as negative. A list of 54
• Abbreviations (e.g., “q”, “dl”, ...): A
negative (e.g., “mecachis”, “sniff ”, ...) and
list of abbreviations is created from the
28 positive (e.g., “hurra”, “jeje”, ...) interjec-
1
http://nlp.lsi.upc.edu/freeling tions including variants modelled by regular
expressions were also collected from different citly, the polarity words included in the Pes
webs as well as from the training corpora. but not in the training corpus will be used
The frequency of each emoticon and interjec- by the classifier. By dealing with those OOV
tion type (positive or negative) is included as polarity words, our intention is to make our
a feature of the classifier. system more robust.
The number of upper-case letters in the Two new features are created to be
tweet was also used as an orthographical clue. included in the polarity information: a score
In Twitter where it is not possible to use let- of the positivity and a score of the negativity
ter styling, people often use the upper case of a tweet. In principle, positive words
to emphasize their sentiments (e.g., GRA- in Pes add 1 to the positivity score and
CIAS), and hence, a large number of upper- negative words add 1 to the negativity score.
case letters would denote subjectivity. So, However, depending on various phenomena,
the relative number of upper-case letters in a the score of a word can be altered. These
tweet is also included as a feature. phenomena are explained below.
According to the results (see Table 4),
these clues did not provide a significant im- Treatment of Negations and Adverbs
provement. Nevertheless, they did show a The polarity of a word changes if it is
slight improvement. Moreover, other literat- included in a negative clause. Syntactic
ure shows that such features indeed help to information provided by Freeling is used
detect the polarity (Koulompis, 2011). The for detecting those cases. The polarity of a
low impact of these features could be ex- word increases or decreases depending on the
plained by the low density of such elements adverb which modifies it. We created a list of
in our data-set: only 622 out of 7,219 tweets increasing (e.g., “mucho”, “absolutamente”,
in the training data (8.6%) include emoticons ...) and decreasing (e.g., “apenas”, “poco”,
or interjections. Emoticon, interjection and ...) adverbs. If an increasing adverb modify-
capitalization features were included in our ing a polarity word is detected, the polarity
final model. is increased (+1). If it is a decreasing adverb,
3.3.4 POS Information (PO) the polarity of the words is decreased (−1).
Results obtained among the literature are not Syntactic information provided by Freeling
clear as to whether POS information helps is used for detecting these cases.
to determine the polarity of the texts (Kou-
lompis 2011), but POS tags are useful for dis- Syntactic Nesting Level
tinguishing between subjective and objective The importance of the word in the tweet
texts. Our hypothesis is that certain POS determines the influence it can have on the
tags are more frequent in opinion messages, polarity of the whole tweet. We measured
e.g., adjectives. In our experiments POS tags the importance of each word w by calculat-
provided by Freeling were used. We used as ing the relative syntactic nesting level ln (w).
a feature the frequency of the POS tags in a The lower the syntactic level, the less import-
message. ant it is. The relative syntactic nesting level
Results in Table 4 show that this feature is computed as the inverse of the syntactic
provides a notable improvement and it is es- nesting level (1/ln (w)).
pecially helpful for detecting objective mes-
sages (view difference in F-score between SP Features/
Metric
Acc.
(6 cat.)
P+ P NEU N N+ NONE
and SP+PO for the NONE class). Baseline 0.45 0.574 0.267 0.137 0.368 0.385 0.578
SP 0.484 0.594 0.254 0.098 0.397 0.422 0.598
3.3.5 Frequency of Polarity Words SP+PO 0.496 0.596 0.245 0.093 0.414 0.438 0.634
SP+EM 0.49 0.612 0.253 0.097 0.402 0.428 0.6
(FP) SP+FP 0.514 0.633 0.261 0.115 0.455 0.438 0.613
The SP classifier does not interpret the polar- All
ALL+AC1
0.523
0.523
0.648
0.647
0.246
0.248
0.111
0.116
0.463
0.46
0.452
0.451
0.657
0.655
ity information included on the lexicon. We
explicitly provide that information as a fea-
ture to the classifier. Furthermore, without Table 4: Accuracy results obtained on the
the polarity information, the classifier will be evaluation of the training data. Columns 3rd
built taking into account only those polarity to 8th show F-scores for each of the class val-
words appearing in the training data. Includ- ues.
ing the polarity frequency information expli-
3.3.6 Using Additional Corpora (AC) the training data Ct−train only those tweets
Additional training data were retrieved using of Ctw containing at least one word w from
the Perl Net::Twitter API. Different searches Pes but not appearing in the training corpus
were conducted during June 2012 using the (w ∈ Pes ∧f req(w, Ct−train ) = 0). Only 7.9%
attitude feature of the twitter search. Using of the retrieved tweets were added. Results
this feature, users can search for tweets ex- were still unsatisfactory, and so, additional
pressing either positive or negative opinion. training data were left out of the final model.
The search is based on emoticons as in (Go It must be noted that the tweet retrieval
et al., 2009). Retrieved tweets were classified effort was very simple, due to the limited
according to their attitude. time we had to develop the system. We
conclude that these additional training data
Corpora/Tweets P N Total were unhelpful due to the differences with
Ctw 11,363 9,865 21,228 the original data provided: Ctw contained
many more ungrammatical structures and
nonstandard tokens than the original data;
Table 5: Characteristics of the tweet corpus the dates of the tweets were different which
collected from Twitter. could even lead to topic and vocabulary dif-
The corpus Ctw including retrieved tweets ferences; and especially, the fact that the ad-
(see Table 5.) was used in two ways: on the ditional data collected did not include neutral
one hand, we used it to find new words for our or objective tweets and neither did it include
polarity lexicon Pes , by using the automatic different degrees of polarity in the case of pos-
method described in section 3.2. The first 500 itive and negative tweets.
positive candidates and 500 negative candid- Features/ #training Accuracy
ates were manually checked. Altogether, 110 Metric examples
positive words and 95 negative ones (AC1) ALL 6,137 0.573
were included in the polarity lexicon Pes . ALL+AC2 27,365 0.507
According to the results (see ALL+AC1 in ALL+AC2-OOV 7,807 0.569
Table 4), these new polarity words do not
provide any improvement. The reason is that
most relevant polarity words included in the Table 6: Results obtained by including addi-
training corpus Ct are already included in tional examples in the training data.
Pe s as explained in section 3.2. In order to
measure the contribution of these words bet-

ter, evaluation was carried out against the 4 Evaluation and Results
test corpus where more OOV polarity words The evaluation test-set Ce provided by the
would be likely to appear (see section 4). organization consists of 60,798 twitter mes-
On the other hand (AC2), we added Ctw sages (see Table 7) annotated as explained in
to the training data, in the hypothesis that section 3.1. Only one run of results was al-
more training data would lead to a bet- lowed for submission. Although the results
ter model, although polarity strength was include classification into 6 categories (5 po-
not distinguished. Thus, only P and N ex- larities + NONE), the results were also given
amples are obtained. In order to evalu- on a 4-category basis (3 polarities + NONE).
ate the effect of the new data, the original For the 4-category results, all tweets regarded
training data were divided into two parts: as positive are grouped into a single category,
85% (6,137 tweets) for training (Ct−train ) and and the same is done for negative tweets.
15% (1,082) for testing (Ct−test ). The test Table 8 presents the results for both evalu-
data were randomly selected and the pro- ations using the best scored classifiers in the
portions of the polarity classes were main- training process. In addition to the accuracy
tained equal in both parts. Our first clas- results, Table 8 shows F-scores for each class
sifier (ALL+AC2) was trained with all the for the 6-category classification.
retrieved tweets included in Ctw as well as The first thing we notice is that the res-
the tweets in Ct−train . Results show (see ults obtained with the test data are bet-
Table 6) that accuracy decreased when using ter than those achieved with the training
these data for training. A second experiment data for all configurations. The best sys-
was carried out (ALL+AC2-OOV), adding to tem (ALL+AC1) achieves 0.653 of accuracy
Polarity #tweets % of #tweets tweets contain positive and negative sen-

P+ 20,745 34.12% tences, which leads us to think that a dis-
P 1,488 2.45% course treatment could be useful in order to
NEU 1,305 2.15%
determine the importance of each sentence
N 11,287 18.56%
N+ 4,557 7.5%
with respect to the whole tweet. In the case
NONE 21,416 35.22% of positive tweets, P tweets, many of them
Total 60,798 100% are classified as P+.
In the experiment (AC1) described in sec-
tion 3.3.6 we did not obtain any improvement
Table 7: Polarity classes distribution in test by adding the words extracted from an addi-
corpus Ce . tional corpus of tweets to the polarity lex-
icon Pes . If we take into account that the
while the same system scored 0.523 of ac- most significant words of the training corpus
curacy in training. Even the baseline shows (Ct ) were already included in Pes , it could be
the same tendency. Regarding the differ- expected that the words in AC1 would have
ences between configurations, tendencies ob- little effect on the training data. In the evalu-
served in the cross validation evaluation of ation against the test data where the vocab-
the training data are confirmed in the eval- ulary is larger, the AC1 lexicon provides a
uation of the test data. Then again, im- slight improvement (see difference between
provement of ALL+AC1 over Baseline is also ALL and All+AC1 in Table 8).
higher in test data-based evaluation than in Metric/ Acc. Acc. P+ P NEU N N+ NONE
System (4 cat.) (6 cat.)
the training cross-validation evaluation: a Baseline 0.616 0.527 0.638 0.214 0.139 0.483 0.471 0.587
16.22% improvement in the accuracy over ALL 0.702 0.641 0.752 0.323 0.166 0.563 0.564 0.683
ALL+AC1 0.711 0.653 0.753 0.32 0.167 0.566 0.566 0.685
the baseline was obtained in training cross-
validation, while in the test data evaluation,
the improvement rose to 23.91%. P+ and Table 8: Results obtained on the evaluation
NONE classes are those our classifier identi- of the test data.
fies best, being NEU and P the classes with
the worst performance (tables 4 and 8). If
we look at the distribution of the polarity 5 Conclusions
classes (tables 1 and 7), we can see that the We have presented an SVM classifier for de-
proportion of the P+ and NONE classes in- tecting the polarity of Spanish tweets. Our
creases significantly in the test data with re- system effectively combines several features
spect to the training data. By contrast, the based on linguistic knowledge. In our case,
NEU and P classes decreased dramatically. using a semi-automatically built polarity lex-
The distribution difference together with the icon improves the system performance signi-
performance of the system regarding specific ficantly over a unigram model. Other fea-
classes could explain the difference in accur- tures such as POS tags, and especially word
acy between test and training evaluations. It polarity statistics were also found to be help-
remains unclear to us why the F-scores for ful. In our experiments, including external
all the classes improved with respect to the training data was unsuccessful. However, our
training phase. We should analyse the char- approach was very simple, and so, a more ex-
acteristics of the training and test corpora, haustive experimentation should be carried
looking for differences in the samples and an- out in order to obtain conclusive results. In
notation. any case, the system shows robust perform-
As for the results of the individual classes, ance when it is evaluated against test data
it is worth mentioning that neutral tweets different from the training data.
are very difficult to classify because they There is still much room for improvement.
do contain polarity words. We looked at Tweet normalization was naı̈vely implemen-
its confusion matrix (both for training and ted. Some authors (Pang and Lee, 2004;
test evaluations) and it shows that NEU Barbosa and Feng, 2010) have obtained pos-
tweets wrongly classified are evenly dis- itive results by including a subjectivity ana-
tributed between the other classes, except lysis phase before the polarity detection step.
for the NONE class, with almost no NEU We would like to explore that line of work.
tweets classified as NONE. Most of the NEU Lastly, it would be worthwhile conducting
in-depth research into the creation of polar- Conference on Artificial Intelligence, Au-
ity lexicons including domain adaption and gust.
treatment of word senses.
Go, A., R. Bhayani, and L. Huang. 2009.
Acknowledgments Twitter sentiment classification using dis-
tant supervision. CS224N Project Report,
This work has been partially founded by the Stanford, pages 1–12.
Industry Department of the Basque Govern-
ment under grant IE11-305 (knowTOUR pro- Godbole, N., M. Srinivasaiah, and S. Skiena.
ject). 2007. Large-scale sentiment analysis for
news and blogs. In Proceedings of the
References International Conference on Weblogs and
Social Media (ICWSM), pages 219–222.
Barbosa, Luciano and Junlan Feng. 2010.
Robust sentiment detection on twit- Hall, Mark, Eibe Frank, Geoffrey Holmes,
ter from biased and noisy data. In Bernhard Pfahringer, Peter Reutemann,
Proceedings of the 23rd International and Ian H. Witten. 2009. The
Conference on Computational Linguistics: WEKA data mining software: an up-
Posters, COLING ’10, pages 36–44, date. SIGKDD Explor. Newsl., 11(1):10–
Stroudsburg, PA, USA. Association for 18, november.
Computational Linguistics.
Han, Bo and Timothy Baldwin. 2011.
Bollen, Johan, Huina Mao, and Xiao-Jun Lexical normalisation of short text
Zeng. 2010. Twitter mood predicts the messages: Makn sens a #twit-
stock market. 1010.3003, October. ter. In Proceedings of the 49th
Annual Meeting of the Association
Brody, Samuel and Nich-
for Computational Linguistics: Human
olas Diakopoulos. 2011.
Language Technologies, pages 368–378,
Cooooooooooooooollllllllllllll!!!!!!!!!!!:
Portland, Oregon, USA, June. Associ-
using word lengthening to detect senti-
ation for Computational Linguistics.
ment in microblogs. In Proceedings of
the Conference on Empirical Methods in Hu, M. and B. Liu. 2004. Mining
Natural Language Processing, EMNLP and summarizing customer reviews. In
’11, pages 562–570. Association for Proceedings of the tenth ACM SIGKDD
Computational Linguistics. international conference on Knowledge
discovery and data mining, pages 168–177.

Choi, Yejin and Claire Cardie. 2009. Adapt-
ing a polarity lexicon using integer linear Kim, Soo-Min and Eduard Hovy. 2004.
programming for domain-specific senti- Determining the sentiment of opinions.
ment classification. In Proceedings of the In Proceedings of the 20th international
2009 Conference on Empirical Methods in conference on Computational Linguistics,
Natural Language Processing: Volume 2 COLING ’04, Stroudsburg, PA, USA. As-
- Volume 2, EMNLP ’09, pages 590–598, sociation for Computational Linguistics.
Stroudsburg, PA, USA. Association for
Computational Linguistics. Kouloumpis, E., T. Wilson, and J. Moore.
2011. Twitter sentiment analysis: The
Esuli, Andrea and Fabrizio Sebastiani. 2006. good the bad and the OMG! In
SENTIWORDNET: a publicly available Fifth International AAAI Conference on
lexical resource for opinion mining. In Weblogs and Social Media.
In Proceedings of the 5th Conference
Liu, Fei, Fuliang Weng, and Xiao Jiang.
on Language Resources and Evaluation
2012. A broad-coverage normalization
(LREC’06, pages 417–422.
system for social media language. In
Foster, Jennifer, Ozlem Cetinoglu, Joachim Proceedings of the 50th Annual Meeting
Wagner, Joseph Le Roux, Stephen Hogan, of the Association for Computational
Joakim Nivre, Deirdre Hogan, and Josef Linguistics (Volume 1: Long Papers),
van Genabith. 2011. #hardtoparse: POS pages 1035–1044, Jeju Island, Korea, July.
tagging and parsing the twitterverse. In Association for Computational Linguist-
Workshops at the Twenty-Fifth AAAI ics.
Liu, X., S. Zhang, F. Wei, and M. Zhou. Riloff, Ellen and Janyce Wiebe. 2003.
2011. Recognizing named entities Learning extraction patterns for subject-
in tweets. In Proceedings of the ive expressions. In Proceedings of the
49th Annual Meeting of the Association 2003 conference on Empirical methods in
for Computational Linguistics: Human natural language processing -, pages 105–
Language Technologies (ACL-HLT 2011), 112.
Portland, Oregon. Somasundaran, Swapna, Galileo Namata, Ja-
O’Connor, Brendan, Ramnath Balasub- nyce Wiebe, and Lise Getoor. 2009.
ramanyan, Bryan R. Routledge, and Supervised and unsupervised methods in
Noah A. Smith. 2010. From tweets employing discourse relations for improv-
to polls: Linking text sentiment to ing opinion polarity classification. In
public opinion time series. In Fourth Proceedings of the 2009 Conference on
International AAAI Conference on Empirical Methods in Natural Language
Weblogs and Social Media, May. Processing: Volume 1 -, EMNLP ’09,
pages 170–179, Stroudsburg, PA, USA.
Pang, Bo and Lillian Lee. 2004. A senti- Association for Computational Linguist-
mental education: sentiment analysis us- ics.
ing subjectivity summarization based on
Speriosu, Michael, Nikita Sudan, Sid Up-
minimum cuts. In Proceedings of the
adhyay, and Jason Baldridge. 2011.
42nd annual meeting of the Association
Twitter polarity classification with label
for Computational Linguistics, ACL ’04,
propagation over lexical links and the fol-
Stroudsburg, PA, USA. Association for
lower graph. In Proceedings of the First
Computational Linguistics.
Workshop on Unsupervised Learning in
Pang, Bo, Lillian Lee, and Shivakumar NLP, EMNLP ’11, pages 53–63, Strouds-
Vaithyanathan. 2002. Thumbs up?: sen- burg, PA, USA. Association for Computa-
timent classification using machine learn- tional Linguistics.
ing techniques. In Proceedings of the Turney, Peter D. 2002. Thumbs up
ACL-02 conference on Empirical methods or thumbs down?: semantic orienta-
in natural language processing - Volume tion applied to unsupervised classifica-
10, EMNLP ’02, pages 79–86, Strouds- tion of reviews. In Proceedings of the
burg, PA, USA. Association for Compu- 40th Annual Meeting on Association for
tational Linguistics. Computational Linguistics - ACL ’02,
Rao, Delip and Deepak Ravichandran. 2009. page 417, Philadelphia, Pennsylvania.
Semi-supervised polarity lexicon induc- Velikovich, Leonid, Sasha Blair-Goldensohn,
tion. In Proceedings of the 12th Kerry Hannan, and Ryan McDon-
Conference of the European Chapter ald. 2010. The viability of web-
of the Association for Computational derived polarity lexicons. In Human
Linguistics, EACL ’09, pages 675–682, Language Technologies: The 2010
Stroudsburg, PA, USA. Association for Annual Conference of the North
Computational Linguistics. American Chapter of the Association
for Computational Linguistics, HLT
Read, Jonathon. 2005. Using emoticons
’10, pages 777–785, Stroudsburg, PA,
to reduce dependency in machine learning
USA. Association for Computational
techniques for sentiment classification. In
Linguistics.
Proceedings of the ACL Student Research
Workshop, ACLstudent ’05, pages 43–48, Wilson, Theresa, Paul Hoffmann, Swapna
Stroudsburg, PA, USA. Association for Somasundaran, Jason Kessler, Janyce
Computational Linguistics. Wiebe, Yejin Choi, Claire Cardie,
Ellen Riloff, and Siddharth Pat-
Riloff, E., J. Wiebe, and W. Phillips. wardhan. 2005. OpinionFinder.
2005. Exploiting subjectivity classifica- In Proceedings of HLT/EMNLP on
tion to improve information extraction. Interactive Demonstrations -, pages
In Proceeding of the national conference 34–35, Vancouver, British Columbia,
on Artificial Intelligence, volume 20, page Canada.
1106.
Techniques for Sentiment Analysis and Topic Detection of
Spanish Tweets: Preliminary Report∗
Técnicas de análisis de sentimientos y deteccion de asunto de tweets en
español: informe preliminar
Antonio Fernández Anta Philippe Morere†

Institute IMDEA Networks ENSEIRB-MATMECA
Madrid, Spain Bordeaux, France
Luis Núñez Chiroque Agustı́n Santos

Institute IMDEA Networks Institute IMDEA Networks
Madrid, Spain Madrid, Spain
Resumen: Análisis de sentimientos y detección de asunto son nuevos problemas

que están en la intersección del procesamiento de lenguaje natural y la minerı́a de
datos. El primero intenta determinar si un texto es positivo, negativo o neutro,
mientras que el segundo intenta identificar la temática del texto. Un esfuerzo sig-
nificante está siendo invertido en la construcción de soluciones efectivas para estos
dos problemas, principalmente para textos en inglés. Usando un corpus de tweets
en español, presentamos aquı́ un análisis comparativo de diversas aproximaciones y
técnicas de clasificación para estos problemas. Los datos de entrada son preproce-
sados usando técnicas y herramientas propuestas en la literatura, junto con otras
especı́ficamente propuestas aquı́ y que tienen en cuenta las peculiaridades de Twit-
ter. Entonces, se han utilizado clasificadores populares (de hecho se han usado todos
los clasificadores de WEKA). No todos los resultados obtenidos son presentados, de-
bido a su alto número.
Palabras clave: Análisis de sentimientos, detección de asunto.
Abstract: Sentiment analysis and topic detection are new problems that are at
the intersection of natural language processing (NLP) and data mining. Sentiment
analysis attempts to determine if a text is positive, negative, or neither, while topic

detection attempts to identify the subject of the text. A significant amount of effort
has been invested in constructing effective solutions for these problems, mostly for
English texts. Using a corpus of Spanish tweets, we present a comparative analysis
of different approaches and classification techniques for these problems. The data
is preprocessed using techniques and tools proposed in the literature, together with
others specifically proposed here that take into account the characteristics of Twitter.
Then, popular classifiers have been used. (In particular, all classifiers of WEKA
have been evaluated.) Due to its high number not all the results obtained will be
presented here.
Keywords: Sentiment analysis, topic detection.
1 Introduction the problems that have been posed to achieve

With the proliferation of online reviews, rat- this are sentiment analysis and topic detec-
ings, recommendations, and other forms of tion, which are at the intersection of natural
online opinion expression, there is a growing language processing (NLP) and data mining.
interest in techniques for automatically ex- Sentiment analysis attempts to determine if
tract the information they embody. Two of a text is positive, negative, or neither, possi-
bly providing degrees within each type. On
∗
Partially funded by the Spanish Ministerio de its hand, topic detection attempts to identify
Economı́a y Competitividad.
† the main subject of a given text. Research
Work partially done while visiting Institute
IMDEA Networks. in both problems is very active, and a num-
ber of methods and techniques have been pro- Pang and Lee (Pang and Lee, 2008) have
posed in the literature to solve them. Most a comprehensive survey of sentiment analy-
of these techniques focus on English texts sis and opinion mining research. Liu (Liu,
and study large documents. In our work, 2010), on his hand, reviews and discusses a
we are interested in languages different from wide collection of related works. Although,
English and micro-texts. In particular, we most of the research conducted focuses on
are interested in sentiment and topic clas- English texts, the number of papers on the
sification applied to Spanish Twitter micro- treatment of other languages is increasing ev-
blogs. Spanish is increasingly present over ery day. Examples of research papers on
the Internet, and Twitter has become a pop- Spanish texts are (Brooke, Tofiloski, and
ular method to publish thoughts and infor- Taboada, 2009; Martı́nez-Cámara, Martı́n-
mation with its own characteristics. For in- Valdivia, and Ureña-López, 2011; Martı́nez
stance, publications in Twitter take the form Cámara et al., 2011).
of tweets (i.e., Twitter messages), which are Most of the algorithms for sentiment anal-
micro-texts with a maximum of 140 char- ysis and topic detection use a collection of
acters. In Spanish tweets it is common to data to train a classifier that is later used
find specific Spanish elements (SMS abbrevi- to process the real data. The (training and
ations, hashtags, slang). The combination of real) data is processed before being used for
these two aspects makes this a distinctive re- (building or applying) the classifier in or-
search topic, with potentially deep industrial der to correct errors and extract the main
applications. features (to reduce the required processing
The motivation of our research is twofold. time or memory). Many different techniques
On the one hand, we would like to know have been proposed for these phases. For in-
whether usual approaches that have been stance, different classification methods have
proved to be effective with English text are been proposed, like Naive Bayes, Maximum
also so with Spanish tweets. On the other, Entropy, Support Vector Machines (SVM),
we would like to identify the best (or at BBR, KNN, or C4.5. In fact, there is no fi-
least good) technique for Spanish tweets. For nal agreement on which of these classifiers
this second question, we would like to eval- is the best. For instance, Go et al. (Go,
uate those techniques proposed in the lit- Bhayani, and Huang, 2009) report similar ac-
erature, and possibly propose new ad hoc curacy with classifiers based on Naive Bayes,
techniques for our specific context. In our Maximum Entropy, and SVM.
study, we try to sketch out a comparative Regarding preprocessing the data (texts
study of several schemes on term weight- in our case), one of the first decisions to be
ing, linguistic preprocessing (stemming and made is which elements will be used as ba-
lemmatization), term definition (e.g., based sic terms. Laboreiro et al. (Laboreiro et al.,
on uni-grams or n-grams), the combination 2010) explore tweets tokenization (or symbol
of several dictionaries (sentiment, SMS ab- segmentation) as the first key task for text
breviations, emoticons, spell, etc.) and the processing. Once single words or terms are
use of several classification methods. When available, typical choices are using uni-grams,
possible, we have used tools freely available, bi-grams, n-gram, or parts-of-speech (POS).
like the Waikato Environment for Knowl- Again, there is no clear conclusion on which
edge Analysis (WEKA, an open source soft- is the best option, since Pak and Paroubek
ware which consists of a collection of machine (Pak and Paroubek, 2010) report the best
learning algorithms for data mining) (at Uni- performance with bi-grams, while Go (Go,
versity of Waikato, 2012). Bhayani, and Huang, 2009) present better
results with unigrams. The preprocessing
1.1 Related Work phase may also involve word processing the
As mentioned above, sentiment analysis, also input texts: stemming, spelling and/or se-
known as opinion mining, is a challenging mantic analysis. Tweets are usually very
Natural Language Processing (NLP) prob- short, having emoticons like :) or :-), or ab-
lem. Due to its tremendous value for prac- breviated (SMS) words like “Bss” for “Besos”
tical applications, it has experienced a lot (“kisses”). Agarwal et al. (Agarwal et al.,
of attention, and it is perhaps one of the 2011) propose the use of several dictionaries:
most widely studied topic in the NLP field. an emoticon dictionary and an acronym dic-
tionary. Other preprocessing tasks that have ing. They propose methods for the classifi-
been proposed are contextual spell-checking cation of tweets in an open (dynamic) set of
and name normalization (Kukich, 1992). topics. Instead, in work we are interested in
One important question is whether the al- a closed (fixed) set of topics. However, we ex-
gorithms and techniques proposed for a type plore all the index and clustering techniques
of data can be directly applied to tweets. proposed, since most of them could be ap-
This could be very convenient, since a cor- plied to sentiment analysis process.
pus of Spanish reviews of movies (from Mu-
chocine1 ) has already been collected and 1.2 Contributions
studied (Cruz et al., 2008; Martı́nez Cámara In this paper we have explored the perfor-
et al., 2011). Unfortunately, Twitter data mance of several preprocessing, feature ex-
poses new and different challenges, as dis- traction, and classification methods in a cor-
cussed by Agarwal et al. (Agarwal et al., pus of Spanish tweets, both for sentiment
2011) when reviewing some early and re- analysis and for topic detection. The differ-
cent results on sentiment analysis of Twit- ent methods considered can be classified into
ter data (e.g., (Go, Bhayani, and Huang, almost orthogonal families, so that a differ-
2009; Bermingham and Smeaton, 2010; Pak ent method can be selected from each family
and Paroubek, 2010)). Engström (En- to form a different configuration. In partic-
gström, 2004) has also shown that the bag- ular, we have explored the following families
of-features approach is topic-dependent and of methods.
Read (Read, 2005) demonstrated how models
Term definition and counting In this
are also domain-dependent.
family it is decided what constitutes a ba-
These papers, as expected, use a broad sic term to be considered by the classifica-
spectrum of tools for the extraction and clas- tion algorithm. The different alternatives are
sification processes. For feature extraction, using single words (uni-grams), or groups of
FreeLing (Padró et al., 2010) has been pro- words (bi-grams, tri-grams, n-grams) as ba-
posed, which is a powerful open-source lan- sic terms. Of course, the aggregation of all
guage processing software. We use it as an- these alternatives is possible, but it is typi-
alyzer and for lemmatization. For classifica- cally never used because it results in a huge
tion, Justin et al. (Justin et al., 2010) report number of different terms, which makes the
very good results using WEKA (at Univer- processing hard or even impossible. Each of
sity of Waikato, 2012; Hall et al., ), which the different terms that appears in the in-
is one of the most widely used tools for put data is called by classification algorithms
the classification phase. Other authors pro- an attribute. Once the term formation is de-
posed the use of additional libraries like Lib- fined, the list of attributes in the input data is
SVM (Chang and Lin, 2011). In contrast, found, and the occurrences of each attributed
some authors (e.g., (Phuvipadawat and Mu- are counted.
rata, 2010)) propose the utilization of Lucene
(Lucene, 2005) as index and text search en- Stemming and lemmatization One of
gine. the main difference between Spanish and En-
Most of the references above have to do glish is that English is a weakly inflected
with sentiment analysis, since this is a very language in contrast to Spanish, a highly
popular problem. However, the problem inflected one. A part of our work is the
of topic detection is becoming also popu- stemming and lemmatization process. In or-
lar (Sriram et al., 2010), among other reader to reduce the feature dimension (num-
sons, to identify trending topics (Allan, 2002; ber of attributes), each word could be re-
Bermingham and Smeaton, 2010; Lee et duced to either its lemma (canonical form)
al., 2011). Due to the the realtime nature (e.g., “cantábamos” is reduced to its infini-
of Twitter data, most works (Mathioudakis tive “cantar”) or its stem (e.g., “cantábamos”
and Koudas, 2010; Sankaranarayanan et al., is reduced to “cant”). One interesting ques-
2009; Vakali, Giatsoglou, and Antaris, 2012; tions is to compare how well the usual stem-
Phuvipadawat and Murata, 2010) are inter- ming and lemmatization processes perform
ested in breaking news detection and track- with Spanish words.
Word processing and correction Sev-
1
http://www.muchocine.net eral dictionaries are available to correct the
words and thus reduce the noise caused by factory.

mistakes. A spell checker can be used to cor- The rest of the paper is structured as
rect typos. Other grammar dictionaries can follows. In Section 2 we describe in detail
replace emoticons, SMS abbreviations, and the different techniques that we have imple-
slang terms by their meaning in correct Span- mented or used. In Section 3 we describe our
ish. In addition, any special-term dictionary evaluation scenario and the results we have
can be applied to get a context in a tweet obtained. Finally, in Section 4 we present
(i.e., an affective word list can give us the some conclusions and open problems.
tone of a text, which is relevant for sentiment
analysis). Finally, it is possible to use a mor- 2 Methodology
phological analyzer to determine the type of In this section we give the details of how
each word. Thus, a word-type filter can be the different methods considered have been
applied to the tweets. implemented in our system. A summary of
Valence shifters By default, once the de- these parameters is presented in Table 1.
cision of what constitutes a basic term is
made, each term has the same weight is a 2.1 Term Definition and
tweet. A clear improvement to this term- Processing
counting method is the process of valence n-grams As we mentioned, a term is the
shifters and negative words. Example of neg- basic element that will be considered by the
ative words are “no”, “ni”, or “sin” (“not”, classifiers. These terms will be sets of n words
“neither”, “without”), while examples of va- (n-grams), with the case when terms are sin-
lence shifters are “muy” or “poco” (“very”, gle words (unigrams) as a special case. The
“little”). These words are useful for senti- value of n is defined in our algorithm with the
ment classification since they change and/or parameter n-gram (see Table 1). The reason
revert the strength of a neighboring term. for considering the use of n-grams with n > 1
(instead of restricting always the terms to in-
Tweet semantics The above approaches
dividual words) is because they are partic-
can be improved by processing specific tweet
ularly efficient to recognize common expres-
artifacts such as author tags, or hashtags and
sions of a language. Also, by keeping a word
URLs (links) provided in the text. The au-
into its context, it is possible to differentiate
thor tags act like a history of the tweets of a
its different meanings. For example, in the
specific person. Because this person will most
sentences “estoy cerca” (“I am close”) and
likely post tweets about the same topic, this
“cierro la cerca” (“I close the fence”), using

might be relevant for topic detection. Ad-
2-grams will allow to detect the two differ-
ditionaly, the hashtags are a great indicator
ent meanings of the word “cerca”. As the
of the topic of a tweet, whereas retrieving
words stay in their context, an n-gram car-
keywords from the web-page linked within a
ries more information than the sum of the
tweet allows to overpass the limit of the 140
information of its n words: it also carries the
characters and thus improves the efficiency of
context information. (Using uni-grams every
the estimation. Another way to overpass this
single word is a term, and any context infor-
limit is to investigate the keywords of a tweet
mation is lost.)
in a search-engine to retrieve other words of
When using n-grams, n is a parameter
the same context.
that highly influences performance. Having
Classification methods In addition to a high value of n allows catching more con-
these variants, we have explored the full spec- text information, since the combinations of
trum of classification methods provided by words are less probable. On another side,
WEKA. rare combinations means less occurrences in
We can construct a large set of (more than the data set, which means that a bigger data
100 thousand) different methods by combin- set is needed to have good results. Also, the
ing features from all the described families. larger n is, the longer the attribute list is.
As this number of combinations is too high, In addition, since tweets are short, choosing
we had to reduce it by manually, choosing a a large n would result in n-grams of almost
subset of all the methods that is manageable the size of a tweet, which would make little
and we think is the most relevant. We hope sense. We found that, in practice, having n
the reader finds the subset we present satis- larger than 3 did not improve the results, so
Parameter/flag Description Process

n-gram Number of words that form a term Both
Only n-gram Whether words are also terms Both
Use input data Whether the input data is used to define attributes Both
Lemma/Stem Which technique is used to extract the root of words Both
Correct words Whether a dictionary is used to correct misspellings Both
SMS Whether an emoticons and SMS dictionary is used Both
Word types Types of words to be processed Both
Affective dictionary Whether an affective dictionary is used to define attributes Sentiment
Negation Whether negations are considered Sentiment
Weight Whether valence shifters are considered Sentiment
Hashtags Whether hashtags are considered as attributes Topic
Author tags Whether author tags are considered as attributes Topic
Links Whether data from linked web pages is used Topic
Search engine Whether a search engine is used Topic
Table 1: Parameters and flags that define a configuration of our algorithm.
we limit n to be no larger than 3. dictionary (see below) we may not use the in-
Of course, it is possible to combine the n- put data. This is controlled with a parameter
grams with several values of n. We only con- that we denote Use input data (see Table 1).
sider the possibility of combining two such Moreover, even if the input data is processes,
values, and one has to be n = 1. This is we may filter it and only keep some of it; for
controlled with the flag Only n-gram (see instance, we may decide to use only nouns.
Table 1), which says whether only n-grams This can be controlled with the parameter
(with n > 1) are considered as terms or also Word types (see Table 1), which is described
individual words (unigrams) are considered. below. In summary, the list of attributes is
In the latter case, the lists of attributes of built from the input data (if so decided) pre-
both cases are merged. The drawback of processed as determined by the rest of pa-
merging is the high number of entries in the rameters (e.g., filtered Word types) and from
final attribute list. Hence, when doing this, a potentially the additional data (like the af-
threshold is used to remove all the attributes fective dictionary).

that appear too few times in the data set, Once the list of attributes is constructed,
as they are considered as noise. We force a vector is created for each tweet in the input
that the attribute appears at least 5 times data. This vector has one position for each
in the data set to be considered. Also, a sec- attribute, so that the value at that position is
ond threshold is used to remove ambiguous the number of occurrences of the attribute in
attributes. For example, the entry “ha sido” the tweet. This value can be modified in some
(“has been”) can be found in tweets indepen- tweets if the occurrence of an attribute is near
dently of its topic or sentiment and can be a valence shifter (see below). Once this pro-
safely removed. This threshold has been set cess is completed, the list of attributes and
to 85%, which means that more than 85% of the list of vectors obtained from the tweets
the occurrences of this entry have to be for a are the data passed to the classifier.
specific topic or sentiment.
2.2 Stemming and Lemmatization
Processing Terms The processing of When creating the list of attributes from
terms involves first building the list of at- a collection of terms, different forms of
tributes, which is the list of different terms the same word will be found (e.g., singu-
that appear in the data set of interest. In lar/plural, masculine/feminine). Including
principle, the data set used to identify at- each form as a different attribute would make
tributes is formed at least by all the tweets the list unnecessarily long. Hence, typically
that are provided as input to the algorithm, only the root of the words is used in the at-
but there are cases in which we do not use tribute list. The root can take the form of
them. For instance, when using an affective the lemma or the stem of the word. The pro-
cess of extracting it is called lemmatization ested in. For topic estimation, the keywords
or stemming, respectively. Lemmatization are mainly nouns and verbs whereas for senti-
preserves the meaning and type of a word ment analysis, they are adjectives and verbs.
(e.g., words “buenas” and “buenos” become For example, in the sentence La pelicula es
“bueno”). We have used the FreeLing soft- buena (“The movie is good”), the only word
ware to perform this processing, since it can that is carrying the topic information is the
provide the lemma of those words that are noun pelicula, which is very specific to the
in its dictionary. After lemmatization, there cinema topic. Besides, the word that best
are no plurals or other inflected forms, but reflects the sentiment of the sentence is the
still two words with the same root but differ- adjective buena, which is positive. Also, in
ent type may appear. Stemming on its hand the sentence El equipo ganó el partido (“The
reduces even more the list of attributes. A team won the match”), the verb ganó is car-
stem is a word whose affixes has been re- rying information for both topic and senti-
moved. Stemming might lose the meaning ment analysis: the verb ganar is used very
and any morphological information that the often in the soccer and sport topics and has
original word had (e.g., words “aparca”, verb, a positive sentiment. We allow to filter the
and “aparcamiento”, noun, become “aparc”). words of the input data using their type with
The Snowball (Sno, 2012) software stemmer the parameter Word types (see Table 1). The
has been used in our experiments. filtering is done using the FreeLing software,
We have decided to always use one of the which is used to retrieve the type of each
two processes. Which one is used in a partic- word.
ular configuration is controlled with the pa- When performing sentiment analysis, we
rameter Lemma/Stem (see Table 1). have found useful to have an affective dic-
2.3 Word Processing and tionary, whose use is controlled with the
flag Affective dictionary (see Table 1). We
Correction have used an affective dictionary developed
As mentioned above, one of the possible pre- by Martı́n Garcı́a (Garcı́a, 2009). This dictio-
processing steps of the data before extracting nary consist of a list of words that have a pos-
attributes and vectors is to correct spelling itive or negative meaning, expanded by their
errors. Whether or not this step is taken is polarity “P” or “N” and their strength “+” or
controlled with the flag Correct words (see “-”. For example, the words bueno (“good”)
Table 1). If correction is done, the algorithm and malo (“bad”) are respectively positive
uses the Hunspell dictionary (Hun, 2012) (an and negative with no strength whereas the
open source spell-checker) to perform it. words mejor (“best”) and peor (“worse”)
Another optional preprocessing step (con- are respectively positive and negative with
trolled with the flag SMS ) expands the a positive strength. As a first approach, we
emoticons, shorthand notations, and slang have not intensively used the polarity and the
commonly used in SMS messages which is not strength of the affective words in the dictio-
understandable by the Hunspell dictionary. nary. Its use only forces the words that con-
The use of these abbreviations is common in tain it to be added as attributes. This has the
tweets, given the limitation to 140 charac- advantage of drastically reducing the size of
ters. An SMS dictionary (dic, 2012) is used the attribute list, specially if the input data
to do the preprocessing. It transforms the is filtered. Observe that the use of this dictio-
SMS notations into words understandable by nary for sentiment analysis is very pertinent,
the main dictionary. Also, the emoticons are since the affective words carry the tweet po-
replaced by words that describe their mean- larity information. In a more advanced fu-
ing. For example :-) is replaced by feliz ture aproach, the characteristics of the words
(“happy”) and :-( by triste (“sad”). The could be used to compute weights. Since not
emoticons tend to have a strong emotional se- all the words in our affective dictionary may
mantic. Hence, this process helps estimating appear in the corpus we have used, we have
the sentiment of the tweets with emoticons. built artificial vectors for the learning ma-
We have observed that the information of chine. There is one artificial vector per senti-
a sentence is mainly located in a few key- ment analysis category (positive+, positive,
words. These keywords have a different type negative, negative+, none), which has been
according to the information we are inter- built counting one occurrence of those words
whose polarity and strength match with the which is the hashtag of the Barcelona
appropriate category. soccer team, it can almost doubtlessly
be classified in a soccer tweet.
2.4 Valence Shifters
• References (a “@” followed by the user-
There are two different aspects of valence
name of the referenced user). It is used
shifting that are used in our methods. First,
to reference other Twitter users. Any
we may take into account negations that
user can be referenced. For example,
can invert the sentiment of positive and neg-
@username means the tweet is answer-
ative terms in a tweet. Second, we may
ing a tweet of username, or referring to
take weighted words, which are intensifiers
his/her. References are interesting be-
or weakeners, into account. Whether these
cause some users appear more frequently
cases are processed is controlled by the flags
in certain topics and will more likely
Negation and Weight (see Table 1).
tweet about them. A similar behaviour
Negations are words that reverse the senti-
can be found for sentiment.
ment of other words. For example, in the sen-
tence La pelicula no es buena (“The movie • Links (a URL). Because of the charac-
is not good”), the word buena is positive ter limitation of the tweets, users often
whereas it should be negative because of the include URLs of webpages where more
negation no. The way we process negations details about the message can be found.
is as follows. Whenever a negative word is This may help obtaining more context,
found, the sign of the 3 terms that follow it specially for topic detection.
is reversed. This allows us to differentiate a
In our algorithms, we have the possibil-
positive buena from a negative buena. The
ity of including hashtags and references as
area of effect of the negation is restricted to
attributes. This is controlled by the flags
avoid false negative words in more sophisti-
Hashtags and Author tags (see Table 1), re-
cated sentences.
spectively. We believe that these options are
Other valence shifters are words that
just a complement to previous methods and
change the degree of the expressed senti-
cannot be used alone, because we have found
ment. Examples of these are, for instance
that the number of hashtags and references
muy (“very”), which increases the degree,
in the tweets is too small.
or poco (“little”), which decreases it. These
We also provide the possibility of adding
words were included in the dictionary de-
to the terms of a tweet the terms obtained
veloped by Martı́n Garcı́a (Garcı́a, 2009) as
from the web pages linked from the tweet.

words with positive or negative strength but
This is controlled by the flag Links. A
no polarity. If the flag Weight is set, our
first approach could have been retrieving the
algorithm finds these words in the tweets,
whole source code of the linked page, get all
and changes the weight of the 3 terms follow-
the terms it contains, and keep the ones that
ing them. If the valence shifter has positive
match the attribute list. Unfortunately, there
strength the weight is multiplied by 3, while
are too many terms, and the menus of the
if it is negative by 0.5.
pages induce an unexpected noise which de-
2.5 Twitter Artifacts grades the results. The approach we have
It has been noticed that with the previous chosen is to only keep the keywords of the
methods, not all the potential data contained pages. We chose to only retrieve the text
in the tweets is used. There are several fre- within the HTML tags h1, h2, h3 and title.
quent element in tweets that carry a signif- The results with this second method are
icant amount of information. Among others much better since the keywords are directly
we have the following. related to the topic.
Because of the short length of the tweets,
• Hashtags (any word which starts with our estimations often suffer from a lack of
“#”). They are used for identify mes- words. We found a solution to this problem
sages about the same topic. Hashtags in several paper (Banerjee, Ramanathan, and
are very helpful for topic estimation Gupta, 2007; Gabrilovich and Markovitch,
since some of them may carry more topic 2005; Rahimtoroghi and Shakery, 2011) that
information than the rest of the tweet. use web sources (like Wikipedia or the Open
For example, if a tweet contains #BAR, Directory) to complete tweets. The web is a
mine of information and search-engines can noticed that WEKA is more efficient when
be used to retrieve it. We have used this there is a smaller number of attributes. Sec-
technique to obtain many keywords and a ond, a smaller file avoids having lack of mem-
context from just a few words taken from ory issues: a great amount of memory, which
the tweets. For implementation reasons, Bing is proportional to the file size, is needed while
(Bin, 2012) was chosen for the process. The WEKA builds a model.
title and description of the 10 first results of Once the ARFF file is available, we are
the search are kept and processed in the same able to run all the available classification al-
way as the words of the tweet. We found gorithms that WEKA provides. However,
out that we have better results by search- due to time limit we will below concentrate
ing in Bing with only the nouns contained on only a few.
in the tweet; therefore, this is the option we
chose. The activation of this option is con- 3 Experimental Results
trolled with the flag Search engine. 3.1 Data Sets
2.6 Classification Methods We have used a corpus of tweets provided
for the TASS workshop at the SEPLN 2012
The Waikato Environment for Knowledge conference (TAS, 2012) as input data set.
Analysis (WEKA) (at University of Waikato, This set contains about 70,000 tweets pro-
2012) is a collection of machine learning al- vided as tuples ID, date, userID. Additionaly,
gorithms that can be used for classification over 7,000 of the tweets were given as a small
and clustering. The workbench includes al- training set with both topic (chosen politics,
gorithms for classification, regression, clus- economy, technology, literature, music, cin-
tering attribute selection and association rule ema, entertainment, sports, soccer or others)
mining. Almost all popular classification al- and sentiment (or polarity, chosen strong pos-
gorithms are included. WEKA includes sev- itive, positive, neutral, negative, strong nega-
eral Bayesian methods, decision tree learn- tive or none) classification. The data set was
ers, random trees and forests, etc. It also shuffled for the topics and sentiments to be
provides several separating hyperplane ap- randomly distributed. Due to the large time
proaches and lazy learning methods. taken by the experiments with the large data
Since we use WEKA as learning machine, set, most of the experiments presented have
it is worth knowing that each element in the used the small data set, using 5,000 tweets
learning machine data set will be called an for training and 2,000 for evaluation.
attribute, and each element of the data itself

will be called a vector. (These correspond to 3.2 Configurations for the
the attributes and vectors we have been han- Submitted Results
dling above.) WEKA uses a specific file for- We tested multiple configurations with all
mat ARFF (Attribute-Relation File Format) the WEKA classifiers to choose the one with
to reference the attributes and the vectors it the highest accuracy to be submitted to
uses to learn. This file is first composed of the TASS challenge. Different configurations
a list of all the attributes whose order is di- gave the best results for sentiment analy-
rectly related to the order of the vectors’ val- sis and topic detection. For instance, for
ues. The second part of the file is composed topic detection the submitted results were
by a list of vector, each one representing a obtained with a Complement Naive Bayes
tweet. Thus, each tweet adds a vector (line) classifier on attributes and vectors obtained
to the file whereas an attribute adds a line in from the input data by not applying lemmati-
the first part of the file and a value in each zation nor stemming, filtering the words and
vector. keeping only nouns, and using hastags and
The different parameters described in Ta- author tags. The reported accuracy by the
ble 1 form a configuration that tells our al- challenge organizers in the large data set is
gorithm which attributes to choose and how 45.24%.
to create the vectors. The output of this al- Regarding sentiment (polarity), the sub-
gorithm is an ARFF file for the configuration mitted results were obtained by first classi-
and the input data. In general, some of the fying the tweets in 5 subsets by using the
parameters intend to reduce the size of this topic detection algorithm, and then running
file, mainly for two reasons. First, it has been the sentiment analysis algorithm within each
subset. The latter used Naive Bayes Multi- for each classification method a new configu-
nomial on data preprocessed by using the af- ration is created and tested with the param-
fective dictionary, filtering words and keep- eter settings that maximized the accuracy.
ing only adjectives and verbs (adjectives were The accuracy values computed in each of
stemmed, and verbs were lemmatized), using the configurations with the five methods with
the SMS dictionary, and processing negations the small data set are presented in Figures
at the sentence level. The accuracy reported 1 and 2. In both figures, Configuration 1
in the large data set was of 36.04%. is the basic configuration. The derived con-
Since the mentioned results were submit- figurations are numbered 2 to 9. (Observe
ted, we have worked on making the algorithm that each accuracy value that improves over
more flexible, so it is simpler to activate and the accuracy with the basic configuration is
deactivate certain processes. This has led to shown on boldface.) Finally, the last 5 con-
a slightly different behaviour from the sub- figurations of each figure correspond to the
mitted version, but we believe it has resulted parameters settings that gave highest accu-
in an improvement in accuracy. racy in the prior configurations for a method
(in the order Ibk, Complement Naive Bayes,
3.3 Process to Obtain the New Naive Bayes Multinomial, Random Commit-
Experimental Results tee, and SMO).
As mentioned, the algorithm used for ob-
taining the new experimental results, is more 3.4 Topic Estimation Results
flexible and can be configured with the pa- As mentioned, Figure 1 presents the accu-
rameters defined in Table 1. In addition, racy results for topic detection on the small
all classification methods of WEKA can be data set, under the basic configuration (Con-
used. Unfortunately, it is unfeasible to exe- figuration 1), configurations derived from this
cute all possible configurations with all pos- one by toggling one by one every parameter
sible classification methods. Hence, we have (Configurations 2 to 9), and the seemingly
made some decisions to limit the number of best parameter settings for each classification
experiments. method (Configurations 10 to 14). Observe
First, we have chosen only five clas- that there are no derived configuration with
sification algorithms from those provided the search engine flag set. This is because
by WEKA. In particular, we have chosen the ARFF file generated in that configuration
the methods Ibk, Complement Naive Bayes, after searching the web as described above
Naive Bayes Multinomial, Random Commit- (even for the small data set) was extremely
tee, and SMO. This set tries to cover the large and the experiment could not be com-
most popular classification techniques. Sev- pleted
eral configurations of the parameters from The first fact to be observed in Figure 1
Table 1 will be evaluated with these 5 meth- is that Configuration 1, which is supposed
ods. to be similar to the one used for the sub-
Second, we have chosen for each of the mitted results, seems to have a better ac-
two problems (topic and sentiment) a basic curacy with some methods (more than 56%
configuration. In each case, the basic con- versus 45.24%). However, it must be noted
figuration is as close as possible to the con- that this accuracy has been computed with
figuration used to obtain the submitted re- the small data set (while the value of 45.24%
sults. (Since the algorithm has been mod- was obtained with the large one). A second
ified to add flexibility, the exact submitted observation is that in the derived configura-
configuration could not be used.) The rea- tions there is no parameter that by changing
son for choosing these as basic configurations its setting drastically improves the accuracy.
is that they were found to be the most ac- This also applies to the rightmost configu-
curate among those explored before submis- rations, that combine the best collection of
sion. Then, starting from this basic config- parameter settings.
uration a sequence of derived configurations Finally, it can be observed that the largest
are tested. In each derived configuration, one accuracy is obtained by Configuration 2 with
of the parameters of the basic configuration Complement Naive Bayes. This configura-
was changed, in order to explore the effect of tion is obtained from the basic one by sim-
that parameter in the performance. Finally, ply removing the word filter that allow only

XVIII Congreso de La Asociacion Española para El P... - (PG 82 - 121) PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

XVIII Congreso de La Asociacion Española para El P... - (PG 82 - 121) PDF

Caricato da

Copyright:

Formati disponibili

XVIII CONGRESO DE LA SOCIEDAD ESPAÑOLA PARA EL PROCESAMIENTO DEL LENGUAJE NATURAL 81

alcanzado logros importantes en la detección probabilísticos para evaluar el grado de

conceptualmente complejos y vinculados con la

el mero uso de medidas de información para

especializado. una de las unidades léxicas que componen una

adjetivos calificativos de la misma fuente de

7 Metodología adjetivos que serán parte de la lista de paro, vía

Demonte (1999) señala que los rasgos

Graduabilidad de adjetivos 84 67 75 un hiperónimo. Desafortunadamente, dada la

hipónimos relevantes, adquiera gran 11th EURALEX International Congress, páginas

Linguistics and 44th Annual Meeting of the ACL,

2nd International Workshop on Exploiting Large Knowledge

1st International Workshop on Automatic Text Summarization

Disambiguating automatically‐generated semantic annotations for Life Science open

Within this context, TASS is an experimental evaluation workshop, as a satellite event of the

TASS - Taller de Análisis de Sentimientos en la SEPLN

José Carlos González-Cristóbal Eugenio Martínez-Cámara

Resumen: Este artículo describe el desarrollo de TASS, taller de evaluación experimental en el

Specifically, in business, reputation

location, and many other fields, which may

3 Table 2: Test Corpus

shown in Table 3, sorted by frequency in the <u ser>c.s u ar i oO < / u::ser>

test corpus. Ja ja ja te suena d algo!] ) >< / conten t>

the following schema (Figure 2), in which the <Cype>AGREEMENT< / type>

~ ;tJf() twitid xsd:nonNegative!nteger <topic>polí ti c a </ copie>

;< ~ .,6 () val ue tt:enumValuePolarity

~) tt:topics (0.. 1) Participants were required to register for the

Table 4: Participant groups 4.3 L2F- INESC

and weighting of polarity words depending on events.

5 Results Table 5: Results for task 1 (Sentiment Analysis)

label classification. This specifically affects to [ Run Id Gr o u p Pr ec i s i o n 1

a similar precision. Batista, Fernando; Ribeiro, Ricardo. The L2F

Ureña López, L. Alfonso. SINAI at TASS

Xabier Saralegi Urizar Iñaki San Vicente Roncal

depending on syntactic nesting level. A pre-processing for treatment of spell-errors

1 Introduction opinions of users about topics or products

supervised approach using a polarity lexicon English-Spanish bilingual dictionary Den−es

3.2 Polarity Lexicon 3.3 Supervised System

polarity English Words Trans- Manually Manually Final

measure the contribution of these words bet-

Polarity #tweets % of #tweets tweets contain positive and negative sen-

discovery and data mining, pages 168–177.

Antonio Fernández Anta Philippe Morere†

Luis Núñez Chiroque Agustı́n Santos

Resumen: Análisis de sentimientos y detección de asunto son nuevos problemas

analysis attempts to determine if a text is positive, negative, or neither, while topic

1 Introduction the problems that have been posed to achieve

words and thus reduce the noise caused by factory.

“cierro la cerca” (“I close the fence”), using

Parameter/flag Description Process

Table 1: Parameters and flags that define a configuration of our algorithm.

threshold is used to remove all the attributes fective dictionary).

from the web pages linked from the tweet.

attribute, and each element of the data itself

Potrebbero piacerti anche