Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Semantic Similarity
The same measure of STS is used for both Task 1 2.2 Semantic Textual Similarity (STS)
and Task 2; however, the algorithm for Task 1 is sim-
STS is computed in the same way for both tasks;
pler and consists of only the first two steps: Prepro-
however it is computed between two sentences for
cessing and Semantic Textual Similarity. The steps
Task 1 and between two chunks for Task 2. While
are outlined below and are expanded on in subse-
describing the computation of STS we refer to
quent subsections.
chunks; for Task 1 a sentence can be conceptual-
1. Preprocessing - text is standardized ized as a chunk. VRep’s STS computation is shown
in Equation (1) and is similar to the method de-
2. Semantic Textual Similarity - the STS between scribed by NeRoSim (Banjade et al., 2015) and
two chunks or two sentences is computed. This Stefanescu (Ştefănescu et al., 2014). chunkSim
is the final step for Task 1. takes two chunks (c1 , c2 ) as input and computes the
weighted sum of maximum word to word similar-
3. Chunk Alignment - align each chunk of one ities, sim(wi , wj ). To do this, the sim(wi , wj ) is
sentence to a chunk in another sentence. If no found for each word in c2 against c1 , and the maxi-
chunks are similar then no alignment (NOALI) mum is added to a running sum.
is assigned.
Pn
4. Alignment Reasoning - assign a label to each i=1 maxm
j=1 sim(wi , wj )
chunkSim(c1 , c2 ) = (1)
aligned chunk pair min(n, m)
5. Alignment Scoring - assign an alignment score where c1 and c2 are two chunks, n and m are the number of words in
c1 and c2 , wi is word i of c1 , wj is word j of c2
on a 0-5 scale
It is interesting to note that there is no classifier for For Task 2, we report results for the Gold Chunks
the REL class. The data set was heavily skewed to- scenario (data is pre-chunked). Each data set is eval-
wards the EQUI class which consisted of 60% of the uated using the F1 score in four categories:
total data, leaving a small percentage to be divided 5
http://alt.qcri.org/semeval2016/task1/data/uploads/sts2016-
among the remaining 5 classes, with just around 5% english-with-gs-v1.0.zip
6
being REL. With a larger training set we would ex- http://alt.qcri.org/semeval2016/task1/
(
β−δ
(Ali) - Alignment - F1 score of the chunk β δ<β
LevenshteinM easure = (2)
alignment 0 δ≥β
(Type) - Alignment Type - F1 score of the where δ is the Levenshtein distance between the two words, and β is
alignment reasoning the threshold used
(Score) - Alignment Scoring - F1 score of
alignment scoring
4.1 STS Component Analysis
(Typ+Scor) - Alignment Type and Score - a
combined F1 score of alignment reasoning and Pearson Correlation Coefficients of STS scores of
scoring Task 1 and the F1 Ali Task 2 are used as evaluation
metrics for the STS portion of VRep. Tables 3 and 4
F1 scores range from 0.0 to 1.0 with 1.0 being
show the effects of adding a component to the Basic
the best score. Data sets are available online7 and
system. Each component and the Basic system are
evaluation metrics are described in more detail in the
described below:
competition summary (Agirre et al., 2015).
1. As a baseline a Basic system which only ap-
Data set Baseline VRep plies Equation (1) is used. For Task 1 the re-
Answers-Students sult is scaled by 5. For Task 2 each chunk
F1 Ali 0.8203 0.7723 is aligned with the chunk with the highest
F1 Type 0.5566 0.5249
chunkSim. No thresholding, or preprocessing
F1 Score 0.7464 0.7014
F1 Typ+Scor 0.5566 0.5226 is performed.
Headlines
F1 Ali 0.8462 0.8908 2. Threshold adds a threshold to sim(wi , wj ) in
F1 Type 0.5462 0.6015 Equation (1). A modest threshold of 0.4 was
F1 Score 0.7610 0.8027 used. The optimum threshold of 0.9 used in
F1 Typ+Scor 0.5461 0.5964 the final system was found with the system as
Images a whole. We did not perform a grid search to
F1 Ali 0.8556 0.8539 optimize the threshold for all component tests.
F1 Type 0.4799 0.5516
F1 Score 0.7456 0.7651 3. Stop Removal adds stop word removal as de-
F1 Typ+Scor 0.4799 0.5478
scribed in subsection 2.1.
Table 2: Results of VRep on SemEval 2016 Task 2 Test Data
4. Levenshtein modifies sim(wi , wj ) for words
not in WordNet. Rather than using a binary
4 Component Analysis value for exact string matching the Levenshtein
measure shown in Equation (2) is used. This al-
In this section the contributions of system compo- lows for slight differences in spelling, plurality,
nents and possible additions are evaluated. VRep tenses, etc. The measure requires a threshold
can be split into two major components, STS and parameter, β which limits the maximum Leven-
Alignment Reasoning, both of which have different shtein distance (δ) and scales the Levenshtein
evaluation criteria. Data used for this section comes Measure between 0.0 and 1.0. β = 2.0 was
from the SemEval 2015 Task 18 training data and found via a grid search to perform best. The
SemEval 2016 Task 29 training data. Two-Tailed p- Levenshtein Measure is unnecessary for these
values are shown in the Tables 3 and 4. tasks most likely because, as stated in section
1, words not in WordNet tend to be proper
7
http://alt.qcri.org/semeval2016/task2/data/uploads/ nouns or abbreviations for which the spelling
test goldstandard.tar.gz is the same, and for short words such as “he”
8
http://ixa2.si.ehu.es/stswiki/images/2/21/STS2015-en
-rawdata-scripts.zip
or “she”, “is” or “in”, even small edit distances
9
http://alt.qcri.org/semeval2016/task2/data/uploads/train can transform the word into a completely unre-
2015 10 22.utf-8.tar.gz lated word.
F1 Ali p-value significant Answers Headlines Images
Basic 65.40 - - Baseline 67.9 61.4 54.2
Threshold 0.7263 <0.0001 yes Naive Bayes 23.9 41.5 35.1
Stop Removal 0.7349 <0.0001 yes Bayes Net 49.3 58.9 51.3
Levenshtein 0.6586 0.1445 no SMO 69.6 65.7 54.9
Table 3: The effects of additional components to the core VRep Decision Table 67.5 65.7 55.8
system on Task 1 J48 66.1 64.9 52.7
Random Forest 68.4 65.8 53.6
F1 Ali p-value significant JRip 68.9 65.4 56.2
Basic 0.7508 - - Table 5: The performance of different classifiers on alignment
Threshold 0.8812 <0.0001 yes reasoning
Stop Removal 0.7511 0.9920 no
Levenshtein 0.7524 0.7795 no
WSD 0.7799 <0.0001 yes entire set of 72 features and tested multiple classi-
Threshold + WSD 0.8216 <0.0001 yes fication algorithms. All classifiers are WEKA (Hall
Table 4: The effects of components to the core VRep system et al., 2009) implementations; results are shown in
on Task 2 Table 5. The baseline score is calculated as simply
assigning the most common class, EQUI.
5. Word Sense Disambiguation (WSD) should
5 Conclusions and Future Work
help to reduce noisy alignments by using the
correct synset when computing the Vector re- In future iterations, more analysis should be done to
latedness measure. We used the entire sentence refine the features used in classification. Using JRIP
(all chunks) as input to SenseRelate::AllWords and other analysis criteria we can see see why cer-
(Patwardhan et al., 2003). WSD improves re- tain features are discriminative, and develop more
sults when used as a single component, but informative features.
when used in combination with a threshold Rather than relying solely on the Levenshtein
(Threshold + WSD) results are worse than a measure for words outside of WordNet, additional
threshold alone. This is likely due to the fact metrics, such as word2vec (Mikolov et al., 2013)
that both WSD and thresholding aim to reduce could be incorporated.
noisy STS and chunk alignments. When used Additional data should be added for training clas-
singularly they both achieve this task, but in sifiers. The top performing classifier was gener-
combination WSD errors reduce performance. ated from all data combined indicating that addi-
tional samples are necessary. It is likely that given
Analysis of the test data indicated that the ad-
more data, topic specific classifiers will outperform
dition of these extra components was unnecessary,
the general classifier we evaluated. Additional data
however to further analyze their contributions three
will also help to reduce the class imbalance and will
runs were submitted for the both tasks 1 and 2. Run
likely result in a set of rules for the REL class.
1 used the basic system, run 2 eliminated the stop
Since VRep already makes use of WordNet, it
removal preprocessing step, and run 3 used the ba-
could be easily expanded to compete in the polarity
sic system with the Levenshtein measure described
subtask by implementing a polarity classifier using
above. Test results were mixed and data set de-
SentiWordNet (Baccianella et al., 2010).
pendant, see the respective competition summaries
(Agirre et al., 2016) for complete results.
6 Acknowledgments
4.2 Alignment Reasoning Component Analysis We would like to thank Dr. Bridget McInnes for her
For alignment reasoning, only the assignment of a advice during the development of VRep, and Uzair
label (Type) to a chunk pair is evaluated. We used Abbasi for his contributions to the initial develop-
the gold standard alignments provided for each data ment of the system.
set, converted each gold standard chunk pair to the
References
Eneko Agirre, Carmen Banea, et al. 2015. Semeval-2015
task 2: Semantic textual similarity, english, spanish
and pilot on interpretability. In Proceedings of the 9th
International Workshop on Semantic Evaluation (Se-
mEval 2015), June.
Eneko Agirre, Aitor Gonzalez-Agirre, Inigo Lopez-
Gazpio, Montse Maritxalar, German Rigau, and Lar-
raitz Uria. 2016. Semeval-2016 task 2: Interpretable
semantic textual similarity. In Proceedings of the 10th
International Workshop on Semantic Evaluation (Se-
mEval 2016), San Diego, California, June.
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas-
tiani. 2010. Sentiwordnet 3.0: An enhanced lexical
resource for sentiment analysis and opinion mining. In
LREC, volume 10, pages 2200–2204.
Rajendra Banjade, Nobal B Niraula, Nabin Mahar-
jan, Vasile Rus, Dan Stefanescu, Mihai Lintean, and
Dipesh Gautam. 2015. Nerosim: A system for mea-
suring and interpreting semantic textual similarity.
William W. Cohen. 1995. Fast effective rule induc-
tion. In Twelfth International Conference on Machine
Learning, pages 115–123. Morgan Kaufmann.
Christiane Fellbaum. 2005. Wordnet and wordnets.
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard
Pfahringer, Peter Reutemann, and Ian H Witten. 2009.
The weka data mining software: an update. ACM
SIGKDD explorations newsletter, 11(1):10–18.
Sakethram Karumuri, Viswanadh Kumar Reddy Vuggu-
mudi, and Sai Charan Raj Chitirala. 2015. Umduluth-
blueteam: Svcsts-a multilingual and chunk level se-
mantic similarity system. SemEval-2015, page 107.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
tions of words and phrases and their compositionality.
In Advances in neural information processing systems,
pages 3111–3119.
Siddharth Patwardhan, Satanjeev Banerjee, and Ted Ped-
ersen. 2003. Using measures of semantic related-
ness for word sense disambiguation. In Computational
linguistics and intelligent text processing, pages 241–
257. Springer.
Ted Pedersen, Siddharth Patwardhan, and Jason Miche-
lizzi. 2004. Wordnet::similarity: measuring the relat-
edness of concepts. In Demonstration papers at hlt-
naacl 2004, pages 38–41. Association for Computa-
tional Linguistics.
Dan Ştefănescu, Rajendra Banjade, and Vasile Rus.
2014. A sentence similarity method based on chunk-
ing and information content. In Computational Lin-
guistics and Intelligent Text Processing, pages 442–
453. Springer.