TY - GEN
T1 - A tale of four metrics
AU - Connor, Richard
PY - 2016/1/1
Y1 - 2016/1/1
N2 - There are many contexts where the definition of similarity in multivariate space requires to be based on the correlation, rather than absolute value, of the variables. Examples include classic IR measurements such as TDF/IF and BM25, client similarity measures based on collaborative filtering, feature analysis of chemical molecules, and biodiversity contexts. In such cases, it is almost standard for Cosine similarity to be used. More recently, Jensen-Shannon divergence has appeared in a proper metric form, and a related metric Structural Entropic Distance (SED) has been investigated. A fourth metric, based on a little-known divergence function named as Triangular Divergence, is also assessed here. For these metrics, we study their properties in the context of similarity and metric search. We compare and contrast their semantics and performance. Our conclusion is that, despite Cosine Distance being an almost automatic choice in this context, Triangular Distance is most likely to be the best choice in terms of a compromise between semantics and performance.
AB - There are many contexts where the definition of similarity in multivariate space requires to be based on the correlation, rather than absolute value, of the variables. Examples include classic IR measurements such as TDF/IF and BM25, client similarity measures based on collaborative filtering, feature analysis of chemical molecules, and biodiversity contexts. In such cases, it is almost standard for Cosine similarity to be used. More recently, Jensen-Shannon divergence has appeared in a proper metric form, and a related metric Structural Entropic Distance (SED) has been investigated. A fourth metric, based on a little-known divergence function named as Triangular Divergence, is also assessed here. For these metrics, we study their properties in the context of similarity and metric search. We compare and contrast their semantics and performance. Our conclusion is that, despite Cosine Distance being an almost automatic choice in this context, Triangular Distance is most likely to be the best choice in terms of a compromise between semantics and performance.
UR - http://www.scopus.com/inward/record.url?scp=84989834824&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-46759-7_16
DO - 10.1007/978-3-319-46759-7_16
M3 - Conference contribution
AN - SCOPUS:84989834824
SN - 9783319467580
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 210
EP - 217
BT - Similarity Search and Applications - 9th International Conference, SISAP 2016, Proceedings
A2 - Schubert, Erich
A2 - Houle, Michael E.
A2 - Amsaleg, Laurent
PB - Springer-Verlag
T2 - 9th International Conference on Similarity Search and Applications, SISAP 2016
Y2 - 24 October 2016 through 26 October 2016
ER -