A multivariate correlation distance for vector spaces

Richard Connor*, Robert Moss

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We investigate a distance metric, previously defined for the measurement of structured data, in the more general context of vector spaces. The metric has a basis in information theory and assesses the distance between two vectors in terms of their relative information content. The resulting metric gives an outcome based on the dimensional correlation, rather than magnitude, of the input vectors, in a manner similar to Cosine Distance. In this paper the metric is defined, and assessed, in comparison with Cosine Distance, for its major properties: semantics, properties for use within similarity search, and evaluation efficiency. We find that it is fairly well correlated with Cosine Distance in dense spaces, but its semantics are in some cases preferable. In a sparse space, it significantly outperforms Cosine Distance over TREC data and queries, the only large collection for which we have a human-ratified ground truth. This result is backed up by another experiment over movielens data. In dense Cartesian spaces it has better properties for use with similarity indices than either Cosine or Euclidean Distance. In its definitional form it is very expensive to evaluate for high-dimensional sparse vectors; to counter this, we show an algebraic rewrite which allows its evaluation to be performed more efficiently. Overall, when a multivariate correlation metric is required over positive vectors, SED seems to be a better choice than Cosine Distance in many circumstances.

Original languageEnglish
Title of host publicationSimilarity Search and Applications - 5th International Conference, SISAP 2012, Proceedings
Pages209-225
Number of pages17
DOIs
Publication statusPublished - 3 Sept 2012
Event5th International Conference on Similarity Search and Applications, SISAP 2012 - Toronto, ON, Canada
Duration: 9 Aug 201210 Aug 2012

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume7404 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference5th International Conference on Similarity Search and Applications, SISAP 2012
Country/TerritoryCanada
CityToronto, ON
Period9/08/1210/08/12

Keywords

  • cosine distance
  • distance metric
  • multivariate correlation
  • similarity search
  • vector space

Fingerprint

Dive into the research topics of 'A multivariate correlation distance for vector spaces'. Together they form a unique fingerprint.

Cite this