Abstract
Searching for similar strings is an important and frequent database task both in terms of human interactions and in absolute world-wide CPU utilisation. A wealth of metric functions for string comparison exist. However, with respect to the wide range of classification and other techniques known within vector spaces, such metrics allow only a very restricted range of techniques. To counter this restriction, various strategies have been used for mapping string spaces into vector spaces, approximating the string distances within the mapped space and therefore allowing vector space techniques to be used.
In previous work we have developed a novel technique for mapping metric spaces into vector spaces, which can therefore be applied for this purpose. In this paper we evaluate this technique in the context of string spaces, and compare it to other published techniques for mapping strings to vectors. We use a publicly available English lexicon as our experimental data set, and test two different string metrics over it for each vector mapping. We find that our novel technique considerably outperforms previously used technique in preserving the actual distance.
In previous work we have developed a novel technique for mapping metric spaces into vector spaces, which can therefore be applied for this purpose. In this paper we evaluate this technique in the context of string spaces, and compare it to other published techniques for mapping strings to vectors. We use a publicly available English lexicon as our experimental data set, and test two different string metrics over it for each vector mapping. We find that our novel technique considerably outperforms previously used technique in preserving the actual distance.
Original language | English |
---|---|
Title of host publication | Proceedings of the 27th Italian Symposium on Advanced Database Systems |
Subtitle of host publication | Castiglione della Pescaia (Grosseto), Italy, June 16th to 19th, 2019 |
Editors | Massimo Mecella, Guiseppe Amato, Claudio Gennaro |
Publisher | Sun SITE Central Europe |
Number of pages | 12 |
Publication status | Published - 9 Jul 2019 |
Event | SEBD 2019 27th Italian Symposium on Advanced Database Systems - Castiglione della Pescaia, Castiglione della Pescaia, Italy Duration: 17 Jun 2019 → 19 Jun 2019 Conference number: 27 http://sebd2019.isti.cnr.it/ |
Publication series
Name | CEUR Workshop Proceedings |
---|---|
Publisher | Sun SITE Central Europe |
Volume | 2400 |
ISSN (Print) | 1613-0073 |
Workshop
Workshop | SEBD 2019 27th Italian Symposium on Advanced Database Systems |
---|---|
Abbreviated title | SEBD 2019 |
Country/Territory | Italy |
City | Castiglione della Pescaia |
Period | 17/06/19 → 19/06/19 |
Internet address |
Keywords
- Metric mapping
- n-Simplex projection
- Pivoted embedding
- String
- Jensen-Shannon distance
- Levenshtein distance
Fingerprint
Dive into the research topics of 'Modelling string structure in vector spaces'. Together they form a unique fingerprint.Datasets
-
Modelling string structure in vector spaces
Dearle, A. (Creator) & Connor, R. (Creator), Bitbucket, 2020
https://bitbucket.org/richardconnor/it_db_conference_2019/src/master/
Dataset