Projects per year
Abstract
Record linking often employs blocking to reduce the computational complexity of full pairwise comparison. A key is formed from a subset of record attributes. Those records with the same key values are blocked together for detailed comparison. Use of a single blocking key fails to detect many true matches if records contain missing values or errors, since only those records with the same key values are compared.
To address missing values, it is common to repeat the matching process using multiple blocking keys, to match records that are identical in a subset of the fields. The presence of erroneous values may be addressed by blocking using key values mapped to a canonical form (e.g. Soundex). However, this does not address other problems such as single digit transcription errors in dates.
Blocking is used to categorise records that are candidate matches, in preparation for a pairwise comparison phase which may use various distance metrics, depending on the domain of the values being compared. Each blocking process defines a partition of records. The comparison operations are only applied to pairs of records within the same category.
In some contexts, it may be useful to have flexible control over the precision/recall trade-off, depending on the intended use for the matched data, and the degree of conservatism required of the identified links. With blocking, this flexibility is limited by the number of sensible blocking keys that can be identified.
In this talk, we describe experiments with a technique based on similarity searching over metric spaces, which appears to offer greater flexibility, and describe some preliminary results using an historic Scottish dataset.
To address missing values, it is common to repeat the matching process using multiple blocking keys, to match records that are identical in a subset of the fields. The presence of erroneous values may be addressed by blocking using key values mapped to a canonical form (e.g. Soundex). However, this does not address other problems such as single digit transcription errors in dates.
Blocking is used to categorise records that are candidate matches, in preparation for a pairwise comparison phase which may use various distance metrics, depending on the domain of the values being compared. Each blocking process defines a partition of records. The comparison operations are only applied to pairs of records within the same category.
In some contexts, it may be useful to have flexible control over the precision/recall trade-off, depending on the intended use for the matched data, and the degree of conservatism required of the identified links. With blocking, this flexibility is limited by the number of sensible blocking keys that can be identified.
In this talk, we describe experiments with a technique based on similarity searching over metric spaces, which appears to offer greater flexibility, and describe some preliminary results using an historic Scottish dataset.
Original language | English |
---|---|
Publication status | Published - 2 Apr 2017 |
Event | UK Administrative Data Research Network Annual Research Conference: Social science using administrative data for public benefit - Royal College of Surgeons, Edinburgh, United Kingdom Duration: 1 Jun 2017 → 2 Jun 2017 http://www.adrn2017.net |
Conference
Conference | UK Administrative Data Research Network Annual Research Conference |
---|---|
Abbreviated title | ADRN2017 |
Country/Territory | United Kingdom |
City | Edinburgh |
Period | 1/06/17 → 2/06/17 |
Internet address |
Keywords
- record linkage
Fingerprint
Dive into the research topics of 'Record linking using metric space similarity search'. Together they form a unique fingerprint.Projects
- 2 Finished
-
Administrative Data Research Centres: ESRC - Admin Data Service - Scottish Consortium
Kirby, G. N. C. (PI)
1/11/13 → 31/10/18
Project: Standard
-
Digitising Scotland: Digitising Scotland
Kirby, G. N. C. (PI)
Economic & Social Research Council
1/09/12 → 31/10/14
Project: Standard