Abstract
Benchmarking the quality of duplicate detection methods
requires comprehensive knowledge on duplicate pairs in addition to sufficient size and variability of test data sets. While extending real-world data sets with
artificially created data is promising, current approaches
to such synthetic data generation, however, work solely on a quantitative level, which entails that duplicate semantics are only implicitly represented,
leading to only insufficiently configurable variability.
In this paper we propose SemGen, a semantics-driven approach to synthetic data generation. SemGen first diversifies real-world objects on a qualitative level,
before in a second step quantitative values are generated. To demonstrate the applicability of SemGen, we propose how to define duplicate semantics for the
domain of road traffic management. A discussion of lessons learned concludes the paper.
Original language | English |
---|---|
Title of host publication | Proceedings of the 4th International Workshop on Data Quality in Integration Systems in conjunction with DASFAA 2011 |
Number of pages | 12 |
Publication status | Published - 2011 |
Fields of science
- 102 Computer Sciences
- 102015 Information systems