TY - GEN
T1 - Automating Data Quality Monitoring with Reference Data Profiles
AU - Ehrlinger, Lisa
AU - Werth, Bernhard
AU - Wöß, Wolfram
PY - 2023
Y1 - 2023
N2 - Data quality is of central importance for the qualitative evaluation of decisions taken by AI-based applications. In practice, data from several heterogeneous data sources is integrated, but complete, global domain knowledge is often not available. In such heterogeneous scenarios, it is particularly difficult to monitor data quality (e.g., completeness, accuracy, timeliness) over time. In this paper, we formally introduce a new data-centric method for automated data quality monitoring, which is based on reference data profiles. A reference data profile is a set of data profiling statistics that is learned automatically to model the target quality of the data. In contrast to most existing data quality approaches that require domain experts to define rules, our method can be fully automated from initialization to continuous monitoring. This data-centric method has been implemented in our data quality tool DQ-MeeRKat and evaluated with six real-world telematic device data streams.
AB - Data quality is of central importance for the qualitative evaluation of decisions taken by AI-based applications. In practice, data from several heterogeneous data sources is integrated, but complete, global domain knowledge is often not available. In such heterogeneous scenarios, it is particularly difficult to monitor data quality (e.g., completeness, accuracy, timeliness) over time. In this paper, we formally introduce a new data-centric method for automated data quality monitoring, which is based on reference data profiles. A reference data profile is a set of data profiling statistics that is learned automatically to model the target quality of the data. In contrast to most existing data quality approaches that require domain experts to define rules, our method can be fully automated from initialization to continuous monitoring. This data-centric method has been implemented in our data quality tool DQ-MeeRKat and evaluated with six real-world telematic device data streams.
UR - https://www.scopus.com/pages/publications/85172682110
U2 - 10.1007/978-3-031-37890-4_2
DO - 10.1007/978-3-031-37890-4_2
M3 - Conference proceedings
SN - 9783031378898
VL - 1860
T3 - Communications in Computer and Information Science
SP - 24
EP - 44
BT - Data Management Technologies and Applications - 10th International Conference, DATA 2021, and 11th International Conference, DATA 2022, Revised Selected Papers
A2 - Cuzzocrea, Alfredo
A2 - Gusikhin, Oleg
A2 - Hammoudi, Slimane
A2 - Quix, Christoph
PB - Springer Nature Switzerland
CY - Cham, Switzerland
ER -