Student: Alexander Gindlhumer (2021)
Supervisor: a.Univ.-Prof. DI Dr. Wolfram Wöß
Co-Supervisor: DI Lisa Ehrlinger, BSc
Motivation and Challenges
Data quality (DQ) is currently perceived as the greatest challenge in operative data management. In order to ensure high-quality queries and data analytics results, it is necessary to measure and know the quality of the used data. This can be achieved by continuously observing whether the quality of the data in an information system continues to conform to standards. The definition of such standards (i.e., the desired qualitative condition) is usually considered a manual task that is done by domain experts. In order to automate this task and support a domain expert with an initial reference profile that can be verified and adjusted if necessary, an automatically generated reference data profile would be a good starting point. Such a DB reference profile should represent the “desired” or “normal” condition of the data, for example, information like the mean, standard deviation, etc. for each attribute.
Example data quality measures that should be calculated against the data profile:
- Missing value detection (e.g., % of NA or default values per column/record/table)
- Duplicate detection (e.g., number of distinct/equal entries in column or similar records)
- Outlier detection (e.g., outlying values in column – if usual values are between 1-100 and there is one entry “13.000.000”)
Objective
The aim of this thesis is to create a concept of how such a reference profile could look like, what information it contains, and in which format it should be ideally stored. In addition, a program should be implemented to automatically generate a reference data set from a relational DB. In a follow-up work, it should be possible to measure the quality of a DB by calculating statistics about the deviation of this stored reference dataset.