Go to JKU Homepage
institute-for-application-oriented-knowledge-processing
What's that?

Institutes, schools, other departments, and programs create their own web content and menus.

To help you better navigate the site, see here where you are at the moment.

Data Quality Measurement for NoSQL Databases: Document stores – MongoDB

Student: Christoph Pachner     (Start: 2021)
 

Supervisor: a.Univ.-Prof. DI Dr. Wolfram Wöß
Co-Supervisor: DI Lisa Ehrlinger, BSc

Motivation and Challenges

Data quality measurement is crucial to estimate the significance of data analysis results and decisions that are based on these results. The primary information source for personal decisions is the Internet, whereas large enterprises usually process their data from several historically developed and heterogeneous information systems. In both cases, prior to decision-making, an integration of the available data is necessary to compare and measure the content of single sources.

A Java-based system has been developed at our institute that analyzes different information sources and calculates metrics to estimate a system's data quality. Currently, it is possible to assess and compare MySQL databases, CSV files and ontological schema descriptions. Schema heterogeneity is resolved by transforming different information source schemas into a unified form using the DSD ("Data Source Description") vocabulary. However, until now, an in-depth investigation of the most widely applied NoSQL databases is missing.

Objective

The main objective of this master's thesis is to evaluate quality measurement of document stores on the basis of a practical implementation using MongoDB. MongoDB is a NoSQL database that stores data in JSON-like documents in order to reach maximal schema flexibility. The practical work includes (1) the transformation of a MongoDB schema to a DSD representation in order to achieve comparability, and (2) an evaluation of the implementation by comparing a MongoDB schema to relational data (e.g., a MySQL DB).

The following research questions should be answered in the course of this thesis:

  • Is it possible to achieve direct comparability in terms of DQ measurement between document stores and relational data?
  • If not, what are the major obstacles and how could they be addressed?
  • Are there existing approaches for measuring the data quality specifically in document stores, and if so, how do they differ from traditional approaches (i.e., relational data)?