Spooq: A Software Library for ETL Processes in Data Lakes

  • David Hohensinn

Research output: ThesisMaster's / Diploma thesis

Abstract

The implementation of ETL processes in data lakes is a complex and intricate process due to heterogeneous open-source software environments, the use of unstructured data, and the schema-on-read principle. This leads to an increased effort for the development of data pipelines compared to traditional data warehouses, which can rely on years of standards and best practices. The increased development effort affects the duration and quality of data integration projects and can even lead to missed business opportunities. This master thesis deals with the implementation of the software library Spooq, which supports data engineers in designing ETL data pipelines in data lakes. The package is based on Apache Spark, which is included in most data lake environments, such as a local Cloudera Hadoop distribution or the cloud-based Azure HDInsight Service. It facilitates testing and documentation and thus enhances the quality of data pipelines. The software library allows data engineers to focus on business logic rather than software code by abstracting Spark’s low-level functions. The use of Spooq results in reduced development effort for data pipelines.
Original languageEnglish
Supervisors/Reviewers
  • Schrefl, Michael, Supervisor
  • Neumayr, Bernd, Co-supervisor
Publication statusPublished - Jan 2021

Fields of science

  • 102 Computer Sciences
  • 102010 Database systems
  • 102015 Information systems
  • 102016 IT security
  • 102025 Distributed systems
  • 102027 Web engineering
  • 102028 Knowledge engineering
  • 102030 Semantic technologies
  • 102033 Data mining
  • 102035 Data science
  • 502050 Business informatics
  • 503008 E-learning

JKU Focus areas

  • Digital Transformation

Cite this