Abstract
Data is everywhere, and the more we digitalize our life and work, the more data our actions generate and tell about us. Capturing this data, cleaning, storing and analysing it has become a vital task of many companies. They do so not to be unique, but to have an edge over their competitors, or simply compete with them, in a more complex, globalized market than it was in the last century. The field of business intelligence aims to address these challenges. Without solid data sources, however, the believed insights are incomplete at best, or devastating for a company, in the worst case.
Large amounts of data are typically stored in an organized way, very often a data warehouse or its most popular variant, the less-structured data lake. A cornerstone to an effective data storage is an efficient loading process into it, with data stemming from various source systems. The standard process to do this, which comes in many variations, adaptations and various commercial tools, is ETL, short for Extract, Transform, Load. It is a powerful set of techniques, stemming from decades of experience, to continuous collect and store data as efficiently as possible and enable meaningful analytics to be performed on the captured and processed data.
Effective ETL processing is an exhaustive task to set up, configure and maintain, even when supported by software frameworks and specialized tools. Especially when new data sources are added to an existing system, non-automated and only hard-to-automate tasks such as credential management, data mappings, necessary transformations, code scripting, and much more take up a lot of time. They can be tedious to perform regularly.
While this thesis does not aim to, nor can it, solve all the above-mentioned tasks, it does make a solid attempt to give the basis for automated ETL code generation. When done manually, this step needs to combine a lot of information from multiple areas, like the before mentioned mappings and transformation, which is complicated. The thesis proposes a way within the scope of a domain to automate this step and therefore reduce the potential of errors. Each step, such as transformation, mapping, can be looked at independently by the ETL engineer, while not having to worry about the complex task of combining them manually.
This thesis' system is composed of multiple components, such as a data storage for information about the ETL processing and the code generator software. When ingested with the information on how data is moving between the sources and target destination, it automates the code generation step and provides the user with debug-able script files. These basic components may be embedded into a larger software system, which handles the data ingestion into the generator system and the ETL script execution, to, in the end, aid the engineer with their ETL processes.
Large amounts of data are typically stored in an organized way, very often a data warehouse or its most popular variant, the less-structured data lake. A cornerstone to an effective data storage is an efficient loading process into it, with data stemming from various source systems. The standard process to do this, which comes in many variations, adaptations and various commercial tools, is ETL, short for Extract, Transform, Load. It is a powerful set of techniques, stemming from decades of experience, to continuous collect and store data as efficiently as possible and enable meaningful analytics to be performed on the captured and processed data.
Effective ETL processing is an exhaustive task to set up, configure and maintain, even when supported by software frameworks and specialized tools. Especially when new data sources are added to an existing system, non-automated and only hard-to-automate tasks such as credential management, data mappings, necessary transformations, code scripting, and much more take up a lot of time. They can be tedious to perform regularly.
While this thesis does not aim to, nor can it, solve all the above-mentioned tasks, it does make a solid attempt to give the basis for automated ETL code generation. When done manually, this step needs to combine a lot of information from multiple areas, like the before mentioned mappings and transformation, which is complicated. The thesis proposes a way within the scope of a domain to automate this step and therefore reduce the potential of errors. Each step, such as transformation, mapping, can be looked at independently by the ETL engineer, while not having to worry about the complex task of combining them manually.
This thesis' system is composed of multiple components, such as a data storage for information about the ETL processing and the code generator software. When ingested with the information on how data is moving between the sources and target destination, it automates the code generation step and provides the user with debug-able script files. These basic components may be embedded into a larger software system, which handles the data ingestion into the generator system and the ETL script execution, to, in the end, aid the engineer with their ETL processes.
| Original language | English |
|---|---|
| Supervisors/Reviewers |
|
| Publication status | Published - 2025 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 9 Industry, Innovation, and Infrastructure
-
SDG 16 Peace, Justice and Strong Institutions
Fields of science
- 102 Computer Sciences
- 102022 Software development
- 102001 Artificial intelligence
- 502007 E-commerce
- 505002 Data protection
- 102010 Database systems
- 102035 Data science
- 102033 Data mining
- 506002 E-government
- 102019 Machine learning
- 102006 Computer supported cooperative work (CSCW)
- 102028 Knowledge engineering
- 102016 IT security
- 202007 Computer integrated manufacturing (CIM)
- 102015 Information systems
- 102025 Distributed systems
- 509018 Knowledge management
- 102014 Information design
JKU Focus areas
- Digital Transformation
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver