Abstract
The schema.org vocabulary was developed by the leading search engine operators to facilitate the semantic annotation of websites in a machine readable and consistent way. In this thesis a data profiling approach for the analysis of the usage of the schema.org vocabulary is developed. The analysis is based on the data from Web Data Commons which extract structured data from the Common Crawl corpuses. The schema.org hierarchy will embed into the created primary fact tables by the usage of semantic dimensions. On the basis of the primary fact tables the variants “Cube” and “Star” (collectively referred to as RDF-Summarization-Cubes) are developed to enable proper schema.org analysis. The proof-of-concept prototype is implemented in PySpark SQL and is executed and tested in a proof-of-concept Hadoop cluster. The first analysis of the usage of the structured data formats Microdata, JSON-LD and RDFa based on a rather small fragment of web data commons shows that the schema.org vocabulary is used most frequently with the JSON-LD format. To show its scalability the proof-of-concept prototype was also deployed in the Microsoft Azure Cloud using Databricks and executed on a larger fragment of the web data commons corpus.
Original language | German (Austria) |
---|---|
Supervisors/Reviewers |
|
Publication status | Published - Nov 2018 |
Fields of science
- 102 Computer Sciences
- 102010 Database systems
- 102015 Information systems
- 102016 IT security
- 102025 Distributed systems
- 102027 Web engineering
- 102028 Knowledge engineering
- 102030 Semantic technologies
- 102033 Data mining
- 502050 Business informatics
- 503008 E-learning
JKU Focus areas
- Computation in Informatics and Mathematics
- Management and Innovation