Modern Data Lake Workshop

Architecture, Analytics, and Real-World Use Cases

Main Speaker

Learning Tracks

Course ID

42791

Date

01/12/2025

Time

Daily seminar
9:00-16:30

Location

John Bryce ECO Tower, Homa Umigdal 29 Tel-Aviv

Overview

In this hands-on workshop, you’ll learn the basics of data lakes and see why so many organizations are adopting them. We’ll cover how data lakes work, how they compare to traditional databases and big data tools, and what makes them powerful. You’ll build your own data lake from the ground up, using an object store, a metastore, a query engine, and analytics tools. With the query engine, you’ll explore and manipulate your data to better understand how it flows and how the system works. We’ll also introduce analytics tools with real-world big data analytics use cases — which you can try on your own datasets. As we go deeper, you’ll learn about advanced topics like Apache Iceberg tables for handling updates and deletes, along with key aspects of managing a data lake: security, best practices, and controlling costs.

Who Should Attend

  • Data Engineers: Focused on data ingestion, transformation, and management.
  • Developers: Integrating applications with data lakes via APIs and SDKs.
  • Database Administrators (DBAs): Add a new technology stack, migrate existing databases to a data lake.

Prerequisites

Course Contents

Foundations of Data Lakes

  • What is a data lake?
  • Data lakes vs. traditional databases: key differences
  • Why data lakes? Benefits and common use cases
  • Data lake architecture overview
  • Data formats in data lakes: Intro to Apache Parquet

Hands-On Workshop Setup

  • Introduction to Docker
    • Key concepts: containers, images, and tags
    • Essential Docker commands

Object Store Integration

  • Set up your own object store with MinIO
  • Load data into the object store
  • Explore and browse stored data

Query Engine & Metastore

  • Overview: what are a metastore and a query engine?
  • Deploy Hive Metastore and Trino
  • Create and query tables using Trino CLI

Data Transformation & Optimization

  • Convert CSV to Parquet using Trino (Tier 1 → Tier 2)
  • Generate Tier 3 data for a use case

Data Analytics

  • Connect your own analytics tool (e.g., Apache Zeppelin) to the data lake

Advanced Topics

  • Apache Iceberg: table format with full CRUD support

Wrap-Up

  • Summary of key learnings
  • Resources for continuing your data lake journey

The conference starts in

Days
Hours
Minutes
Seconds