Open-source Snowflake & Fivetran alternative, with Postgres compatibility. Powers AI Agent sandbox environments at https://gettelio.com
-
Updated
Jan 7, 2026 - Go
Open-source Snowflake & Fivetran alternative, with Postgres compatibility. Powers AI Agent sandbox environments at https://gettelio.com
A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.
A curated list of open source tools used in analytics platforms and data engineering ecosystem
End-to-end Data Lakehouse project built on Databricks, following the Medallion Architecture (Bronze, Silver, Gold). Covers real-world data engineering and analytics workflows using Spark, PySpark, SQL, Delta Lake, and Unity Catalog. Designed for learning, portfolio building, and job interviews.
Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
Open-source data framework for biology. Context and memory for datasets and models at scale. Query, trace & validate with a lineage-native lakehouse for bio-formats, registries & ontologies. 🍊YC S22
Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.
Data Engine for Manual/Algo Trading: Download/Stream -> Clean -> Store. Supports Data Lakehouse Architecture. Clean Once and Forget.
SwiftLake: Java SQL engine built on Apache Iceberg and DuckDB for efficient lakehouse reads and writes
DatAasee - A Metadata-Lake for Libraries
Floe: Policy-based table maintenance for Apache Iceberg
This repository is a place for the Data Warehousing course at the Information Systems & Analytics department, Santa Clara University.
My M.Sc. dissertation: Modern Data Platform using DataOps, Kubernetes, and Cloud-Native ecosystem to build a resilient Big Data platform based on Data Lakehouse architecture which is the base for Machine Learning (MLOps) and Artificial Intelligence (AIOps).
The project aims to process Formula 1 racing data, create an automated data pipeline, and make the data available for presentation and analysis purposes.
This repo provides a step-by-step approach to building a modern data warehouse using PostgreSQL. It covers the ETL (Extract, Transform, Load) process, data modeling, exploratory data analysis (EDA), and advanced data analysis techniques.
Complete open-source data platform with Airbyte, Dremio, dbt, and Apache Superset - Documented in 18 languages
🌊 Git-like Version Control for Data with Nessie, Iceberg, and Spark
A project of creating a local data lakehouse using open-source tools and using Apache Iceberg as the open table format
This project implements an end-to-end techstack for a data platform, for local development.
🚀 Scalable near-real-time data pipeline using Apache Iceberg, Spark, Kafka, and Trino. ACID-compliant JSON ingestion, processing, and analytics. Dockerized for easy deployment. #DataEngineering #DataLake
Add a description, image, and links to the data-lakehouse topic page so that developers can more easily learn about it.
To associate your repository with the data-lakehouse topic, visit your repo's landing page and select "manage topics."