Open Source Data Stack for small team
Tools to build a data stack that will help your company become data driven
In this blog we will discuss some flexible set of technologies to support the storage, management & consumption of data. These tools are inspired by modern data stack.
Characteristics of MDS for an org that is starting its Data Product journey:
- A low entry barrier to data product journey.
- Low-code components/tools — Connector/Plugin based system.
- Cloud Native — Born in the cloud
- Powered by SQL — Use of cloud data warehouse or Lakehouse
- Modular & Customizable— Every tool solves one problem (and does that very well), thus they have to be modular to ensure they work well with the others.
- Should be metadata driven — All aspects of the stack should collect and emit metadata.
- Support for real-time scenarios as well.
- Focus on operation
- The stack should be collaborative at its center.
- Each component or tool used should have active and large communities for support.
- Choose the right partner — MDS is not a one-size-fits-all solution. Every organization is different and hence needs specific solutions. Speak to people, read what’s out there and join the array of Slack communities.
A Modern data stack can be divided into four layers.
Acquire→ Organize→ Transform → Analyze & Operationalise
Acquire
Getting the data into the stack, from the source where it is generated, is the first step in data platform. This data is collected from DBs, File systems, SaaS tools, Ad platforms, IoT devices, web and mobile events.
Extraction and Loading
The source data needs to be moved in Batch-mode to organize those in data lake or lakehouse. This needs to be done from varied sources and on a timely manner. Airbyte is an open soure tool to achieve this data integration.
Airbyte / Fivetran
Event Tracking/Streaming
An event is business/technical information. Events happens very fast and need to consumed in real time or near real time. So the data stack needs a way to process these events. Kafka is the de-facto choice streaming platform.
Kafka / Confluent
Organize
After collecting the data, Organizing it in a manner it will be easier to manage consume, is the next step. In MDS generally we have a cloud warehouse. But as things are moving fast towards lakehouse system, in my opinion we should have a lakehous platform. But to keep the process fast I would include both warehouse and lakehouse and gradually leaning towrds one.
Data Warehouse
Druid / Snowflake
Data Lakehouse
Iceberg + Dremio / Databricks Delta
Transform
Once we have organized the data, we need to standardize, validate, or even restructure data. These transformations often delves into data modeling. This transformations can be done with dbt.
Transformation and Modelling
DBT / Spark
Data Validation
Great Expectation / SodaSQL
Operationalize
Once data has been organized and cleaned in the ware[lake]house,We need to have insights from that. Insights can be in the form of questions answered with data (ad-hoc query), story told using data (visualization) and data point fed back in business tool to take actions (Reverse ETL).
Visualization
Superset / Looker
Analyze
Trino
Reverse ETL
Grouparoo / High touch
Governance
All the data and its trandformed versions need to be monitored and governed. Data governance has two pillars: observability and data cataloging.
Cataloging deals with what & where of data. While observability deals with freshness, correctness, schema changes and lineage.
Data Observability & Catalogging
Datahub / Amundsen / OpenMetadata
Orchestration
Since all the steps that we have discussed above talk to different tools and different time we need a orchestrator to run the show. Airflow is the widely use open source workflow manager.
Airflow
Enjoy building data products !!!