Photo by Xavi Cabrera on Unsplash

Open Source Data Stack for small team

Amit Singh Rathore
Dev Genius
Published in
3 min readAug 31, 2022

--

Tools to build a data stack that will help your company become data driven

In this blog we will discuss some flexible set of technologies to support the storage, management & consumption of data. These tools are inspired by modern data stack.

Characteristics of MDS for an org that is starting its Data Product journey:

  • A low entry barrier to data product journey.
  • Low-code components/tools — Connector/Plugin based system.
  • Cloud Native — Born in the cloud
  • Powered by SQL — Use of cloud data warehouse or Lakehouse
  • Modular & Customizable— Every tool solves one problem (and does that very well), thus they have to be modular to ensure they work well with the others.
  • Should be metadata driven — All aspects of the stack should collect and emit metadata.
  • Support for real-time scenarios as well.
  • Focus on operation
  • The stack should be collaborative at its center.
  • Each component or tool used should have active and large communities for support.
  • Choose the right partner — MDS is not a one-size-fits-all solution. Every organization is different and hence needs specific solutions. Speak to people, read what’s out there and join the array of Slack communities.

A Modern data stack can be divided into four layers.

Acquire→ Organize→ Transform → Analyze & Operationalise

Open source (mostly) Data stack for small size company

Acquire

Getting the data into the stack, from the source where it is generated, is the first step in data platform. This data is collected from DBs, File systems, SaaS tools, Ad platforms, IoT devices, web and mobile events.

Extraction and Loading

The source data needs to be moved in Batch-mode to organize those in data lake or lakehouse. This needs to be done from varied sources and on a timely manner. Airbyte is an open soure tool to achieve this data integration.

Airbyte / Fivetran

Event Tracking/Streaming

An event is business/technical information. Events happens very fast and need to consumed in real time or near real time. So the data stack needs a way to process these events. Kafka is the de-facto choice streaming platform.

Kafka / Confluent

Organize

After collecting the data, Organizing it in a manner it will be easier to manage consume, is the next step. In MDS generally we have a cloud warehouse. But as things are moving fast towards lakehouse system, in my opinion we should have a lakehous platform. But to keep the process fast I would include both warehouse and lakehouse and gradually leaning towrds one.

Data Warehouse

Druid / Snowflake

Data Lakehouse

Iceberg + Dremio / Databricks Delta

Transform

Once we have organized the data, we need to standardize, validate, or even restructure data. These transformations often delves into data modeling. This transformations can be done with dbt.

Transformation and Modelling

DBT / Spark

Data Validation

Great Expectation / SodaSQL

Operationalize

Once data has been organized and cleaned in the ware[lake]house,We need to have insights from that. Insights can be in the form of questions answered with data (ad-hoc query), story told using data (visualization) and data point fed back in business tool to take actions (Reverse ETL).

Visualization

Superset / Looker

Analyze

Trino

Reverse ETL

Grouparoo / High touch

Governance

All the data and its trandformed versions need to be monitored and governed. Data governance has two pillars: observability and data cataloging.

Cataloging deals with what & where of data. While observability deals with freshness, correctness, schema changes and lineage.

Data Observability & Catalogging

Datahub / Amundsen / OpenMetadata

Orchestration

Since all the steps that we have discussed above talk to different tools and different time we need a orchestrator to run the show. Airflow is the widely use open source workflow manager.

Airflow

Enjoy building data products !!!

--

--