Modern data stacks consist of various tools and frameworks for processing data. Typically, it would be a large collection of different cloud resources intended to transform data and get it to the state where we can generate valuable insights. Managing the multitude of these data processing resources is not a trivial task and can seem overwhelming. The good thing is that data engineers invented a solution called infrastructure as code. Basically, it is coding that helps us deploy, provision and manage all the resources we may need in our data channels. In this story, I would like to discuss popular techniques and existing frameworks that aim to simplify resource provisioning and data pipeline implementation. I remember how, early in my data career, I implemented data resources using the web UI, i.e. storage buckets, security roles, etc. Those days are behind me, but I still remember the joy and happiness when I knew it could be like this. It is done programmatically using templates and code.
Modern data stacks
What would that be: a modern data stack (MDS)? Technologies that are specifically used to organize, store, and manipulate data would be something that would make up a modern data stack (1). This is what helps shape a modern and successful data platform. I remember raising this discussion in one of the previous stories.
A simplified data platform model typically looks like this:
It typically contains dozens of different data sources and cloud platform resources to process them.
There may be different types of data platform architecture depending on functional and business requirements, skills of our users, etc., but in general the infrastructure design encompasses various data processing…