Recently, I was fortunate enough to speak with several data engineers and architects about the problems they face with data in their companies. The main pain points I heard over and over again were:
- Not knowing why something broke
- Getting burned by high cloud computing costs
- Taking too long to create data solutions or complete data projects
- Requires experience in many tools and technologies.
These problems are not new. I have experienced them, and you probably have. However, we don’t seem to be able to find a solution that will solve all of these problems in the long term. You might think, “Well, point one can be solved with {insert data observation tool},” or “point two just needs a more stringent data governance plan.” The problem with these types of solutions is that they add additional layers of complexity, which makes the final two pain points more severe. The sum total of the pain points is still the same, there’s just a different distribution across the four points.
This article aims to present an opposite style of problem solving: radical simplicity.
Summary
- Software engineers have achieved enormous success by embracing simplicity.
- Over-engineering and the pursuit of perfection can result in bloated, slow-to-develop data systems, at sky-high costs to the business.
- Data teams should consider sacrificing some functionality for the sake of simplicity and speed.
A lesson from the software guys
In 1989, computer scientist Richard P. Gabriel He wrote a relatively famous essay on computer systems, paradoxically called “Worse is Better”. I won't go into details, you can read the essay. here If you will, but the underlying message was that software quality doesn't necessarily improve as functionality increases. In other words, you can sometimes sacrifice integrity for simplicity and end up with an inherently “better” product because of it.
This idea was foreign to the pioneers of computing during the 1950s and 1960s. The philosophy of the time was: a computer system must be pure, and it can only be so if it takes into account all possible scenarios. This was probably due to the fact that most of the leading computer scientists of the time were academics, who wanted to treat computing as a hard science.
Academics at MIT, the leading institution in computer science at the time, began working on the operating system for the next generation of computers, called MulticsAfter nearly a decade of development and millions of dollars of investment, the folks at MIT released their new system. It was undoubtedly the most advanced operating system at the time, but it was complicated to install due to the computing requirements and feature updates were slow due to the size of the codebase. As a result, it never caught on beyond a select few universities and industries.
While Multics was being built, a small group supporting the development of Multics became frustrated with the increasing requirements placed on the system. They eventually decided to abandon the project. Armed with this experience, they set out to create their own operating system, one with a fundamental change in philosophy:
Design should be simple, both in implementation and interface. It is more important that the implementation be simple than the interface. Simplicity is the most important consideration in a design.
—Richard P. Gabriel
Five years after the launch of Multics, the dissident group launched its operating system, UnixSlowly but steadily, it gained ground and by the 1990s Unix became the preferred choice for computers, with More than 90% of the world's 500 fastest supercomputers using it. To this day, Unix is still widely used, most notably as the underlying system for macOS.
Obviously, there were other factors beyond its simplicity that led to Unix's success, but its lightweight design was, and remains, a very valuable asset of the system. That could only be achieved because the designers were willing to sacrifice functionality. The data industry should not be afraid to think the same way.
Back to data in the 21st century
Looking back on my own experiences, the philosophy of most big data engineering projects I have worked on was similar to that of Multics. For example, there was a project where we needed to automate the standardization of raw data coming in from all of our clients. The decision was made to do this in the data warehouse via dbt, as this way we could have a complete view of the data lineage from the raw files to the standardized single table version and beyond. The problem was that the first stage of transformation was very manual, requiring loading each individual raw client file into the warehouse, then dbt creating a model to clean up each client file. This led to hundreds of dbt models needing to be generated, all using essentially the same logic. dbt became so bloated that it took minutes for the data lineage graph to be uploaded to the dbt documentation website, and our GitHub Actions for CI (continuous integration) took over an hour to complete for each pull request.
This could have been solved fairly easily if leadership had allowed us to do the first layer of transformations outside the data warehouse, using AWS Lambda and Python. But no, that would have meant that the data lineage produced by dbt would not be 100% complete. That was it. That was the main reason for not simplifying the project greatly. Like the group that split off from the Multics project, I left this project halfway through its development; it was simply too frustrating to work on something that clearly could have been much simpler. As I write this, I discovered that they are still working on the project.
So what the heck is radical simplicity?
Radical simplicity in data engineering is not a framework or a set of data stack tools, it is simply a mindset. A philosophy that prioritizes simple, straightforward solutions over complex, all-encompassing systems.
Key principles of this philosophy include:
- Minimalism: Focus on core functionalities that provide the most value, rather than trying to accommodate every possible scenario or requirement.
- Accepting trade-offs: Willingly sacrificing some degree of integrity or perfection in favor of simplicity, speed, and maintainability.
- Pragmatism over idealism: Prioritizing practical, viable solutions that solve real business problems efficiently, rather than seeking theoretically perfect but overly complex systems.
- Reduced cognitive load: Design systems and processes that are easier to understand, implement and maintain, thereby reducing the expertise required across multiple tools and technologies.
- Cost-effectiveness: Adopting simpler solutions that often require fewer computational resources and human capital, resulting in lower overall costs.
- Agility and adaptability: Creating systems that are easier to modify and evolve as business needs change, rather than rigid, over-engineered solutions.
- Focus on outcomes: Emphasize end results and business value rather than getting caught up in the complexities of the data processes themselves.
This mindset can be in direct contradiction to modern data engineering solutions that involve adding more tools, processes, and layers. As a result, you are expected to defend your position. Before suggesting a simpler alternative solution, arm yourself with a deep understanding of the problem at hand. I am reminded of the quote:
It takes a lot of hard work to make something simple, to truly understand the underlying challenges and find elegant solutions. (…) It’s not just about minimalism or the absence of clutter. It involves digging into the depths of complexity. To be truly simple, you have to go very deep. (…) You have to deeply understand the essence of a product so you can get rid of the non-essential parts.
—Steve Jobs
Side note: Keep in mind that embracing radical simplicity doesn't mean ignoring new tools and advanced technologies. In fact, one of my favorite solutions for a data warehouse right now is using a new open source database called DB DuckCheck it out, it's pretty cool.
Conclusion
Lessons from the history of software engineering offer valuable insights for today’s data landscape. By embracing radical simplicity, data teams can address many of the problems plaguing modern data solutions.
Don’t be afraid to promote radical simplicity in your data team. Be the catalyst for change if you see opportunities to optimize and simplify. The path to simplicity is not easy, but the potential rewards can be substantial.