If you could see the threads of emails and slack chats amongst our team regarding our customers and other inputs from conferences we attend etc. on the differences and benefits (or lack thereof) between a data warehouse and a data lake, it would be enough to make your head spin. Probably very similar to this run on sentence. But there are just different schools of thought that surround this greater conversation.
For those of us that remember, the dimensional models, data hubs, and other concepts surrounding data warehousing, many of us are only privy to some basic concepts of building data lakes.
At Ai Consulting Group we actually build quite a few (like a lot) of data lakes and data warehouses (yes, still) because they can they can both really server separate but equally valuable purposes. A short example would be that a recent data lake we built (or actually re-built) needed to meet the business case of housing 10 disparate data sources totaling around 300 GB historical data, with daily batch for 8 sources and 2 real-time streams for remaining data sources, so that a budding data science team (currently of 2 but planning for 4) could begin their journey of a data science team at a mid-size manufacturing organization. This data lake will be bare bones and for the next several months until it evolves based on iterative feedback and more purposeful intent is derived through conversations, data wrangling (which we’ll support and which is now possible), as well as some central analytical operational reporting (there’s a machine learning component here for the purpose of the data lake as well, but I’ll gloss over that for the sake of brevity.) This is more of a cart before the horse but necessary from both an executive directive and the fact that there is a belief that there are gems in the data, and instead of waiting for shiny gem to be delivered in a velvet box, wrapped with bow, the small and our extended staff augmentation will refine the rough gem and cut it to its brilliance.
On the other hand a data warehouse can, when done correctly with experienced professionals, provide quick-win value in an agile fashion and deliver a targeted or known set of information through the data integration, to provide more immediate business value. It can also create automation of business logic, mitigating the human error equation of repetitive tasks or calculations, so that the known metrics/data/KPIs are consistent – and that creates business confidence (which we think everyone likes). Recently we had a debate with a client hell-bent on a data lake when what they really needed was a data mart for marketing reporting. And though it took a decent amount of convincing, in the end the right call was made to created a phased approach to deliver the data mart, then a data lake and the two would be combined for more competent analytics and predictive ML outputs.
At the end of the day, we are all enamored with technology and its advancements. There’s always going to be a better mousetrap. But the solid foundation of architecture, modeling, data logic and flow, are all just conversation if the solution doesn’t meet expectations. And in analytics, AI, or in any functional department within the business, if it doesn’t add value then it just might be vaporware.
So the business value from either approach needs to be brought into play. That’s why we’re keen on the idea of a DataLakeHouse. Yes, a one-word combination of the historically perceived business value of a Data Lake (storage for all types of data but ‘data sciencey’ benefit of fast distributed data mining and data discover and data wrangling, with the power and known business value of a Data Warehouse/Mart. The combination of these modern to kinda-legacy powerful concepts gives organizations and end-to-end perspective of their data to drive informed decision making while both preparing for any type of data ingestion and still delivering a single source of truth for business analytics.