So your company is going cloud. Perhaps cloud native, perhaps cloud-hybrid. Perhaps mainly on-prem but finally getting one or two SaaS applications this budget year and the excitement over the potential to start blending data is keep you up at night. Any way you slice it, if you’re getting into a data science initiative or you are heavily engaged in AI/ML at any level then you should understand that an attempt to capture data in a repeatable, meaningful, is not as easy as loading a sample CSV dataset into your Jupyter Notebook.
Enter your data lake. But so many people and groups have misconceived notions and even poorly written definitions about what a data lake is, not to mention what it means for their company. Implementing a new data source such as a modern SaaS application however is an excellent place to build or update an existing data lake and to do it correctly with the proper business (or data science collaboration) support.
So with that in mind here are the top 3 reasons we thought about for this article on why you want to take every new SaaS system’s data and store it into your data lake. Yes, we’ll save other principals for building a data lake correctly for another article.
#1 – Proactivity – New Systems Should be Testing for Functionality
There’s no time like the present. The organization needs to ensure that the data format for extracting or direct linking to the new SaaS system is available:
- Will you need a custom or specific drivers?
- Will an export for the exact data you need be available and can it be pulled on a frequency of your choosing in a format of your choosing?
- Do you have access to the database directly to submit SQL queries or graphQL?
Without taking the system pre-production and kicking the tires for an ingestion process you’re causing technical debt and see #3 as you are also doing your team and your data lake a disservice in the value, perceived or otherwise, that you do or should be creating in the organization. Don’t even get me started on just getting a simple gut check on need for security and role based data level security roles and data governance.
#2 – It’s your data lake – What else is it for?
To expound upon #1 and pre-cursor to tip #3, if your team already has a data lake then use it. That is to say it was built to inspire and act as a repository for the betterment of the organization or at least a single department/team such as the data science squad. Since the historical and sometimes naive vision from IT is just to “build a data lake because a data warehouse architecture is outdated and that the data lake will meet all of our organizations data centralization needs”, you will need to actually use the data lake whenever the chance allows. Unfortunately when a myopic view is taking in the design and build of a data lake some organizations cannot scale the design nor pivot to ingest more data without naturally causing a train wreck of unorganized data ingest and hitting data governance issues, so they fail to progress with ease of ingesting future data source.
So, obviously, these are words of caution for an organization yet to embark on a data lake journey. But at the end of the day, a data lake is not an if you build it they will come type of infrastructure, especially if you don’t design and build to scale. And the big picture thinking of we built or we will build the data lake to ingest all future sources will keep you from making the naive mistakes mentioned here.
#3 – Executive Sponsorship and Continued Justification
This is similar to #1 above but only in the proactive nature of things. This reason is more about how systems such as analytics/reporting systems, ETL, data science, and even sometimes but rarely collaboration systems such as SharePoint, never fully get the credit they deserve until they continue to prove their worth, day in and day out. We used to see this consistently in Business Intelligence and Analytics where providing an ROI on the BI teams was hard to do because all they did and produced were “Reports”, right? No one gets to see the massive 50 Terabyte Data Warehouse migrated from SQL Server to Hadoop to support the massive multi-departmental daily reporting that runs the business. So, this is really a squeaky wheel gets the oil situation. By bringing the fact of data from a business system needs to be placed into our data lake house then you tie in a massive business expense and perceived benefit software with the continued need to feed the data lake. They cycle continues and ultimately the benefit is the appended data is available for consumption. The rest is up to the data lake design and architecture – hopefully that is sound and you didn’t screw that up to begin with (technical debt anyone?).