Transcript from the Meetup event:
Hello everyone. Welcome to another great Carolina Snowflake meet up group event. Today we will be going over bringing software engineering principles into data analytics. And we have a special guest speaker, Anna from CodeSignal.
Just quickly pull up our meetup rules. I’m not going to read through them all because you guys have all heard them all a thousand times, but these are made up rules and we’ve got our previous meet ups available on our YouTube channel and we’ll do a quick recap of Snowflake with Mike.
Yeah, for some folks that may have are new to Snowflake or just getting into Snowflake. And even if you’re a seasoned veteran, just want to get everyone on the same page of what is Snowflake? We talk about the data cloud.
It’s really a comprehensive, fast-paced entire platform from data ingestion, the governance and all the way from even data sharing. As you look to monetize your data within your organization and scale that it’s really infinite, all done through really a consumption based model.
At the end of the day, we talk about Snowflake peeling back some of the layers on what it solves, what does it address. We talked about the data integration. It doesn’t matter where your data sets, whether it’s in a transactional database, some third party scraping web logs, looking at IoT data ingesting, that transforming that into essentially the different structures that are needed by your different user groups within your organization.
Whether it’s somebody that’s building application that just really wants some lightly curated data to folks within the organization that may be more executive based, that are just consuming. Enterprise dashboards through a tool and a readonly type of format.
Snowflake handles all of that transparently in an easy to use method where you’re just paying for what you’re using. You don’t have to do those really large capacity planning exercises of the old than base here.
All right, we’ll go ahead and turn it over to Anna. Welcome, Anna. I’m Anna Yeomans. I’m a data analytics engineer at CodeSignal. Today we’ll be talking about bringing software engineering principles into the data analytics.
Okay, so but first, what is CodeSignal, what company I’m working for? So, we are a technical interview and assessment solution platform, which is honestly a really, really cool product to work on. We essentially help companies to hire engineers and other technical roles by making industry standardized assessments.
And we also have a live interview tool. Okay, so we are trusted by some cool companies, and there are different industries. Some of the big names are Meta, Uber, Visa, Capital One, and others. So here are modern data stack at CodeSignal.
Just like some of the technologies, we have a face to the list, all of them. But our raw data is coming from very different sources. So from like a product database, we have some custom and not very custom events tracking.
You have some business apps that come from a variety of places. But essentially, the point is, all of the raw data is very different. And then it’s all loaded into the home base for everyone, which is Fivetran, our ETL tool.
ETL stands for extract, transform and load. So then all of this data is moving into Snowflake. After that. And that’s a Snowflake route. So hopefully you know the context. But we also have some custom data manipulations that are connected to Snowflake.
And then, because the raw data could be very difficult to work with, we are using a data modeling tool, which is a dbt, and that’s where we essentially clean, test, and document data that is ready for users to use.
And on the bi and data science side, we also using mobile analytics. Today, we mostly talk about dbt, and this is around that. So here on the left side, this is how an organization might look like without dbt.
Analysts make some data requests to engineers, who then put those requests in the queue. And analysts will wait around and wait, and they will wait until those requests are done. But here on the right side, this is how an organization will look like.
This DB team. And this is basically how our team how my team looks like. Engineers and analysts collaborate very closely together, and they are able to do requests much faster. So dbt was inspired by software engineering workflows like continuous integration, version control, and automating testing.
So if you are a software engineer in the room, a lot of things in this presentation will sound obvious. But if you’re an analyst, basically what we are doing here is applying some typical workflows software engineers use to make the code changes as easy and as safe as possible.
In organizations, you will mostly see people who understand the business and people who understand the technology. Then those business people will understand all of the marketing concepts, sales concepts, they understand why certain things and processes are changing in salesforce.
And on the other side, the technology people will understand how the platform operates, they will understand the importance of version control, how to use and make processes as automative as possible.
So understand the importance of continuous integration. And I would say dbt is really where the two worlds meet and that’s how the role of analytics engineer came into the market. So what is dbt? I keep saying dbt.
So dbt is a data modeling tool that essentially helps to transform the raw data see on the left side. And then through development, we will test and we will clean the raw data as well as be tested and then be documenting what everything is and then they deploying it for everyone.
And that’s where analysts can use bi tools and machine learning tools and others to actually use that bring that data and make sense. Everything is so here I just listed a few of software engineering principles that in my opinion a data team should not leave this out and this is just Asana and we’ll dive in deeper into some of them as well.
So you should always test data on staging before moving anything to production or social board. Then we can use GitHub Actions to test models and also create like a singular styling for the whole team and test it on full requests.
Then we can also invest in tools, in analyst developer experience. So, because analysts. Don’t normal work as software engineering principles. You can lessen those tools and the experience who is going to be working with to make their life easier than being comfortable.
Using VS Code kit and GitHub require at least one reviewer to merge any changes to production and making builds fast and simple with continuous integration. So I would say the most core software engineering principle is using Git and that’s what software engineers use like every day.
Like for example, my terminal is always open. So what is Git? It is a version control system that essentially will allow us to save and view changes at a completely whole new level. And from this point on I might use the word code and what I’m really referring to is data models and not necessarily to code base or application code.
But my point is the data models can be treated completely like code and thus need good. Okay, so dbt does have a UI and it’s really nice, you can see it on the left side. But it really has much less functionality than any software engineering tools like GitHub or Git.
If you’re using Vs code, you can have a dbt command line, you can have various code plugins that help you be more efficient as you work. You can have Linting and you can use Git and masses of data just like any software engineer would.
So another software engineering principle is continuous integration. That means pretty much automating what happens after there is any code change. And the idea is to remove any manual steps from testing the new code, deploying the new code and any other part that might be a part of the code change automatically without any effort from the analyst or an engineer, depending on who is building the model.
And here is just like a small example I added that we use. So even though dbt does provide a great building documentation, we did face not an issue like I would say, like a problem that they could improve.
It gets very difficult to manage access of users to dbt and it gets very difficult to share models between someone internally, right? Because then you have to check like does it have dbt access or not?
So what we did was to auto deploy our documentation with Netlify. If you haven’t heard of Netlify is I recommend to check it out. It’s basically a serverless back end services for the applications. So you can do it using item actions which will make documentation available for the whole company.
And it’s pretty easy to do. You can have SSO Netlify or you can also protect it by like a global password or passwords. So the next one is the GitHub Actions to test models and singular styling pull requests.
So the real estate magazine principle I would say is I would say let’s call it testing. Always test your code, never merge and test the code to production. And the importance of testing goes, I would say like.
Like without saying like it’s something that is obvious and data models should not be an exception to that. When I say testing, I don’t just mean running some sequel and like see if statements will return what you expect.
What I really mean is you need to unit test your code and have some automating process that will also run those unit tests for you and you can make testing as a part of your workflow. This is how my team is using testing.
So we test all these tests locally during development models. Then we also run tests on pull requests in dbt. You can instead of running all of the data models, you can set it up so you will only run and test models on pull requests for modified models.
And here’s a screenshot as an example. So you would just say something like dbt run on a state modified versus if you’re running some scheduled routine test and all models you can say just like dbt run.
But this is just like a small example of what you can do. Then Linting really means standardizing the code style for the whole team so that there is no confusion for Cookers to go through your styling as a review of code.
And it can be a part of developers workflow as well. With tools like Precommit, you can also testing and put requests and you can use Vs code plugin to have it set up locally as well. So our team is using SQL flop as a SQL linker.
If you haven’t heard about it, I do recommend to check it out. It’s really cool tool to use and it’s easy to set up. On the right side, it’s just a screenshot how your setup for SQL flaws might look like.
It’s just that sequel flow file. And then you can decide it in your team what kind of styling you want to have standardized for everyone, and you can automate it in development. Using a tool will pre-commit that will fire up fixing the code automatically every time you are trying to make a pull request.
And you also can set up on checking that Linting on pull requests and then your reviewer will see if all of the Linting tests passing or not. And then it brings us to conclusion. So once again, I want to say dbt is a great tool and it’s very easy to use together with Snowflake.
And using dbt software engineering principles can really improve your team’s workflow and to help you be more efficient. Also, I would say from my experience working with analysts, they can come from a very different background and some of them might come as previous experience of software engineering tools like continuous integration and version control.
But some of them might not have that experience. And that’s okay because analysts are very technical people and they can adapt to any tools that you build for them pretty fast. So do invest in those tools and do write a good documentation for them to use, adding some screenshots as well.
So then the process for them is as easy as possible and I promise they will enjoy using those tools we will build for them. Most of the things that we use and all the software engineering principles that we use are based on zip problems that we had to solve.
Like for example, making documentation available for everyone or. Running tests on pull requests to save your viewers’ time. And I strongly encourage you to do some research or maybe even use the tools that we covered today or maybe some similar ones, hopefully some of the examples that they covered, but to inspire you solving a problems that you have in your company.
And just the last thing I want to say, even though data modeling and like any data engineering analytics engineering work does sound like it’s going outside of the software engineering, but moving towards analytics, but it really shouldn’t be this way.
Do try to use software engineering principles as much as you can because I do promise you your life will be much easier and the job will be more efficient. And thank you. Do you have any questions? Yes, a couple of questions came in.
Thank you for your presentation. It was fantastic. Some good questions. Here, let me throw them at you one at a time, give you a chance to answer them here. So one of the questions came in, is SQL Fluff free or not?
It is free. It’s an open source and it’s a very well maintained and there is great documentation. They even have some recommendations of some like basic rules that they recommend you use. Doesn’t mean you have to use them, but I found it very helpful to standardize simple code.
Okay, perfect. And then there was a question says please, kind of reexplain or dive deeper into how you use Netlify. Yeah, good question. So again, we had to deploy Netlify. We had to deploy our documentation somewhere, right?
So we just use Netlify. How you can do it? There are a lot of good resources online. You would use GitHub actions and you would put Netlify Key and the API and would be able to deploy it on Netflix. I it’s pretty straightforward, but I also recommend sending on for a dbt group if you are not there yet.
It’s a dbt Slack group and if you just can like Google deploying on Netlipy, you would find a lot of different examples how other companies wrote those actions and that’s how I implemented ours. Okay, very good.
So another question is how long did it take you to get all your transformations of data into dbt? Good question. So we actually did it pretty quickly, which is guess. My team is very fast and we are very excited to make data available and test it.
For as a whole company, it probably took us around three to four months to migrate everything, but it doesn’t mean we are done. Product changes happen like every deployment once a week and they do get other teams using more different business apps that also require data to be migrated.
So tickets always get collected. Okay, perfect. Another question. How would you recommend structuring a data project beyond dbt from a CI/CD pipeline? As the source code typically spreads across multiple systems, you have data science and mode, data engineering, and dbt.
It really depends on the project. Right. So I’m like thinking of custom integration that’s outside of GBP as an example to think of and. To be honest, we still like even when we do the custom integration, we still test the data using dbt, but you could implement testing, like looking at some Python library and run those tests automatically as well.
Okay, very good. One more question. How did you handle the change management of just getting your team and even others within the organization kind of bought into this new modern methodology for developing.
For using GBT specifically or just like the contingent integration testing? And it’s a pretty open ended question. I’m not sure, so maybe just touch a little bit on both. Yeah, I would say I will answer.
I understand from the second part, like, how do you get people believe in software engineering principles in the data analytics world, right? Really just build the tools and do provide trainings and do give demos, write a very in-depth documentation and do explain why like why I’ll be doing it.
And I do promise, like, if people have a comparison there’s like downloading CSVs for example, and manually loading them somewhere versus like the system will do it for you. And then you don’t have to test yourself, the technologies will test your code for you.
I would say it saves so much time that just to try and follow those principles will be something that you can build. Okay, perfect. Those are all the questions. Thank you much for your presentation and for handling those questions.
Are great job. Good question. Heather will turn it back to you. So just wanted to quickly talk about our upcoming event we will be having on Monday, October 24 at 06:00 p.m.. Eastern. We will be going over Snowflake health checks and talking about some best practices and it’s going to be really great.
So we hope you join us and invite a friend. Alright? And then next up we’re going to be going to Coalesce. So if anybody here will be there, we’d love to meet up for coffee and you can let us know in the chat window or connect with us through Mike his emails here on the screen.
And that is all we’ve got for you guys today. Thank you guys so much for joining us!dbt