Often the question arises on how can Snowflake customers secure their data at a row level, or by organization, or by some other slice of meaningful delineation. This is a very common use case for most customers especially in analytics and reporting. Join us for a conversation of best practices and hands-on coding on securing data with secure view and UDFs.
Anyone interested in the code we use during our events can check us out on GitHub.
Check out our Meetup video to watch an overview of this event.
Transcript of the Meetup video:
Welcome to another Carolina Snowflake meetup group event. Today we’re going to be going over Secure Views in Snowflake. Our previous meetups we record and we put them on our YouTube channel, which you can find in the discussion of the meetup.
And if anybody is interested in those, just go check on that discussion board and there’s a link in there. Today we’ll do a quick introduction, Snowflake refresher and talk a little bit about View Security in Snowflake, kind of overview of what that is, use cases and then we’ll have a walkthrough and then we’ll have a little Q and A session for anybody that has any questions, right?
Mike is going to take it from here. For those of you that are maybe new to Snowflake, we have a little bit of an overview. But for others that are veterans, you already know this or maybe you’ve even hopefully gotten deep into many of these within your organization.
Today’s chat, regardless if you’re a long time Snowflake user or not, is for you when you think of what Snowflake is, Snowflake, the Data Cloud at the end of the day allows you to take data rather than on prem in the cloud.
Third party data, log data, IoT data, sensor data and really bring all that data together into that single cloud location. And from there you’re able to determine which users should see which data and what format should they be seeing the data.
When I say format, that is already bringing multiple data sources together. Are you providing data warehouses, data marts? Or perhaps you’re even exposing some of the transactional data to some of your different data consumers.
At the end of the day, it’s the single location. It’s the same data, whether you have that data aggregated or broken out into its own transactional format. Everyone’s using the same data at the end of the day, and you get the power of the cloud and you’re only paying for what you actually use.
Gone are the days of trying to figure out, what are my next three to five years from capacity planning and buying all of that now and then, hopefully, if we’ve guessed right, grow into that over a period of time.
A unique thought process that allows you to save money and move even faster. You’re not installing any software in order to using Snowflake, and you can choose which public cloud between Amazon, Azure and Google that you want to run Snowflake.
For example, if your organization is a large AWS shop, great. You can use the complimentary services that AWS has to offer with Snowflake running in your desired region. Again, lots of power from that perspective.
And all of that comes from an out of the box perspective with Snowflake with your annual subscription. All right, so switching gears to view security and Snowflakes. When we think about Snowflake, we have data, like Mike was saying, and then at the end of the day, some of that data can be open from a centralized point of view to.
A large number of users or it can be restricted to certain select users. And so one of the ways Snowflake allows us to do that is through something called View security. And in other tools this might be called rulebased access control RBAC.
We’ll talk about that a little bit. So really it’s this ability to restrict access to underlying data by using a view object which overlays basically a table object act that provides this transparent filtering of data to allow for data privacy.
We’ll talk more about different versions of Snowflake and different capabilities. But when we look at the basic additions of Snowflake, roll level security or View security is available really at all levels for the most part.
And then there’s another higher level called column level view security. And we won’t get into that too much today, but that’s mainly available on the enterprise and above plans for Snowflake. When we think about implementing secure views or view level security inside of Snowflake, if you think about it from a development perspective, level of effort perspective, it really is a technical piece of development that’s going to take place.
And then there’s a little bit of scale involved here, but it is flexible and there’s some other components as well. They give that ability and at the end of the day, it’s a simple method to deploy. It doesn’t involve a lot of extra work, but it will involve a little bit of maintenance on an ongoing basis, depending on if you’re changing your roles, if you’re changing you, your different type of permissive levels and so forth and so on.
So what we’re going to talk about next are just some general use cases. And so let’s dive into some use cases really quickly. So we can think about sales and marketing teams. Perhaps there’s different sales teams in the organization that need to look at the different the data differently.
And perhaps those sales and marketing teams should not have access to other teams either within the organization. Maybe you have some sort of franchising concept, you have different sales territories, anything where you can think of delineating.
Perhaps you’re implementing Snowflake, what we might call multitenancy environment or at least a type of multitenancy structure architecture where you might have different customers or clients and those different customers and clients should not see each other’s data or look into another customer or clients data.
But you want to centralize that data into a single data mart or data warehouse. So this will be a great use case for not just real up security, but view secure views marketplace and vendor data. There’s capability there.
So if you’ve been a Snowflakes marketplace, this is a concept that aligns almost one to one with the Snowflake marketplace and the companies behind providing their data to other Snowflake customers. I mentioned before any type of really departmental or hierarchical alignment of roles to permissions.
And then one that we see quite often is when we migrate systems that have been traditionally on premise but oftentimes on the cloud now. And we migrate those operational systems like Oracle EBS or PeopleSoft or Salesforce.com.
And when we migrate that into the cloud, the roles have already been established in those operational systems. So what we want to do is we want to find a way inside of Snowflake to respect those roles that still exist in those systems that are in use today.
But we want to provide it in a way that emulates it correctly, almost in harmony. And then still be able to provide an extra layer on top of that if necessary. And so that’s a typical use case that we see with a lot of different companies in the wild.
So when we think about a very general problem of securing data, let’s take a look at what we also see often, which is sales territories and sales groups within organization. So let’s walk through this example.
So if we had a sales team that’s going to be divided into regions or territories, then the sales data is typically centralized. Either an operational system and then that data that sales data is then synchronized or ETL or ELT over into a data warehouse, where, again, it’s traditionally going to be in the same database or in the same schema, same set of tables for efficiency.
But the sales team, each individual in the sales team, whether it be a manager or a field rep or something like that, or business development, they have different levels of responsibility. And typically those levels of responsibility align with levels of access.
So if you’re a manager across regions, you typically have responsibility for people that report up to you, so they’re underneath you and they might have different territories, right? So then sales team leaders should not see other sales team leaders regional territorial sales data, that could be a conflict of interest.
You might have something like different incentive plans or salaries and you don’t want to see one sales or territory region manager seeing the other regional managers data. So as you can see in a diagram on the left right, there might be a complete roll up to that regional manager or the vice president of sales, something like that, so that person can see all the data underneath.
But if an individual is only at one of those lower hierarchies, they’re. And they should only see the hierarchy level they’re at and underneath that data. Or if they’re at the lowest level, then they should only see that lowest level of data.
So if we look at the diagram here, this is pretty interesting, right? So we’re going to walk you through this. So this is an animated slide and we’re going to step through it animate by the nation. So let’s pretend we’ve got two sales managers.
We can give them whatever names we want. And one is a manager of the western region, one is a manager of the eastern region. And so then what we might have as we build it up is we have maybe a sales lead or a territory lead of the Mountain and west region.
And so that individual can see the data at the Mountain and west region and in this case from Nevada, but they cannot see the western region total data. Maybe there’s other regions like south or something like that.
So they won’t be able to see that that information. If we look at another, maybe a business development rep on the eastern region sales team that might just have the territory of Virginia. So we can see here that this business development rep will only be able to see the data for Virginia.
They won’t, he won’t be able to see anything above Virginia. So he wouldn’t be able to see like a mid Atlantic rollup or an eastern region roll up. He can’t see data for Ohio or Washington DC. As that’s probably going to fall the mid Atlantic or the eastern region.
So what’s going to happen next? So if the eastern region manager tries to see the data from the western region, that should not happen, right? We don’t want that to happen. And if the business development rep.
In Virginia wants to try to look at data from Nevada, that shouldn’t work. And if they want to look at a higher level, like in the mid Atlantic, that shouldn’t work either. So this restriction of the data as it would exist in the same table, in the same schema, same Snowflake database, this restriction can be accomplished by using secure views.
Okay, so let me switch over to a little bit of a walkthrough on how we can do that inside of Snowflake. So we’re going to take a look at some code, and I’m going to ask the team to confirm what they’re seeing so that we are making sure that we’re looking at the right thing.
We see your Snowflake worksheet. Excellent. That’s what I was hoping for. All right, so we’re going to walk through some code here real quick. Let’s see what I left over here. So we have our meet up schema here.
We don’t have anything in it right now. And so what I’m going to do is I’m just going to create a mock up schema. Schema already exists, and we see if we have any views in that schema. We do not. I’m going to create a table called W Sales Fact for our mockup data warehouse and actually let me switch my role.
Sorry. It’s where I want to be. Okay. So I’m going to go and create Sales fact. Actually, let me do one more thing. I switch to the wrong role. Let us try that again. Switch to the right rule, and we’re going to make sure we have the schema we should have that we don’t have any views.
And let’s see if we can create this fact table. Yes, fact table created. Now what we’re going to do is we’re just going to populate a little bit of sales data. Inside of the sales fact table we just created.
So it’s just, you know, hey, these are kind of SKUs and products. And, and what we’re doing here is it’s a flat table. So what we’re doing is just aligning it to a region. So you can see here Nevada, Colorado, Virginia, just like the slide showed us a few minutes ago.
So let’s go ahead and insert all those records. So we have about eleven records coming through, and we can take a look at those records, kind of move back a little bit. So we can see we have the records showing we have product name, sales regions, maybe a roll up to what team that is, or denormalized table here, what was the sales price and then maybe the quantity for each one of those transactions.
So now you can see if I’m the sales region manager or business development person, I’ve got Colorado, Nevada, I’ve got Virginia, North Carolina. So the data is all mixed into the same table. And this is very traditional.
There’s absolutely nothing wrong with modeling your data in this way. I would say probably 99% of the organizations that use any form of analytics or data warehouse data modeling will model a transactional to denormalized fact table in this way.
And I don’t care if you’re doing Data Vault or Kimball Dimensional. It doesn’t matter. You’re going to model a table this way for efficiency. And so what we want to do is, how do we determine if a user who should be in the Virginia region should only see Virginia data?
Let’s take a look at that. What we’re going to do is create what we call a reference table. And this reference table could exist in the same schema, or it could exist in a schema outside of the main schema where the fact table exists.
So we’ll run this, we’ll create it very simple. And typically we’re going to create a little bit of a more complex reference table that holds hold the security. But just know that this is really common practice in order to create a secure view within Snowflake.
So we just have a table with two columns. Now what we’re going to do is just insert and populate that table with some basic data. Now the important thing to note here is that we’re creating, creating a reference table.
And what we want to do is align the level of granularity or what they should be able to see with a role. Now mind you, this access role is a Snowflake role and this is how Snowflake secure views work.
You’re aligning the role of the users that exist in that role to the level of privilege or the level of granularity for which they should have access. I have six rows inserted in there and you can see just a list of regions here.
Virginia, California, Colorado, so forth and so on. And now what I want to do is just because this is a demo and we’re not doing a full blown out implementation, we don’t have a lot of existing roles that align to territories and things of that nature.
But you can really make sense of what we’re doing here. What I’m going to do is just assign the ability for the default public role to use the schema and be able to select from the view. And you’ll see that work here in a minute.
So I’m going to run the permission to grant the role public access to this new schema and then I’m going to grant access to the role public to basically conduct a select on all the views that get created in this example.
And notice I’m not doing future views. So I just created. Well, you’ll see in a second. And notice that I’m using the demo roll. So I’m in the demo roll and I’m granting access to the public role. So now we’re going to go and create the view.
If I break this apart just a little bit so it’s maybe a little more legible, what are we doing with this view? So let’s break this apart on line 68, creating a replacement of you very standard sequel.
But I’m adding this different into qualifier here. I’m adding the secure view. So instead of it just being a regular view, this is pretty standard ANZ SQL. I’m changing it to a secure view which becomes more let’s just call it Snowflake sequel.
Now the secure view uses pretty much all the same syntax as a normal view, create the view and I’m calling it VW Sales fact as and I’m selecting from that fact table. But what I’m doing differently here is I’m predicating on the sales region which I showed about, so Virginia, Nevada, Colorado, et cetera.
And then I’m doing a sub query where I’m pulling the one identifier which is the access level which associates to the region in the fact table. So I’m associating the region in fact table to the reference technically region in the reference table.
And then in the reference table I’m doing a predicate or filter on the access role. So it should only show me the regions where the access role is in play for the current terminal. So let me go ahead and run this views created and with the demo rule, let’s see what we have.
So I’m going to run this and you can see here I’ve done a select from my view and I can only see the sales records and detail from Virginia which is. If I just scroll up slightly here and I take a look at let’s look at our DML here.
If I take a look at where Demo Roll is in this list, then I can see the Demo Roll only has access to Virginia. I want to look at that inside the grid here, just to be a little more clarifying. We can see here that for the Demo role that I have the access to Demo role, which I have right here, which is my current role, then I can only see Virginia.
So let’s move down a little bit and let’s go ahead and change the role to public. So I’m switching the role to public. We can confirm that not only up here, but we can also confirm it by running the select.
And I can see that my role has changed to Public kernel. And now let me select from the view. Look at that. So the view does not exist or not authorized. And in this case, it’s actually not authorized.
So we did this on purpose, just as an elementary refresher. So once you grant a permission, particularly select on a set of objects, in this case, all views or an individual object, if you do it before the object is created, as we did here.
So I granted it here, but then I didn’t create it here. Then the grant doesn’t actually take effect on the object because the object wasn’t available when you ran the grant permission. So what I have to do here is switch back to Demo role, who owns the objects on the owner of that object.
You guys. I know this would be an extra security lesson, not just. Secure view lesson. And now I can run the grant permission and you can see one object is impacted. So now I can grant that. So now that now if I switch back roles, let me just actually go back here, switch back to the public role.
I can see that I’m in the public role. And now if I run this, I should be able to see I don’t have any Warehouses. Dope thought I had Compute, so I don’t have a Warehouse available, actually, to me for some reason.
But if I did run this, I would be able to see that I have access to all the ones that are in public. Let’s see if I can actually modify that here in a minute. So that is the gist of secure views. And I’m going to switch back to the slide deck and do a little bit of a recap here.
And in the time being, if I can get that Warehouse running, we can see it run all the way through with the public role access. Well, let me just switch over here. Got it running also. So we actually walked through this example.
So the user coming out of Virginia was only able to see data from Virginia. So at the hands on exercise. So very cool. Now, if we think about the steps that we took, right, really it’s a very complex.
We could make the reference tables complex as we want to make it. We can obviously model our tables the way that we want, hopefully following some level of best practice. But really what we want to do is determine which objects we want to apply security on, because not all objects we’ll.
Require row level security. We might have some tables, like dimension tables, that don’t require that level of security. But if you do apply the secure views on top of one of the access points, sometimes it makes sense to do it to almost all of them.
And there are some cases where there’s actually a need for that. But we’ve seen really a mixed set of secure views as it relates to data warehouses, data marts and the like. So your mileage may vary, but one of the first things you want to do upfront is really understand what you are filtering.
And sometimes this helps to do that at the very beginning of a project. And probably one of the near last things is just test, test, test, because you want to look at where the roles are coming from.
Do those roles that you’re applying the security to align with an operational system? If so, then it’s more of an end to end process. The users in the operational system should have very similar security, visibility experience that they have with your data and Snowflake or data warehouse and so forth.
Then also you want to document this somewhere. But ultimately you want to understand is this almost like an array of user to roles? Because you could wind up in a situation where you have a very low, low level of granularity and you start assigning roles and next thing you know you have a plethora of roles and then maintenance becomes a bear.
So drawing it out, doing whiteboarding, almost taking us to the level of first drafting this into an Excel spreadsheet, those types of things might really help. And then probably one of the last ones in testing is just again, we’d like to go back to will this be something required by an API or another system?
Might you use DBT or DBT cloud or transformation tool where you’re going to be using a certain service account that’s going to need a higher level of permission to basically see all the roles. So this is something else to take into consideration as you’re using secure views.
When we think about V security, do we really need it? Do we have to have it? It rather and no, you don’t. But there’s just a lot of cases where this is an excellent add in. When we think about clean rooms, when we think about multi tenancy, when we think about restricting access for policy reasons, it is a great add to the Snowflake platform.
And then again, if you’re looking at even more control, like PII type of data, then the higher versions allow you to do things that get to the column level, which is really important. So that you’re not doing double work, you’re not creating five different tables that just have different subsets of the data because you didn’t know how else to to restrict or model this.
So these types of tools in your toolbox really, really add to your flexibility and your capability to be fast developed quickly, get a great return on investment, and then again make the best out of the product in general.
All right, so I think at the end of the day, it’s a thumbs up for secure views and view security. All right, I’ll turn this back over. But just to throw out one thing. Looking at Snowflake, looking at the utilization of your Snowflake account and across your organization is super important.
And so we’ve actually been privy to develop this great. Platform where we can view what our daily, almost down to the minute usage of Snowflake is when we look across queries, queries that are causing problems, queries that are run on queries looking at the cost and spin of Snowflake and Snowflake accounts through DLH IO.
So if anybody is interested in taking taking a look at their Snowflake usage from an analytical perspective, definitely reach out to us, leave a comment and again, Data Lake House AO if you’re synchronizing Data, look at Data Lake House as an option.
We are growing the community on Data Lake House so if you’re interested in getting a super cool pair of socks potentially before winter is over, definitely take a look at joining us and all the cool things and all the awesome conversations we’re having on our Slack channel.
Love to have you guys join that we’ll put the link into the meetup comments and you can just click on that and join and then pick pair of socks that you guys might be interested in. We’ll send those over to you.
So I think we’ll open it up for some QNA yes. So feel free to put your question into chat here and we got some questions rolling in and so one of the first questions is are there performance considerations between a secure view and a non secure view?
That is a really good question. Ah you know, a while back we did a kind of a very light benchmark on this and I haven’t seen one from Snowflake ever. There has been some discussion on some Snowflake community boards where they say it’s just immaterial.
Think it’s it’s it’s almost not quantifiable from the perspective of any performance drawback. I mean, obviously you’re doing a sub query, but you know, your reference table is should is and should be very small when you’re looking at role to permission or level of granularity.
And because the the offset, even if you have millions, hundreds of millions of rows, you know that that lookup is so small on in theory that table is ultimately cashed out as a reference table. So I don’t think there’s any material performance concern.
Okay, perfect. Let’s see when you’re using all right, let me see if I can get this one here. This one might be too late in the day where my eyes aren’t working quite right here. So bear with me. Stumbled to this one.
So when you’re using secure views and data sharing, are there particular let me back up. How would you handle secure views and data sharing? One, should you do it? And two, if so, any pitfalls or got you when going down the data sharing route?
Yeah, I feel like this is a question on the Snowflake Administration certification exam. But yeah, at the end of the day. Right. From a data sharing perspective, we do this fairly often. The flexibility is there.
Right. So should you or should you not do it? I think it boils down to this scenario that you’re in, how you want to who’s that organization or. Or that individual or that tenant that you’re sharing that data with, and then how are you breaking it out for your share?
Right? So that’s kind of where you start on your side. So from your Snowflake account perspective and then there’s probably a couple of different ways you can go about it. I think that it does empower empower for sharing that data to use view security because you’re in complete control and then you’re just really sharing out the view, if you will.
But we’ve seen some situations where it just makes sense to spawn off a subset. So you can very easily create a share that’s being populated every, for example, 1 hour or whatever frequency from your DBT or your data form model.
Right? Let’s just go dbt Cloud. It’s a great tool. So maybe you have a frequency set up for every 2 hours. And as part of your data sharing policy, you have an SLA or an Slo of every 2 hours and your window to transform your data falls easily within that.
So then the expectation is every 2 hours you have this separate schema or a separate set of tables or even database that you’ve shared out, particularly for that tenant or that customer. And it holds then absolutely only the data they should see.
So I think your models might vary in this situation where secure views could be a great use case, but there could be other options when it comes to data sharing that might fit the bill a little bit better.
Okay, perfect. And then one final question and. Sure we’ll get some chuckles on this one, but can you help us implement secure views as we are new to it and we have a lot of HIPAA data? Absolutely. We help Snowflake customers all the time.
We are one of the top partners for Snowflake on the East Coast. We host the meetups, we do a lot of whiteboarding lunch and learns and Data for Breakfast.
Just really helping people get the most out of their Snowflake account and their Snowflake investments. And because we’ve done implementations so frequently of all sizes, we’re probably the best suited to help out Snowflake customers, whether they’re just starting off and just need to get help with the basics or they have much larger let’s.
Call it Enterprise Opportunities and challenges, large migrations, things of that nature, where they might have very much a phased or milestone based approach in their program. And they start off, perhaps, with the migration, but then they need to then come back and apply a layer of security on top of it, using secure views.
So absolutely. Just drop us a comment, set up some time to chat with you. All right, perfect. Thank you. Those are all the questions that I see come through. Thank you everyone for asking questions.
Next month we’ll be going over Apache Iceberg on Snowflake and providing general information about Apache Iceberg. And then in February, we’ll be going over Snowflake micro partitions, so we’ve got some good stuff coming up and more to come.
Also, we’re now a part of the modern data professionals meet up network, so we’ve also got a couple of groups in there too for Midwest and East Coast, where we’ll be having some in person events coming up soon.
Thanks, everybody. Have a great night!
To fill in the gap between engineering and the business side, a new role of analytics engineer came into play.
So, why do Data Analysts need a Software Engineer in their team? From dbt version control, continuous integration and GitHub actions to code quality and proper testing, it all automates the workflow of a data analyst and provides data quality for the entire company.
Other Meetups and Events to check out:
00:01 Hello everyone. Welcome to another great Carolina snowflake meet up group event. Today we will be going over bringing software engineering principles into data analytics. And we have a special guest speaker, Anna from Code Signal.
00:20 Just quickly pull up our meetup rules. I’m not going to read through them all because you guys have all heard them all a thousand times, but these are made up rules and we’ve got our previous meet ups available on our YouTube channel and we’ll do a quick recap of Snowflake with Mike.
00:41 Yeah, for some folks that may have are new to Snowflake or just getting into snowflake. And even if you’re a seasoned veteran, just want to get everyone on the same page of what is Snowflake? We talk about the data cloud.
00:56 It’s really a comprehensive, fast-paced entire platform from data ingestion, the governance and all the way from even data sharing. As you look to monetize your data within your organization and scale that it’s really infinite, all done through really a consumption based model.
01:16 At the end of the day, we talk about snowflake peeling back some of the layers on what it solves, what does it address. We talked about the data integration. It doesn’t matter where your data sets, whether it’s in a transactional database, some third party scraping web logs, looking at IoT data ingesting, that transforming that into essentially the different structures that are needed by your different user groups within your organization.
01:45 Whether it’s somebody that’s building application that just really wants some lightly curated data to folks within the organization that may be more executive based, that are just consuming. Enterprise dashboards through a tool and a readonly type of format.
02:03 Snowflake handles all of that transparently in an easy to use method where you’re just paying for what you’re using. You don’t have to do those really large capacity planning exercises of the old than base here.
02:18 All right, we’ll go ahead and turn it over to Anna. Welcome, Anna. I’m Anna Yeomans. I’m a data analytics engineer at CodeSignal. Today we’ll be talking about bringing software engineering principles into the data analytics.
02:36 Okay, so but first, what is Code Signal, what company I’m working for? So, we are a technical interview and assessment solution platform, which is honestly a really, really cool product to work on. We essentially help companies to hire engineers and other technical roles by making industry standardized assessments.
02:57 And we also have a live interview tool. Okay, so we are trusted by some cool companies, and there are different industries. Some of the big names are Meta, Uber, Visa, Capital One, and others. So here are modern data stack at Code Signals.
03:15 Just like some of the technologies, we have a face to the list, all of them. But our raw data is coming from very different sources. So from like a product database, we have some custom and not very custom events tracking.
03:29 You have some business apps that come from a variety of places. But essentially, the point is, all of the raw data is very different. And then it’s all loaded into the home base for everyone, which is fiveTrend, our ETL tool.
03:48 ETL stands for extract, transform and load. So then all of this data is moving into Snowflake. After that. And that’s a snowflake route. So hopefully you know the context. But we also have some custom data manipulations that are connected to snowflake.
04:06 And then, because the raw data could be very difficult to work with, we are using a data modeling tool, which is a DVT, and that’s where we essentially clean, test, and document data that is ready for users to use.
04:24 And on the bi and data science side, we also using mobile analytics. Today, we mostly talk about DBT, and this is around that. So here on the left side, this is how an organization might look like without DBT.
04:41 Analysts make some data requests to engineers, who then put those requests in the queue. And analysts will wait around and wait, and they will wait until those requests are done. But here on the right side, this is how an organization will look like.
05:01 This DB team. And this is basically how our team how my team looks like. Engineers and analysts collaborate very closely together, and they are able to do requests much faster. So DBT was inspired by software engineering workflows like continuous integration, version control, and automating testing.
05:23 So if you are a software engineer in the room, a lot of things in this presentation will sound obvious. But if you’re an analyst, basically what we are doing here is applying some typical workflows software engineers use to make the code changes as easy and as safe as possible.
05:42 In organizations, you will mostly see people who understand the business and people who understand the technology. Then those business people will understand all of the marketing concepts, sales concepts, they understand why certain things and processes are changing in salesforce.
06:01 And on the other side, the technology people will understand how the platform operates, they will understand the importance of version control, how to use and make processes as automative as possible.
06:16 So understand the importance of continuous integration. And I would say DBT is really where the two worlds meet and that’s how the role of analytics engineer came into the market. So what is DVT? I keep saying DBT.
06:33 So DVT is a data modeling tool that essentially helps to transform the raw data see on the left side. And then through development, we will test and we will clean the raw data as well as be tested and then be documenting what everything is and then they deploying it for everyone.
06:55 And that’s where analysts can use bi tools and machine learning tools and others to actually use that bring that data and make sense. Everything is so here I just listed a few of software engineering principles that in my opinion a data team should not leave this out and this is just sansa and we’ll dive in deeper into some of them as well.
07:22 So you should always test data on staging before moving anything to production or social board. Then we can use GitHub Actions to test models and also create like a singular styling for the whole team and test it on full requests.
07:40 Then we can also invest in tools, in analyst developer experience. So, because analysts. Don’t normal work as software engineering principles. You can lessen those tools and the experience who is going to be working with to make their life easier than being comfortable.
07:59 Using Vs code kit and GitHub require at least one reviewer to merge any changes to production and making builds fast and simple with continuous integration. So I would say the most core software engineering principle is using Git and that’s what software engineers use like every day.
08:21 Like for example, my terminal is always open. So what is git? It is a version control system that essentially will allow us to save and view changes at a completely whole new level. And from this point on I might use the word code and what I’m really referring to is data models and not necessarily to code base or application code.
08:54 But my point is the data models can be treated completely like code and thus need good. Okay, so DBT does have a UI and it’s really nice, you can see it on the left side. But it really has much less functionality than any software engineering tools like GitHub or Git.
09:16 If you’re using Vs code, you can have a DBT command line, you can have various code plugins that help you be more efficient as you work. You can have Linting and you can use Git and masses of data just like any software engineer would.
09:34 So another software engineering principle is continuous integration. That means pretty much automating what happens after there is any code change. And the idea is to remove any manual steps from testing the new code, deploying the new code and any other part that might be a part of the code change automatically without any effort from the analyst or an engineer, depending on who is building the model.
10:07 And here is just like a small example I added that we use. So even though GBT does provide a great building documentation, we did face not an issue like I would say, like a problem that they could improve.
10:26 It gets very difficult to manage access of users to DBT and it gets very difficult to share models between someone internally, right? Because then you have to check like does it have DBT access or not?
10:40 So what we did was to auto deploy our documentation with Netlify. If you haven’t heard of Netlify is I recommend to check it out. It’s basically a serverless back end services for the applications. So you can do it using item actions which will make documentation available for the whole company.
11:01 And it’s pretty easy to do. You can have SSO Netlify or you can also protect it by like a global password or passwords. So the next one is the GitHub Actions to test models and singular styling pull requests.
11:24 So the real estate magazine principle I would say is I would say let’s call it testing. Always test your code, never merge and test the code to production. And the importance of testing goes, I would say like.
11:41 Like without saying like it’s something that is obvious and data models should not be an exception to that. When I say testing, I don’t just mean running some sequel and like see if statements will return what you expect.
11:57 What I really mean is you need to unit test your code and have some automating process that will also run those unit tests for you and you can make testing as a part of your workflow. This is how my team is using testing.
12:17 So we test all these tests locally during development models. Then we also run tests on pull requests in DBT. You can instead of running all of the data models, you can set it up so you will only run and test models on pull requests for modified models.
12:36 And here’s a screenshot as an example. So you would just say something like DBT run on a state modified versus if you’re running some scheduled routine test and all models you can say just like DBT run.
12:50 But this is just like a small example of what you can do. Then Linting linting really means standardizing the code style for the whole team so that there is no confusion for Cookers to go through your styling as a review of code.
13:09 And it can be a part of developers workflow as well. With tools like Precommit, you can also testing and put requests and you can use Vs code plugin to have it set up locally as well. So our team is using SQL flop as a SQL linker.
13:29 If you haven’t heard about it, I do recommend to check it out. It’s really cool tool to use and it’s easy to set up. On the right side, it’s just a screenshot how your setup for SQL flaws might look like.
13:44 It’s just that sequel flow file. And then you can decide it in your team what kind of styling you want to have standardized for everyone, and you can automate it in development. Using a tool will precommit that will fire up fixing the code automatically every time you are trying to make a pull request.
14:05 And you also can set up on checking that Linting on pull requests and then your reviewer will see if all of the Linting tests passing or not. And then it brings us to conclusion. So once again, I want to say DBT is a great tool and it’s very easy to use together with Snowflake.
14:25 And using DBT software engineering principles can really improve your team’s workflow and to help you be more efficient. Also, I would say from my experience working with analysts, they can come from a very different background and some of them might come as previous experience of software engineering tools like continuous integration and version control.
14:53 But some of them might not have that experience. And that’s okay because analysts are very technical people and they can adapt to any tools that you build for them pretty fast. So do invest in those tools and do write a good documentation for them to use, adding some screenshots as well.
15:12 So then the process for them is as easy as possible and I promise they will enjoy using those tools we will build for them. Most of the things that we use and all the software engineering principles that we use are based on zip problems that we had to solve.
15:32 Like for example, making documentation available for everyone or. Running tests on pull requests to save your viewers’ time. And I strongly encourage you to do some research or maybe even use the tools that we covered today or maybe some similar ones, hopefully some of the examples that they covered, but to inspire you solving a problems that you have in your company.
16:01 And just the last thing I want to say, even though data modeling and like any data engineering analytics engineering work does sound like it’s going outside of the software engineering, but moving towards analytics, but it really shouldn’t be this way.
16:20 Do try to use software engineering principles as much as you can because I do promise you your life will be much easier and the job will be more efficient. And thank you. Do you have any questions? Yes, a couple of questions came in.
16:36 Thank you for your presentation. It was fantastic. Some good questions. Here, let me throw them at you one at a time, give you a chance to answer them here. So one of the questions came in, is SQL Fluff free or not?
16:52 It is free. It’s an open source and it’s a very well maintained and there is great documentation. They even have some recommendations of some like basic rules that they recommend you use. Doesn’t mean you have to use them, but I found it very helpful to standardize simple code.
17:11 Okay, perfect. And then there was a question says please, kind of reexplain or dive deeper into how you use Netlify. Yeah, good question. So again, we had to deploy Netlify. We had to deploy our documentation somewhere, right?
17:31 So we just use Netlify. How you can do it? There are a lot of good resources online. You would use GitHub actions and you would put Netlify Key and the API and would be able to deploy it on Netflix. I it’s pretty straightforward, but I also recommend sending on for a DBT group if you are not there yet.
17:59 It’s a DBT Slack group and if you just can like Google deploying on netlipy, you would find a lot of different examples how other companies wrote those actions and that’s how I implemented ours. Okay, very good.
18:17 So another question is how long did it take you to get all your transformations of data into DBT? Good question. So we actually did it pretty quickly, which is guess. My team is very fast and we are very excited to make data available and test it.
18:36 For as a whole company, it probably took us around three to four months to migrate everything, but it doesn’t mean we are done. Product changes happen like every deployment once a week and they do get other teams using more different business apps that also require data to be migrated.
19:01 So tickets always get collected. Okay, perfect. Another question. How would you recommend structuring a data project beyond DVT from a CI CD pipeline? As the source code typically spreads across multiple systems, you have data science and mode, data engineering, and DVT.
19:23 It really depends on the project. Right. So I’m like thinking of custom integration that’s outside of GBP as an example to think of and. To be honest, we still like even when we do the custom integration, we still test the data using DBT, but you could implement testing, like looking at some Python library and run those tests automatically as well.
19:54 Okay, very good. One more question. How did you handle the change management of just getting your team and even others within the organization kind of bought into this new modern methodology for developing.
20:09 For using GBT specifically or just like the contingent integration testing? And it’s a pretty open ended question. I’m not sure, so maybe just touch a little bit on both. Yeah, I would say I will answer.
20:27 I understand from the second part, like, how do you get people believe in software engineering principles in the data analytics world, right? Really just build the tools and do provide trainings and do give demos, write a very indepth documentation and do explain why like why I’ll be doing it.
20:46 And I do promise, like, if people have a comparison there’s like downloading CSVs for example, and manually loading them somewhere versus like the system will do it for you. And then you don’t have to test yourself, the technologies will test your code for you.
21:07 I would say it saves so much time that just to try and follow those principles will be something that you can build. Okay, perfect. Those are all the questions. Thank you much for your presentation and for handling those questions.
21:26 Are great job. Good question. Heather will turn it back to you. So just wanted to quickly talk about our upcoming event we will be having on Monday, October 24 at 06:00 p.m.. Eastern. We will be going over Snowflake health checks and talking about some best practices and it’s going to be really great.
21:47 So we hope you join us and invite a friend. Alright? And then next up we’re going to be going to Coalesce. So if anybody here will be there, we’d love to meet up for coffee and you can let us know in the chat window or connect with us through Mike his emails here on the screen.
22:08 And that is all we’ve got for you guys today. Thank you guys so much for joining us and we look forward to seeing you guys on the.