Mike Jelen

November 30, 2021
9:00 am

Snowflake Data Backup Concepts

Even though Snowflake has Time-Travel, how do you feel confident that your data is actually Backed-Up and recoverable after you’ve brought it into Snowflake. One way is to move it to a physical location like another database within your Snowflake account. Or you could even move your data from your Snowflake account to another Snowflake account using AWS S3 or another cloud vendor.

Check out our Meetup video below to watch an overview of this event:

Other Meetups and Events to check out:

Transcript from the Meetup event:
Okay, everyone, welcome to another Snowflake meet up. Thanks for joining. Today is November 29. We’ve got a great topic here which is backing up Snowflake or how to do backups in Snowflake.

Same basic things that we go over every time, so please go mute. But if you want to ask a question, definitely please use the chat function. And remember, you don’t have to turn the video on, but definitely go out and invite a friend and then try to get social.

Follow us on DatLakeHouse. Check us out on LinkedIn and all the other social platforms are on. So what we’re going to do today is we’re going to talk about backing up our backup strategies in Snowflake.

We’re going to walk through some actual codes like we like to do, and we’re going to hit we might hit time travel. Probably won’t have time for it today, but we’ll definitely talk about it. We’ll talk about cloning as a backup strategy.

We’ll also talk about using cloud storage, AWS, GCP, Azure as a backup strategy approach. And then we’ll just kind of wrap up and have an open discussion and then talk about next events. So as usual, we do report these will be out on YouTube typically about one to three days after the event.

So check that out. Be sure to subscribe and like us on YouTube. So again, a quick Snowflake refresher. What is Snowflake? For those that just might be joining? It is a super awesome data cloud experience where it’s really focused on almost having an infinite scale for computing and doing data computations in the cloud as a really first class citizen.

Compute and all the other aspects of data warehousing are available for ingesting data, governing data, accessing your data, and really again at a near infinite scale. So Snowflake is really solving the problem of.

All the major components through the data integration pipeline, all different types of data sources can be consumed. And really through integration, transformation, and other processes that are unique or native to Snowflake, we give consumers of data really great experience through that tooling.

All right, so let’s get into the topic for today’s meet up and then definitely looking for your guys participation so we can have a good conversation around backing up Snowflake. Just looking at the definition of backing up or taking a backup right from the computing definition, which I would love to have a computing definition.

Make a spare copy of data or disk. And I think in this case, we are talking data. So just real quick, I’m just checking to make sure that we are recording this session. That looks good. All right, so we’re backing up data.

I love the other definition of this, provide support for someone or something. So many ways that’s happening. But so what is backing up or backing up into a Snowflake? Well, so everybody loves data retention, right?

Policies, it’s a great thing, especially in the cloud. So it doesn’t matter if you’re coming from old school on premise tools, if you’re familiar with the cloud, like managing storage buckets, those types of things, there’s some sort of retention policy is super important.

And when it comes to Snowflakes, there’s no major difference there from the perspective of there’s data, you want to make sure you have the data when you need it. And if something happens to your data, you want to make sure you could recover from any type of data loss, right?

So we’ll talk about that data retention here as we go through the meetup. But we also think that backing up stuff like is super important because we talk to a lot of people who are still in that legacy mindset that they want to get that warm and fuzzy feeling, seeing a physical copy of instance of an object, a database, a table or schema sitting there in their Snowflake instance.

Sorry, Christian, one question and maybe you’ll get to it here, but just one thing that you’re going through this folks, there’s backing up in Snowflake, there’s snapshots, there’s Time Machine or Time Travel.

Sorry, it may be just kind of touch upon each of those three just for folks at home to be able to separate those. Yeah, so you have to remind me of three. But time travel is definitely this awesome feature that’s seemingly native to two Snowflakes of time travel.

I like Time machine, though. Time Machine. Code name time Travel influx. Capacitor. That’ll be good. Yeah, exactly. Back in time. From a perspective like data retention, this is an awesome tool, right?

Like if you drop a table or you delete some records or something like that, you can go back and restore that by writing some simple SQL commands inside of Snowflake. So you can basically say, hey, recovery.

You can undrop a drop table or drop schema. For example. You can do like what are some of the other commands? Like simply leveraging the before or the at kind of syntax. You want to look at data before this date or at this date based on and all set.

So that’s kind of time travel is really cool. But one interesting thing, this makes this conversation really. A little bit more well, gives it more credence is because even with time travel, so there’s different editions of Snowflake.

Everybody knows that. So by default, I think the standard retention period that time travel can use is one day for like a standard edition, like 24 hours, right. When you get to Enterprise edition and higher, the default can be set up anywhere between, let’s say one day and 90 days.

So that still means that if you want a snapshot or a backup from 180 days ago and you want to persist, that you kind of need to do that some way, shape or form. So that’s where some of the scenarios we’re going to talk about, I think come into play.

And I’m sure somebody is going to in the comments on our YouTube or the meet up, somebody’s going to jump in and probably give us a completely different scenario, which is awesome. We love to be proved wrong whenever possible, but we’ve seen this time and time again and this seems to be an effective method.

What were the other two things we had? Time travel was a snapshot. Yes, snapshots. And then just we’ll say just the concept of backing up in general. Yeah, absolutely. So then you’ve got snapshots, right.

So from the perspective of Snapshots, this is like, at any point in time, snapshot sort of concepts, like you’re doing some development or you’ve got some smart machine learning and they’ve got a training data set, and they’re kind of like, well, I want to hold on to that data set, that data set, that table, that group of tables.

That schema was awesome. Right. Got me, like really within my target zone. So you could simply take a quick snapshot of that data set or that Schema instance and then use that as a reference. You could obviously use the time travel capability of.

Of Snowflake and say, hey, you know, I have that one data set. It was working fine for my, my ML training model training. And now you’re on the very next stage, like, oh, well, I don’t like this model as much.

So you could use time travel to go back one day and then revert in essence to your previous data set. But by doing that, you’ve kind of manipulated the time stone, if you will, for that particular data set.

So if you had an actual physical other table called like, training data set underscore good as a snapshot, then you just immediately revert to that. Right? You can see it there. You know what zech was.

You could even do like an immediate comparison, again, which you can do with time travel. But again, some people like that weren’t fuzzy, having the object and then the last yeah, I think backup is just more about any way, shape or form giving a copy of that data so you can recover.

We’re going to talk about one method, which is just taking the data out of Snowflake and putting in an S Three. So many times people talk about going from S Three or any cloud storage vendor, Azure GCP, and putting that into Snowflake.

So this is more about, hey, we want to take a quick backup of a table, for example, and move that into our S Three bucket, maybe into a special sort of bucket or folder and preserve that for whatever time.

Right. So these are just concepts and scenarios which we see quite often and just talking to them. So hopefully that spoke to those two items. Those three items, yes. Thank you. Then everyone says Snowflake can do anything or everything and well, I think there’s lots of options there.

But your mileage may vary, right, depending on your skill set, on who you have working on your team, depending on how much time you have to work on things. So, you know, really. Again, your mileage may vary depending on what you know and you don’t know what you don’t know.

So it’s good you’re listening to this meet up and attending. Then, of course, we keep talking about time travel. Probably spend a lot of time on that because that’s kind of like a de facto thing. But I think at the end of the day it is.

Check your manual, of course, and understand the limitations on time travel. Like I said, it’s up to 90 days for that object for retention. And then I’m pretty sure that elapses, and you’re kind of left holding the bag.

So if you’re looking for another approach, it doesn’t seem like there’s really any major best practices out there. Maybe we’re actually creating some right here and sharing them with everyone. But these are just some scenarios, like I said, that we see quite often.

So there’s no real best practices other than time travel. So, yeah, heed our advice, possibly. And so we really see this idea of backing up in these discussion points around the following topics. Right?

So application and development, which we see a lot of. Again, I gave you the machine learning example from a training data set, just grabbing any type of historical snapshot. Maybe you’re doing something like you’re migrating financial systems or human resource systems and you want to make sure that whatever data integration process you have like actively flowing through your pipeline, loading your current let’s say you’ve got a raw landing zone or a staging area or final fact table or data vault, whatever you might have, that you’ve got a nice cutoff.

Or during your testing, you can do some nice comparison. You want to make sure that as you’re migrating source systems, that you’re catching everything before it flows too deeply into your existing system.

So. Capturing storical snapshots, point in time snapshots will help enable that visibility. And then, of course, automating, right? Nobody wants to sit there and click a button or run a script every single time you want to do a point in time snapshot or backup.

So, like, how can we automate that inside of Snowflake? Right? There might obviously people want to jump to like a Python script or something like that, which is awesome, we love Python. But at the same time, like, could you automate it inside of Snowflake?

Again, time travel we mentioned. Then of course, we’ve got this other concept which is actually pretty cool. Obviously you can share data sets inside a Snowflake, but what if you just wanted to simply migrate data between your Snowflake accounts, right?

So these other concepts that come in into like, backing up data. And then one last one, which we won’t talk too much about, this could be its own meetup, which is Snowflake, the command line interface versus UI versus API versus other.

So when you have all these tools, it’s kind of a question mark of like, which one do you use for the job? And I think what you’ll see in what we cover today is that really a combination of them could be used.

But a lot of this you can just set up in the UI. You don’t even have to have really any like, API dev skills or anything like that. So hopefully that’s going to be interesting to you guys. All right, so let’s run through a quick scenario.

So we’re going to talk about backing up Snowflake. We’ll talk about cloning. And what we’re going to do is we’re going to use a combination of procedures, tasks for automation. And then we’ll also, maybe if we have time, hit on data migration between accounts.

So let’s jump over into a quick script and see if we can we can do something with that kind of data. So I’m scrolled to the very top. And we were just updating this a minute ago. So what you’re looking at here, let me make this a little bit more visible case we need to for the folks who wear glasses like myself.

And so what we’re going to do is just kind of walk through the scripts that we created and hopefully it works. Also time to kind of quickly modify these things as we go on and before we get to the meet up.

So over on the left hand side, you do not see a database called backup database. So let’s get in a character here. Let’s start running some of this. So we’re going to do is just basically set our contacts throughout the Bat standard mode of operation.

And then we’re going to use this admin. We’re going to create a database here. So we’re going to create a backup database in this backup database we’re going to use and again, we like to use comments when we create objects.

It’s helpful. If you don’t do it, just start doing it. It’s a great practice. And then we’re going to let me see, let’s use this database again, even though we’re using it, we’re going to go ahead and grant all the privileges on this public schema over to daily health roles.

All right. Then the next thing we’re doing, I’ll walk us through this sort of procedure and again all the code will be out on our GitHub site. So be sure to start our GitHub repository for the meet ups in our webinars.

This is a simple store procedure and we definitely have to get a meet up going about store procedures. There’s so much about store procedures that to go into regarding everything such as the permissions required to create it, all the funky stuff you can do.

We’re looking forward to having our Snow Park meet up here sometime in the next quarter or two and we’ll cover some of the stuff. But let’s. Oh, and last thing here is like this idea of the executed caller, which I have commented out because we want to execute it under the role that I’m creating this as, which as I notice I don’t have the type role which we want to use as admin when we create this.

Okay? So this sort procedure, what it’s going to do is it’s going to get the date, the current date, it’s going to split that date up in a way that it really is just going to remove any dashes or anything.

We’re just looking at the numeric values, right? So only there should be eight characters in this numeric value, right? 2020, 111 29 for today set our variable and this is basically TypeScript or theory JavaScript, right, for those who might be familiar with that.

And so what we’re going to do here is we’re going to take this, input this parameter, we’re going to pass an argument into this parameter and we’re looking for the fully qualified domain name of the schema.

So this is actually a schema backup process. We’ll show you here. And all we’ve done is we’ve used the DML here to create a schema. We’re going to call it by the name that’s passed in which we the schema name, and we are going to split that name, that flip off name so that it really just gets us a schema name.

And we are going to add that to the backup database, the one we just created above, and we’re going to append the underscore 2020, 111, 29 to it. And then another thing that we’re doing is we’re actually going to call a grant statement here.

So we’re going to grant selecting all the tables in that backup. A database schema. We’re going to give that to a specific role. So maybe this like a read only role for your developers. In this case is our Data Lake Calcium score or our generic one.

Then what we do here is we then turn that into a statement from basically the Snowflake, if you will, driver. And then we’re going to execute both of those. So do the create schema backup first, and then we’ll set the permissions on that.

And so that’s it for the statement. And so once I create this procedure, if it lets me, hopefully everything goes well here. All right, so I created it. Then what I do, I’ll go over here to the left.

I’ll go to Backup database. I don’t see any new schemas. And let’s see if this works. So I’m just going to call this particular store procedure, since I created it here, I’m just going to go and snag the context just to give it the best chance of actually working.

I think this should work. Let’s see. Okay, cool. So you can see that I’m calling a store procedure. So I’ve never used a stored procedure for it’s an individual line execution. So you have to run the call command and then the name of the store procedure.

We typically like to put store procedures and functions and tasks and things like that into the public schema of the database whenever in question, depending on what we’re doing. Oftentimes we’ll create another schema called Z integration.

And that scheme, so we know that any sort of integration, like pipes and stages, they’ll go into that one instead of public. So again, your models may vary, but those are our best practices. We call the store procedure by name and then we actually pass in.

Now this is just a string. So you might be noticing, like, why are we using double quotes like this? Well, we’re not looking for any super case sensitivity. In that case, there’s no special characters, so we don’t need to do that.

And then we’re just passing that. This is a string argument. And we’ve got a fully qualified schema name, which is a database and the schema. And now if I go over here, everything worked properly. Yes.

I love it when things work on the first go. So now we’ve actually had so we’ve copied from this demo sandbox AdventureWorks table, which was in this scheme here, in this database, and we’ve actually copied it here.

So we have in essence backed up this schema with all the tables in one command. All the tables, one schema, one command. Now that’s awesome. Right? So now you can have anybody on the team take your data scientists, machine learning Gurude as my example.

Earlier you find a great training data set and now you want to say, well, let me back that up real quick for today before I go home. Something like that. So that’s a great one liner way to back that up with the timestamp.

So, using stored procedures, the next thing we can do, and I’m kind of hesitant to actually do it right now, but we’ll have this in the GitHub repository. We’ve already had a Task meet up, so all these steps are really in our previous meet up, which we’ll provide a link to on our YouTube channel.

So you can link back to that and watch the Task meetup and super great information on that one. Lots of good instruction. That’s great learning coming out of that one. I think I might skip that for just now, but there are certain things you have to do in the Task.

Like you have to there’s a function here, you have to account admin to grant this execute Task one account to a role. And then grant that to the role that’s going to basically create the task as well as some of the other operations like start the task or resume the task.

But just to take a quick look at the logic here, all we’re really doing is, well, first we’re granting access to that role that has that account execute task to the stored procedure so it can call that from the task.

And then the task is actually really basic. You see a lot of comments here because we love to comment our code. Every good developer worth their salt will comment their code, hint, hint for anybody else that doesn’t do it.

So we use things like in this case for tasks, you can create a schedule using a Prawn syntax and if you guys are familiar with that and using the correct was that the locale or like UTC sort of format for Cron?

And so some of those instructions are here in the reference links. Definitely use those, they come in handy tremendously. And here’s kind of a reference to another type of like format. Instead of that you could use we’re going to use our standard warehouse and that and you can see what we’ve done is we’ve actually created a task with the name of the schema that we want to call.

But there’s different ways you can set this up. This is a simplified version. You can in theory create a store procedure that has multiple calls. You could pass in a number of fully qualified domain schema or table objects through store procedure and have them all back up all within the same call.

We don’t want to give away too much secret sauce in this one, but this is the basics of how you could get going with such a thing. And then once you create that task, it will be running on that schedule.

So for those who can basically read binary and understand this means this is like it’s going to run on the first day of every month. That’s what the syntax would do. So then you could schedule this and then your backup for whatever you want to back up.

This task would run on the first day of every month, and it would back up whatever you had inside this procedure, inside this task. So that is pretty slick. And just remember, task, if I call correctly, they’re limited to just one execution per task.

So that’s another reason we simplified this for the meetup, just to take a look. The tasks are awesome. They’re great to use. But again, redefine print on tasks and how they work. Okay, let me just jump back here real quick.

So one of the things we wanted to go over, nobody has any questions on that high level stuff. So you saw we did the backup. There’s one other concept that we’ve been using a little bit, and that is this concept of backing up from one Snowflake account to another.

And so this is actually pretty cool because you can do some interesting things, right? So, for example, changes to, say, log. So what you can do here is you can say, okay, I want to export data from one Snowflake instance.

You could do this to the CLI using the command line interface. Like the snow CLI. You can actually do things directly from your local machine, so your laptop or if you’re on a virtual machine, instead of having to use an external stage.

In this case, what we’re doing is we’re using an external stage, or you can do stage. So we’ll keep it simple and just use one stage on each Snowflake account. And again, there’s other ways to do this.

There’s data sharing in theory. There’s reader accounts. There’s multiple ways you can do this. But this is just a very simplified approach to kind of drive home this idea of backing up data instantly.

So I could go into our one dev or QA account, right, which is completely separately from another account we have, and I can create a stage, which we’re not going to do, because I think we have that in another meet up as well.

If we have that, we’ll create a link for it so you can watch that one as well. And so let’s say we we created a stage called I guess this would be the stage name, not the table name. Welcome to the stage.

All right, so what we would copy into the stage creating a CSV file, and we would do that from a table, right? So you can do this by the same name. This could be like backup, for example, and you can do that from a table, selecting all the records and then assign your file format.

Now, you can create a file format object, but in this case, we just did the explicit options definition of the file format. And then one thing here, which I think depending on your scenario, your models may vary, everyone is a Snowflake when it comes to Snowflake, I guess.

But we typically use header equals true. And we like doing that simply because when you land the data into your stage, you have the predefined let’s just call it like if we’re doing CSV, you have the CSV structure.

So like, all your commas that you have represent an actual header, and there should be any gaps because you’re doing a select all or you’re doing, you know, select column one, column two. Like whatever you’re doing, but you’re actually getting those explicit markers in your CSV by comma.

So, you know, if you had ten columns, you’re exporting, you’ve got what is that? Nine commas, which means you have ten columns. And so you’ve got a header marker there. Don’t do that with the CSV. When you’re importing the data, recovering the data and you could potentially bump into issues.

We’ve seen that before. So header equals true. Saves you a little bit of a headache, potentially. And so you would run this on your source. Your source Snowflake account would land that data into, you know, either your internal stage or external stage and then preferably your external stage, right?

Because it’s by account. So external stage, we can put this export to external stage and then you could go over to your other account, create another stage here on your target Snowflake account. Create another stage created by same name, different name, whatever you desire there.

And then you basically run the copy into statement again, but run it in reverse, in essence with a few different options and tax settings. So you would do copy into in this case, if you can look at your documentation from Snowflake again, anyone can correct us if we’re wrong.

Copy into requires, I think, an existing table. So you would have to already have done the DDL, right? So there’s a step in here on target account, create DDL or prior prior to loading external stage.

And then that would allow you to then run the copy into statement, create the table called error log. And yes, we create that. Table called airlock. And then you can load into it from the stage where this one drop it in there with of course the headers.

And then you could you would then put skip header because you don’t want to load the headers, you want to load just the data under the headers to be the header. And then I think by default Snowflake gzip compresses file.

To give you an idea of probably why they do that compressing, I think we had an example where there was like a table that had 100,000 records that was about 38 megabytes, maybe like 20 columns, 100,000 records, it’s like 38 megabytes.

When you gzip it, it becomes like five megabytes or maybe four. So there’s a good reason why they probably do that by default. Save space all the way around and so you’d have to ensure you match the compression and maybe another one or two optional settings.

And then this would load that into your target destination table after you’ve created DDL and in essence you move the data, you’ve created a backup elsewhere while you’ve created backup twice. Right?

You’ve created backup by moving it off of Snowflake and then in theory, you’ve created some sort of secondary backup on another account completely. Again, there’s other ways to do this, but this is kind of a quick simple way to do it just to get data across accounts.

You can automate this as well to some degree as well as far as if you had 1000 tables in a schema in the database. There’s some ways you can script this out through just sequel inside of Snowflake. And then same thing for the ingestion part here.

Alright, so let’s jump back over here to our slide deck as we’re wrapping up. Cool. So make sure we didn’t miss anything. Alright. So yeah, so we had a few scripts. We ran a few scripts on backing up Snowflake or backing up in Snowflake.

So we talked about a few scenarios and use cases for backing up Snowflake or objects in Snowflake. And we talked about time travel. We talked about migrating data between Snowflakes account and we talk about automating and cloning backups.

And actually we went through the cloning exercise using stroke procedure, but the basics of the task and automation are there. Again, check out our other meet up YouTube video on tasks and streams.

So be sure to check out our code totally open source on this repository. Be sure to give us a star. And as we wrap up here, see if there’s any questions or comments, be sure to check out DataLakeHouse.io and we’ve got some upcoming events here in the near future.

Oh yeah, we’ve got next month Java UDFs and then in January, Financial Forecasting using Python and Snowflake. And then in January again, we’re going to have a second one on DataLakeHouse on Snowflake.

So that’s pretty exciting. Very exciting. So some more great content. Be sure to join us, be sure to tell a friend and be sure to subscribe to our YouTube channel. And yeah, thanks everyone for attending.

We’ll see you on the next one.

More to explorer

Snowflake Loading Data with Special Characters

July 24, 2024 No Comments

Special characters in your column names can cause chaos for downstream users, tools and processes.

Building a Generative AI Competency (or the First Gen AI Project)

July 21, 2024 No Comments

When Building a Generative AI Competency one must identify the necessary infrastructure, architecture, platform, and other resources and partners that can help an AI initiative be successful. We have just like many data warehouse and digital transformation initiatives over the last 20 years fail because of poor leadership, or companies only going half in on the objective.

clock, time management, time-3222267.jpg

Snowflake Time Travel Not the First Time Traveller but Let’s Review

June 12, 2024 No Comments

IBM DB2 long before Snowflake had this concept as did a few other select databases using a Temporal Database concept. I think Snowflake was able to make it more popular and mainstream due to the association of Data Warehousing and analytics specifically