Mike Jelen

April 26, 2022
9:00 am

Using Python to Execute a Snowpipe

Snowflake’s method for moving data from files as soon as they’re available in a stage. This will make data available to users within minutes. This data is auto ingested into Snowflake and allows loading data continuously but is not to be considered streaming of data.

There are many ways to automate the data ingesting along with notifications.

Check out our Meetup video below to watch an overview of this event:

Other Meetups and Events to check out:

Transcript from the Meetup Event:

Welcome to another awesome Carolina Snowflake Meetup group event. Today we are going to be going over using python to execute a Snowpipe. Just some basic rules that we try to keep in mind to be respectful of everybody else and go on mute.

Put your questions in the chat if you have any and try to keep them on topic if you can. And video is optional. Always and always recommend to a friend or invite a friend. We have that option available.

So if anybody wants to invite a friend or a colleague, that’s always an option. And check us out on social media at daily cost. All of our previous meetups we post onto YouTube and they’re usually available a few days afterwards.

Sometimes we take a little longer getting them on there, but we’re trying to get better about that. Weeping Me today for the agenda, we’re going to do some quick introductions for just a quick refresher on Snowflake, what Snowpipe is and some use cases for it.

Then we’ll be getting into the python setup and Snowpipe execution and we’ll discuss our beta program just briefly and have an open discussion if anybody has any questions and talk about our upcoming events.

All right, so just a quick refresher. Get everyone on the same page of what is Snowflake? So think of Snowflake as the name Data cloud. Well, really, it’s an all in one secure place to move your data into.

Store your data. Transform your data and access your data and really just pay for what you use and how you use it. And of course, not only is the highest level of security from the different FedRAMP programs, but even from that you can allocate who is going to use what from an income increase processing.

So your executives or maybe your IoT users or your data scientist type of users, you can let them use more horsepower and then you can have your general audience consume the data in a more economical format where you really balancing and leveraging all those internal costs, extreme flexibility.

You can even share your data outside your organization as well without having to move and dump CSV files or text files and figure out what’s the latest and greatest version customers, your customers can go in and actually access the appropriate data and you control who has access to that.

And you can also monetize that if you choose to do so, as well as additional revenue stream within your organization. And so we look at what is Snowflake really solving the traditional paradigm of all these different transactional databases, many different applications.

I have third party the data it’s going to go is the best way to put it and it’s very complex in between there as to the types of data, especially when you get into streaming and the different technologies.

And really at the end of the day, the consumers just want to get access to their data, whether that’s analytical analysis, operational reporting, somewhere in between from an ad hoc perspective and then even those uber power users.

So Snowflake addresses all of these individual concerns and one of the things or one of the ways that they do that is by leveraging Snowpipe. So Snowflake actually has a tool called Snowpipe, just as Mike mentioned, and it’s their method of moving data from files as soon as they are available in Stage.

So it’s their, I guess, alternative to bulk data loading, which is accomplished via the Copy statements or the Copy Into statement, which will dive into a little bit later, but essentially the ability to have data within minutes of it appearing inside of that stage.

So whether it’s an S Three Bucket or Google Cloud Storage or Azure Blob storage file lands there and Snowpipe takes that data and moves it into your target data location inside Snowflake. So it’s basically enabling the whole process of continuous data loading.

So as you see on the slides there, it really eliminates the need to run manual copy statements to get that data in there. Snowpipe is just the automated version of continuous data loading and getting that data into your target location in Snowflake.

And it works well with streams and tasks. Maybe we’ll talk about this little later as well, just from the standpoint that you can use a Snow pipe to maybe move data into a data warehouse staging location and then have some sort of automated streams or tasks to move that into a more refined data warehouse location as well.

The Snow pipe just in general allows you to. Have a lot of programmatic flexibility in the way that your pipelines are run. You can have a number of pipelines running simultaneously as opposed to and kind of going back to that last point, having to run things manually with single copy statements, which of like refers to that as bulk data loading.

And it gives you just the ability to just keep an eye on other things as opposed to having to keep track of one copy statement at a time in that bulk data loading process. The whole process, as we alluded to is automated.

So it’s again, getting away from manually running a copy statement. And you can do this through the Snowflake rest API. And this is actually what we’re going to walk through today is having a lambda triggered by an S Three event or having that data file dropped into S Three.

And again, it’s very automated flow. There’s also the ability to use cloud messaging, like, for example, AWS, simple notification services and simple queue service to trigger that automated flow of Snowpipe.

And then again, being able to leverage Snowpipe with tasks and streams as well. So I guess what I’m going to do is I’m going to take what we’re going to do is we’re going to walk through and I’m going to take control of the screen just for a second, and we’re going to kind of walk through the process of setting up a Snowpipe inside of AWS, using that Snowflake rest API and just showing how a data file.

Can move from a staging location into a target Snowflake destination into your data warehouse. So in this process, we’ll be able to see also how python communicates with Snowflake and how it works in executing this whole Snowpipe type inside of the AWS lambda function.

And we’ll also show you just kind of how much flexibility you get with programming in an AWS lambda function. All right, so I’m going to go ahead and take control of the screen. Can you guys see my I guess it’s a hacking duck.

Can you see that? Yes. Great. Okay. So I would have put this in the presentation, but I think it’s going to be easier for us to kind of just follow this as kind of our general outline because I’m going to be bouncing back between a couple of different tabs.

But anyway, we’re just going to start with kind of a high level overview right here. And I mentioned before that we’re going to show the general process of moving a data file via Snowpipe into Snowflake or into the target destination.

But specifically, we’re going to utilize a couple key AWS services. We’re going to utilize Amazon S three, as well as AWS lambda. That’s kind of where the pythonic attributes of the meet up are going to come into play is a lot of this inside of this lambda function.

But in order for us to be able to do this, we’re going to have to create a few things within Snowflake and a few things within AWS as well. Just to start off, we will need a target table, obviously, for our data.

So inside of our target inside of our target Snowflake database and schema, we’ll want to actually have a place where we’re trying to land this data. So we’ll go ahead and create a table, or we’ll just have maybe a table DDL that will walk through as well as a and I’m going to show you guys how we can leverage Snowflake storage integrations as well.

They are very helpful in that they allow you to not have to use your own security credentials AWS security credentials when creating a stage later on. I guess this step right here. But really just a big thing from a security standpoint.

You don’t want to have scripts that have usernames passwords, AWS, key IDs, secret key IDs, anything like that when you’re creating your scripts. So nothing hard coded like that. This kind of removes that need for that hard coded value.

That Snowflake storage integration. So we’ll kind of walk through that as well. And then we’ll actually walk through the DDL for creating a Snowflake or a Snowflake Snowpipe. And then within AWS as well, we’re going to walk through having an S three bucket.

I won’t walk through a lot of the details of how you create a three buckets or anything like that, but just show you that you’ll kind of need this as your stage location, your external stage within Snowflake.

And then we’ll get a little bit more detailed walking into lambda function, because Snowflake hasn’t pretty detailed code that you can use to. Be able to leverage that Snowflake rest API to trigger your Snowpipe to work.

So we’ll kind of walk through that as well. And then what we’ll also need is something that triggers the AWS lambda function to run, and we’ll have that linked to our three buckets. And then we’ll also need kind of a series of IAM roles, identity, access and management roles and policies so that Snowflake can be able to interact with AWS and they can have that back and forth so that Snowflake knows when data is being staged in s three and vice versa.

Let’s go ahead and start with Snowflake. So just off the bat, like I mentioned, we’re going to go ahead and create a target table set up like this. Nothing too crazy. Before I get any further, I’ll mention that the data that we’re going to use is kind of a little bit of stock data.

We’ve got a pretty extensive listing of mutual funds and their stock prices at the end of day for a pretty considerable time period. So I think for a lot of the mutual funds that are out there, the data set that I found on Kaggle was pretty expensive, maybe going back 2025 years.

So pretty good set of data, but we’ve kind of portioned off such that we’re only using a little bit for today. But beyond the point, we’ve got just a few fields here. We’re just looking for our Identifier of the mutual fund, the price at the end.

The actual date with which the price was measured and then the price at the end of that day. So went ahead and created that table. And then if we wanted to see the kind of the attributes of the table, we would run just a quick describe statement here where we would see just a little bit of a little bit more information about the table.

But nothing too, nothing too in depth, but kind of as I mentioned earlier, the next part that we’re going to go into and really need is storage integration. And this, as I mentioned earlier, allows us to not have to use our credentials, secret key and AWS access ID when we get down here.

And using this creator replace stage statement. If we hadn’t created a storage integration right here, we would have to specify those. And just as a general rule of thumb, it’s not great to have identity values, usernames passwords, anything like that, hard coded in scripts.

So that’s why we went here and we created our storage integration. Just allows us a lot more flexibility from a security standpoint. So this is kind of the statement on which you would run in order to create that storage integration, like we’re specifying here, that we’re creating an external stage.

We want our data to be staged in an S Three bucket. And we are specifying that here. As I mentioned earlier, we also were creating an im rule that allows Snowflake or Snowpipe specifically to access the data inside of S Three.

So this is where we would specify the. Amazon resource name in order for the storage integration to be able to pick up that information. And what we’re just telling the storage integration that we want it to be for the security to apply to any single location inside of S Three.

So any folder, any subfolder, subdirector, anything like that. But then if you wanted to see more information about that, you could also run the described storage integration statement right here. And that would give you kind of a little bit more information about the im role that’s being acquired by Snowflake, as well as the external ID that Snowflake is going to use in order to in order to get into AWS.

In addition to the there’s another policy that Snowflake will assume in order to access that data within S Three. So for the sake of privacy, I’m not going to run that, but that’s what you would use in order to get more information about that.

So the next piece right here that we’ve got highlighted is our stage. So our stage and if I just go back to the diagram right here, our stage is where our data is landing or where Snowflake is recognizing that raw data is landing.

So in our case, we’ve got our CSV file, and it’s landing in AWS Three. It’s an S three bucket. So we want to basically tell Snowflake that this is where the data is going to live. And that’s we do right here is we specify the URL of our S Three buckets where our data is going to be.

And we specify our storage integration as well. Meet up bucket integration, which we just specified or just went over a second ago. And then we also are specifying a file format right here. Just the data that we’re getting is in CSV format.

So we want our stage to recognize CSV data. So that’s all set up. We’ve got our stage Snowflake now recognizes that we’re looking for staged data inside of this s three bucket. And now we can go ahead and we can create our Snowpipe.

So this is kind of, in essence, a Snowpipe is a Snowflake knows it to be just kind of a simple copy statement or a copy into statement. So I was talking a little bit more earlier about bulk loading, and this kind of highlighted statement right here is the exact code that you would use to if you had data inside of a stage.

You would just run that in order to move that data from the stage into your target location. So I’m specifying right here, copy data from this staging location into our target table, our mutual fund prices table that we specified right up here.

You could run this manually right now if you have data in your stage, and it would take the data from the stage, and it would put it into that target table. So really what you’re doing when you create a Snowpipe is you’re giving Snowflake the code that it needs to be be able to run over and over and over again.

So it’s going to keep basically this copy statement in memory. And whenever the Snowpipe gets triggered, then it’s just going to run this continuously or it’s going to run this whenever it needs to.

So that’s kind of a bit of the setup behind what we’re going to need from a stove lake perspective. But if we go back here, we can take a look at kind of what we need from an S Three perspective and go over that.

So as I mentioned, the S Three bucket is where our stage is, where Snowflake knows our stage data to be. So if we just go over to AWS, I’ve already created a bucket over here and I’ve called it Meet up Examples Buckets.

And I’ve created for the sake of this meetup a directory called Snowpipe Example. So this is where when we’re ready, I’m going to drop data and it’s going to kind of trigger this downflow process of Python Lambda being alerted that there’s data in this location.

And then Lambda, if we go back to this diagram, lambda is saying, hey Snowpipe, run that copy into statements and take the data that is in S Three and put it into our target Snowflake destination. So if you’re thinking about it, the S Three bucket and the stage is kind of the origin of where this whole pipeline process begins.

So we’ve got our bucket in place and what we also, as I mentioned, is our Lambda function. You won’t be able to see it here. I kind of went through the process of what Snowflake had specified in its instructions, and maybe I’ll just bounce over there real quick to show you.

They’ve got pretty good detailed instructions on how you can set up Snow pipe with AWS lamp and they’ve got a lot of sample skeleton code that you can use, albeit I had to make a few tweaks in order to get it to run the way that I wanted it to run.

But they kind of give you an outline it’s parameterized so you can specify your Snowflake credentials, not necessarily credentials, but you specify your Snowflake information, account information, user information, and then from there they actually have you create an EC two instance to do this whole kind of packaging of Python libraries and dependencies in there.

If you kind of go back up here, you see kind of a list of libraries packages that you need in order to be able to run the Python code. They recommend running it through or creating an EC Two instance and kind of packaging everything up in there.

I don’t know if you necessarily need to do it. That was the way that I went through it. But the process you can use is you can use it independently of the easy to instance as well, I think, and it shouldn’t give you too much in the way of problems.

But anyway, you would use the EC Two instance, create your deployment package, and then you would end up playing your. Function via the command line with your Snowpipe zip file that you just package right up here and then add any requisite permissions and then that’s how you would create your lambda function via that easy to image.

So I think probably when we upload these meetups to YouTube, we’ll go ahead and we’ll link to some of these some of these websites, the Snowflake documentation, because I think they are pretty good resources.

If you guys do want to just follow this step by step in order to be able to use it. But not to get too far down this rabbit hole. Just kind of going back and taking a look at the lambda function as well.

So the deployment package as you can see, is a little bit too big for us to view via AWS inline code editor. So I’m just going to bring that up via my Visual Studio code and it’s basically the same code that you saw from earlier, just with the few things edited out.

So I went ahead and inserted all of the parameterized information that we needed here in order to specify our Snowflake target or our Snowflake host, our user, our Snowpipe that we want to be able to trigger.

And we’ve got this private key text and that’s going to be a bit beyond the scope of the meet up here today. But there is article that we will link to in here about setting up key pair authentication within Snowflakes so that it just helps from a security standpoint with the communication between AWS and Snowflakes.

So we’ll link to that. But that’s also kind of a key piece of the lambda development and you guys could see that in this pemline variable right here is where we would have this encrypted key. But again, for the sake of not getting too far down that rabbit hole, I’ll just keep that for another day.

But that’s where we would specify all that information and basically our lambda function is boiled down to this set of code right here where we are taking in the event context that’s occurring. So we said earlier that our S three bucket or any time of the file lands in S three it’s going to trigger the Snowpipe lambda.

So any time it gets triggered then we’ve got information about automatically information inside of our lambda about what file just got dropped in there. And since we’ve got that information we go ahead and get down to this part right here where we are sending information back to the Snowflake rest API with our ingested file and the Snowpipe that we’re trying to execute because of that.

So that’s a bit about our lambda function and we’ve got kind of a set set of. Permissions that we have to run here. We had to set up a separate policy in order to basically be able to run our lambda function in the first place.

So we just have a little policy right here, here where we would just say that anything that lands in the S three buckets that we specify here, our meetup examples bucket, we want that to be able to trigger our lambda function to run.

So once we create that, we also create an S three trigger here where we are creating basically put policy to where any time a file lands in our S three buckets, it is going to be able to trigger our lambda function.

So if we just go back to our little diagram we have here, we’re saying anytime we want our CSV file to land in this location and our S three buckets, we want our whole Snowpipe to be triggered. And it does that via the lambda function and the call to Snowflake rest API.

So we’ve basically covered, I think, everything that we wanted to COVID from the standpoint of all of our AWS requirements. So why don’t we go ahead and just demo our Snowpipe. So as I mentioned earlier, we’ve got kind of a listing of stock data and I need to pull that up in front.

So I’ve got a few files here. We’re going to specifically take a look at these three files. The quadruple Ax. Mutual fund as well as the Aaax mutual fund, as well as Fae TX mutual fund. So we’re going to take a look at all that data and we want it, we’re going to basically drop it into our S Three buckets and be able to demo that Snowpipe in action.

So we go over to our bucket and if I can find that, there it is. So we’ll just try one file at a time for now. But our file goes in here. We upload it and now it does take a couple of seconds for Snowflake and Snowpipes to recognize that sometimes.

So we can run a couple of commands just to see the state of our Snowpipe and how everything is doing. So we’ll first be able to list the data that’s inside of the stage. So basically this is really a cool feature from Snowflake is we’ve got this stage configured with our S Three bucket that we mentioned earlier.

And when we say list and then at the stage is we’re basically saying list any file in that S Three location. So we’ve got in here, we’ve got the file we just dropped and if we go back to Snowflake you can see that same file available inside of the stage right now.

Now that the data is inside of the stage via the file we can go and we can check on the status of our Snowpipe. So it is saying that our Snowpipe is running right now and we can actually. Take a look at this query, whether the data has landed successfully in there and for some reason it’s a little slow today, I guess did not land for whatever reason.

So we’ve got our data available inside of the stage and we have our Snowpipe. Now it really should be saying that the pending file count is one, so I don’t know if that means it’s just a little slow right now, but you can see I ran it a little bit earlier with the Faetx mutual fund data and it did land successfully inside of our target location.

So I guess if we roll back 2 hours in this case in our query, you’ll be able to see that right here. So this is just metadata that we’re looking at from an earlier load of the file. But for looking at the Fat X, that Fae TX mutual fund file, the last one, that last file that was ingested, it was ingested with 7400 records.

And you can see that it was loaded successfully. So everything was good from that perspective. I guess it’s just got the demo jitters. For whatever reason today we can go ahead and try a different file.

Maybe for whatever reason, that file didn’t land as well as it would have liked, but so we are going to just pull that to the side real quick. We’re going to go back to S three and we’re going to try and land the QAAC Fund prices file.

So we’re going to upload that to our stage. I know why we had trouble with that first file, but we go back to listing files in our stage, we’ll be able to see yes, it recognizes our file that was just uploaded.

So all good from that perspective. And if we go to check out the status of our pipe right now, you can actually see that we just landed our new file and Snowpipe ingested that file into the table. So you can see our data from the Qaacx Fund is now inside of our target table.

And that was courtesy of Steel pipe. So if we just go back to that AAAA X fund prices file, that was just a bit of a brain fart on my end. So what happened? Since I ran that file earlier and since Snowpipe had already ingested that file, it.

Basically kind of knows not to not to run it again. It calculates it as a duplicate file. So if you get a duplicate file, then it won’t trigger the Snowpipe and it won’t recognize it. So kind of another cool feature of Snowflake is that if you’re getting duplicate files, you’re not going to get duplicate data inside your target table.

So if you just go look quickly at our metadata, we can see our new file that was loaded here and we loaded with 4700 records. And that lines up with what our record count was in our target table. So that’s kind of the gist of how you use Snowflake to run Snowpipe in order to be able to leverage Snowflakes continuous data loading capabilities and just the ability to use kind of a list or a multitude of different features in order to be able to do it.

So we used the Snowflake Rest API via AWS Lambda today to be able to do that. You could also do it with cloud messaging services like SQS or SNS in AWS or any of the other Snowflake compatible cloud messaging service as well.

So I think that was all that I wanted to share with you guys. So I’ll just hand back control the screen to Heather. Yeah, sure. I’ll put it back on our next step. Do you want to talk about the next steps?

Yeah, sure. Okay, so. If this was your first time learning about Snowpipe, the ability to make the next step is now you know how to automate a script to be able to run it. So this was a bit of learning experience for me.

I had used Snowpipe before, but not from the standpoint of the Snowflake Rest API and with AWS Lambda. So this whole portion of automation can be done in a few different ways. And like I mentioned, this is something that can be leveraged with cloud messaging like SQS or SNS.

So you’re not tied down to the configurations of the Snowflake Rest API. You are able to use some of those other cloud services. And as I kind of alluded to earlier, Snowpipe really allows you to allow flexibility in terms of using data to your desire to get continuous data loaded and have it kind of is a bit of an ELT or an ETL pipeline in and of itself inside of Snowflake.

So the ability to have that all kind of in one area is very valuable. For example, you had your data warehouse and you had kind of a separate staging location within Snowflake. You would be able to run your Snowpipe to get data from your stage into your Snowflake stage and then be able to use any sort of task or stream or storefront or whatever other Snowflake capability that you need to be able to.

Further get that data into your data warehouse. So, really a lot of flexibility in what Snowpipe is able to do and what it brings the table from the standpoint of ETL or ELT. Thanks, Drew. So just talk a little bit about us real quick.

We are AIC, the developers of the. Data Lake House platform, and Data Lake House IO is our website right now. We’ve currently got a ton of sources that you can use for ELT. There’s no coding required.

There’s an element of machine learning that will be implemented soon and analytics to go along with them. It’s an out of the box solution and totally self service, so it’s nice. And easy for everybody to use.

And there’s also a data catalog feature. To kind of keep your whole team on the same page when it comes. To when where that data is coming from. Right now we have our open beta. Program going on, and I’ll just send a link in the chat.

And in our open beta program, you’ll get access to Data Lake House IO. Including all the great features I just. Mentioned and some cool no Bad Data swag. And if you choose to go to our sign up page, this is what it looks like.

You just enter first name, last name, and email, and then you’ll get some next steps sent straight to your inbox. So it’s super simple, not a ton of work for you to sign up for, and you’ll get access to our awesome platform.

This is our no. Bad Data mascot. He’s just representing that we believe that there is no bad data, only bad processes. And we can use this time to discuss any questions anybody might have. And anything like that.

And I also put in the chat the link to the documentation that Snowflake has on getting your Snowpipe set up and that’s just kind of the intro. I think Drew was on a different page, so we’ll make sure to include that link in the video on YouTube also.

Our next event we have coming up on the 23 also at 06:00 PM ET. And we’ll be going over using Snowsite to visualize data.

Thank you all for joining and please send us any suggestions you have on future Meetup’s and we’ll try to incorporate them.

More to explorer

Snowflake Loading Data with Special Characters

July 24, 2024 No Comments

Special characters in your column names can cause chaos for downstream users, tools and processes.

Building a Generative AI Competency (or the First Gen AI Project)

July 21, 2024 No Comments

When Building a Generative AI Competency one must identify the necessary infrastructure, architecture, platform, and other resources and partners that can help an AI initiative be successful. We have just like many data warehouse and digital transformation initiatives over the last 20 years fail because of poor leadership, or companies only going half in on the objective.

clock, time management, time-3222267.jpg

Snowflake Time Travel Not the First Time Traveller but Let’s Review

June 12, 2024 No Comments

IBM DB2 long before Snowflake had this concept as did a few other select databases using a Temporal Database concept. I think Snowflake was able to make it more popular and mainstream due to the association of Data Warehousing and analytics specifically