Heather Pierson

May 26, 2023
5:47 pm

Snowpark for Snowflake

You can write code in the language you are most familiar with using Snowflake’s Snowpark. Use Python, Scala, or Java to create complicated logic rather than SQL for large-scale systems and pipelines. Maintain your data as a first-class citizen and dramatically speed up your workloads using Snowpark, which eliminates the requirement to export data in order to run in different contexts.

In the Meetup event below, we’ll show how Snowpark makes it possible for data scientists to work directly with Snowflake data using Python, integrating machine learning into development and deployment and, ultimately, vastly accelerating the time it takes to deliver business value.

We incorporated information into this event from Snowflake’s website.

Other Meetups and Events to check out:

Transcript from the Meetup event:

Hey, thanks everyone for joining us for another Charlotte Snowflake Meetup group.

Carolina. Sorry. Carolinas Snowflake meetup group event. Thinking Charlotte today, but all about the Carolinas and really all over the world. Everyone’s completely welcome to join our Snowflake Meetup group.

So tonight we’re going to be talking about Snowpark for Snowflake. It’s a great topic. This is one of the latest topics that come out of Snowflake. They’ve been making lots of advancements with their Snowpark integration.

We’ve even seen some new things in the new Snowflake UI. So let’s get going here. Next slide. Cool. Well, if you’ve attended one of our meetups before, then you know that we do record these. We put them out on our YouTube channel for everyone to take a look at, maybe rewind something, and definitely share some of the great content that we’ve been providing the world here tonight.

We’re just going to walk through Snowflake as a refresher. We’re going to talk about what Snowpark is and maybe what it isn’t. As usual, we’ll talk about some of the use cases and benefits of working with and implementing Snowpark.

Got a little bit of a demo for you all and then we’ll open it up for discussions and questions and talk about some of our upcoming events. And as always, everyone’s input is welcome, especially as we are looking for topics and guest speakers for some of our upcoming events.

So just diving into it a little bit, what is Snowflake? I think most folks at this point know what Snowflake is. It is really that end-to-end platform for ingesting and allowing data consumers to access data.

Through a myriad of paradigms handling things like data integration, streaming, handling some aspects of transforming the data, and then of course, providing that layer and that baseline for data analytics, lots of great use cases for ad hoc real-time operational reporting, KPI development and so many myriad of other things related to data ingestion to output.

So when we talk about Snowpark so Snowpark is that latest and greatest framework from Snowflake that adds that developer layer for Snowflake customers so that they can incorporate and develop in familiar data sciency tools incorporating legacy and very well-known SQL Syntax with Python, Java and Scala so that they can really ingest lots of data sources and then, of course, make sense of their data.

Again following data science best practices, but also incorporating other types of concepts such as DevOps and so forth. And so we’re really excited to talk about this topic tonight because there’s so much that can be done with this tool.

I think we’re only scratching the surface on what Snowpark is going to enable not only data scientists, but data enthusiasts and data modelers to accomplish all through the single platform, if you will, of Snowflake.

So if we focus on some of the use cases and some of the overall benefits of Snowpark really allows the ability to do custom data processing, be able to build and extend, if you will, custom applications, it’s.

Definitely ripe for machine learning applications. We talked about DevOps data ops and of course, if you’ve been reading some of the buzzwords lately, there’s also the concept of ML ops for machine learning operations.

And so intertwining these into the Snowflake product, really, I think that gives us this Snowpark concept that we can leverage so that we can, of course, leverage other things like data integration and handle very complex data transformations.

So again, data science developers are going to feel quite comfortable in this environment, or rather working with this framework, as you’ll see here in the demo in a few minutes. So benefits, they might seem to be very self explanatory, and I think that they are because the benefits that come out of using Snowpark are going to be vast because you’re combining the compute and the processing capability, the complete scalability of Snowflake with this ability to do data science work through Snowpark.

And so you’re going to be able to run your code directly inside of Snowflake, accelerate your workloads, handle very complex workloads, obviously. And that’s going to enable your data science team, or your data team in general, to develop in a much more software development, DevOps, centric type of paradigm, which is becoming more and more the standard for working with data.

And that includes everything such as having your code in a Git repository and so forth and so on. Right? So this is taking that traditional DevOps and data engineering standard. And really applying it to, in this case, into Snowflake, we think it’s going to enable better partner integrations.

If you’ve looked at some of the things that Snowflake is doing with their partnership with Anaconda, and then, of course, again, with open source libraries, the partnership with Anaconda being able to have some of those curated libraries, they are available for you.

So data teams that might be a little newer, you might have a more novice level team. This is going to allow those data team members to really find a nice, clean path to working with data in this capacity as they align that with Snowflake.

And of course, one of the other benefits is you don’t have to actually go to another tool. Right. So there’s other tools that are out there that talk about Spark and some other sort of fast workloads that you’re working, especially with Python.

And now you can do that all within Snowflake. So that’s actually really cool. Right. So at this point, yeah, unless there’s any questions, we’ll turn it over to Anapov, who’s going to walk us through a little bit of a demo.

You’re going to kind of set up the demo, what you’re going to go into, what you’re going to accomplish. Right. And then after that, if we have any questions or we want to open that up, we’ll definitely dive into it then.

Thank you, Christian. I’ll take it forward from here. Let me just share my screen. Do you see a notebook that’s open on my screen right now? We do. Okay. Anyway, you can drill into the view a little bit better.

Zoom in. Rather, I think is a term. Yeah. Great. Okay, so taking forward to what to the great introduction that Christian just said for us, basically, Snowpark kind of lets you extend your capabilities in Snowflake and lets you play around and give you more features that you can extend your existing data pipelines.

So just to give you a quick sense of the various flavors that you can have with Snowpark. So what you have right now in front is the demo that I’ll quickly start and talk about is using the Snowpark libraries in Python and then kind of explain some of the function and then walk you guys through a quick linear regression implementation using Snowpark.

That how would you go about setting up a linear regression training and then testing the model and running the model completely using Snowpark? But before I go into that depth, I just want to kind of emphasize that there are two ways of kind of working with Snowpark.

So number one is that what you see right now is that you import the Snowpark libraries into some form of a notebook, and then you work on the notebook using Snowpark. Or the other option is that you completely work out of your Snowflake instance using something called as Snow sheets.

So in Snowsite, you have the option to create Python sheets. So on the left, once you enter Snow site, you click on left and you can create something called as a Python worksheet. And when you click on the Python worksheet, it will take you to a page which is going to look exactly like this.

But fundamentally, from a library standpoint, whatever that you can do in a notebook, you do the exact same thing. Even in your Python worksheet. You’ve got to import the same. Snowflake Snowpark libraries and the same functions.

But the only difference is the medium. In this case, you can execute this as a worksheet in Snowflake. And for the other one that we’ll walk through is basically using a Jupyter notebook. And before I go to go to the notebook, just a quick touch on the worksheet here.

As you look at the top, this looks something very similar that you have in your classic console, where you can choose your database and your schema. That is fine, but what you see here are the list of packages that come pre-installed.

So these are the preinstalled Anaconda packages that come predefined that you need to work with them. And you’ve got your different settings. That what will be your main handler and then what’s the output back from your main handler and so on.

But before I think we’ll probably have a different session to kind of COVID Python worksheets for today, I’ll just go back to my notebook and talk about using Snowpark in from a Jupyter notebook. So first thing is that you’ve got to import your Snowflake Snowpark libraries.

And then there are some predefined functions such as which I’ll talk as we go further, but then as you look at the code, there are some predefined functions that you need to import. One of the most important method that you want to import is a session.

So if you don’t use a Python session using Python worksheet, you’ll have to start a session from your notebook. And that’s precisely what we’ll do first. So once we import the libraries, these are the connection parameters is that you would use to kind of connect with Snowflake for this demo.

We’ve already set up that predefined and we’ll use the connection settings. But for you to kind of set up your settings with Snowflake, you’ll have to pass your account, which is your account number, your username, your password.

And you’ve got other optional features like what role you want to log in with, what’s the default warehouse database and schema and so on. So people who have used the Snowflake connected library would feel at home with this concept because it’s very similar to that.

And once you connect, you’ve got to create your session. So you get an instance of your session and if everything goes well, you can do a quick test to say session SQL. And as you see, the SQL is a predefined method that comes with the Snowpark session module that you have here.

And just like we do in Snowflake, we set the database, we set the warehouse, and you would get a success or an error message back from this method. So in this case, it says that whatever statement you’ve run, it says that statement has executed successfully.

Similarly, you can run other DDL operations like so let’s assume that instead of running database of vera, we want to create a table. So we do session SQL and we run the SQL that we run in Snowflake.

And you would see a success message that comes here. Let’s go on to insert some values and then now let’s do a select using the dot SQL method. Now you see that one final difference is that this method here, SQL did not have a dot collect at the end.

And when you don’t have a dot collect at the end and if you just run that method, it just gives you the type of the object that gets returned from the SQL will. So to emphasize on this concept that what you get back is a Snowpark data frame.

And I’ll talk about what exactly the Snowpark data frame, but the idea that if you remove the collect, you would get the type of the object which comes back, but if you do a collect, you actually get back the response.

From your SQL. So make sure that if you’re trying to talk to Snowflake using SQL, have a collect at the end to get the results back from this method. Now, as I said, what is the difference between a Snowpark data frame and the other data frame that we’re used to?

So we usually work with a Pandas data frame and people who work with frame than Pandas. It’s a very similar concept, but you still have to convert your Pandas data frame into your Snowflake data frame if you want to use a standard Pandas data frame.

So I’ll explain what that means. So let’s assume that we’ve got a table called sample product data that’s already pre created in Snowflake. We can read the data by calling the table method from a session object.

So we say session table and then you see there’s a method called dot show and this is something which is applicable with the Snowpark data frame. And in your Pandas data frame this would be something like dot head.

So these are the minor differences and if you try to do a dot head here, it would error out saying that it does not have a method called dot head because dot head will work with a Pandas data frame. So you can actually convert your Pandas data frame into a Snowpark data frame.

That’s exactly what we’ve done here. So if you look at this method, the DF underscore Pandas is a Pandas code from data frame. And if you look at the type of DF underscore table that we got up here, it’s actually a Snowflake Snowpark table that we get back from this particular method.

So all you have do is say DF underscore table two Pandas and that will convert the Snowpark table into a Pandas data frame and then you can work around and do all your processing that you could do with a Pandas data frame.

So if you’re comfortable working with Panda JFrame, you can take the input. From Snowpark SQL into the table which comes back and then convert that into a Pandas data frame by doing it to Pandas and that opens up flexibility for you to work in Python.

There are multiple ways of creating data frames in Snowpark, and I’ve just shown some of them. But if you look at the Snow park documentation, they’ve probably given seven different ways of creating data frames.

Snow powers data frames in Python here. So one way is something that we touched upon earlier that we call Session Table, where we just specify the name of the table that will return the table. The other way that you can actually run the SQL.

So you can just select name from sample product data and do a show. You’ll have your data frame like this, or you can actually create a structure so 1234 the values that are passed. But for that you’d have to call Create Data Frame and you’ve got to pass another method, call it to two underscore DF to specify the name of the column.

That’s another way of creating a data frame. Then there are other ways of passing the values in Create Data Frame and you can specify the name of the columns in Schema another way. And then there’s something called as a row where you can import the row method from Snowpark library.

And then you can say Create Data Frame where you specify row and map each value to the column. So they basically achieve the same result, but it’s a matter of using it as per your convenience and your requirements.

So before I go on to the use case of applying machine learning example to Snowpark, I’ll take a quick five-second break to see if there are any questions at this stage before I continue. Heather, I can’t see the chat.

Do we have any questions in the chat? No, we’re good for now. Okay, so let’s continue with applying machine learning with Snowpark. So what we’ve done is there is this example and data provided by Snowflake Labs.

This is an exhaustive example that they’ve given. We’ve taken some the finer elements from this reference for this demonstration. But in case if you are curious about the entire demo that they have done.

So as part of this demo, what they’ve done is they’ve talked about setting up stages, they’ve talked about setting up Snow Park, and they’ve also used this logic to reflect the data back on Streamlit.

So it’s the end-to-end solution for Snowpark by Snowflake Lab. So we’ll post this in our GitHub and as part of the session notes, feel free to reference this link. And based on the link, we’ve derived this example for today’s session.

So again, using the same method session SQL, you can first look at what’s the existing setup in your Snowflake that you’re working on. So you see that the current user, the current rule and the database, the schema, and the current version of Snowflake, and the current version of your Snowpower for Python version you can do by calling the SQL and then reading it in form of a list.

In fact, this is a matrix, you can look at the matrix and see what each value has. So a list inside a list, basically. And then after that, we touched upon this method table where we’re trying to read some form of campaign spend into a data frame.

For further processing. So we are trying to load some aggregated campaign spend data and this data will be used as a feature to perform linear regression and do a prediction on the revenue based on the set of features.

So that’s the use case we’re trying to come up with that based on the advertising spend. Can we predict the ROI prediction based on our features? So we are going to do a linear session to perform the prediction.

For this example here, before I go into the setup and the code, I just want to quickly touch upon that. In our Snowflake instance where we are reading from, we’ve already done some setup. So if you see these are public data that’s given to us from Snowflake Labs that you can try QR into, they’ve given some from dummy campaign spend, dummy monthly revenue and so on.

We’ve used this data for our predictions. And if you look at the code, what we’ve done is that we’ve created a CSV custom file format and then we’ve created an external stage to read from the S three bucket.

These are public SA buckets, so you can access them. And then we’ve copied them from the stage into tables. So we’ve copied from copied into campaign spend from the campaign spend similar stage and we’ve copied monthly revenue into monthly revenue table from the monthly revenue data stage.

So, whatever you see here are the tables that we’ve already pre created for this demo. And one important point which I’ll touch upon is that there are some stages, internal stages that we’ve created.

So one is called dash underscore procedures where we will go and store the store procedure that we will create as part of this demo. And then dash underscore models are the stage where we will load the models that we will have created.

So we will train a model and then we will load the model into a stage and we will use the loaded model to make predictions. For our requirement and then we will use a UDF to actually perform the prediction using the trained model.

Just keep in mind these are three different internal stages and when I get there I will again emphasize on these stages. So we read the campaign spend here. These are just some more Snowpark methods that you can see that for example if you want see what query is run in Snowflake.

So you see that when tries to read the table from table campaign pen it’s just doing a static star from campaign pen table. If you also want to see the query history so you can create an instance history and then you can actually specify what queries you want to track.

So for example when it says snow DF spin show it shows that this is the query that runs in the pattern select star from campaign spend limit ten and then it also gives you a query record. So this is useful when you want to do some form of debugging to see that what query was executed send and it’s just a mechanism to do debugging.

Now for our machine learning example, as I explained that we are trying to do ROI prediction and for that our input independent variables are going to be certain features. So these are the features that we are trying to create and these are various different form of spends.

So we will have some form of spend which is going to be spent per channel per month and then they will be spent across different channels and so on. So for the first feature we are just trying to find the total cost across channels which is grouped by on year and month.

So that’s called the total spend per child per month. That’s the feature that we’ve created here. Then our second feature is nothing but it’s just going to be our spend which is going to be pivoted for different US channels that we have which is aggregated across year and month. So we’ve got that across channels. And then third we’ve created another feature which is nothing, but it’s the total revenue which is grouped by your year and month.

And then in the end, we’ve just created one single table so that we have one single feature table where we can actually map our year, our month, different form of channels that we have, what’s the spend for each channel and what’s the revenue that we have for that particular month and the year.

So we have all our features in one single table and we will use that in our training below. So as you go to the model training part so what we’ve done is that the first step that we’ve removed any missing values.

So this is people coming from Pandas would understand that drop NA is where you drop any null missing values. We’re not interested in using any form of year and month. Again, if you’re not doing any form of time-space modeling or any form of time forecasting time modeling, then you don’t need to really use year and month because having year and month for non time modeling might sometimes ruin your output and predictions.

So those two columns have been dropped and we’re just trying to save those tables into a temporary table for now. So we’ve created marketing budget features and this is what the final set of features looks like for our machine learning.

Again, for linear regression, we are going to use the famous ScikitLearn models. So if you look at this function that’s been created to actually perform the train without going too much into detail, but we’ve done the standard setup.

So you’ve got all your models and all your libraries, you’ve got your linear regression model coming from ScikitLearn library, you’ve got your grid search and your train trace split, so on. And so coming from ScikitLearn, you’ve got your feature engineering polynomial features and standard scaling normalization coming from ScikitLearn and so on.

So the first step is that we’ve taken our features table, we’ve made it into a data frame, step number one. And after that, since we’ve got two kinds of features here, we’ve got numeric features, we just want to make sure that we perform some form of feature engineering on that.

So we’ve just tried to see if we use polynomial feature of the degree two, that means you would have the square of each feature along with the product of each feature as different input features for your and then we scale the value using the Gaussian scaling formula, using the standard scalar.

That’s done. We’ve tried to combine the two into a pipeline and then we’ve dropped the revenue because we have to predict revenue. So we will drop it from a feature. All the features go into the list x and our target variable, which revenue will go into y.

We’ve done a train and test split for 80-20. So zero point 280 20 with 80% for train, 20% for testing. And then we’ve done the grid search optimization and these values are something that we will pass when we call the function, but we do it for ten folds, standard values for doing cross validation folds.

And then this is where we actually fit the model. And then we calculate the R square scores for train and test and so on. But I think standard you must have seen your machine learning code. But the idea is that after you’ve run the code, you can actually go and put your saved model, your safe train model into the stage.

I was talking about as at the rate dash model. So we will use this train model later down below. To go ahead and then do a prediction on this model after that is done. So we’re just trying to call the same model with our features given these different parameters I just went through and then standard you can analyze your threshold values to see what’s there.

Okay, now, one important feature is that you can add go and register this entire execution to run as a stored procedure. So what we’re doing is that using the Snowpark capability, we’re saying that this particular training model, that means this function that we defined here, which is the train revenue prediction model, can be called as a stored procedure and that’s what the register method would do for you.

It will register your function as a stored procedure and these package will be installed when the stored procedure runs and it will go and store the out and the stored procedure will be stored in your dash underscore Sprocks stage that you have.

So let me show you what happened in Snowflake. So if you go and see its show procedures, you see, you should see our procedure that is available, that runs with these arguments. So it behaves like any store procedure you have in Snowflake.

So the important thing that we’ve defined the procedure here, but we are registering it in Snowflake. So once you have registered that as a stored procedure, we are calling that store procedure from session call where we pass a feature.

So this is very similar to what we see up here, just that this is being called wrong Snowflake directly that is done, you get the same output. And now once you’ve done the training, you have a model that is stored, it’s time to perform a test and do predictions on your train model.

So for that, what we have done is we’ve created a UDF where you can actually pass your session instance and you can pass your train model, you can pass your features and it would perform a prediction of the ROI for you.

So what we have done is for training, we’ve created some form of some values that we want to predict against that’s being passed to our UDF up here. One important point is that if you look at the two methods, which is session add, import, that is just adding the model that we trained up there, that it knows that it has to access this particular model and it knows that it has to add these packages for this prediction to work.

So that is you’ll have to specify what models and what anaconda packages you need to perform the prediction. And then we actually call the method to do a prediction and the output of the prediction is stored as created underscore ROI.

This is what it looks like. So we pass the values, we pass the features, and then you have created ROI which comes here. Now, this we have done in Snowflake. You can actually take this UDF and call that into your some form of UI.

You can call that in Tableau, you can call that in Looker, you call that in Streamlit and you have an end to end working machine learning production model using Snowpark. Yeah, I think this is what I wanted to cover and think.

I see a question in the Q and A. No, I’m only using Python. I am not using R for now. Whatever code that I’ve done as part of this session is completely in Python. In fact, that’s a good question that I don’t think there should be any form of support, at least for Snowpack.

Christian, correct me using R and Snowbuck so you could read data from Snowflake, like using Snowpack, put that into Python and then you can use the R libraries in Python in to do all of the artwork that you have to do.

But that’ll again, be like a hybrid solution for that hope, if that answers the question. Yeah, that was a good question. I think another question that might come up on a pause is when you register that store procedure, can you call that store procedure then inside of Snowflake with standard SQL at that point?

Right. So I think I would want to show what the output looks like if I can show what it’s going to be here. Yeah, if you look at this, I have the output that is there. Okay, I will do one thing. Christian, let me I should be having the output which runs here, but let me try to get that what it looks in Snowflake and show what the output would look as the prediction that works in the background.

Snowflake. Yeah, that’s a good question. That when your store procedure runs in Snowflake. What would that look like in Snowflake history? Yeah, well, I think even without showing the question is can you run it in Snowflake?

Right? Can you call that stored procedure with plain SQL? I think you said you could earlier. It’s just a standard Snowflake stored procedure object, correct? Right? Yes, right. That’s why I showed the definition also which showed the definition of Snowflake that you have for the source procedure.

Great. I don’t know if you have the ability to spin up the slide deck once more. I don’t think there’s that many more slides there, but if you can you guys see my screen again? Yes, we see discussions in Q and A.

Are there any more questions out there? Well, even if there aren’t questions, what’s always interesting in these discussions is to kind of hear from other folks what type of workloads they’re working on, if they’re coming from a different product suite, perhaps, and if Snowpark is something that they’ve heard about recently or if they’re currently using it or they plan on using it.

I think it is one of the newer features coming out of Snowflake, especially with complete support in now in public preview. So you’ve got Python, Java and Scala. I think that’s going to open up tons of workloads especially.

For coders who, again, Scala, Spark, but Java in particular, where that language has been around for a very, very long time, not going away anytime soon. And just to kind of see the type of developers and the type of use cases potentially with existing code that will really come to the forefront here, it looks like somebody’s used it within DBT.

Very cool. Yeah. And I think as far as one of the upcoming sessions, we’re planning on doing DBT integration. I know you can speak to that more than I can, Anapov, but the whole DBT Python integration, but any other questions or comments from anyone, hit the next slide there, Heather.

Yeah, so we’ll have some more events coming up soon, and you’ll get those out in your inbox from our meetup group, and you should be seeing them come through sometime this week, I’d say. Fantastic. Awesome.

Well, thank you guys for joining us tonight, and we look forward to seeing you next month towards the end of the month. We try to keep these every Monday, the last Monday of the month, but we hope to see you again next month and.

Please feel free to reach out in the discussion board with any topics you’re interested in hearing or if you’re interested in becoming a guest speaker or that kind of thing. Like Christian said earlier, we’re always looking for other people to contribute.

Thanks, everyone!

More to explorer

Snowflake Loading Data with Special Characters

July 24, 2024 No Comments

Special characters in your column names can cause chaos for downstream users, tools and processes.

Building a Generative AI Competency (or the First Gen AI Project)

July 21, 2024 No Comments

When Building a Generative AI Competency one must identify the necessary infrastructure, architecture, platform, and other resources and partners that can help an AI initiative be successful. We have just like many data warehouse and digital transformation initiatives over the last 20 years fail because of poor leadership, or companies only going half in on the objective.

clock, time management, time-3222267.jpg

Snowflake Time Travel Not the First Time Traveller but Let’s Review

June 12, 2024 No Comments

IBM DB2 long before Snowflake had this concept as did a few other select databases using a Temporal Database concept. I think Snowflake was able to make it more popular and mainstream due to the association of Data Warehousing and analytics specifically