Mike Jelen

August 28, 2023
7:47 pm

Snowpark Architecture

Snowpark, a cutting-edge data platform that empowers data teams to seamlessly work with data, enabling faster analytics and insights. This meetup is designed for data enthusiasts, engineers, and professionals interested in exploring the architecture and functionalities of Snowpark.

During this event, we had seasoned experts from the industry who provided an in-depth overview of Snowpark’s architecture, its key components, and how it revolutionizes the way data is processed and analyzed. Whether you are a seasoned data engineer or just starting your journey in the data world, this meetup promises valuable insights and networking opportunities with like-minded individuals. Don’t miss this chance to discover how Snowpark can unleash the true potential of your data projects.

Thank you Andy Bouts, Senior Sales Engineer from Snowflake for the Snowpark demo!!

External Links

Snowflake Quickstart
Snowpark Quickstart (hands-on)

Other Meetups and Events to check out:

Transcript:

Hello everyone. Welcome to another Carolina Snowflake Meetup group event. Today we will be giving an overview of Snowpark architecture. Excuse me, we have a special guest today, Andy Bouts from Snowflake will be co presenting with us.

Welcome, Andy! Thank you, I appreciate that.

All right, all of our Meetups are recorded and we put them on our YouTube channel. If you ever miss one or you want to go back to one for refer to anything, they are available on our YouTube channel and you can find that in the Meetup group description.

First, we’re going to have a quick Snowflake refresher and we will talk about Snowpark, what it is, why you would use it and some of the benefits.

And we’ll be getting a cool demo from Andy and then we’ll talk about some upcoming events and answer any questions anyone might have.

We understand that on these calls we have folks that are new to the Snowflake journey and others that are seasoned veterans have been using it for a while.

Because we have folks of varying levels just want to have a level set of understand ending and a foundation just around what is Snowflake at the end of the day. So Snowflake is the world’s first data cloud really built from the ground up for the cloud and we say for the cloud, taking advantage of all the different services.

And only charging you for what you use at the end of the day here. And so when you think of your data in various different data sources, on prem, other third party or cloud sources, you may have log data and really getting and integrating that data into that data cloud so that you can transform that data and really bring that data together into either data marts or data warehouses, or even at a level that you are able to do analytics.

And that analytics could be from Bi tools, could be with your data science. There’d be other applications that consume that data, but all that data sits within Snowflake at the end of the day. Next slide.

All right, pass it over to Andy to take us on a tour of Snowpark. Yeah. Thank you. So. Screen. Okay, perfect. So, as you mentioned, and on your last screen, Mike, it’s interesting to note that a number of those columns could be consolidated where instead of all the separate boxes and functions, those operations could be done directly on Snowflake.

So, as just a quick reminder to everyone, when Snowflake was founded in 2012, there was a couple of key design paradigms. One is, it just works. It needed to be optimized for performance at massive scale at cloud service provider type of scale, and it needed to be globally connected to unlock additional features and new data sharing paradigms.

This is understanding of the architecture. So starting in the middle, where my mouse is in the dark blue. Snowflake uses the cloud service provider’s backend Services and has our own proprietary optimization and compression algorithms to handle any type of data, whether that’s unstructured, semi structured or structured.

Secondly, in the middle shade of blue we have our elastic multi cluster compute warehouses and these are logically separated from each other so that you can do highly concurrent operations without impacting anyone else using the Snowflake data cloud.

And then thirdly in the lighter blue circle is our Cloud services layer. These are the cloud native microservices that were created in 2012 and through the subsequent years as we went GA in 2015 to allow for unparalleled levels of performance optimization, cross cloud, cross account, cross database and schema management of data as well as unparalleled levels of high availability.

Secure with a secure by default, extremely Governance compliant platform where we keep track of all the metadata and that allows consumers and users to unlock sharing and collaboration that just doesn’t exist in other platforms.

And in the background is a visualization of Snowflake accounts interacting with each other through direct data shares. As you can see, there are nodes and edges and links that have been made between them.

The architecture is unique because it allows you to scale up, out and across in unparalleled ways, creating compute warehouses instantly. Within a second or two, typically, sometimes less oftentimes milliseconds to create that unique compute warehouse and utilize it for you and your workload or your business operation and the business operation workload.

I’m going kind of fast through here because I just wanted to level set with everyone. We’re getting into Snowpark momentarily. What I mean by unparalleled levels of concurrency means that we can work with one or more of these ELT ETL providers as well as first party cloud services like Lambda and glue and listening to APIs.

And that can all happen with warehouses as small as an extra small or as large as a five extra large. This happens concurrently and regularly. Wow. Users are consuming data perhaps in a larger medium warehouse using modern tools like Sigma or ThoughtSpot or Power Bi or Tableau or Looker, to name a few.

All of these are we are provider partner agnostic. So I don’t mean to diminish any specific partner, but while that data is loading and while that data is being operated on, data science users or data transformation could also be happening concurrently in their own isolated compute warehouse.

And this allows users of Snowflake to be able to do the work that they need to do with little interruption, while they’re all working together on one globally consistent set of data, breaking down data silos. What we have done though, also in addition to that, is released additional programming interfaces. And so when I first heard about. Snowpark for Python. I thought that Snowflake was going to do what everyone else had done, which is bring the Spark Engine with either a Java Runtime or a Python Runtime into the Snowflake boundary and then we would manage Spark for you.

But that’s not what we did. What we did was something novel and unique in. Instead of maintaining separate Spark engines with runtimes and data movement in and out of cloud storage and credential management and very loose governance, what we did was we created a single API that uses our highly optimized SQL processing engine that we’ve been working on since 2012 and uses an API paradigm to unlock Java and Python alongside SQL.

There’s no additional engines, there’s no additional runtimes. You simply use the existing Compute warehouses that you’ve been using before to run SQL and issue a Python API call very similar to PySpark in order to work with Data Pythonically in addition to SQL or with Java or with Scala.

And so that same simple architecture remains, but with unlocking additional languages to work with data. And so traditionally, if a SQL Engineer were to write a very straightforward query like this one, select Star from Products where equals one, and you wanted to work with that in a data Frame, you could work with either a Scholar, a Python or a Java data frame.

Issuing a command like this in your Data Frame session, do a table operation where the column Products. On the table products where the ID column is equal to one. These are equivalent queries. And so what our API does is if you write the one on the left and you prefer to write it in Python because you already have data frame operations on that table.

And as you continue to do aggregates and things, this notation of writing in spark manner is very consistent and easy to follow. We run select Star from products where ID equals one, and we do that in our Distributed Compute Engine.

And so I’ll show you in my demo account what this looks like, and then we’ll jump into a hands on lab. First of all, though, I do use single sign on and multifactor authentication. So let me authenticate my phone out because I’ll need it again in a second.

Now, in Snowflake, what you’ll see is a Create Worksheet button. And if you click on that, you’ll see an option for a Python Worksheet. This is new. It came out 912 months ago, about ten months ago, nine or ten months ago, where you can add a Python Worksheet.

And if you click this by default, what will happen is it will come up with a Python Worksheet for you out of the box that can be run and you can see it’s doing a data frame operation. So this data frame operation and everything is available for you with hover over text help, example help.

We even have examples in the bottom. I’m doing this and I’m looking at all of the tables in the Information Schema package where the language equals Python. So I’m going to go ahead and kick this off real quick.

What we’re doing right now is we’re spinning up a Compute. Warehouse and adding the bits we need for a Python session. And then we’re passing that back and forth. Apologies, I needed to have a database.

Let me do that again. And then we’re processing that data. We’re looking through all of the information schema packages and we’re doing a data frame operation. So based on what I just showed you in slides, we would expect this to have an equivalent SQL query.

And in fact, it does. If I were to go here and look at the query ID and look at query history again, I’m moving kind of quickly. But what I did was in the result panel, I clicked on the query ID and that popped out a new tab for me.

You can see it’s got a Python worksheet here. And when I look at the query details, it actually executed this code for calling the Python worksheet as a stored procedure. And if I look at the rest of my query history, what you’ll see is the SQL query that got passed.

So this is the exact equivalent in terms of a SQL query to what we just ran in Python. You can see the results here are exactly what we saw on the other screen. So that’s cool. And it unlocks a lot of capability for us in terms of the ability to work with python.

Now, if I had the correct permissions, I’m using my public role. But if I were to change to a role that has write access like my window Retail Data engineer, what you’ll notice is and when I pick this and I change my database to one that I have right access to.

That this deploy button shows up and I can actually create this as a stored procedure including parameters, so I can parameterize that stored procedure and deploy it so that it can be called regularly over and over again and utilized in your workflow.

Okay, so this is cool. Like, this is some data engineering parts of Snowpark and some ways that we can utilize it. What I’d like to go into next briefly, and I realize this is going quickly, but the recording will be available.

This link is available. This is a public Quick start. I would encourage anyone to go and use it who would like you can do this in a trial account. You can do this in your company’s account. We have over 100 Quickstarts available at Quickstart Snowflake.com and we’re updating these regularly.

A good number of them were even updated in the last three weeks throughout the month of August 3 to four weeks. This specific quick start was designed to use the diamonds data set and it comes pre configured with three Jupiter notebooks to be able to work with that diamonds data set and use Snowpark to do advanced machine learning and the data frame API that we just looked at, including UDFs and stored procedures to be able to do advanced ML training and deployment inside of Snowflake.

With the data never leaving Snowflake. Now, traditionally, ML architects have done this with libraries like Pandas and you can absolutely use Pandas within Snowflake, but. If you were to use Pandas within Snowflake, that breaks our design paradigm, because Pandas requires data to be locally, to be operated on locally within the compute environment.

A key difference to the Snowpark data frame is that it’s designed for only server side execution. Snowflake side execution inside of Snowflake, with the data never leaving. Therefore, all of your governance and compliance protocols remain in effect.

So if you have a dynamic data masking or a row based access policy, two examples of governance policies, the Snowpark data frame will honor that. And all of that data will stay using those policies inside of Snowflake.

Now, Pandas will too, but only to the extent you pull the data out of Snowflake. Once you pull it out of Snowflake and operate it on locally inside of VM or whatever aspect of compute you’re using with Pandas, the data governance policies are no longer enforced because those don’t exist on that VM or outside of Snowflake.

So let’s jump into it. The first thing we’re going to do is bring the data into Snowflake. And then after we bring this Diamonds data set into Snowflake, we’re going to do ML feature transformations and then finally training and deployment.

Now, with the time that we have left, I don’t think we’ll be able to get through all of those steps. But I can show them. Thanks Heather, for putting the link to Quick starts in the chat. And this is the specific link to this Quickstart chat.

Perfect. I might have only said that to the panelists. So Heather, if you wouldn’t mind sending it to everyone. In order to do this effectively and efficiently, I’m going to use Vs code. I’ve used Vs code for years.

Perhaps it’s one of the better products that Microsoft has created. And one of the reasons why I like it is the native Git integration as well as the Snowflake extension. So I’ve clicked on the marketplace extension tiles here and if I type Snowflake, I will see our extension that we’ve created and it’s available for you on public use.

We maintain it and support it. And I’m going to utilize that today for this demo. So the first thing I’m going to do is continue to authenticate with my enterprise credentials, like I mentioned before.

So I’m going to open that up. It’s going to do single sign on and then a multifactor authentication. And now that I’ve authenticated, you can see that I’ll be able to explore my databases just like I was able to explore them in Snowflake itself.

In fact, I should be able to see examples of the stored yep. So here is the Python stored procedure that I was just working with. This is an example of it that was put out there. Now in order to connect with and integrate with Snowflake, you can see that I have a connection file.

The connection file uses my demo account. I’m using external browser for single sign on and my email. And so I’ve got that. And you can see here with Git that’s the two changes I made to the file. And so I’m going to open up, I’m going to open up the three jupyter notebooks.

I’m sorry. First thing I’m going to do is. I’m going to actually work through. The purpose of the Snowflake Extension is to be able to run SQL Code inside of the Vs Code extension. And as you can see, I’m getting the query results right away.
And if I click back over, all of the query history is available for me. So I can click back through the various query history commands. I can execute these single or altogether. So I’ve now done that.

And I’m ready to open up the Python Jupyter notebook. So the first thing I’m going to do is I’ve got my environment already set up at Python 39, following the instructions. I’m Going to go ahead and run this to initiate that Python Kernel and import some libraries.

And Then I’m going to go ahead and kick off the next one. So as soon as that’s done, it’s going to do authentication. It’s going to pop me back over here to single sign on. It’s going to do MFA. And I’ve got a session now with Snowflake using my Credentials, using this database with the latest Version of Snowflake at 729 and The Latest version of this Connector at 151.

So now that that’s in place, I’ll be Able, sorry, excuse me, meant to hit Shift Enter. I’m going to go through and work with this Diamonds data Set. I’m going to use my Session. I’m going to put the Diamonds data set into a data frame.

It’s coming from this s three bucket here’s some characteristics of it. And Then I’m going to do a summarization of that. And so this is how easy it is to work. Python with data in Snowflake. And once this data got loaded, when I did this session read and I read the data into the data Frame, all I had to do Was A Show.

And a describe and those are operations that happened back in the back end on Snowflake. The only thing that came back to Vs code is the summarization of the output. And so this remote compute paradigm is extremely powerful and I can do things like data correction in Snowflake with the data never relieving and all I get back is the results.

I can even do array operations and rename columns and do other data frame operations in order to get the data ready to work. And then what we’ve also unlocked and I know this will be the last thing I’m able to show with the time that we have left is we can also do advanced ML modeling, preprocessing one hot encoding and other data transformation or feature engineering inside Snowflake with the data never leaving.

So again I’ll go ahead and do my authentication, I’ll load the libraries that I need except my MFA and now I’ve got another session so I can go ahead and set a parameter to my demo table now when I’ve done this.

Snowpark is a lazily executed paradigm as like other big data processing engines and so I actually haven’t done anything yet. All I’ve done is established some variables and defined them. This last part of what I’m going to do is we’re going to use a Min max scalar to actually do feature engineering on this and do some fitting of that.

And then the last thing I’ll have time to show you today is using our ordinary encoder to do. Actual encoding of the data and operating on that. And that provides this encoded index here that we can see, as well as some other characteristics of the data in terms of feature engineering, these two cut and clarity ordinal encoders, actually are what resulted from that operation.

So, Heather, I understand that was a really fast overview. I did all that about 25 minutes. It was, but you did a great job. Thank you. If there’s any quick questions, I’m happy to cover those. Otherwise, I appreciate everyone’s time.

Andy, thank you. There’s a couple of questions that have come in. I’ll read them one by one and give you a chance to respond to them, if you’re good with that. Yes, we are a Hadoop shop. Considering Snowflake have to wait up to ten minutes for nodes to start before executing code today, does Snowflake have a similar annoyance?

No, absolutely not. When I used to work on spark clusters, I’d have to wait 25 minutes for a cluster. So in my demo today, I used two or three different compute warehouses for Snowflake. Those all started up within a second or two.

So I started two or three different compute engines during that demo today, and I got it all done in 25 minutes. Perfect. Next question is we leverage a lot of small files and some large files. Data sets.

Do have to manage and organize all these files, or is there a better way to access our 10,000 plus number of files? Yeah, I think myself as well as the AICG team would love to talk to you more about that.

The short answer is, I can think of three or four different ways Snowflake could make your life significantly easier. Many of my customers are moving that file based operations into Snowflake and letting Snowflake be your raw layer.

So, Mike and Heather, I suppose maybe you could follow up with them later, but I think we have some great opportunity to make your life better. It 100% yeah, absolutely. We’ll follow up. And then the final question we have so far, do we have to manage partitions and garbage collection?

Absolutely not. In one of the first slides where I showed our modern architecture, there is no partitioning, there’s no vacuuming, there’s no garbage collection. Snowflake manages all of that for you.

In fact, you don’t even set up indices on database tables. We do all of that for you automatically through our query optimization and other management services. All right. Very good. Thank you. Those are all the questions we have for now.

If we didn’t get to a question or somebody didn’t want to raise their hand, feel free to reach out to us directly. We’re always happy to have no obligation type of conversations and go from so, heather, I’ll turn it back over to you.

Have a great day. Thank you so much, Andy, for your time. That was excellent presentation, and we hope to have you back on for some more in depth Snowpark adventures. You’re welcome. Have a good day.

Thank you. You too. So, just to call out some upcoming events, you can find an event near you. You don’t have to be local to the carolinas. There is one coming up, raleigh, September 7. And I am going to share the link with everyone if you’re interested in finding an event near you.

And we appreciate everyone coming out today, and we hope to see you at our next event. Yes. Our next Meetup event will be a deeper dive into Snowpark. Yeah. And you should see an invite for that coming out within this week sometime, and we’ll have that out through Meetup.

Awesome. Thank you, everyone, and we hope you have a great day. Thank you. Bye.

More to explorer

Snowflake Loading Data with Special Characters

July 24, 2024 No Comments

Special characters in your column names can cause chaos for downstream users, tools and processes.

Building a Generative AI Competency (or the First Gen AI Project)

July 21, 2024 No Comments

When Building a Generative AI Competency one must identify the necessary infrastructure, architecture, platform, and other resources and partners that can help an AI initiative be successful. We have just like many data warehouse and digital transformation initiatives over the last 20 years fail because of poor leadership, or companies only going half in on the objective.

clock, time management, time-3222267.jpg

Snowflake Time Travel Not the First Time Traveller but Let’s Review

June 12, 2024 No Comments

IBM DB2 long before Snowflake had this concept as did a few other select databases using a Temporal Database concept. I think Snowflake was able to make it more popular and mainstream due to the association of Data Warehousing and analytics specifically