Heather Pierson

June 8, 2023
1:13 pm

Snowflake Query Optimization

The goal of query optimization in Snowflake is to increase the speed and effectiveness of Snowflake-run queries. This helps Snowflake users better utilize their budget within The Data Cloud for things like data enrichment using Snowflake Marketplace in addition to saving them time.

In the Meetup event below, we explore various techniques and strategies to optimize query execution and maximize the utilization of Snowflake’s powerful features.

Checkout the following links for more information:

https://teej.ghost.io/understanding-the-snowflake-query-profile/
https://docs.snowflake.com/en/user-guide/query-acceleration-service
https://www.datalakehouse.io/resources/blog/snowflake-usage-analytics-the-key-to-efficient-data-warehousing/

Other Meetups and Events to check out:

Transcript from the Meetup event:

Hello, everyone. Welcome to another awesome Carolinas Snowflake Meetup Group Event. Tonight, we will be going over Snowflake query optimization. Our previous meetups are recorded and they’re usually available within a few days on our YouTube channel.

And if you would like to check out our YouTube channel, you can find a link in our discussion board in our Meetup group. So we’re going to go over a quick Snowflake refresher tonight and talk about Snowflake query optimization, the use cases and benefits.

And then Anubhav is going to give us a quick demo, and then we’ll go over some Q and A if anybody has any questions. All right, so Snowflake is a cloud-based data warehousing platform for businesses. It’s known for its scalability, flexibility, and high performance.

It supports AWS, Azure and GCP, and it enables you to pull all of your data from all of your sources into the cloud for a single source of truth. And you can pair it with your favorite Bi tool for analytics.

So Snowflake is designed to automatically optimize query execution, but there are some techniques and things that you can do to further enhance query performance. And that’s what we’re going to be talking about today.

And I will let on I’ll take it from here. Thank you, Heather. Okay, I think we will continue with this slide now. So as Heather just kind of explained that there is some form of inbuilt optimization that’s done in Snowflake and we’ll probably touch upon those topics as we go ahead.

But then there is also some form of control and leverage that you have as a user that you can use to optimize your queries. So we’ve kind of guided five major areas that you can look at optimization approaches to kind of improve your SQL and given the scope and the breadth of the content that you can actually cover in this topic, we’ve decided to focus on the last bullet that is using query profiling and the optimization.

Using a query profile to fine-tune your queries and then kind of leverage the optimization in Snowflake. But having said that, there are other approaches that you could look at, such as looking at your underlying tables to look at the design.

If you have some of your queries, you could just do simple SQL optimization itself within your joints and your selects, and then using your order by then, so on, and then you could use some form of materialized using some of the scenarios to kind of improve the response.

It depends whether how quick or what’s the cadence of the refresh that’s required. So we’ll probably have different sessions to go through those different topics. But for today, if you just keep it down to query providing and optimization.

Heather, can you move to the next one please? These are just some general sets of benefits that you could have while looking at optimization. Pretty standard, nothing new, just that you would definitely see improvement in performance and that would translate to some form of cost savings and budget allocation on Snowflake.

We can move on, Heather to the next. Okay, so before we kind of start talking about Snowflake, query Profile wanted to kind of talk about, in terms of the concept, what Query Profile does. And then after we go through slides, we’ll have a quick demo to see how do we access Query Profile and Snowflake, and then how do we kind of understand and decode the decode a query using Snowflake Profile.

So there’s a great blog post out there. We’ve added that as a reference to this presentation also. And if you guys want to dig deeper into what’s being covered here, feel free to go to that link that’s added in the presentation to read the blog post in detail.

But you can think of Query Profile as a graph along with some stats detail that you would have mapped to the graph nodes you have. So you would have every query is divided into sub-nodes. And as you go further, we’ll talk about what each nodes are.

But if you had to look at one unit node, if you look at this one that we have on the slide, which is a table scan node. So whenever your query gets loaded, your first step would be to read the data from your table.

And that could basically be a table scan or external scan or table generator based on what kind of data source you’re reading from. But the idea is that your first node would be some form of scanning of data are and if you look at this diagram right here, you have a block which gives you what’s the name of the operator.

We’ll talk about the operator names in the next slide, what’s the configuration that means what’s the source, what’s the name of the database, the schema and the table and so on. Um, what’s in terms of percentage, it basically tells you what’s the amount of utilization or what’s the time spent or what’s the effort that’s spent for this query to run on this particular node.

And we will see that when we get to the demo to kind of show how these numbers kind of add up. And then if you look at this arrow that basically tells you that after this step was completed by this node, what was the total queries that get sent to the next step for processing?

So this helps you debug that as part of each step in your query, how many rows were evaluated and it helps you understand that which node in your query execution is the most expensive node. And you can go and kind of debug and optimize that section to kind of see if you can optimize that particular node.

So with this idea in mind where we talk about operator names heather can we go to the next slide, please? So if we see that and this is a general breakdown, there are some more elements to it, but when we kind of try to debug or optimize our query, we basically divide this into these four major sections.

So, as I explained that the first step whenever you load a query would basically be a data access operation, that could be a table scan, external scan and so on. But the idea that it’s going to read data from a source, that could be a table view, whatever you have and accordingly load data.

And that’s where when we see the demo, we would also quickly touch upon the partitions pruning that’s done by Snowflake automatically. And then after you’ve done with reading your data, that’s where Snowflake evaluates all your joins, any form of aggregate functions that you might have.

So if you have any sum, any. A median average or if you take any of your of your window functions dense rank, rank. If you’ve done any flattening on your data then you would kind of see all that being processed as the next step.

And then once you’ve evaluated the join, you would basically go to Snowflake will go and evaluate any form of manipulation language operator you’ve done so stuff like insertions, deletions your copy commands, merging updates, so on that’s evaluated next by Snowflake.

And then at the end, if you have any form of metadata, such as if you’ve added any form of using a data name sorry, using a database name, warehouse name, or doing a constra that’s done right at the end because that’s usually the least expensive step in your SQL hierarchy and that’s usually your topmost step at your hierarchy also.

And we will see how this kind of breaks down in next slide. Can we move on to the next slide? Yeah. So if you look at an example again, this link here has further details about the same topic, but we will probably go through a high-level explanation.

So for example, if you have a query which basically selects any form of column and then does some form of aggregation sum and so on, and then if it does pi, you see that the first step is usually some form of scanning that’s done.

So that’s the table scan because we’re reading it from a table, but if you’ve been reading from a stage instead of a table, it would be external scan or if you use any other table generator, it would be a table generator scan.

So that could vary but the idea is the same and. It kind of looks at what are the source, what are the constraints that you have, what are the fields that you have, and evaluates that as the first step.

So after the first step, it’s 1 million rows are sent to the next one where it performs aggregation on 1 million rows. And that’s where you see the next steps, count, star, sum and so on are evaluated.

And then that further gives you 10,000 rows. After you’ve done with that, you’ve got your sorting, and that’s where your predicate implementation comes into picture. If you have any form of predicate that you can use to further reduce it.

So in this case, you’ve used the limit 500. So you want to limit that 10,000 to 500, and then at the end from 500, if you have any Groundstar or if you have any form of database warehouse at the high level that gets evaluated.

So that is a high-level picture of what really happens in inquery profile. There are a lot more computations permutations that can happen based on what constraints you’ve used and what’s your logic. But they all kind of fall under the same category that we’ve discussed as part of this slide.

So, Heather, I think I’ll share my screen now to take this explanation to kind of do a quick demo on query profile. So I’ll just go ahead and share my screen. Meanwhile, if you have any questions, we can take that.

Now, do you see my Snowflake screen? Yeah, we see it. Okay, so I’ll just wait for 1 second few seconds if anybody have any question on this before I go on to. Explain or do a quick demo on this. Okay, so I think we’re good on this.

So how do we access the Query profile in Snowflake? So this is your classic console access to Snowflake. You would come to your history tab where you have all your history and then you can just go find the query that you want to analyze.

Just click on the query ID. We know that there’s also a function which is called get query results or something on the same line. But I think that’s only on private preview right now. It’s not in production for all scenarios, but it does the same thing that you can call that function and you can pass the query ID and it should give you the same stats that you see on the query profile.

Also, so your Query profiler basically has two major tabs. One is the details which just tells you what’s the status of the query and then what’s the byte scanned and gives you a quick snapshot of the output and then gives you a care Query ID and Session ID.

But where the magic lies is actually in the profile one. So as we discuss that it breaks it down into smaller functions. So the first step is a table scan that’s been done here. And now if you look so these are the graph nodes that you have which tells you what are the most expensive nodes you’re looking at.

That’s good to know that 70% processing logic of this entire query execution was spent on scanning tables. That means we don’t have very expensive functions up there so that’s one good thing. But you can also dig further to see that, for example and I would avoid going too much improunning for now because I think we’ll probably have a separate session on Pruning and Snowflake, but.

But for a quick explanation. What Snowflake does is that it creates micro partitions, and then it internally tries to optimize the process based on your data to access partitions and prune them out. So the idea is that if any form of Pruning has been done, your partition scan should be less than your total partitions that’s out there.

Because that means that if Snowflake has scanned less number of partitions and has not read all the partitions, that means it was successful in trying to get down your optimization. But for this particular scenario, it actually scanned through all the partitions.

So we really could not leverage any form of advantage of using the Pruning. But if you want Snowflake to kind of make sure that you get maximum advantage from Pruning, there’s a set of rules and guidelines you’ve got to follow, which we’ll talk about in a separate session.

That how can you use that to make sure that Pruning is done right for your data? But having said that, so you would have some form of Pruning that gets done. It tells you about also the node execution time that you look here.

So right now, you only see access to a remote disk I O. But people familiar with Snowflake architecture understand that given the three-tier architecture of Snowflake, where you have the remote disk I O, you have the local disk I O, and then you have your compute nodes, your virtual memory, which runs a remote disk I O, there are more parameters that you would usually see.

So, for example, if you have a large data set that is not able to be evaluated within the virtual memory, it usually spills down to your local list, and then it spills onto your remote disk I O. And then as you go further from.

Virtual memory, your memory, it kind of takes more time to process the data. So it also tells you that what’s the pillage across your local disk and remote disk and so on. So that is also an indication to know that do you need to expand your warehouse size.

So, for example, in warehouse, you have two concepts. You have the scale in and the scale out. And then you have scale up and then you have scale down. So if you have a multicluster warehouse, you can specify multiple cluster nodes to be used in a warehouse, or if you say that I want to expand the size of my warehouse from a small, medium, large.

So on this information and your data set information can help you evaluate if you need to expand your warehouse to kind of make it faster and optimize it and so on. But then there’s a caveat that increasing the size of the warehouse could also mean higher costs.

And you’ve got to be careful that you don’t want to run any form of warehouse size which is larger than what you need to run. But again, I think for cost, we’ll probably take it separately. That is there.

Then we’ve got the aggregate functions. It just tells you what attributes have been aggregated after table scan. And then it is sort you see, sorts are usually very fast because it does not really take that much of time.

And then window functions are expensive ones. So that basically tells you that what are the steps and where you spent most of your time. So I think Heather, we can go back to the slide. Let me just share my screen again.

Can you guys see my screen? Yeah. So we just have some common query problems here. Do you want to go over those on above? Absolutely, absolutely. Thank you for that Heather. So I think while explaining these are some that we’ve also experienced and this also kind of takes some from Snowflakes official documentation as to what are some of the common query problems that they have encountered and they’ve kind of explained how to go about it.

So number one issue that people usually face are are joins which kind of look to be harmless but then if you look into it, they usually become cartesian products. So you might have joined two tables with 50 records and 50 records but you join them on the wrong column.

So you’ve forgotten to add a join statement on the right column. So you end up multiplying 50 by 5500 which just increases it and makes it more complex. So exploding joins is something that you can look at as one issue and query profiler would actually help you to see if you have an exploding join because for each from clause that you have would essentially be a table scan.

So usually every where clause in your query would have a corresponding table scan node to it.

And then when you join you can see that which all nodes are getting joined and if you see that any form of aggregation leads to further rows than what was taken as source, that could be an exploding join scenario.

So you could look at that second scenario is usually the union all and the union scenario. So if you just do a union all it just kind of adds all the data that you have. You might have duplicate data also but it just appends the values and give you whatever you have without really caring about duplicate values.

But if you just put a union it’s going to do a duplicate elimination for you. So usually people. Might end up using Union for a scenario where they don’t want to have any duplicate elimination, where they know that there won’t be any duplicate data.

But in that case, if you use a Union instead of Union all, there will be an overhead of Snowflake performing that duplicate elimination for you. So if you know in your pipeline that you won’t be having any duplicate data, then you might as well as the Union on.

And that could be a faster approach than using a Union because it does not take the logic of duplicate elimination. Then the third issue which we had recently encountered for one of our clients is the queries was too large to fit in memory.

So even by expanding large warehouses, we had actually gone to five Excel and even six Excel to basically get it done. But it was still taking a long time for the data to work. So the other approach was to kind of divide them into smaller batches and then you could have your pipeline instead of having run them all together in one single time.

If there’s a way that you can divide into smaller batches, that could be one way to kind of solve your query problems around optimization. And then as I said, that we’ll have another session on Pruning because it’s itself a vast complicated topic that’ll take a while to explain.

But the idea is that one rule of the thumb is that if you see that your partitions can are less than total partitions, then that means Snowflake is able to utilize some of the Pruning logic that it has.

But if not, if your partitions scanned are equal to total partitions, then you’re not really leveraging Snowflakes partition scanning advantage. So these are like some other query problems you could be encountering and you could work around with.

So. Um, recently Snowflake has come up with the concept of query acceleration, which basically, again, just like other topics within itself, is a new service and it’s huge. But we wanted to talk about it because this is another way of kind of boosting your queries.

So I use the word boosting because, um, what it really does is that if you have any scenarios where you have queries that might change and might have outliers. So let’s assume that you run some form of warehouse where you have 10,000 10,0000 queries, you know, that they usually run, but on one month or some form of seasonal data, you get 500,000 records, which you know is going to be an outlier.

So you don’t want to expand your warehouse or change your settings that you have just because you might have outliers for some of these scenarios. And what usually happens in Snowflake is that if you have a long runner query, that means if you have one large query and then you have small multiple queries, unless and until that large query is not completed by your warehouse, your small queries might be standing in the queue and not getting executed.

So that is not ideal because you know that these small queries can probably get processed and then go back to the larger query, but that is sometimes out of your control. So that’s where query acceleration is one of the approaches you can look at to boost.

So what it basically does is that you can specify your scale factor and then you enable it for your warehouse. So this is a warehouse-specific configuration where you tell Snowflake that I want my warehouse to implement query acceleration and then you can specify the factor.

So for example, you can understand factors by imagining that. If you specify scale factor as three and then if you have, let’s say a medium warehouse and then there are certain considerations that you can certain queries when the query acceleration gets turned on.

So query acceleration does not get turned on, there are certain queries for which Sofia acceleration gets turned on and that’s nicely defined in the Snowflake documentation and we’ll also talk about it.

We have a session coming up on query acceleration. So when the conditions are satisfied for query acceleration to kick in, it will multiply the warehouse by three. So you’ll have three medium warehouses running to kind of process that outlier query that you have and then once it processes, it automatically goes back to the original size.

So it’s like a boost that you can implement to automate and process your outlier queries. And there are some factors that you might consider that what’s the difference between query acceleration and multicluster warehouse?

So multicluster warehouse basically you would scale in, scale out, it does not really boost and it requires you to change your policies to manually manage that. Unlike query acceleration where it automatically will add a boost, will automatically go back to your previous configuration without any manual intervention and it kind of does the job for you.

So I think that if you go back to the next if you go to the next slide Heather, please. Just to kind of explain that you can create a new warehouse with query acceleration, or if you have a warehouse which has query acceleration, you can just enable that by specifying these parameters, and you can provide the maximum scale factor.

So if you want to keep it twice and thrice, but just be cognizant of the fact that adding up scale factor is also going to have a corresponding increase in your costs for execution. So for whatever time or minutes that the query acceleration has been used by Snowflake, it’s going to charge you based on that.

And you would notice that if your warehouse has query acceleration enabled, if you go to the query profile it will tell you that whether query acceleration was used for that particular query or not. And you will see some form of data right at the bottom is where you can see query acceleration explaining how many partitions were scanned and how many partitions selected for acceleration.

So it’ll give you an idea whether you had query acceleration or not. Building our demo for this today heather so we’ve kind of gone through the demo part, but I think that kind of sums up the topics that we had.

Yeah, I went ahead and put the blog post we referenced and the Snowflake documentation in the chat for you guys if you’d like to check that out. And we’ll also have that in the description of the video when it’s posted.

I do want to mention though, one thing we encourage for our clients is when they’re optimizing their Snowflake queries, it kind of helps them allocate their budget better so they can put that money towards enriching their data with third-party data from Snowflake Marketplace and things like that.

You can do a lot more if you’re saving on your queries and I think that’s one of the key things here. Does anybody have any questions? You can just throw them in the chat. All right, well, if there are no questions, if you think of some later, feel free to go ahead and add those on a comment on the YouTube video or in our discussion board, and we would be more than happy to answer them.

Thank you all for coming. We appreciate it. We look forward to seeing you at the next Snowflake event or Carolina Snowflake user group event. Excuse me. And you should be expecting an announcement for that sometime this week.

Go check out your inbox. Stay tuned for that. Thanks, everyone. Mahindra had a question. Oh, sorry. I missed that. The hand was raised. I don’t type in your question unless it was an accident. Can do a lot of talk.

Did you have a question? All right. I see. Here we go. Am I audible? Yes, we can hear you. Yeah, I have a question. Like, in the first slide you are shown about table scan. Right. In a traditional database system, there are scans like table scan, index scan, something like that.

There are plenty of scans out there, but in Snowflake, are there any other scans other than table scan? To meaning to meaning to say is a Table span will influence to collect all the data from the entire table, or.

Entire micro partition, something like that. Is my query going to qualify few records from few micro partitions? No. Instead of scanning for entire micro partition, that’s going to scan for few micro partitions.

So that is so called as some index seek in some other traditional database system, other something like that in Snowflake. So Mohan, great question. So I’ll divide the question into, I think, three major parts.

Okay. Number one is in terms of indexes, unlike the tradition databases like Postgres and Oracle where you can specify indexes on your table, in Snowflake, we don’t really specify index per se on the data that we have.

It doesn’t let you specify any index and primary keys per se in Snowflake. So what Snowflake does is that based on the data that you’ve set up, it tries to understand and then automatically tries to access data in the most reasonable manner.

And that’s where your micro partition and partitions logic comes into picture. Even though Snowflake internally has a logic where it tries to divide the data that you have set up into micro partitions.

So that when it’s trying to read the data using table scan or any other form, which I’ll talk about now, it will try to see that it reads the micro partitions and then try to get the best out of them.

And I think if you look at the Snowflake documentation link that we specified on the partition pruning and the partition specification, it explains that if you want to make the most out of partition pruning and partition micro partition reading, how should you be specifying data?

How should that be loaded into Snowflake? Number two. For third part. Yes. Instead of table scan, instead of you specifying the scan. Right. What happens is that based on the source of the data that you’ve read, it will choose whether it’s a table scan or it’s an external scan.

So for example, you can also have external scan if you’re trying to read data using a copy command from an external stage. So you could have a stage which could read data from, let’s say, AWS or Azure or GCP.

So in that case, it’ll use external scan if you use a table generator. So you have these inbuilt functions in Snowflake where it has table generator where you can specify what tables needs to be read.

If you’re using a table generator as a source, it will use a table generator scan to read that data. And I think there are a few more options that it uses instead of table scan depending on the source.

So you can also use that, but that is primarily driven by the source. And Snowflake does the scanning for you based on the source that you’re reading from. So it’s a little hands-off to what you would have from a more traditional database that you could have more control.

Okay, probably I’ll get more insight from reading the Snowflake documentation for Pruning and micro partition. Absolutely. And if you have any specific questions, Mohan, I think we would be happy to answer them.

You can post them on our channel. And I think we plan to have a session on the Pruning also because that’s one of sort of the session that we’ll to have that in future. Okay, sure. Thank you so much.

Thank you. Thank you everyone. Thank you. I’m just going to go ahead and share. A series of blog posts out there about DataLakeHouse.io Snowflake Usage Analytics. And so I’ll put that in the chat for everyone.

It’ll also be in the video description. There’s our announcement for our launch of Usage Analytics for Snowflake in Datalakehouse.io, and we look forward to seeing you guys at our next event. Thank you all for joining.

I could not copy the link. Can you please give some time? Let me copy it one by one? Yeah, sure, go ahead. And I’ll make sure to put that in our discussion board on the meetup group, too. That way you can.

Okay, thank you so much. No problem. Mohan, I just added one more link for clustering, which is another approach of kind of making sure that you can define a clustering key and clustering columns that basically helps Snowflake to understand how to micro-partition the data.

So I’ve also added the link for that. And we also. Be covering clustering key, then clustering as part of a future micro partition sessions. Okay. Thanks, Anubhav. Thank you. Thank you so much. Thank you, guys.

More to explorer

Snowflake Loading Data with Special Characters

July 24, 2024 No Comments

Special characters in your column names can cause chaos for downstream users, tools and processes.

Building a Generative AI Competency (or the First Gen AI Project)

July 21, 2024 No Comments

When Building a Generative AI Competency one must identify the necessary infrastructure, architecture, platform, and other resources and partners that can help an AI initiative be successful. We have just like many data warehouse and digital transformation initiatives over the last 20 years fail because of poor leadership, or companies only going half in on the objective.

clock, time management, time-3222267.jpg

Snowflake Time Travel Not the First Time Traveller but Let’s Review

June 12, 2024 No Comments

IBM DB2 long before Snowflake had this concept as did a few other select databases using a Temporal Database concept. I think Snowflake was able to make it more popular and mainstream due to the association of Data Warehousing and analytics specifically