Heather Pierson

April 4, 2023
8:23 pm

Apache Iceberg in Snowflake

Large-scale analytics tables can be created using SQL utilizing Apache Iceberg’s well-known open table format, which was first created at Netflix and released as open-source software in 2018. Because of its adaptability, Apache Iceberg is the perfect data storage platform for analytical workloads.

We talk about Apache Iceberg in this session, both inside and outside of Snowflake, and how you can start utilizing it right away to reduce the cost of object storage, transmit powerful analytics, and greatly scale your cloud data warehouse.

Here are a couple great blog posts where you can find more information:

https://www.snowflake.com/blog/iceberg-tables-powering-open-standards-with-snowflake-innovations/
https://www.snowflake.com/blog/expanding-the-data-cloud-with-apache-iceberg/

Check out our Meetup video below to watch an overview of this event:

Other Meetups and Events to check out:

Transcript from the Meetup event:

Hello everyone. Welcome to another fabulous Carolina snowflake meetup group event. Tonight we will be going over Apache Iceberg and Snowflake. All of our previous meetups are recorded and they are usually available within a couple of days on our YouTube channel.

First, we’ll start with a snowflake refresher and discuss Apache Iceberg, go over a little bit of some use cases and give a demo and talk about our upcoming events and answer any questions you guys might have.

All right, I think Mike will take it from here. Yes. So welcome everybody. Some of you are new to Snowflake and others are seasoned veterans. Welcome everybody. So just real quick, in case you’re new to Snowflake, think of Snowflake as a data cloud.

You have data all over the place on premise databases, applications could be ERP, could be third-party types of applications, data sitting in SaaS-based solutions. You’re going to have device data, think of IoT and really bring all that data together.

That’s through using ETL/ELT streaming. And then you’re trying to figure out what I’m going to do with this data. I need to transform it to make it useful to my different consumers out there. Some of the data you might aggregate, other data you just may actually just say this is the raw data, data scientists have at it.

And then naturally you’re going to build for some folks dashboards on top of that because you have your different data consumers. Those folks can be varying from technical to non-technical types of users, but you’re really having all the data in one spot. That data could be in various formats, but it’s all within Snowflake at the end of the day.

So, yeah, jump in here and talk about Apache Iceberg. So really, Apache Iceberg is coming out of a Netflix project and they’ve been creating great things over the last decade or so, and a great team over there open-sourcing some of their learnings and some of their productized development efforts.

And so they just recently really donated their concept, their framework of Apache Iceberg over to the Apache Foundation, which is amazing. And they’ve open-sourced it, like I said. And it’s really an open-managed project, which means pretty much anybody can contribute to it and benefit from it.

And there’s actually a decent amount of companies who are taking advantage of that legwork that Netflix did, including Snowflake. Other groups like Tabular.io, Google has been doing some cool things.

So it’s really an exciting platform and it’s one of those frameworks platforms, bits of technology that can be used ubiquitously throughout vendors, which is great for customers of Snowflake, in that it’s a good concept to move and transfer data ubiquitously across the inner webs.

We’re going to talk about a little bit about this tonight, Apache Iceberg and how it works within Snowflake, at least currently. And a lot of this is under the private preview, but interesting nonetheless.

And so really, this is kind of one of those safe harbor statements where your miles may vary and anything that we’re showing here could potentially change. So when we talk about Apache Iceberg, it really does provide some great capabilities that we’ve come to really lean on, such as asset compliance with database and other types of storage mechanisms.

And then also it’s really great in being able to query data. That lies within your data lake storage. And for those who’ve been spending some time with us here at the meetup, when we talk amongst friends, we don’t like to use the general term by itself, data lake.

Usually it comes with something. So it’s either data lake analytics, data lake storage, which is where you’re storing your data, such as an S3 or Google Cloud storage. You might have a data lakehouse and this might be part of your entire data lake analytics infrastructure.

This is about Apache Iceberg and we’re going to dive into some really cool fundamentals here. And so let’s look at a few use cases of where Apache Iceberg can be our friend. Start off again, you could almost think of it as a compartmentalized mini database.

So these files for Apache Iceberg can be very large and that’s great because there is a bit of compression, but there’s also a metadata layer there. So when you think about that traditional database concept, you’re accessing files through your database.

That’s the mechanism, that’s the process. But really underlying that is some type of storage, whether that’s in memory or not. And so what Apache Iceberg is in essence doing is it’s giving us the concept of the database storage, if you will, and then you just simply need the means to pull the information out of the Apache Iceberg format.

But everything that you technically need there is there. So you’ve got a metadata file, you’ve got obviously the data, so forth and so on, which makes it really fast because you can go directly against the data and you have the information that you need to parse it however you wish.

So it’s really fast. You don’t have all these additional layers that you have to go through. And so when we think about the Iceberg tables as a use case in Snowflake, it’s really a great way to provide this external storage capability without having to necessarily be beholden to any particular type of vendor for storage.

And so that’s pretty interesting. And for Snowflake it gives them a lot of flexibility because it doesn’t matter technically, eventually, where that Iceberg file is sitting, you can access it through Snowflake.

And so obviously, if you attended Summit in 2022, there’s lots of talk about using this external storage capability where Snowflake is the conduit in the engine and the data cloud that you want to go to, to access any of your data no matter where it is.

So that’s great. And I already mentioned external tables. Some of you guys are familiar with that already. But from the perspective of just having a data set elsewhere, that’s the concept of external tables.

We won’t go into too much great detail about that, but we’ll talk a little bit more about the fundamentals of Iceberg and Snowflake. So when we think about Iceberg, we’re going to thank James Malone over at Snowflake for putting together a fantastic set of blogs and we’re giving them some attribution down below because drawing out the Iceberg configuration really, again comes from the Apache Foundation deployment of Iceberg.

And so you’ve got versioning, you’ve got metadata files and then underlying that conceptually are multiple parquet files, right? So we won’t go into great detail about parquet in this meetup. I think that sometimes might require a deep dive.

But you can see here from the diagram, really, Snowflake is in lock and step with the Iceberg metadata concept. And I think we all already know that Parquet is a natively ingestible file within Snowflake already, in addition to JSON and Avro and so forth.

And so the Parquet format lends itself very well not only to Apache Iceberg, but already historically what Snowflake is capable of working with and so doing having this Iceberg type of formatted table in Snowflake really makes it a first class citizen table along with everything else.

Why do we really draw the attention to that? First-class table concept is really, if you think about it, you can then join with other tables. So you could have your Iceberg external table. They can join cleanly with an existing what you might think of a standard table, a transient table, a temporary table, these other type of classes of tables inside of Snowflake.

And so that fundamentally allows a great amount of flexibility. It’s part of the full platform. So any of the core concepts that you can apply to tables mainly should work on an Iceberg table and then of course, just the interoperability of it.

All right, so Iceberg tables, they use Iceberg metadata concept, which we mentioned, and the Parquet file concept. So when you’re using that data lake storage as your external system, you’re really kind of using the benefits of not only Iceberg and Parquet and metadata and then of course, when it’s stitched together inside of Snowflake, you get to use everything that Snowflake can do with tables to begin with.

So you get all these benefits from this interoperability. So it’s an amazing integration that’s now come along inside of Snowflake. And we’re really excited about it and we think you should be too. This slide might look a little crazy and think it is just a general overview.

I might just kind of gloss over this slide, but it kind of just echoes what I just covered. Snowflake can handle all types of workloads with data. We know that already. And again, this Apache iceberg inside of Snowflake just continues.

I think Snowflake’s mission of combining private and on-premise data with public cloud data really meshing those together so that there really don’t seem to be any borders. I think that’s what I really get out of the slide without having to follow the decision tree.

But this is a nice slide. So you’ll have this video recording available and then of course check out James Malone’s blog post on the topic. It’s really good when you think about how does Apache Iceberg benefit me or my company or your company or companies in general, all customers of Snowflake?

Well, again, it’s really allowing customers of Snowflake flexibility and control. A lot of companies want to go towards open standards. There’s nothing wrong with that. This gives them the ability to remove lock-in from their current vendor, potentially.

It allows their development team to have a little bit of flexibility. It allows machine learning and data science scientists and developers to really quickly move in information across those teams without being bogged down by too many limitations.

And so it’s very exciting what type of control access Iceberg is providing today. And we’ll continue to provide, especially in the Snowflake platform. Low cost. You can just think about it, right? You’re using data lake storage.

So any type of compliant bucket or object storage capability that’s interoperable, that can be accessed publicly or privately is going to allow that transfer of data to happen even more rapidly. And in particular, because you can access the data where it lies in its storage system.

So that’s leveraging all of what the cloud has to offer today by setting up different API keys or IAM roles. If you’re an AWS customer and then any customer that wants to use this concept can obviously there’s a whole slew of tools that are out there, open source, closed source, that allow access into Iceberg and Parquet.

So that’s only going to grow. So we’re really excited about what the future is going to offer data workers and data consumers, especially around Iceberg. So super exciting stuff. So if we think about some of the challenges that have been imposed on our data, a lot of times there are just general restrictions on formatting and what type of information goes where.

And if you structure a table potentially in a certain way, it might be difficult to query that data. So you bump into performance issues. And so when you think about some of the things Snowflake is doing, like auto schema detection, providing the ability to kind of move formats of data in and out, it really unlocks the capability that you have as you’re working with your data.

And so having different types of tools that can work with that data very quickly, where you can get notified of changes on data that. External to the snowflake system. It really kind of, again removes these barriers to working with all different types of data that might come your way, whether you’re buying external data from a third-party vendor and it comes to you in a parquet or an Iceberg format.

I think the overall gist of it is that ultimately Snowflake will be able to handle that format or even transfer it into a different format that might be a little bit more compliant with something you’re working with all within the platform.

So, again, not to overuse the word, but it’s all very exciting stuff. Yeah. So there’s more and more talk about this concept of data mesh and going too much into a deep dive on the data mesh. It’s really just opening a can it steer the conversation perhaps many, many different ways.

But I think at the end of the day, there are certain principles that the modern data stack and cloud data initiatives are leaning towards. And the open Iceberg format really allows, again, a lot of this flexibility for companies to have domain-driven ownership of the data.

You can think of that as your marketing department versus your finance department. Also, I mentioned a few minutes ago about data as a product, right? So potentially the Snowflake marketplace or some other marketplace will deliver their data to you as a consumer of that data in the format of an Apache Iceberg file, potentially password encrypted.

Right. So having this flexibility of the formatting will allow lots of new concepts like that to surface. And we think that the world, in general, is not taking advantage of data as a product or data as an asset, as much as it could be.

But we think that. That slope is going up and to the right as far as uptake usage and adoption of that concept. So looking forward to seeing more of that self-service. Again, as far as Apache Iceberg, we can take that file and begin working on it very quickly.

I mentioned before that it kind of abstracts away this kind of database necessity layer, right? You’re able to take a file that has great data. It’s got the metadata there and there’s a slew of tools, think Python notebook, where you could query that Apache Iceberg file pretty much directly, quickly.

So working kind of inside of the infrastructure of the consumer where they’re comfortable, right? And a lot of other things. Federal governance is probably one more, but there’s probably another good handful of principles out there for what you might think of data mesh, right?

So your data everywhere, spread everywhere, but you have access to it in a very flexible way. All right, so that was kind of our background and kind of setting up really just very short hands-on walkthrough or let’s not call it hands-on, let’s just call it a quick demo.

So Iceberg is under private preview, so we’re just going to do a quick walk-through of that. So I’m going to try to narrate what’s happening here before I click play. So this is a Google Cloud storage bucket.

Could also be known as your data lake storage because for the most part, when we talk about data lake, people use that term incorrectly. We’re really talking about data like storage. And so we’re using Google Cloud storage for that.

And we have a folder here, park test, and then we’ve got a parquet file inside of it. So I’m going to go ahead and just kind of narrate this. So what I want you to notice is there’s just one file here right now, and everything else we’re going to do is going to be in Snowflake.

So just kind of quickly stepping through this, cleaning up some stuff from making sure it worked and stepping into the demo. We’re just kind of setting our privileges. Right now, basically, all of Iceberg is using the Role account admin.

We’re going to create a simple external volume that points to our bucket that I just showed. And then we’re going to create an integration as well that points also to that Google Cloud storage bucket.

And then here what we’re going to do is we’re going to create a stage. So this is just the standard external stage to Snowflake that points again to our data lake storage Google Cloud bucket and focus on the parquet format.

And then the next thing that we’re going to do is now we’re going to create that external table. So we’re creating an external table that is pointing to our parquet file. Now, the cool thing here, if you never created an external table, is that you can point to basically a prefix.

And so if you know a lot about cloud storage or the way object storage and data lake storage works is really a prefix, could be a folder or it could be any subset of files that start with a certain prefix.

And so what we’re doing is we’re just pointing to not an individual file, but a folder that contain a lot of the same files. And then what we’re doing is in this line 61, we’re just pulling one of the columns out of that file or set of files and we’ve recognized the format as parquet.

And what we’re not doing here is auto-refreshing the data so we don’t need a notification integration. In this case, we’ll run that and that’s going to create our external table right there. So now we can test to see if that external table is returning data.

And you can see that it is. So we’ve got data coming through. From our external parquet file, which is now a table. And now we’re going to do is we’re going to actually create our iceberg table. We’re doing something a little differently here.

What we’re doing is we’re going to create our iceberg table using our external table, which is pulling from our data lake storage, which is our Google cloud storage. And so this will allow us to create our Iceberg table again based on our parquet set of files.

And you can see that happen pretty cleanly there too. And when we did that, you can see that what it’s actually done is it’s actually created one parquet file here. If the data was larger, it would create multiple parquet files for us.

And that is how Snowflake is referencing the iceberg table underlying data set. And I think that’s reflected in that one slide we were showing earlier. And there you can see that we now have our data coming from our iceberg table in Snowflake.

So really cool stuff. Well, I think what’s more is that Snowflake is still in the early stages of what’s possible with Apache Iceberg. And I think the gist is Apache Iceberg, I think is going to be one of those formats that really is game-changing into the modern data stack, into the Snowflake data cloud and really how consumers of data can use that format going forward.

So you can kind of already see that interoperability between external tables getting notified when an external table gets refreshed. And then, of course, how we’re using an external table to create an iceberg table, which has the ability to go in and write back.

To our data lake storage and Google Cloud storage in this case. So a lot going on. So if we were to take a poll, I think a lot of folks would say it’s really early to understand the benefits of Apache Iceberg.

I think one of these are going to really resonate. Does every organization need I think that the quick answer is no, right? You don’t have to have Apache iceberg. I think, what’s going to happen is a lot of the use cases, especially per industry or per volumes of data that an organization might have, will start shining a light on Apache Iceberg for customers.

And I think Snowflake is really trying to shine that light on customers sooner rather than later and trying to be a leader in enabling that for customers. So it’s exciting. One thing we would throw out at AICG is that one cool thing that it can do for you is provide deep cold storage capability.

And so if you’re famous AWS s three, they already kind of have this cold storage concept and it’s a less expensive option for storing your data. This is another one of those methods that could be used for cold storage, right?

Like you’ve got some native compression in that format. And then of course, you could think of this as, hey, because we can go from as you’ve seen, really, you could go from your data that’s in Snowflake, which could be a very regular table, it could be an external table, and then you could create table off of that, which would then just load into your storage.

You could almost do backups with your data if you think of it that way. And that could potentially reduce your storage cost depending on what types of volumes of data you’re working with and so forth.

And then, of course, easy transfer. If you think about your Data Science teams, a lot of the machine learning training data sets are tremendous in size. I like to think the Kegel competitions, that’s probably one super quick win for the world in data and data science right there.

And then, of course, there’s huge potential for migrating between not just cloud vendors, but potentially cloud vendor accounts. So you could think of moving loads of data between one account to another versus replicating data natively through that particular vendor’s application or platform.

So those are just a few things we think will start driving the use cases for people getting on the Apache Iceberg train. Very cool and exciting times. Yeah. So I just want to quickly talk about DataLakeHouse Snowflake usage analytics.

If you’re ever wondering exactly how your organization is using Snowflake, you can check out DataLakeHouse Snowflake usage analytics and get all kinds of cool details there and optimize your Snowflake usage in the future.

For those of you who don’t know, DataLakeHouse is an ELT end-to-end solution. No code whatsoever. We’ve got built-in industry-specific analytics and data models, and you can create a data catalog to keep everybody on the same page.

And so, yeah, it’s a very cool platform. Lots of different connectors out there that are constantly growing by the Nd. And if you guys would like to join us on Slack, you can pick out your favorite socks of your choice and we will send them to the first hundred people to join our Slack community.

And I will make sure I add the link into the chat for that. Along with our YouTube channel for you guys. And I will turn it over to Mike and Christian for the Q and A if anybody has any questions. Yeah, a couple have come through.

The first one is, let’s see, can data level role level security be applied to Iceberg tables from within Snowflake? That is a great question, Mike. I do believe that is the case. Again, Iceberg is in private preview, so your mileage may vary based on the use case.

But rumor has it that all of the core functionality and features available to native tables will be available on the Iceberg table format and concept. All right, thank you. And I do just from I guess I’ll help answer the question as well, just from the metadata layer that comes with Iceberg.

So think of the tables are still in their native cloud location, but you’re really bringing in metadata into Snowflake that’s keeping track of all of these different pieces of information related to the files.

And then one of those is applying data-level security. So if you have concerns, say with PII Data or just HIPAA, that you’re able to apply data masking so that regardless of who the user is within Snowflake, they’re able to mask what they’re they’re looking at.

And of course, if you have a Use case that says some users should be able to see data and some shouldn’t, you can apply it much like you would a traditional table within. Snowflake that’s a great point.

One other piece here is you mentioned updating data. Is Updating data also deleting data? Well, if you think about the interaction between an external table, which is really just an underlying file format concept on your object, storage of it should be in parity.

And so whether you’re deleting well, I’ll take that back. There is a read-only concept here, right? So when you look at the external table concept, there is a read-only concept there, right? I think the notification concept is really driven at change on the underlying external table to then notify Snowflake that a change has been made to the underlying table format itself.

So that’s kind of a loaded question, right? Because external tables in and of themselves, I believe, are read-only. But the capability to write back to the bucket is apparent, as we’ve seen in this demo.

So I think it really depends on the use case. So don’t hold me to that one. I think this is one of those that probably needs to get tested out. Have you looked at this yourself, Mike? Do you know any further details on that question?

I do not, other than I’m not sure. If you were to say, go ahead and delete, is it really deleting in the metadata and relative to because it’s only read-only in the bucket? It’s just now Snowflake is keeping track of deletes, but it’s not hard deleting anything.

In the source. Yeah. And I just confirmed, right? So external tables are, in fact, read-only. Right?

So when we talk about that notification integration, that’s really going to be, hey, there was a file connected to your external table, and if you set up that notification, then it’s just notifying you like, hey, there’s a new whatever, right?

Something happened, something got modified. But it wouldn’t wouldn’t necessarily be coming directly from Snowflake in that case because it’s read only. Perfect. And then one final question.

What is a real-world use case from a business perspective?

Use case for using Apache iceberg. And I’ll go with that. Definitely. We’re working with a customer. They have several federal clients, or actually their clients have federal contractors. And so when it comes to data, they have to be able to control and have an audit log of what are all the applications and individuals in those applications that are interacting with these files.

You can’t just put a file in your application and just say, well, whoever service wants to consume this and pull data into Snowflake or really anywhere. Snowflake is one of those potential targets. They have to be able to control and definitively say, here’s where the file is.

If I remove this file, for whatever reason, it is now not available to any application that will be

leveraging the data that’s in that file. So really, that command and control that’s from our perspective where we’re using Apache Iceberg.

Obviously there’s others out there. Christian, if you have another one to add to that or that’s a good one. No, not the top of my head. No, that was a great one. All right. Those are good questions. Thank you, everybody, for dropping those.

So I’ve got one more. I know the answer to it, but I just want to put it out there for anybody who’s curious. If somebody were to be interested in trying Apache iceberg tables in Snowflake, how

would they go about that?

Mike, answer that one. Yes, absolutely. So reach out to your Snowflake sales rep. That individual will be able to navigate the appropriate channels within Snowflake itself. You do have to be a Snowflake customer in order to get access to this private preview here.

Awesome. Thanks, guys. But yeah, I would say just to piggyback on that, you do get access to it. And now you’re wondering, well, great, now I have access. How do I use it or how do I get working with it?

Definitely reach out. We’re happy to walk you through how it works and get it set up for you. Yeah, good point. So we’ve got some more events coming up. End of February, we’ll be going over Snowflake micro partitions.

And then in March, we will be talking about sending emails in Snowflake and for anybody interested in where we got a lot of this information. And if you want to reference it later, you can turn into the chat and see those two articles from James Malone are in there, along with our YouTube channel and the link to join our DataLakeHouse slack community.

More to explorer

Snowflake Loading Data with Special Characters

July 24, 2024 No Comments

Special characters in your column names can cause chaos for downstream users, tools and processes.

Building a Generative AI Competency (or the First Gen AI Project)

July 21, 2024 No Comments

When Building a Generative AI Competency one must identify the necessary infrastructure, architecture, platform, and other resources and partners that can help an AI initiative be successful. We have just like many data warehouse and digital transformation initiatives over the last 20 years fail because of poor leadership, or companies only going half in on the objective.

clock, time management, time-3222267.jpg

Snowflake Time Travel Not the First Time Traveller but Let’s Review

June 12, 2024 No Comments

IBM DB2 long before Snowflake had this concept as did a few other select databases using a Temporal Database concept. I think Snowflake was able to make it more popular and mainstream due to the association of Data Warehousing and analytics specifically