Mike Jelen

May 25, 2021
10:00 am

Loading Data into Snowflake with Python

Python is a popular programming language but not everyone is well-versed in how to work with Python and data. Python is easy to use with familiar programming structures, a declarative language using Notebooks (Jupyter, Zeppelin, etc) and has lots of libraries so you are not starting from scratch.

Check out our Meetup video below to watch a deeper dive of this topic:

Other Meetups and Events to check out:

Transcript from the Meetup event:
We’re loading Data into Snowflake with Python. And really this is more like interactivity with Snowflake using Python and just taking a programmatic view of things and what we came up with something interesting just for the Meetups.

They’re virtual. We want to have a little bit of virtual rules, little virtual rule setting in here just for everybody to stay sane. And really we’re just asking everyone to be respectful of all the other participants.

If you jump in here, try to keep yourself on mute. We’re not regulating that hardcore right now. We try to keep it a really collaborative, social sort of session. So just make sure you mute and respect anybody else.

And then of course, if you have a question, definitely go ahead and throw it in the chat. We’ve got some folks from the Carolina Snowflake team here who are monitoring the chat. So anything comes up, we’ll make sure that conversations get started and questions get answered, all those types of things.

And during the session we’re kind of like, well, let’s ask questions that are relevant to what we’re discussing. And then we’ve got really some open discussion at the end where we can jump into anything, any topic anybody wants to talk about.

And turning your video on is completely optional. And we highly recommend you guys invite a colleague or a friend. Join us on one of the next meetups. We think we’re having fun and everything’s going to eventually coalesce to having these in the real live at restaurant or at a pub or some other outing.

So we’re really trying to create a really good community here in the Carolinas, maybe even in the whole country, but focusing in the Carolinas for now around Snowflake and working with Data in general.

And then lastly, try to be social so you can ping us at Data lake house on Twitter as well as LinkedIn. And then my Twitter and LinkedIn is somewhere in the slide deck. You guys can always ping. Me. We’ve got Heather Pearson and we’ve got the super deep Apisani from the team as well on the line.

So really, any questions about Snowflake or working with it, I’m sure we can answer for you. So just a super quick refresher. I know everybody knows this, but every now and then we get people who are new to the meet up.

So just kind of just doing a quick like, review. So what is Snowflake? Well, it’s kind of created themselves. They’ve created themselves as the data cloud. They started off as the cloud data warehouse.

And really, they’re a great solution in the cloud, using that elastic capability of cloud computing for ingesting data at all different levels of almost infinite scale, accessing data at almost infinite scale, providing that governance layer.

And really, that’s just that full data layer, especially when it comes to data warehousing or building data vaults or doing ELT and that’s restructured and unstructured data. And so it’s a great platform.

We’re big fans, and ultimately a lot of organizations have problems that they need to solve around, hey, we have a lot of data from many disparate data sources. We’re looking to get efficiency across the board from ingress and egress.

We need to do some transformations, we need to be fast, and we have consumers of all different walks of life, and we need that all to be handled in a single platform. So for the most part, Snowflake is that platform.

And there’s a lot of things on the ancillary to that, or on the periphery rather. And we tend to highlight those as much as possible in these meetups, as well as dive into a lot of the inner workings of Snowflake where we can.

So with that being said, just talking about Python. So again, Python being one of the more popular programming languages these days, but as we find, and maybe you guys find, not everybody is a pro grammar that works with data.

And they might be an excel master, but they might not be a programmer, for example. Just stepping back a bit. What is Python? Well, without going into super lengthy technical detail, really, Python is a programming language, right?

It’s just like honestly, just like but it’s similar to some of the other ones that are out there. Maybe you’ve heard of or maybe you took computer science studies back at university or just by being in technology, around technology, you’re aware of other languages like PHP, C, Sharp, VB, net, Java.

In theory, JavaScript also could fall in that bucket these days for a lot of the front end stuff as well. But also Snowflake uses JavaScript for developing their store procedures. If you don’t know that a lot of people like Python, well, it’s easy to use and it’s got some good similar programming structures, like if you’re coming from VBScript or anything like that, or any other programming languages like JavaScript or Java, it kind of flows in a very similar fashion.

Obviously constructs and things like that are a little different, which makes Python unique in theory. It’s a declarative language. And then one of the things that people like about Python, especially these days with the data scientists, or people like to call themselves data scientists or this term of data engineers, is that there was introduced something called notebooks and Jupiter notebooks.

There’s a few other flavors out there. Jupiter seems to be the most popular and it’s really accelerated the use of Python over the years, as you can kind of see in this chart. I should have done a nice kind of map point there when Jupiter notebooks and Anaconda and all these things kind of came out.

Did that really accelerate Python? Or not something to look into, but really it’s a great platform to ease development of writing. Python code, connecting to data sources, doing machine learning, saving that kind of whole block of waterfall code off and then reusing and repurposing it and automating it.

And Python has lots of different libraries, just like many of the other popular languages. So you can kind of bring in all the libraries you need to use. And a lot of these are really amazing. Most of them, if not all of them, are open source.

So it’s really easy language to get in there and learn stuff. And then, of course, there’s all this cool stuff, like you can use Python to do, like, web scraping and all the way to deep fake detection and a lot of great stuff.

And so, as you can see in the chart, it’s really accelerated. I mean, this comes from stack overflow. This chart does. If you’re familiar with stack overflow. I think I’m in there, like, once or twice a day checking on errors and issues and bugs.

It’s a great platform, and they’re going to have a lot of the data that’s going to be able to measure how many searches are done for Python per day and things like that. So it’s a fairly credible resource looking at something like that.

And then, of course, if you’re sitting around a campfire and you’re talking about, hey, how are people using Python today? Well, I think it’s being used. I won’t say ubiquitously, but it’s pretty close.

We see it in universities. We saw a lot of university hires over the years. Website Development. You can build an entire web app. And there’s several that are out there doing well. I don’t know if they’re unicorns or not, but you can build an entire front end and back end of your website, basically with Python and JavaScript, for the most part, HTML, CSS.

And then, of course, you can even build pure command line interfaces off of Python. I mentioned automation and scheduling tools earlier. So really, when you think about where the Python be used all over the place.

Just like some of the other, perhaps even more well known languages, and then, like I like I said earlier, you know, by all the data nerds that are out there data scientists, machine learners. Machine learning practitioners, analysts, engineers and so forth.

Data analyst. We even have a client where the CFO knows Python is doing data wrangling and data analysis with Python. So it’s pretty straightforward, pretty easy to use once you grab and you can do some really cool things.

And then just kind of lastly, I think just kind of highlighting Python by the numbers again, kind of showed this one, but it’s had a lot of growth in comparison to some of the other languages that might be either Stagnant or decreasing in popularity.

As you see PHP there on the downtrend and then on the left if you compare it to some of the other smaller technologies that are out there, like Rust or Scala or Swift, which I think is the Apple program language if you click, or even R, which is again, kind of a data science statistician language.

Python is just really in a situation where it’s taken off. So let me just bring this up into presenter mode here. So let’s get into Python. Snowflake, before you start, I mean, any questions? I know we’ve got some folks out there that know about Python.

Maybe it’s a good point to break, because once we get into the code, we’ll start just getting pretty technical. If you have any questions, that might have been pretty fast. Intro to Python and Snowflake.

But I have a question. What would you say is the easiest programming language to learn? That’s a head scratcher right there, honestly. You know, if I think about it, if I think about the program language, that would be the most useful, like across the board from building a website with a call to back in a website, which could technically do.

Some of the fun stuff too, but and then being a very useful like I could I could call it to run simple stuff like people do on their own machines. Like I want to run a script that changes all the file name, the extension of all file names from like, you know, XLS to Xlsx or like just kind of these very functional things that might be on my machine or even like pulling a spreadsheet and do some math on it.

I’m a big fan of Java myself, which also means almost by nature, I’m a big fan of scala and like spark and everything. But I’m going to have to say that I’ve been using Python, and actually there’s a version of Python called Jifarm.

It’s a Java based version with kind of like the same standards and declarations and things. So I’ve been using a combination of those two for, I think about ten years, if I’m calculating that correctly.

And I’ve been using Java for 15 years and then all the nets since they came out. So I’m going to have to say Python with that long winded explanation. I’m going to say Python is at this point, I think it’s the easiest one to use, if that was the question.

And I think it’s the one that can provide the most benefit across so many different areas of use cases. Yeah, like I said, if you want to scrape a website and pull something, if you want to log into a website, remotely connect to it, go to a page, pull the data out, do something with that metadata from a page like Python, you can do it in like less than maybe 200 lines of code.

So I think I’m going to go Python. Very apropos for today’s session, I think. Alright, so if everyone can see the slide here and the screen okay, then I will carry on. As a matter of fact, let me switch something really quickly.

Let me just pause this here. So go into kind of stop mode here. And then I’m going to present just that slide because I have to jump off of it. Let’s do just that slide. I think this is what I’m missing.

All right. Hopefully it’s better, a little more full screen. And I’m happy to jump right out of it to do the kind of the real time analysis here. So what do we want to talk about here? So in this meet up and again, we’ll keep it as interactive as we can.

So what do we want to talk about when we’re talking about Python? Well, first of all, Python doesn’t equal Snowflake. I actually had that question earlier before we started doing this session. And someone said, well, if Snowflake built on Python, what’s the difference?

Why would you even use it? So I think I gave a little bit of reason why you might use Python to connect with Snowflake. But Snowflake is its own standalone platform as a service or software. As a service entity.

You sign into it through on your desktop or mobile and you log with username password or another authorization mechanism. And you’re able to, let’s just call it work with your data. Okay. There’s a user interface and then Snowflake provides SDKs and APIs just to keep it simple, so that you can then connect through a client tool or client programming language like Python or Java.

And then you can do something with the data that you have or metadata that you’re working with or have stored in Snowflake. And your account of Snowflake. So you would get your account credentials and you would take those account credentials and you would use your programming language.

In this case we’ll use Python. And then you’ll connect to Snowflake as let’s just be called simply a repository. And then from that repository, you can then use use that data as you’re connected with your programming language and then use all the accoutrements of your programming language to then do things like so if you’re like a pandas data frames.

Guru you can bring some of that data down from Snowflake into your local Python environment using data frames and do all the things that data frames does with that data. Obviously you can iterate through the data.

You can bring it down in a batch. You can do some things right. So what we’re going to go through today is just going to look at saying, hey, Python is a nice approach to do that, and you can even do it on your laptop.

Obviously there’s levels of scale. So you could use like elastic computing and like Google Cloud Platform or Azure or AWS, and you could get a virtual machine or you could use lambda functions to connect to Snowflake and work your magic there.

We’ll look at a couple of ways you can connect. We’ll probably just stick with one in the session today, but if you have a question, we can probably answer that and then we’ll just execute commands. So I’ll be working remotely from my laptop and kind of like live demo mode connecting to our group Snowflake instance and work with some data there.

And then we’ll try to push some data and test some data and stuff like that. So break it down simply. We’ll get our account, we’ll connect to it, we’ll create some stuff, we’ll check some stuff and we’ll load some stuff.

Keep it simple. I did mention Jupiter having Jupiter notebook. A lot of people don’t know this, and I found this the other day working with a really cool customer of ours. And the team of Data scientists actually had been working all on their laptops, and everybody has a Jupiter notebook on their laptops.

And actually there’s something called Jupiter Lab now, which is kind of the predecessor of Jupiter notebooks, but same concept. And so there’s something called Jupiter hub. And Jupyter Hub is like a centralized jupyter notebook platform.

It’s really like for collaboration. So you can deploy Jupiter Hub into a server and that’s where it will be on the cloud or on premise. And then it basically acts as a web app where you can set up multiple users and then each one of those users gets their own kind of like notebook server.

So you could in theory do all the stuff you’re doing on your laptop if you’re using Jupyter notebook or Jupiterlab, but you can do it centralized as a company. We use it at Aseg and for Data lake house.

And like I said, it’s pretty cool, it’s pretty straightforward to set up and just allows everything to be centralized. And you can still work from your local laptop on your own instance, and then you can just save off your IP YMB files and then you can push those up to the server and then you can share access to them.

That way you can always use GitHub or whatever your Source Control management is for that, but just to look at it. And then I guess before I jump into it, the other question we had was around if we do connect from our laptop or from our server on Amazon Web Services and Google cloud platform, how can we protect our Snowflake instance?

So you might not be the administrator of your Snowflake instance for your account, for your company, or for your team or what not, but the administrator should know this, that you can set up network policy.

And the network policy will allow you to restrict or what they call whitelist IP addresses, so that only certain machines, computers, people calling from certain IP addresses can. Can actually connect to your Snowflake instance.

And that’s super important, obviously, especially with all the hacking that’s going on in the world. And people just pinging servers to see if they can get access and use dictionary attacks on passwords and such.

It’s only a matter of time before somebody can break into someone else’s systems. Snowflake does a great job. Know it’s a platform as a service, but just in case you want extra layers of protection and security, network policy is awesome.

So again, most people aren’t admins of their systems. So if you’re not sure your admin has done that, now you can bring something to them and let them, let them know you know stuff about the Snowflake.

All right, so let’s go and get started into the actual Python connecting to Snowflake. So I’m going to switch screens here real quick, and I’m going to go over to our Python or Jupiter Notebook rather.

So let me jump into that. Which one is it? So it’s going to be our Jupiter Lab. Let me know when you guys can see it. All right. If we give all my tabs at the top. And so right now you should just be able to see the Jupiter Lab browser window.

All right, so if we get there and nobody sees anything different, then I’ll just keep continuing on. All right, so one of the first things we will do with Python, let me set the stage real quick. So what we’re looking at is a deployment of Jupiter Lab.

Again, you can call Jupiter Notebook if you want, and I’ve already explained kind of what that is in the background of that a little bit. And it’s running in my local environment. So this is literally running off my laptop.

And I’ve got a MacBook Pro. It’s a pretty decent one. It’s a couple of years old, but we’ve got some good Ram here. And so. What happens in Jupyter Lab, jupyter notebook is you have these cells and each cell can operate independently, which is really great because if you think about like a waterfall programming language, like, typically if you press like compile and run, it wants to run everything all at the same time, compile everything and run everything, it’s done.

But in Jupyter Lab, you can actually run sections at a time. So I can write like one block of code. I can run that, and then any variables that were set in kind of the upper level cells will be available to the cells that are lower from whatever was above it.

If I run that cell above it, and then I have the option to kind of skip around the cells if I want to run the first one and the third one and the 7th one. But typically people run them one after another after another.

Okay, I might be telling people a lot of people what they already know, but again, just give a little bit of background here because for some people, it might be the first time. So if I run this first cell, you can see it’s doing something very similar to what a lot of different languages does.

It just imports a bunch of libraries. I’m going to use those libraries. You can see here, we’ve got Snowflake connector. We’ve got Pandas, which we’re going to use for our Pandas data frames, calendar.

We’ve got some import tools here from Snowflake. So this is a very basic 1012 line configuration just to import the basics. Okay? So what I’m going to do is I’m going to go over here to the edit. I’m going to go clear all outputs.

So that means if I run this before, at any point in time, you’ll see here in a minute, it just cleans it up. It just cleans up my notebook here. So the next thing I’m going to do is I’m going to click on the Play button, and you can see here it says run the selected cell and then advance to the next cell.

So I want to run this because I want to set up my imports correctly. So I need my imports because otherwise if I call another library or if I call another cell, rather nothing’s going to happen. So I called the first one, and you can kind of see that there’s a number that’s been placed in here.

So I was like the sequence run. I was doing some testing before this meet up to make sure everything worked properly. So right now I’m on like the. Cell run of the day, I think, right? So nothing spectacular.

So now the next step, the next cell, and you can almost think of it as a step, is I’m going to set my connection. Now I’m a big fan of documenting code, which if you guys write any code or if you do anything where you have to write logic, I highly recommend, of course, everybody document their code.

So please document your code when you can’t. So this first step here is basically creating a connection using the Snowflake library to our instance. And I’m going to have to type in my password here in a minute.

This is our account and this is our warehouse we’re going to use for Day lake house. And this is a database that if you saw our last meet up, we did a bunch of COVID data analysis using Snowflake. So that’s just kind of left over from that and then recalling the scheme.

So we’re basically setting in Snowflake what they call the context. And that’s super important, especially if you’ve used Snowflake before, you know how important that is. So if I press the Play button or Run button here, it’s actually going to try to go out to Snowflake and connect and you see how fast that returns.

It’s always great when something is very responsive, right? Especially in technology. And you can see down here, it basically gives me a bunch of errors and ultimately it says it fails to connect password and everything was not correct.

So what I’m going to do real quick is I’m going to pause my screen so that I get to see my password and then I’m going to jump right back into it. So let’s see here. Pause and then no one should be able to see me type in this password.

Alright? Now chances are in this meeting I probably happen to scroll up and the password will still be there, which I think has happened to me before, but we delete these users right afterwards anyway, so it won’t be around for long anyway.

Then the next thing that we’ll do is we’ll go to the next cell. So what we want to do is just like super simple. We’re going to go ahead and get the we’re going to run the cursor and then connect. So let me pause here for a second in the cell above.

When I create the connection, we basically create a variable called a connection. And that connection variable, I think we call it like CXN connection. And then there is something called a cursor. So a cursor is a pretty common thing.

I pulled up something on Wikipedia. I’ll show you in a minute here on what a cursor actually is, what it actually means. Because sometimes you’ll see examples even like when Snowflake documents or on stack overflow and people just kind of throw out the example.

And then all of a sudden you’re copying and pasting and then somebody asks you, well what’s the cursor? What does that mean? And you’re kind of like, I don’t know, it just works. So I always find it helpful to actually find out definitions of what you’re actually using so that way you can talk intelligently about it in case somebody pings you about it.

But of course there’s a programmatic connection term, and I’ve got the link to Wikipedia for you in case you’re curious, right here. All right, so I made the connection and then each connection pretty much in all languages really have a cursor, especially in Python, almost 100% on that.

And that cursor basically says that once you’re in that connection, you can loop through a record set, you can loop through the metadata. So it’s just like a point of reference in that data for whatever is being returned by your client library.

So in this one we’re going to take that cursor, which is basically a pointer to the connection, and then we’re going to execute something. Using that cursor, which is a pointer to the connection, so that way I don’t have to reconnect every single time I want to run something.

So here I’ve got a simple current version for those who use Snowflake. I don’t know if you know that Snowflake has versions just like every other software, so I’m going to click play on this and alright, so boom, I’ve got a little bit of data right here because we’re not building a web app here, right?

We’re just trying to use our programming language of Python to retrieve data points, metadata, and then use that data, those data points of metadata in our code to do whatever we want to, to make sense of whatever’s being returned.

So again, we’re not seeing big fancy graphs and charts yet. So that’s where we’re on version 5.19.2. And I would assume that anybody that ran this in our environment today, it’s not like we get the same version number.

Okay, so then our next one that we’re going to run is just a retrieval. So what have we done? We’ve connected, we’ve got the cursor and we did just the general select against the version. It’s like metadata database.

And so on this next one we’re just going to retrieve some data from the COVID data set. So let’s go ahead and run that and then talk about what we’re getting. So look at this data set. And that was fast, right?

I don’t think that’s probably cached since I ran into a couple of times, but not bad. So we’re pulling this one out of the COVID-19 data set. And what’s the query? Right, select all these columns from the World Health Organization’s daily report.

So on a daily basis, World Health Organization reports, we captured just that daily stats in one table. So there’s only like one record for every country, like 200 plus records. And so what we’re going to do here is we’re going to loop through that cursor.

Using a cursor, we’re going to loop through the data that’s returned. So for every record in that cursor, we’re going to basically use some print formatting syntax. And if you’re Java or I think see.

C, print f or print s if I call correctly. It’s been a while since I worked at C or C. You get kind of like a formatted output of whatever argument you pass into it. So here we’ve got three columns. We’re returning one, two, and three.

And then we’re just printing those out and it’s looping through the print. So we’re going to get one row each, three columns printed out for every record that’s returned. And that’s exactly what you see.

Nothing fancy at all. And I think for cases for the US, I think that’s actually coming down since our last meetup. I don’t know. I’d have to have a fact check on that so we could clean this up. This is a huge list.

It’s going to detract from what I’m looking to do. So I can simply right click on this and I can say clear outputs. Now I’ll just clear this output for the cell as you see it. Now it looks like technically, I didn’t run anything.

So it’s way to clean up as you’re going. And again, on each one of these, this is super basic, right? But on each one of these, the concept is, as I’m pulling data, I’m writing like a significant amount more code here, right?

I’m joining it. I’m putting some sort of conditions or rules to it. But we’re just walking to the basics, right? So now this one’s a little bit different. For those who know Python have done some data science, what I’m going to do is I’m going to take that same query and I’m going to run it into a data frame.

Okay? So here if I run this, I’m actually setting the variable before the try, and I’m setting an option here for Max columns to see what happens. And then I’m using this fetch pan is all, which I think actually comes from it comes from Snowflake.

And it’s kind of just a quicker retrieve to pull everything in without having to identify any of the columns and so forth and map that into the data frame. It does it kind of automatically. So if I run this one.

You’ll see that I get kind of a nicer printed formatted table and what’s really cool. So this is basically a data frame output. So I ran an extra command down here at the bottom called DF head and if anybody familiar with Linux, it’s got a very similar thing.

So if you’re doing like, for a file or if you’re doing like a tail and you do like head, you’re just getting kind of the top X rows based on your numerical argument right? And that’s coming through. Let me see.

I think I heard someone had a question. Yeah. Andres, let me see here. I don’t know if I can unmute you. Let me see. Can you hear me? Yeah. There you go. How are you? I’m good, and you? Good. Quick question here so when you load use this method, are you loading data into your Ram memory?

Yeah, as far as I know and I’ve never contributed to the Pandas data frame open source library, but yeah, pretty much everything is pointing to memory at this point, as far as I know. Do you know any option?

Let’s say that I have a ten gigabyte data frame that I want to put out of Snowflake into Pandas, so that’s going to kill my computer. Do you know if there’s anything more advanced coming from Snowflake to avoid, first, downloading the data into your Ram memory and second, using your Ram memory to process with Pandas?

Yeah, so that brings up a couple of points. Maybe something for another meet up for sure. So. I can probably think about three ways to respond to that question. Right? So I don’t know if you caught the beginning of the meet up, but we’re talking about Jupiter Hub versus Jupiter lab on the laptop.

And so what we do as part of the data lake house, and for some other folks ourselves as well, we stand up at Jupiter Hub instance, like on either Google cloud platform or like AWS as a virtual machine.

And so we’ve got clients and customers that are pulling in 510, twelve gigs of data into data frames, parse, manipulating, because we can scale up that server and that memory for those particular loads.

I’ll give an example. Like one of our customers has run a really cool Python process and it actually like stepped through movements of a transportation process every day. And on the guy’s laptop it was taken like 30 hours.

Like he would just set it there on Friday night, check it like Saturday and Sunday and Saturday is still be running. And then we stood up, Jupyter Hub, put up a machine that I think it had like 64 gig Ram and like twelve virtual CPU and the same thing finished in like 2 hours, right?

You take a sample of the data run on the laptop, make sure whatever process works, and then take that same Python notebook, stick it up on Jupiter Hub in. So that’s one way you can do it, another way you can do it.

Even though we’re not talking Python per se. You could use Pi Spark, but you could use Spark in general. And there’s a lot of discussion about using Spark because a lot of times when people use Python or they’re pulling data from Snowflake and it’s got lots of data, one of the problems that slows things down is not so much the memory, is the operations they’re doing in memory.

And so using Spark as that load and transformation engine could reduce the load that you’re taking on in the way you’ve structured your Python transformation, if you will. So that’s one way and then you get a combination like things like better in memory, movement of data and transformation and then just the use of memory in general.

Okay, I think I said one more way, but airflow maybe. Well, yes, in theory also, if you have ten gigs free on your machine, right? You could always download that into your machine, right? Like you can just download the 10GB of data in your machine and then in theory you could chunk that through whatever process.

Like take the first 1GB and run that through whatever you do in Python, whether it be data frame or not. And then push that up to a stage or back into Snowflake, right? Almost like chunking it along.

That’d be another way to do it. If you’re like completely resource constrained. I’ve seen that before too. You shouldn’t have to get that creative, right? That’s a great question. I hope somebody corrects me on the memory thing again.

I’m not everything’s a memory, but I’m pretty pretty close to it. Alright. And data frames is amazing, right? It’s quick as fast and it can do a lot of great things. So the next one here, just kind of stepping forward a little bit is calling standard.

I guess you can call these like DDL DML Admin type functions, right? Because if you are really trying to use Python to interact with Snowflake. You might want to do more than just reading data and transforming the data and turning it into a CSV or pushing it somewhere.

So as an admin, you might want to automate a process. Or automate, like, provisioning roles to new users or writing a script. So you can absolutely do that in Python on. And so just to give a little taste of that, what I’m going to do, if you remember before, I was using this database here that has the Public World Health Organization data.

So I now want to switch that because I can’t write to that data set. If you remember from Our previous meet up, that’s A shared data Set. So I can’t write into that. So what I have to do is I’m going to switch to another database by using the use database called Demodb, which is a database we already have created.

We’re not worried about that. But in that database, I’m going to create a new schema called Meetup DB. With the date of the day. And then I’m going to use that schema to do any further operations. And then what I like to do sometimes, just like in programming, I like to kind of have just an indicator that tells me if something failed in this process did I get to this point, which is not really a big computation or data call So If I run this, my only output should be the version of Snowflake if it works.

Otherwise it’ll fail because everything Is kind of waterfall like most Languages. So It’s going to run this One, it’s going to run this one, this one, and if they all Succeed, then It Should reach this One with no Problem.

And it did because it output the version number. So we’re good there. So this is basically like this context setting. There’s nothing major here. It’s just kind of setting some things, creating an object in the back end that’s structural to my database.

All Right. And then we just have a couple more here to go, and then we can really so the next one is now that I’ve got the schema. I’m using the schema because remember, Snowflake, you have to tell it for just about.

Most of the interaction. What’s your role, what’s your warehouse that you’re going to use for the compute? What your database you’re using, what’s your schema? So those four things. So now what I want to do is now I’m using that schema, I want to create a new table and it’s going to be called Meetup Guest.

And if it’s already created, it’s going to replace it. And then once it’s created, it’s going to go ahead and insert a couple of records. And I’ve got my first record here, which is a Guest Elon Musk 88, sky high as his phone number.

And he’s interested in rockets, car, space, crypto and data, lake house IO. So I’m going to run this and then it’s going to give me the current version like before, and then it’s going to go ahead and give me a list of the guests.

So let’s go and run that. All right, so here we go. The one thing I forgot that I was doing was I’m actually printing out here the cursor query Identifier. So one really cool thing about Snowflake, again, which I’m not sure if you’ll learn right now, is every query that you pass into Snowflake generates an Identifier for that query and pretty much captures that query information.

And so it’s like a great way to track anything with that query. It’s performance, it’s load time, it’s full time, it’s retime whatever it might be. And yeah, we’ll probably have another meet up on looking at some of the query profiling information, but for the most part, and then you can take this Identifier and run it through query profile or find the query and its specifics.

And so that’s a really good one to know. So we can just print that out with the cursor. It’s part of the metadata. So as soon as you run that cursor, run any command against Snowflake, because every command is basically SQL, you can get that query Identifier by running cursor dot SFQ ID.

And then, of course, I’m using Pandas again here. But you’ll notice I’m doing something different with the pan, is I’m not actually printing out the data frame. I’m printing out an instance of the data frame, which is basically turning it all into, like, a string and just taking part of the data frame logic.

I wanted to show that because it’s a little different take on using the actual data frame for what it is in notebook. And here we can see the printed out user. All right, so I think the last one, I think yes.

Is going to be inserting records. So what’s interesting about the Snowflake client on Python is, in my opinion, there’s a few bugs and some of it. So testing using the community forms and Snowflake stack overflow is going to be your friend.

I mean, you can do quite a bit, but not all of the documentation for stuff like works as intended, I think. And that might be a bad call, maybe. I don’t know what I’m doing. Doubtful. It could be something with the versions of Python you’re using, the things that you’re importing, how you’re using it in your code, if you’re using a try block and this type of thing.

So your mileage may vary. And again, setting up your environment is very important. And that’s why I’m a big fan of standardized known environments. So definitely read any documentation you can from Snowflake.

And then, of course, you have to find out how you take that documentation and kind of match it to whatever environment of Python or version of Python you might be using, depending on the flavor and the setup you’re using, and if you’re using virtual environments, Python and all those sorts of things, you have to consider.

With that being said, let’s look at this last one. We’re still setting our cursor. We’re going to use a try block. We’re going to just insert a couple more records. So we’re going to insert Albert E and Nikolai.

And we’re going to try to add them to our new table meetup guest. So I’m just going to run this, okay? And then I went ahead and printed out the standard format data frame again here. And then I also pulled out the Identifier, which of course is different than the one above.

So now I’ve got all three guests in my system. So we just walk through how to connect, how to read the cursors, how to loop through the cursors, how to select general data, how to insert data a little bit, how to use data frame.

That’s like the absolute 1% bare minimum of using data frames with Pandas. This is not an ML or Data Science meet up, but maybe we’ll have one soon. And anyone here who’s on the call is super welcome to contribute to the meet ups or presenting as well as providing ideas.

We always look for ideas, right? So if you have any input, definitely interject that in our form. So we’re going to do a couple of other ones. And then I think I’ll wrap up with just what I did mention is at the very bottom of your code, we typically close out your connection.

Because if you think about your connection to Snowflake, it’s almost like you’re logging into the user interface, right? You go to the browser, you type in your account in the URL, and then when you log in, you’ve created a session, right?

And that session is open as long as your browser is open. And even if you close your browser, you log back in or you open your browser up again. Typically you’re still signed in because your session is there.

Same thing with connecting to Python or another client. So, best practices at the end of your code or when you finalize all your coding or logic to go ahead and close out that connection. You can even put some logic or like a try catch block around.

It simple, if connections exist and close, it will work. And then if I run this right, gets me to the next one. I run it again, it still works. Connections basically like a noble. But if I go back up here technically and try to run this, in theory, it shouldn’t run.

Yeah, so I guess I can hear you guys saw that working connection is closed. So my sessions out, my connection is out, and then I can feel comfortable, I can walk away. And I don’t have have any open sessions of my code or anything like that.

All right, so let me jump back over to a couple of other slides, and then we’ll just wrap up, see if there’s any questions as well. So let me make sure I’m presenting this other slide and I think that is presenting.

So I’ll share it out, and then you guys tell me if it seems to be working it’s up. All right. For some reason, I can’t see it, so that’s not good. All right, let’s try that again. Present your name and then share that out.

Okay, so last couple of slides here, I mentioned to you guys that we had a few things in the demo I went to COVID I think we had them all credentials and we could talk about different authorized. There’s SSO, there’s OAuth, there’s SSH keys, that type of thing, or private key kind of login.

So we did all this and lots of lessons learned, I think, out of that, especially if you’re new to Snowflake or you’re new to Python. We talked about cursors and data frames, super notebook, super hub, a little bit about restricting access.

Definitely Python. And the one we didn’t cover today because. It’s probably a lot to take in if you never done the integration Python one, which is pretty fun. It’s called asynchronous queries. And just to talk about that for a couple of seconds, it’s generally think about it, right?

So I mentioned having a ten gigabyte set of data set. So let’s imagine you had you had that in chunks, right? And you want to do an operation on that. So an Asynchronous query, you could kick one off.

You could kick the query off one at a time and it’s going to run Asynchronously, right? So you could come out and then you could wait. You could run the query ID to kind of see what’s happened to that execution that you ran.

So it won’t be linear, if that makes sense. So it’s kind of a non blocking sort of query. So let you kick it off. You can do things in between. You can wait for it return, or you can go to something else and then ping it to see if it returned or finished or completed the execution and then continue on to do something.

So it’s a little different than what we went over today, which is pretty much just calling something, waiting for it. Nothing else was happening. We could do anything else until that query completed and then we could once it completed, we could do something.

So one synchronous, one day synchronous. Then we’ll leave you with this definition of what a cursor is in a database system. So I think, yeah, guys, let’s open it up for some Q and A and definitely make sure that tell a colleague about joining one of the upcoming events.

We got a few lined up and we’re adding more to the schedule this week. So this is fun for us and we hope it’s fun for you guys too. Okay, let’s see if there’s any questions. Okay, a few people jump in as we’re going there.

Let’s see here. So I’m sure somebody’s going to say. They knew all of that already. If that’s the case for anybody, love to hear your take or how you guys are, how you’ve been using Python with Snowflake.

Anybody wants to share? I like to share work with Snowflake a lot. So we have developed so far, we have not used Python yet. We are still working with challenges about the memory and the cost of downloading the data.

But we have used different stuff to build like mainly Informatica for data ingestion and for ETL Airflow and even satisfied, I’m not sure if you have heard about it, databricks as well to connect and process the data and then moving the data to Domino for processing.

That’s the cloud that we are currently using for analytics and data science and stuff and, well, mainly SQL. So I have a question for you. Could be probably a little bit out of this discussion, but have you built applications yet.

From or with. Front ends in a web browser and then you call the data and then you expose like a dashboard? Yeah, absolutely. We do it with our Platform Day lake house, and we’ve done it for some other folks as well.

Yes, it works pretty well if you’re talking about directly against Snowflake react or angular? Yeah, we’re more of an angular. Shop on the team. I’m on my team. But we do also, like, do some react, but not as much as angular.

We’re bigger fans of angular for some reason. We’re glutton for punishment over here, I think. Can you show me your information so we can so I can check your web page if you have some or yeah, sure. We can share.

Yeah, it’d be interesting, what you got? You guys have a lot of tool sets there. Domino hot spots. Yeah, well, I’ll tell you. We talked with one of my old colleagues, it director, a large retailer, and they actually switched a while back, they had Informatica and SSIs and some of these other tools, and they actually switched to, like, a pure custom built Python process for both loading all their data into Snowflake and then any operations to pull data down.

Not to say there’s anything wrong with an Informatica or SSIs. I’m a big fan of both because they have great tools. But sometimes if you’re going to a new initiative, maybe with Snowflake, sometimes it’s easier just to say, like, how fast can we go?

And if something works, like in Python, you can scale it, you can keep it stable. Then. It makes experience. We are using IICs because to move really big chunks of data, that’s the only tool that is reliable to move the data if you want to.

Just 20GB every day. That’s the only really reliable in my opinion or in my experience, like even data breaking is not that reliable. And the other thing that ICS has that is really cool is listeners.

So if you want to have a real time ingestion, you use this. All the job is done with that very simple with informatica cloud service. Yeah, I just dropped you. Here my information so we can connect if you want.

Cool question about ICS and Snowflake. So for the streaming data, can you guys use the ICS? Or it’s only for that if you load some real time data and all those things I am using for real time. So I have many different streams or different business cases that I’m using informatica and I’m using listeners, as I mentioned.

So, for example, I just set up a project where I have some STP. So we are receiving data out of the corporation and then of course, the data is going through all the security. But once I receive the data, I just ingest the data or I just set up a listener on the location on the HSTP and then the data just flows through automatically to informatica, like real time.

So that’s one of the business case. Another one could be like sensors like I used before data breaks, I tried to use before data breaks, the data breaks. Once again, that’s my opinion. But databricks cannot be really reliable.

And Informatica for me has been the only tool and it’s not the most friendly tool, if not at all, but it’s the only one that has brought us reliability on ingesting really big data continuously. So we are kind of exploring the Azure data factory or some other tools.

We are also exploring database. To do them we have to migrate some like on prem data to snowplay. We don’t have the ICS and all those. So we are kind of exploring either with the Python or second like as you did a factory.

So that still we are trying to assess. And third one we are thinking about the data breaks. ADF is a really good tool. It pretty much depends on your needs. And of course, it’s not my company, I work for a company.

So we have the buckets to have all the tools. Because we are ingesting source systems. So we ingest probably more than 100GB every day just from source systems. So we really need to have a reliable tool to complete this type of jobs.

But if you don’t have the need to ingest at a specific time, point of time, like just you just have 1 hour to do the full ingestion and that stuff you can just explore like Ads. I think Ads is a really good tool.

Okay, so there is no data size limitations, right? No, but there are no limitations. But there is no such flexibility. So you are going with the ICS and all those are one of the. I tell you I don’t like ICS.

But it’s the only tool that has really solved all our issues. All included the sensor data. Like ingesting sensor data, which is really complex. And it gets complex really easy, especially when you don’t have a time stamp.

Informatica has some tricks to a time stamp on real time and then that really helps you a lot to solve your problems. Interesting. So other topic that suppose that if you’re doing Python and we have to convert if we have BT script and all those on frame, we have like 1500 script and so is there something in Snowflake?

I heard about the snow convert and all those we can do the conversion of those Terra BT script into kind of Python environment and Snowflakes. Have you guys ever used it? Like don’t if we use snow convert directly but we have done stuff if you’re asking me that is we have done some stuff with store procedures.

So there’s a lot of folks who are trying to completely migrate into Snowflake for their data warehouse. And a lot of data warehouse developers, architects, eventually just wound up writing a bunch of SQL code, so they might have been using SSIs or Informatica and then they wind up just writing a bunch of sequel and then pulling that in from store procedure because they thought it was easier.

And so now there’s a lot of Snowflake customers that are trying to migrate and so they’re trying to take all those store procedures and put them into Snowflake and either still use them as store procedures because there’s like arguments and they’re parameterized and they’re doing kind of like variable setting and things like that.

So we actually created a set of scripts that helps kind of accelerate that process. But I haven’t really seen anything for converting like VB into Python for Snowflake in particular. Is that what you guys have?

Probably. Or you’re saying you have no. So we were talking to the Snowflake guys. They recommend that snow convert. It basically takes the code on print code and converts into the Snowflake, kind of.

And which can be done in Python. That’s what we have been told. So we have like around 2000 script and we don’t want to rewrite everything and all those it’s going to take very long time and a lot of resources and all those.

So we are looking for something which can do things quickly. Like in two weeks we can kind of set that up in the Python environment. Yeah, that’s one to take a look at. We’ve built a lot of tools to kind of help accelerate because even if it gets you 60 70% of the way there, the automation saves you so much time and so much headache, like you were saying.

But that’s an interesting one. I’ll look that one up. Why are you guys moving away from Batch scripts? Batch scripts are awesome. So we are moving to on prem. To Cloud. So on Prem we have all the bad script, but now we are going with azure and Snowflakes and the Snowflakes.

So we have to migrate all the onprem data to Snowflake. Right. And we have big data warehouse. So we have to convert everything into the Snowflake thing. Got it. Yeah. And so the idea being that the cloud operates more universally with maybe Python versus DB script or shell scripting rather.

Yeah. Okay, that makes sense. Dig into that a little bit. Yeah, but thanks for all the info about ICS and all those. So ICS, we haven’t planned to spend money on that yet, but we’re still currently assessment phase.

Awesome. Well, I think that’s a time before we wrap up, maybe we could think about how else has anything else we also have on the meet up, as you guys have probably seen, there’s a little bit of a form for communication there to kind of ask around questions or submit topics for the next event.

Our next event, I think, is on June 21, which kind of goes into a little bit of a discussion here, especially when we start talking about ICS and Andreas and things. Because if you have stages, especially like IoT or sensor data, the stages allow you to do this great kind of real time object orient object storage dumping of the data and then having Snowflakes massive scale up capability to kind of read that information in just about near real time and ingest that X number of gigabytes at a time.

So definitely one worth attending. And we can talk about some of those advanced topics during that one, too. So that one’s June. If you guys have any topics you want to see covered during that meetup around anything we’re talking about here, we could probably get some stuff thrown in there.

Everybody, I appreciate your time, and thank you so much for participating and a good conversation. Thank you. All right, thanks, everyone. Will see you on the next one.

Have a good night.

More to explorer

Snowflake Loading Data with Special Characters

July 24, 2024 No Comments

Special characters in your column names can cause chaos for downstream users, tools and processes.

Building a Generative AI Competency (or the First Gen AI Project)

July 21, 2024 No Comments

When Building a Generative AI Competency one must identify the necessary infrastructure, architecture, platform, and other resources and partners that can help an AI initiative be successful. We have just like many data warehouse and digital transformation initiatives over the last 20 years fail because of poor leadership, or companies only going half in on the objective.

clock, time management, time-3222267.jpg

Snowflake Time Travel Not the First Time Traveller but Let’s Review

June 12, 2024 No Comments

IBM DB2 long before Snowflake had this concept as did a few other select databases using a Temporal Database concept. I think Snowflake was able to make it more popular and mainstream due to the association of Data Warehousing and analytics specifically