Mike Jelen

June 27, 2022
8:08 pm

Using Streamlit to Visualize Data

Streamlit is an open-source application framework that enables Machine Learning (ML) and Data Science teams to use Python without any front-end experience. Streamlit has a lot of programmatic flexibility to provide interactive visualizations, filters, graphs/charts well beyond traditional Business Intelligence (BI) tools.

Check out our Meetup video below to watch an overview of this event:

Other Meetups and Events to check out:

Transcript from the Meetup video:

Hello everyone. Welcome to another Carolina Snowflake Meetup group event where we’re going over using Streamlit to visualize data. These are our virtual Meetup rules that everybody is disrespectful to one another and goes on mute and uses the chat bot and tries to stay on topic.

Videos optional. And of course invite a friend always and check us out on Twitter at daily cast. All of our previous meet ups are on our YouTube channel. We usually get them up within a few days. The last ones are a little slower, but they’re up.

That’s all in the discussion board, I believe in the Meetup group if you want to check out our YouTube channel. All right, so today we’ll do some introductions with a quick Snowflake refresher and talk a little bit about Streamlit and some use cases for it.

And then we will visualize data in Streamlit and talk a little bit about our beta program and our next events and answer any questions you guys may have. Quick refresher on what is Snowflake. Some of you are experienced veterans with Snowflake and others might be new to Snowflake and everyone is always welcome.

And so, just in case you’re new to Snowflake, just a quick reminder. We say Snowflake or Snowflake data cloud. So whether it’s data ingestion, data access, data distribution, data sharing, data governance, you’re trying to scale up and scale down workloads and everything in between, that is really think of Snowflake in the cloud.

It’s really very simple description. Obviously it’s very, very. Varying degrees of complexity that you’re able to do within Snowflakes, really meet the business cases or your use cases that you have with your data.

Naturally, some of these cases are very straightforward data that is to be accessed by all. Then you have other data from various sources, could be transactional. Even IoT going to have varying degrees of sensitivity that you’re going to want to be able to control.

Not only the transformation on that data, how that data is aggregated and disseminated throughout your organization, where some of your user groups are going to be able to see all of the individual data elements and other groups.

You’re going to really control that access. Think of HR data, for example. You don’t want everyone to know everything about each individual as to what they make, but it’s okay to maybe see across an entire subsidiary within your organization and lots of different ways to leverage Snowflake to your advantage.

And really, at the end of the day, technology is cool. It’s all about lining that up with your business use cases. Next slide, please. All right. So, Jude, do you want to take it from here? Yeah, sure.

Hey, everybody. So, Streamlit, I guess the topic of our meeting today is an open source framework that is going to enable users to be able to create their own machine learning and data science teams to be able to create applications without needing really to have any front end experience.

So no HTML or CSS or anything like that. It really makes the builds the ease of flexibility in creating a lot of these front end. Applications to be able to present to your business users without having to, like I mentioned, having to have that front end experience.

So if you don’t mind going next slide, Heather. Like I mentioned, we’ve got this streamlined, got this ability to have a lot of programmatic flexibility. So I like to think of if you guys are familiar with our Shiny, it’s a tool that’s very similar to that to where you can create these very interactive visualizations that business users are able to use.

And I’ll be able to show you in a little bit about the kind of extensibility that’s enabled by these dashboards and the ability to streamline gives you to create interactive filters or interactive graphs and charts, but certainly a lot of flexibility that you may not get with maybe a tool like Power Bi or Tableau in that you are able to program your own applications.

And the big part, one of the big things is the ability to build machine learning apps and data science applications. And I guess we’ll get into that right now. Our target data set that we’re going to be looking at today is DoorDash for Work Data.

And the data set that I’ll be using is actually a prebuilt model that is part of DataLakeHouse.io and the pre built models and applications that we have within there. So. If you don’t mind giving me screen share access, Heather, I’m going to go ahead and kind of walk through what I had done in creating my own application and how I went ahead and utilized Streamlit to be able to visualize that data.

So it is still showing screenshot. Screen sharing is disabled for me, if you don’t mind getting that. It should be enabled. Okay, now it’s test. Okay, cool. All right, so can everybody see my digital studio code?

Yes, we can. Alright, so just off the bat, what I also was able to use was Snow Park, this integration. So Snow Park is Snowflakes kind of big data tool that they’ve got. It’s very similar to Apache Spark if any of you guys have ever used.

That really brings home the idea of data frames and using data frames programmatically. And that’s what I did. So we started out by creating a connection to our Snowflake account and creating a session right here.

And obviously for the purpose of public address, I’ve got my information masked right here. But this is kind of the way that you would want to structure your connection and then build your session object.

And then from there, you can see right here, I’ve actually created three different data frames from three different tables in Snowflake that are a member of that session. So. We’ve got an orders fact table, an employees dimension table and a merchant’s dimension table.

So all that we want to do really is we’re building off of this orders fact table, which has the vast majority of our door order information. A lot of the order amounts, the order dates, the amount that the company paid, the amount that the employee paid, the amount that was above or below budget, and if it was above budget, anything like that is really going to be found in our orders fact table.

And what I did was I also pulled from the employees dimension and the merchants dimension in order to get the name of the employees that are making these orders and then the merchants who are also the places that we’re ordering food from.

Just another heads up, all this data is mock data. So we’re also not sharing employee information or anything like that. So just beware that there’s also some weird quirks with some of this data because it is mocked data.

So if something looks a little off, that’s why So went ahead and what we did right here with this combined DF part was we leveraged Snow Park to be able to join these two data sets together, the two dimension data sets, the employees dimension and the merchants dimension with the orders Fact table.

And then, so just two easy joins and then we are selecting the fields that we want right here. And then we’re going ahead and collecting Snowflake or Snow Park actually works off of what’s called lazy execution.

So if. If I didn’t put collect here or didn’t put show here, it’s actually not going to collect any data. It will wait until this data set is actually being executed upon or we had a show or collect statement right here.

So I went ahead and just had that collect right here just to have that data pulled. But lazy execution is basically a way to prevent large amounts of data to be pulled in locally. So kind of a cool perk there in working with Snow Park, but then moving down.

I actually am going to go ahead and run this whole Streamlit application just to give you guys an idea of what we’re going to be trying to build. And I’ll kind of be bouncing around back and forth between that application so that I’m able to the code that I’m referencing can line up with what is on that application.

So I’m going to bring up my command line right here and I’m going to run just an easy run Streamlit meetup PY command. You’ll notice I’m working off of a Python notebook, but the Python file that I am going to be executing right here, I’m not going to bring up because it’s got that personal connection information for Snowflake.

But anything that is on that file is in this Python notebook. So you’re not going to be missing anything. So let me go ahead and run this and you’ll get a bunch of command line jargon output right here that will come up.

But in my browser, I can actually bring this over. I got a Chrome browser going. And we’ll see that this is our application. So. What we’ve got essentially is this date range slider at the top, which is controlling the amount of data that we want to be paying attention for a specific point in time.

So I’ve got the last one year, I’ve got this application set up so that we’re looking at the one year of data and one year, I guess, of delivery dates. So anything that’s been any DoorDash deliveries that have been made in the last year.

You can see that if I adjust the slider at all, it’s going to reupdate this application such that we have a new order count, a new percentage of orders above budgets, but also that this data is basically reflecting what’s in this slider time period.

Now. So that’s the first piece. And then we’ve got a couple of different visualizations here, a couple of different tiles up at the top and then a line chart that just shows the amount of money that was spent on DoorDash by different spend categories.

So the service fee that was part of the order, the tip that was part of the order, the actual order total and so on and so on. And then we’ve got a couple of graphs here just with the kind of occurrences in which by weather condition of the order or the delivery and then our top ten merchants so where we were ordering food from the most.

Ultimately, like I mentioned, I’ll keep bouncing back and forth between this so that you guys can see it. But I’ll go back to the code for now and kind of go from there. But like I mentioned, we have that slider at the top that I was showing you that basically can isolate the amount of data that we want to look at in a specific time period.

So we have the way that we do that is Streamlit has this slider object right here. And the cool thing is when you move one of the values on the slider back or forth, it outputs a value here. And since we’ve got kind of a double ended slider between the start date and the end date, we get an array that is output.

So like I mentioned right here, our start date would be 2021, 927, and our end date would be today, today’s date. So what would be output then would be for value one or value the first element of the array would be 927, 2021.

And then the second value or end date would be the max date or the date of today. So we then want to leverage that and kind of the way we manipulate our data sets though. And we’ve got this PD combined DF, which is the one that we’re really going to be pulling everything from.

So with it being our source data, we want our filter to occur kind of at the root of that source data. So that’s where we’ve got our filter from, our slider right here. And we do more filtering. And we’ve got in this next section, we do a little bit of.

Finagling to get a month and a year. Kind of like I was mentioning you on that line chart where we can go month by month and show the amount of amount of money that was spent on DoorDash by month. So that’s how we got that.

And then we also run a couple of metrics right here that we’ll use later, but we’re not going to worry too much about them right now. So the big part of that list section was that we created that slider and we’ve now done kind of our filtering and anything we’ve needed on our source data to get the data isolated for that time period within that slider that I was showing you.

And now the next part we want to do is we actually are creating some aggregated data sets based on the source data set, this filter data set that we were looking at just a second ago. So in these data sets are going to help us to create the charts and those little tiles that I was referencing before and I’ll bring up one more time just to show you.

But ultimately what we’re doing with those datasets is we’re creating, we’re wanting to get it to more aggregated format so that we’re able to better create our line chart here, our bar charts here, and these metrics right here for our monthly orders.

Just to mention this is a Pandas data frame as well. Reasoning. We kind of go from Snow Park to Pandas and you can kind of go. Back and forth using whichever one you want. Just for the ease of this one.

I use Pandas, but I could have reasonably used Snow Park all the way down to aggregate these data sets as well. But right here I’m creating I’m wanting to create that line chart that shows our order totals over time by month.

And what I’m doing is I’m taking that month year column that I created just a second ago with just a second ago in the above code right here. And I’m grouping by that. And I want to sum the order total metrics together so that we’re showing that amount spent over time.

And then the next data set that I actually want to create is one that we’re filtering on the amount of orders that went above the door dash for work allotted budget. So just doing a quick filter here to say is my order above budget?

And if so, it will be returned in the data frame, and if not, it’ll be filtered out. And here I create another data set, another data frame where I’m just wanting to do a quick group by on the weather condition and then get the count that was on that data set as well.

And then the last data set is just grabbing the grouping by our restaurant name and then doing basically a sum on the order total to find the spent on amount spent at that restaurant in that given amount of time.

So. So that’s we’ve got our data sets kind of all put together and in a format now where we want them to be able to be leveraged by Streamlit. So we now get down to formatting and actually creating the outputs for our Streamlit app.

So up to now, this was just raw snowpark, raw Pandas, just data manipulation. And at this point we’re trying to actually be able to put our aggregated and our cleaned data into formats that streamlike can recognize and be able to turn into graphs and charts.

This is probably where I’ll go back and forth between the application a little bit more. Hang with me actually might try to see if I can go split screen here. Might not be as ugly or it might not be as pretty because it’s kind of mushing things together a little bit.

Doesn’t the graphs don’t fit necessarily well enough for us to be able to have a full screen. But this is at least going to be able to help me to show you how the code is working with what the output is creating.

So the mention the slider, we’ve got that slider code up there but if we go down here, we’ve got what’s called a container where we are building out of five different columns with five different metrics.

We’ve got our first metric being the order count and we are specifying right that. Right there. And this is coming inside of a container. And we have order count. That was a metric that I created up here, where we are just taking the raw row count of that very first data sets.

So since it’s an order fact table, we’ve got one record for one order inside of that table. So we’re just taking a raw row count of that table. And when we do that, we’re taking the sum of it or the number of that row count, and we’re just putting it right there so that’s the total amount of orders during that span of time.

Now, the next metric that we’ve got is the percentage of orders that actually came in over budget, so over that allotted DoorDash for work budget. And what you can see is we actually have, I guess 55% of our orders came in over budget in that specified amount of time.

So what we ended up doing was we took a metric that we created for over budget count. The data set that we had created where we filtered on is above budget, and we just took the row count of that or over budget count.

You can see right here. And we took that amount, divided it by the total number of orders, multiplied it by 100 to get it into a percentage, and then round it off and displayed that metric right here.

And then we kind of did the same thing with these three other metrics. I won’t go through all of them, but you can kind of seem what we’re getting to now is with our aggregated data, we’re now able to.

To create these metrics of interest and be able to plot them right on the page for us to be able to use. So the next thing we actually want to go to is we want to be able to put this monthly spend by spend category onto a line chart.

And what I found and actually using this was extremely got a little it didn’t handle the data set that I had created well. And I actually had to use one of Streamlist integrations with Altar. You can see it up here.

It’s actually another data visualization package. But I had to use Altar, which actually Streamline integrates with in order to be able to get a little bit more in the way of my own my own specifications in terms of the axes, titles, some of the labels and then the actual lines themselves.

For whatever reason, if I just taken that aggregated data set that I created previously and put it here, this monthly orders data set and said, hey, Streamlit, put this into a line chart. It gave me some weird amounts and it didn’t order my axes correctly.

So I went ahead and just created a custom integration with Altair and created Chart that way. But the nice part of Streamlit is that it’s recognizes that there might be times where you need more hands on specifications involved.

And then you’re able to kind of use your own altar code or plotly code and then put that onto the page inside of Streamlets still. So it’s not like Streamlets saying it’s my way or the highway. The ability to kind of integrate with other visualization packages.

Like I mentioned, plotly use Altair here, but a lot more flexibility. And that kind of hammers home to that point I made earlier about programmatic flexibility. There’s a lot of it when you’re using Streamlit.

But point being, this is the route that I took to create this line chart and get it the way that I wanted to. But we also had the ability to create these box plots or these bar charts at the bottom. And the nice part about this first one, the weather conditions bar chart, was that the data set that I curated, the PB weather data frame actually came into good enough format for Streamline just to be able to put on screen.

So you can see right here that I just had to write St bar chart Pdweather. And that was able to give me the data set, the visualization that I wanted instead of having to go through and create the altar chart.

So for ease of quick data sets and quick visualizations, streamline does have strong ability to get those. To get those created quickly and seamlessly. And then for this last Bar chart, went ahead and actually switched up the axes to make it a little easier to read for the restaurant, for the merchant that we ordered food from.

And the only thing that I really changed here was just the axes and the chart property right here, as opposed to reuse mark bar here. And previously, for that Line chart, we used Mark line. So back to using Altair.

And again, another chart where I had a lot of flexibility to be able to rename my axes and rename and recreate, I guess, the width of the chart as well. So making sure that it fits to this specification instead of whatever STREAMLET decided that its size was.

That’s about it with regards to with regards to some of these visualizations. But as you can see, there’s again, a lot of flexibility. And this was something that wasn’t a huge level of wasn’t a huge pain to put together.

So if I just go back to the notebook, it’s a matter of creating my aggregated data, creating some tools or. Some tools like the slider and be able to integrate that with these data frames and then ultimately just creating these charts.

And it was just a matter of maybe 50 lines of Streamlit code that allowed me to create that entire, that entire application. So extremely, very seamless, very low maintenance to be able to get up and going to create a data science app or a machine learning application.

So I guess if you were to take your next steps, it would be to take a look at the Streamlit documentation and we can link to that, that’s very detailed. And again, a lot of these, the packages that I was showing earlier are very low maintenance, low input necessary.

So if you have a curated data set or data frame, it’s really just a matter of saying, hey STREAMLET, take this data frame and create a bar chart or create a line chart for me. But the very cool thing is that Snowflake, which bought Streamlit, actually last March, just announced that Snowflake Summit, that they are creating custom integration for STREAMLET within the Snow site UI.

So, you know, in the same place you would be able to see tables and tables, schemas databases, you’ll also see what they’ll be calling Streamlets. So almost your own data application within Snow site.

And I think that’s incredibly cool. But. Yeah, that’s I guess where you would head next. And I’m sure that Snowflake will be putting out more documentation on what that integration will be looking like very soon.

So feel free to reach out if any questions or trying to get started. We’re always happy to answer questions. We have our own slack workspace you can get to from our website and ask questions there or ping one of us directly on social media.

Absolutely. And I’ve got our GitHub link in the chat as well as our YouTube channel if anybody wants to refer back to the code or the video. So, in case you guys aren’t aware, we are AIC G, developers of the DataLakeHouse platform and you can go to our website, DataLakeHouse.io to get some more information about that.

Basically, we are able to sync data into your Snowflake with our ELT tool. There’s no code required and there’s some machine learning involved. We’ve got analytics, a data catalog, all kinds of cool stuff, and this is our actual DoorDash for work analytics dashboard, which from a business perspective, can be super helpful for any company that wants to kind of look into how much their employees are taking advantage of the benefits and whether or not they need to increase budget and that kind of thing.

And I believe we are the only one, to our knowledge, that offers analytics for DoorDash for Work. Currently right now we’ve got an open beta program going on which you can go to Daily Lakehouse Iobeda if you’d like to sign up for it, and you get some cool swag and access to the DataLakeHouse platform.

And this is the formula you will get to on that screen to fill out if you’re interested. All right, so if anybody has any additional questions, we can take those. And Peter asks if anyone is hiring. Christian or Mike, you want to take that one?

Yeah, we’re definitely open for hiring data engineers. A great intellectual curiosity to always ask questions about data, learn new, exciting things. If that fits your personality and type of organization you look for, definitely reach out to us.

We love to chat more. All right. In the chat, Anna also mentioned that Code Signal is hiring. Good stuff. Yeah. Always happy to help others out as well. All right, so our upcoming events, a little sneak peek at what we’re doing is using DataLakeHouse to TerraForm Snowflake.

We’ll be having that event next month towards the end, and I will get that updated soon in the Meetup group, and that should be sent out as soon as it’s updated. All right. And we also went to Snowflake Summit, and we did a great recap of that, which will be uploaded onto our YouTube channel also soon.

So if you guys are interested in what you might have missed, check that out as well. Thank you guys all for joining, and we will see you next month when we go over using DataLakeHouse to Terraform Snowflake.

Thank you so much.

More to explorer

Snowflake Loading Data with Special Characters

July 24, 2024 No Comments

Special characters in your column names can cause chaos for downstream users, tools and processes.

Building a Generative AI Competency (or the First Gen AI Project)

July 21, 2024 No Comments

When Building a Generative AI Competency one must identify the necessary infrastructure, architecture, platform, and other resources and partners that can help an AI initiative be successful. We have just like many data warehouse and digital transformation initiatives over the last 20 years fail because of poor leadership, or companies only going half in on the objective.

clock, time management, time-3222267.jpg

Snowflake Time Travel Not the First Time Traveller but Let’s Review

June 12, 2024 No Comments

IBM DB2 long before Snowflake had this concept as did a few other select databases using a Temporal Database concept. I think Snowflake was able to make it more popular and mainstream due to the association of Data Warehousing and analytics specifically