Mike Jelen
Mike Jelen

Customer Segmentation Using Python & Snowflake

Twitter
LinkedIn

Customer segmentation is the process by which you divide your customers into segments up based on common characteristics – such as demographics or behaviors, so you can market to those customers more effectively.

Python is a popular programming language but not everyone is well-versed in how to work with Python and data. Python is easy to use with familiar programming structures, a declarative language using Notebooks (Jupyter, Zeppelin, etc) and has lots of libraries so you are not starting from scratch.

Check out our Meetup video to watch a deeper dive of this topic.

Other Meetups and Events to check out:

Transcript from the Meetup event:
Welcome to another Carolina Snowflake Meetup group event. Tonight we’ll be talking about customer segmentation using Python and Snowflake. We have a few you meet up rules to go over real quick, just that everybody’s respectful and stays on mute and, let’s say, have something important to say. 

And use the chat to ask any questions so that everybody can read them. And try to keep everything on topic. And you can keep your video on or off, whatever you’re comfortable with. And always invite a friend and of course ping us on Twitter. 

Tonight we’re going to be going over an introduction to customer segmentation, talking about a few use cases, having a dentist that walk through and a little bit of data modeling with Anubhav and we’ll talk about our beta program briefly and then we’ll have an open discussion for any questions and talk about our upcoming events. 

All of our previous meetups are on YouTube and that you can find in the discussions in our Meetup group. A quick refresher on Snowflake just because we know we have some people who are new to Snowflake and some people who are probably Snow pro core certified at this point but really high level. 

What is Snowflake? Snowflake is a data cloud. It is a SaaS based offering that’s cloud centric. It’s modern data. repository that handles data ingestion, good data governance that have a unique take on access to data. 

And then with their compute and disk separation, they basically allow almost infinite scale on data operations. So really unique stance they took to solve some major problems in the market with data. 

As you guys probably know, it’s a great way to take any of these data points, events, transactions, so forth, take them all the way through that data pipeline into something that’s actually consumable by different types of data users and data workers. 

So we like Snowflake. We think it solves very real problems in the real world, and we’ve been using it for a long time. And that’s one reason we love having these meetups so we can talk about Snowflake and share some of our thoughts and also hear from everybody else as well and their experiences. 

Mike is going to give our introduction to customer segmentation. So just a quick story here. A dad went into a Target up in Minneapolis and demanded to talk to the store manager. He was absolutely furious. 

The store manager came over and said, hey, how can I help you? And clearly the dad was a bit disheveled and upset. And so the manager knew, hey, I’ve got to approach this with kid gloves and have empathy. 

And the dad started shaking a bunch of coupons in front of the store manager said they faced and said, how dare you send my underage daughter pregnancy related coupons? And the manager fiercely apologized, oh, my goodness, sir, that is not the not the intent. 

I don’t know what happend; I’ll look into it. And sure enough, a week later, the store manager called the dad back and said, sir, I apologize. We did send those. The company was using data to try to better understand what was happening. 

And the dad had fired back and said, well, actually, I owe you an apology. There have been some things going on in my household that I wasn’t aware of, and my daughter is due later this summer. And so what that really showcased is definitely one way to segment your customers. 

And really it came down to a data scientist at Target was doing market basket analysis of what were people buying and what were those products in the basket, what did they potentially tell about the future or future purchases? 

And could Target use that to help send coupons and engage with that customer? Especially when you look at expecting families or individuals, that those early days when you find out that you’re pregnant. 

Those are very, very key moments from a buying perspective of what items and things will that mother to be purchasing, whether it’s for the baby, for herself, from vitamins and other personal care items. 

And so of the traditional 25 items within the basket that were identified as there’s some sort of a correlation that we can make. This data scientist applied a score to every one of those items in the basket. 

And that score, when you combine it together with other items in the basket, was used as a predictor of, here are the types coupons that we should be sending our customers so that they. Purchase additional things from us as Target. 

So that sounds awesome on paper. However, the execution, well, that needed some refinement as the data and manager story. So now Target has actually taken a little bit different approach that common items that are purchasing, that 25 items. 

A lot of times Target will stack and position other things on end caps of those aisles that might be unrelated, some different candies and some other foods. Also, when the coupons are not sent out, target actually will not only send out in this case from an expectant perspective, but it also will add in other things that are not related things that are furniture, clothing for adults and just other non related things to pregnancy. 

So that for some families they might not want to get the word out that hey, we’re expecting, and others just don’t want to be thrown in their face of hey, we know you’re expecting, here’s some coupons. 

Oh my goodness. So now, if you are lucky enough to get some of those Target coupon books, there’s other non related things that you’ll see in there. 

And it really stems back to we segmented our customer. We identified through market basket analysis the predictive behaviors of future purchases and using that data, they’re able to now market to those individual customers. 

So, while it is a little bit on the surface, a bit lighthearted story, it’s really what we’re talking about here is how we are using data to segment our customers or your customers at the end of the day.

So some use cases, lots of different things. We talked about the market Basket analysis here naturally from a loyalty. Perspective, very straightforward of all right, if you consistently buy A, B and C, or you think of airlines, they’re really good about the miles game of hey, you get a credit card or you buy X dollars from restaurants, within the next three months, we’ll triple those those points.

And so really that gamification by using and targeting from a customer segmentation perspective, naturally there’s other efficiencies and segmentation that you can do relative to real estate. For example, all right, you’re not just creating buildings or leasing out space, you’re actually targeting.

All right, if you’re life sciences, there are particular things that the life sciences industry needs. If you’re manufacturing, there are particular things that they need and want and want to look for even from transportation perspective, when you look at different whether it’s route optimization, but more importantly, that fleet maintenance and what are those different parts and services that can be offered to those either truckers or those transportation organizations.

And likewise with we talk about data warehouse as a service. That’s where all your data is coming from and you’re trying to make sense of this doing the segmentation and naturally turn that into sales and marketing activities.

So we’re the next slide. So today we want to accomplish a number of different things. We talked about the story relative to target about understanding your consumer spending buying behaviors. And also it’s one thing once you understand them, what are you trying to do and how much are you trying to increase your bottom line in your own organization or perhaps trying to increase the total spend with the customer and you’re not as worried about what is the.

The bottom line impact because it’s really all about that customer and brand attraction and then you’re using that to do prediction, classification and continually refine these models. Next slide. Alright, so now I think we want to look through just to give a little bit background on the data set that we used and how we actually were able to perform some of those analytics.

But just briefly about the data set, we were kind of in search of some data that gave us the ability like Mike was talking about, to look into a lot of different facets into how we could segment based on kind of a customer base that we were looking at.

And one of the struggles that we found was that a lot of the data sets are companies do not want to give out sales records just all willy nilly. So what we ran into a lot was a lot of randomly generated data sets and this was kind of where we ended up just taking one data set that we liked.

Probably very randomized data, not very clean or a little maybe too clean, but anyway, we found a data set on Kegel that had about a thousand records of supermarket sales that were grouped by like an invoice level based on like a customer’s visit to the supermarket.

So it had a couple of fields for unit price and quantity of items purchased and then the actual grouping of products. As they pertain to like health and beauty versus food versus other cosmetic items.

And then we also there is another field about customer type. So whether this person was a member or had a membership to that specific supermarket or whether they were just a person walking on off the street.

So that was a little bit about the data set and then our goal was to get that into Snowflakes so that we could end up doing a lot of performative analytics on that and moving into the whole clustering of the data as well.

Can you go to the next slide, Heather? Thank you. And we just wanted to give a little background on actually loading data into Snowflake. So three primary methods that Snowflake they give out on their level of documentation online.

They give out the option to use the Snowflake UI just to load data in there, which it’s very no code centric, where you are able to just specify the data set you’re hoping to integrate using a specified file format and it will perform the put and copy commands behind the scenes.

Put being the process where data is loaded into a stage. Internal or external? Internal being kind of like a local machine, whereas external is going to be something more like s three and then copying actually taking that data from the stage and putting it into a target data set.

So this is going to be a solution for small files. Because Snowflake does limit the data size of incoming data into those tables. So I think they say, yeah, we’ve got 50 megabytes specified there. So this is the solution we use because we’re really using a data set with a size of 1000 records.

You’re looking for more of a bulk solution. This is where the no Sequel client comes into play. It’s a CLI solution to where you can you have a lot more freedom to be able to specify any sort of formats in your put and copy commands and loading into specific stages.

A lot more freedom over the configurations that you make and you get a lot more in the way of availability, data availability to be put into Snowflake. And then there’s Snowpipe, which is the option you can use to load continuous data.

So this is going to be a great solution if you are using AWS S Three, where you’ve got some sort of automated process that’s continually dropping files into an S Three bucket. What you end up doing is being able to specify that S Three bucket as your external stage.

And what Snowflake will do is, via Snowpipe, it will pick up these new files and then kind of stream them into your target data set and it’ll appear pretty shortly after that data becomes available inside of the stage.

So I think now we’re going to move on to the actual data modeling part. And Autopilot took a lot of this thank you to take it forward from you. I think you’ll have to stop sharing for items that sharing. 

Thank you. So I’ll just share my screen. Just let me know if you guys can see my chip, the notebook. We see it. Okay, so what we’ve done is we’ve connected connected a Python children notebook and pull data from Snowflake. 

And we basically imported libraries in Python, which is called as a Snowflake connected library. Before I jump to the part where we talk about Python and all the libraries being used here and the machine learning aspect of the entire segmentation problem, just continuing and sharing some more information on what to just spoke about the data set. 

So it’s increasingly difficult to find data which is around customer and customer information, because companies usually like to keep it private, you should keep it for their own analysis and so on. So we happened to pick up a very good mocked data from Kaggle. 

So people knew around the space of Kaggle, it’s a great place to find good data sets, data sources, and kind of also play around with our machine learning competition and so on. So we’ve given the sources in the notebook, and this data source basically has around 17 features, 17 columns.

And if you look at that, you have information such as the invoice ID, which talks about the profit of the market, the location of the supermarket. Was the the type of customer, was it a member customer, normal customer, your gender of the customer, what’s the price of the item that they bought, the quantity, the tax, and so on, and the cost of goods and what’s the cross income, and so on.

So we talk more about this data set as I go ahead and kind of talk about various features. Moving on to the library in Python, we’ve called the Snowflake connected to connect with Snowflake. And we’ve used all the Pandas, Tampa and so on to perform our machine learning processing. 

And then we’ve used the standard ScikitLearn library to perform clustering and you see everything being pulled from cycling and so on. So this is the part where we connect with Snowflake to fetch data. 

And as you can see, you can call the connected or connect method and then passing your username and password, account name and so on. With Snowflake you’ve got to use a role in Snowflake that will be used to execute the statements, the name of your database and the warehouse. 

And then as we see down here, it lets you execute SQL commands just like you would do in any Snowflake or any SQL ID. And then what we do is we kind of call the fetch panels all and fetch the information or the output from this query as part of this data frame. 

So for people new in the world of Pandas, it’s like a table which you call it theta frame, and all the processing is done within theatre frames in Pandas. So we execute the fetch Pandas all, get the data in something called data frame CS master, and then we kind of go ahead to kind of process the data and see what they look like. 

So you have a method called shape, again from Pandas data frame. It tells you that we loaded thousand records and 17 calls. You do a quick head you can see. So when you do head, it gives you by default the top five rows in your data frame.

If you put the number here with a function, you can actually call more rows. So if you pass head 25, you can see top 25 rows in your data frame and so on. So your invoice ID looks like a complex number with alphanumeric.

Format to it. And then you’ve got your branch details, your city of the supermarket type of customer, agenda of the customer. What’s the product line of the product we’re talking about? What’s the price, the quantity, the tax, total cost, date of purchase, time of purchase, purchase payment method, cost of good sold and the cross income of the customer that’s currently purchasing this particular product and the rating that’s given by the customer here.

And then for the analysis, we see that we have data which changes currently from 1 January to 30 March 19th. So it’s three months of data. And then just to kind of understand what’s the data type of the column being called, it’s just without info method.

You see that you have a float integer and an object. It’s like a variant which you can hold, string and mix here, type and use in band as in button checking for any duplicate entries. So just doing a duplicate sum indicates we have no duplicate entries.

We could run that and then just looking at the various values that we have, unique value in each column. Calling your unique method on your column name kind of tells you what the unique values that you have.

So we have three centers. ABC again, it’s a mock data. So ABC and then for city, we have three cities, yangon, napita and Mandela, which I believe are cities based out of Burma. And then similarly for other male and female, the member for the customer type.

Then what we have done is, as I showed previously, that this invoice ID and data set is a very complicated this as a numeric character here we use something called as a label encoder. Which essentially converts this complex inverse idea list of numbers and that’s kind of done to make it easy for us to process when we do clustering in machine learning and that’s what we’ve done.

We’ve converted those complex numbers to simple numbers that here you cannot process when we pass the data to a machine learning model. So we use label encoder, but there are more encodent techniques that you can use.

You have onehot encoding categorical encoder and so on and then in addition to the data analysis that you can do in SQL in Python, it kind of helps you look at you can go deeper in terms of the plot you make and the various columns you look at.

So for each column we’ve done a quick plot analysis here and then as you see that again being a mocked of data, you see that a distribution of data for each column is kind of similar. So you have a perfect break of data across branches, similar data breakdown across cities, customer type 50% across male and female, 50% across member nonmember.

And then approximately 17% of data breakdown across each of these columns, the product categories that you have. And then we’ve got a box plot to see if there’s any difference in terms of rating. So on the y axis you have the rating that is given by the customer for each branch.

So your branch A and branch you have similar ratings but your branch B for some reason has been given lesser rating compared to A and C. That’s good to know. Similarly, if we plotted the rating given by customer across each product line and it’s pretty much similar, no much change to it and we also can kind of break down the cross income.

So basically we’ve seen where do. High spending customers spend the most and that’s pretty much also not varying a lot across the product line. So you have high paying customers of spending across all of these branches.

Yeah. This is an important one, because this kind of indicates that as part of the feature in the engineering when you say feature engineering, it always helps to kind of control the number of columns or features that you pass.

Your machine learning model to ensure that you don’t have any features which are kind of duplicated or which are kind of giving you the same information again and again. So you have something called as a correlation matrix which kind of indicates what’s the correlation between any two columns or any features at a given time and the value of correlation will range between plus one to minus one.

Plus one indicates that it’s the strongest correlation and minus one is that the negatively correlated in the opposite direction. So anything which is red in color or amber in color indicates strong correlation.

And what is good to know is interesting cross income. So customers with high income are possibly correlated with unit price which might indicate that customers which have high gross income are probably willing to buy items which have higher unit price and similarly customers with higher gross income have strong correlation with quantity so they’re more likely to buy higher number of items.

But this is again something we’ve got to be careful is that we should not assume correlation to the cause for causation. So it’s always good to keep that back of your mind that these are just indicators and not finding answers to each of the questions that we ask.

And I think this is pretty much logical that tax total and cost of good sold have to be correlated with the unit price and quantity because if you sell more items you are. Likely to break play, pay more, cost, a good sole and higher tax and total amount of amount and so on.

I think I’ll just take a quick break here for a few seconds if there are any questions before I go ahead, okay? And then I think one good thing we also wanted to see was, is there a difference in the number of items that are being sold on a daily basis?

For example, you could expect that probably during the start of the month, you might expect people to buy more and then probably end of the month people to buy less. But we could not see any specific trend, at least infinity.

That FB view pretty much seem to follow a random pattern here. So even the day voice breaker would not give us a lot of information. In continuation to the previous analysis, if you plot the unit price on the Yaxis with the gross income, it does indicate that customers with higher gross income are kind of more populated towards higher unit price.

And then if you look at the lower quota, customers with lesser gross income are probably more densely connected towards the lower end of the plot, which is again linked with higher income. Customers are more likely to buy quantity products with higher unit price.

This is another way to do that. And then this is just to kind of show that if people are interested to see what’s the mean, what’s the max standard deviation, which is something called your inferences statistics, your descriptive statistics on what it looks like.

So you can see, for example, the average rating. Of an experience is around seven, then the average unit price of data there are $56 and so on. You could do an average to see what the average mean, mode and so on is.

So that was more from an EDA standpoint, data exploration standpoint to see what our data tells us. And now what we’ve done is we’ve kind of gone ahead to see that if we just take this data set and then if we can apply a machine learning technique, which is called a clustering, and clustering is unsupervised learning.

When I say unsupervised learning that we’ve not told the algorithm or we’ve not created any training and test scenarios, and we’ve just given the data as it is to the machine learning algorithm to find any patterns.

And using those patterns, it creates categories for us which can be segmented in this particular scenario, and this is the tip of the iceberg. There are many ways of doing customer segmentation. We come with one commonly used approach, which is called as K means clustering.

What basically it does is that K stands for the number of clusters that you want your algorithm to create. And then based on those clusters, you can kind of correlate and see that what is my data kind of tell me.

So, for example, in this scenario, we’ve tried to create clusters to see if our algorithm can kind of divide the data set that we have into something called as low paying customers, high paying customers and so on.

So that’s the kind of analysis that’s the segmentation that we’ve tried to do as part of this particular meeting. Before I go into showing the analysis here’s, just a quick discussion on what cabin clustering is.

So, as I said, it’s unsupervised returning algorithm. There’ll be no training done on the algorithm as such. And what we basically do is we’ll talk about how do you run a K means algorithm, what parameters are used to kind of customize the algorithm.

And then you talk about something called as an inertia curve. That is a good way to come up with a value of K. That what is the most optimized you can use given your data set. So, by principle, what KVIs does is that we spent some time on the algorithm here is that once you choose a number of clusters, let’s say I want to have three clusters for my analysis.

A step two. It starts with choosing random points from your data as centroids. When you say centroid, they can just be taken as center of center of your value, as basically pillars of your value. And then what it does is based on the numbers.

For example, if I have three clusters, it takes three random values as the center of my clusters. And then it picks up each of the value data point in my data set to find the shortest distance between these random values and then kind of cluster them together.

So it does this in there is iterations very similar to Monte Carlo simulations. It runs them and keeps running them in multiple loops. And it finds the best possible set of clusters which are closely linked to each other.

So it’s made two properties are that all the data points which are linked in one cluster should have the smallest distance between them and the centroid and each centroid between themselves. Your interpluster distance should be as far as possible.

So ideally, you should have minimum intracluster and maximum interclass resistance. That’s what the machines algorithm works. To kind of free chat, and it basically stops that when it sees that no more centroids can be found.

You optimize the entire search based netherlands, and your points are starting to remain constant or similar in your cluster. Or if you specify find that, hey, my loop should only run for, let’s say, 10,000 times, then it kind of exits one to 10,000 iterations have been done as part of the algorithm execution.

So, as I said, choosing the right number of clusters is very important. So, how do you know? You pick three clusters, four clusters, five clusters, and so on. So, for that, one possible approach is that if you have a business need, if your business might not say, hey, you know what, just divided it into four clusters and tell us what’s possible, then you have the key value given to you.

But if not, if you want to find the optimal value, then we have something called as inertia analysis. Inertia is nothing, but it’s basically the sum of the intracluster distances. So, you have your centroid here, and then you have basically data points in your clusters here.

It tries to find the sum of the interpluster distance. So, essentially what it does is if you come to plotting it down, it’s called the elbow method, by the way, and it’s called elbow because what it does is we plotted the inertia, or the intracluster distance for each number of clusters.

And then ideally, we want to choose the one which has the lowest inertia for the given value of cluster. And then when you plot this chart, you would often see like, an elbow shape here, something like this being created, which indicates that that particular value is a point of deflection where you’re value is gone down.

And we start to kind of stabilize after that. So. In this particular plot you see at value two and three. So cluster number two and three is when your initial reduces. But going after that, going forward, it’s pretty much stabilized.

So whether you choose four clusters, six pluses or eight, it’s not going to make a difference. So you might want to choose two or or three or four or one of them for analysis to kind of get you optimized.

And from sciencelearn, we already have something called Scamming. Societalarn is a very well known, very commonly used python library for machine learning. Not just Kmeans, but most of the other thing.

We also covered in second learn. And then we’ve got the Kmeans method. So just to plot the graph, we plotted from one to eleven to see what the inertia value looks like. So it internally calculates the inertia for you given the number of classes you pass.

And we use something called K means plus plus. K means plus is nothing, but it’s just K means where it has a more optimized approach to select the first random points. Instead of choosing a random point, there’s an algorithm, and K means plus plus plus, where it chooses based on the algorithm that what are the initial points that should be chosen as your starting point for centroids.

So we’ve taken that and after that. So this is one way to kind of initialize. And I think what happens with that it’s easy to analyze the entire thing when you have two features or three features, when you can plot them on one dimension or two dimensional three dimension.

But once you go beyond three dimensions to multiple dimensions, for example, you have 17 features here, so it forms 70 different dimensions which the human mind cannot kind of imagine. So in that particular way, it’s good to kind of use this cluster analysis to see that for 17 dimensions.

What is the right value of my clusters. So for this particular scenario, it says that anything between two and three and four might be a good idea to use for your Kmeans clustering. So I’ve taken three as this piece of code and then I’ve initialized is the Kmeans method and then I have passed my value X from which when I say fit and predict based on my features, based on the algorithm.

It kind of maps clusters value to all the data points internally and tells you that, for example, if you have five rows, it will do the analysis and say that row number one goes to plus number one, row number two will go to plus number two, plus number three, and so on.

It just gives you clusters. So, for example, if I come down and just because this data is such a beautiful day where things are perfectly correlated and it’s kind of simple that way. So that’s where you see a very nice linear plot of clusters being created.

But more often than not, you will see clusters, something like one cluster being created here, one cluster being created here, one cluster being created here. But what it’s done is, since I’ve chosen three, it’s told me that you have some sort of set of values which can be taken as one cluster here, and then you have another set of values which can be one cluster here.

So I just call them as class one, class two, class C, but essentially a class one cluster to class three. And then when you have one set of clusters which are mapped here now, it’s just told me three classes.

It’s the task of the analyst, of the data scientist or the engineer to kind of look at what the clusters output has come and what does it mean for the business. So in this particular scenario, if I look at my plot on my data, a total expenditure with my cross income.

So looking at that particular plot. What we can kind of analyze is that this particular cluster is kind of talking about those scenarios where your customers with increasing gross income or you can say with items which are which are higher in expenditure, are being bought by customers with increasing cross income or high paying customers.

And similarly so you can kind of tell that these are low paying customers. And then if you look at this particular cluster, these are probably customers who are little higher than the high paying customers.

And then these clusters of groups are the highest paying customers that you have in your data set. And this is again dependent on what your Xaxis or your point of analysis is. So yeah, it kind of creates the cluster for you.

And then you’ve listed some more sources that you guys can read on. There are much more different pieces of analysis being done by people on the same data set. Good way to kind of look at different approaches and kind of get more ideas on this particular concept.

So I think I’ve covered from a clustering standpoint. If there are any more questions, happy to take it now. Thanks. On a PAB. I can share now if you let me. So as far as next steps go, if you are trying to do any real analysis, you’re going to want to find a better data source than we found because it was all, as you can see.

Pretty flat and not very robust. You can segment on if this was a more robust data set, you could segment based on gender, the product line, the location, their loyalty status, I think was also included in there to find all kinds of great insights and then you would adjust your marketing strategy accordingly.

We are a ICG, the developers of the Daily House platform, and we have a beta program going on right now. Our platform data lake House is capable of ELT. There’s no code required, and it uses machine learning and produces analytics.

And we also have a data catalog and all kinds of cool features. And if anybody is interested in joining our data or beta program, sorry. We’ve got a link that I’ll put in the chat right now and we’ve got all kinds of cool stuff we’re offering in exchange for a little bit of feedback on our platform.

And the link will take you to our site where you’ll just fill out the short little form and then we’ll send you some next steps in your inbox. I didn’t mean to click on it, sorry. It’s not wanting to switch the slide on me.

So this is our no. Bad data mascot. Just to show that we don’t believe that there is such thing as bad data, only bad processes. You. And yeah, that’s basically the end of the speed up. Does anybody have any questions?

So, a great walk through, guys. Awesome material there, I guess. A quick question for you on if you’re going to take some of that clustering and try to identify some of the customers. Maybe if you’re looking for top three cluster customers based on buying behavior, there’s the modeling step there.

But what would you give as insight to a next step there for maybe a marketing team or something like that? What would you say a good next step would be? Yes, in fact, that’s a great question, because what you can do is that once you have the clusters created right, so in my scenario, for example, we spoke about class one, two, and three.

So what you can actually do is you can take the data set and within data frame itself, you can specify that, hey, just pick up those customers which have been marked, let’s say, as part of cluster one and two.

For example, if I say my highest paying cluster paying customers based on the output I see is cluster number three. So I just segment and pick up the value. The customer number is three. I know that this set of 50 people or 60 people in my data records are the highest paying customer.

And then you can use that further, see if you have any other trends in that data set. For example, if you want to see that within the 60 customers, you see male gender spending more than female, or female spending more than men, or what are the product lines they’re looking at, what’s the average cost, what’s the average taxes they paid, and so on.

So you can further mine deeper by creating a subset of your talent. Cluster. And in data science, right, you do some sort of ensemble. And clusters are usually a part of an ensemble where you take the output of a cluster to kind of put that into some other machine learning algorithm and then get some further data analysis, predictions and segmentation and so on.

Awesome. I’m sure that’s a topic for another day. I think there’s a couple of questions in the chat. Let’s see. I think Andy was asking, when you do Kmeans clustering, will the result be different each time you run it?

That’s actually a great question, Andy. And as ethical, that usually right. With any of the machine learning algorithms that you use, you would like to execute them multiple times to kind of take away any form of frames that you have.

So it won’t vary a lot in terms of clusters being created, but you might have few records, a few data points moving across clusters here and there because it’s a very small partition that you have across clusters.

Maybe this scenario was actually a great data set that you could have such clear demarcation, but more often than not, in more real life data sets, you will see that you have clusters which overlap over each other.

And then when you run the Klein’s algorithm multiple times, you will see that few of the points which were in cluster A have moved to cluster B and cluster C and so on. So it’s a good idea to kind of run those algorithms multiple times and kind of average it out to see what’s the best possible representation. 

Thanks Anubhav. We’ve got our upcoming events here for sorry. Do I have my thing on there? Using Python and MindsDB for machine learning with Snowflake. And then that’s coming up in March. And then in April, we’ll be talking about using Python to execute Snowpipe. 

And we’re trying to be the best Snowflake meet up that’s out there across the globe. So if you guys know anyone who wants to join up or see any of our meetups or take part, definitely reach out and invite them. 

Of course. All the code that we put together here and all the sessions are out there. They’re going to be open source while we record the sessions, put them on YouTube, the code we put out on GitHub so everyone has access to that information. 

And the cool work we’re doing over here at AICG and DataLakeHouse. So definitely recommend us. And if you have an actual topic that you’d love to hear more about, or maybe something you’re working on for either your organization or maybe there’s a client you’re working on if you’re in professional services or something like that, definitely send us a note. 

We just love this stuff. We love working with Snowflake. So just drop us a note either on meetup or through another channel here, and we’ll be happy to kind of dig into that topic and then have a meet up on that subject. 

Thanks, Everyone. We hope to see you next month for our next one. 

More to explorer

Snowflake Health Check

We enjoy sharing best practices/framework to validate the health of Snowflake implementation and discuss best practices to help you ensure you’re on

dbt Coalesce 2022 conference

DBT Coalesce 2022 – Day 4

This day was bittersweet, as it has been so nice to finally be around people in person (yes, I realize that sounds

dbt's Coalesce 2022 conference

DBT Coalesce 2022 – Day 3

As we were headed towards day 3- We were halfway through, but still, a lot to cover, and the schedule had so

Scroll to Top