Machine Learning (ML) is an aspect of Artificial Intelligence (AI) to provide a means for creating predicts and classifying data without having to program every possible outcome. Using historical data is the starting point so having large sums of data helps to increase the accuracy of Predictive Analytics. MindsDB is a SQL-based construct database that helps to address ML problems where multiple technologies and tools are integrated within a database.
We’ll be making use of #Python and the Snowflake rest API to trigger Snowpipe continuous data loading. Check out our Meetup video to watch the deep dive on this topic.
Other Meetups and Events to check out:
Transcript from the Meetup event:
Welcome everyone to another Carolina Snowflake Meetup. And so this one is all about using Python and MindsDB for machine learning with Snowflake. And I think there’s going to be a general focus here on on MindsDB itself, so we can jump into some logic here.
Alright, so we just kind of go over our standard meet up rules that we have. I think folks who have been here before understand we’re trying to get everyone’s participation, but if you don’t want to participate, that’s okay.
Use the chat to throw in questions and video is optional, but definitely try to be social. Check us out on all the awesome social platforms. Twitter, Instagram, Facebook, all that good stuff at stuff at DataLakeHouse.
And real brief agenda for us today. We’re going to talk about machine learning a little bit, how MindsDB intersects with that and attempts to democratize machine learning. Couple use cases walk through.
We’ll talk about the data set that we’re going to use for this particular Zoom or this particular zoom meeting for the meetup and really kind of dive into some technical demo walkthrough with MindsDB, but also with Python on the way that potentially the same prediction can be made.
And Drew talk to you guys about that. And then of course, we’ll always highlight things about data and how we’re getting data into Snowflake and things like that. As I mentioned before, all Meetups are out there on YouTube and we also post all of our code over on our GitHub.
For these meetups. So anybody who wants to try to recreate what we went over today, they can jump over there and clone that repository and do that. I’m going to Zoom right through the Snowflake. What is it?
What does it solve? Just again, Snowflake is an awesome cloud based platform that really problem of quickly ingesting data using cloud compute so that multiple type of data workers and data consumers can leverage and share data at scale.
So let’s talk about machine learning. All right, so a lot of us are familiar with SQL and some of us, not all of us are familiar with machine learning at a high level. And so just a brief description of what machine learning is.
It’s really an aspect of artificial intelligence. So there are several different levels. We’ll just focus on machine learning. And it’s really about providing a means for creating predictions and classifying data without having to program every single iteration or condition for how to use that data, how to classify or predict based on the data you have for the particular outcome you might be seeking.
And so really it uses historical data to do that. The more data a machine learning model or algorithm has, the better more accurate that prediction can be. And so it’s very important to know that as we’re talking about MindsDB tonight, because they’re all about trying to democratize machine learning.
I mentioned before, some of us are familiar with using SQL. Hopefully you’re here because of the Snowflake slant on the meetup. And so MindsDB provides this capability for leveraging some of your existing SQL skills.
In order to do machine learning, which typically you’re using some other type of coding language to do that, machine learning such as Python. And there’s some other ones I won’t go into right now. And the machine learning can be really broken out into these three different aspects of supervised, unsupervised or reinforced learning.
And again, we won’t go into great detail. We’ve got some other material out there already on machine learning. Check out our other Meetups. But tonight we want to focus on again MindsDB, which, as I mentioned, allows people who might not be able to write in Python or might not be able to write in some of these other coding languages to do.
Machine learning, enabling them to use SQL, something they’re already familiar with and the language that’s been out there for decades. At this point, in order to do things like predictive analytics, create predictions of machine learning and so forth and so on.
And one of the ways it does that is by latching on to some of these existing leading databases that are out there. I’ll read a list analogy, but obviously Snowflake is one of those on the list that we have highlighted there with a nice smiley face.
So that’s what we’re going to be talking about tonight, how we can leverage Snowflake and a data set from Snowflake to integrate with MindsDB and to allow us to produce a prediction using, in essence, SQL.
All right, so again, we don’t work for MindsDB, we don’t work for Snowflake, but we do work with those teams. And at a high level, the problem that MindsDB is solving is this kind of really long process so far traditionally and historically, of preparing machine learning models, doing the modeling, then of course, attaining all those different data sets outside of into the database, maybe from a S3 bucket or Google Cloud Storage, getting all that prepared.
And then with that model, then trying to do that training, maybe doing a bunch of ETL to get the data in appropriately. And a lot of times in machine learning we use database data anyway. So again, we might use something from S3, you know, bring that into Snowflake using streams and tasks, what have you, leverage some other databases, integrate that create some sort of subset data set.
And of course, we have to keep history and whatever we did the training on and so forth and so on. So what they’re trying to do is really trying to solve this problem of having this huge machine learning ops.
So just kind of like people in DevOps for writing code machine learning ops and trying to get that into a place where it can be more democratized, I guess, in essence, giving more people access to the ability to create machine learning models and capabilities without having to, you know, spread their skills so wide, right?
Kind of bring it all into the to the database tier or as much as possible into the database tier, I should say. So I’ve already kind of highlighted some use cases for machine learning and how we might apply Mine’s, DB and the capability there.
So we’ve talked about this in previous Meetups, in case you’re not too familiar with machine learning. Again, it really solves and adheres to these use cases of forecasting and finance, right? Anomaly detection, credit fraud, predicting sales, filtering applications and applicants from a human source, human resources perspective.
And there’s a lot more if you tie in manufacturing and, you know, logistics and some of these other use cases. They are plenty. And so one of the use cases we want to talk about tonight was around Shopify data and.
And some of, you know, orders, sales products, so forth and so on. And Drew is going to get into more of that, but, you know, just kind of highlight what we’re doing. So we’re basically taking Shopify data from a Shopify store.
And so if you guys are familiar with Shopify, it’s an ecommerce platform that really you could think of it as democratizing ecommerce merchants, providing them with a complete platform for adding their products, creating a store, shop selling their products, everything behind an ecommerce store that you might want to put online.
And so obviously, that data is valuable from, like a retailer ecommerce perspective. So we pull that data into Snowflake using a tool called DataLakeHouse.io. If you haven’t been to DataLakeHouse.io, definitely check them out.
And we’re going to pull sales order data so we can forecast that using MindsDB. But we’re also going to try to do a little bit of a comparative analysis, maybe at a very high level, comparing, hey, what are the capabilities of MindsDB, you know, prediction algorithm versus maybe a Python prediction algorithm using linear regression and so forth.
So we’re going to take a look at that as well, just to kind of take a look at some data. And Drew, I think I’m still doing this piece here. So shopify data. Using Data lake house to get that into Snowflake.
So I’ll just kind of click this button here and just kind of go over here and just kind of view some data. So we typically use a database called Data lake House Raw. Kind of brings in the raw data. Here you can see we’ve got Shopify data coming in.
So there’s all these nice tables, these entities coming from a Shopify data source or the database behind Shopify. So we can look at some of those data sets. So we’re going to be looking at order header or items and checkouts.
And so if we kind of take a look at our orders. Just kind of scan that real quick, but kind of do a preview on the data. You know, we can kind of see. We’ve got some variant columns here for address, customer ID, kind of a decent size table.
We’re getting that order number. We can get over here to currency. Obviously, we’re going to have things like other information about the particular sale, total dollar amount, price, discount amount, so forth and so on.
Look at line item, obviously in detail, what’s been purchased, the price, the quantity, the actual title of that item. And then we’re going to look at some of the other data for checkout and so forth.
So that’s what we’re going to use in our data set today. If I just jump back here to the slideshow. Getting the data in a Snowflake pretty easy using DataLakeHouse.io. So a decent amount of connectors there where we can pick shopify, search for it, find it, enter the detail.
And you can see here we use basically this target scheme, that prefixes ecom co. And that’s what you saw there, ecom co underscore Shopify. All right, so that’s data lake house, getting data. And so, in essence, we just did these steps, right?
Use data, lake house, IO, connect to shopify. Then we create a connection to our target. So this is all online, no code. Takes like five minutes to make this whole process work. Then we synchronize the data down to Snowflake.
You can set an interval if you want, like every 2 hours, every 6 hours, depending on how your data is changing. I think you can get as low as like, whatever, 15 minutes or something like that, if your data is changing that fast.
And then we deploy into mines. DB. So we’ll talk about that here in a minute. And we took a quick look at the Shopify data already. Kind of did that, just kind of real time. And now we’re going to jump into the actual, like, actually using MindsDB to do some of the data predictions.
So I’m going to hand this over to Drew real quick. Don’t want to steal his thunder unless this is my part here, Drew. And looking at the slides, I think this is you. So I’m going to stop sharing real quick and then hand this back over to you or hands over to you to begin with.
Yeah. Thanks, Christian. It was kind of a great overview of what we’re looking to do, really just kind of going through lines DB and understanding how you can take specifically with Shopify data that we’re looking at and being able to create predictive insights based on based on a dataset that you’re getting from Shopify.
So can everybody see my screen? Yes, it should be the Minecraft editor. So that’s where we’re going to start. And I guess one important note is Chris talked about getting data from Snowflake, and that’s basically what we did for the most part, except we took one step right here where we exported that data from Snowflake into a CSV file in order for Minev to consume.
We had a little bit of trouble when running through our, I guess, quick start and configurations with connecting to Snowflake. We weren’t able to find some good documentation on that, but we were able to just use the CSV file editor, the Upload file capabilities of MindsDB to get our data into MindsDB for that actual consumption process.
So the process is as simple as uploading here and inputting the table name that you wanted to. You want it to be referred to within MindsDB and you upload it and it’s very simple process. So ultimately we’re just going to display that process right here by running this show tables from Files command.
And we see that we have created a table in my CB called Vendor Order Data. And that’s the data set that we that came exactly from Shopify that Christian was showing you before, that was able to be extracted from Shopify via Data lake house and put into Snowflake.
So the only process that we changed up here is that we took it in a CSV file. So just a quick demo about that actual data set. And what we’re showing here is very basic data set where we are looking at the number of orders by, I guess on a specific date.
So for Supply Dark, their vendor, they had three orders on February 9. So pretty straightforward. And ultimately what we want to do based on this input data set is to be able to predict how many orders we get in a specific day based on the vendor that we input.
So if we wanted to see for Digimon or Barbie however many orders that they have or however many orders that they might be able to have on April 1, this is what we’re going to be able to use Minds to be to be able to do so.
MindsDB is able to do the whole machine learning, predictive analytics part of. What Christian was discussing earlier. So right here we’re going to run a create predictor command which is essentially just creating a training model.
So we are creating we’ve got this data set in here that we’ve inputted via our file and we’re going to use this data to train a model and we’re going to call this model Shopify Orders Five. And basically we’re specifying to MindsDB.
We want to create it from the files. It’s either a database or a schema but we want to and from that files database schema. We’re specifying that we want all of that vendor order data just from our select star command here.
And we’re also telling the model that we are trying to predict orders which is this field right here using two parameters using our order day and our vendor. So as our we’re trying to predict orders are dependent variable based on our independent variables order day and vendor.
So we have done that. We’ve created our predictor just run that command real quick and I’ve already done it so it already exists. So if we are just we can run a quick select star from the this is a MindsDB internal table called Predictors.
We are just running it where our model name is Shopify Orders Five and it’s saying that it’s complete so it’s already trained the model. We found an interesting little piece here where accuracy is 0.0 and we have maybe an idea of that why that’s true, because maybe it haven’t run a bunch of test data on it yet.
But we would love to do a little bit more investigation and to see why that is the case, but just it’s a little bit more metadata. The rest of this, the rest of the fields in this little table here. So that’s kind of what you get when you query from this MindsDB predictors table.
And we can also run a described command which will give us an our squared score as well as a little bit more metadata about the model itself. And again, we think probably our squared value here is similar to the, similar to that accuracy score that we were seeing in the previous table.
But anyway, the beauty of MindsDB is we’ve created this training model based on the data set that we input. But what MindsDB does really well is they have the ability to create what are called AI tables.
And AI tables, they’re similar to regular SQL tables such that they include data, but they also infer based on inputs that you give it. So if you were to query the data that we had input from the file earlier, such that we are just looking for Barbie and the order date of 2022 418, well, that data doesn’t exist yet because 418 is in the future.
But the beauty of the MindsDBAI table is that we are that based on the model that we trained, it will input both of these. These independent variables as such into the model and it will output the actual results of the model.
So this is where we actually run our predictions. The cool thing is it’s very, it’s very similar to what you would expect when you ran actual SQL. So, for example, I’m just going to try and figure out what the predictive model would say.
How many orders would occur on 418 22. So I run this and it’s specifying that I would probably have five orders on that day and it does it according to a confidence level of 99.9%, which is also maybe something else we would try to investigate a little bit more.
That seems pretty accurate, but obviously we would need to do a little bit more investigation as to why that number would occur like that. This is how you get your predicted analytics and it’s almost by running just straight SQL.
So just very cool features and just to be ran. One more prediction here for six days later, it predicts that we would have two orders. So you can replicate this SQL statement in order to get however many predictions.
You could also switch up the vendor name or you could switch up the order date. But I think the big picture is that here with my CV, you’re able to run a predictive model based on set of inputs from a file or.
Data set. They do have numerous data connections, as Christian was mentioning before, but once you create that model based on those inputs, from that connection, you are left with this very cool configuration in an AI table of where you are able to run basically your own freedom of asking for predictions from the model.
So, very cool stuff all around from MindsDB is what we thought. But ultimately when we were doing this, we wanted to see how it kind of would stack up against what you could do just using machine learning with Python or with R or some other tool as you normally do.
So what we ended up doing was we worked against we ran a linear regression model in a Python Jupyter notebook here. Ultimately, for the sake of simplicity, we went with linear regression. So we could have gone with something a little bit more complex, random forest or any other machine learning algorithm, but we really just try to keep it straightforward so as to keep kind of the focus on understanding how Mines BB works and being able to compare it against something concrete.
So we went with linear regression and ultimately very similar process in terms of bringing in data, training data and then testing it and creating predictions off of those off of that trained model. So ultimately we start with.
Bringing in that same CSV file that we referenced in the MindsDB walkthrough, that vendor order data CSV. And just a quick look at it, we’ve got it’s the same data set. So there’s nothing too crazy. It’s order day, the number of orders on that day for a specific vendor.
And then ultimately our next step is creating a linear regression model using the linear model function here. And I think the big key with the data set that we are looking at and what we want to accomplish within linear regression is that the two that are kind of predictive fields are independent variables.
What we’re talking about earlier, our order day and our vendor are categorical variables in this case. So categorical being they are discrete, they’re not numeric basically. So what we end up needing to do is we need to go through a process of converting these variables into numeric variables so that we can use that y equals MX plus b format of linear regression.
So we do kind of go through this cleaning process here where we first of all just want to drop orders from our data set and ultimately get down to a point where we are doing a little bit more massaging of the data to give us categorical numeric variables.
So we go through the process of turning order day into a numeric. Value. So I don’t know. Anapod was the kind of curator of what we got in this code. But I don’t know about the off the top of my head what value it’s converting to right here, but I believe if it’s anything like other values, if you convert from daytime to numeric values, it’s going to give you a value that is a certain amount of time from a given start point.
So this 738 195 would be the specific amount of days from a specific start time, whether it’s January 1 of 1900 or something like that. So if you’re looking right here between 738 195 and 738 215, the difference there is 20.
So I think we’re talking about a difference of 20 days which would line up between February 9 and March 1. So that’s kind of what we’ve got in that numeric variable. And so we’ve basically converted our date to a numeric value.
And what we need to do now is convert our vendor to a numeric value. But easier said than done when you’re thinking about kind of the Y equals an X plus B format. What we need to do is basically create a multivariate linear regression model in this case.
So our process is now making kind of almost a matrix looking data set where we are specifying via one hot encoding. The dependent variables that occur or independent variables that occur for a specific record.
So if we’re thinking about kind of going from this data set to this data set right here, we are saying that supply dark, for example, on our first record right here, isn’t present in any of these variables except for the supply dart column.
So this is just a way of representing that, yes, this record is specific to the specific vendor that we’ve got specified here. So via one hot encoding, we get this more matrix looking data set. But in essence, this is going to allow us to run a multivariate linear regression model.
So we’re able to run predictive analytics on the based on the order, date and the actual vendor that we’ve got. So we run through kind of a process of actually fitting our linear regression model based on this new matrix via one hot encoding.
And we have now created our model, which has, you know, if you’re thinking about y equals MX plus b, but also in the format of a multivariate linear regression model, we’ve got a number of coefficients for each specific independent variable.
So instead of looking at. Two variables. We’re actually looking at 20, however many, 46 columns. So we’re looking at 46 different independent variables here between order day and all of our vendors. And this is the coefficients that were given for each of those independent variables.
And a little bit more massaging here. But anyway, we want to end up getting to a point where we can run predictions based on our regression model. And this is kind of what we do here is we create a bit of a test data set where we are feeding it a specific date, 2022 418.
Kind of like we mentioned the mines DB model, and we’re saying that we want to specify the number of orders that we get for the specific vendor, barbie, in this case. And when we do that and we run our test, we end up with eleven.
So it’s actually predicting that on that date, we would have eleven point 36 orders for Barbie. So that’s how we would go about running a linear regression and being able to predict based on that vendor date or that order date and the specific vendor.
So all in all, you can kind of see that it’s pretty lot more extensive being able to run this code and getting everything all massage to the point where we could do a linear regression on it. Versus what?
We saw with MindsDB, where it was a matter of eight lines of code or however many SQL commands, eight SQL commands that we ran. Ultimately very simple. The simplicity is what minds to be. Gets at and it really makes running predictive analytics quite a bit easier than what you would do in just a Jupiter notebook, just any sort of more manually created machine and learning model.
Christian or Heather, do you mind sharing back? Yes, got it over here. And I think there’s a question that’s going to bring up one of the docs as well from these guys. So MindsDB is also open source for the most part.
There’s a docker image that can be used and leveraged for deploying it on your own local machine and playing with it. I think we had a couple of questions on a couple of questions if it’s using an arena model or LSTM.
And I think at the end of the day, you can kind of bring your models for the most. Part, keras really can bring her models, right? And think at the end of the day, what mindset Be is trying to do is just democratize machine learning, getting more people who might have more SQL skills than Python skills.
For example, more SQL skills than the machine learning skills and bring those people and the organization into the mix for working with the prediction capability that they might otherwise have to go to, let’s say more senior data scientists on the team go for that learning curve and so forth and so on.
So just jumping back over. Here. I think we hit this one. Let me get to the slideshow over here. So we kind of covered those two aspects. Did you hit this one true? Yeah. So this is what I was kind of going to get into the next.
So kind of what we showed were two specific predictions for the MindsDB model and the linear regression model, where we ran a test point for April, I think it was 18th April 18, 2022, using the vendor Barbie.
And what we found were that mine predicted that it would be five orders that day, whereas our linear regression model predicted three. And what we ultimately if you’re thinking about linear regression and just the whole Y equals MX plus B model and the fact that our model as a whole was ultimately lending toward a graph that was linear heading up or heading northeast.
We ultimately had a model that, if you think about it, for each day that passes, we’re going to have greater orders than we did in the previous day. It kind of exposes the flaws of why you might not want to use linear regression in this case and might want to do something more like some sort of time series analysis like Arima, as mentioned earlier.
But again, linear regression was something we chose just for the sake of its simplicity and force, the ability to just get something off the ground and be able to. Show a direct comparison with mines DB.
So ultimately, no, that’s not with the model we would choose if we had a lot more time. But for the sake of demonstration today, I think it worked all right. So but that is kind of one comparison that we would keep in mind.
But can you go to the next slide, Christian? So I see. I guess maybe a question in the chat about minimum sample size. And that’s a great question as well. We used actually a very small data set. In this case, we might have been pushing on three figures for our training data set, which, again, for the sake of running larger scale machine learning models, isn’t going to do much good.
But for the sake of demonstration, I think it worked out all right. So to answer the question, no, MindsDB doesn’t actually need a minimum sample size to be able to run, but for the sake of accuracy, it always helps to have a bigger sample size.
So what we found was it was probably overfitting the model in a lot of cases based on that sample size and something you want to keep in mind as you build out any specific or any certain machine learning model.
But we kind of talked about. Maybe the next steps we would do with respect to that, the linear regression model is do a bit more in terms of getting that bigger sample size so that we could direct more resources or more that sample size to be training and testing as opposed to just training data.
And we definitely want to do something with respect to a moving average or an auto regressive model, something where we could take into account time series capability because I think that’s especially critical when you’re using a date as an independent variable to be able to predict anything.
So ultimately this was for demonstration purposes, but the whole idea was with respect to what we did was kind of just give a nice piece of background on what could be done in terms of inventory management.
If you’re able to predict the number of orders that you make in a specific day based on a specific vendor or based on any other certain given descriptive attributes, you’re able to do a lot to know how much of what thing you would need to have in supply at a certain given time.
And ultimately that’s going to help avoid major issues like a lot of what we’ve seen in the last two years, supply chain issues, anything like that. And granted, now Mindcb and our linear regression model aren’t going to be able to take into account a global pandemic or anything like that, but.
Just on a day to day basis based on a period of historical data, you can do a pretty good job of making sure that you keep supply up and know when your biggest sales spikes are likely to occur. Ultimately, these two, and specifically MindsDB, is a great tool to be able to leverage machine learning with kind of a certain level of simplicity and being able to relate it to something that we all really haven’t.
I think everybody in this meetup probably has a pretty good idea of how to use SQL. And to be able to almost do machine learning with SQL in mind or with SQL as your tool is a pretty cool capability.
You might go into the next step. Christian or next slide. Sure. Let me see here earlier, the slide or I think that might have been it. Yeah. Heather, were you going to do wrapup? Hey. Sorry. Yeah, I can if you want to.
It doesn’t matter. All right. So we’re AIC G. We’re the creators of Data Lakehouse. More than just an ELT tool, it requires no code. And as you saw, we were able to get that shopify data into Snowflake relatively quickly.
Right now, we have no bad data program going on, a beta program going on. Sorry. And. You’ll get some cool swag and access to DataLakeHouse.io, where we have a data catalog and a bunch of other cool features.
And if you’re interested in that, you can go to DataLakeHouse.io Beta, and I can put that in the chat for you guys if anybody is interested. Also, that’s what your form will look like when you get on the site and you just fill out that quick form.
We’ll send you some next steps to your inbox. And then on the next slide, I think we got our Nobad Data mascot guy, which is just our little guy, to show that we don’t believe there’s such thing as bad data, just bad processes.
And if you can polish up your processes, you can do a lot with your data. So that is the end of the meet up, and we’re just going to take some questions. If anybody has any additional questions, go ahead and ask those.
I have a question about Data lake house dot IO. I’m just curious as to what are the main use cases? I’m not sure if I totally understood what are the main use cases for when we might want to use that tool.
Yeah, so it’s really a data synchronization tool. So if you’re trying to do, like, change data capture, you’re just trying to get a data set ready for your business users or any of your data pipeline or data flows, particularly in a Snowflake.
It’s a very powerful tool for doing that, setting up on any type of frequency solve in a lot of those use cases. And it tries to be more business value driven. Versus just trying to have 10,000 different types of connectors that really aren’t paid attention to over time.
Then there’s other features in there, like Heather mentioned, there’s a data catalog for collaborating and adding descriptions on the data attributes, columns, tables, what they mean, tagging them. It actually integrates with Snowflake object tagging feature.
And then there’s some other Snowflake Administration features there like creating new user monitoring, certain workloads and things like that. So hopefully that answers that question. Yes, thank you.
So we’ve got a decent number of customers in there right now from our private beta, and so we’ve been expanding our private beta into more of an open beta. And so looking for any companies that are particularly on Snowflake where they have workloads, where they might be doing something for some other potential use case or development tools such as with like Airflow or maybe they’re using a different tool like SSIs or Oracle Data integrator where there might be some operational applications.
And then like Heather said, we’ve got some swag and then we’ll be able to distribute out for those beta companies. Any other questions, guys? Well, we’ll be posting this code out to our GitHub repository.
When that’s done and the videos posted on YouTube in case somebody wants to revisit it. But yeah, definitely recommend you guys check out both Daylightcast IO, but also minds DB. We think they’re I think the company is only like three years old.
And so I think it’s a very, very promising company. As a matter of fact, mine’s, DB is actually backed by the company, one of the companies that also backed MySQL and like, MariaDB, if you guys are familiar with those databases.
And so, yeah, we’re looking forward to what they’re going to really do with that platform here in the next few months to years. It’s definitely one to watch. Well, thanks, everyone. I think we can probably wrap up there and we’ll see you guys on the next one.
If there’s any topics you guys want to see, we’ve got an awesome team over here where we do a lot of ML, we do a lot of Snowflake, we do a lot around data.
So anything you guys are curious about, drop us a line over there on the Meetup channel or YouTube channel. And we’d love to put something together that really fits a request from the group. Thanks everyone for for hanging out with us tonight!