• +1 888.396.AIML (2465)
  • wecare@aicg.com

Data Lakes with a Purpose: What are Delta Lake and Kylo?

Share on twitter
Twitter
Share on linkedin
LinkedIn

Kylo and Delta Lake are two open source platforms that provide management capability for building out your own data lake solution. While they are both open source there are some differences where one might fit better into an existing architecture or meet plans for a bigger picture architecture.

Delta Lake – delta.io

The Delta Lake (delta.io) Project is open source and seemingly its main contributor is DataBricks though it is actually maintained under the Linux Foundation making is a serious OSS player. It’s an at scale extendible system with ACID transaction stability and built to handle big data workloads. Because of its build out it currently ingest data through its API, stores data in object storage, and allows query output through the Delta JDBC driver. Storage options for read and write support HDFS, AWS S3 and Azure Storage options (Blob, etc.). Some might argue that previously Delta Lake seemed more like a database decoupled architecture but this might be only a near-sighted perspective on the larger set of features that Delta Lake provides such as their Time Travel feature, which is a nice feature for those who undertake AI/ML initiatives.

Our take on Delta Lake is that we are definitely fans. There are some short comings in comparison to larger platform architectures such as building a Data Lake in Google Cloud Platform or AWS with the direct connectivity and extensibility of that architecture. But if there is a small team getting off the ground and looking for an immediate solution for large data volumes that needs snapshotting, etc. like a small ML initiative, this may be a good self-contained option.

Kylo – Kylo.io

Kylo seeks to be a much broader end-to-end data lake management software solution. It was originally developed and released by Teradata, the well know application company with a focus on retail customers. It has many great features that we’ve implemented with success, and often not using all the features for each organization. The immediate use of ingest, streaming, wrangling, searching, and output. Kylo seems to be going strong with most of its core code written in Java while Delta Lake is mainly written in Scala which is arguably faster for development lifecycles.

Kylo has some pre-built sandbox environments for download and getting started has a bit of learning curve installing in a stand alone environment but once it is up and running you can immediately see the value with the myriad of options. Incorporating and implementing those options such as search and monitoring.

So, What’s the Verdict?

Both are compelling options and give an open source alternative to a build it yourself, cloud platform, Zaloni, Upsolver, and some others. A key focus of these systems is fast, reliable self-service. As we continue to educate a key of any data lake system is a defined purpose with context to the bigger picture of how the data lake will provide enterprise value and data access to those users that need it to create value. Both are good options but if looking to build something for the enterprise we might recommended bringing in a professional to design and hash out the purpose for selecting and/or going forward with the platforms. A green light for either is possible, but be sure it’s the right fit for what you’re doing.

More to explorer

what is data subsetting

What is Data Subsetting?

One of the most important concepts in analytics, machine learning, and data science is to obtain a workable dataset to analyze, and

Scroll to Top