Data Lakes with a Purpose: What are Delta Lake and Kylo?

Kylo and Delta Lake are two open source platforms that provide management capability for building out your own data lake solution. While they are both open source there are some differences where one might fit better into an existing architecture or meet plans for a bigger picture architecture.

Delta Lake – delta.io

The Delta Lake (delta.io) Project is open source and seemingly its main contributor is DataBricks though it is actually maintained under the Linux Foundation making is a serious OSS player. It’s an at scale extendible system with ACID transaction stability and built to handle big data workloads. Because of its build out it currently ingest data through its API, stores data in object storage, and allows query output through the Delta JDBC driver. Storage options for read and write support HDFS, AWS S3 and Azure Storage options (Blob, etc.). Some might argue that previously Delta Lake seemed more like a database decoupled architecture but this might be only a near-sighted perspective on the larger set of features that Delta Lake provides such as their Time Travel feature, which is a nice feature for those who undertake AI/ML initiatives.

Our take on Delta Lake is that we are definitely fans. There are some short comings in comparison to larger platform architectures such as building a Data Lake in Google Cloud Platform or AWS with the direct connectivity and extensibility of that architecture. But if there is a small team getting off the ground and looking for an immediate solution for large data volumes that needs snapshotting, etc. like a small ML initiative, this may be a good self-contained option.

Kylo – Kylo.io

Kylo seeks to be a much broader end-to-end data lake management software solution. It was originally developed and released by Teradata, the well know application company with a focus on retail customers. It has many great features that we’ve implemented with success, and often not using all the features for each organization. The immediate use of ingest, streaming, wrangling, searching, and output. Kylo seems to be going strong with most of its core code written in Java while Delta Lake is mainly written in Scala which is arguably faster for development lifecycles.

Kylo has some pre-built sandbox environments for download and getting started has a bit of learning curve installing in a stand alone environment but once it is up and running you can immediately see the value with the myriad of options. Incorporating and implementing those options such as search and monitoring.

So, What’s the Verdict?

Both are compelling options and give an open source alternative to a build it yourself, cloud platform, Zaloni, Upsolver, and some others. A key focus of these systems is fast, reliable self-service. As we continue to educate a key of any data lake system is a defined purpose with context to the bigger picture of how the data lake will provide enterprise value and data access to those users that need it to create value. Both are good options but if looking to build something for the enterprise we might recommended bringing in a professional to design and hash out the purpose for selecting and/or going forward with the platforms. A green light for either is possible, but be sure it’s the right fit for what you’re doing.

More to explorer

Snowflake Loading Data with Special Characters

July 24, 2024 No Comments

Special characters in your column names can cause chaos for downstream users, tools and processes.

Building a Generative AI Competency (or the First Gen AI Project)

July 21, 2024 No Comments

When Building a Generative AI Competency one must identify the necessary infrastructure, architecture, platform, and other resources and partners that can help an AI initiative be successful. We have just like many data warehouse and digital transformation initiatives over the last 20 years fail because of poor leadership, or companies only going half in on the objective.

clock, time management, time-3222267.jpg

Snowflake Time Travel Not the First Time Traveller but Let’s Review

June 12, 2024 No Comments

IBM DB2 long before Snowflake had this concept as did a few other select databases using a Temporal Database concept. I think Snowflake was able to make it more popular and mainstream due to the association of Data Warehousing and analytics specifically