Jennifer Pip
Jennifer Pip

CDAP.io: A beloved Data Transformation Tool

Twitter
LinkedIn

Since we have had a chance to work with CDAP a toolset developed by CASK, we’ve been impressed. Many of our team members have deep expertise in legacy Extract, Transform, and Load tools such as SQL Server Integration Services, Informatica, Oracle Data Integrator, and Talend and this very much assisted in any learning curve as we were never starting from zero with no basis for building good data integration pipelines. And, we’ll be using CDAP to build most if not all of our pre-built analytics and ML solution workflow products going forward, if not all of them. Let me explain why.

First, CDAP is open source

The open source community has taken off over the last 20 years probably beyond everyone’s expectations. And open source products with great contributors give the open source tooling an almost enterprise company support feel especially at a developer level for developer productivity tools for sure. What’s great about the platform is it is built for the end user developer in mind. It has great documentation, a responsive community, and hey if there’s not a feature or connector – go build it yourself. That’s powerful stuff coming from other locked in products with difficult to extend features because you cannot see nor understand the server code working behind the scenes

Second, CDAP is well designed

As mentioned above the design and GUI interface is nice. It has some catching up to other tools such as SSIS or Informatica by way of the GUI. But that really doesn’t matter because just about all of the data flow logic and pipelines can be built in code. For example in python, logic to extend a simple transform can be used like so.

def transform(record, emitter, context):
import sys

# Debug write of record components to cdap.log file
sys.stdout.write(“XXXoffset: %i\n” % record[‘offset’])
sys.stdout.write(“XXXbody: %s\n” % record[‘body’])

# now write the unmodified record object to the next pipeline stage
emitter.emit(record)

Third, CDAP is set to include and improve on Wrangling and its Connector ecosystem

The plugin extension system is currently Java based. This fits nicely into our core skillsets here at Ai Consulting Group. But what is really going to change this adoption of CDAP is inclusion of its Wrangling tool and increase documentation of the plugin connector ecosystem. As more SaaS vendors and Cloud Platforms data source connectors become a default part of CDAP the more widely used we will start seeing this solution. That’s great for us. I’m hoping to see training and a certification process come along next year to really create some separation in resource skillset capability with the product as all good open source tools should have.

Conclusion

CDAP is really a contender. I see big things happening for this product going forward. Keep an eye out for products we are releasing soon which will be packaged and extendible with CDAP as part of our back-bone.

More to explorer

International Women's Day 2024

International Women’s Day 2024: Empowerment and Progress

As we commemorate International Women’s Day on March 8th each year, it’s a time to honor the resilience, accomplishments, and contributions of women worldwide. In 2024, this day holds particular significance as we take stock of the strides made, acknowledge persistent challenges, and recommit ourselves to the pursuit of gender equality.

Bank grade security

5 Steps to Configure Key Pair Authentication in Snowflake

Key pair authentication is a secure way to access your Snowflake data warehouse without relying solely on traditional username and password authentication. In this step-by-step guide, we will walk you through the process of setting up key pair authentication in Snowflake. We’ll also cover how to install OpenSSL, a crucial tool for generating the necessary key pair.

streamline-processes

Streamlining Your Bullhorn CRM: Mastering Duplicate Data Management

Discover the most effective strategies for eliminating duplicate records in your Bullhorn CRM. Duplicates can hinder your productivity, lead to data inaccuracies, and impact your relationships with clients and candidates. In this insightful session, we will guide you through best practices, cutting-edge tools, and proven techniques to ensure a clean and efficient CRM database.

Scroll to Top