• +1 888.396.AIML (2465)
  • wecare@aicg.com

CDAP.io: A beloved Data Transformation Tool

Share on twitter
Share on linkedin

Since we have had a chance to work with CDAP a toolset developed by CASK, we’ve been impressed. Many of our team members have deep expertise in legacy Extract, Transform, and Load tools such as SQL Server Integration Services, Informatica, Oracle Data Integrator, and Talend and this very much assisted in any learning curve as we were never starting from zero with no basis for building good data integration pipelines. And, we’ll be using CDAP to build most if not all of our pre-built analytics and ML solution workflow products going forward, if not all of them. Let me explain why.

First, CDAP is open source

The open source community has taken off over the last 20 years probably beyond everyone’s expectations. And open source products with great contributors give the open source tooling an almost enterprise company support feel especially at a developer level for developer productivity tools for sure. What’s great about the platform is it is built for the end user developer in mind. It has great documentation, a responsive community, and hey if there’s not a feature or connector – go build it yourself. That’s powerful stuff coming from other locked in products with difficult to extend features because you cannot see nor understand the server code working behind the scenes

Second, CDAP is well designed

As mentioned above the design and GUI interface is nice. It has some catching up to other tools such as SSIS or Informatica by way of the GUI. But that really doesn’t matter because just about all of the data flow logic and pipelines can be built in code. For example in python, logic to extend a simple transform can be used like so.

def transform(record, emitter, context):
import sys

# Debug write of record components to cdap.log file
sys.stdout.write(“XXXoffset: %i\n” % record[‘offset’])
sys.stdout.write(“XXXbody: %s\n” % record[‘body’])

# now write the unmodified record object to the next pipeline stage

Third, CDAP is set to include and improve on Wrangling and its Connector ecosystem

The plugin extension system is currently Java based. This fits nicely into our core skillsets here at Ai Consulting Group. But what is really going to change this adoption of CDAP is inclusion of its Wrangling tool and increase documentation of the plugin connector ecosystem. As more SaaS vendors and Cloud Platforms data source connectors become a default part of CDAP the more widely used we will start seeing this solution. That’s great for us. I’m hoping to see training and a certification process come along next year to really create some separation in resource skillset capability with the product as all good open source tools should have.


CDAP is really a contender. I see big things happening for this product going forward. Keep an eye out for products we are releasing soon which will be packaged and extendible with CDAP as part of our back-bone.

More to explorer

what is data subsetting

What is Data Subsetting?

One of the most important concepts in analytics, machine learning, and data science is to obtain a workable dataset to analyze, and

Scroll to Top