Tuesday, July 21, 2015

Versioning Pentaho kettle content in Source Control Repositories

Everybody knows that all IT projects use a source control repository to manage and maintain versions of the source code or artifacts which gets deployed on the server.  This is no different than Data Integration projects with PDI (Pentaho Data Integrator).

What's the best way to develop KTRs and KJBs with PDI considering that PDI requires you to save in the Enterprise Repository every time you modify the file?
 
Enterprise Data Integration (DI) server comes with a repository that is based on Apache Jackrabbit and a scheduler (Quartz).  Through DI scheduler, organizations can schedule jobs and transformations.  Another benefit of using DI server and repository is that DI server can act as a data access service provider.  This is a very power feature of DI server.  The repository which is based on jackrabbit provides versioning, but is not meant to be used a source control repository.

 Let's go through a development pattern life cycle.

  1. A developer starts spoon on their local laptop.
  2. Since an integration project consists of multiple KTRs and KJBs, and they are all inter-connected, they need to know about their paths on the repository.  Therefore, the developers connects to the repository, and all KTRs/KJBs will be saved in the repository, with their paths relative to the repository.
  3. Developers modify KTRs and test them frequently.  Each time the developer wants to test it, they must save the KTR in a repository, even though KTR is being developed and tested on spoon.
  4. Every time a KTR is being saved, it is being saved as a new version(full copy of the file) in Jackrabbit repository.  This makes the repository much larger than it actually needs to be, since most version in the repository will not be executed on the DI server.
  5. DI Administrators must periodically use pentaho provided utility purge-utility.sh/bat to clean up the repository by removing old version of the KTRs and KJBs.
  6. Once the KTRs is finally tested and approved, it is scheduled on the DI server and it will get executed periodically.

DI repository is not the best place to version KTRs or KJBs.  It is not designed to be a source control system (SCS).  Therefore, development team should think about using a traditional SCS outside of Jackrabbit, such has CVS, SVN, or etc.

Traditional SCS work on your file system.  That means all files are in a directory structure (workspace) on the developer's laptop and they are checked-in using the SCS's utilities.  But if a developer chooses to use spoon on the local file system, and the references between the KTRs and KJBs are relative to the file system and can't be checked in to DI repository.

Now that I've provided a very, VERY LONG explanation on the problem.......let's discuss the solution!

SOLUTION:

Most people don't know that spoon support 3 different types of repositories.
  1. DI Repository, which is the traditional repository that is access through DI and files are stored in Apache Jackrabbit.
  2. Kettle DB repository: A relational database is used to stored the KTRs and KJBs.
  3. Kettle file repository: A local folder is used to store the KTRs and KJBs
For simplicity, developers should create a directory on their laptop and define it as the repository for Spoon.  In the below example, C:\DIWorkspace is used as the repository root folder.


How to define a File Repository
The same folder can be the root folder the SCS' workspace where metadata files are created.  It is important for the developer to creat ethe same directory tree structure that exists on the DI repository. By default, /home and /public exist on the DI repository.  Other folders can be created as well, and developers can store their content in those folders.

Mimic the directory structure on the DI server
Now the development cycle becomes as such:
  1. A developer starts spoon on their local laptop.
  2. Developer connects to the Kettle File Repository which its directory structure mimics what on the DI respository.
  3. Developers creates and modify KTRs and test them frequently.  As they save KTRs, they only get saved to the local file system, no there is no versioning.
  4. When the developer is satisfied with the KTRs and KJBs, they will use the SCS' utilities to check in the files.
  5. DI administrator will check out the KTRs and KJBs with full path information and import them into the DI Repository.
  6. Now using spoon, DI administrator can connect to the DI server repository and schedule jobs, knowing that relationship between the KTRs and KJBs is preserved.
I hope you find this posting helpful.
FYI:  THE FINE PRINT!!!! FILE REPOSITORY IS NOT SUPPORTED BY PENTAHO's SUPPORT team.

-maz
many thanks to Kevin Hanrahan

8 comments:

  1. I am not getting step 4. Where are the SCS' utilitities to check in the file?
    Please update

    ReplyDelete
    Replies
    1. Every source control system has its own setup of tools and command to check in/out or update files of the repo. These are tied to any pentaho product. For example, you can use Github. I hope this answers your question.

      Delete
  2. I am not getting step 4. Where are the SCS' utilitities to check in the file?
    Please update

    ReplyDelete
    Replies
    1. Every source control system has its own setup of tools and command to check in/out or update files of the repo. These are tied to any pentaho product. For example, you can use Github. I hope this answers your question.

      Delete
  3. Useful information on Pentaho kettle. Get some more details on Pentaho Consulting at Pentaho Consulting

    ReplyDelete
    Replies
    1. .... better yet, Get "the real" Pentaho consulting from https://www.hitachivantara.com/en-us/services/training-certification.html

      Delete
  4. Nice article on Pentaho Kettle and Useful Information for the people who wants to know more about Pentaho. Get some more details on Pentaho Consulting and reach us, if you are interested for a free demo.

    ReplyDelete
    Replies
    1. .... better yet, Get "the real" Pentaho consulting from https://www.hitachivantara.com/en-us/services/training-certification.html

      Delete