Wednesday, November 11, 2015

Managing parallel jobs in Pentaho Kettle

Executing business processes means that it may take "some time" to complete.  Let's just say it's more than just waiting for you browser to respond with data stating that the job has finished.  If they are human centric, they may take several days or months.  If they are data centric, then they may take seconds to hours to complete.  In each case, it is the job of business analysts or software architects to look at the process as a whole and determine if parts of the process can run in parallel to reduce the run time or be given more resources to resolve bottlenecks.

Examples:
  • When creating data marts which consists of database tables for dimension and fact tables, in many cases, you can create the dimension tables in parallel since they are not related to each other.  Once they are created, then the fact table can be populated with foreign keys pointing to records in dimension tables.
  • Fact tables are generally heavily indexed.  For large record counts, inserting records may take long time for each insert.  In Pentaho Kettle, we can run multiple copies of the Insert/Update step to run several copies of the step at the same time.  The benefits are more visible, when you have moving fact tables (e.g. you need to update fact records instead of truncating the table and starting from scratch).
In Pentaho Data Integrator, you can run multiple Jobs in parallel using the Job Executor step in a Transformation.  KTRs allow you to run multiple copies of a step.  You would only need to handle process synchronization outside of Pentaho.  It is best to use a database table to keep track of execution of each of the jobs that run in parallel.  Figure 1 shows the pieces that you need to invoke multiple copies of the SingleJob kjb in parallel.
The Generic Parallel kjb is a driver job which initializes the data, and all pieces of data should be persisted in the database.  It also calculates a count of how many jobs need to be run in parallel.  Make sure that your system can handle all the jobs that need to run in parallel.  Otherwise, you should have an upper limit.
Parallel By Thread is the ktr that spawns all the jobs and waits for them to finish.  Once all copies of the Job Executor finish, Generic Parallel ktr can now check the status of all the jobs and take the necessary action. 

Figure 1


Running long running jobs such as huge data movements on kettle engine can raise other questions such as what happens if 1 of the jobs fail.  do we need to restart the entire process or can we restart the failed SingleJob.  The good thing is that process synchronization is left to the developer, and as developers we can jobs to rerun the failed pieces and merge it back to the parenting Generic Parallel job.

I hope you find this post helpful!

-maz