Examples:
- When creating data marts which consists of database tables for dimension and fact tables, in many cases, you can create the dimension tables in parallel since they are not related to each other. Once they are created, then the fact table can be populated with foreign keys pointing to records in dimension tables.
- Fact tables are generally heavily indexed. For large record counts, inserting records may take long time for each insert. In Pentaho Kettle, we can run multiple copies of the Insert/Update step to run several copies of the step at the same time. The benefits are more visible, when you have moving fact tables (e.g. you need to update fact records instead of truncating the table and starting from scratch).
The Generic Parallel kjb is a driver job which initializes the data, and all pieces of data should be persisted in the database. It also calculates a count of how many jobs need to be run in parallel. Make sure that your system can handle all the jobs that need to run in parallel. Otherwise, you should have an upper limit.
Parallel By Thread is the ktr that spawns all the jobs and waits for them to finish. Once all copies of the Job Executor finish, Generic Parallel ktr can now check the status of all the jobs and take the necessary action.
Figure 1 |
Running long running jobs such as huge data movements on kettle engine can raise other questions such as what happens if 1 of the jobs fail. do we need to restart the entire process or can we restart the failed SingleJob. The good thing is that process synchronization is left to the developer, and as developers we can jobs to rerun the failed pieces and merge it back to the parenting Generic Parallel job.
I hope you find this post helpful!
-maz