Examples:
- When creating data marts which consists of database tables for dimension and fact tables, in many cases, you can create the dimension tables in parallel since they are not related to each other. Once they are created, then the fact table can be populated with foreign keys pointing to records in dimension tables.
- Fact tables are generally heavily indexed. For large record counts, inserting records may take long time for each insert. In Pentaho Kettle, we can run multiple copies of the Insert/Update step to run several copies of the step at the same time. The benefits are more visible, when you have moving fact tables (e.g. you need to update fact records instead of truncating the table and starting from scratch).
The Generic Parallel kjb is a driver job which initializes the data, and all pieces of data should be persisted in the database. It also calculates a count of how many jobs need to be run in parallel. Make sure that your system can handle all the jobs that need to run in parallel. Otherwise, you should have an upper limit.
Parallel By Thread is the ktr that spawns all the jobs and waits for them to finish. Once all copies of the Job Executor finish, Generic Parallel ktr can now check the status of all the jobs and take the necessary action.
Figure 1 |
Running long running jobs such as huge data movements on kettle engine can raise other questions such as what happens if 1 of the jobs fail. do we need to restart the entire process or can we restart the failed SingleJob. The good thing is that process synchronization is left to the developer, and as developers we can jobs to rerun the failed pieces and merge it back to the parenting Generic Parallel job.
I hope you find this post helpful!
-maz
Hi Maz,
ReplyDeletethe article is very helpful. i am stuck in a issue.
sorry i am new to pdi development. i have issue running parallel transformations in a job.
i created a pdi job(kjb) which has 5 transformations that insert data to target table B from source table A that run parallel. for small data the job is fine.
however , when i need to insert millions of rows i am running into out of memory issues.
But, when i call the same transformations through a shell script as below, i am able to insert upto 300 million rows parallel. no issues at all.
#main#
./pan.sh -file=../jobs/instance1.ktr -logfile=./instance1.log &
./pan.sh -file=../jobs/instance2.ktr -logfile=./instance2.log &
./pan.sh -file=../jobs/instance3.ktr -logfile=./instance3.log &
./pan.sh -file=../jobs/instance4.ktr -logfile=./instance4.log &
./pan.sh -file=../jobs/instance5.ktr -logfile=./instance5.log &
i am not sure why, a pdi job runs out of memory that runs parallel transformation fails, but when these run thru .sh , it is fine.
I really want to run through a ktr job not thru .sh script. Please do let me know where i am going wrong? if memory was a issue, should'nt both pdi job and .sh script
fail? i am puzzled, sorry i am new to pdi. please suggest what could be the reason, and what should i do to work through pdi?
I really appreciate this wonderful post you have provided for us. I assure this would be beneficial for most of the people. web developer career
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDelete