Wednesday, November 11, 2015

Managing parallel jobs in Pentaho Kettle

Executing business processes means that it may take "some time" to complete.  Let's just say it's more than just waiting for you browser to respond with data stating that the job has finished.  If they are human centric, they may take several days or months.  If they are data centric, then they may take seconds to hours to complete.  In each case, it is the job of business analysts or software architects to look at the process as a whole and determine if parts of the process can run in parallel to reduce the run time or be given more resources to resolve bottlenecks.

Examples:
  • When creating data marts which consists of database tables for dimension and fact tables, in many cases, you can create the dimension tables in parallel since they are not related to each other.  Once they are created, then the fact table can be populated with foreign keys pointing to records in dimension tables.
  • Fact tables are generally heavily indexed.  For large record counts, inserting records may take long time for each insert.  In Pentaho Kettle, we can run multiple copies of the Insert/Update step to run several copies of the step at the same time.  The benefits are more visible, when you have moving fact tables (e.g. you need to update fact records instead of truncating the table and starting from scratch).
In Pentaho Data Integrator, you can run multiple Jobs in parallel using the Job Executor step in a Transformation.  KTRs allow you to run multiple copies of a step.  You would only need to handle process synchronization outside of Pentaho.  It is best to use a database table to keep track of execution of each of the jobs that run in parallel.  Figure 1 shows the pieces that you need to invoke multiple copies of the SingleJob kjb in parallel.
The Generic Parallel kjb is a driver job which initializes the data, and all pieces of data should be persisted in the database.  It also calculates a count of how many jobs need to be run in parallel.  Make sure that your system can handle all the jobs that need to run in parallel.  Otherwise, you should have an upper limit.
Parallel By Thread is the ktr that spawns all the jobs and waits for them to finish.  Once all copies of the Job Executor finish, Generic Parallel ktr can now check the status of all the jobs and take the necessary action. 

Figure 1


Running long running jobs such as huge data movements on kettle engine can raise other questions such as what happens if 1 of the jobs fail.  do we need to restart the entire process or can we restart the failed SingleJob.  The good thing is that process synchronization is left to the developer, and as developers we can jobs to rerun the failed pieces and merge it back to the parenting Generic Parallel job.

I hope you find this post helpful!

-maz

Wednesday, July 22, 2015

To Columnar or Not to Columnar with Pentaho's Mondrian.....What is your decision?

When developing a report or  a Mondrian schema in Pentaho, should you use a Columnar database or not?  Let's discuss it!!

Lots of college grads venture off to New York City and dream of making it big in a city so expensive that even millionaires have hard time balancing their budget. It's really easy to "recommend" them to eat Fillet Mignon at Peter Luger's or visit Morimoto's for an exceptional Japanese meal, EVERY NIGHT!!!  Hey if they share a studio with 3 other people, one can assume they are saving enough money.  If sustenance is important, there are lots of delis and good super markets with fresh products which they feel very comfortable with and doesn't require an expensive cab ride to cross rivers and oceans to get to their destinations.... Peter Luger's is in Brooklyn ;)

In the world of Business Intelligence (BI) Columnar databases are the Fillet Mignon at Peter Luger's or the Omakase at Morimoto's.  The food is perfectly done and set in front of you.  you just have to eat it and reap its benefits.  But let's be honest, how many people can eat like that every night? 

The questions is.... Is it worth it to use a columnar database, like Info-bright, Red Shift, or Vertica with Pentaho's Mondrian engine or should stick with MySQL or Postgres which are open source products? 

Pentaho's Mondrian engine requires an SQL based database.  Yes, version 4 support MongoDB as well, but let's not open that can of warm soup.  Pentaho's Analyzer is the GUI front end to the open source OLAP engine.  As users bring in levels and measures into the pivot table, Mondrian generate an SQL query in the backend and executes it on the database.  These queries are generally aggregate queries (SUM, COUNT, AVG, etc.), which work on a specific column (e.g. total_cost) of the result set based a certain criteria, and group the results based on the levels chosen. 

What's the best way to handle these queries?  Considering:
  1. Generally in BI, data sets are very large.  Fact tables can range from 10M and up (500M+) records.
  2. The data is updated limited number of times per day (once or twice, or 5 times) in bulk.
  3. The data is read many, Many, MANY times through interactive reports.
  4. SQL queries are generally of aggregate nature. 
I'm not going to discuss the differences between Columnar and traditional row-storaged based database, but I highly recommend you to read these posts to get a better understanding of both technologies.

https://en.wikipedia.org/wiki/Column-oriented_DBMS
http://www.dbbest.com/blog/column-oriented-database-technologies/
Possibilities:
If you have the money, and performance IS AN ABSOLUTE MUST (<10-15 sec response), then columnar databases are the way to go.  here are some tips and things you should be aware of:
  1. Columnar database can return aggregate query results extremely fast.  An order of magnitude faster than traditional row-storaged based databased, but if your data warehouse needs to be updated often, then they are slow.  For example, data warehouses that provide information about the latest status of a business process, generally require updates to specific records.  Make sure you wouldn't have to do those operations frequently.  
  2. With Mondrian, you do not need a star-schema with Columnar databases.  Data can be in 1 BIG flat (wide and long) table.  Columnar databases are very efficient in storage, and compression.
  3. If your OLAP cube has a lot of single attribute dimensions (no hierarchy, or relationships with other tables), or degenerate dimensions, then star schemas may not be a good choice, and 1 BIG flat table may be a better choice.  So again, Columnar databases may be a better option for you.  
  4. If your fact table data set is reaching 100M records or above, you may need to think about biting the bullet and plan for a columnar database.
  5.  Columnar databases are not your average databases.  They are very specialized for data warehousing or BI.  This is in addition to your transactional databases such as Oracle, Sybase, MySQL, or Postgres, etc.  It does not replace those technologies.  With a new technology in your stack, you may need to have new skill sets to manage them.  This is Cost of Ownership!
  6. Columnar databases are free to download, but can be far more costly to support with a production license.  This can be a big catch for most startup companies with limited budget.
What if you don't have enough money or your data set is not big enough to justify a columnar DB.  What now?  here are some tips to work with your traditional database and Mondrian:
  1. Traditional databases are much cheaper.  Cloud based host providers, already have pre-configured installations which are typically tuned for production.
  2. You MUST have a proper star schema.  Please read these references to get a better understanding of star schema.
  3. The fact table in the star schema should be narrow (few data columns of measures), and very long (10-100M) records.  It should be designed with measures only and lots of columns of foreign keys to dimension tables. If your fact table has >50 columns, it is time to review your requirements and rethink your design.
  4. Index the measure columns on the fact table.  They will greatly improve the performance. Considering that the fact table will be written only a few times each day, the cost of poor performance of writes will be paid many times over with far better reads.
  5. If your fact table data set is reaching 100M records or above, you may need to think about biting the bullet and plan for a columnar database.
  6. You can have degenerate dimensions in your fact table, but if the cardinality is very high (>500), then it's better to move it to a different dimension table, otherwise, it will cause a very expensive full table scan of the fact table.
  7. If some queries are taking too long, you can detect them in mondrian_sql.log file and execute them independently and review the execution plan on the database.  your DBA should make a good sense of it.
  8. If your company is a "Leading Provider of so and so", then act like one!!!!  Hire a good DBA to keep a close tab on the database to detect the slow queries.
All in all, Columnar database are becoming more common and cost effective.  If the $$ cost is bearable, then you may want to use it.  This way, as your data grows, you can be sure that you can handle the reports as well. But if you have ~10M records and it takes several minutes to render your reports,  don't blame it on Mondrian or Pentaho.  Review your requirements and tune your schema!

Many Thanks to Oliver Nielsen and Babji Nallamotu
-Maz

Tuesday, July 21, 2015

Versioning Pentaho kettle content in Source Control Repositories

Everybody knows that all IT projects use a source control repository to manage and maintain versions of the source code or artifacts which gets deployed on the server.  This is no different than Data Integration projects with PDI (Pentaho Data Integrator).

What's the best way to develop KTRs and KJBs with PDI considering that PDI requires you to save in the Enterprise Repository every time you modify the file?
 
Enterprise Data Integration (DI) server comes with a repository that is based on Apache Jackrabbit and a scheduler (Quartz).  Through DI scheduler, organizations can schedule jobs and transformations.  Another benefit of using DI server and repository is that DI server can act as a data access service provider.  This is a very power feature of DI server.  The repository which is based on jackrabbit provides versioning, but is not meant to be used a source control repository.

 Let's go through a development pattern life cycle.

  1. A developer starts spoon on their local laptop.
  2. Since an integration project consists of multiple KTRs and KJBs, and they are all inter-connected, they need to know about their paths on the repository.  Therefore, the developers connects to the repository, and all KTRs/KJBs will be saved in the repository, with their paths relative to the repository.
  3. Developers modify KTRs and test them frequently.  Each time the developer wants to test it, they must save the KTR in a repository, even though KTR is being developed and tested on spoon.
  4. Every time a KTR is being saved, it is being saved as a new version(full copy of the file) in Jackrabbit repository.  This makes the repository much larger than it actually needs to be, since most version in the repository will not be executed on the DI server.
  5. DI Administrators must periodically use pentaho provided utility purge-utility.sh/bat to clean up the repository by removing old version of the KTRs and KJBs.
  6. Once the KTRs is finally tested and approved, it is scheduled on the DI server and it will get executed periodically.

DI repository is not the best place to version KTRs or KJBs.  It is not designed to be a source control system (SCS).  Therefore, development team should think about using a traditional SCS outside of Jackrabbit, such has CVS, SVN, or etc.

Traditional SCS work on your file system.  That means all files are in a directory structure (workspace) on the developer's laptop and they are checked-in using the SCS's utilities.  But if a developer chooses to use spoon on the local file system, and the references between the KTRs and KJBs are relative to the file system and can't be checked in to DI repository.

Now that I've provided a very, VERY LONG explanation on the problem.......let's discuss the solution!

SOLUTION:

Most people don't know that spoon support 3 different types of repositories.
  1. DI Repository, which is the traditional repository that is access through DI and files are stored in Apache Jackrabbit.
  2. Kettle DB repository: A relational database is used to stored the KTRs and KJBs.
  3. Kettle file repository: A local folder is used to store the KTRs and KJBs
For simplicity, developers should create a directory on their laptop and define it as the repository for Spoon.  In the below example, C:\DIWorkspace is used as the repository root folder.


How to define a File Repository
The same folder can be the root folder the SCS' workspace where metadata files are created.  It is important for the developer to creat ethe same directory tree structure that exists on the DI repository. By default, /home and /public exist on the DI repository.  Other folders can be created as well, and developers can store their content in those folders.

Mimic the directory structure on the DI server
Now the development cycle becomes as such:
  1. A developer starts spoon on their local laptop.
  2. Developer connects to the Kettle File Repository which its directory structure mimics what on the DI respository.
  3. Developers creates and modify KTRs and test them frequently.  As they save KTRs, they only get saved to the local file system, no there is no versioning.
  4. When the developer is satisfied with the KTRs and KJBs, they will use the SCS' utilities to check in the files.
  5. DI administrator will check out the KTRs and KJBs with full path information and import them into the DI Repository.
  6. Now using spoon, DI administrator can connect to the DI server repository and schedule jobs, knowing that relationship between the KTRs and KJBs is preserved.
I hope you find this posting helpful.
FYI:  THE FINE PRINT!!!! FILE REPOSITORY IS NOT SUPPORTED BY PENTAHO's SUPPORT team.

-maz
many thanks to Kevin Hanrahan

Sunday, August 10, 2014

Exporting PRPT reports to Multi-worksheets excels



It is very common for business users to get access to a pentaho PRPT report in Excel spreadsheet format.  They may choose to print it and bundle it with other printed materials and distribute it as a packet.  They may want to share it with coworkers who do not have access to the online version of the report.   They also may want to look at the raw data in excel format and perform their own data analysis.

By default, PRPT reports can be downloaded as excel spreadsheets.  However, in real world spreadsheets commonly have multiple worksheets populated with data from multiple data sources and contain charts or plots in different formats.  Challenge is how to design a PRPT report so that it generates such a spreadsheet when downloaded.

The first thing to do is to design the report and determining what data needs to be displayed on each worksheet.  Since each worksheet can contain data from different data sources, you need to use subreports.  Therefore, each subreport can be the container for each worksheet.  Keep in mind that the subreports need to be banded subreports.  To create a subreport, drap and drop a subreport from the left side of PRD.  When the Insert Subreport query pops up, select "Banded". 

Adding subreports


Insert Subreport query
YOU MUST specify Banded






This will open up the subreport tab in PRD.  you can design the internals of a subreport just like any other report. 

The final step is to specify a page break and the sheetname.  From the master report, select the style tab.  specify a name for the sheetname in the excel section,  and select True for the pagebreak-before. 



Simply preview your report in excel format and you can see the different worksheets.