Wednesday, July 22, 2015

To Columnar or Not to Columnar with Pentaho's Mondrian.....What is your decision?

When developing a report or  a Mondrian schema in Pentaho, should you use a Columnar database or not?  Let's discuss it!!

Lots of college grads venture off to New York City and dream of making it big in a city so expensive that even millionaires have hard time balancing their budget. It's really easy to "recommend" them to eat Fillet Mignon at Peter Luger's or visit Morimoto's for an exceptional Japanese meal, EVERY NIGHT!!!  Hey if they share a studio with 3 other people, one can assume they are saving enough money.  If sustenance is important, there are lots of delis and good super markets with fresh products which they feel very comfortable with and doesn't require an expensive cab ride to cross rivers and oceans to get to their destinations.... Peter Luger's is in Brooklyn ;)

In the world of Business Intelligence (BI) Columnar databases are the Fillet Mignon at Peter Luger's or the Omakase at Morimoto's.  The food is perfectly done and set in front of you.  you just have to eat it and reap its benefits.  But let's be honest, how many people can eat like that every night? 

The questions is.... Is it worth it to use a columnar database, like Info-bright, Red Shift, or Vertica with Pentaho's Mondrian engine or should stick with MySQL or Postgres which are open source products? 

Pentaho's Mondrian engine requires an SQL based database.  Yes, version 4 support MongoDB as well, but let's not open that can of warm soup.  Pentaho's Analyzer is the GUI front end to the open source OLAP engine.  As users bring in levels and measures into the pivot table, Mondrian generate an SQL query in the backend and executes it on the database.  These queries are generally aggregate queries (SUM, COUNT, AVG, etc.), which work on a specific column (e.g. total_cost) of the result set based a certain criteria, and group the results based on the levels chosen. 

What's the best way to handle these queries?  Considering:
  1. Generally in BI, data sets are very large.  Fact tables can range from 10M and up (500M+) records.
  2. The data is updated limited number of times per day (once or twice, or 5 times) in bulk.
  3. The data is read many, Many, MANY times through interactive reports.
  4. SQL queries are generally of aggregate nature. 
I'm not going to discuss the differences between Columnar and traditional row-storaged based database, but I highly recommend you to read these posts to get a better understanding of both technologies.

https://en.wikipedia.org/wiki/Column-oriented_DBMS
http://www.dbbest.com/blog/column-oriented-database-technologies/
Possibilities:
If you have the money, and performance IS AN ABSOLUTE MUST (<10-15 sec response), then columnar databases are the way to go.  here are some tips and things you should be aware of:
  1. Columnar database can return aggregate query results extremely fast.  An order of magnitude faster than traditional row-storaged based databased, but if your data warehouse needs to be updated often, then they are slow.  For example, data warehouses that provide information about the latest status of a business process, generally require updates to specific records.  Make sure you wouldn't have to do those operations frequently.  
  2. With Mondrian, you do not need a star-schema with Columnar databases.  Data can be in 1 BIG flat (wide and long) table.  Columnar databases are very efficient in storage, and compression.
  3. If your OLAP cube has a lot of single attribute dimensions (no hierarchy, or relationships with other tables), or degenerate dimensions, then star schemas may not be a good choice, and 1 BIG flat table may be a better choice.  So again, Columnar databases may be a better option for you.  
  4. If your fact table data set is reaching 100M records or above, you may need to think about biting the bullet and plan for a columnar database.
  5.  Columnar databases are not your average databases.  They are very specialized for data warehousing or BI.  This is in addition to your transactional databases such as Oracle, Sybase, MySQL, or Postgres, etc.  It does not replace those technologies.  With a new technology in your stack, you may need to have new skill sets to manage them.  This is Cost of Ownership!
  6. Columnar databases are free to download, but can be far more costly to support with a production license.  This can be a big catch for most startup companies with limited budget.
What if you don't have enough money or your data set is not big enough to justify a columnar DB.  What now?  here are some tips to work with your traditional database and Mondrian:
  1. Traditional databases are much cheaper.  Cloud based host providers, already have pre-configured installations which are typically tuned for production.
  2. You MUST have a proper star schema.  Please read these references to get a better understanding of star schema.
  3. The fact table in the star schema should be narrow (few data columns of measures), and very long (10-100M) records.  It should be designed with measures only and lots of columns of foreign keys to dimension tables. If your fact table has >50 columns, it is time to review your requirements and rethink your design.
  4. Index the measure columns on the fact table.  They will greatly improve the performance. Considering that the fact table will be written only a few times each day, the cost of poor performance of writes will be paid many times over with far better reads.
  5. If your fact table data set is reaching 100M records or above, you may need to think about biting the bullet and plan for a columnar database.
  6. You can have degenerate dimensions in your fact table, but if the cardinality is very high (>500), then it's better to move it to a different dimension table, otherwise, it will cause a very expensive full table scan of the fact table.
  7. If some queries are taking too long, you can detect them in mondrian_sql.log file and execute them independently and review the execution plan on the database.  your DBA should make a good sense of it.
  8. If your company is a "Leading Provider of so and so", then act like one!!!!  Hire a good DBA to keep a close tab on the database to detect the slow queries.
All in all, Columnar database are becoming more common and cost effective.  If the $$ cost is bearable, then you may want to use it.  This way, as your data grows, you can be sure that you can handle the reports as well. But if you have ~10M records and it takes several minutes to render your reports,  don't blame it on Mondrian or Pentaho.  Review your requirements and tune your schema!

Many Thanks to Oliver Nielsen and Babji Nallamotu
-Maz

Tuesday, July 21, 2015

Versioning Pentaho kettle content in Source Control Repositories

Everybody knows that all IT projects use a source control repository to manage and maintain versions of the source code or artifacts which gets deployed on the server.  This is no different than Data Integration projects with PDI (Pentaho Data Integrator).

What's the best way to develop KTRs and KJBs with PDI considering that PDI requires you to save in the Enterprise Repository every time you modify the file?
 
Enterprise Data Integration (DI) server comes with a repository that is based on Apache Jackrabbit and a scheduler (Quartz).  Through DI scheduler, organizations can schedule jobs and transformations.  Another benefit of using DI server and repository is that DI server can act as a data access service provider.  This is a very power feature of DI server.  The repository which is based on jackrabbit provides versioning, but is not meant to be used a source control repository.

 Let's go through a development pattern life cycle.

  1. A developer starts spoon on their local laptop.
  2. Since an integration project consists of multiple KTRs and KJBs, and they are all inter-connected, they need to know about their paths on the repository.  Therefore, the developers connects to the repository, and all KTRs/KJBs will be saved in the repository, with their paths relative to the repository.
  3. Developers modify KTRs and test them frequently.  Each time the developer wants to test it, they must save the KTR in a repository, even though KTR is being developed and tested on spoon.
  4. Every time a KTR is being saved, it is being saved as a new version(full copy of the file) in Jackrabbit repository.  This makes the repository much larger than it actually needs to be, since most version in the repository will not be executed on the DI server.
  5. DI Administrators must periodically use pentaho provided utility purge-utility.sh/bat to clean up the repository by removing old version of the KTRs and KJBs.
  6. Once the KTRs is finally tested and approved, it is scheduled on the DI server and it will get executed periodically.

DI repository is not the best place to version KTRs or KJBs.  It is not designed to be a source control system (SCS).  Therefore, development team should think about using a traditional SCS outside of Jackrabbit, such has CVS, SVN, or etc.

Traditional SCS work on your file system.  That means all files are in a directory structure (workspace) on the developer's laptop and they are checked-in using the SCS's utilities.  But if a developer chooses to use spoon on the local file system, and the references between the KTRs and KJBs are relative to the file system and can't be checked in to DI repository.

Now that I've provided a very, VERY LONG explanation on the problem.......let's discuss the solution!

SOLUTION:

Most people don't know that spoon support 3 different types of repositories.
  1. DI Repository, which is the traditional repository that is access through DI and files are stored in Apache Jackrabbit.
  2. Kettle DB repository: A relational database is used to stored the KTRs and KJBs.
  3. Kettle file repository: A local folder is used to store the KTRs and KJBs
For simplicity, developers should create a directory on their laptop and define it as the repository for Spoon.  In the below example, C:\DIWorkspace is used as the repository root folder.


How to define a File Repository
The same folder can be the root folder the SCS' workspace where metadata files are created.  It is important for the developer to creat ethe same directory tree structure that exists on the DI repository. By default, /home and /public exist on the DI repository.  Other folders can be created as well, and developers can store their content in those folders.

Mimic the directory structure on the DI server
Now the development cycle becomes as such:
  1. A developer starts spoon on their local laptop.
  2. Developer connects to the Kettle File Repository which its directory structure mimics what on the DI respository.
  3. Developers creates and modify KTRs and test them frequently.  As they save KTRs, they only get saved to the local file system, no there is no versioning.
  4. When the developer is satisfied with the KTRs and KJBs, they will use the SCS' utilities to check in the files.
  5. DI administrator will check out the KTRs and KJBs with full path information and import them into the DI Repository.
  6. Now using spoon, DI administrator can connect to the DI server repository and schedule jobs, knowing that relationship between the KTRs and KJBs is preserved.
I hope you find this posting helpful.
FYI:  THE FINE PRINT!!!! FILE REPOSITORY IS NOT SUPPORTED BY PENTAHO's SUPPORT team.

-maz
many thanks to Kevin Hanrahan