Wednesday, July 22, 2015

To Columnar or Not to Columnar with Pentaho's Mondrian.....What is your decision?

When developing a report or  a Mondrian schema in Pentaho, should you use a Columnar database or not?  Let's discuss it!!

Lots of college grads venture off to New York City and dream of making it big in a city so expensive that even millionaires have hard time balancing their budget. It's really easy to "recommend" them to eat Fillet Mignon at Peter Luger's or visit Morimoto's for an exceptional Japanese meal, EVERY NIGHT!!!  Hey if they share a studio with 3 other people, one can assume they are saving enough money.  If sustenance is important, there are lots of delis and good super markets with fresh products which they feel very comfortable with and doesn't require an expensive cab ride to cross rivers and oceans to get to their destinations.... Peter Luger's is in Brooklyn ;)

In the world of Business Intelligence (BI) Columnar databases are the Fillet Mignon at Peter Luger's or the Omakase at Morimoto's.  The food is perfectly done and set in front of you.  you just have to eat it and reap its benefits.  But let's be honest, how many people can eat like that every night? 

The questions is.... Is it worth it to use a columnar database, like Info-bright, Red Shift, or Vertica with Pentaho's Mondrian engine or should stick with MySQL or Postgres which are open source products? 

Pentaho's Mondrian engine requires an SQL based database.  Yes, version 4 support MongoDB as well, but let's not open that can of warm soup.  Pentaho's Analyzer is the GUI front end to the open source OLAP engine.  As users bring in levels and measures into the pivot table, Mondrian generate an SQL query in the backend and executes it on the database.  These queries are generally aggregate queries (SUM, COUNT, AVG, etc.), which work on a specific column (e.g. total_cost) of the result set based a certain criteria, and group the results based on the levels chosen. 

What's the best way to handle these queries?  Considering:
  1. Generally in BI, data sets are very large.  Fact tables can range from 10M and up (500M+) records.
  2. The data is updated limited number of times per day (once or twice, or 5 times) in bulk.
  3. The data is read many, Many, MANY times through interactive reports.
  4. SQL queries are generally of aggregate nature. 
I'm not going to discuss the differences between Columnar and traditional row-storaged based database, but I highly recommend you to read these posts to get a better understanding of both technologies.

https://en.wikipedia.org/wiki/Column-oriented_DBMS
http://www.dbbest.com/blog/column-oriented-database-technologies/
Possibilities:
If you have the money, and performance IS AN ABSOLUTE MUST (<10-15 sec response), then columnar databases are the way to go.  here are some tips and things you should be aware of:
  1. Columnar database can return aggregate query results extremely fast.  An order of magnitude faster than traditional row-storaged based databased, but if your data warehouse needs to be updated often, then they are slow.  For example, data warehouses that provide information about the latest status of a business process, generally require updates to specific records.  Make sure you wouldn't have to do those operations frequently.  
  2. With Mondrian, you do not need a star-schema with Columnar databases.  Data can be in 1 BIG flat (wide and long) table.  Columnar databases are very efficient in storage, and compression.
  3. If your OLAP cube has a lot of single attribute dimensions (no hierarchy, or relationships with other tables), or degenerate dimensions, then star schemas may not be a good choice, and 1 BIG flat table may be a better choice.  So again, Columnar databases may be a better option for you.  
  4. If your fact table data set is reaching 100M records or above, you may need to think about biting the bullet and plan for a columnar database.
  5.  Columnar databases are not your average databases.  They are very specialized for data warehousing or BI.  This is in addition to your transactional databases such as Oracle, Sybase, MySQL, or Postgres, etc.  It does not replace those technologies.  With a new technology in your stack, you may need to have new skill sets to manage them.  This is Cost of Ownership!
  6. Columnar databases are free to download, but can be far more costly to support with a production license.  This can be a big catch for most startup companies with limited budget.
What if you don't have enough money or your data set is not big enough to justify a columnar DB.  What now?  here are some tips to work with your traditional database and Mondrian:
  1. Traditional databases are much cheaper.  Cloud based host providers, already have pre-configured installations which are typically tuned for production.
  2. You MUST have a proper star schema.  Please read these references to get a better understanding of star schema.
  3. The fact table in the star schema should be narrow (few data columns of measures), and very long (10-100M) records.  It should be designed with measures only and lots of columns of foreign keys to dimension tables. If your fact table has >50 columns, it is time to review your requirements and rethink your design.
  4. Index the measure columns on the fact table.  They will greatly improve the performance. Considering that the fact table will be written only a few times each day, the cost of poor performance of writes will be paid many times over with far better reads.
  5. If your fact table data set is reaching 100M records or above, you may need to think about biting the bullet and plan for a columnar database.
  6. You can have degenerate dimensions in your fact table, but if the cardinality is very high (>500), then it's better to move it to a different dimension table, otherwise, it will cause a very expensive full table scan of the fact table.
  7. If some queries are taking too long, you can detect them in mondrian_sql.log file and execute them independently and review the execution plan on the database.  your DBA should make a good sense of it.
  8. If your company is a "Leading Provider of so and so", then act like one!!!!  Hire a good DBA to keep a close tab on the database to detect the slow queries.
All in all, Columnar database are becoming more common and cost effective.  If the $$ cost is bearable, then you may want to use it.  This way, as your data grows, you can be sure that you can handle the reports as well. But if you have ~10M records and it takes several minutes to render your reports,  don't blame it on Mondrian or Pentaho.  Review your requirements and tune your schema!

Many Thanks to Oliver Nielsen and Babji Nallamotu
-Maz

2 comments:

  1. Very good article. I would completely agree that if you have billions or more of rows to query, you're best bet to get remarkably good performance is one of the good (not all are) columnar analytics RDBMS's. I would argue though that even the commercial columnar RDBMS are NOT nearly as expensive as the enterprise editions of the Big 3...price Vertica and Oracle or SQLServer, and you'll see what I mean. ;-) Also, good columnar RDBMS's can ingest data at mind-blowing rates as well (as in millions or more rows per second), you just have to do it the right way, which is NOT via single-row insert/update/delete operations.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete