Monday, June 25, 2012

SQL Optimization: Warehouse versus Realtime Reports versus Listing Pages

Sometimes a Real Time Report is assumed to be the equivalent of a listing page. However, while they both consume instant data, Real Time Reporting can take longer than a listing page. This is because a user will happily wait to receive the report run at certain time but waiting a minute for a web page to be rendered will not be a nice experience at all.

On the other hand Warehouse Reports are fast and in many cases even faster than listing pages because they run out of data sources that have been optimized for data retrieval (reads) instead of the regular data sources which are used for reads and writes. However Warehouse reports are valid only if a snapshot of the past is acceptable.

Of course what is fast, or what is old is relative and it is easy to mix Warehouse Reports, Real Time Reports and Web Listing Pages concepts.

From time to time I need to explain the difference to Business people and I thought like writing it down to avoid repeating myself.

There is a reason why we cannot have a transmitter and a receiver that will work in all electromagnetic frequencies and it is the metric called Gain–bandwidth product.

In short if you try to design for higher gain (power to amplify) you will need to compromise the bandwidth (the spectrum of frequencies you will be able to amplify). If you want to build a device that will be an AM/FM radio, a TV receiver, a cellular phone, a WIFI transceiver, a satellite multi frequency transceiver and more you will need multiple circuits (and multiple antennas, more power disipation etc) which will mean a bigger equipment. At least that rule is still there after I learned about the "Gain–bandwidth product" 25 years ago. Technology advances and circuit integration gets better, the space you need gets smaller but the amount of spectrum in use also grows and the demands for different services grows as well. No, you cannot have it all.

Knowing the limitations are important so you can make wise decisions in your architecture. It does not matter if it is hardware or software based.

Architecture is the art of designing a future product. For the product to be real the Architect must design under constraints making sure the client understand the cost involved in the construction. Usability, expected life, maintainability, and other metrics will decide the final cost of the product. As an architect you are the Chief of the Building Process.

How this applies to databases? Let us pick MySQL and think about a typical request: Allow me to search for X in [table A].[field1] where [table B].[field 1] is Y but order by [table C].[field 1]. Of course the three tables are related by keys.

Pretty quick you will find yourself inspecting the query plan (EXPLAIN) just to realize that you can cut in half or more query execution times just applying some indexes in the case you are lucky enough to get composite indexes that relate the filter (WHERE) and the sorting (ORDER BY) however even if you are lucky your query might still not satisfy in some cases the demands for speed.

Here is where the Gain-bandwidth product concept helps. Is it really that we want this listing page in our web interface? Is this long taking query based report to be run real time or actually we are OK with 1 hour old data? Can we at least cache it? Or is it OK to use a typical data-warehouse? Is it so crucial to have this response time that we should forget about normalization for some fields and impact then persistance? Can the query ask for more on the WHERE clause? Can the query be divided in the one to be used by regular employees which commonly are supposed to query for more specific data and a 1 day delay executive report? Too many questions and depending on answers to these and more the Architect should come up with a proposal that makes sense budget and user experience wise.

Business just needs to understand the cost and the Architect for sure will deploy the solution that makes sense for the Enterprise.

As with the "Gain–bandwidth product" there is a "Speed-Normalization product" and depending on the allowed compromise you could even end up deciding for a NoSQL Database approach. So in specifics let us analyze what we are trying to do with the attached query:

No comments: