Friday, September 27, 2013

Apache log files statistics - Hits per resource - Finding most consumed resources

In a typical modern web application you have users hitting resources directly and indirectly. Many times, especially in REST approaches there is a number in the path representing the specific Collection member we are currently accessing. Unix Power Tools can quickly give us a response for questions like "List hits by page" or better said in WEB 2.0 "List hits per resource".

Given Apache logs look like: - - [22/Sep/2013:06:25:09 -0400] "POST /my/resource HTTP/1.1" 200 3664
When the below command is run:
grep -o "[^\?]*" access.log | sed 's/[0-9]*//g' | awk '{url[$7]++} END{for (i in url) {print url[i], i}}' | sort -nr
Then an output like the below will be returned:
10000 /my/top/hit/resource
50 /my/number//including/hit/resource
1 /my/bottom/hit/resource
The command first gets rid of the query string, replaces all numbers (This allows us not to consider resources that differ by ids as different), builds an associative array (or map) with key being the resource and content being the number of such resources found, prints it as "counter resource" and finally sorts it descendant (no real need for the -n switch as no numbers will be present in the URL.

1 comment:

Matthew Holt said...

This was seriously the most helpful trick I've seen in a long time. Thanks! Saved me hours of work, and I learned something about Unix in the process.