Sunday, October 16, 2011

Document Management System with CouchDB - First Part

I will start documenting about my experience using CouchDB to build a Document Management System (DMS), an important component of any Content Management System (CMS).

The first part concentrates on installing and using CouchDB in OSX and Ubuntu.

OSX

Alternatively you could install from sources which I prefer to get later and greatest.

  1. Download any pending updates for OSX. Then latest version of XCode
  2. Install homebrew if you still not have it. It is the best package manager for OSX.
  3. /usr/bin/ruby -e "$(curl -fsSL https://raw.github.com/gist/323731)"
    
  4. Install couchDB. Following instructions from http://wiki.apache.org/couchdb/Installation with just one command. It could take a while, if it hangs then restart again, it will continue from where it broke.
  5. brew install couchdb
    
  6. Start the server.
    $ couchdb
    $ curl -X GET http://127.0.0.1:5984/
    
  7. If you need to change the default port:
    sudo vi  /usr/local/etc/couchdb/default.ini
    

Ubuntu

In Ubuntu I decided to build from sources to get latest available version for my 10.10 Maverick. For 11.4 I found this also works but you have to issue 'sudo apt-get remove libmozjs185-dev' in order to build.

Using CouchDB

Let us start interacting with CouchDB to create a database, a document, attach a file to it, update it etc. We use curl to be sure we can issue different HTTP request method (GET, POST, PUT, DELETE).
  1. We create our Document Management System database and we confirm it was created:
    $ curl -X PUT http://127.0.0.1:5984/dms
    {"ok":true}
    $ curl -X GET http://127.0.0.1:5984/_all_dbs
    ["_replicator","_users","dms"]
    
  2. Let us create a first document. In this example we use POST instead of PUT so we get a UUID generated by CouchDB:
    $ curl -d '{
        "name":"Investor Document 11",
        "clientId": "1001",
        "createdByEmployeeId": "2",
        "reviewedByEmployeeId": "1",
        "approvedByManagerId": "21",
        "created": "2/2/2011",
        "reviewed": "2/3/2011",
        "approved": "2/4/2011",
        "investorId": "32",
        "categoryId": "2",
        "statusId": "2"
    }' -H "Content-Type: application/json" -X POST http://127.0.0.1:5984/dms
    
    Result:
    {"ok":true,"id":"296ef7cde8fe533efe0c7dded873505b","rev":"1-d5dd0fa82df07553f3a2b82947864fc6"}
    
  3. Let us attach a file to the above document. Note we need the document is and rev:
    $ curl -X PUT -H "Content-Type: application/pdf"  --data-binary @DailyReport.pdf  $DMS/296ef7cde8fe533efe0c7dded873505b/DailyReport.pdf?rev=1-d5dd0fa82df07553f3a2b82947864fc6
    
  4. You can visually manage your CouchDB server via Futon user interface. Just hit http://localhost:5984/_utils/ and start playing with it.
  5. Create some documents for different combinations of categoryId, clientId and investorId either from curl or from Futon.
  6. Let us start querying our DB.

    You query a View in CouchDB. Views are a combination of two functions that are applied to the original data: Map and Reduce (MapReduce style: Map functions generate indexes and Reduce queries are requests against them). Map function as its name suggests specifies the mapping between the document structure and the structure of the View. Reduce function as its name suggests specifies how to group the resulting data to reduce the results. The View is consequently just a transformation of the document where an index is usually defined. If you need to group a Reduce step will be applied as well.

    You must become familiar with how to write the map and reduce functions for Views. This is done using the javascript language. From Futon select "Temporary View ..." option from the View dropdown. You have two panes now, the left is for your Map function and the right is for the Reduce. By default you see CouchDB proposes the below code which is equivalent to "Do not use any custom key and show all values from the document". There is no transformation nor custom index at all in this case, however if no key is specified couchDB uses the document id as unique identifier). Remember the Map function generates rows containing the id, an optional key and an optional value.
    function(doc) {
      emit(null, doc);
    }
    
  7. Let us edit the function to "Use the name as key and show only clientId and investorId". When you run the view using both functions you will realize the difference. By now you should be aware that emit() just accepts two parameters, the key for an index and the value that will be returned. Of course the results come ordered by the Key if provided. Both key and value are json expressions as well.
    function(doc) {
      //emit(doc.name, doc);
      if(doc.name && doc.clientId) {
        var key = doc.name;
        var value = {name: doc.name, clientId: doc.clientId, investorId: doc.investorId}
        emit(key, value);
      }
    }
    
  8. Here we use a composite key out of the clientId and the investorId so we can find the documents for that combination. Again the results are ordered first by clientId and later by investorId:
    function(doc) {
      if(doc.clientId && doc.investorId) {
        var key = [doc.clientId, doc.investorId];
        var value = {name: doc.name, clientId: doc.clientId, investorId: doc.investorId}
        emit(key, value);
      }
    }
    
  9. Save the view. The options you pick will be used in the URL to retrieve the View results. I have decided to use "common" for the design document name and "by_client_investor" for the name of the view:
    Design Document: _design/common
    View Name: by_client_investor
    

  10. Now the View is saved so we can query it at any time. The View is now "Permanent" and not longer "Temporary". Let us query it for just one key. Note that as we decided to use an array as key we will need to look for something like: ["1000","30"]. As you might have notice the key contains characters that must be URL encoded, in this case %5B%221000%22%2C%2230%22%5D. Here is how the command will look like:
    $ curl -X GET http://127.0.0.1:5984/dms/_design/common/_view/by_client_investor?key=%5B%221000%22%2C%2230%22%5D
    
    Alternatively you can use a more clear approach using some other curl flags:
    curl -X GET http://127.0.0.1:5984/dms/_design/common/_view/by_client_investor -G --data-urlencode key='["1000","30"]'
    
  11. Here is how you use curl to create and execute a temporary View from the command line. Here we are using categoryId as a key and getting the whole document as a result of the "non existent" transformation.
    curl -X POST http://127.0.0.1:5984/dms/_temp_view -H "Content-Type: application/json" -d \
    '{
      "map": "function(doc) {
                if (doc.categoryId) {
                  emit(doc.categoryId, doc);
                }
              }"
    }'
    
  12. Let us explore the results of the below temporary View. Here we are insterested in the total documents by category. As we are grouping we need to use a Reduce function where we take advantage of the provided _count. Note the key is null because it counts all of the existing documents.
    curl -X POST http://127.0.0.1:5984/dms/_temp_view -H "Content-Type: application/json" -d \
    '{
      "map": "function(doc) {
                if (doc.categoryId) {
                  emit(doc.categoryId, doc);
                }
              }",
      "reduce": "_count"
    }'
    
  13. Here is how we generate the counting by key which translates to use "Grouping=exact" from Futon or as shown below "group=true" from the HTTP request:
    curl -X POST http://127.0.0.1:5984/dms/_temp_view?group=true -H "Content-Type: application/json" -d \
    '{
      "map": "function(doc) {
                if (doc.categoryId) {
                  emit(doc.categoryId, doc);
                }
              }",
      "reduce": "_count"
    }'
    
  14. We already saw how to make a View permanent while saving it from Futon. Here is how from an HTTP request you do the same. This time we are adding a View called category_count to a Design Document called category:
    curl -X PUT http://127.0.0.1:5984/dms/_design/category -d \
    '{
       "_id": "_design/category",
       "language": "javascript",
       "views": {
         "count": {
           "map":
             "function(doc) {
               if (doc.categoryId) {
                 emit(doc.categoryId, doc);
               }
             }",
           "reduce": "_count"
          }
       }
    }'
    
  15. As we already saw we can query this view like this:
    curl -X GET http://127.0.0.1:5984/dms/_design/category/_view/count
    {"rows":[
    {"key":"1","value":17},
    {"key":"2","value":1}
    ]}
    

Some other examples

Here is how you would pull document information, delete its attachment (named the same as the document) using the revision number, try to delete it again and get an error, try to pull the attachment from the document and get an error and pull information again to confirm the document is still in the DB but simply it does not have any attachments. Note that I am accessing here now a production system where we use SSL with user and password:
$ curl -k -X GET "https://user:password@example.com:6984/dms/sample.pdf"
{"_id":"sample.pdf","_rev":"1-adb7b6f2e32d73758dfa16966c1caef9","approvedOn":"2012-03-15T16:34:08.000-0400","createdByEmployeeEmail":"nestor@example.com","title":"sample.pdf","_attachments":{"sample.pdf":{"content_type":"application/pdf","revpos":1,"digest":"md5-ZRJB3hYW9LuwL2p9wjJr0g==","length":167546,"stub":true}}}
$ curl -k -X DELETE "https://user:password@example.com:6984/dms/sample.pdf/sample.pdf/?rev=1-adb7b6f2e32d73758dfa16966c1caef9"
{"ok":true,"id":"sample.pdf","rev":"2-87988b99af60e2f7cb9022b65b7565d5"}
$ curl -k -X DELETE "https://user:password@example.com:6984/dms/sample.pdf/sample.pdf/?rev=1-adb7b6f2e32d73758dfa16966c1caef9"
{"error":"conflict","reason":"Document update conflict."}
$ curl -k -X GET "https://user:password@example.com:6984/dms/sample.pdf/sample.pdf"
{"error":"not_found","reason":"Document is missing attachment"}
$ curl -k -X GET "https://user:password@example.com:6984/dms/sample.pdf"
{"_id":"sample.pdf","_rev":"1-adb7b6f2e32d73758dfa16966c1caef9","approvedOn":"2012-03-15T16:34:08.000-0400","createdByEmployeeEmail":"nestor@example.com","title":"sample.pdf"}

Logging

If you are unsure where couchdb is logging just issue the below command:
$ curl -X GET http://localhost:5984/_config/log {"file":"/usr/local/var/log/couchdb/couch.log","include_sasl":"true","level":"info"}
You can also get the latest log lines directly from the below request:
$ curl -X GET http://localhost:5984/_log

Review

At this point you can interact with CouchDB from any language using plain REST commands. You might want to use some abstractions with an API that allows you to go through CRUD operations with CouchDB without being concern about the details of sending and parsing JSON.

In the next part we start the design of the DMS for which we will not use any specific language other than plain HTTP with the help of curl.

No comments:

Followers