Thursday, January 8, 2015

Documentum Disaster Recovery

When developing a disaster recovery architecture and plan for any enterprise application, two factors come into play: recovery time and recovery point.  Recovery time is the amount of time it takes to restore the system to working order.  Recovery point is the point in time for which the data can be restored.  These requirements must be gathered prior to developing a disaster recovery plan (and ideally, before developing the system infrastructure).  For a single service application (i.e. Relational Database), after the RT and RP values have been determined, implementing them becomes a straightforward affair.  However, for an application which is comprised of several, interdependent services (i.e. Documentum Content Server), each service must have a separate DR strategy, and the culmination of those strategies must meet the application’s RT and RP values.  Let us take a high level look at the services that comprise a typical Documentum application and what their inter-service dependencies are.
At the heart of any Documentum application is a relational database.  The DR strategy may be to use some vendor provided clustered database, to have a second database that syncs with the first, or to simply recover from a cold backup.  The Recovery Point of the database must be older (less current) than that of the content storage area to minimize the likelihood of content objects missing their associated content files.  This is a soft rule.  If, for the application, the metadata is high value and the content data is low value, this rule can be broken.
Next let us look at the Content Storage area.  Again, the most common DR strategies are clustering and replication.  As we mentioned earlier, the content storage area of a recovered system should be more current than the RDBMS.  This way, all content bearing objects will have their content available.  Note that when files are updated, they are versioned, and the old files are not overwritten.  When files are deleted, their associated content is not, until a file clean job is run.
In order for an Index Server to perform its duty, it must have a valid, up to date index.  In DR situations, this is not easy to grantee.  One method for dealing with DR for an Index Server is to simply restore the Index Server but not the associated index.  This will require the Index Server to re-index the content (and FT functionality will be limited during this time), but will alleviate concerns over validity of the Index.  Other strategies include restoring both the IS and the Index.  Realize that if the Index is more current than the Content Server’s Restore Point, it may have invalid indices (documents that do not exist in the content server).  If the Index is less current than the restored Content Server, documents may have been marked as being indexed, but not actually be indexed.  This can be mitigated by performing a re-indexing on documents that may have been added or modified during the delta.
Thankfully, Content Server and Application Servers are relatively stateless, and therefore can be restored to any point that has all of the custom code and configurations that coincide with the RP of the CS.  Simply starting the apps should suffice to restore the system to working order.
This post highlights the complexity of Documentum DR; the devil always lies in the details.  Do not try to implement DR without verifying the infrastructure architecture, installation and configuration procedures, and redundant hardware in other environments