When developing a disaster recovery architecture and plan
for any enterprise application, two factors come into play: recovery time and
recovery point. Recovery time is the
amount of time it takes to restore the system to working order. Recovery point is the point in time for which
the data can be restored. These
requirements must be gathered prior to developing a disaster recovery plan (and
ideally, before developing the system infrastructure). For a single service application (i.e.
Relational Database), after the RT and RP values have been determined,
implementing them becomes a straightforward affair. However, for an application which is
comprised of several, interdependent services (i.e. Documentum Content Server),
each service must have a separate DR strategy, and the culmination of those
strategies must meet the application’s RT and RP values. Let us take a high level look at the services
that comprise a typical Documentum application and what their inter-service
dependencies are.
At the heart of any Documentum application is a relational
database. The DR strategy may be to use
some vendor provided clustered database, to have a second database that syncs
with the first, or to simply recover from a cold backup. The Recovery Point of the database must be
older (less current) than that of the content storage area to minimize the
likelihood of content objects missing their associated content files. This is a soft rule. If, for the application, the metadata is high
value and the content data is low value, this rule can be broken.
Next let us look at the Content Storage area. Again, the most common DR strategies are
clustering and replication. As we
mentioned earlier, the content storage area of a recovered system should be
more current than the RDBMS. This way,
all content bearing objects will have their content available. Note that when files are updated, they are
versioned, and the old files are not overwritten. When files are deleted, their associated
content is not, until a file clean job is run.
In order for an Index Server to perform its duty, it must
have a valid, up to date index. In DR
situations, this is not easy to grantee.
One method for dealing with DR for an Index Server is to simply restore
the Index Server but not the associated index.
This will require the Index Server to re-index the content (and FT
functionality will be limited during this time), but will alleviate concerns
over validity of the Index. Other
strategies include restoring both the IS and the Index. Realize that if the Index is more current
than the Content Server’s Restore Point, it may have invalid indices (documents
that do not exist in the content server).
If the Index is less current than the restored Content Server, documents
may have been marked as being indexed, but not actually be indexed. This can be mitigated by performing a
re-indexing on documents that may have been added or modified during the delta.
Thankfully, Content Server and Application Servers are
relatively stateless, and therefore can be restored to any point that has all
of the custom code and configurations that coincide with the RP of the CS. Simply starting the apps should suffice to
restore the system to working order.
This post highlights the complexity of
Documentum DR; the devil always lies in the details. Do not try to implement DR without verifying
the infrastructure architecture, installation and configuration procedures, and
redundant hardware in other environments