Worldox Profile Database Technology
Worldox® Dual Database Architecture
By Steven Feldberg
The Role of the Database in DMS
Central Database: Single Point of Access, Single Point of Failure
Distributed Database: The Whole is Greater than the Sum of the Parts
Worldox: The Best of Both Worlds
"Rather than accept the intrinsic limitations of one particular database model over all others, the designers of Worldox have chosen to combine the robustness and reliability of a distributed data model with the speed and efficiency inherent in a centralized database."
Worldox® Enterprise Document Manager® employs a two-tiered architecture in the implementation of its back-end document profile database technology. This approach combines the best aspects of a failure-resistant distributed database, with the speed of access inherent in a centralized data repository. Thus Worldox satisfies customer requirements for quick, global access to document repositories, and provides the added protection of a redundant, fault-tolerant database structure as well.
The Role of the Database in Document Management Systems
Document management software is designed coordinate and control the documents created, maintained, and used within a firm or organization. Virtually all electronic document management systems also offer additional functionality as a rule—including version control, document archiving, full-text indexing, content-based retrieval, network mirroring, workflow, and so forth. But the heart and soul of document management consists of cataloging and tracking documents.
This means that the document manager must somehow extract or derive "information about the documents," often referred to as document profiles or metadata, keeping it separate from the information in the documents themselves. It is critical to distinguish document profile information from document content. The critical role that documents serve within an organization cannot be over-stated. Documents are a firm's intellectual assets—in much the same way that staff are human assets. In order to manage employees effectively, firms maintain human resource records—frequently handled by an entire department dedicated to that function. A document management system performs an analogous function by maintaining document resource records.
The document manager must have the means to store, edit and retrieve the document resource records, or profile information, for the documents under its control. The obvious solution is to use a database system of some kind. Storing profile information in a database affords all the advantages inherent in database technology to the document manager. By means of a database the document manager can handle a vast amount of information that can be stored, organized, and searched quickly and efficiently by large numbers of users. The database contributes an element of structure in parallel with the document repository, which consists of largely unstructured data.
In determining an approach to implementing database technology for a document management system (DMS), the essential questions must revolve around how best to leverage:
- the informational requirements inherent in managing a large document set
- available system resources at a DMS client site
- the skill level of information system staff (to support a particular database implementation)
- the day-to-day realities of document flow and usage experienced by users of the DMS solution.
Keep in mind that document management system databases or not generally end-user facing, meaning that most users do not have any direct experience with the database in typical usage scenarios. No more, say, than most automobile drivers have direct experience of their car's engine while driving. When it works as designed, the driver can disregard it and concentrate on getting from one place to another. As with automobile engines, the database engine of a DMS should be unobtrusive, efficient, powerful, and reliable. It should enable customers to achieve desired outcomes at an acceptable performance level without draining resources excessively or breaking down when needed most.
Central Database: Single Point of Access, Single Point of Failure
A document management system that employs a central database stores all document profile information in a single, monolithic database. Typically this is a relational database management system (RDBMS) that resides on a dedicated Server attached to the network. As users of the document management system work with documents across the network, the DMS routes all document profile updates and information requests to the server housing the central database.
Typically the database is an adjunct to the documents, which tend to remain distributed throughout the network. Database records therefore contain a pointer of some sort that "attaches" the record to the corresponding document's physical location. An extreme application of a centralized database approach to DMS, however, foregoes this level of abstraction and actually stores the documents within the database structure, often as a binary large object, or BLOB. While highly secure, this approach can also increase risk of lost documents or impeded productivity during a hardware or software breakdown.
A centralized database offers several key benefits:
- controlled access to the document profile repository
- quick, efficient searching
Many DMS vendors have constructed systems that use SQL database technology. Ostensibly, this is to enhance interoperability and to take advantage of the client organization's existing expertise with the technology, if present. SQL databases offer these benefits:
- some sites can leverage existing SQL proficiency (if present)
- scaleable support for Wide Area Networks
- offers opportunity to consolidate data stores.
On the down side, a single, centralized database containing the entire body of information about a firm's document repository presents a single point of failure. Without a properly defined and strictly enforced backup regime, the failure or corruption-for any reason-of a centrally stored document profile repository, can be catastrophic. If the sole document profile database is corrupted, restoring any lost information becomes a drop-everything, mission-critical operation. Even with current backups, at a minimum the DMS-and probably the network as well-will have to be "brought down," bringing productivity to a standstill.
While the potential for a "doomsday" DMS scenario is not necessarily high in well implemented and well maintained systems, it does happen from time to time. Stories abound of firms having to stop all work, search for the latest backup tapes-or worse, of having to recreate document profiles from scratch.
The more likely scenario on the downside is that the server hosting the centralized DBMS may go down. Without a redundant system in place, such as mirroring, the loss of the server will generally enforce a network-wide work stoppage.
In large organizations with the staff and the know-how, an SQL back-end to a DMS can prove beneficial on several fronts. However, in smaller organizations, or those lacking in-house SQL expertise, a dedicated SQL database can prove to be a serious resource drain, both on the network and on budgets that are bound to expand in order to accommodate outsourcing SQL database expertise. SQL databases are powerful, but that power comes at a cost.
Be assured that installing an SQL database is in no way a turnkey operation. It takes studied expertise to set up the database, and ongoing tuning, backups, and continual upkeep in order to keep it running within acceptable performance limits.
It is the presence of these all too real concerns that has led to another approach…
Distributed Database: The Whole is Greater than the Sum of the Parts
A distributed database offers an antidote to "putting all your eggs in one basket." The key difference between a centralized and distributed database is where the information is stored. A document management system using a distributed database stores all the necessary document profiling information dispersed throughout the network. The information is stored at various points, or nodes. The storage nodes may be based on the network architecture or on disk structure. Though the data is stored in multiple physical locations, the distributed database is centrally managed. Distributing the database compartmentalizes the information, greatly reducing the chance of loosing the entire database.
A typical implementation in a DMS might allocate the profile database along the lines of the logical structure shared by the documents within the network. For example, each system Folder (directory) containing profiled documents will have a corresponding DMS data set.
The distributed data approach offers several advantages:
- There is no single point of failure (with respect to data loss)
- Updates happen close to where the work occurs and may therefore happen faster (less latency in the system)
- Certain operations, such as "cloning" profiles, can be faster
- No special procedures are required to backup and restore document profile information-it gets backed up during the standard backup routine
- The distributed approach is inherently scaleable
- A distributed database should optimize processor use as individual data sets required by processing functions will be smaller.
As with the centralized database, the nature of the distributed database is transparent to the end user. A user simply works with the document management software to save documents, retrieve documents, etc. The DMS handles whatever mediation is necessary with respect to the profile database.
There are, to be sure, some disadvantages to using a distributed database within a document management system:
- Where databases share media with the profiled documents, database resources may be unprotected (e.g. users may delete database files)
- Data synchronization must be managed by the software which requires processing overhead
- Searching across data sets can be slow.
Worldox: The Best of Both Worlds
Rather than accept the intrinsic limitations of one particular database model over all others, the designers of Worldox have chosen to combine the robustness and reliability of a distributed data model with the speed and efficiency inherent in a centralized database. This unique approach enables Worldox to exploit the benefits of each model, while circumventing the various drawbacks.
The implementation of a two-tiered database architecture sets Worldox apart from other document managers on several fronts:
- Access to files when the central database or network is unavailable: The primary benefit directly impacts productivity in that with Worldox users maintain access to their documents along with the profile information even when the central database is unavailable for any reason. With Worldox's mirroring technology users maintain full access even in the event of a network shutdown. Mirroring, coupled with the Worldox distributed database, maintains local copies of work files that users can work with while off-line. When network connections are restored, Worldox automatically synchronizes local documents with their network counterparts.
- The central database can be rebuilt from distributed databases: If the central Worldox database becomes corrupted for any reason, it can be rebuilt directly from the information stored in the local databases. This ensures that the database contains the latest information available about the documents under DMS control. Restoring a database from an overnight backup, on the other hand, will fail to include information updated subsequent to the time of the backup creation.
- Distributed databases are included in routine backup procedures: Each distributed Worldox database is backed up as part of routine network backup procedures. The databases, in essence, "live with the documents," and therefore remain tightly integrated with them.
- The distributed databases compartmentalize document profile information: If a database becomes corrupted somehow, the damage is localized, allowing quick and easy recovery without impacting the entire network. There is no need to take down the entire network, or to limit access to the document management system while restoring the local database affected.
Worldox stores profile data in a distributed database consisting of linked pairs of data files residing in each directory containing profiled documents. Each distributed data set consists of two files, XNAME.LIB and XNAME.CRS, which are described briefly in the following table.
||Contains document numbers (DOS names), extended
names, and file security information.
||Contains custom profile field and version control information.
In many customer sites-smaller installations for the most part-the Worldox distributed database fully satisfies the database requirements of the DMS and delivers acceptable search performance. In such sites Worldox works exclusively with its distributed database.
Which highlights another advantage the Worldox dual-database architecture affords: it is able to adapt to the requirements of an organization's computing and business environment. Worldox does not present a "one size fits all" architecture forcing organizations to accede to its needs.
In larger sites, however-which comprise the majority of Worldox installations-documents typically number into the hundreds of thousands. Such sites are able to reap the full benefit of the Worldox dual database organization by implementing a central profile data structure to achieve optimal search performance.
The central database duplicates the information contained in the distributed database, but in a form that is optimized for search speed and efficiency. The central database is created and maintained by a dedicated computer called an Index Server (or Indexer, for short). An installation may have one or more Index Servers, depending upon the number of documents that are profiled, whether full-text searching is implemented, and other network configuration factors.
From the users' perspective, Worldox databases essentially fade into the background. XNAME files are generally configured to be hidden system files so that users can neither see them, nor accidentally delete them. The central database resides on a file server on the network insulated from direct user access.
As users work with documents within the Worldox desktop, their actions affect the local database. Worldox client software does not directly contact the central database except when conducting searches. Updates to the central database occur by means of queued change files that each Worldox desktop posts to the Indexer.
The Worldox Central Profile Database
As a general rule, a Worldox site will implement a central profile database for each volume or repository containing documents. All central database files are grouped in a shared directory which is located beneath the Worldox program directory by default. This directory must be visible to all Worldox users on the network. The default directory naming convention places each database in a separate subdirectory as shown in the following example:
- The profile database for a volume labeled 'X' is located in F:\Worldox\ISYS\PROFX.
- The profile database for a volume labeled 'Q' is located in F:\Worldox\ISYS\PROFQ.
The Worldox central profile database uses the Isys search and retrieval engine from Odyssey Development Corporation. The following table provides brief descriptions of the files that make up the Worldox central profile database.
||List of directories under Worldox control.
||The primary central database file, containing file records from the XNAME files.
||Index to FTA
||Index to FTB
||Isys Index A
||Isys Index B
||Isys Index C
||Isys database configuration file
All changes to document profiles made by the client software-that is, the Worldox Desktop-are written directly to the local distributed database (the XNAME files in the document's directory). This reinforces overall system reliability in that all changes are saved immediately, meaning that profile "transactions" are not subject to network outages, nor latency. As changes are made throughout the network, the Worldox Index Servers identify the changes and post them to the central database.
In addition to continuously polling the network, Worldox Index Servers can follow a scheduled update process programmed by the site administrator. This allows each unique installation to implement a central database update strategy that is optimized for that site.
© World Software Corp. 1998
June 19, 1998
If you have any questions about the information presented in this paper, or would like more information, please contact World Software Corporation via electronic mail at Worldox@worldox.com. This paper is available at the official Worldox web site at www.worldox.com.
World Software Corporation
124 Prospect St.
Ridgewood, NJ USA 07450