Provides classes and methods to record information about changes in DSpace. The main class is {@link org.dspace.history.HistoryManager}.

Overview

The purpose of the history subsystem is two-fold:

Note that the history data is not expected to provide current information about the archive; it simply records what has happened in the past.

Harmony Model

The Harmony project describes a simple and powerful approach for modeling temporal data. The DSpace history framework adopts this model.

The Harmony model is used by the serialization mechanism (and ultimately by agents who interpret the serializations); users of the History API need not be aware of it.

High-Level Approach

When anything of archival interest occurs in DSpace, the saveHistory method of the HistoryManager is invoked. The parameters to the call are references to anything of archival interest.

The history data component receives the objects of interest via method calls on the HistoryManager. (Note that this does not preclude other interested parties from acting on object as well). Upon reception of the object, it serializes the state of all archive objects referred to by it, and creates Harmony-style objects and associations to describe the relationships between the objects. (A simple example is given below). Note that each archive object must have a unique identifier to allow linkage between discrete events; this is discussed under Unique Ids below.

The serializations (including the Harmony objects and associations) are persisted to the filesystem, and marked as history data in the database.

Archival Events

Creating, modifying or deleting Community, Collection, Item, EPerson, WorkflowItem, or WorkspaceItem objects (including adding subobjects) are generally of archival interest.

Serializations

The serialization of an archival object consists of:

The implementation of serialization simply calls methods in the Content API.

Version information for the serializer itself is included in the serialization.

Unique Ids

To be able to trace the history of an object, it is essential that the object have a unique identifier.

After discussion, the unique identifiers are only weakly tied to the Handle system. Instead, the identifier consists of:

Why Synchronization Is Not a Problem

A classic problem with having data in two places is synchronization; it is no longer always clear which data source is authoritative.

This is not a problem for the history data because:

Storage

The History system stores serializations and an MD5 checksum for the serialization. When another object is serialized, the checksum for the serialization is matched against existing checksums for that object. If the checksum already exists, the object is not stored; a reference to the object is used instead.

Note that since none of the serializations are deleted, ref counting is unnecessary.

History Maps

The history data is not initially stored in a queryable form. Nonetheless, it is a good idea to provide at least basic indications of what is stored, and where it is stored.

Therefore the following simple RDBMS tables are used:
History table: 
  history_id INTEGER PRIMARY KEY,
  -- When the history data was created (this data is also in the history!)
  timestamp  TIMESTAMP

HistoryReference table: 
  history_reference_id INTEGER PRIMARY KEY,
  -- Reference to the history
  history_id           INTEGER FOREIGN KEY,
  -- Object Id
  object_id            VARCHAR(64),

One way to trace the history of an object would be to find all history serializations which refer to it (in the HistoryReference table), and unwind and interpret these. When the history data refers to a serialization of an object, use the History table to find the serialization.

Example

An item is submitted to a collection via bulk upload. When (and if) the Item is eventually added to the collection, the saveHistory method is called, with references to the Item, its Collection, the User who performed the bulk upload, and some indication of the fact that it was submitted via a bulk upload.

When called, the HistoryManager does the following: It creates the following new resources (all with unique ids):

  • An event
  • A state
  • An action
  • It also generates the following relationships:

      event  --atTime-->     time
      event  --hasOutput-->  state
      Item   --inState-->    state
      state  --contains-->   Item
      action --creates-->    Item
      event  --hasAction-->  action
      action --usesTool-->   DSpace Upload
      action --hasAgent-->   User
    

    The HistoryManager serializes the state of all archival objects involved (in this case, the Item, the User, and the DSpace Upload). It creates entries in the history map which associate the archival objects with the generated serializations.

    What History Data Is Not

    History Data is not version control information. No effort has been made to provide diffs, merges, or highly efficient storage; instead, effort is focused on simple remembrance. Note that this does not preclude more sophisticated approaches later.

    History Data does not attempt to reconcile any contradictions in the data it serializes.

    History Data does not keep track of any kind of current state.