Back to contents
Back to architecture overview
The org.dspace.core
package provides some basic classes that are used throughout the DSpace code.
ConfigurationManager
)The configuration manager is responsible for reading the main dspace.cfg
properties file, managing the 'template' configuration files for other applications such as Apache, and for obtaining the text for e-mail messages.
The system is configured by editing the relevant files in /dspace/config
, as described in the configuration section.
When editing configuration files for applications that DSpace uses, such as Apache, remember to edit the file in /dspace/config/templates
and then run /dspace/bin/install-configs
rather than editing the 'live' version directly!
The ConfigurationManager
class can also be invoked as a command line tool, with two possible uses:
/dspace/bin/install-configs
This processes and installs configuration files for other applications, as described in the configuration section.
/dspace/bin/dsrun org.dspace.core.ConfigurationManager -property property.name
This writes the value of property.name
from dspace.cfg
to the standard output, so that shell scripts can access the DSpace configuration. For an example, see /dspace/bin/start-handle-server
. If the property has no value, nothing is written.
This class contains constants that are used to represent types of object and actions in the database. For example, authorization policies can relate to objects of different types, so the resourcepolicy
table has columns resource_id
, which is the internal ID of the object, and resource_type_id
, which indicates whether the object is an item, collection, bitstream etc. The value of resource_type_id
is taken from the Constants
class, for example Constants.ITEM
.
The Context
class is central to the DSpace operation. Any code that wishes to use the any API in the business logic layer must first create itself a Context
object. This is akin to opening a connection to a database (which is in fact one of the things that happens.)
A context object is involved in most method calls and object constructors, so that the method or object has access to information about the current operation. When the context object is constructed, the following information is automatically initialized:
A connection to the database. This is a transaction-safe connection. i.e. the 'auto-commit' flag is set to false.
A cache of content management API objects. Each time a content object is created (for example Item
or Bitstream
) it is stored in the Context
object. If the object is then requested again, the cached copy is used. Apart from reducing database use, this addresses the problem of having two copies of the same object in memory in different states.
The following information is also held in a context object, though it is the responsiblity of the application creating the context object to fill it out correctly:
The current authenticated user, if any
Any 'special groups' the user is a member of. For example, a user might automatically be part of a particular group based on the IP address they are accessing DSpace from, even though they don't have an e-person record. Such a group is called a 'special group'.
Any extra information from the application layer that should be added to log messages that are written within this context. For example, the Web UI adds a session ID, so that when the logs are analysed the actions of a particular user in a particular session can be tracked.
A flag indicating whether authorization should be circumvented. This should only be used in rare, specific circumstances. For example, when first installing the system, there are no authorized administrators who would be able to create an administrator account!
As noted above, the public API is trusted, so it is up to applications in the application layer to use this flag responsibly.
Typical use of the context object will involve constructing one, and setting the current user if one is authenticated. Several operations may be performed using the context object. If all goes well, complete
is called to commit the changes and free up any resources used by the context. If anything has gone wrong, abort
is called to roll back any changes and free up the resources.
You should always abort
a context if any error happens during its lifespan; otherwise the data in the system may be left in an inconsistent state. You can also commit
a context, which means that any changes are written to the database, and the context is kept active for further use.
Sending e-mails is pretty easy. Just use the configuration manager's getEmail
method, set the arguments and recipients, and send.
The e-mail texts are stored in /dspace/config/emails
. They are processed by the standard java.text.MessageFormat
. At the top of each e-mail are listed the appropriate arguments that should be filled out by the sender. Example usage is shown in the org.dspace.core.Email
Javadoc API documentation.
The log manager consists of a method that creates a standard log header, and returns it as a string suitable for logging. Note that this class does not actually write anything to the logs; the log header returned should be logged directly by the sender using an appropriate Log4J call, so that information about where the logging is taking place is also stored.
The level of logging can be configured on a per-package or per-class basis by editing /dspace/config/templates/log4j.properties
and then executing /dspace/bin/install-configs
. You will need to stop and restart Tomcat for the changes to take effect.
A typical log entry looks like this:
2002-11-11 08:11:32,903 INFO org.dspace.app.webui.servlet.DSpaceServlet @ anonymous:session_id=BD84E7C194C2CF4BD0EC3A6CAD0142BB:view_item:handle=1721.1/1686
This is breaks down like this:
Date and time, milliseconds | 2002-11-11 08:11:32,903 |
Level (FATAL , WARN , INFO or DEBUG ) |
INFO |
Java class | org.dspace.app.webui.servlet.DSpaceServlet |
@ |
|
User email or anonymous |
anonymous |
: |
|
Extra log info from context | session_id=BD84E7C194C2CF4BD0EC3A6CAD0142BB |
: |
|
Action | view_item |
: |
|
Extra info | handle=1721.1/1686 |
The above format allows the logs to be easily parsed and analysed. The /dspace/bin/log-reporter
script is a simple tool for analysing logs. Try:
/dspace/bin/log-reporter --help
It's a good idea to 'nice' this log reporter to avoid an impact on server performance.
Utils
comtains miscellaneous utility method that are required in a variety of places throughout the code, and thus have no particular 'home' in a subsystem.
The content management API package org.dspace.content
contains Java classes for reading and manipulating content stored in the DSpace system. This is the API that components in the application layer will probably use most.
Classes corresponding to the main elements in the DSpace data model (Community
, Collection
, Item
, Bundle
and Bitstream
) are sub-classes of the abstract class DSpaceObject
. The Item
object handles the Dublin Core metadata record.
Each class generally has one or more static find
methods, which are used to instantiate content objects. Constructors do not have public access and are just used internally. The reasons for this are:
"Constructing" an object may be misconstrued as the action of creating an object in the DSpace system, for example one might expect something like:
Context dsContent = new Context(); Item myItem = new Item(context, id)
to construct a brand new item in the system, rather than simply instantiating an in-memory instance of an object in the system.
find
methods may often be called with invalid IDs, and return null
in such a case. A constructor would have to throw an exception in this case. A null
return value from a static method can in general be dealt with more simply in code.
If an instantiation representing the same underlying archival entity already exists, the find
method can simply return that same instantiation to avoid multiple copies and any inconsistencies which might result.
Collection
, Bundle
and Bitstream
do not have create
methods; rather, one has to create an object using the relevant method on the container. For example, to create a collection, one must invoke createCollection
on the community that the collection is to appear in:
Context context = new Context(); Community existingCommunity = Community.find(context, 123); Collection myNewCollection = existingCommunity.createCollection();
The primary reason for this is for determining authorization. In order to know whether an e-person may create an object, the system must know which container the object is to be added to. It makes no sense to create a collection outside of a community, and the authorization system does not have a policy for that.
Item
s are first created in the form of an implementation of InProgressSubmission
. An InProgressSubmission
represents an item under construction; once it is complete, it is installed into the main archive and added to the relevant collection by the InstallItem
class. The org.dspace.content
package provides an implementation of InProgressSubmission
called WorkspaceItem
; this is a simple implementation that contains some fields used by the Web submission UI. The org.dspace.workflow
also contains an implementation called WorkflowItem
which represents a submission undergoing a workflow process.
In the previous chapter there is an overview of the item ingest process which should clarify the previous paragraph. Also see the section on the workflow system.
Community
and BitstreamFormat
do have static create
methods; one must be a site administrator to have authorization to invoke these.
Classes whose name begins DC
are for manipulating Dublin Core metadata, as explained below.
The FormatIdentifier
class attempts to guess the bitstream format of a particular bitstream. Presently, it does this simply by looking at any file extension in the bitstream name and matching it up with the file extensions associated with bitstream formats. Hopefully this can be greatly improved in the future!
The ItemIterator
class allows items to be retrieved from storage one at a time, and is returned by methods that may return a large number of items, more than would be desirable to have in memory at once.
The ItemComparator
class is an implementation of the standard java.util.Comparator
that can be used to compare and order items based on a particular Dublin Core metadata field.
When creating, modifying or for whatever reason removing data with the content management API, it is important to know when changes happen in-memory, and when they occur in the physical DSpace storage.
Primarily, one should note that no change made using a particular org.dspace.core.Context
object will actually be made in the underlying storage unless complete
or commit
is invoked on that Context
. If anything should go wrong during an operation, the context should always be aborted by invoking abort
, to ensure that no inconsistent state is written to the storage.
Additionally, some changes made to objects only happen in-memory. In these cases, invoking the update
method lines up the in-memory changes to occur in storage when the Context
is committed or completed. In general, methods that change any [meta]data field only make the change in-memory; methods that involve relationships with other objects in the system line up the changes to be committed with the context. See individual methods in the API Javadoc.
Some examples to illustrate this are shown below:
Context context = new Context(); Bitstream b = Bitstream.find(context, 1234); b.setName("newfile.txt"); b.update(); context.complete(); |
Will change storage |
Context context = new Context(); Bitstream b = Bitstream.find(context, 1234); b.setName("newfile.txt"); b.update(); context.abort(); |
Will not change storage (context aborted) |
Context context = new Context(); Bitstream b = Bitstream.find(context, 1234); b.setName("newfile.txt"); context.complete(); |
The new name will not be stored since update was not invoked |
Context context = new Context(); Bitstream bs = Bitstream.find(context, 1234); Bundle bnd = Bundle.find(context, 5678); bnd.add(bs); context.complete(); |
The bitstream will be included in the bundle, since update doesn't need to be called |
Instantiating some content objects also causes other content objects to be loaded into memory.
Instantiating a Bitstream
object causes the appropriate BitstreamFormat
object to be instantiated. Of course the Bitstream
object does not load the underlying bits from the bitstream store into memory!
Instantiating a Bundle
object causes the appropriate Bitstream
objects (and hence BitstreamFormat
s) to be instantiated.
Instantiating an Item
object causes the appropriate Bundle
objects (etc.) and hence BitstreamFormat
s to be instantiated. All the Dublin Core metadata associated with that item are also loaded into memory.
The reasoning behind this is that for the vast majority of cases, anyone instantiating an item object is going to need information about the bundles and bitstreams within it, and this methodology allows that to be done in the most efficient way and is simple for the caller. For example, in the Web UI, the servlet (controller) needs to pass information about an item to the viewer (JSP), which needs to have all the information in-memory to display the item without further accesses to the database which may cause errors mid-display.
You do not need to worry about multiple in-memory instantiations of the same object, or any inconsistenties that may result; the Context
object keeps a cache of the instantiated objects. The find
methods of classes in org.dspace.content
will use a cached object if one exists.
It may be that in enough cases this automatic instantiation of contained objects reduces performance in situations where it is important; if this proves to be true the API may be changed in the future to include a loadContents
method or somesuch, or perhaps a Boolean parameter indicating what to do will be added to the find
methods.
When a Context
object is completed, aborted or garbage-collected, any objects instantiated using that context are invalidated and should not be used (in much the same way an AWT button is invalid if the window containing it is destroyed).
The DCValue
class is a simple container that represents a single Dublin Core element, optional qualifier, value and language. The other classes starting with DC
are utility classes for handling types of data in Dublin Core, such as people's names and dates. As supplied, the DSpace registry of elements and qualifiers corresponds to the Library Application Profile for Dublin Core. It should be noted that these utility classes assume that the values will be in a certain syntax, which will be true for all data generated within the DSpace system, but since Dublin Core does not always define strict syntax, this may not be true for Dublin Core originating outside DSpace.
Below is the specific syntax that DSpace expects various fields to adhere to:
Element | Qualifier | Syntax | Helper Class |
---|---|---|---|
date |
Any or unqualified |
ISO 8601 in the UTC time zone, with either year, month, day, or second precision. Examples: 2000 2002-10 2002-08-14 1999-01-01T14:35:23Z |
DCDate |
contributor |
Any or unqualified |
In general last name, then a comma, then first names, then any additional information like "Jr.". If the contributor is an organization, then simply the name. Examples: Doe, John Smith, John Jr. van Dyke, Dick Massachusetts Institute of Technology |
DCPersonName |
language |
iso |
A two letter code taken ISO 639, followed optionally by a two letter country code taken from ISO 3166. Examples: en fr en_US |
DCLanguage |
relation |
ispartofseries |
The series name, following by a semicolon followed by the number in that series. Alternatively, just free text. MIT-TR; 1234 My Report Series; ABC-1234 NS1234 |
DCSeriesNumber |
The primary classes are:
org.dspace.content.WorkspaceItem |
contains an Item before it enters a workflow |
org.dspace.workflow.WorkflowItem |
contains an Item while in a workflow |
org.dspace.workflow.WorkflowManager |
responds to events, manages the WorkflowItem states |
org.dspace.content.Collection |
contains List of defined workflow steps |
org.dspace.eperson.Group |
people who can perform workflow tasks are defined in EPerson Groups |
org.dspace.core.Email |
used to email messages to Group members and submitters |
The workflow system models the states of an Item in a state machine with 5 states (SUBMIT, STEP_1, STEP_2, STEP_3, ARCHIVE.) These are the three optional steps where the item can be viewed and corrected by different groups of people. Actually, it's more like 8 states, with STEP_1_POOL, STEP_2_POOL, and STEP_3_POOL. These pooled states are when items are waiting to enter the primary states.
The WorkflowManager is invoked by events. While an Item is being submitted, it is held by a WorkspaceItem. Calling the start() method in the WorkflowManager converts a WorkspaceItem to a WorkflowItem, and begins processing the WorkflowItem's state. Since all three steps of the workflow are optional, if no steps are defined, then the Item is simply archived.
Workflows are set per Collection, and steps are defined by creating corresponding entries in the List named workflowGroup. If you wish the workflow to have a step 1, use the administration tools for Collections to create a workflow Group with members who you want to be able to view and approve the Item, and the workflowGroup[0] becomes set with the ID of that Group.
If a step is defined in a Collection's workflow, then the WorkflowItem's state is set to that step_POOL. This pooled state is the WorkflowItem waiting for an EPerson in that group to claim the step's task for that WorkflowItem. The WorkflowManager emails the members of that Group notifying them that there is a task to be performed (the text is defined in config/emails,) and when an EPerson goes to their 'My DSpace' page to claim the task, the WorkflowManager is invoked with a claim event, and the WorkflowItem's state advances from STEP_x_POOL to STEP_x (where x is the corresponding step.) The EPerson can also generate an 'unclaim' event, returning the WorkflowItem to the STEP_x_POOL.
Other events the WorkflowManager handles are advance(), which advances the WorkflowItem to the next state. If there are no further states, then the WorkflowItem is removed, and the Item is then archived. An EPerson performing one of the tasks can reject the Item, which stops the workflow, rebuilds the WorkspaceItem for it and sends a rejection note to the submitter. More drastically, an abort() event is generated by the admin tools to cancel a workflow outright.
The org.dspace.administer
package contains some classes for administering a DSpace system that are not generally needed by most applications.
The CreateAdministrator
class is a simple command-line tool, executed via /dspace/bin/create-administrator
, that creates an administrator e-person with information entered from standard input. This is generally used only once when a DSpace system is initially installed, to create an initial administrator who can then use the Web administration UI to further set up the system. This script does not check for authorization, since it is typically run before there are any e-people to authorize! Since it must be run as a command-line tool on the server machine, generally this shouldn't cause a problem. A possibility is to have the script only operate when there are no e-people in the system already, though in general, someone with access to command-line scripts on your server is probably in a position to do what they want anyway!
The DCType
class is similar to the org.dspace.content.BitstreamFormat
class. It represents an entry in the Dublin Core type registry, that is, a particular element and qualifier, or unqualified element. It is in the administer
package because it is only generally required when manipulating the registry itself. Elements and qualifiers are specified as literals in org.dspace.content.Item
methods and the org.dspace.content.DCValue
class. Only administrators may modify the Dublin Core type registry.
The org.dspace.administer.RegistryLoader
class contains methods for initialising the Dublin Core type registry and bitstream format registry with entries in an XML file. Typically this is executed via the command line during the build process (see build.xml
in the source.) To see examples of the XML formats, see the files in config/registries
in the source directory. There is no XML schema, they aren't validated strictly when loaded in.
DSpace keeps track of registered users with the org.dspace.eperson.EPerson
class. The class has methods to create and manipulate an EPerson
such as get and set methods for first and last names, email, and password. (Actually, there is no getPassword()
method--an MD5 hash of the password is stored, and can only be verified with the checkPassword()
method.) There are find methods to find an EPerson by email (which is assumed to be unique,) or to find all EPeople in the system.
The EPerson
object should probably be reworked to allow for easy expansion; the current EPerson object tracks pretty much only what MIT was interested in tracking - first and last names, email, phone. The access methods are hardcoded and should probably be replaced with methods to access arbitrary name/value pairs for institutions that wish to customize what EPerson information is stored.
Groups are simply lists of EPerson
objects. Other than membership, Group
objects have only one other attribute: a name. Group names must be unique, so we have adopted naming conventions where the role of the group is its name, such as COLLECTION_100_ADD
. Groups add and remove EPerson objects with addMember()
and removeMember()
methods. One important thing to know about groups is that they store their membership in memory until the update()
method is called - so when modifying a group's membership don't forget to invoke update()
or your changes will be lost! Since group membership is used heavily by the authorization system a fast isMember()
method is also provided.
Another kind of Group is also implemented in DSpace--special Groups. The Context
object for each session carries around a List of Group IDs that the user is also a member of--currently the MITUser Group ID is added to the list of a user's special groups if certain IP address or certificate criteria are met.
The primary classes are:
org.dspace.authorize.AuthorizeManager |
does all authorization, checking policies against Groups |
org.dspace.authorize.ResourcePolicy |
defines all allowable actions for an object |
org.dspace.eperson.Group |
all policies are defined in terms of EPerson Groups |
The authorization system is based on the classic 'police state' model of security; no action is allowed unless it is expressed in a policy. The policies are attached to resources (hence the name ResourcePolicy
,) and detail who can perform that action. The resource can be any of the DSpace object types, listed in org.dspace.core.Constants
(BITSTREAM
, ITEM
, COLLECTION
, etc.) The 'who' is made up of EPerson groups. The actions are also in Constants.java
(READ
, WRITE
, ADD
, etc.) The only non-obvious actions are ADD
and REMOVE
, which are authorizations for container objects. To be able to create an Item, you must have ADD
permission in a Collection, which contains Items. (Communities, Collections, Items, and Bundles are all container objects.)
Currently most of the read policy checking is done with items--communities and collections are assumed to be openly readable, but items and their bitstreams are checked. Separate policy checks for items and their bitstreams enables policies that allow publicly readable items, but parts of their content may be restricted to certain groups.
The AuthorizeManager
class' authorizeAction(Context, object, action)
is the primary source of all authorization in the system. It gets a list of all of the ResourcePolicies in the system that match the object and action. It then iterates through the policies, extracting the EPerson Group from each policy, and checks to see if the EPersonID from the Context is a member of any of those groups. If all of the policies are queried and no permission is found, then an AuthorizeException
is thrown. An authorizeAction()
method is also supplied that returns a boolean for applications that require higher performance.
ResourcePolicies are very simple, and there are quite a lot of them. Each can only list a single group, a single action, and a single object. So each object will likely have several policies, and if multiple groups share permissions for actions on an object, each group will get its own policy. (It's a good thing they're small.)
All users are assumed to be part of the public group (ID=0.) DSpace admins (ID=1) are automatically part of all groups, much like super-users in the Unix OS. The Context object also carries around a List of special groups, which are also first checked for membership. These special groups are used at MIT to indicate membership in the MIT community, something that is very difficult to enumerate in the database! When a user logs in with an MIT certificate or with an MIT IP address, the login code adds this MIT user group to the user's Context.
Where do items get their read policies? From the their collection's read policy. There once was a separate item read default policy in each collection, and perhaps there will be again since it appears that administrators are notoriously bad at defining collection's read policies. There is also code in place to enable policies that are timed--have a start and end date. However, the admin tools to enable these sorts of policies have not been written.
The org.dspace.handle
package contains two classes; HandleManager
is used to create and look up Handles, and HandlePlugin
is used to expose and resolve DSpace Handles for the outside world via the CNRI Handle Server code.
Handles are stored internally in the handle
database table in the form:
1721.123/4567
Typically when they are used outside of the system they are displayed in either URI or "URL proxy" forms:
hdl:1721.123/4567 http://hdl.handle.net/1721.123/4567
It is the responsibility of the caller to extract the basic form from whichever displayed form is used.
The handle
table maps these Handles to resource type/resource ID pairs, where resource type is a value from org.dspace.core.Constants
and resource ID is the internal identifier (database primary key) of the object. This allows Handles to be assigned to any type of object in the system, though as explained in the functional overview, only communities, collections and items are presently assigned Handles.
HandleManager
contains static methods for:
DSpaceObject
, though this is usually only invoked by the object itself, since DSpaceObject
has a getHandle
methodDSpaceObject
identified by a particular HandleHandlePlugin
is a simple implementation of the Handle Server's net.handle.hdllib.HandleStorage
interface. It only implements the basic Handle retrieval methods, which get information from the handle
database table. The CNRI Handle Server is configured to use this plug-in via its config.dct
file.
Note that since the Handle server runs as a separate JVM to the DSpace Web applications, it uses a separate 'Log4J' configuration, since Log4J does not support multiple JVMs using the same daily rolling logs. This alternative configuration is held as a template in /dspace/config/templates/log4j-handle-plugin.properties
, written to /dspace/config/log4j-handle-plugin.properties
by the install-configs
script. The /dspace/bin/start-handle-server
script passes in the appropriate command line parameters so that the Handle server uses this configuration.
DSpace's search code is a simple API which currently wraps the Lucene search engine. The first half of the search task is indexing, and org.dspace.search.DSIndexer
is the indexing class, which contains indexContent()
which if passed an Item
, Community
, or Collection
, will add that content's fields to the index. The methods unIndexContent()
and reIndexContent()
remove and update content's index information. The DSIndexer
class also has a main()
method which will rebuild the index completely. This is invoked by the dspace/bin/index-all
script. The intent was for the main()
method to be invoked on a regular basis to avoid index corruption, but we have had no problem with that so far. Which fields are indexed by DSIndexer
? These fields are currently hardcoded in indexItemContent()
indexCollectionContent()
and indexCommunityContent()
/ methods.
The query class DSQuery
contains the three flavors of doQuery()
methods--one searches the DSpace site, and the other two restrict searches to Collections and Communities. The results from a query are returned as three lists of handles; each list represents a type of result. One list is a list of Items with matches, and the other two are Collections and Communities that match. This separation allows the UI to handle the types of results gracefully without resolving all of the handles first to see what kind of content the handle points to. The DSQuery
class also has a main()
method for debugging via command-line searches.
Currently we have our own Analyzer and Tokenizer classes (DSAnalyzer
and DSTokenizer
) to customize our indexing. They invoke the stemming and stop word features within Lucene. We create an IndexReader
for each query, which we now realize isn't the most efficient use of resources - we seem to run out of filehandles on really heavy loads. (A wildcard query can open many filehandles!) Since Lucene is thread-safe, a better future implementation would be to have a single Lucene IndexReader shared by all queries, and then is invalidated and re-opened when the index changes. Future API growth could include relevance scores (Lucene generates them, but we ignore them,) and abstractions for more advanced search concepts such as booleans.
The DSIndexer
class shipped with DSpace indexes the Dublin Core metadata in the following way:
Search Field | Taken from Dublin Core Fields |
---|---|
Authors | contributor.* creator.* description.statementofresponsibility |
Titles | title.* |
Keywords | subject.*
|
Abstracts | description.abstract description.tableofcontents |
Series | relation.ispartofseries |
MIME types | format.mimetype |
Sponsors | description.sponsorship |
Identifiers | identifier.* |
The org.dspace.search
package also provides a 'harvesting' API. This allows callers to extract information about items modified within a particular timeframe, and within a particular scope (all of DSpace, or a community or collection.) Currently this is used by the Open Archives Initiative metadata harvesting protocol application, and the e-mail subscription code.
The Harvest.harvest
is invoked with the required scope and start and end dates. Either date can be omitted. The dates should be in the ISO8601, UTC time zone format used elsewhere in the DSpace system.
HarvestedItemInfo
objects are returned. These objects are simple containers with basic information about the items falling within the given scope and date range. Depending on parameters passed to the harvest
method, the containers
and item
fields may have been filled out with the IDs of communities and collections containing an item, and the corresponding Item
object respectively. Electing not to have these fields filled out means the harvest operation executes considerable faster.
In case it is required, Harvest
also offers a method for creating a single HarvestedItemInfo
object, which might make things easier for the caller.
The browse API maintains indices of dates, authors and titles, and allows callers to extract parts of these:
Values of the Dublin Core lement title
(unqualified) are indexed. These are sorted in a case-insensitive fashion, with any leading article removed. For example:
The DSpace System
Appears under 'D' rather than 'T'.
Values of the contributor
(any qualifier or unqualified) element are indexed. Since contributor
values typically are in the form 'last name, first name', a simple case-insensitive alphanumeric sort is used which orders authors in last name order.
Note that this is an index of authors, and not items by author. If four items have the same author, that author will appear in the index only once. Hence, the index of authors may be greater or smaller than the index of titles; items often have more than one author, though the same author may have authored several items.
The author indexing in the browse API does have limitations:
Ideally, a name that appears as an author for more than one item would appear in the author index only once. For example, 'Doe, John' may be the author of tens of items. However, in practice, author's names often appear in slightly differently forms, for example:
Doe, John Doe, John Stewart Doe, John S.
Currently, the above three names would all appear as separate entries in the author index even though they may refer to the same author. In order for an author of several papers to be correctly appear once in the index, each item must specify exactly the same form of their name, which doesn't always happen in practice.
Another issue is that two authors may have the same name, even within a single institution. If this is the case they may appear as one author in the index.
These issues are typically resolved in libraries with authority control records, in which are kept a 'preferred' form of the author's name, with extra information (such as date of birth/death) in order to distinguish between authors of the same name. Maintaining such records is a huge task with many issues, particularly when metadata is received from faculty directly rather than trained library cataloguers. For these reasons, DSpace does not yet feature 'authority control' functionality.
Items are indexed by date of issue. This may be different from the date that an item appeared in DSpace; many items may have been originally published elsewhere beforehand. The Dublin Core field used is date.issued
. The ordering of this index may be reversed so 'earliest first' and 'most recent first' orderings are possible.
Note that the index is of items by date, as opposed to an index of dates. If 30 items have the same issue date (say 2002), then those 30 items all appear in the index adjacent to each other, as opposed to a single 2002 entry.
Since dates in DSpace Dublin Core are in ISO8601, all in the UTC time zone, a simple alphanumeric sort is sufficient to sort by date, including dealing with varying granularities of date reasonably. For example:
2001-12-10 2002 2002-04 2002-04-05 2002-04-09T15:34:12Z 2002-04-09T19:21:12Z 2002-04-10
In order to determine which items most recently appeared, rather than using the date of issue, an item's accession date is used. This is the Dublin Core field date.accessioned
. In other aspects this index is identical to the date of issue index.
One last operation the browse API can perform is to extract items by a particular author. They do not have to be primary author of an item for that item to be extracted. You can specify a scope, too; that is, you can ask for items by author X in collection Y, for example.
This particular flavour of browse is slightly simpler than the others. You cannot presently specify a particular subset of results to be returned. The API call will simply return all of the items by a particular author within a certain scope.
Note that the author of the item must exactly match the author passed in to the API; see the explanation about the caveats of the author index browsing to see why this is the case.
The API is generally invoked by creating a BrowseScope
object, and setting the parameters for which particular part of an index you want to extract. This is then passed to the relevent Browse
method call, which returns a BrowseInfo
object which contains the results of the operation. The parameters set in the BrowseScope
object are:
To illustrate, here is an example:
The results of invoking Browse.getItemsByTitle
with the above parameters might look like this:
Rabble-Rousing Rabbis From Sardinia Reality TV: Love It or Hate It? FOCUS> The Really Exciting Research Video Recreational Housework Addicts: Please Visit My House Regional Television Variation Studies Revenue Streams Ridiculous Example Titles: I'm Out of Ideas
Note that in the case of title and date browses, Item
objects are returned as opposed to actual titles. In these cases, you can specify the 'focus' to be a specific item, or a partial or full literal value. In the case of a literal value, if no entry in the index matches exactly, the closest match is used as the focus. It's quite reasonable to specify a focus of a single letter, for example.
Being able to specify a specific item to start at is particularly important with dates, since many items may have the save issue date. Say 30 items in a collection have the issue date 2002. To be able to page through the index 20 items at a time, you need to be able to specify exactly which item's 2002 is the focus of the browse, otherwise each time you invoked the browse code, the results would start at the first item with the issue date 2002.
Author browses return String
objects with the actual author names. You can only specify the focus as a full or partial literal String
.
Another important point to note is that presently, the browse indices contain metadata for all items in the main archive, regardless of authorization policies. This means that all items in the archive will appear to all users when browsing. Of course, should the user attempt to access a non-public item, the usual authorization mechanism will apply. Whether this approach is ideal is under review; implementing the browse API such that the results retrieved reflect a user's level of authorization may be possible, but rather tricky.
The browse API contains calls to add and remove items from the index, and to regenerate the indices from scratch. In general the content management API invokes the necessary browse API calls to keep the browse indices in sync with what is in the archive, so most applications will not need to invoke those methods.
If the browse index becomes inconsistent for some reason, the InitializeBrowse
class is a command line tool (generally invoked using the /dspace/bin/index-all
shell script) that causes the indices to be regenerated from scratch.
Presently, the browse API is not tremendously efficient. 'Indexing' takes the form of simply extracting the relevant Dublin Core value, normalising it (lower-casing and removing any leading article in the case of titles), and inserting that normalized value with the corresponding item ID in the appropriate browse database table. Database views of this table include collection and community IDs for browse operations with a limited scope. When a browse operation is performed, a simple SELECT
query is performed, along the lines of:
SELECT item_id FROM ItemsByTitle ORDER BY sort_title OFFSET 40 LIMIT 20
There are two main drawbacks to this: Firstly, LIMIT
and OFFSET
are PostgreSQL-specific keywords. Secondly, the database is still actually performing dynamic sorting of the titles, so the browse code as it stands will not scale particularly well. The code does cache BrowseInfo
objects, so that common browse operations are performed quickly, but this is not an ideal solution.
The purpose of the history subsystem is to capture a time-based record of significant changes in DSpace, in a manner suitable for later refactoring or repurposing. Note that the history data is not expected to provide current information about the archive; it simply records what has happened in the past.
The Harmony project describes a simple and powerful approach for modeling temporal data. The DSpace history framework adopts this model. The Harmony model is used by the serialization mechanism (and ultimately by agents who interpret the serializations); users of the History API need not be aware of it. The content management API handles invocations of the history system. Users of the DSpace public API do not generally need to use the history API.
When anything of archival interest occurs in DSpace, the saveHistory
method of the HistoryManager
is invoked. The parameters contains a reference to anything of archival interest. Upon reception of the object, it serializes the state of all archive objects referred to by it, and creates Harmony-style objects and associations to describe the relationships between the objects. (A simple example is given below). Note that each archive object must have a unique identifier to allow linkage between discrete events; this is discussed under "Unique IDs" below.
The serializations (including the Harmony objects and associations) are persisted as files in the /dspace/history
(or other configured) directory. The history
and historystate
tables contain simple indicies into the serializations in the file system.
The following events are significant enough to warrant history records:
The serialization of an archival object consists of:
To be able to trace the history of an object, it is essential that the object have a unique identifier. Since not all objects in the system have Handles, the unique identifiers are only weakly tied to the Handle system. Instead, the identifier consists of:
When an archive object is serialized, an object ID and MD5 checksum are recorded. When another object is serialized, the checksum for the serialization is matched against existing checksums for that object. If the checksum already exists, the object is not stored; a reference to the object is used instead. Note that since none of the serializations are deleted, reference counting is unnecessary.
The history data is not initially stored in a queryable form. Two simple RDBMS tables give basic indications of what is stored, and where. The history
table is an index of serializations with checksums and dates. The history_id
column corresponds to the file in which a serialization is stored. For example, if the history ID is 123456, it will be stored in the file:
/dspace/history/00/12/34/123456
The table also contains the date the serialization was written and the MD5 checksum of the serialization.
The historystate
table is supposed to indicate the most recent serialization of any given object.
An item is submitted to a collection via bulk upload. When (and if) the item is eventually added to the collection, the history method is called, with references to the item, its collection, the e-person who performed the bulk upload, and some indication of the fact that it was submitted via a bulk upload.
When called, the HistoryManager does the following: It creates the following new resources (all with unique ids):
It also generates the following relationships:
event --atTime--> time event --hasOutput--> state Item --inState--> state state --contains--> Item action --creates--> Item event --hasAction--> action action --usesTool--> DSpace Upload action --hasAgent--> User
The history component serializes the state of all archival objects involved (in this case, the item, the e-person, and the collection). It creates entries in the history database tables which associate the archival objects with the generated serializations.
This history system is a largely untested experiment. It also needs further documentation. There have been no serious efforts to determine whether the information written by the history system, either to files or the database tables, is accurate. In particular, the historystate
table does not seem to be correctly written.