UNBC data migration

Main Page > AABC Data Migration Toolkit > UNBC data migration

= Overview =



UNBC Archives is the first participant to be migrated as part of the AABC Data Migration Toolkit project. The original project specification included migrating data from MS SQL Server, MS Access and ICA-AtoM (Release 1.0.9) MySQL databases as well as linking over 5,000 digital asset files (image and video).

Data migration from the MS SQL Server data was performed via the Pentaho Data Integration (PDI) Spoon (pictured) extract, transform, load tool - see Technical Tool Options for other options considered, and reasons for selecting PDI.

Due to the pilot nature of the project, there were a number of unforeseen technical challenges and the project ran over schedule.

Original sources for migration
One (1) MS SQL Server database backup file
 * 1) NBCA_backup_201004080030.bak

Four (4) MS Access databases:
 * 1) Fawcett.mdb
 * 2) Ray Williston Collection.mdb
 * 3) Cassiar_Box1_on.mdb
 * 4) Cassiar-Box751onwards.mdb

One (1) archive (ZIP) file containing digital objects (image and video files):
 * 1) Uploads.zip

Data from one (1) ICA-AtoM Release 1.0.9 instance (set up in late 2010?), including description and digital assets
 * 1) http://www.ica-atom.org/unbc

Data migrated
The final migrated data includes
 * The NBCA MS SQL Server database data as detailed in the section below
 * The digital files from the Uploads.zip archive
 * The ICA-AtoM Release 1.0.9 UNBC website (hosted by Artefactual) including digital assets

MS Access databases
None of the four MS Access databases provided were migrated, for the following reasons
 * the Cassiar fonds files (Cassiar_Box1_on.mdb, Cassiar-Box751onwards.mdb) were never scheduled for migration
 * in the course of the project it was determined that the "Ray Williston Collection" data was present in the NBCA (MS SQL Server) database, so migration of the Ray Williston Collection.mdb database was not necessary
 * The Fawcett.mdb database will be migrated after development of a AABC_Data_Migration_Toolkit methodology

Accession data
ICA-AtoM Release 1.1 has no provision for storing accession data. The future ICA-AtoM Release 1.2 will include an accession module, and this data will be migrated at a future date.

Accession data is stored in several tables in the NBCA database, and these were not migrated:
 * 1) Accession - accession and appraisal data
 * 2) Individual - most of the data in this table is specific to appraisers and donors
 * 3) AppraiserSpecialty - links appraisers (Individual) to their area of specialty
 * 4) Receipt - accession receipts?

Other MS SQL Server tables not migrated

 * 1) SpecifiedLocation - locations are specified at the Fonds level, and are very general (e.g. "Photograph room", "Mobile shelving", "On display in library").  UNBC agreed that migrating this data was not necessary.
 * 2) Lookup - contained controlled vocabulary maintained in the Lookups table was not applicable in the ICA-AtoM application.

Project deliverables

 * 1) Migration of all source data to a single ICA-AtoM Release 1.1 website (hosted by Artefactual)
 * 2) Integration of linked digital assets
 * 3) This process documentation

= Methodology =


 * 1) Migrate ICA-AtoM Release 1.0.9 data (with attached digital assets) to ICA-AtoM Release 1.1 data schema. See: Upgrading
 * 2) Install MS SQL Server 2005 Express and load the NBCA database from backup file provided
 * 3) Use the Pentaho Data Integration (PDI) Spoon tool to create a data migration script to migrate the NBCA database to the ICA-AtoM data model
 * 4) Attach digital assets from Uploads.zip archive to NBCA descriptions
 * 5) Refine the data migration script through several iterations
 * 6) Set up an install of ICA-AtoM for UNBC to audit the migrated data - including Release 1.0.9 data, NBCA database data, and all digital assets from both
 * 7) UNBC audits migrated data on the auditing site
 * 8) Correct any problems with the migration script found in the data audit
 * 9) Do the final migration of the data to the live UNBC site (hosted by Artefactual)

= Extract =

Restore UNBC MSSQL Backup



 * Download and install SQL Server 2005 Express Edition with Advanced Services SP3
 * Installion options
 * Components: Management Studio Express, Full-text search, Connectivity Components, Replication?
 * Mixed Authentication Mode (Windows and SQL Server)
 * Add user to the SQL admin role
 * Copy NBCA_backup_201004080030.bak to C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Backup\NBCA_backup_201004080030.bak
 * Run SQL Server Management Studio express
 * Right-click "Databases" > Restore
 * To database "NBCA" (or whatever).
 * From device (browse to NBCA_backup_201004080030.bak)
 * Click checkbox next to backup filename
 * OK

Create new SQL Server login



 * In left pane of "Microsoft SQL Server Management Studio Express" click to expand "Security > Logins"
 * Right-click either "Login" folder or in Login pane and select "New Login..."
 * Enter username
 * Select "SQL Server authentication"
 * Enter password, and confirm
 * Uncheck "Enforce password expiration"
 * Default database "NBCA"
 * OK

Give read access to new login



 * Right-click newly created login, and select "Properties"
 * Click "User-Mappings" in left pane
 * Click checkbox next to "NBCA" database in top-right pane
 * Click "db_datareader" permission in lower-right pane
 * Click OK

Turn on TCP/IP access



 * "Start > Microsoft SQL Server 2005 > Configuration Tools > SQL Server Configuration Manager"
 * In left pane, click "SQL Server 2005 Network Configuration > Protocols for SQLEXPRESS"
 * Right-click "TCP/IP" in right pane, and click "Properties"
 * click "IP Addresses" tab
 * For desired TCP/IP connection select "Enabled: Yes", "TCP Dynamic Ports: ", "TCP Port: 1433". OK (You will get a warning that server must be restarted for changes to effect) (See screenshot)
 * Right-click "TCP/IP" and click "Enable" (Warning again)
 * Click "SQL Server 2005 Services" in left pane
 * In right pane, right-click "SQL Server (SQLEXPRESS)" and click "Restart"

Troubleshooting TCP/IP config
In several instances the static port settings above did not work properly with SQL Server 2005. Fixing this issue required modification the following Windows registry keys to set the Port value to 1433.
 * HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\\MSSQLServer\SuperSocketNetLib\Tcp
 * HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\\MSSQLServer\SuperSocketNetLib\Tcp\IPAll

Connecting to MSSQL with PDI Spoon via JDBC


PDI Spoon is a Java GUI for creating PDI Transformation scripts. We will use it to connect to MSSQL with the JDBC connector. Please read the Spoon documentation for general instructions on using the tool.


 * Add new "Input > Table input" step
 * Double-click the Table input step
 * Click the "New" connection button,
 * Select the MS SQL Server connector, and fill in the form like the screenshot below
 * Click "Test"
 * Click OK

NOTE: The "MS SQL Server (Native)" connector did not work in our tests, so the vanilla "MS SQL Server" connector was used



= Transform =

ICA-AtoM Release 1.0.9 data
Data from the existing http://ica-atom.org/unbc database was migrated to ICA-AtoM Release 1.1 data schema via the standard ICA-AtoM upgrade process.

Other transform sources




Creator name
The NBCA.Fond table had no standard field for recording the creator of a Fonds, so the CreatorHistory and Keywords columns were used in an ad-hoc manner. Because neither the name format nor the column used was consistent, there was no automated way to extract the name of the Fonds' creator. The solution was to create a CSV export file from the NBCA Fond table and for the UNBC archivists to manually enter a creator name, where the creator was known.

The creator.csv file was created by export of the NBCA Fond table (columns: Id, Title, CreatorHistory and BioAdminHistory) and was then manually edited by the UNBC Archives staff. The edited creator.csv file is loaded into ICA-AtoM by the PDI Spoon Transformation "Creators.ktr".

Keywords
In the NBCA Fond and FondSeries source tables, the Keywords column contains a semi-colon separated list of strings. This data is defined via "Access Points" in the ICA-AtoM model. Because ICA-AtoM distinguishes between name, subject and place access points, and there is no automated way to categorize all of the original strings, this required a multi-part process:
 * 1) Export a comma-separated file containing all Keywords column data from the NBCA database (only present in Fond and FondSeries tables)
 * 2) Run automated normalization and data-cleaning using Google Refine (Artefactual)
 * 3) Manual data cleaning via Google Refine (UNBC)
 * 4) Import of cleaned and normalized CSV file via PDI Spoon

The complex data structure of ICA-AtoM further complicated the import process (step #4) as can been seen in the AccessPointEntity.ktr and AccessPointRelation.ktr screenshots.

Digital asset import
Digital assets locations were exported from the NBCA MS SQL Server database to a CSV file via PDI Spoon transformation script then the actual digital assets were imported and linked to a description via the digitalobject:load task. The transformation script also mapped the source id (FondSeriesItemAttachement.FondSeriesItemId) to the target key (information_object.id) via the keymap table, and did a string search & replace on the Attachment.SourcePath column to substitute the local path for each digital asset.



= Load =

Load existing ICA-AtoM install data
Data from the http://ica-atom.org/unbc site can be loaded via mysqldump backup file. This step must be performed before any other migrations because it destroys any existing data in the target database.

Load NBCA data
The PDI Spoon NBCA_Transform.kjb job writes data directly to the target ICA-AtoM MySQL database.

Load digital objects
Digital objects are loaded using the digitalobject:load task using the nbca_digitalobjects.csv file to link assets to archival descriptions.

= Post-processing =

Several final tasks must be performed on the final ICA-AtoM instance data which require direct access to the ICA-AtoM/symfony framework


 * 1) Build nested set - See: Build nested set task
 * 2) Generate permalink slugs - See: Generate slugs task
 * 3) Build search index - See: Rebuild search index