City of Vancouver data migration

Content server data

 * approx. 125,000 Private Records (from private donors)
 * approx. 87,000 Public Records (City created documents)
 * approx. 9,700 Maps and Pamphlets
 * approx. 3,350 Public documents (published materials issued by the City)
 * 812 Accession records
 * 224 Authority records (for Public Records only)
 * 195 Authority records relationships
 * 88 Ephemera records (private sources)

Other sources

 * ??? ICA-AtoM 1.1 records
 * approx. 70,000 digital objects (TIFF)

Process
For Content Server data:
 * Exported data to CSV from Content Server,
 * Transformed data to match ICA-AtoM RAD CSV template via "Pentaho Data Integration" (PDI) tool,
 * imported into ICA-AtoM via CSV import command line (CLI) script.

For ICA-AtoM 1.1 data:
 * Migrate and import via standard ICA-AtoM upgrade procedure.

For digital objects:
 * Processed RAW TIFFs via Archivematica to produce DIPs (expand?),
 * Then imported using CLI DIP import script

Importing
Import of the private records initially took 1 second per record on a conventional workstation, but an import speed of 0.12 seconds per record, initially, was achieved using a ~15GB RAM Rackspace Cloud instance (~$0.90/hour?) with an Ubuntu 11.04 LAMP stack. Import would slow down as it went, however, so the CSV dump was split into 10000 row chunks. The import slows down as it goes, but we were able to import 125K rows in ~12 hours after using a patch to disable nested set (hierarchy) updating. Post-import search indexing is very slow still.

Following are the steps to do the import (using Gnu Screen to keep the session alive).


 * 1) SSH to import server
 * 2) Download CSV data to /root
 * 3) Enter apt-get update
 * 4) Enter apt-get install -y tasksel subversion screen
 * 5) Enter tasksel install lamp-server
 * 6) Enter /usr/sbin/apache2ctl restart
 * 7) Enter cd /var/www
 * 8) Check out qubit Subversion trunk to qubit
 * 9) Enter chown -R www-data:www-data qubit
 * 10) Create qubit</tt> MySQL database
 * 11) Visit Qubit with browser to complete installation
 * 12) Enter cd qubit</tt>
 * 13) Enter screen</tt>
 * 14) Enter ./symfony csv:import path/to/first/import/file --rows-until-update=5 --error-log=/root/errors.log --source-name=source_name</tt>
 * 15) In the import command, "source_name" is used when recording mapping of legacy IDs to new Qubit IDs

Steup after initial import.


 * 1) Update nested set data: ./symfony propel:build-nested-set (took 3.33 minutes for 125K imported objects)

Challenges

 * Private Records CSV export included 100+ columns, and over 1 million data points.
 * Google Refine could not handle this much data, and could not be used for data transformation or normalization