One-Two Migration Punch with Kapow and PowerShell - Part 2
Jun 19 2012
After numerous meetings, brainstorming sessions, and detailed analysis we had formulated a strategy for our migration, and were ready to get our hands dirty staging the content. If you missed Part I of this migration blog series, you can read it here. To quickly recap, at this point in the process we knew where the data resided, how it was structured, and (for the most part) how it needed to be transformed and structured on the target system. However, before we were able to start extracting and transforming the data, we had to first get it onto our local development environments.
Content Staging
The subject matter experts that provided several of the migration requirements were the content authors that managed the legacy system on a daily basis. Per their requirements, an entirely new taxonomy structure was going to have to be constructed on SharePoint, and their legacy content had to be both preserved and enhanced during the migration effort. A few examples of the content enhancements included tagging content with new managed metadata terms, modifying the data types of various metadata fields, and applying SharePoint audiences to specific pieces of content.
After conducting content analysis, we were able to create an extensive content inventory that included both the legacy content and the proposed future state content structures. This inventory was critical, because it also served as a reference guide that the authors were able to use throughout the migration effort. The inventory was created via analysis of the current state data, and through working sessions to define the future state structure. Information architects/business systems analysts, the content authors, and SharePoint architects all contributed to the content inventory and analysis processes. However, even after the inventory was completed, the authors also needed to be able to see instances of their existing content with enhanced future-state metadata before we could begin the migration process.
Vignette uses SQL Server (in our case 2000) for storing content and its instances. Additionally, Vignette's content is database driven, and even the page breadcrumbs are generated dynamically upon request. Due to a lack of documentation, we had to reverse engineer Vignette’s content database structure to determine our strategy for content extraction. Navigating tables lacking referential integrity was an arduous task, but after trial and error we were finally able to find the link between a content type and its instances.
Armed with the knowledge of how Vignette dynamically created its data for a given request, I created a local inventory database that used SQL Server Integration Services (SSIS), and copied the necessary Vignette tables from the production environment. It’s important to mention that my local database was SQL Server 2008 R2, which allowed me to utilize more effective queries unavailable in 2000. Specifically, the Common Table Expression (CTE) query allowed me to aggregate the content hierarchically per Vignette channel, and export the data for the authors to see.
A Vignette channel is essentially a unique key that can be used to tag and organize content. Unique instances of content can be associated with multiple channels, and each channel has a primary piece of content associated with it. In many cases, the primary piece of content associated with a channel is displayed as a primary navigation element on a Vignette website.
A CTE query is a recursive query that can iterate upon itself until the appropriate (final) subset is returned. In our case, the most important parameters were the ID of the root Vignette channel, the ParentChannelId, and whether the channel was active. Thereby, we were able to identify all of the children of the root channel, iterate through them, and create an output that included the taxonomy level relative to the root and the breadcrumb for each piece of content.
Content Staging CTE Query
USE Inventory
GO
WITH ChannelHierarchy (id, name, isActive, orderAsSibling, parentChannelId, ChannelLevel, Breadcrumb)
AS
(
SELECT
id,
name,
isActive,
orderAsSibling,
parentChannelId,
0 AS ChannelLevel,
CAST('Home' AS VARCHAR(1000)) as Breadcrumb
FROM Inventory.dbo.vgnAsChannel
WHERE id = '59944321'
UNION ALL
SELECT
DT.id,
DT.name,
DT.isActive,
DT.orderAsSibling,
DT.parentChannelId,
ChannelLevel + 1,
CAST((DTH.Breadcrumb + ' > ' + DT.name) AS VARCHAR(1000)) AS Breadcrumb
FROM Inventory.dbo.vgnAsChannel AS DT
INNER JOIN ChannelHierarchy AS DTH ON DTH.id = DT.parentChannelId
)
SELECT DTH.ChannelLevel,
DTH.Breadcrumb + ' > ' + d.title_name AS 'Breadcrumb',
DTH.name AS 'ChannelName',
d.title_name AS 'ArticlePageTitle',
d.content_keyword_text AS 'Keywords',
DTH.id AS 'ChannelId',
chanFile.vcmObjectId AS 'ContentId',
d.dept_id AS 'DepartmentId',
d.start_dt AS 'StartDate',
d.exp_dt AS 'ExpiryDate',
d.pub_dt AS 'PubDate',
DTH.parentChannelId,
'https://portal/content?contentId=' + chanFile.vcmObjectId + '&channelId=' + DTH.id AS 'SourceUrl',
'https://portal/channel?channelId=' + DTH.id AS 'SourceChannelUrl',
'' AS TargetUrl,
'' AS TargetContentType,
'' AS Flag,
'' AS Comments
FROM ChannelHierarchy AS DTH
LEFT JOIN Inventory.dbo.vgnAsChannelFileAssociation chanFile ON chanFile.channelId = DTH.id
LEFT JOIN Inventory.dbo.vgnAsMoMetaData meta ON meta.contentMgmtId = chanFile.vcmObjectId
LEFT JOIN Inventory.dbo.vgnAsMoMap map ON meta.contentMgmtId = map.recordId
LEFT JOIN Inventory.dbo.article d ON map.keyInt1 = d.article_id AND meta.name = d.title_name
LEFT JOIN Inventory.dbo.vgnAsObjectType objt ON objt.id = meta.objectTypeId AND objt.name = 'article'
LEFT JOIN Inventory.dbo.vgnCMSObject obj ON meta.contentMgmtId = obj.id
WHERE DTH.id != '59944321' AND DTH.isActive = 1
AND Breadcrumb NOT LIKE '%UNPUBLISHED%'
AND Breadcrumb NOT LIKE '%Not In Use%'
AND Breadcrumb NOT LIKE '%Unused%'
AND (d.exp_dt >= (GETDATE()-365))
ORDER BY Breadcrumb, ChannelLevel ASC
GOUsing the CTE, we were then able to output the hierarchical content type data, in this case article pages, into an Excel spreadsheet. From there, the authors had the ability to work tangibly with their current structure and were able to put Target URL, target system content types, and additional metadata on the content. This critical part of the process allowed the authors to define where the content would live in the new portal, and how it should be transformed. Additionally, it also allowed them to draft visual representations of the new taxonomy, modify the content type of an instance of content, and denote if a specific piece of content should not be migrated at all. The authors, analysts, and architects iterated over this file during our monthly sprints. Of course, we put this file into a SharePoint library so that they could version and preserve integrity of the file!
Also, due to the fact that the Vignette content freeze did not occur until right before our final migration, we continually had to merge new content from production with the staged data. To accomplish this feat, we extensively used VLOOKUP and SUMPRODUCT formulas in excel to merge the new data from production with the previous author updates of the newly staged target taxonomy. As described in the sections above, a particular piece of content (unique “ContentID”) could be displayed in multiple channels (unique “ChannelIDs”) on Vignette. With this in mind, we also used excel formulas to identify and prevent the same piece of content from being migrated multiple times. Additionally, excel was used to validate that each SharePoint site had a single landing page, and to perform tallies that validated all necessary items were mapped appropriately to the proper target locations. Finally, before every migration, we output a report from excel that summarized the amount of data to be migrated, the amount of data per target location, the amount of data per content type, and any data that may be potentially skipped during the migration due to erroneous/missing mappings.
For example, after generating a unique list of "ContentIDs", a variation of the following excel formula was used to generate the # of times a specific piece of content was going to be migrated based on the mappings in the content mapping spreadsheet. This was used to help identify duplicative mappings prior to migration.
=SUMPRODUCT( (ContentMappingSpreadsheet_SourceContentId_Range = UniqueContentID) *
(NOT (ContentMappingSpreadsheet_TargetUrl_Range = "Do Not Migrate")))
After we had successfully merged all production content into the content mapping spreadsheet and performed validations on the human components of this spreadsheet, we then merged the data back into our Inventory database. Specifically, we merged this data into a single table that served as the "master" of all content types. I called this the MasterInventory table. This table was the starting point for all Kapow robots (more on that later). An example of the data structure for this kind of table is as follows:
- SourceUrl (Primary Key; the source URL for the content type; contains "ContentID" & "ChannelID")
- SourceContentId
- SourceChannelId
- SourceContentTypeId
- SourceChannelName
- TargetContentTypeId
- TargetUrl (New URL to be used with Upload process when creating the new taxonomy in SharePoint)
- StatusId (Lookup for Initial, Extracted, Transformed, Upload, Error)
- Keywords (Source system metadata)
- ExpiryDate
- TaxonomyLevel
- Breadcrumb
- ParentChannelId
- IsDeleted
- SourceChannelUrl
- PublicationDate
- StartDate
- DepartmentId
During our merge process, we could not support the "scorched earth" method of wiping all previous content and adding new. Instead, we developed a two-step process where we first merged the staged data using SQL Server's MERGE statement available in 2008 R2. This statement requires two tables where one served as the source (MasterInventorySource) and the other (MasterInventory) the target.
INSERT The Excel Spreadsheet to Inventory Database
The code below took the author's updates from the content mapping spreadsheet and added it to the MasterInventoryMerge (source) table. Please note that this process required an OLEDB adapter install.
Add-PSSnapin Microsoft.SharePoint.PowerShell -EA 0 | Out-Null Add-PSSnapin sqlserverprovidersnapin100 -EA 0 | Out-Null Add-PSSnapin sqlservercmdletsnapin100 -EA 0 | Out-Null $OleDbConn = New-Object "System.Data.OleDb.OleDbConnection" $OleDbCmd = New-Object "System.Data.OleDb.OleDbCommand" $OleDbAdapter = New-Object "System.Data.OleDb.OleDbDataAdapter" $DataTable = New-Object "System.Data.DataTable" # Establish the OLEDB connection & Pull from Local Computer $ExcelFile="C:\Content Migration\CONTENT_HIERARCHY.xlsx" $OleDbConn.ConnectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=$ExcelFile;Extended Properties=""Excel 12.0 Xml;HDR=YES"";" $OleDbConn.Open() # Establish the Excel spreadsheet connection $OleDbCmd.Connection = $OleDbConn $OleDbCmd.commandtext = "Select * from [CONTENT_HIERARCHY$]" $OleDbAdapter.SelectCommand = $OleDbCmd
# Fill the adapter
$RowsReturned = $OleDbAdapter.Fill($DataTable)
# Loop through the recordset
ForEach ($DataRec in $DataTable) {
$sourceUrl = $DataRec.SourceUrl
$breadCrumb = $DataRec.Breadcrumb
$channelId = $DataRec.ChannelID
$contentId = $DataRec.ContentID
$targetContentTypeId = $DataRec.TargetContentType
$targetUrl = $DataRec.TargetURL
$level = $DataRec.ChannelLevel
$statusId = 0
$sourceContentTypeId = 1
$articlePageTitle = $DataRec.ChannelName
$parentChannelId = $DataRec.ParentChannelID
$contentKeywords = $DataRec.ContentKeywords
$expiryDate = $DataRec.ExpiryDate
$startDate = $DataRec.StartDate
$pubDate = $DataRec.PublishDate
$deptId = $DataRec.DepartmentId
$sourceChannelUrl = $DataRec.SourceChannelUrl
$needsReRun = $DataRec.ReRunMigration
Write-Host "Inserting: " $articlePageTitle
# Use local SQL Server and insert data
Set-Location "sqlserver:\sql\(local)\default\databases\Inventory"
$data = Invoke-Sqlcmd -Query `
@"
SET ANSI_WARNINGS OFF
GO
INSERT INTO MasterInventoryMerge
(SourceURL,
Breadcrumb,
SourceContentId,
SourceChannelId,
SourceContentTypeId,
SourceChannelName,
TargetContentTypeId,
TargetURL,
StatusId,
TaxonomyLevel,
ParentChannelId,
KeyWords,
ExpiryDate,
PublicationDate,
StartDate,
DepartmentId,
RequiresReRun,
SourceChannelUrl)
VALUES
('$($sourceUrl -Replace "'", "''")',
'$($breadCrumb -Replace "'", "''")',
'$($contentId -Replace "'", "''")',
'$($channelId -Replace "'", "''")',
'$($sourceContentTypeId -Replace "'", "''")',
'$($articlePageTitle -Replace "'", "''")',
'$($targetContentTypeId -Replace "'", "''")',
'$($targetUrl -Replace "'", "''")',
$statusId,
$level,
'$($parentChannelId -Replace "'", "''")',
'$($contentKeywords -Replace "'", "''")',
'$($expiryDate -Replace "'", "''")',
'$($pubDate -Replace "'", "''")',
'$($startDate -Replace "'", "''")',
'$($deptId -Replace "'", "''")',
'$($needsReRun -Replace "'", "''")',
'$($sourceChannelUrl -Replace "'", "''")')
SET ANSI_WARNINGS ON
"@}
$OleDbConn.Close()
Merging MasterInventoryMerge & MasterInventory
Now that we had the source table filled, we then merged it with the target using our T-SQL MERGE statement:
Use Inventory
GO
MERGE dbo.MasterInventory AS Target
USING dbo.MasterInventorySource AS Source
ON (Target.SourceURL = Source.SourceURL
AND Source.SourceURL LIKE '%portal%'
AND Source.SourceContentTypeId = 1)
WHEN MATCHED THEN
UPDATE SET
Target.TargetContentTypeId = Source.TargetContentTypeId,
Target.TargetURL = Source.TargetURL,
Target.Keywords = Source.Keywords,
Target.ExpiryDate = Source.ExpiryDate
WHEN NOT MATCHED BY TARGET THEN
INSERT (SourceURL, SourceContentId, SourceChannelId,
SourceContentTypeId, SourceChannelName, TargetContentTypeId,
TargetURL, StatusId, Keywords, ExpiryDate,
TaxonomyLevel, Breadcrumb, ParentChannelId, SourceChannelUrl, IsDeleted)
VALUES (Source.SourceURL, Source.SourceContentId, Source.SourceChannelId,
Source.SourceContentTypeId, Source.SourceChannelName, Source.TargetContentTypeId,
Source.TargetURL, Source.StatusId, Source.Keywords, Source.ExpiryDate,
Source.TaxonomyLevel, Source.Breadcrumb, Source.ParentChannelId,
Source.SourceChannelUrl, Source.IsDeleted)
WHEN NOT MATCHED BY SOURCE THEN
UPDATE SET Target.IsDeleted = 1
OUTPUT $action,
Inserted.*,
Deleted.*;
GOFinishing Thoughts
To recap, to account for the authors continued creation of production content we created a repeatable process that merged author updates in a content mapping spreadsheet with freshly pulled production updates. We had to pull that data using SSIS into our local Inventory database. From there, we created a new content mapping spreadsheet that was then merged with the old spreadsheet using Excel formulas such as VLOOKUP and SUMPRODUCT. Once this was completed, we consumed the updated content mapping spreadsheet by utilizing PowerShell to INSERT into our local MasterInventoryMerge table. Lastly, we merged our MasterInventory table using the MERGE statement.
Now that we have used our good friends PowerShell, Excel, and SQL Server to handle the brittle processes of human workflow components, Post III will focus on the extraction and transformation of the content using Kapow.