Data Profiling and Data Remediation Priority for Migrating Legacy Systems Part II
Mar 07 2011
In Part I of this blog, I stated that before embarking on a large scale data migration from a legacy system into a new system, it is important to establish a game plan. In Part II, once you have established the game plan, it is important to establish priorities for the Data Remediation Plan.
To build a Data Remediation Plan, you need to establish metrics by which to measure and prioritize the effort. This will be very helpful when you have conflicting resource constraints and need to know where to begin.
In Part I, I demonstrated the importance of setting priorities for data by Mandatory field, Data Retention Policy, Data Load (null records), Data Loss (truncation and duplication of records), Data Quality. Once all of these parameters have been determined, you are now left with the records that need to be remediated. I will use a fictitious metrics table to provide examples of how to set up the priority for tackling each one.
The Risk Priority Number
In Six Sigma, there is a process called Failure Mode and Effects Analysis, in which a step-by-step approach is used to identify all possible failures in a design, a manufacturing or assembly process, or a product or service. By using this approach but applying it to Data Quality, we are able to use one of the tools to establish the priority. The tool is called the Risk Priority Number (RPN). The RPN is the measuring process for establishing the data remediation priority. By using this metric, you can create a hierarchical priority list for Data Remediation and I will demonstrate how to do so.
The RPN is comprised of three components with numeric values. These are:
SEVERITY - This measures the severity of the potential effect of the data load failure if the data is not remediated. For example, a high severity would mean that the data will not load because there are too many nulls.
OCCURRENCE – This measures the likelihood that the data load failure will occur. For example, date fields that should be numeric fields.
DETECTION – How difficult is it to detect the data load failure caused by this data element? For example, symbols embedded into a record that should be a letter.
The RPN is the product of the Severity, Occurrence and Detection:
RPN = S x O x D
As you can see from this equation, the higher the number the higher the priority to resolve. I have used a scale of 1 to 5 using the following definitions, (although you could use Fibonacci numbers if you are obsessed by the DaVinci Code):
|
Rating |
Description |
Criteria |
|
1 |
Very low or none |
Minor nuisance, but no real harm, data will load and data is very good |
|
2 |
Low or minor |
Data quality at reduced performance, data will load |
|
3 |
Moderate or significant |
Gradual data performance degradation, data will load but data is bad |
|
4 |
High |
Loss of function such as table lock-up |
|
5 |
Very high or catastrophic |
System-related catastrophic failure |
So the range of RPN, in this instance would be from 1 to 125 (RPN = 1 x 1 x 1 =1; RPN = 5 x 5 x5 = 125). By establishing the numbering system described, then, in conjunction with addressing the data quality parameters described in Part I, (Validity, Data Definitions, Existence, and Consistency), the Data Remediation Plan can be prioritized by the RPN.
A Fictitious Example
For the example below, the Customer/Name, because of the highest RPN, would be remediated first, followed by Customer/Address. Customer/Type would be good to load and would not require remediation.
|
Table/Column |
RPN |
Severity |
Occurrence |
Detection |
Comments |
|
Customer/Address |
32 |
4 |
2 |
4 |
City or State records are null but are embedded in Address and difficult to detect |
|
Customer/Name |
60 |
5 |
3 |
4 |
1,545 records of 20,000 records with character length greater than 40. Truncation would cause the table to fail for 40% of the data |
|
Customer/Type |
1 |
1 |
1 |
1 |
Type of Customer maps to a list of values and has a default value |
To determine the level of effort required to remediate the data will be left for another day.
Summary
To quote Lord Kelvin, “If you can’t measure it, you can’t improve it.” By applying this same principle and using metrics for data quality, you will have an objective way to assign priorities and resources to the data remediation effort. By re-using tools applied to other disciplines, data management and data integration may become less daunting as well. More importantly, this tool will provide the Data Analyst a manner in which to communicate to the business the priority and importance of the Data Remediation Plan and the benefit to the business that everyone can understand.