Delivering Data Warehousing and BI Projects using Agile - Follow up
Mar 22 2011
A follow up to Delivering Data Warehousing and BI Projects using Agile:
Working as
a data architect on an Agile data warehouse project for a financial services
client, I had the opportunity to support multiple warehouse delivery “bays”.
Of all the lessons learned, one thing that I would highly suggest to any
organization looking to embark on an Agile data warehouse delivery project, would
be a solution architected to leverage a tiered data architecture that clearly separates
a persistent staging environment from the integrated subject areas of an
analytical data warehouse. Decoupling the “base layer” from the integrated
subject areas can provide significant advantages to project teams supporting
the warehouse build.
In the rapid delivery environment, significant gaps can expose themselves long
after the completion of a sprint. When this data isn't available from
source systems (or prohibitive to obtain), options are limited when historical
data is needed to rebuild history, or troubleshoot issues. Often this
leads to imperfect fixes such looking to non-governed environments for
historical data.
Once the effort has been put forth to source the data, it is prudent to persist
the raw source data. Data retention options and strategies need to be
evaluated and selected to meet the organization’s needs as a part of the design
process. Having and governing access to the historical source raw data will
prove to better support business initiatives and ability to create and maintain
an effective analytic environment.
On this particular project, source data was persisted in an integrated environment on a case by case basis and the decisions were driven mostly by business requirements. I experienced multiple situations where issue resolution would have been trivial should access to the raw data exist.
With all data decisions, there are advantages and disadvantages. Here are a few to consider:
Advantages
- Raw source data and history is available
- Reduced risk from source data providers not persisting data in their application
- Raw data better supports data quality or defect investigation
- Quicker to re-engineer the integrated analytical subject areas without historical conversion concerns or limitations
- Tactically, teams can focus on data sourcing and staging, enabling others to focus on integrating into subject areas enhancing delivery velocity
- Enables better, more representative data for profiling during data integration modeling
- Analytic users can access raw
data while teams design integrated subject areas (Decreased time to market
in situations where raw data is usable)
- NOTE: I’m not
advocating that the staged data be available for general user access as
the end state solution. I’m
suggesting that certain power users, under controlled circumstances be
given access during design and delivery of the integrated subject
areas. There must be clear expectations
that this access will be revoked once the integrated data is available,
as the base staging layer must remain responsive to changes in data
sources and not susceptible to delays caused by user dependencies.
- NOTE: I’m not
advocating that the staged data be available for general user access as
the end state solution. I’m
suggesting that certain power users, under controlled circumstances be
given access during design and delivery of the integrated subject
areas. There must be clear expectations
that this access will be revoked once the integrated data is available,
as the base staging layer must remain responsive to changes in data
sources and not susceptible to delays caused by user dependencies.
Disadvantages
- Impacts to platform capacity (not all data has historical requirements or value)
- Increased platform costs associated with persisting and maintaining raw data
- Potential decreased delivery time, impacting time to market metrics in situations where raw data is not usable by business
- Potential legal implications with increased data retention
- Potential to create storage design complexity because of large data volumes (combination of hot / cold, online offline storage etc).
When designing the architecture for your Agile data warehouse, keep in mind the whirlwind of delivery that will take place once the pace heats up. Mistakes will be made, gaps will exist, dependencies upon dependencies will require interim solutions, workarounds, concessions, redesigns etc. Readily available persistent raw source data will help to insulate your organization from the inevitable challenges that come along with this rapid delivery methodology as it is applied to delivering useful analytic information to the business users within your organization.