This is a repost of the attached article first published in the PMI Virtual Library.
In the consulting world, project estimation is a critical
component required for the delivery of a successful project. If you estimate correctly, you will deliver a
project on time and within budget; get it wrong and you could end up over
budget, with an unhappy client and a burned out team. Project estimation for business intelligence and
data integration projects is especially difficult, given the number of
stakeholders involved across the organization as well as the unknowns of data
complexity and quality. Add to this mix a
firm fixed price RFP (request for proposal) response for a client your
organization has not done work for and you have the perfect climate for a poor
estimate. In this article, I share my
thoughts about the best way to approach a project estimate for an extract, transform
load (ETL) project.For those of you not familiar with ETL, it is a common
technique used in data warehousing to move data from one database (the source)
to another (the target). In order to
accomplish this data movement, the data first must be extracted out of the
source system—the “E.” Once the data
extract is complete, data transformation may need to occur. For example, it may be necessary to transform
a state name to a two-digit state code (Virginia
to VA)—the “T.” After the data have been
extracted from the source and transformed to meet the target system
requirements, they can then be loaded into the target database—the “L.”
Before starting your ETL estimation, you need to understand
what type of estimate you are trying to produce. How precise does the estimate need to be? Will
you be estimating effort, schedule, or both?
Will you build your estimate top down or bottom up? Is the result being used for inclusion in an
RFP response or will it be used in an unofficial capacity? By answering these questions, you can assess
risk and produce an estimate that best mitigates that risk.
In many cases, the information you have to base your
estimate on is high level with only a few key data points do go on, and you do
not have either the time or ability to ask for more details. In these situations, the response I hear most
often is that an estimate cannot be produced.
I disagree! As long as the precision of the estimate produced is
understood by the customer, there is value in the estimate and it should be
done. The alternative to a high-level
estimate is none at all, and as someone who has to deliver on the estimate, I
would rather have a bad estimate with clear assumptions than no baseline at
all. The key is being clear about how
the estimate should be used and what the limitations are. I have found that one of the best ways to
frame the accuracy of the estimate with the customer and project team is through
the use of assumptions. Every estimate
is built with many assumptions in mind and having them clearly laid out almost
always generates good discussion and eventually a more refined and accurate estimate.
A common question that comes up during the estimation
process is effort versus schedule; in other words, how many hours will the work
take versus the duration it will take to complete the effort. To simplify the estimating process, I start
with a model that delivers the effort and completely ignore the schedule. Once the effort has been refined, it can be
taken to the delivery team for a secondary discussion on overlaying the estimated
effort across time.
Once you know what type of estimate you are trying to deliver
and who your audience is, you can begin the process of effectively estimating
the work. All too often, this up-front
thinking is ignored and the resulting estimate does not meet expectations.
I’ve reviewed a number of the different ETL estimating
techniques available and have found some to be extremely complex and others
more straightforward. Then there are the
theory of estimating and the tried and true models of Wide Band Delphi and COCOMO. All of these theories are interesting and
have value but they don’t easily produce the data to support the questions I am
always asked in the consulting world: How much will this project cost? How many
people will you need to deliver it? What
does the delivery schedule look like? I
have discovered that most models focus
on one part of the effort (generally development) but neglect to include
requirements, design, testing, data stewardship, production deployment,
warranty support, and so forth. When
estimating a project in the consulting world, we care about the total cost, not
just how long it will take to develop the ETL code.
Estimating an ETL
Project
In the ETL space I use two models (top down and bottom up)
for my estimation, if I have been provided enough data to support both; this
helps better ground the estimate and confirms that there are no major gaps in
the model.
Estimating an ETL
Project Using a Top Down Technique
To start a top down estimate I break down the project by
phase and then add in key oversight roles that don’t pertain specifically to
any single phase (i.e., project manager, technical lead, subject matter expert,
operations, etc.). Once I have the
phases that relate to the project I am estimating for, I estimate each phase
vertically as a percentage of the development effort, as shown in the chart
below. Everyone has a different idea
about what percentage to use in the estimate and there is no one right answer. I start with the numbers below and tweak them
accordingly, based on the project environment and resource experience.
|
Phase
|
Percentage of Development
|
|
Requirements
|
50%
of Development
|
|
Design
|
25%
of Development
|
|
Development
|
|
|
System
Test
|
25%
of Development
|
|
Integration
Test
|
25%
of Development |
Once I have my verticals established, I break my estimate
horizontally into low, medium, and high, using the percentages below:
|
Complexity
|
Percent of Medium
|
|
Low
|
50%
of Medium
|
|
Medium
|
N/A
|
|
High
|
150%
of Medium |
Generally, when doing a high-level ETL estimate, I know the
number of sources I am dealing with and, if I’m
lucky, I also have some broad stroke level of complexity
information. Once I have my model built
out, as described above, I work with my development team to understand the
effort involved for a single source. I
then take the numbers of sources and plug them into my model, as shown below (Figure
1, in yellow). If I don’t have
complexity information, I simply record the same numbers of sources in the low,
medium, and high columns to give me an estimate range of +/−50%
I now have a framework I can share with my team to shape my
estimate. After my initial cut, I meet
with key team members to review the estimate, and I inevitably end up with a revised estimate and, more
importantly, a comprehensive set of assumptions. There is no substitute for socializing your
estimate with your team or with a group of subject matter experts; they are
closest to the work and have input and ideas that help refine the estimate into
something that is accurate and defendable when cost or hours are challenged by
the client.
Estimating an ETL
Project Using a Bottom Up Estimate
When enough data are available to construct a bottom up
estimate, this estimate can provide a powerful model that is highly defendable.
To start a bottom up ETL, estimate a minimum of two key data elements are
required: the number of data attributes
required and the number of target structures that exist. Understanding the target data structure is a
critical input to ETL estimation, because data modeling is a time-consuming and
specialized skill that can have a significant impact on the cost and schedule.
When starting a bottom up ETL estimate, it is important to
break up the attributes into logical blocks of information. If a data warehouse is the target, subject
areas work best as starting points for segmenting the estimation. A subject area is a logical grouping of data
within the warehouse and is a great way to break down the project into smaller
chunks that align with how you will deliver the work. Once you have a logical
grouping of how the data will be stored, break down the number of attributes
into the various groups, noting the percentages of attributes that do not have
a target data structure.
Once you have defined the target data subject areas, attributes,
and percentages of data modeled, the time spent per task, per attribute can be
estimated. It is important to define all
tasks that will be completed during the life cycle of the project. Clearly defining the assumptions around each
task is also critical, because consumers of the model will interpret the tasks
differently.
In the example shown, there is a calculation that adjusts
the modeling hours based on the percentage of attributes that are not modeled,
giving more modeling time as the percentage increases. This technique can be used for any task that
has a large variance in effort based on an external factor.
To complete the effort, estimate the hours per task that can
be multiplied by the total number of attributes to get effort by task. In addition, the tasks can be broken out
across the expected project resource role, providing a jump start on how the
effort should be scheduled. As with any
estimate, I always add a contingency factor at the bottom to account for
unforeseen risk.
Comparing a top down estimate with a bottom up estimate will
provide two good data sets that can drive discussion about the quality of the
estimate as well us uncover additional assumptions.
Scheduling the Work
Once the effort estimate is complete (regardless of the
type), I can start thinking about how much time and how many resources are
needed to complete the project.
Generally, the requestor of the estimate has an expected delivery date
in mind and I know the earliest time we can start the work. With those two data points, I can calculate
the number of business days I have to deliver the project and get a rough order
of magnitude estimate of the resources required.
The first thing I do is map the phases established in the
effort estimate to the various project team roles (BSA, developer, tester, etc.). Once I break down the effort into roles, I
can then divide the effort by the number of days available in the project to get
the expected number of resources required.
In the example below (Figure 2), I shorten the time that the BSA, developer,
and tester will work, taking into
account that each life cycle phase does not run for the duration of the
project. At this stage, I also take into
consideration the cost of each resource and add in a contingency factor. This method allows for the ability to adjust
the duration of the project without impacting the level of effort needed to
complete the work.
Using the techniques described above provides you with the
flexibility to easily answer the “what if” questions that always come up when
estimating work. By keeping the effort
and the schedule separate, you have total control over the model.
Delivering on the Estimate
Once the effort and duration of the project are stabilized,
a project planning tool (e.g., Microsoft Project) can be used to dive into the
details of the work breakdown structure and further map out the details of the
project.
It is important to continue to validate your estimate
throughout the project. As you finish
each project phase, revisiting the estimate to evaluate assumptions and
estimating factors will help make future estimates better, which is especially
important if you expect to do additional projects in the same department.
Conclusion
In my experience, bottom up estimates produce the most
accurate results, but often the information required to produce such an
estimate is not available. The bottom up
technique allows the work to be broken down to a very detailed level. To effectively estimate bottom up ETL
projects, the granularity needed is typically the number of reports, data
elements, data sources, or the metrics required for the project.
When a low level of detail is not available, using a top
down technique is the best option. Top
down estimates are derived using a qualitative model and are more likely to be
skewed based on the experience factor of the person doing the estimate. I find that these estimates are also much
more difficult to defend because of their qualitative nature. When doing a top down estimate for a proposal,
I like to include additional money in thebudget for contingency to cover the
unknowns that certainly lie in the unknown details.
There is an argument that a bottom up estimate is no more
precise than a top down estimate. The
thinking here is that with a lower level of detail, you make smaller estimating
errors more often, netting the same
result as the large errors made in a top down approach. Although this is a compelling argument (and
why I do both estimates when I can), the more granular the estimate you have,
the quicker you can identify flaws and make corrections. With a top down estimate, errors take longer
to be revealed and are harder to correct.
An estimate is only as good as the data used to start the
estimate and the assumptions captured.
Providing clear and consistent estimates helps build credibility with business
customers and clients and provides a concrete defensible position on how you
plan to deliver against scopeand it also provides a constant reminder of the
impact of additional scope. No matter
how easy or small a project appears to be, always start with an estimate and be
prepared for that estimate to need fine tuning as new information becomes
available.
Glossary of Terms
ETL – Extract, Transform, and Load. A technique used to move data from one
database (the source) to another database (the target)
Business Intelligence – A technique used to analyze data to
support better business decision making
Data Integration – The process of combining data from
multiple sources to provide end users with a unified view of the data
Data Steward – The person responsible for maintaining the
metadata repository that describes the data within the data warehouse.
Data Warehouse – A repository of data designed to facilitate
reporting and business intelligence analysis
RFP – A request for proposal (RFP) is an early stage in the
procurement process, issuing an invitation for suppliers, often through a
bidding process, to submit a proposal on a specific commodity or service.
Source – An ETL term used to describe the source system that
provides data to the ETL process
Subject Area – A term used in data warehousing that
describes a set of data with a common theme or set of related measurements
(e.g., customer, account, or claim)
Target – An ETL term used to describe the database that
receives the transformed data
About the Author
Ben Harden, PMP, is a manager in the Data Management and
Business Intelligence practice of the Richmond,
Virginia based consulting firm CapTech
. He specializes in the project management and delivery of data integration and
business intelligence projects for Fortune 500 organizations. Mr. Harden has successfully managed
data-related projects in the health care, financial services, telecommunications,
and governmental sectors and can be reached via e-mail at bharden@captechconsulting.com.