Data Integration
Oct 03 2011
2011 Teradata Partners Conference Day 1
Day 1 of the 2011 Teradata Partners User Group Conference and Expo at San Diego was very educational and I had the opportunity to listen to some very good sessions. During the morning session, I attended a workshop on “Teradata Live”, an End to End demo by Lance Miller from Teradata. This demo was done on a 8GB Memory,2TB, TD 13.10,4AMP configuration. Lance showed a logical data model for retail customers and the data distribution for 150 million row sales table. He pulled the Teradata Administrator view and showed how the data was evenly distributed on 4 amps. Lance also demonstrated the Teradata Miner 5.0 which is used for data profiling. During this session I asked a question on whether they have any data to support benchmarks against competitors either IBM's Netezza or Oracle Exadata.
Apr 07 2011
Teradata Indexing
One of the challenges design teams face when implementing Teradata design solutions is choosing indexes which help improve performance. What happens if the primary index is not the same as the primary key, and how does one validate that the design decision is correct?
The primary key of a relational table uniquely identifies each record in that table. A logical data model should specify the primary key (PK). If this specification is absent or not clear, the data modeler must provide this information. Indexes are physical design concepts and are related to data base performance. PKs are used for referential integrity, for logical correctness within the data model, and may be NULL. Primary indexes (PI), on the other hand, a physical mechanism for storage, are defined by the SQL create table, may be unique or non-unique, and can allow for changes in the data.
Mar 25 2011
Data quality and data governance lessons from national health care
Who would want to be a national health care administrator? Who would want the responsibility for managing health care and formulating health policy for tens or hundreds of millions of people? It seems obvious that such decisions would rely on quality data. A recent interview impressed upon me how much data managers can learn from a field where data recording millions of separate life and death decisions aggregates to support decisions on the future allocation of health care resources.
Heather Richards of the Canadian Institute for Health Information (CIHI) was recently interviewed by the Australian magazine Image and Data Manager on CIHI efforts to provide neutral, objective and unbiased information to those making health care allocation policy decisions. Ms. Richards also happens to be Director of Publicity for the International Association for Information and Data Quality (IAIDQ).
In a detailed, concise, and refreshingly buzzword-free conversation, Ms. Richards described CIHI’s approach to improving data quality. To me, that approach boils down to these three themes:
Jan 27 2011
Slowly Changing Dimensions – Special Attention Needed
Margaret, who was an average sales person, moved from Washington, DC to Richmond, VA, whose market is one fifth the size, during the month of June. When the annual evaluations of sales performance were done in the month of December, she was listed as the top performer in the Richmond market resulting in the company promoting her to Sales Director. The next two highest ranked Richmond salespeople had been the consistent leaders for the last several years and outperformed Margaret since she arrived in Richmond. Her very high sales numbers during the first six months of the year skewed her average, placing her above the rest of the Richmond area. In this example, if the decision makers had correct information handy, and used it appropriately, would they have promoted Margaret over her new Richmond peers?
Here is another example.
Jan 27 2011
Consider the source in health care data integration
The Atlantic, not typically a technical rag, recently presented an article by business and economics editor Megan McArdle on health care data integration entitled “Paging Dr. Luddite”. The article brings to a mass audience an understanding of both the importance and difficulty of data integration, but the title and general anti-healthcare-professional tone seem counterproductive.
The article opens with a compelling example of data silos, detailing the challenges of integrating the many data feeds from sensors attached to a premature baby “as small as the hand that cradles” it. Each sensor produces a data stream, and, in a field trial at Toronto’s Hospital for Sick Children, IBM is integrating that data in near real time to identify early signs of infection.
Jan 19 2011
Informatica Cloud Express - the Data Integration Software as a Sevice (SaaS) on Cloud
At first glance, hosting applications in the cloud seems to require less setup and operational cost, which accounts for its great appeal. That said, the world is still trying to make sense of how cloud computing can really fit into the enterprise computing puzzle in the midst of so many of the complex challenges we face, like the ever increasing needs for security, continuing dependency on age old legacy systems, ever increasing uncertainty in the current troubling global economic situation, and more. Nevertheless, one thing that is quite clear to me is that cloud computing during this economic uncertainty does offer a great new hope for organizations if only for its “pay as you go” approach.
Jan 12 2011
DMBI Tech Tip: Include Audit Columns on all Tables
When I design a database, and local standards permit, I include both a surrogate primary key and audit columns on every table.
A surrogate primary key is a system-generated integer that increments by one with each new row inserted. Most DBMSs make it easy to add surrogate keys. Oracle uses a construct called a Sequence, SQL Server calls its variant an Identity.
The audit columns I like to include are these, shown with their SQL Server datatypes: CreationDatetime (datetime), UpdateDatetime (datetime), CreationUserId (varchar (30)), and UpdateUserid (varchar (30)). Most DBMS’s offer the ability to update columns like these with either default values or triggers.
Aug 03 2010
Three Tips for Better Data Definitions
If business or IT users insist that their definition is good and everyone knows what they mean when in fact that is not the case, the strategies below may help.
1. Provide examples of unclear
vs. clear definitions
Users who are intimately familiar with their business process and
supporting systems may not understand the point of specifying exactly what they need. To them "the ID of the customer" is a
perfectly acceptable
definition of "Customer ID." Or, the IT representative may give a
definition that works for them but no one else, such as "the primary key
of the customer table". It will help both to see examples of what is
needed in order to have a workable definition to support data warehouse
population and use of the data.
Apr 12 2010
Defaulting data integration to customers = risky business
Here’s a little-recognized fact about data integration: if you run a business or any sizable chunk of one, someone is integrating your data.
In my professional life I have, on occasion, suggested data integration efforts. Sometimes my suggestions have been accepted and sometimes not. As an IT professional I understand that different managers have different priorities, and in a given business situation sometimes other things may be more important than, for example, having a single, consistent source for all customer records, or making sure production data matches financial data.
But as a customer? That’s different.
Feb 15 2010
Using the SQL RANK function to solve an unusual sorting problem in obtaining data for performance tests
Problem:
Our team needed to create a data set listing retail stores belonging to different territories for performance tests that involved store selection. The data set was to be used by virtual users signing-in as different territory owners. The challenge was that the territory and store numbers in the data set should be as random as possible. For example, if there are 20 territories and each territory has 10 stores, the data set should contain the 20 territories with their first stores, followed by the same 20 territories with their second stores and so on.