Data Integration

Oct 03 2011

2011 Teradata Partners Conference Day 1

Day 1 of the 2011 Teradata Partners User Group Conference and Expo at San Diego was very educational and I had the opportunity to listen to some very good sessions. During the morning session, I attended a  workshop on “Teradata Live”, an End to End demo by Lance Miller from Teradata. This demo was done on a 8GB Memory,2TB, TD 13.10,4AMP configuration. Lance showed a logical data model for retail customers and the data distribution for 150 million row sales table. He pulled the Teradata Administrator view and showed how the data was evenly distributed on 4 amps. Lance also demonstrated the Teradata Miner 5.0  which is used for data profiling. During this session I asked a question on whether they have any data to support benchmarks against competitors either IBM's Netezza or Oracle Exadata.

Read More

Apr 07 2011

Teradata Indexing

One of the challenges design teams face when implementing Teradata design solutions is choosing indexes which help improve performance. What happens if the primary index is not the same as the primary key, and how does one validate that the design decision is correct?

The primary key of a relational table uniquely identifies each record in that table. A logical data model should specify the primary key (PK). If this specification is absent or not clear, the data modeler must provide this information. Indexes are physical design concepts and are related to data base performance. PKs are used for referential integrity, for logical correctness within the data model, and may be NULL. Primary indexes (PI), on the other hand, a physical mechanism for storage, are defined by the SQL create table, may be unique or non-unique, and can allow for changes in the data.

Read More

Mar 25 2011

Data quality and data governance lessons from national health care

Who would want to be a national health care administrator?  Who would want the responsibility for managing health care and formulating health policy for tens or hundreds of millions of people?  It seems obvious that such decisions would rely on quality data.  A recent interview impressed upon me how much data managers can learn from a field where data recording millions of separate life and death decisions aggregates to support decisions on the future allocation of health care resources.

Heather Richards of the Canadian Institute for Health Information (CIHI) was recently interviewed by the Australian magazine Image and Data Manager on CIHI efforts to provide neutral, objective and unbiased information to those making health care allocation policy decisions. Ms. Richards also happens to be Director of Publicity for the International Association for Information and Data Quality (IAIDQ).

In a detailed, concise, and refreshingly buzzword-free conversation, Ms. Richards described CIHI’s approach to improving data quality.  To me, that approach boils down to these three themes:

Read More

Jan 27 2011

Slowly Changing Dimensions – Special Attention Needed

Margaret, who was an average sales person, moved from Washington, DC to Richmond, VA, whose market is one fifth the size, during the month of June.  When the annual evaluations of sales performance were done in the month of December, she was listed as the top performer in the Richmond market resulting in the company promoting her to Sales Director.  The next two highest ranked Richmond salespeople had been the consistent leaders for the last several years and outperformed Margaret since she arrived in Richmond.  Her very high sales numbers during the first six months of the year skewed her average, placing her above the rest of the Richmond area.  In this example, if the decision makers had correct information handy, and used it appropriately, would they have promoted Margaret over her new Richmond peers?

Here is another example.

Read More

Jan 27 2011

Consider the source in health care data integration

The Atlantic, not typically a technical rag, recently presented an article by business and economics editor Megan McArdle on health care data integration entitled “Paging Dr. Luddite”. The article brings to a mass audience an understanding of both the importance and difficulty of data integration, but the title and general anti-healthcare-professional tone seem counterproductive.

The article opens with a compelling example of data silos, detailing the challenges of integrating the many data feeds from sensors attached to a premature baby “as small as the hand that cradles” it. Each sensor produces a data stream, and, in a field trial at Toronto’s Hospital for Sick Children, IBM is integrating that data in near real time to identify early signs of infection.

Read More

Jan 19 2011

Informatica Cloud Express - the Data Integration Software as a Sevice (SaaS) on Cloud

At first glance, hosting applications in the cloud seems to require less setup and operational cost, which accounts for its great appeal. That said, the world is still trying to make sense of how cloud computing can really fit into the enterprise computing puzzle in the midst of so many of the complex challenges we face, like the ever increasing needs for security, continuing dependency on age old legacy systems, ever increasing uncertainty in the current troubling global economic situation, and more. Nevertheless, one thing that is quite clear to me is that cloud computing during this economic uncertainty does offer a great new hope for organizations if only for its “pay as you go” approach. 

Read More

Jan 12 2011

DMBI Tech Tip: Include Audit Columns on all Tables

When I design a database, and local standards permit, I include both a surrogate primary key and audit columns on every table. 

A surrogate primary key is a system-generated integer that increments by one with each new row inserted.  Most DBMSs make it easy to add surrogate keys.  Oracle uses a construct called a Sequence, SQL Server calls its variant an Identity. 

The audit columns I like to include are these, shown with their SQL Server datatypes:  CreationDatetime (datetime), UpdateDatetime (datetime), CreationUserId (varchar (30)), and UpdateUserid (varchar (30)).  Most DBMS’s offer the ability to update columns like these with either default values or triggers.

Read More

Aug 03 2010

Three Tips for Better Data Definitions

If business or IT users insist that their definition is good and everyone knows what they mean when in fact that is not the case, the strategies below may help.

1. Provide examples of unclear vs. clear definitions

Users who are intimately familiar with their business process and supporting systems may not understand the point of specifying exactly what they need. To them "the ID of the customer" is a perfectly acceptable definition of "Customer ID." Or, the IT representative may give a definition that works for them but no one else, such as "the primary key of the customer table". It will help both to see examples of what is needed in order to have a workable definition to support data warehouse population and use of the data.

Read More

Apr 12 2010

Defaulting data integration to customers = risky business

Here’s a little-recognized fact about data integration: if you run a business or any sizable chunk of one, someone is integrating your data.

In my professional life I have, on occasion, suggested data integration efforts.  Sometimes my suggestions have been accepted and sometimes not.  As an IT professional I understand that different managers have different priorities, and in a given business situation sometimes other things may be more important than, for example, having a single, consistent source for all customer records, or making sure production data matches financial data.

But as a customer?  That’s different.

Read More

Feb 15 2010

Using the SQL RANK function to solve an unusual sorting problem in obtaining data for performance tests

Problem:

Our team needed to create a data set listing retail stores belonging to different territories for performance tests that involved store selection.  The data set was to be used by virtual users signing-in as different territory owners.  The challenge was that the territory and store numbers in the data set should be as random as possible.  For example, if there are 20 territories and each territory has 10 stores, the data set should contain the 20 territories with their first stores, followed by the same 20 territories with their second stores and so on.  

Read More

 

Disclaimer

The words and opinions expressed here are those of each article's respective author, and do not necessarily represent the views of CapTech Ventures.