Data Profiling – Cross-Database Validation |

With a collection of quick and simple checks, Data Profiling provides you with a much better understanding of your data. You can quickly find issues before engaging on any data project; issues which will cost you much more to put right later in the project life-cycle.In this article we’re going to focus on perhaps one of the more advances aspects of Data Profiling; cross-database checks and validation. Unfortunately, many tools do not support cross-database analysis and you will often need to load all the relevant sources in to the same database or repository to perform such checks.But even given this extra step, cross-database validation is a very worthwhile exercise, and will payback handsomely on any data initiative:* Data integration projects will by their very nature require the analysis and comparison of multiple data sources.* On any data migration project you will want to validate both the source and loaded datasets.* Even with a “single” database project you will find that that are usually various authoritative data sources strewn across the business (often in the shape of Excel spreadsheets and personal datasets) which need to be cross-checked with the target database.To cope with all this you will want to perform a number of cross-database checks. In effect you’ll be data profiling several sources and comparing their resulting profiles. Specifically, you should consider:* Comparison of codes used in the various systems. If not identical, is there an appropriate mapping between the codes?* If there are many codes, perhaps Social Security Numbers, then compare their patterns/formats.* If entities are expected in more than one system, then you can check keys in both systems to check for duplicate or missing entries. And of course, if you’re expecting the data in the systems to be unique, you should still check for, and investigate, any duplicates.Cross-database validation is not trivial, but it’s not that hard either. The checks are easy to understand and communicate and any issues found are generally significant. It is therefore something which you should always undertake as part of any Data Profiling exercise.