Data Quality and Systems Theory
A Paper by Ken Orr
ęCopyright 1996 by The Ken Orr Institute; revised edition 1997
Published in Communications of the ACM, May 1998
In the movie "War Games," set in the Strategic Air Command Center under a mountain near Colorado Springs, Colorado, there is a critical scene: on a huge map of the globe, there are arrows indicating large numbers of ICBMs coming in from the Soviet Union. The general in charge is trying to decide if he should ask the President for permission to launch a retaliatory attack, because the technicians are telling him that the threat is real. At this point, the scientist who originally designed the system and who knows that something is wrong with the system itself says, "General, what you see on that board is not reality, it is a computer generated hallucination!"
Recently, a former Director of the CIA revealed that a real-life version of this fictional scenario was actually played out when a test tape was inadvertently installed and the screen at a similar center warned of a similar nuclear attack. As computers play a more and more important role in the real-world--a world in which the computer-generated outputs often present a picture of the real-world for critical activities--it is increasingly vital that that picture be correct!
2. A Systems Model for Data
In the mid-1970s, I , and a number of my colleagues, developed a model for information systems that predicted: (1) major computer systems problems involved with making the transition to the Year 2000; (2) data quality difficulties in many operational systems being developed at the time; and (3) fundamental issues involved in the accuracy of confidential/secret data.
The theory that allowed us to formulate our predictions involved viewing information systems as subsystems imbedded in a larger framework of a real-world feedback-control system (FCS) (Figure 1). Two observations caused us to look at information systems this way: (1) all of the information systems we developed operated in a larger, goal-seeking, organizational environment, and (2) those systems that failed to take into account that larger FCS context were difficult to operate and their outputs difficult to reconcile with the real-world . We began to see that the data and data quality in our information systems did not exist in a vacuum. As a result, we began to explore the implications of a true systems (cybernetic) model for information systems.
The principal role of most information systems is to present views of the real-world so that the people in the organization can create products or make decisions. If those views do not substantially agree with the real-world for any extended period of time, then the system is a poor one, and, ultimately, like a delusional psychotic, the organization will begin to act irrationally.
3. Defining Data Quality
From the FCS standpoint, data quality is actually easy to define. Data quality is the measure of the agreement between the data views presented by an information system and that same data in the real-world . A system's data quality of 100% would indicate, for example, that our data views are in perfect agreement with the real-world, whereas a data quality of 0% would indicate no agreement at all.
Now, no serious information system has a data quality of 100%. The real concern with data quality is to insure not that the data quality is perfect, but that the quality of the data in our information systems is accurate enough, timely enough, and consistent enough for the organization to survive and make reasonable decisions.
Ultimately, the real difficulty with data quality is change. Data on our databases is static, but the real-world keeps changing. Even if our system has a database that is 100% in agreement with the real-world at time t0, at time t1 it will be slightly off, and at time t2 it will be even further off. FCS theory states that if you want a system to track the real-world , then you must have some mechanism to the data in the system in synch with changes in the real-world -you must have feedback!
But where does this feedback come from? The classic answer from information systems developers is that feedback is solely the responsibility of the users of the system. "Our job is not to understand what our systems are being used for or even their context!" the systems developers maintain. "We simply build systems that meet the requirements of our users-it is the job of the users to insure that the data on our databases is maintained in an accurate and timely manner. The best we can do is to insure that the database is internally consistent and that the user's business rules are enforced."
Users, on the other hand, have historically felt that they were held responsible for data quality in information systems that they often did not understand, where it was often difficult to make appropriate corrections, and where the results of certain kinds of changes were unpredictable.
Unfortunately, as it turns out, the problem of data quality is fundamentally tied up in how our system fits into the real-world; in other words, with how users actually use the data in the system. In fact, two things have to happen for data on any database to track the real-world : (1) someone or something, e.g. an automatic sensor, has to compare the data views from the system with data from the real-world, and (2) any deviations from the real-world have to be corrected and re-entered.
Too often, systems developers have an overly simplistic view of how systems are organized; they think of systems in a simplistic Input-Process-Output (IPO) "transform" model (Figure 4).
But, this IPO model fails to account for the role that the database plays in a broader context (Figure 5).
In real information systems, the database acts to mediate between the input and the output, where the input and output: (1) occur at different times, and/or (2) represent different views of the real-world . This broader view of a system then makes it possible to understand fully the FCS model, in which the information system fits within actions taken in the real-world (Figure 6).
In this model, data is entered in the system based on external inputs, it then undergoes processing and gets stored in a database, which in turn is processed to produce outputs that are used in (compared with) the real-world . Finally, new inputs are produced (and fed back) so that the database can be kept correct. Without this final loop, the system will fail to maintain its database, and, therefore, its outputs correctly. This final FCS model (Figure 6) allows us to understand more fully the true problem of data quality--the better our information system fits within the real-world, the better the quality of our data will be, the worse the fit, the worse the data.
4. Data Quality Rules
There are a number of general Data Quality Rules that one can deduce from an FCS view of information systems:
4.1. Data Quality: Use It or Lose It
Unfortunately, many, if not most, of these data quality rules fly in the face of traditional systems practice. It is common practice, for example, to collect large numbers of unused data elements on the premise that someday someone might want to use them, and it is cheaper to put them in the system now than to do so when the need actually arises. However, applying the FCS model, it is clear that if an organization is not using data, then, over time, real-world changes will be ignored and the quality of the data in the system will decline.
In biological systems, scientists refer to this phenomenon as atrophy--if you don't use a part of the body, for example, then it atrophies. In a practical sense, something similar to atrophy happens with unused data-if no one uses the data, then the system become insensitive to that data. As with an individual who is blind or deaf, certain changes in the real-world are not perceived.
In most large systems, however, it is difficult to tell if particular data elements are actually being used. For example, some poor quality data elements appear on reports or screens but no one actually uses those reports or screens. In other cases, the data elements are used, but not very seriously. Here DQ3 comes into play, namely, that the quality of a specific piece of data will be no better than its most stringent use. In general, the data quality of data that is not stringently used will be better than data that isn't used at all, but not much better. For example, names and addresses that are used only for mailing lists and are not corrected based on returned mail tend not to be very accurate.
The nature of data quality, then, hinges upon the connections of that system to the outside world. The stronger those connections, the better the system and the better the data quality.
4.2. Data Warehousing and Data Quality
Recently, as large organizations have begun to create integrated Data Warehouses for decision support, the resulting data quality problems have become painfully clear . They have discovered, for example, that the quality of the data in their legacy databases is their single biggest problem. One data manager for a large company reported that fully 60% of the data that was transferred to their Data Warehouse failed to pass the business rules that the systems operators had said were in force, something that could have perhaps been predicted based on poor data usage.
On the plus side, just developing Data Warehouses represents a quantum leap forward in terms of end-user usage of data. As more advanced Data Warehouses and Data Marts are created, for example, more people will be using data in more stringent ways. The need for quality data has already begun to focus more management attention on just how poor their data quality is.
In the 1970s, when I and my colleagues first began to understand the implications of the FCS model, it became clear why so many of the systems we had worked on failed to meet their data quality objectives. We had developed systems that created data that no one used. We recognized that such systems were difficult to define, difficult to program, and difficult to operate--what we didn't understand was why.
Because of the FCS model, we created a new development approach that helped insure that all data collected and stored would actually be used. That approach involved designing systems by working backward from uses to outputs to database to inputs in a controlled fashion . In one case, we found that the legacy system being replacing had three times more data elements than it actually needed. Imagine the problems with data quality. Attempting to build quality systems without understanding FCS theory is much like attempting to build an airplane without understanding aerodynamics.
4.3. Data Quality and The Year 2000
It was also clear early on that the Year 2000 would create a serious problem because there would be very little use of the millennium and century fields until the Year 2000 actually arrived. Unfortunately, we failed to see just how massive the problem would actually turn out to be. Estimates now range in the hundreds of billions of dollars . The real problem involved with finding, fixing and testing the changes to the Year 2000 dilemma is not its difficulty but its ubiquity; it represents a simple problem repeated a billion times. Could the Year 2000 problem have been avoided? Possibly, but only if some form of "use-based" data quality programs had been in place. Millennium and century fields have not been tested on a large-scale because the "time-horizons" of these systems do not yet use those fields. As the Year 2000 approaches, more and more systems will fail because their systems practice will be forced to actually use dates that occur in the 21st Century.
4.4. Data Quality, System's Age and Meta-data
As systems get older their data quality problems get worse. In the 60s and 70s, it was widely thought that the lifespan of the average information system would only be a few years; therefore, it didn't make sense to try to put in place costly data quality programs, since any problems or shortcomings in the current system would correct itself in subsequent versions. In fact, major information systems have turned out to be much longer lived than anyone would have anticipated. There are large numbers of legacy systems in operation today that date back 20 or 25 years. Consequently, it is necessary to assess the impact of time on data quality.
We have found that not only does data quality suffer as a system ages, so does the quality of its meta-data. Clearly, what happens is that people who are responsible for entering the data discover which data fields are not used; either, they then make little effort to enter the correct data, or they begin to use the data for other purposes. The consequence is that the both the data and the definitions of the data (the meta-data) no longer agree with the real-world .
Another predictable problem occurs where the data model used in our systems differs significantly from the real-world . In such cases, the structure of the data in the system no longer agrees with the current structure of the business. Typically, systems designers do not actually look at the structure (patterns) of data that occurs in various fields, but rather arbitrarily assign data to fixed fields based on technology limits or constraints. Because the developers are not noticing changes in the data structure, the structure is not changed, and as a result, round data is forced into square holes.
4.5. Data Quality and Secrecy
One of the most troubling implications of the FCS model for data quality has to do with confidentiality and secrecy. If the quality of data is truly wrapped up in its use, then there seems to be serious limitations to the quality of confidential/secret data. One implication seems to be that confidential/secret data will always have limited quality. This may account for the fact that while dictatorships seem to be an efficient way to run a society, democracies, for all their inherent inefficiencies work better. A free press and an open political process, though bothersome, provide feedback and therefore keep data quality high.
4.6. Data Quality and Information Overload
It is not clear just yet what the impact of information overload will have on data quality. In our modern technologically based society, there may be such a thing as too much data. Unfortunately, one of the results of such enormous amounts of data being available is the difficulty in finding important data and being able to compare similar data from different sources.
5. Use-based Data Quality Programs
If data quality is a function of its use, there is only one sure way to improve data quality--improve its use! We call this use-based data quality. Use-based data quality programs are built around finding innovative, systematic ways to insure that critical data is used. Such programs involve:
5.1. Use-based Data Quality Audits
To improve our data quality, it is necessary to get a good handle on just how good the data in our databases is today. Use-based Data Quality Audits involved answering a number of key questions:
For the most part, data quality audits are best done using statistical sampling. It is rarely a good idea to try to verify all of the data on a real database; it is necessary to create a sufficient sample that will enable us to draw meaningful conclusions.
5.2. Use-based Data Quality Redesign
To improve data quality, it is mandatory to improve the linkage between the usage of data throughout the system. One of the problems is where to begin. In point of fact, while most legacy environments contain hundreds of records (tables) and thousands of data elements, all the data is not equal. In most systems, there a few critical sets of data that make all the difference. Often, the "customer," "product," "order," and "organizational structure" data are most important. The first step in a serious data quality redesign program then is to identify the critical data areas.
The first element of redesign involves a careful reexamination of how the critical pieces of data are used. Normally this is most manifest in two areas--the basic business processes (order entry through fulfillment, etc.), and decision support. Use-based design means focusing on exactly how the data will be used, and in trying to identify inventive ways to insure that the data is used more strenuously. In a great many cases, this means becoming more creative in getting the people most knowledgeable about the data to take responsibility for it.
A good example is the "frequent flyer" programs offered by the airlines. In addition to creating customer loyalty, such programs also go a long way to improving the quality of the data. In the normal case where the same flyer may have more than one Frequent Flyer Number assigned, it is in the best interest of the customer to make sure that the records are consolidated and vital information such as name, address, family relationships and preferences are kept up to date. This is the best kind of data quality program-the data subject keeps the data correct!
Developing a use-based data quality is to expend much more effort on the actual process of completing the feedback use of data. In general, that means reducing the number of data elements collected. If data cannot be maintained correctly, then it is questionable whether that data provides any value to the enterprise.
Another major component of use-based design is to understand the content of the existing critical data bases. A number of tools have emerged in recent years that aim at analyzing and combining data from multiple databases to create a common view of "customers", "products", "vendors", etc. The normal result of these programs is to dramatically reduce the size (consolidate) of major databases. Consolidations of 5:1 or even 10:1 are not uncommon. A second byproduct that is derived from this process is the development of a much more sophisticated set of meta-data based on data content.
Another technique for improving data quality is to promote (demand) "sharing" data through the use of "common databases." With the advent of the Internet, more and more people are able to access data more easily. Providing easy data access to a broad audience has the long-term effect of dramatically improving our data quality.
5.3. Use-based Data Quality Training
One of the major problems with data quality problems is getting both users and managers to understand the fundamentals of data quality. In order for any data quality program to work long term requires devoting a significant amount of time to education and training in the nature of use-based data quality. It is hard to convince users and managers who have been used to requiring all sorts of data arbitrarily--that data that isn't used it won't be any good. Fortunately, people become converted after a relatively short time, when they begin to see the effects.
5.4. Use-based Data Quality Continuous Measurement
Data quality requires constant measurement to insure that use-based practices are followed through. As Deming noted, most quality problems are systems problems, not worker problems. However, individual errors contribute to poor quality data as well. Measurement and quality programs must go hand-in-hand. Periodically, all of the same questions that were raised in the Data Quality Audit need to be redone for the redesigned system as well.
A final note on measurement: do not to be persuaded by internal measures without external verification. All that internal measurement can ultimately insure is that our data is internally consistent. No large organization can rely on its inventory records without periodic "physical inventories." History has shown that having records that show that we should have 23 computers in Warehouse X does not mean that there are actually 23 computers on the shelves. If we want our data on our databases to agree with the real-world , we must periodically verify that those computers actually exist where our system says they are, and we must take actions to reconcile any differences. Data that is truly vital must be physically audited.
Too often, the primary focus of data quality projects is to increase the internal controls involved in entering and editing data. As laudable as these efforts are, they are ultimately doomed to failure, as are one shot attempts to clean up data. The only way to truly improve data quality is to increase the use of that data. If an organization wants to improve data quality, it needs to insure that there is stringent use of each data element.
Because of the problems created by the Year 2000, every organization in the world that uses computers will have to step up to the problems of data. This, coupled with the increased need for quality data for decision making, will make data quality a high priority item in every enterprise. Use-based Data Quality provides a theoretically sound and practically achievable means to improve our data quality dramatically.
The Ken Orr Institute
5883 S.W. 29th St., Suite 101, Topeka, KS 66614
785.228.1200 Fax: 785.228.1201