Friday, October 14, 2005

TMI!!!! Anyone who is remotely involved with the on-line world, with kids between the ages of 5 and 30, or both, understands these initials to stand for Too Much Information. The TMI moniker is normally attached to pieces of information that, though most likely truth-based, bring the recipient close to and then through the personal barriers that stand between polite acquaintances and deposit the conversation/data sharing squarely into the realm of the close friend and confidant.

For the purposes of the next few minutes, though, I'd like to borrow the label for another purpose. Today we find that our systems are overloaded with information of all kinds, and we then have to adjust the systems that we deal with accordingly to handle all the information in a way that can be managed.

One of the functions I'm involved in is the optical mark scanning of examinations, which is a good, data-intensive place to be. It's also a spot for me to illustrate the TMI concept. In order to be a comprehensive solution, all the data is collected from each sheet. So, even information for things on each sheet that no one ever uses is collected and stored. In our testing system, there is no call to use the Special Codes section of each sheet, yet we must allow for it to be safe. The same goes for other demographic information that we don't (and shouldn't, for FERPA reasons) use.

Thus, for each of the nearly million sheets scanned each year, there are two digits (actually sixteen bytes of information in binary format) for the special code as well as more than 100 other empty pieces of information. The underlying system treats a blank as a blank, but must use placeholders for blanks instead of ignoring them due to the occasinoal need to omit a question. It's not a lot of space that is wasted, from an overall standpoint, but it is too much. TMI applies to data collection.

On the processing end, there is also too much information to allow for proper processing. An example of which is the truthful answer to a question meant for an entirely different purpose. When supplying the information needed to process the exams, one question asked on the informational card is for the section number of the course. If someone brings in exams for two different sections, and fills that information in but doesn't differentiate between the two sections (most commonly done when both use the same answer key), then the question is meaningless, because the entire intent of the question is to provide accurate labelling of results. TMI applies to data accuracy.

On the user's end, there is the problem of TMI as it applies to what is done on a reporting basis. It is a known statistical fact that you cannot assume anything drawn as a conclusion if you don't have enough samples. While the 'magic number' varies slightly depending on the expert, there is unison of opinion when saying that a distribution drawn on a sample size of 15 is useless. Despite that, several users request and are given statistical distributions on classes of that, or smaller size. TMI applies to data relevance.

I enjoy coming up with solutions so that I offer them up with the issues I point out, but on this occasion, I'm struck by the astounding task, and can suggest only surprisingly non-technical ways to improve this. Here, then, is the naiive approach to the TMdataI problem:

NO - say this when there is marginal or no residual value to the data being collected. If it was needed in the past, then you need to consider if it is needed in the future, but in most cases, when data stops being relevant (the number of 8-track players sold in a year, for example), then it needs to stop being part of a data schema. This is, after all, how the cent symbol disappeared from ASCII when it first came out. They were trying to economize on which letters and symbols made the cut, and they realized that so many places were marking prices with a dollar sign and decimal notation that the symbol wasn't needed. They did bring it back as the library expanded, but it is an anachronism that is fading still further from the consciousness.

SHOW - use this when someone wishes to provide data but not the specificity for you to accurately know when, where and how to handle it. If they can show the need and the level of accuracy that will support the data manipulation and storage, then it is a viable effort. If they know what they want but can't really specify what they want, then there is justification to refuse the data. This is what gets so many systems into trouble as they proceed through the design process, dumbing themselves down to a bloated and inefficient form.

JUST GO! - use this when a user or other individual is requesting data manipulation that is either worthless or dangerous. The availability of computing power to handle mathematics does not do away with the fundamental requirement that those attempting to use is understand what they are trying to do. I would be interested to know how many terrible decisions have been made as the result of faulty assumptions based on faulty information based on faulty use of data, and I can only speculate at the number of zeroes that you'd have to write in the accurate estimate of the overall cost.

In closing, I have to say that we can only understand and properly deal with data when we stop thinking of it as ones and zeroes and begin to commoditize it as we do other things that are helpful and powerful at the same time. We need to think of ourselves as virtual bartenders. In the real world, that position is charged with properly monitoring the alcohol consumed, and what is done with it. Too much and there is a bar full of drunken customers who may then be dangers to themselves or others, and the bartender may be liable if they are, so s/he is courteous, comradely, and cautious at the same time. Some people require more scrutiny than others when faced with large supplies of an intoxicant - what, after all, is more intoxicating than data - and that's well advised to those of us monitoring, collecting and processing data.