Data verses Information

Assignment 2, by Zach Tomaszewski

for ICS 691-3, Fall 2001, taught by Dr. Joan Nordbotten

I consider data to be, basically, symbols. Information is the thoughts or ideas conveyed through these symbols.

Baeza-Yates and Riberio-Neto seem to offer a very similar definition in their distinction between data and information retrieval. Computers excel at symbol/data manipulation. In data retrieval, a document either contains a string that matches the query, or it does not. The data is usually well-defined and structured so the computer can process it quickly and accurately. If any semantics can be said to exist for a computer, it is because of this structure. For example, a computer could differentiate between the name of a city and the name of person only if the data is in the correct fields.

Information, as ideas and meanings, exist in minds (human minds in most discussions). Yet, excepting possible direct mind-to-mind transfers, any information must to be converted to symbols (in a broad sense of the word) in order to be communicated. In information retrieval systems, even though the content is in the form of these symbols and characters, it is usually in loose, unstructured, natural language documents. Natural language is rife with ambiguity, synonyms, and other "imprecisions." Information retrieval attempts to connect a human user's need with a human author's answer through the symbols and data in the system. It is no surprise the inaccuracy of most IR systems when one realizes that the system is trying to retrieve meaning through simple symbol pattern-matching.

The IFIP definitions are also similar, where data is formalized representations manipulated by specific processes, while information is the meaning a human extracts from that data.

Stonbraker delimits a difference between simple and complex data. Though both are types of data, complex data requires some form of special processing or internal methods to be handled by the system. For example, text and integers are usually simple data because they can be handled by standard operations--two numbers, regardless of what field they are in, can be compared for equality. Yet two images cannot be compared without using certain methods that depend on the particular image file formats or the type of comparison needed. Most multimedia objects tend to fall into the complex data category.

As more information is stored in multimedia formats, retrieval and manipulation of those media are going to become more important. Hopefully the future systems can at least match the current retrieval success of simple data.

Comment on a posting by Russell Kackley:

>Stonebraker's [3, pgs. 8-19] discussion of complex data
>is related to this discussion of data vs. information.
>In his example of complex data without queries, the
>application using the data must extract meaning from
>the data in order to utilize it. In his example of
>complex data with queries, the data must be analyzed
>in a complex way to determine if, for example, a
>particular photo contains an image of a sunset. This
>is an example of information retrieval.

The class seems quite split on whether Stonbraker's complex data is an example of data or of information. I still believe it is data, but some of the points raised, such as Russell's here, did make me think.

Complex data is still just bits and symbols. It is true that complex data needs to be manipulated or analyzed more than simple data. But I think this is still at the level of symbol-matching. For example, if you are looking for a sunset, the computer will have to look for the string "sunset" in the picture's description. If the application can handle some sort of image scanning or recognition, it will likely look for orange pixels near the top of the picture. It doesn't know what a sunset is; it is merely pattern matching. It may miss some sunset pictures in which the sun has just set, and it may falsely retreive a picture of a storefront with an orange awning or a picture of an egg yolk. This is because it is just matching pixels or symbols and doesn't understand the meaning or semantics of a picture.

I do realize that my distinction between data and information retrieval is rather technology-dependent. I don't think we currently know enough about how the mind works (or even agree a "mind" is) to know how much we humans are simply "pattern-matching." As the algorithms improve (and if the "semantic web" turns out to be all it's dreamed), it may be possible that a computer could preform as well as an average human in accurately interpreting the meaning of data, simple or complex. (For those with backgrounds in cog sci or AI, I'm basically refering to being able to pass the Turing test.) In this case, it would be functionally difficult to say that the computer isn't actually processing the information in the system, and not just the data.

So the lines may be a little fuzzy at the borders, but, currently, I still think Stonbraker's complex data is just data, and any processing of it is simply data retrieval trying to approximate true information retrieval.