Assignment 1

Prototypes and Basic Level Categories, by Zach Tomaszewski

for LING 640G, Fall 2002, taught by Dr. Ben Bergen


A: We're not allowed to swim the crawl.
B: That's too bad. So I guess you're not allowed to do the butterfly either, then?

In this conversation, B is making use of a typical prototype. Given a very common swimming stroke, he thinks of another, closely-related swimming stroke. He is generalizing. Based on the premise that the most prototypical stroke--the crawl--is not allowed, he assumes that other less common strokes must also be forbidden.

A: How's your new computer working out?
B: It's a total Testarossa.
A: Wow. Nice.

A is making use a paragon exemplar prototype--the Ferrari Testarossa. This is (supposedly) the embodiment of the ideal sports car. Using this as a reasoning point, A probably infers from the analogy that B's new computer is very fast, luxurious, and expensive. What is interesting is that a paragon exemplar for cars can still be used to talk about computers.

A: We're picking up our dog from quarantine today.
B: I hope he's okay. When we got our Fluffy, his tail was missing.

In this case, B is making use of a salient experience. Fluffy losing his tail is an experience important to B, but shared by few others. He is now generalizing from this emotionally scarring episode that all quarantines freqently result in dogs losing their tails.

Basic level categories

In information retrieval, especially on the WWW, users frequently enter search terms that are much too general. I believe the terms are usually at the user's basic level of categorization. Some examples are "dogs", "travel", or "smoking", rather than "boston terriers", "cheap flights to Cancun", or "middle east hookahs".

In essense, search engines only match the natural language keywords that make up the documents of the collection (indexed webpages, in the case of the Web). There exist a number of ranking algorithms to try to improve this simple word matching. Some involve neural networks, or mathematical vector-matching between the query and the document, or including other information besides the query itself. Google, the current king of search engines, uses the links to a document to boost its ranking. Simply put, if more people are linking to the webpage, it is probably useful or authoritative, and so it ranks higher in the search results. But, even with this myriad of query-modifications and document-weightings, improving the original query would probably be the most helpful. People freqently don't write pages at the basic level; they write about specific occurances. If people searched using the same level of terms as exist in the documents, their searches should be much more relevant.

An experiment could be devised to test that people really do search with basic level categories. Using a search engine's log files, we could look at all the "one topic" searches. (That is, those searches that aren't looking for an intersection of two or more subjects, such as dogs AND cats.) Most of the search terms should be at the basic levels revealed by previous categorization experiments. This would show that people do default to the basic level, even when not directly faced with a controlled laboratory experiment.

In a follow-up experiment, a search engine could be developed to help users refine their searches. When they enter a single topic or single word search, the page of results would include an input box asking about "what kind or type of [search term]." So, if a person searched for "dogs", the results page would include:

What sorts of dogs?

When faced with this new form, I expect that most users will enter terms at a lower level of categorization than their first query. This would give further evidence that users are capable of switching categorization levels when given clues that it is appropriate to do so.

Either in a laboratory setting or through a feedback form of some type, we could then look at the user's opinion of the relevancy of the two levels of searches. I expect that the more specific searches will be more relevant. Relevancy ranking could even be used as an indicator of category level differences--searches for "vertebrates", for "dogs", and for "boston terriers" should each be more relevant as well as more specific.