GopherCluster: Document Visualization in the Third Dimension Paul Lindner lindner@boombox.micro.umn.edu GopherCon `95 ABSTRACT In it's present form Gopher is a one-dimensional model; a simple linear list. By using three dimensions to display the same data we can visualize the innate relationships between documents. One way of visualizing a database of text documents is by putting similar documents close together in what we call a Cluster. This paper describes a new system, called "GopherCluster". This system attempts to build document clusters automatically by using an iterative force function. 1. Introduction Until recently information systems such as Gopher and the World-Wide-Web have been confined to linear lists or graphical pages. These systems use real-life metaphors to visualize and organize information. Overall the book metaphor is the most popular. Infosystems routinely support the equivalent of: · Table of contents · Indices · Footnotes · Quoted references to other material Recently we've started to see non-book metaphors for organizing information. The Blue-Skies Gopher client allows interactive spatial organizations of information based on links overlaid on a graphic image. Similar software, called clickable image maps, exists in World Wide Web software as well. Still, for most infosystems we're still stuck in the book frame of mind. Most lists of documents are simple linear listings of references; most infosystem indices mimic their real-world counterpart. The advent of three dimensional software that runs on inexpensive hardware should change this. With a third dimension it's possible to visualize information in ways never dreamed possible. Three dimensional Systems offer a three-fold increase in the number of dimensions that can be simultaneously represented. Consider the Gopher information system. The following are the traits of a Gopher reference: · Name of item · Location in list · Type of item · Abstract for item · Reference to a remote item. · Miscellaneous metainformation The World-Wide-Web adds the following traits to an information system · Font Size · Font Style (Bold, Italic, etc.) · Associated graphic icon When we move to the third dimension we gain many new traits to describe information. · Colors · Size in three dimensions · Abstract shape representation · Vertical position and size · Horizontal position and size · Depth position and size · Orientation of object · Associated text · Associated sounds and scores · Location of object relative to other objects We can map most of these traits to data within the information system. The height of an object can be mapped from the size of the document. The abstract shape can represent the type of the document, or a preview of what the document contains. A harder problem is mapping the relative location onto data within the information system. The software described in this paper seeks to do just that. Some information systems allow you to find the similarity between two documents. With this data in hand we set out to group the information into distinct separate areas called clusters. A cluster should be representative of a group of documents that are self similar. For instance a recipe information system could group all the cake recipes into one area and all the meat recipes in another and so on. The problem with clustering is how to do it efficiently for large sets of documents. We could do it manually. This organization would have the highest quality since human thought would go into the organization of the information. For large document sets we need to look beyond manual organization. A software solution is the most efficient answer. We hope that GopherCluster can fill this need. 2. Our Clustering Algorithms 2.1 Prior Work We did some literature searches for information on clustering in information systems but came up very short. It is a difficult problem to represent documents in clusters. Still, some work has been done in this area · Jussi Karlgren at the Sweden Institute of Computer Science (SICS) has done a lot of work in the area of clustering documents based on user behavior patterns. · Scatter-Gather is a system that organizes information into different `classes' that the administrator defines. It was developed at Xerox PARC. · There was at one time an implementation of clustering for the SMART system. This was one of the reasons we initially looked at it. However that code does not work on the current SMART code base. In most cases the focus of these studies was more on grouping items together than representing a scene in three dimensions. We're still looking for more prior work in this area. The social sciences have long done multivariate cluster analysis. This has allowed scientists in these fields to come up with generalizations of their data samples: · Smokers have a high risk of cancer. · Children in high income families score higher on standardized tests. · People who eat bran muffins and sleep on their side are more likely to watch comedies on television. We found these algorithms and methods interesting, but we found it difficult to apply them to our problem. A number of factors kept us from using standardized clustering algorithms, including: · A lack of independent variables. Most cluster analysis systems assume you have a large set of data that you want to find clusters for. We only had the similarity between items. · We have no method of automatically generating the cluster attributes. This generally requires a semantic understanding of the content of the documents. This would require human understanding or a very good expert system. Most of the algorithms seem focused on grouping the items into clusters, not organizing and mapping them to a three dimensional representation. We want to use the human spatial senses to visualize the clusters, not a computer telling us where the clusters are. 2.2. Our Algorithms Instead we set out on our own with a gut feeling of how clustering could be done. We had the following information: Pd[x,y] initial location of document d S(Pd1,Pd2) similarity of documents d1 and d2 We can represent the similarities between documents as a matrix like this: TABLE 1. Matrix of similarity values d1 d2 d3 d4 d1 1 0.34 0.04 0.55 d2 0.34 1 0.22 0.33 d3 0.04 0.22 1 0.08 d4 0.55 0.33 0.08 1 We'll call this matrix Sim[d1,d2] We tried a number of approaches to clustering documents, most were unsatisfactory: · Our first try represented the x-axis as the similarity to the highest ranked document. The y-axis was represented as the similarity to the second highest ranked document. This didn't work well. We ended up with one big clump in most cases. · Next, we tried representing things in polar coordinates. Less relevant documents were farther away, more relevant ones are closer. Documents similar to the best document were swept out right to left. This really didn't generate clusters either. It was an interesting representation though. Neither approach generated what we would consider an actual cluster. Finally, we hit upon a method that seemed to work. We stopped thinking of ways to generate variables to plot onto the different axis. Instead we transformed our system into a physical model. We considered each document to be a mass on a two dimensional grid. This allowed us to map the similarity factors to a `force' between the two entities. The following steps led us to what seemed to be actual clustering. 1. Calculate the initial location of the documents on the grid. We find the square root of the number of documents and spread the documents over a square that size. 2. Calculate the accumulated forces on each document. Each document pulls or pushes other documents based on their closeness. 3. Move the documents to their new locations. 4. Check to see if a significant portion of the documents are close together. Go back to step 2 if this hasn't happened. We used the following force functions to move the documents around given D= the number of documents: (EQ 1) Force functions used in GopherCluster The final location of the point is the sum of these two force equations. The first sums the attractive force between documents. The second function repels documents that get very, very close together so they're exactly one unit away. We stop the system when we reach a point where 50% of the documents are within two units of each other. If we didn't do this, the system would congeal into one large clump. 3. The Implementation We used the following tools to make the clustering searches happen · The SMART Text Retrieval system · The GopherVR software · The Unix Gopher Server · Two perl scripts 3.1 SMART The backbone of the whole clustering system is the SMART text retrieval system. SMART was developed at Cornell University to provide a solid framework for work in information retrieval. You can get more information about SMART by selecting the following URL: The reason we went with the SMART system instead of WAIS or other text indexing software was the capability of the SMART system to find the similarity between two documents. This provides the basis for our clustering algorithms. 3.2 The GopherVR Software The GopherVR software is the method we used to actually view the clusters. Early development used gnuplot as a method of viewing clusters. The GopherVR software has the advantage of actually feeling like you're inside the scene, within the clusters. The GopherVR software also allows you to get overviews of the scene and look at it in many different orientations. You can get GopherVR binaries by selecting this URL: or or 3.3 The Unix Gopher Server The Unix Gopher Server provided us a way to pass the location, orientation and scale of items to the GopherVR client. It's mostly a go-between for the actual shell script and the client. The Unix Gopher Server can be gotten by selecting this URL: 3.4 Two Perl Scripts We wrote scripts in the perl language. This allowed us to rapidly develop and prototype many different clustering algorithms. The first script takes the search request and reformulates it into a Gopher Directory. This directory contains references to the second backend script that does the actual clustering. 4. A Sample of Clustering The best way to try out the clustering software is to try it. Start your gophervr software and open the following URL: Select Clustering Search Tool and enter a few words that might be found in a recipe or movie index. You'll get a scene that looks like this: FIGURE 1.\x13Starting a Cluster Search Select which database you'd like to search by clicking on it. You'll then see a scene that organizes documents into clusters based on their similarity of words. A sample search for winona in a movie review database returns the following list of items: · BRAM STOKER'S DRACULA · THE HOUSE OF THE SPIRITS · BRAM STOKER'S DRACULA · LITTLE WOMEN · REALITY BITES · THE AGE OF INNOCENCE · REALITY BITES · LITTLE WOMEN · BRAM STOKER'S DRACULA · THE AGE OF INNOCENCE · THE GODFATHER: PART III · Winter Film Reviews · MISC: Academy Award Nominations (1994) · MISC: Academy Award Nominations (1995) · VAN GOGH and THE BAD LIEUTENANT · MISC: The Most Memorable Films of 1994 · MISC: 1994 Academy Awards Comments · MISC: Top Ten of 1994 The following figures show a clustered organization of the above set of documents: FIGURE 2.\x13Inside a Cluster Scene FIGURE 3.\x13An Overview of a Cluster Scene An interesting thing happens with this search. Both Little Women reviews end up next to each other. The miscellaneous non-movie review items group together in their own location. The Age of Innocence reviews are also grouped together. Documents that you would expect to be clustered together are indeed clustered together. The other clusters seem to be distributed haphazardly. Still, they are grouped according to the actual text of the items. These clusters are harder to define. They could be caused by a similar writing style, or usage of similar words. They are still valid clusters. Classifying them is just more difficult. Who knows why one of the Dracula movie reviews is by the Reality Bites movie review? A final set of documents are non-clustering. They probably weren't similar enough to the other documents to cluster together. 5. Problems The clustering algorithms do strange things with small sets of documents. Consider a search for `superman'. This results in two clusters. One that contains: · An April Fools post called Gump Fiction · A review of the Fantastic Four · A review of Free Willy The other cluster had the following items: · Surf Ninjas · The Crow · Robocop 3 Obviously these don't have much in common. The other problem is caused by documents getting too close together. We don't want the items to overlap in the scene, so we scale the scene upwards depending on what the smallest distance is. This can lead to a vast plain of scattered documents. Try searching for `arnold' and you'll see this effect. The algorithms could be improved a bit. The determination of an ending state is pretty arbitrary. We could test to see if we're in a steady state, but that may take many, many iterations. Plus there's a good chance that the documents will end up in one big clump, or scatter to the far ends of the system. Yet another problem is caused by the initial positioning of the documents. We haven't done much initial testing, but it would seem that the clusters that result have a lot to do with the initial values of the documents. This problem may be related to not letting the physical model run it's full course to a steady state. 6. Future Directions The cluster software should definitely use a better algorithm. We've had good results with the current one, but don't have that gut level feeling that the math is entirely theoretically correct. We'd also like to cluster the whole set of documents instead of the limited set of documents that result from a search. This clsutering could be done ahead of time. It would also allow the administrator an opportunity to tune the clusters that the software might find, and give them appropriate names. We could add these cluster names to the scene above the location the documents are in. We'd also like to try this software on other databases. We think that this software will have broad appeal for helping people find the information they need, and perhaps stumble across something else that is similar to the document they were looking for. 7. Conclusion We hope that by exposing the innate relationships between documents people will have a much easier time finding the information they want. The GopherCluster software is a small step towards this goal.