WASP - About


 

The WebApp Search Page alias WASP gives you the ability to free text search the entire contents of the public newsgroups of Data Access Worldwide.

 

WASP is written in VDF11 and uses the DataFlex database

 

The amount of data in the WASP search engine is as follows (September 2005):

 

Threads

30.200

 

Messages

137.000

 

Phrases

16.600.000

Number of phrases (which equals the number of words written in total)

Words

9.800.000

Number of different words in each article summed up for all articles

Vocabulary

147.000

Different words in all

 

Login

 

To use WASP you need to login to the system with a user name and a password. To become a user on the system click Register new user in the login screen, fill in your e-mail address and WASP will send you an e-mail with a password to use for login.

 

You will get logged off automatically if you are not using WASP for 10 minutes. If after 10 minutes of no activity you try to navigate within WASP you will be sent back to the login dialog.

 

How to search

 

Enter one word into the search field to find all articles containing that word. If you enter more words only articles containing all words entered will be found. You may enter as many words as you like.

 

If you put a minus in front of a word only articles not containing that word will be found, i.e.:

 

Sture -Andersen

 

will find all article containing the word Sture but not the word Andersen (please note that the search engine does not distinguish between uppercased and lowercased letters).

 

You may also enter "Sture Andersen" (including the quotation signs) to find all articles in which these two words appears in immediate succession (as a phrase).

 

When you enter a search value the system will search the article text (excluding paragraphs quoted from other articles) and the subject line.

 

The server environment

 

The WebApp Search Page is running on a single processor Pentium IV 2.4 GHz machine with 1 GB of RAM. The WebApp server license is the Internet Server edition with process pooling enabled.

 

The newsgroup data is automatically fetched every hour by a Visual DataFlex 11 application using the Mabry NewsX component.

 

Performance

 

A search is allowed a maximum time of 5 seconds to complete. If a search takes longer the result set will be empty and a error message saying "Maximum search time exceeded" will display. However, since the server caches its disks you may try a second time (or even a third) to see if it will complete.

You may cheat the 5 seconds maximum by searching only one very frequent word. I know you have to try now and thats OK, but do it only once.

 

Limitations

 

The system does not decode attached files. Users have to go to news.dataaccess.com to get the files.

 

The system is not able to interpret HTML coded messages. Such messages (I suspect that there are about 1400 of them) are not indexed and will appear empty if displayed as part of a thread.

 

Using the Previous and Next buttons it is possible to scroll off the range of the search result. I know this, and I do not consider it very important (for now at least).

 

For reasons unknown, the site sometimes (seldom) errors on submit complaining that User-Id has not been specified. This is untrue, and one click on the New Search button cures the problem

 

Indexing the messages

 

Using the Mabry NewsX component a VDF 11 program downloads new messages from the DAW server every hour. The technical details of this are pretty trivial. Let it suffice to know that the text of every message is fetched together with the ID of the poster, the time it was posted and information about which other message (if any) it is answering. The message text downloaded is stored in a TEXT field (length 16384 characters).

 

A message consists of a number of lines. All lines that have a > character as the first non-blank character on the line are removed. These are considered to be quotations of the thread of messages it is answering. After this the whole message text is considered as one large string with cr and lf characters substituted for spaces.

 

The algorithm now splits the text into a sequence of words. A word is considered to be any sequence of characters in this string: abcdefghijklmnopqrstuvwxyz01234567890 (capitalized or not). However, a word may also consist of any sequence of characters not in that string (except space). This means that the phrase VDF11.1 is cool is divided into the following 5 words: VDF11, ., 1, is, cool. The wisdom of this strategy may not be apparent, but it worked fine so far ().

 

As you will read in a minute, WASP has a maximum phrase length of six. WASP there only considers the first 6 words of any searched phrase. Note that in some cases it is even less than 6 words because the phrase You cant do that really equals 6 words because of the word splitting strategy just mentioned.

 

These words are stored in the dictionary table of the system which also keeps track of the frequency and the unique numbering of each word. The frequency is needed to device the optimum search sequence. This strategy has so far generated a dictionary of approximately 150.000 different words in WASP out of 16.000.000 used (their added frequencies).

 

When you read the word article from here on, what I really mean is newsgroup message text.

 

Using the dictionary table we can now express the article text as a sequence of numbers (word ids). In another table (called article words) the algorithm stores every word together with the unique id of the article. This table only has these two columns, by which it is also uniquely indexed. The article words table therefore maintains information about which words are in which articles but not the sequence of words in the article.

 

Using the article words and dictionary tables it is now possible to search articles in which words occur (or do not occur). It is however not possible to search for articles containing a particular sequence of words, for instance the sentence To be or not to be (which used to be the horror of search engines because every word in that sentence is extremely frequent)

 

To handle that end, a third table is introduced called article phrases. This table stores the article-id, word-id and word-ids of the following 5 words. The table is indexed like this: word-id, word-id2, word-id3, word-id4, word-id5, word-id5, recnum. We are therefore able to translate to be or not to be into the sequence 25, 71, 379, 141, 25, 71. Because of the index we can dive directly into the articles containing this sentence.

 

Without this last table WASP would have had to go through 278836 + 49853 + 34994 + 57061 = nearly half a million records to determine which articles the sentence *might* be in (that is the added frequency of the words in the sentence). Using the article phrases table it only has to do 2 finds to determine that the sentence occurs only once. This is one cool table.

 

 

I know you will find WASP useful. If you have questions or complaints please write a message on the DAW newsgroups and put WASP in the subject line.

 

Sture ApS, Denmark

Mail: sture.aps@mail.tele.dk

Tel: +45 40 59 70 20



Newsgroups were scanned 91833h02m14s ago
WASP was programmed by Sture ApS