Sunday, July 24, 2005

Search Engine Page Depth

In my post regarding Google’s Desktop Search Engine, I mentioned several objections I had to the technology. One of them was that it did not search past 5,000 words in documents.

Subsequent research has brought to light that this partial-search is a common trait of search engines (although most of them go much farther than 5,000 words). How far down the document the indexer will go is a metric known as page depth. And, according to searchenginewatch.com, there is considerable variety in the page depth. According to the site’s 2003 figures (the latest one they have), here is how the major search engines measure up in terms of page depth:

  • Google: 101KB
  • MSN: 150 KB
  • Yahoo: 500 KB

So, as you see: In terms of results, a search engine only gets you partial information on what’s known to be available. Such that the absence of a result should not imply that the item does not exist. In fact, the item might even be sitting in the search engine’s document cache. However, it is simply not in the engine’s index.

Wednesday, July 06, 2005

Rogue Wave C++ Library now Open Source

Rogue Wave (now a division of Quovadix) has open-sourced its implementation of the standard C++ library. The company gave the library to the Apache Software Foundation, which stuck it immediately in its incubator system, from which it presumably will emerge months from now. In its hey day, Rogue Wave was considered a vendor of some of the best-written libraries in C++, so I suspect quality on this will be pretty good as well. The link for more info is here.

Saturday, July 02, 2005

Portable Libraries: Slim choices

As I move forward with my book on Intel EM64T 64-bit extensions to x86, I want to present some complete programs. I want the code to be portable, and I don’t want to write every routine and data structure by hand. What I need is a portable library for C that can be used for free by readers. C/C++ developers with the same problem have a few choices:

For GUIs:

For everything else:

Most of what I need falls in the domain of one of the last three libraries. There are problems with each choice, however. The APR is the most actively supported of these, but it has one horrible failing: zero documentation. And because it uses a very specific model, if you don’t grok how it comes together, you will find it impossible to use.

NSPR is well documented, but is currently supported by only one person. The limitation of his time and his resources make it difficult to commit to the library with heart and soul. Although, of the three general-purpose libraries, this is probably the one that appeals to me the most.

The ACE library is another low-documentation product. In fact, other than Doxygen lists of API calls, the only documentation extant is a book. It seems to me that if you want the lib to be used a lot, the authors should make parts of the contents freely available, enough to get potential up to speed. If the library and the book are good, most every user will buy a copy. But I won’t commit to either one without some certainty going in that I will arrive satisfactorily where I want to go.

So, what to do? I will use the Qt library, which is actually much more than a GUI library. It provides support for threads, networking, sockets, database access, and XML–which are the features I need. The library is dual-licensed: free for GPL’d projects and licensed for a fee on all other projects. Since my code will be open source the free Qt solution works. Plus it’s a lovely crafted library. With great docs–not to say coverage via several books.

The oddest thing in this whole search is the APR’s lack of documentation. Despite having an active list of contributors, nobody has done docs. As a result, the only developers who use it are those who wrote or worked on the library. Seems like a stupid obstacle. I’d help out on docs if I had anyway at all to figure out how the library works, but there is literally no info.