[gopher] Gopher++ scrapped & Internet Archive -style thingy

Kim Holviala kim at holviala.com
Tue Apr 20 09:25:54 UTC 2010


As part of my project to code a neat search engine to cover the whole 
Gopherspace I've (partially) crawled sites and snooped and researched a 
lot of stuff.

Let's just say that the Gopherspace is small, but interesting. I'm glad 
I started crawling :-).

Anyway.

Whatever I've written about the gopher++ extra headers can now be 
considered as "obsolete". I found a few live sites which just cannot 
accept anything else than a selector<CRLF> so there's no way I can 
insert extra headers without breaking stuff. Those sites even break with 
type 7 queries (and gopher+) so I'm kind of giving up now.

All code regarding the header extensions has been scrapped and deleted, 
it's all gone for good. The good thing is that my code is now 100% 
compatible with ALL early 90's servers but the bad thing is that the 
neat charset conversion thingy is now all gone and we're back to 7-bit 
US-ASCII (or non-working Latin/UTF). Oh, well.

As my search engines indexer is an offline one my spider basically 
crawls around and saves all type 0&1 files to a local cache hierarcy. 
This was mostly accidental, but I managed to create something very much 
like The Internet Archive but for gopher. Basically, you give the cache 
manager an url and it gives you back the cached page (if it has it) AND 
it mangles menus so that as long as the pages are in cache you'll stay 
in the cache.

It's kind of like a combination of Google's cache and archive.org, only 
it works better than either of those...

Here's a cached copy of (partial) Floodgap:
gopher://gophernicus.org/1/cache.q?gopher://gopher.floodgap.com

It even cached itself:
gopher://gophernicus.org/1/cache.q?gopher://gophernicus.org

Notice how the cached Floodgap is much faster than the original one ;D. 
I wish there was something like this for teh web....

<turtleneck shirt mode on>
One more thing,
</turtleneck>

I'll be crawling everything in about a month or so, so now is the time 
to fix your robots.txt if you don't want your files to end up in the cache.


- Kim





More information about the Gopher-Project mailing list