Tuesday, October 06, 2009

CPANHQ

There is a project to build a Catalyst based CPAN search engine, with the idea to make it more hackable and ultimately incorporate some more 'social' features into it. The original effort was started by Brian Cassidy, he created a cpanhq github repository and wrote some initial code, later there were some features added by Shlomi Fish (in his repo) and now I am working on it. In my repo you'll find a version that finally has a search page and it even works. It uses the SQLite FTS3 full text search engine.

The biggest problem for me is that I had to modify the database quite extensively and now the loading scripts don't work any longer. To show something I decided to load the database into the repo - so now it can take awful long time to download the software, but in return right after the download you should be able to try out the searching. It is not at end user quality - you need to know some internals to use if effectively (like using 'me.name desc' in the order field to sort the packages by name), but it should be reasonably fast. Now I am thinking what are the most useful search strategies.

6 comments:

cow boy said...

You might want to glance at https://github.com/minty/PSNIC/

It got is mostly working locally, but it is rather buggy in places

http://psnic.sysmonblog.co.uk/

But perhaps it might have some ideas for you.

Michael said...

I can understand wanting to make search.cpan.org more "social" but why use SQLite as the search tool? search.cpan.org gets a lot of hits and it currently uses something that was designed for very fast full text searches for millions of documents (swish-e). I can understand wanting to pick a different full-text search engine that might have more features (xapian, sphinx, etc) but why go with something that will be slower?

zby said...

@Michael - that was just the simplest way to get the search working. But it has also some additional advantages - like the ability to sort and search by any other data tables in the db. For now the goal is to get something more useful than the current searches - I imagine this will require input from many developers and by using the SQLite features this is really easy to tweak and add new features. Finally we don't have milions of pages - so it might require a different optimisation approach than tools built for big document sets.

@cow boy: I see you have similar but slightly different goals. Let's see if we can collaborate.
Commenting on your faq - I managed to extract the POD from the packages (mostly only in one version - the latest one as installed by mini-cpan, but with some exceptions) by Pod::Xhtml and the database is only about 200MB.

phil jones said...

None of my business really, but uploading the database to github? Doesn't sound right.

Why is this necessary, exactly?

zby said...

Hi Phil, I think this is the fastest way so that other people can play with the code. I still have not fixed the loading scripts and, what is perhaps more important, loading the data takes a lot of time (at my comp it was about a day and night - plus there are some memory leaks so the scripts needed restarting from time to time).

Stefan Petrea said...

Hi

Here's a Perl search engine library