[PD] Using Xapian for the Pd Search Plugin

Hans-Christoph Steiner hans at at.or.at
Fri Feb 1 20:57:01 CET 2013

On 01/25/2013 09:55 PM, Jonathan Wilkes wrote:
> ----- Original Message -----
>> From: Hans-Christoph Steiner <hans at at.or.at>
>> To: pd-list at iem.at
>> Cc: 
>> Sent: Friday, January 25, 2013 9:39 PM
>> Subject: Re: [PD] Using Xapian for the Pd Search Plugin
>> On 01/24/2013 04:59 PM, Jonathan Wilkes wrote:
>>>  I've looked a bit at the Xapian API.  Here's my preliminary route 
>> to changing
>>>  the search plugin to use Xapian.
>>>  ***Build the Index***
>>>  * Read the file for each doc.
>>>  regsub out all "#X foo number number" stuff since it won't 
>> help the search
>>>  * Optional: prefix all object names with a XAPIAN prefix so that the user 
>> can search for instances of objects if 
>>>  they want.  Additionally include the object names unprefixed so they count 
>> toward a score when the user isn't
>>>  searching just for objects
>> This sounds quite interesting, how do you mean searching for instances of 
>> objects?
> Well, we can put "clip~" in the search terms, but we can additionally add it to the
> db with a prefix (something like XOclip~) when it originated from the
> document as  "#X obj 20 10 clip~".  (Basically you normalize all the document
> search terms to lower case, so then upper case denotes certain fields.)
> I suppose we could also make use of the numbers in "#X obj 20 10", as term with
> associated lower number coordinates are closer to the top left corner and are more
> prominent.

That makes a lot of sense to me.  It would be great to include as much of the
meta data as possible, like whether this term is an object name, object
arguments, or maybe also whether its in a comment or message box.

This gives us the possibility to make the searching much more aware of Pd.


>>>  * Prefix all the pd META stuff so that users can search by category, 
>> author, etc., and also include it unprefixed
>>>  so that again it counts toward a general score when not searching for a 
>> particular field
>>>  * Include the following as the document data: base directory, filename, pd 
>> META KEY/values pairs.  I include the
>>>  pd META stuff in the doc data since we want to display some of it 
>> (keywords, maybe other stuff in the future) in
>>>  the search results.
>>>  Then it's trivial to check for database existence, and only build it if 
>> it's not there.  (Maybe just have the last link
>>>  on the homepage be "Rebuild Index".)
>> Sounds all good.
>>>  Now we have an index so
>>>  *** Search ***
>>>  Search.  Depending on speed, I might just keep it the way it is, showing 
>> ALL results instead of the Google way of 10 per page or whatever.
>>>  *** Search by Category ***
>>>  This will be nicer than it is currently-- instead of cryptic regexp text 
>> showing up in the search bar, it will just be
>>>  the prefixed keyword, like "Kbandlimited" or 
>> "Ksignal".  That's easy enough to grasp that I don't think 
>> we'll need
>>>  some special syntax for category searches-- newbies can just depend on the 
>> home page links.  Plus, if they want to search for several categories at once 
>> they can quickly figure out it's just a matter of prefixing a 
>>>  "K" in front of the category and are way less likely to generate 
>> a tcl error as they would be screwing around inside a regexp.  (I could even 
>> make a mousebinding, like <ctrl-click> will add a category to the search 
>> bar without triggering a search, so they can use that to gang several together.)
>> what about "category:bandlimited"  The K seems arbitrary and hard to 
>> remember.
> Yeah, I'm just being lazy because the "K" prefix is how its actually stored in the
> database, and the main user interface is clicking a link.  It'd basically just
> be a regsub there so not too hard to use your syntax.
>>>  Also, if I understand the tclxapian interface correctly, I can just hand 
>> off a tcl string to Xapian so the search-plugin can get out of tcl 
>> "quoting-hell".  (Thus, much less chance of generating errors because 
>> of malformed
>>>  lists.)
>> That sounds very nice too.  Sounds to me like this would be a large
>> improvement.  Once you start committing some code, I'll try to find the time
>> to add xapian to the Mac and Windows builds so people can start using/testing
>> early.
> Well, this is all pre-testing stage.  Hopefully there's no weird snags in all this.  But the
> documentation seems pretty straightforward so far.
> -Jonathan
>> .hc
>> _______________________________________________
>> Pd-list at iem.at mailing list
>> UNSUBSCRIBE and account-management -> 
>> http://lists.puredata.info/listinfo/pd-list

More information about the Pd-list mailing list