[PD] search plugin update (was: Re: reverse kickstarter update)

Jonathan Wilkes jancsika at yahoo.com
Sun Sep 15 21:29:59 CEST 2013


Hi list,
      Attached is a first pass at using the Xapian backend to
search Pure Data docs.

What the revision does:
* simplifies building a search index.  It builds once, on the first
search, and all subsequent searches happen very fast.  Previously
it searched the docs themselves every single time and depended
on the OS caching the data, resulting in sluggish performance
especially on Windows.
* natural language, probabalistic searches.  The search terms
in the index were automatically chosen by the engine with no
customization, and already the results are decent.
* nearly no input errors.  Xapian has its own simple syntax, but
for most cases users can ignore it and type in natural language
searches (like Google).  And the few errors the user
can generate have meaningful feedback to the console.  Also,
since I'm passing the input as a string you don't have to worry
about malformed tcl lists or weird characters that previously caused
error.
* everything, including pd files, pdfs and html, is indexed properly
and so will get included in the results in the proper place.
* gives the ability to add results from a remote database with a
couple lines of code.
* allows the removal of "Match all terms" and "Match whole words"
checkbuttons, simplifying the interface.
* performs "stemming" out of the box-- that is, searching for
"edit", the engine will take into account "editing", "edits", "edited",
etc.

Installation for linux (Debian):
1) Make sure you have libxapian and tclxapian packages
installed.  Other distros probably have corresponding packages.
2) put search-plugin.tcl in the /startup directory, or if you're
using Pd vanilla just make sure it's in a directory that's specified
in the "Path" dialog.
3) Run Pd and click <ctrl-h> or choose "Search" from the "Help"
menu.

Further work that needs to be done:
* need to figure out where to create the database directory on
Linux, OSX, and Windows.  The directory needs to be read/writable.
Is there an easy way to do this?
* need a "Cancel" button next to the progressbar when indexing,
so the user can cancel a long index.

Further work that could be done:
* add pd meta tag/values to the index terms for each document.
This would make it possible to type keyword:foo or author:bar
to search based solely on that pd meta tag/value.
* add filenames to terms
* add "object" terms so the user can search pd patches for
a particular object instance, i.e., object:clip
* limit the document data in the database to pd meta tags/values
and other metadata.  Right now I'm storing the _entire_ doc text
in the database which obviously wastes space.
* xapian has all kinds of features, like suggesting related searches,
and realtime results.  The latter could be very handy for autocompletion
in object boxes, for example.
* could use the title of html files as description for better result 
descriptions
* could plug in to puredata.info to search for externals, plugins, etc.

As always, feedback welcome.  And feel free to donate some rice
and beans if you can!
https://jwilkes.nfshost.com/donations.php

Best,
Jonathan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: search-plugin.tcl
Type: text/x-tcl
Size: 51473 bytes
Desc: not available
URL: <http://lists.puredata.info/pipermail/pd-list/attachments/20130915/96faa9c9/attachment-0001.tcl>


More information about the Pd-list mailing list