[PD] search plugin time optimisation

Fri Jan 18 00:33:22 CET 2013

----- Original Message -----

> From: Hans-Christoph Steiner <hans at at.or.at>
> To: Jonathan Wilkes <jancsika at yahoo.com>
> Cc: "pd-list at iem.at" <pd-list at iem.at>
> Sent: Thursday, January 17, 2013 5:06 PM
> Subject: Re: [PD] search plugin time optimisation
> 

[...]

>>  * Note that my measurements are just for read times.  Xapian will be 
> writing to an index so
>>  you'll have to factor in write times for _each_ string of file 
> contents.  Those probably aren't
>>  insignificant digits.
> 
> The Debian package could generate the cache right after installation.  Mac and
> Windows would have to happen on first launch.  The benefits of this would be
> real, but it would also take a real amount of work to implement ;)

For the record I have zero interest in investigating how to integrate
Xapian into Pd-extended, making sure the tcl bindings actually work and aren't
just abandonware, making sure it actually works with comparable results on
each OS (as opposed to, say, opening and closing files with tcl), and integrating
Xapian into the Pd-extended build process for OSX and Windows.  Those
caveats aside I'd be happy to revise the search plugin to use the Xapian API.

-Jonathan

> 
> .hc
> 
>> 
>>> 
>>>  .hc
>>> 
>>>  On 01/17/2013 03:35 PM, Jonathan Wilkes wrote:
>>>>   Another update on this:
>>>>   If I pair down the number of files to something that approaches 
>>>  Pd-extended's
>>>>   extra folder + doc folder (around 6,500 items):
>>>>   Results for GNU/Linux:
>>>> 
>>>>   Getting dirs takes 524 microseconds per iteration
>>>>   Getting files takes 324646 microseconds per iteration
>>>>   Reading files takes 5483581 microseconds per iteration
>>>>   Done.
>>>> 
>>>> 
>>>>   This is on the Debian Wheezy machine with the 1gig ram
>>>>   and fast processor-- it might take slightly longer on the other
>>>>   machine but there's probably not much difference.
>>>> 
>>>>   Unfortunately WinXP still takes an eternity to do the same-- I 
> didn't
>>>>   copy the results but getting files was something like 8 seconds, 
> and
>>>>   reading them was about 30 seconds. (And that wasn't including 
> the
>>>>   "doc" folder so about 700 fewer files!)
>>>> 
>>>>   Unfortunately this discrepancy is so large that I can think of few 
> decent
>>>>   changes to the plugin and its interface that would improve the 
> situation
>>>>   for Windows users without bothering unix-derived OS users.  Five
>>>>   seconds on Debian for the first search (and less than a second for
>>>>   subsequent ones) is completely reasonable IMO, and I don't see 
> the
>>>>   Pd-extended documentation growing significantly any time soon.
>>>> 
>>>>   So to end a _long_ answer to your question, I think you have to 
> remove
>>>>   the old libs from your search path and simply put up with the 
> minimum
>>>>   35 second initial search time.  If subsequent searches for 
> anything other
>>>>   than empy symbol (i.e., "") are taking 2 minutes to 
> complete let 
>>>  me know
>>>>   and I'll try to troubleshoot it.
>>>> 
>>>>   -Jonathan
>>>> 
>>>> 
>>>> 
>>>>   ----- Original Message -----
>>>>>   From: Jonathan Wilkes <jancsika at yahoo.com>
>>>>>   To: João Pais <jmmmpais at googlemail.com>; PD-List 
>>>  <pd-list at iem.at>
>>>>>   Cc: 
>>>>>   Sent: Thursday, January 17, 2013 3:21 AM
>>>>>   Subject: Re: [PD] search plugin time optimisation
>>>>> 
>>>>>   If you're describing the time it takes for an _initial_ 
> search, see 
>>>  below.  
>>>>>   However,
>>>>>   subsequent file access is an order of magnitude faster-- on my 
> 
>>>  pd-extended 
>>>>>   install
>>>>>   with ca. 5,000 docs I barely even see the progressbar at all 
> after the 
>>>  first 
>>>>>   search.
>>>>> 
>>>>> 
>>>>>   I tested the attached tcl script in a folder that had 307 
> subdirs and 
>>>  about
>>>>>   200megs of files; roughly 18,000 files, 13,000 of which were 
> docs 
>>>  readable by 
>>>>>   the
>>>>>   script.
>>>>> 
>>>>>   Debian Wheezy amd_64
>>>>>   AMD Athlon(tm) II P360 Dual-Core Processor
>>>>>   4gigs ram
>>>>>   Results:
>>>>>   Getting dirs takes 184543 microseconds per iteration
>>>>>   Getting files takes 1387819 microseconds per iteration
>>>>>   Reading files takes 14766208 microseconds per iteration
>>>>>   Done.
>>>>> 
>>>>>   In other words gathering up all the directories into a list
>>>>>   takes less than 200 milliseconds, getting a list of all the
>>>>>   files takes about a second and a half, and actually opening
>>>>>   the file and feeding the contents to a variable takes about
>>>>>   15 seconds.
>>>>> 
>>>>>   Debian Wheezy (32bit)
>>>>>   Intel(R) Pentium(R) 4 CPU 3.60GHz
>>>>>   1gig of ram
>>>>>   Results:
>>>>>   Getting dirs takes 46418 microseconds per iteration
>>>>>   Getting files takes 1365663 microseconds per iteration
>>>>>   Reading files takes 18203551 microseconds per iteration
>>>>>   Done.
>>>>> 
>>>>> 
>>>>>   Similar results on a machine with less ram and 32bit.
>>>>> 
>>>>> 
>>>>>   WinXP
>>>>>   Intel Core2 6600 @ 2.4GHz
>>>>>   1gig of ram
>>>>>   (NTFS filesystem)
>>>>>   Results:
>>>>>   Getting dirs takes 0 microseconds per iteration
>>>>>   Getting files takes 13109000 microseconds per iteration
>>>>>   Reading files takes 41250000 microseconds per iteration
>>>>>   Done.
>>>>> 
>>>>> 
>>>>>   Not sure why it doesn't register anything for getting 
> dirs. Also,
>>>>> 
>>>>>   no idea why Windows takes so much longer to return the
>>>>>   list of files.  I haven't found any clues on the tcl wiki, 
> tcl 
>>>  docs,
>>>>>   or tcl irc.  Finally, notice how much longer Windows takes to
>>>>>   read the files: it's nearly 3x slower than the Debian 64 
> machine.
>>>>> 
>>>>> 
>>>>>   Mac OS X 10.7.5
>>>>>   2.33 GHz Intel Core 2 Duo
>>>>>   2gig of ram
>>>>>   Results:
>>>>>   Getting dirs takes 5158 microseconds per iteration
>>>>>   Getting files takes 979583 microseconds per iteration
>>>>>   Reading files takes 30045212 microseconds per iteration
>>>>>   Done.
>>>>> 
>>>>> 
>>>>>   Still much faster than Windows for a comparable CPU, but
>>>>>   reading still takes some time
>>>>> 
>>>>> 
>>>>>   ***
>>>>> 
>>>>>   So while I can make some optimizations here and there in
>>>>>   the search plugin, the measurements above are best
>>>>>   case scenarios.  You can try the script on Windows 7 if
>>>>>   you want-- unfortunately the script only looks inside 
> directories in
>>>>>   it's own parent directory so you might have to make a
>>>>>   test folder with lots of docs in order to make use of it.  
> However, I
>>>>>   suspect you'll see number more like my WinXP report above,
>>>>>   and that would mean you simply cannot get an initial search 
> below
>>>>>   one minute with a comparable amount of docs.
>>>>> 
>>>>>   Alternatives are:
>>>>>   * build an index from [pd META] data. I did this with the 
> original
>>>>>   search tool built in pd, but you lose the ability to do a full 
> text 
>>>  search
>>>>>   and effectively can no longer search text files and html.
>>>>>   * build a full text index from the docs.  Faster probably but 
> it would
>>>>>   be a large file.
>>>>>   * use a search engine library like Xapian.  But it requires 
> someone
>>>>>   who wants to do the work of using a searching engine library 
> like
>>>>>   Xapian.
>>>>> 
>>>>>   All those alternatives still have the requirement that you
>>>>> 
>>>>>   spend time building the index at least once, instead of each 
> time you 
>>>  restart
>>>>>   your computer or flush/overwrite wherever your OS caches the
>>>>>   dirs/files for the current search plugin.  And even with 
> Xapian 
>>>  you're
>>>>>   opening files in tcl and sending them to an index through the 
> Xapain 
>>>  interface,
>>>>>   so you'd still see the long wait time building the initial 
> index.
>>>>> 
>>>>> 
>>>>>   I'll try to test a Pd-extended nightly to see how the 
> smaller 
>>>  number
>>>>>   of docs performs later.
>>>>> 
>>>>>   -Jonathan
>>>>> 
>>>>> 
>>>>>   ----- Original Message -----
>>>>>>     From: João Pais <jmmmpais at googlemail.com>
>>>>>>     To: PD-List <pd-list at iem.at>
>>>>>>     Cc: 
>>>>>>     Sent: Wednesday, January 9, 2013 7:05 AM
>>>>>>     Subject: [PD] search plugin time optimisation
>>>>>> 
>>>>>>     Hi,
>>>>>> 
>>>>>>     I was trying the search plugin, and one search takes 
> around 60s. 
>>>  In the 
>>>>>   end, the 
>>>>>>     plugin reports that he had to look through "16337 
>>>  docs".
>>>>>>     I have several work directories in my path, which I 
> don't use 
>>>  that 
>>>>>   often. Is 
>>>>>>     there a way of optimising the search plugin?
>>>>>> 
>>>>>>     The system is W7.
>>>>>> 
>>>>>>     Best,
>>>>>> 
>>>>>>     João
>>>>>> 
>>>>>>     _______________________________________________
>>>>>>   Pd-list at iem.at mailing list
>>>>>>     UNSUBSCRIBE and account-management -> 
>>>>>>   http://lists.puredata.info/listinfo/pd-list
>>>>>>       
>>>>> 
>>>>>   _______________________________________________
>>>>>   Pd-list at iem.at mailing list
>>>>>   UNSUBSCRIBE and account-management -> 
>>>>>   http://lists.puredata.info/listinfo/pd-list
>>>>> 
>>>> 
>>>> 
>>>>   _______________________________________________
>>>>   Pd-list at iem.at mailing list
>>>>   UNSUBSCRIBE and account-management -> 
>>>  http://lists.puredata.info/listinfo/pd-list
>>>> 
>>> 
>>>  _______________________________________________
>>>  Pd-list at iem.at mailing list
>>>  UNSUBSCRIBE and account-management -> 
>>>  http://lists.puredata.info/listinfo/pd-list
>>> 
>