[PD] search plugin time optimisation

Thu Jan 17 23:06:41 CET 2013

On 01/17/2013 04:37 PM, Jonathan Wilkes wrote:
> 
> 
> 
> 
> ----- Original Message -----
>> From: Hans-Christoph Steiner <hans at at.or.at>
>> To: pd-list at iem.at
>> Cc: 
>> Sent: Thursday, January 17, 2013 4:01 PM
>> Subject: Re: [PD] search plugin time optimisation
>>
>>
>> This troubleshooting is great.  I was also hoping that the OS's disk caching
>> would make things acceptibly fast, it works that way on GNU/Linux and Mac OS X.
>>
>> I really think that xapian is the way forward on this.  I don't think 
>> it'll be
>> too hard to use in the search-plugin, and it will provide much faster searching.
> 
> The initial xapian-based search will be equal to (and probably greater than*) the
> _initial_ load times listed below for each system.  That means the windows user
> downloads pd-extended with this new xapian-based search plugin, opens the
> search dialog, types "foo", and _still_ waits somewhere between 45 seconds to
> one minute to return the first search results.  This is because tcl must still search
> all the files and send the content to xapian to build the index in the first place, unless
> you want to ship an index with Pd-extended which would at least double the size of
> the archive.
> 
> The improvement with Xapian comes when the Windows user a) reboots the machine
> or b) does something that causes the OS to forget about the search plugin's file name,
> file content, and regexp cache.  In those cases the current search plugin must read all
> the files again, whereas a future Xapian plugin would just access the existing index file.
> Basically Xapian just gives you control and persistence wrt the stuff the OS is
> already caching (plus probably other fancy stuff like rating search results).
> 
> But in practice I think such improvements only affect Windows users or people who want
> to search through large numbers of libs (greater than everything in subversion currently).
> Five seconds for a search plugin which typically shows you the first few results in less than a
> second doesn't really register as latency most of the time.  (Unless of course the only results
> happen to be the last ones in the list, but remember that's only the _initial_ search-- subsequent
> ones are extremely fast.
> 
> -Jonathan
> 
> * Note that my measurements are just for read times.  Xapian will be writing to an index so
> you'll have to factor in write times for _each_ string of file contents.  Those probably aren't
> insignificant digits.

The Debian package could generate the cache right after installation.  Mac and
Windows would have to happen on first launch.  The benefits of this would be
real, but it would also take a real amount of work to implement ;)

.hc

> 
>>
>> .hc
>>
>> On 01/17/2013 03:35 PM, Jonathan Wilkes wrote:
>>>  Another update on this:
>>>  If I pair down the number of files to something that approaches 
>> Pd-extended's
>>>  extra folder + doc folder (around 6,500 items):
>>>  Results for GNU/Linux:
>>>
>>>  Getting dirs takes 524 microseconds per iteration
>>>  Getting files takes 324646 microseconds per iteration
>>>  Reading files takes 5483581 microseconds per iteration
>>>  Done.
>>>
>>>
>>>  This is on the Debian Wheezy machine with the 1gig ram
>>>  and fast processor-- it might take slightly longer on the other
>>>  machine but there's probably not much difference.
>>>
>>>  Unfortunately WinXP still takes an eternity to do the same-- I didn't
>>>  copy the results but getting files was something like 8 seconds, and
>>>  reading them was about 30 seconds. (And that wasn't including the
>>>  "doc" folder so about 700 fewer files!)
>>>
>>>  Unfortunately this discrepancy is so large that I can think of few decent
>>>  changes to the plugin and its interface that would improve the situation
>>>  for Windows users without bothering unix-derived OS users.  Five
>>>  seconds on Debian for the first search (and less than a second for
>>>  subsequent ones) is completely reasonable IMO, and I don't see the
>>>  Pd-extended documentation growing significantly any time soon.
>>>
>>>  So to end a _long_ answer to your question, I think you have to remove
>>>  the old libs from your search path and simply put up with the minimum
>>>  35 second initial search time.  If subsequent searches for anything other
>>>  than empy symbol (i.e., "") are taking 2 minutes to complete let 
>> me know
>>>  and I'll try to troubleshoot it.
>>>
>>>  -Jonathan
>>>
>>>
>>>
>>>  ----- Original Message -----
>>>>  From: Jonathan Wilkes <jancsika at yahoo.com>
>>>>  To: João Pais <jmmmpais at googlemail.com>; PD-List 
>> <pd-list at iem.at>
>>>>  Cc: 
>>>>  Sent: Thursday, January 17, 2013 3:21 AM
>>>>  Subject: Re: [PD] search plugin time optimisation
>>>>
>>>>  If you're describing the time it takes for an _initial_ search, see 
>> below.  
>>>>  However,
>>>>  subsequent file access is an order of magnitude faster-- on my 
>> pd-extended 
>>>>  install
>>>>  with ca. 5,000 docs I barely even see the progressbar at all after the 
>> first 
>>>>  search.
>>>>
>>>>
>>>>  I tested the attached tcl script in a folder that had 307 subdirs and 
>> about
>>>>  200megs of files; roughly 18,000 files, 13,000 of which were docs 
>> readable by 
>>>>  the
>>>>  script.
>>>>
>>>>  Debian Wheezy amd_64
>>>>  AMD Athlon(tm) II P360 Dual-Core Processor
>>>>  4gigs ram
>>>>  Results:
>>>>  Getting dirs takes 184543 microseconds per iteration
>>>>  Getting files takes 1387819 microseconds per iteration
>>>>  Reading files takes 14766208 microseconds per iteration
>>>>  Done.
>>>>
>>>>  In other words gathering up all the directories into a list
>>>>  takes less than 200 milliseconds, getting a list of all the
>>>>  files takes about a second and a half, and actually opening
>>>>  the file and feeding the contents to a variable takes about
>>>>  15 seconds.
>>>>
>>>>  Debian Wheezy (32bit)
>>>>  Intel(R) Pentium(R) 4 CPU 3.60GHz
>>>>  1gig of ram
>>>>  Results:
>>>>  Getting dirs takes 46418 microseconds per iteration
>>>>  Getting files takes 1365663 microseconds per iteration
>>>>  Reading files takes 18203551 microseconds per iteration
>>>>  Done.
>>>>
>>>>
>>>>  Similar results on a machine with less ram and 32bit.
>>>>
>>>>
>>>>  WinXP
>>>>  Intel Core2 6600 @ 2.4GHz
>>>>  1gig of ram
>>>>  (NTFS filesystem)
>>>>  Results:
>>>>  Getting dirs takes 0 microseconds per iteration
>>>>  Getting files takes 13109000 microseconds per iteration
>>>>  Reading files takes 41250000 microseconds per iteration
>>>>  Done.
>>>>
>>>>
>>>>  Not sure why it doesn't register anything for getting dirs. Also,
>>>>
>>>>  no idea why Windows takes so much longer to return the
>>>>  list of files.  I haven't found any clues on the tcl wiki, tcl 
>> docs,
>>>>  or tcl irc.  Finally, notice how much longer Windows takes to
>>>>  read the files: it's nearly 3x slower than the Debian 64 machine.
>>>>
>>>>
>>>>  Mac OS X 10.7.5
>>>>  2.33 GHz Intel Core 2 Duo
>>>>  2gig of ram
>>>>  Results:
>>>>  Getting dirs takes 5158 microseconds per iteration
>>>>  Getting files takes 979583 microseconds per iteration
>>>>  Reading files takes 30045212 microseconds per iteration
>>>>  Done.
>>>>
>>>>
>>>>  Still much faster than Windows for a comparable CPU, but
>>>>  reading still takes some time
>>>>
>>>>
>>>>  ***
>>>>
>>>>  So while I can make some optimizations here and there in
>>>>  the search plugin, the measurements above are best
>>>>  case scenarios.  You can try the script on Windows 7 if
>>>>  you want-- unfortunately the script only looks inside directories in
>>>>  it's own parent directory so you might have to make a
>>>>  test folder with lots of docs in order to make use of it.  However, I
>>>>  suspect you'll see number more like my WinXP report above,
>>>>  and that would mean you simply cannot get an initial search below
>>>>  one minute with a comparable amount of docs.
>>>>
>>>>  Alternatives are:
>>>>  * build an index from [pd META] data. I did this with the original
>>>>  search tool built in pd, but you lose the ability to do a full text 
>> search
>>>>  and effectively can no longer search text files and html.
>>>>  * build a full text index from the docs.  Faster probably but it would
>>>>  be a large file.
>>>>  * use a search engine library like Xapian.  But it requires someone
>>>>  who wants to do the work of using a searching engine library like
>>>>  Xapian.
>>>>
>>>>  All those alternatives still have the requirement that you
>>>>
>>>>  spend time building the index at least once, instead of each time you 
>> restart
>>>>  your computer or flush/overwrite wherever your OS caches the
>>>>  dirs/files for the current search plugin.  And even with Xapian 
>> you're
>>>>  opening files in tcl and sending them to an index through the Xapain 
>> interface,
>>>>  so you'd still see the long wait time building the initial index.
>>>>
>>>>
>>>>  I'll try to test a Pd-extended nightly to see how the smaller 
>> number
>>>>  of docs performs later.
>>>>
>>>>  -Jonathan
>>>>
>>>>
>>>>  ----- Original Message -----
>>>>>    From: João Pais <jmmmpais at googlemail.com>
>>>>>    To: PD-List <pd-list at iem.at>
>>>>>    Cc: 
>>>>>    Sent: Wednesday, January 9, 2013 7:05 AM
>>>>>    Subject: [PD] search plugin time optimisation
>>>>>
>>>>>    Hi,
>>>>>
>>>>>    I was trying the search plugin, and one search takes around 60s. 
>> In the 
>>>>  end, the 
>>>>>    plugin reports that he had to look through "16337 
>> docs".
>>>>>    I have several work directories in my path, which I don't use 
>> that 
>>>>  often. Is 
>>>>>    there a way of optimising the search plugin?
>>>>>
>>>>>    The system is W7.
>>>>>
>>>>>    Best,
>>>>>
>>>>>    João
>>>>>
>>>>>    _______________________________________________
>>>>>   Pd-list at iem.at mailing list
>>>>>    UNSUBSCRIBE and account-management -> 
>>>>>   http://lists.puredata.info/listinfo/pd-list
>>>>>       
>>>>
>>>>  _______________________________________________
>>>>  Pd-list at iem.at mailing list
>>>>  UNSUBSCRIBE and account-management -> 
>>>>  http://lists.puredata.info/listinfo/pd-list
>>>>
>>>
>>>
>>>  _______________________________________________
>>>  Pd-list at iem.at mailing list
>>>  UNSUBSCRIBE and account-management -> 
>> http://lists.puredata.info/listinfo/pd-list
>>>
>>
>> _______________________________________________
>> Pd-list at iem.at mailing list
>> UNSUBSCRIBE and account-management -> 
>> http://lists.puredata.info/listinfo/pd-list
>>