[PD] search plugin time optimisation

Jonathan Wilkes jancsika at yahoo.com
Thu Jan 17 21:35:27 CET 2013


Another update on this:
If I pair down the number of files to something that approaches Pd-extended's
extra folder + doc folder (around 6,500 items):
Results for GNU/Linux:

Getting dirs takes 524 microseconds per iteration
Getting files takes 324646 microseconds per iteration
Reading files takes 5483581 microseconds per iteration
Done.


This is on the Debian Wheezy machine with the 1gig ram
and fast processor-- it might take slightly longer on the other
machine but there's probably not much difference.

Unfortunately WinXP still takes an eternity to do the same-- I didn't
copy the results but getting files was something like 8 seconds, and
reading them was about 30 seconds. (And that wasn't including the
"doc" folder so about 700 fewer files!)

Unfortunately this discrepancy is so large that I can think of few decent
changes to the plugin and its interface that would improve the situation
for Windows users without bothering unix-derived OS users.  Five
seconds on Debian for the first search (and less than a second for
subsequent ones) is completely reasonable IMO, and I don't see the
Pd-extended documentation growing significantly any time soon.

So to end a _long_ answer to your question, I think you have to remove
the old libs from your search path and simply put up with the minimum
35 second initial search time.  If subsequent searches for anything other
than empy symbol (i.e., "") are taking 2 minutes to complete let me know
and I'll try to troubleshoot it.

-Jonathan



----- Original Message -----
> From: Jonathan Wilkes <jancsika at yahoo.com>
> To: João Pais <jmmmpais at googlemail.com>; PD-List <pd-list at iem.at>
> Cc: 
> Sent: Thursday, January 17, 2013 3:21 AM
> Subject: Re: [PD] search plugin time optimisation
> 
> If you're describing the time it takes for an _initial_ search, see below.  
> However,
> subsequent file access is an order of magnitude faster-- on my pd-extended 
> install
> with ca. 5,000 docs I barely even see the progressbar at all after the first 
> search.
> 
> 
> I tested the attached tcl script in a folder that had 307 subdirs and about
> 200megs of files; roughly 18,000 files, 13,000 of which were docs readable by 
> the
> script.
> 
> Debian Wheezy amd_64
> AMD Athlon(tm) II P360 Dual-Core Processor
> 4gigs ram
> Results:
> Getting dirs takes 184543 microseconds per iteration
> Getting files takes 1387819 microseconds per iteration
> Reading files takes 14766208 microseconds per iteration
> Done.
> 
> In other words gathering up all the directories into a list
> takes less than 200 milliseconds, getting a list of all the
> files takes about a second and a half, and actually opening
> the file and feeding the contents to a variable takes about
> 15 seconds.
> 
> Debian Wheezy (32bit)
> Intel(R) Pentium(R) 4 CPU 3.60GHz
> 1gig of ram
> Results:
> Getting dirs takes 46418 microseconds per iteration
> Getting files takes 1365663 microseconds per iteration
> Reading files takes 18203551 microseconds per iteration
> Done.
> 
> 
> Similar results on a machine with less ram and 32bit.
> 
> 
> WinXP
> Intel Core2 6600 @ 2.4GHz
> 1gig of ram
> (NTFS filesystem)
> Results:
> Getting dirs takes 0 microseconds per iteration
> Getting files takes 13109000 microseconds per iteration
> Reading files takes 41250000 microseconds per iteration
> Done.
> 
> 
> Not sure why it doesn't register anything for getting dirs. Also,
> 
> no idea why Windows takes so much longer to return the
> list of files.  I haven't found any clues on the tcl wiki, tcl docs,
> or tcl irc.  Finally, notice how much longer Windows takes to
> read the files: it's nearly 3x slower than the Debian 64 machine.
> 
> 
> Mac OS X 10.7.5
> 2.33 GHz Intel Core 2 Duo
> 2gig of ram
> Results:
> Getting dirs takes 5158 microseconds per iteration
> Getting files takes 979583 microseconds per iteration
> Reading files takes 30045212 microseconds per iteration
> Done.
> 
> 
> Still much faster than Windows for a comparable CPU, but
> reading still takes some time
> 
> 
> ***
> 
> So while I can make some optimizations here and there in
> the search plugin, the measurements above are best
> case scenarios.  You can try the script on Windows 7 if
> you want-- unfortunately the script only looks inside directories in
> it's own parent directory so you might have to make a
> test folder with lots of docs in order to make use of it.  However, I
> suspect you'll see number more like my WinXP report above,
> and that would mean you simply cannot get an initial search below
> one minute with a comparable amount of docs.
> 
> Alternatives are:
> * build an index from [pd META] data. I did this with the original
> search tool built in pd, but you lose the ability to do a full text search
> and effectively can no longer search text files and html.
> * build a full text index from the docs.  Faster probably but it would
> be a large file.
> * use a search engine library like Xapian.  But it requires someone
> who wants to do the work of using a searching engine library like
> Xapian.
> 
> All those alternatives still have the requirement that you
> 
> spend time building the index at least once, instead of each time you restart
> your computer or flush/overwrite wherever your OS caches the
> dirs/files for the current search plugin.  And even with Xapian you're
> opening files in tcl and sending them to an index through the Xapain interface,
> so you'd still see the long wait time building the initial index.
> 
> 
> I'll try to test a Pd-extended nightly to see how the smaller number
> of docs performs later.
> 
> -Jonathan
> 
> 
> ----- Original Message -----
>>  From: João Pais <jmmmpais at googlemail.com>
>>  To: PD-List <pd-list at iem.at>
>>  Cc: 
>>  Sent: Wednesday, January 9, 2013 7:05 AM
>>  Subject: [PD] search plugin time optimisation
>> 
>>  Hi,
>> 
>>  I was trying the search plugin, and one search takes around 60s. In the 
> end, the 
>>  plugin reports that he had to look through "16337 docs".
>>  I have several work directories in my path, which I don't use that 
> often. Is 
>>  there a way of optimising the search plugin?
>> 
>>  The system is W7.
>> 
>>  Best,
>> 
>>  João
>> 
>>  _______________________________________________
>>  Pd-list at iem.at mailing list
>>  UNSUBSCRIBE and account-management -> 
>>  http://lists.puredata.info/listinfo/pd-list
>>     
> 
> _______________________________________________
> Pd-list at iem.at mailing list
> UNSUBSCRIBE and account-management -> 
> http://lists.puredata.info/listinfo/pd-list
>




More information about the Pd-list mailing list