[PD] search plugin time optimisation

Jonathan Wilkes jancsika at yahoo.com
Thu Jan 17 09:21:46 CET 2013


If you're describing the time it takes for an _initial_ search, see below.  However,
subsequent file access is an order of magnitude faster-- on my pd-extended install
with ca. 5,000 docs I barely even see the progressbar at all after the first search.


I tested the attached tcl script in a folder that had 307 subdirs and about
200megs of files; roughly 18,000 files, 13,000 of which were docs readable by the
script.

Debian Wheezy amd_64
AMD Athlon(tm) II P360 Dual-Core Processor
4gigs ram
Results:
Getting dirs takes 184543 microseconds per iteration
Getting files takes 1387819 microseconds per iteration
Reading files takes 14766208 microseconds per iteration
Done.

In other words gathering up all the directories into a list
takes less than 200 milliseconds, getting a list of all the
files takes about a second and a half, and actually opening
the file and feeding the contents to a variable takes about
15 seconds.

Debian Wheezy (32bit)
Intel(R) Pentium(R) 4 CPU 3.60GHz
1gig of ram
Results:
Getting dirs takes 46418 microseconds per iteration
Getting files takes 1365663 microseconds per iteration
Reading files takes 18203551 microseconds per iteration
Done.


Similar results on a machine with less ram and 32bit.


WinXP
Intel Core2 6600 @ 2.4GHz
1gig of ram
(NTFS filesystem)
Results:
Getting dirs takes 0 microseconds per iteration
Getting files takes 13109000 microseconds per iteration
Reading files takes 41250000 microseconds per iteration
Done.


Not sure why it doesn't register anything for getting dirs. Also,

no idea why Windows takes so much longer to return the
list of files.  I haven't found any clues on the tcl wiki, tcl docs,
or tcl irc.  Finally, notice how much longer Windows takes to
read the files: it's nearly 3x slower than the Debian 64 machine.


Mac OS X 10.7.5
2.33 GHz Intel Core 2 Duo
2gig of ram
Results:
Getting dirs takes 5158 microseconds per iteration
Getting files takes 979583 microseconds per iteration
Reading files takes 30045212 microseconds per iteration
Done.


Still much faster than Windows for a comparable CPU, but
reading still takes some time


***

So while I can make some optimizations here and there in
the search plugin, the measurements above are best
case scenarios.  You can try the script on Windows 7 if
you want-- unfortunately the script only looks inside directories in
it's own parent directory so you might have to make a
test folder with lots of docs in order to make use of it.  However, I
suspect you'll see number more like my WinXP report above,
and that would mean you simply cannot get an initial search below
one minute with a comparable amount of docs.

Alternatives are:
* build an index from [pd META] data. I did this with the original
search tool built in pd, but you lose the ability to do a full text search
and effectively can no longer search text files and html.
* build a full text index from the docs.  Faster probably but it would
be a large file.
* use a search engine library like Xapian.  But it requires someone
who wants to do the work of using a searching engine library like
Xapian.

All those alternatives still have the requirement that you

spend time building the index at least once, instead of each time you restart
your computer or flush/overwrite wherever your OS caches the
dirs/files for the current search plugin.  And even with Xapian you're
opening files in tcl and sending them to an index through the Xapain interface,
so you'd still see the long wait time building the initial index.


I'll try to test a Pd-extended nightly to see how the smaller number
of docs performs later.

-Jonathan


----- Original Message -----
> From: João Pais <jmmmpais at googlemail.com>
> To: PD-List <pd-list at iem.at>
> Cc: 
> Sent: Wednesday, January 9, 2013 7:05 AM
> Subject: [PD] search plugin time optimisation
> 
> Hi,
> 
> I was trying the search plugin, and one search takes around 60s. In the end, the 
> plugin reports that he had to look through "16337 docs".
> I have several work directories in my path, which I don't use that often. Is 
> there a way of optimising the search plugin?
> 
> The system is W7.
> 
> Best,
> 
> João
> 
> _______________________________________________
> Pd-list at iem.at mailing list
> UNSUBSCRIBE and account-management -> 
> http://lists.puredata.info/listinfo/pd-list
>    
-------------- next part --------------
A non-text attachment was scrubbed...
Name: findread.tcl
Type: text/x-tcl
Size: 854 bytes
Desc: not available
URL: <http://lists.puredata.info/pipermail/pd-list/attachments/20130117/520f19d6/attachment.tcl>


More information about the Pd-list mailing list