[PD] search plugin time optimisation

Thu Jan 17 22:01:45 CET 2013

This troubleshooting is great.  I was also hoping that the OS's disk caching
would make things acceptibly fast, it works that way on GNU/Linux and Mac OS X.

I really think that xapian is the way forward on this.  I don't think it'll be
too hard to use in the search-plugin, and it will provide much faster searching.

.hc

On 01/17/2013 03:35 PM, Jonathan Wilkes wrote:
> Another update on this:
> If I pair down the number of files to something that approaches Pd-extended's
> extra folder + doc folder (around 6,500 items):
> Results for GNU/Linux:
> 
> Getting dirs takes 524 microseconds per iteration
> Getting files takes 324646 microseconds per iteration
> Reading files takes 5483581 microseconds per iteration
> Done.
> 
> 
> This is on the Debian Wheezy machine with the 1gig ram
> and fast processor-- it might take slightly longer on the other
> machine but there's probably not much difference.
> 
> Unfortunately WinXP still takes an eternity to do the same-- I didn't
> copy the results but getting files was something like 8 seconds, and
> reading them was about 30 seconds. (And that wasn't including the
> "doc" folder so about 700 fewer files!)
> 
> Unfortunately this discrepancy is so large that I can think of few decent
> changes to the plugin and its interface that would improve the situation
> for Windows users without bothering unix-derived OS users.  Five
> seconds on Debian for the first search (and less than a second for
> subsequent ones) is completely reasonable IMO, and I don't see the
> Pd-extended documentation growing significantly any time soon.
> 
> So to end a _long_ answer to your question, I think you have to remove
> the old libs from your search path and simply put up with the minimum
> 35 second initial search time.  If subsequent searches for anything other
> than empy symbol (i.e., "") are taking 2 minutes to complete let me know
> and I'll try to troubleshoot it.
> 
> -Jonathan
> 
> 
> 
> ----- Original Message -----
>> From: Jonathan Wilkes <jancsika at yahoo.com>
>> To: João Pais <jmmmpais at googlemail.com>; PD-List <pd-list at iem.at>
>> Cc: 
>> Sent: Thursday, January 17, 2013 3:21 AM
>> Subject: Re: [PD] search plugin time optimisation
>>
>> If you're describing the time it takes for an _initial_ search, see below.  
>> However,
>> subsequent file access is an order of magnitude faster-- on my pd-extended 
>> install
>> with ca. 5,000 docs I barely even see the progressbar at all after the first 
>> search.
>>
>>
>> I tested the attached tcl script in a folder that had 307 subdirs and about
>> 200megs of files; roughly 18,000 files, 13,000 of which were docs readable by 
>> the
>> script.
>>
>> Debian Wheezy amd_64
>> AMD Athlon(tm) II P360 Dual-Core Processor
>> 4gigs ram
>> Results:
>> Getting dirs takes 184543 microseconds per iteration
>> Getting files takes 1387819 microseconds per iteration
>> Reading files takes 14766208 microseconds per iteration
>> Done.
>>
>> In other words gathering up all the directories into a list
>> takes less than 200 milliseconds, getting a list of all the
>> files takes about a second and a half, and actually opening
>> the file and feeding the contents to a variable takes about
>> 15 seconds.
>>
>> Debian Wheezy (32bit)
>> Intel(R) Pentium(R) 4 CPU 3.60GHz
>> 1gig of ram
>> Results:
>> Getting dirs takes 46418 microseconds per iteration
>> Getting files takes 1365663 microseconds per iteration
>> Reading files takes 18203551 microseconds per iteration
>> Done.
>>
>>
>> Similar results on a machine with less ram and 32bit.
>>
>>
>> WinXP
>> Intel Core2 6600 @ 2.4GHz
>> 1gig of ram
>> (NTFS filesystem)
>> Results:
>> Getting dirs takes 0 microseconds per iteration
>> Getting files takes 13109000 microseconds per iteration
>> Reading files takes 41250000 microseconds per iteration
>> Done.
>>
>>
>> Not sure why it doesn't register anything for getting dirs. Also,
>>
>> no idea why Windows takes so much longer to return the
>> list of files.  I haven't found any clues on the tcl wiki, tcl docs,
>> or tcl irc.  Finally, notice how much longer Windows takes to
>> read the files: it's nearly 3x slower than the Debian 64 machine.
>>
>>
>> Mac OS X 10.7.5
>> 2.33 GHz Intel Core 2 Duo
>> 2gig of ram
>> Results:
>> Getting dirs takes 5158 microseconds per iteration
>> Getting files takes 979583 microseconds per iteration
>> Reading files takes 30045212 microseconds per iteration
>> Done.
>>
>>
>> Still much faster than Windows for a comparable CPU, but
>> reading still takes some time
>>
>>
>> ***
>>
>> So while I can make some optimizations here and there in
>> the search plugin, the measurements above are best
>> case scenarios.  You can try the script on Windows 7 if
>> you want-- unfortunately the script only looks inside directories in
>> it's own parent directory so you might have to make a
>> test folder with lots of docs in order to make use of it.  However, I
>> suspect you'll see number more like my WinXP report above,
>> and that would mean you simply cannot get an initial search below
>> one minute with a comparable amount of docs.
>>
>> Alternatives are:
>> * build an index from [pd META] data. I did this with the original
>> search tool built in pd, but you lose the ability to do a full text search
>> and effectively can no longer search text files and html.
>> * build a full text index from the docs.  Faster probably but it would
>> be a large file.
>> * use a search engine library like Xapian.  But it requires someone
>> who wants to do the work of using a searching engine library like
>> Xapian.
>>
>> All those alternatives still have the requirement that you
>>
>> spend time building the index at least once, instead of each time you restart
>> your computer or flush/overwrite wherever your OS caches the
>> dirs/files for the current search plugin.  And even with Xapian you're
>> opening files in tcl and sending them to an index through the Xapain interface,
>> so you'd still see the long wait time building the initial index.
>>
>>
>> I'll try to test a Pd-extended nightly to see how the smaller number
>> of docs performs later.
>>
>> -Jonathan
>>
>>
>> ----- Original Message -----
>>>   From: João Pais <jmmmpais at googlemail.com>
>>>   To: PD-List <pd-list at iem.at>
>>>   Cc: 
>>>   Sent: Wednesday, January 9, 2013 7:05 AM
>>>   Subject: [PD] search plugin time optimisation
>>>
>>>   Hi,
>>>
>>>   I was trying the search plugin, and one search takes around 60s. In the 
>> end, the 
>>>   plugin reports that he had to look through "16337 docs".
>>>   I have several work directories in my path, which I don't use that 
>> often. Is 
>>>   there a way of optimising the search plugin?
>>>
>>>   The system is W7.
>>>
>>>   Best,
>>>
>>>   João
>>>
>>>   _______________________________________________
>>>   Pd-list at iem.at mailing list
>>>   UNSUBSCRIBE and account-management -> 
>>>   http://lists.puredata.info/listinfo/pd-list
>>>      
>>
>> _______________________________________________
>> Pd-list at iem.at mailing list
>> UNSUBSCRIBE and account-management -> 
>> http://lists.puredata.info/listinfo/pd-list
>>
> 
> 
> _______________________________________________
> Pd-list at iem.at mailing list
> UNSUBSCRIBE and account-management -> http://lists.puredata.info/listinfo/pd-list
>