[PD] search plugin time optimisation
Hans-Christoph Steiner
hans at at.or.at
Thu Jan 17 22:01:45 CET 2013
This troubleshooting is great. I was also hoping that the OS's disk caching
would make things acceptibly fast, it works that way on GNU/Linux and Mac OS X.
I really think that xapian is the way forward on this. I don't think it'll be
too hard to use in the search-plugin, and it will provide much faster searching.
.hc
On 01/17/2013 03:35 PM, Jonathan Wilkes wrote:
> Another update on this:
> If I pair down the number of files to something that approaches Pd-extended's
> extra folder + doc folder (around 6,500 items):
> Results for GNU/Linux:
>
> Getting dirs takes 524 microseconds per iteration
> Getting files takes 324646 microseconds per iteration
> Reading files takes 5483581 microseconds per iteration
> Done.
>
>
> This is on the Debian Wheezy machine with the 1gig ram
> and fast processor-- it might take slightly longer on the other
> machine but there's probably not much difference.
>
> Unfortunately WinXP still takes an eternity to do the same-- I didn't
> copy the results but getting files was something like 8 seconds, and
> reading them was about 30 seconds. (And that wasn't including the
> "doc" folder so about 700 fewer files!)
>
> Unfortunately this discrepancy is so large that I can think of few decent
> changes to the plugin and its interface that would improve the situation
> for Windows users without bothering unix-derived OS users. Five
> seconds on Debian for the first search (and less than a second for
> subsequent ones) is completely reasonable IMO, and I don't see the
> Pd-extended documentation growing significantly any time soon.
>
> So to end a _long_ answer to your question, I think you have to remove
> the old libs from your search path and simply put up with the minimum
> 35 second initial search time. If subsequent searches for anything other
> than empy symbol (i.e., "") are taking 2 minutes to complete let me know
> and I'll try to troubleshoot it.
>
> -Jonathan
>
>
>
> ----- Original Message -----
>> From: Jonathan Wilkes <jancsika at yahoo.com>
>> To: João Pais <jmmmpais at googlemail.com>; PD-List <pd-list at iem.at>
>> Cc:
>> Sent: Thursday, January 17, 2013 3:21 AM
>> Subject: Re: [PD] search plugin time optimisation
>>
>> If you're describing the time it takes for an _initial_ search, see below.
>> However,
>> subsequent file access is an order of magnitude faster-- on my pd-extended
>> install
>> with ca. 5,000 docs I barely even see the progressbar at all after the first
>> search.
>>
>>
>> I tested the attached tcl script in a folder that had 307 subdirs and about
>> 200megs of files; roughly 18,000 files, 13,000 of which were docs readable by
>> the
>> script.
>>
>> Debian Wheezy amd_64
>> AMD Athlon(tm) II P360 Dual-Core Processor
>> 4gigs ram
>> Results:
>> Getting dirs takes 184543 microseconds per iteration
>> Getting files takes 1387819 microseconds per iteration
>> Reading files takes 14766208 microseconds per iteration
>> Done.
>>
>> In other words gathering up all the directories into a list
>> takes less than 200 milliseconds, getting a list of all the
>> files takes about a second and a half, and actually opening
>> the file and feeding the contents to a variable takes about
>> 15 seconds.
>>
>> Debian Wheezy (32bit)
>> Intel(R) Pentium(R) 4 CPU 3.60GHz
>> 1gig of ram
>> Results:
>> Getting dirs takes 46418 microseconds per iteration
>> Getting files takes 1365663 microseconds per iteration
>> Reading files takes 18203551 microseconds per iteration
>> Done.
>>
>>
>> Similar results on a machine with less ram and 32bit.
>>
>>
>> WinXP
>> Intel Core2 6600 @ 2.4GHz
>> 1gig of ram
>> (NTFS filesystem)
>> Results:
>> Getting dirs takes 0 microseconds per iteration
>> Getting files takes 13109000 microseconds per iteration
>> Reading files takes 41250000 microseconds per iteration
>> Done.
>>
>>
>> Not sure why it doesn't register anything for getting dirs. Also,
>>
>> no idea why Windows takes so much longer to return the
>> list of files. I haven't found any clues on the tcl wiki, tcl docs,
>> or tcl irc. Finally, notice how much longer Windows takes to
>> read the files: it's nearly 3x slower than the Debian 64 machine.
>>
>>
>> Mac OS X 10.7.5
>> 2.33 GHz Intel Core 2 Duo
>> 2gig of ram
>> Results:
>> Getting dirs takes 5158 microseconds per iteration
>> Getting files takes 979583 microseconds per iteration
>> Reading files takes 30045212 microseconds per iteration
>> Done.
>>
>>
>> Still much faster than Windows for a comparable CPU, but
>> reading still takes some time
>>
>>
>> ***
>>
>> So while I can make some optimizations here and there in
>> the search plugin, the measurements above are best
>> case scenarios. You can try the script on Windows 7 if
>> you want-- unfortunately the script only looks inside directories in
>> it's own parent directory so you might have to make a
>> test folder with lots of docs in order to make use of it. However, I
>> suspect you'll see number more like my WinXP report above,
>> and that would mean you simply cannot get an initial search below
>> one minute with a comparable amount of docs.
>>
>> Alternatives are:
>> * build an index from [pd META] data. I did this with the original
>> search tool built in pd, but you lose the ability to do a full text search
>> and effectively can no longer search text files and html.
>> * build a full text index from the docs. Faster probably but it would
>> be a large file.
>> * use a search engine library like Xapian. But it requires someone
>> who wants to do the work of using a searching engine library like
>> Xapian.
>>
>> All those alternatives still have the requirement that you
>>
>> spend time building the index at least once, instead of each time you restart
>> your computer or flush/overwrite wherever your OS caches the
>> dirs/files for the current search plugin. And even with Xapian you're
>> opening files in tcl and sending them to an index through the Xapain interface,
>> so you'd still see the long wait time building the initial index.
>>
>>
>> I'll try to test a Pd-extended nightly to see how the smaller number
>> of docs performs later.
>>
>> -Jonathan
>>
>>
>> ----- Original Message -----
>>> From: João Pais <jmmmpais at googlemail.com>
>>> To: PD-List <pd-list at iem.at>
>>> Cc:
>>> Sent: Wednesday, January 9, 2013 7:05 AM
>>> Subject: [PD] search plugin time optimisation
>>>
>>> Hi,
>>>
>>> I was trying the search plugin, and one search takes around 60s. In the
>> end, the
>>> plugin reports that he had to look through "16337 docs".
>>> I have several work directories in my path, which I don't use that
>> often. Is
>>> there a way of optimising the search plugin?
>>>
>>> The system is W7.
>>>
>>> Best,
>>>
>>> João
>>>
>>> _______________________________________________
>>> Pd-list at iem.at mailing list
>>> UNSUBSCRIBE and account-management ->
>>> http://lists.puredata.info/listinfo/pd-list
>>>
>>
>> _______________________________________________
>> Pd-list at iem.at mailing list
>> UNSUBSCRIBE and account-management ->
>> http://lists.puredata.info/listinfo/pd-list
>>
>
>
> _______________________________________________
> Pd-list at iem.at mailing list
> UNSUBSCRIBE and account-management -> http://lists.puredata.info/listinfo/pd-list
>
More information about the Pd-list
mailing list