[GEM-dev] Fwd: Cg runtime and fragment programs

Fri Sep 17 15:35:45 CEST 2004

hey ronan,

...here's the specifics of a discussion I had with some guys about the 
Cg/ATI/ARB problems...I know it's a little long, but I think there's 
some good info here for all of us to consider regarding the issues...

l8r,
jamie

Begin forwarded message:

> From: Chris Bentley <chrisb at ati.com>
> Date: June 28, 2004 10:32:01 AM EDT
> To: mac-opengl at lists.apple.com
> Cc: mgix at apple.com
> Subject: Re: Cg runtime and fragment programs
>
> Hi,
>
> Here are Dan Gessel's comments:
>
> Our initial ARB FP implementation was designed with the assumption 
> that programs would be written to the parse, or non-native, limits. 
> These were set to be close to, or, in some cases, less than the native 
> limits (as we would hold back exporting resources that might be 
> required to implement some ARB FP instructions which are not native to 
> our HW).
>
> Because of this, instruction re-ordering and merging of dependent 
> phases (or indirections) was not a primary concern.
>
> Specifically, we understand that Cg, focuses on minimizing 
> temporaries, as it is apparently an important issue for Nvidia 
> hardware; in contrast, "indirections" are limited on ATI HW.
>
> Another aspect on our hardware is that fetches from texture rectangles 
> require ALU ops to scale texture coordinates before the texture 
> instruction can be issued, which, if the first instruction is a 
> texture rect fetch, will introduce another "indirection".
>
> Because Cg trades "indirections" for fewer temporaries, Cg generated 
> code with fetches from texture rectangles will often fail on our early 
> ARB FP compilers.
>
> The 10.3.4 compiler is much improved from 10.3.3 and earlier, but is 
> still designed for ARB FP programmers writing in "assembler", with an 
> understanding of the HW limitations.
>
> However, additional work on the compilers for upcoming hardware (with 
> instruction space for significantly longer programs), has also come 
> with much more aggressive optimization (including indirection 
> aggregation).
>
> Because the optimizers are more capable, we have enlarged the parse 
> limits so that we have the best opportunity to execute any given 
> program.
>
> In particular, the program below can be executed on ATI HW using the 
> latest beta drivers.
>
>
> chrisb
>
>
>
>
>
> On Jun 24, 2004, at 7:11 PM, Chris Bentley wrote:
>
>>
>>
>> -----Original Message-----
>> From: mgix at apple.com [mailto:mgix at apple.com]
>> Sent: Thursday, June 24, 2004 6:47 PM
>> To: macopengl List
>> Subject: Re: Cg runtime and fragment programs
>>
>>
>> Le 24 juin 04, ` 14:44, Pete Warden a icrit :
>>
>>> One other thing to note is that whilst the limits are tighter than
>>> NVidia's, the fragment programs we use run two to three times faster
>>> on ATI R3xx systems than the NV34.
>>>
>>>> Yes, on the R300 chips the limit is 64 alu instructions, and then
>>>> there are also 32 texture instructions.
>>>
>>> By the way Chris, when you mention '32 texture instructions', is 
>>> there
>>> anyway to actually use that many TEX's without hitting the other
>>> limitations? We've managed to use 8 texture reads in a single program
>>> by using glMultiTexCoords() or other ways of generating the texture
>>> coordinates that don't trigger the indirection count, but doing any
>>> more than that fails.
>>>
>>> Pete
>>>
>>> On Thursday, June 24, 2004, at 01:53 PM, Chris Bentley wrote:
>>>
>>>> Hi,
>>>>
>>>>> If you're running on an ATI board, there are severe limitations to
>>>>> what you can actually execute on it:
>>>>
>>>> Whether "severe" or not, yes there are indeed limitations...  To
>>>> inject a little actual data into this discussion, here are the
>>>> limits:
>>>>
>>>>
>>
>> I guess 'severe' was probably a bad choice of words :) There
>> is indeed plenty of cool stuff you can do within the existing limits.
>>
>> However, I'd like to add my 2 cents to Pete's comment : I am kind of 
>> surprised by the 32 texture instruction limit you mention: every time 
>> I write a pixel shader that uses more than about 8 taps, it either 
>> does not load or fails silently (black images).
>>
>> I confess I did not explore the issue in depth, and I'm not sure what 
>> actual limit the shader hit (# of TEX instruction, ordering or 
>> other), but I do hit these limits real, real fast, especially with 
>> raw CG output.
>>
>> To inject a little more data in the discussion, below is a specific 
>> example of an ARB shader that fails for me (I run on an R350, on OSX
>> 10.3.4):
>>
>> It has 7 texture taps, 14 ARB instructions, and seems to be within 
>> constraints, but fails to load :
>> gl->getProgramivARB(GL_PROGRAM_UNDER_NATIVE_LIMITS_ARB) returns 0
>>
>> One frustrating thing is that it is very hard to figure why the 
>> shader fails, and you end up having to spend lots of time just trying 
>> to move things about in the ARB instruction list just to get the 
>> driver to accept it.
>>
>> It looks like it does fail simply because the instructions do not 
>> come in the order the driver expects.
>>
>> If so, it would indeed be very nice if the driver could reorder them 
>> on the fly to fit the requirement of the chip rather than having the 
>> programmer do it by hand.
>>
>> Especially if the HW constraint change with each chip generation.
>>
>>
>> !!ARBfp1.0
>> # ARB_fragment_program generated by NVIDIA Cg compiler
>> # cgc version 1.2.0001, build date Feb 19 2004  10:51:06
>> # command line args: -I. -profile arbfp1
>> #vendor NVIDIA Corporation
>> #version 1.0.02
>> #profile arbfp1
>> #program main
>> #semantic main.tex0 : TEXUNIT0
>> #semantic main.w0 : C0
>> #semantic main.w1 : C1
>> #var samplerRECT tex0 : TEXUNIT0 : texunit 0 : 1 : 1
>> #var float4 w0 : C0 : c[0] : 6 : 1
>> #var float3 w1 : C1 : c[1] : 7 : 1
>> #var float4 color : $vout.COLOR : COL : 0 : 1
>> #var float4 tc0 : $vin.TEXCOORD0 : TEX0 : 2 : 1
>> #var float4 tc1 : $vin.TEXCOORD1 : TEX1 : 3 : 1
>> #var float4 tc2 : $vin.TEXCOORD2 : TEX2 : 4 : 1
>> #var float2 tc3 : $vin.TEXCOORD3 : TEX3 : 5 : 1
>> PARAM u0 = program.local[0];
>> PARAM u1 = program.local[1];
>> TEMP R0;
>> TEMP R1;
>> TEMP R2;
>> TEX R0, fragment.texcoord[0].zwzz, texture[0], RECT;
>> TEX R1, fragment.texcoord[0], texture[0], RECT;
>> MUL R0, R0, u0.y;
>> MAD R0, R1, u0.x, R0;
>> TEX R1, fragment.texcoord[1], texture[0], RECT;
>> TEX R2, fragment.texcoord[1].zwzz, texture[0], RECT;
>> MAD R0, R1, u0.z, R0;
>> MAD R0, R2, u0.w, R0;
>> TEX R1, fragment.texcoord[2], texture[0], RECT;
>> TEX R2, fragment.texcoord[2].zwzz, texture[0], RECT;
>> MAD R0, R1, u1.x, R0;
>> MAD R0, R2, u1.y, R0;
>> TEX R1, fragment.texcoord[3], texture[0], RECT;
>> MAD result.color, R1, u1.z, R0;
>> END
>> _______________________________________________
>> mac-opengl mailing list | mac-opengl at lists.apple.com
>> Help/Unsubscribe/Archives: 
>> http://www.lists.apple.com/mailman/listinfo/mac-opengl
>> Do not post admin requests to the list. They will be ignored.
> _______________________________________________
> mac-opengl mailing list | mac-opengl at lists.apple.com
> Help/Unsubscribe/Archives: 
> http://www.lists.apple.com/mailman/listinfo/mac-opengl
> Do not post admin requests to the list. They will be ignored.
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 7172 bytes
Desc: not available
URL: <http://lists.puredata.info/pipermail/gem-dev/attachments/20040917/2cc2786b/attachment.bin>