radeonsi LLVM Performance Improvements

I just pushed up a new branch to my LLVM repo that enables two important LLVM codegen features (machine scheduling and subreg livenes) for SI+ targets, which should improve performance of the radeonsi driver.

The biggest improvement that I’m seeing with this branch is the luxmark luxball OpenCL demo which is about 60% faster on my Bonaire. Other tests I’ve done show 10% – 25% improvements in performance. I haven’t done much OpenGL benchmarking, but I expect these changes will have much bigger impact on the OpenCL benchmarks, so OpenGL improvements may be in the lower end of that range. I still need more benchmark results to know for sure.

Enabling LLVM Machine Scheduler for radeonsi

I got back to working on enabling LLVM’s machine scheduler for radeonsi targets in the R600 backend after seeing a really good tutorial about how it works at this year’s LLVM Developer’s conference.

Since I last worked on this, I’ve figured out how to enable register pressure tracking in the scheduler, so now the scheduler will switch to a register pressure reduction strategy once register usage approaches the threshold where using more registers reduces the number of threads that can run in parallel.

So far the results look pretty good, several of the Phoronix benchmarks are faster with the scheduler enabled. However, I am still trying to track down a bug which is causing the xonotic benchmark to lockup when using the ‘ultra’ settings.

If anyone wants to test it out, I’ve pushed the code to my personal repo.

Bitcoin Mining with Open Source Drivers

* Update all the code necessary to run bfgminer has been pushed upstream

On Monday, I started to work on getting one of the many bitcoin mining applications, bfgminer, to work with clover and r600g.  Since then, I have uncovered and fixed a number of bugs in the compute stack and now with some final bug fixes from R600 backend developer Vincent Lejeune, I have bfgminer working with the Open Source drivers! (Also, thanks to Aaron Watry for implementing the rotate builtin for libclc!)

If anyone is interested in testing, you can follow the installation instructions for gallium compute here, and instead of the standard branches listed in the instructions checkout r179204 from the clang tree, and use the bfgminer branch from the following repos:


Additionally, you will need to invoke bfgminer using the command line options in this script (you will need to modify the script to add your preferred bitcoin address). Currently, bfgminer will only work on Evergreen and non-Cayman Northern Islands GPUs.

There is still one bug that will occasionally cause a hash value to be miscalculated.  When this happens, bfgminer will report a hw error, but this is not serious and the program will continue to run.

The compiled kernel code for bfgminer is not very well optimized yet and there are still a lot of things that can be done to improve performance.  For now, I’m going to work on getting these patches upstream and writing some piglit tests for the bugs, but maybe in the future I’ll have some time take a closer look at the kernel performance.  I have a few ideas for things to do to improve performance, so if anyone is interested in playing around with the code, just ping me on irc.  My nick is tstellar, and I’m usually in #radeon and #dri-devel on irc.freenode.net.  There may even be a few things that would make good Google Summer of Code projects for any students out there interested in GPU performance and/or bitcoin.

* Update: bfgminer now autodetects the Mesa platform, so the bfgminer patch is no longer required.

r600g Todo List

I’ve just updated the R600Todo page, with some LLVM and Compute tasks.  I’ll try to keep this page updated and add more tasks as I think of them.

These tasks are a great way to get started on GPU driver development, so if you are interested, pick a task and start hacking!  Also, let me know what you are working on so I can update the page.  If you have any questions about any of these stop by #radeon or #dri-devel on irc.freenode.net and ask.

Testing clover with r600g

Compute support with clover and r600g has been progressing very nicely over the last few months. This is due to some great work by EVOC student Francisco Jerez on the clover state tracker and gallium compute interface, and also Ádám Rák who wrote an r600g compute implementation. With these pieces in place it is possible to run simple OpenCL programs using clover and r600g! Here is what you can do to try it out:

** UPDATE: Installation instructions can now be found here.

After all that you should be ready to go. I have posted some simple OpenCL programs here. Most of these should work with clover and r600g.
There is still a lot of work left to do, so don’t be surprised if your favorite OpenCL program doesn’t work, but I am excited about the work that has been done so far, and I’m optimistic the open source compute stack will continue to improve.

Texture Semaphore for r500

I just finished some improvements to the r300g instruction scheduler to make better use of the texture semaphore. The texture semaphore is used by instructions that need to read texture data to tell the ALU to delay execution until the desired texture data has been fetched from the texture unit. Previously in the r300g compiler, all instructions were using this semaphore, so even instructions that didn’t need texture data were waiting for it to be fetched. With these improvements, we are able to prefetch texture data by placing instructions that don’t depend on texture data directly after texture look ups, so they execute while the data is being fetched. This should lead to some performance improvements for certain kinds of shaders. In Lightsmark, there is one shader in particular that really benefits from this optimization, and I’m getting about a 33% speed up in overall FPS, with these new changes on my RV515. I’m curious to see what kind of performance improvements this brings for Lightsmark on other cards and even if there are other applications that benefit. Unfortunately, though, this optimization is only available on r500 cards, so r300 /r400 users are out of luck.

If anyone is interested, I’ve pushed the code to the tex-sem branch of my fdo git repo (http://cgit.freedesktop.org/~tstellar/mesa/) . When testing this out you can make use of a new environment variable called RADEON_TEX_GROUP, which defines the maximum number of texture lookups to submit at the same time. The default is 8, because it gave me the best Lightsmark performance on my card, but different values might be better for other applications / GPU combinations. To set the maximum number of texture lookups to 12, just do this:

RADEON_TEX_GROUP=12 ./your_app

The values I used for testing were 4, 8, and 12. It probably won’t help to go any lower than 4, and I doubt anything higher than 16 will have much of an effect.
There are also a few other optimizations in this branch namely, a smarter instruction scheduler, and the re-enabling of the register rename pass which enhances the effect of all the compiler optimizations. If you are interested, give this branch a try and let me know how it works for you.

Updates to the New R300 Register Allocator

I just pushed an updated version of the new r300 register allocator to http://cgit.freedesktop.org/~tstellar/mesa/ The branch is called new-register-allocator-v2. This new version contains support for loops and a few bug fixes. It has been rebased to included the floating-point texture additions, so it can now be tested on those apps that need floating-point textures.

New Register Allocator in the R300 Compiler

I’m mostly finished with a new and improved register allocator for fragment shaders in the R300 compiler. I still need to clean up the code and add comments, but otherwise it is ready for testing. The new allocator takes advantage of a register allocation algorithm designed for irregular architectures from a paper by Johan Runeson and Sven-Olof Nyström. Eric Anholt implemented this algorithm and added it to mesa, so all drivers could make use of it.

The new register allocator can pack one and two component register writes together into the same register to make full use of the four component temporary registers that the programs have access to. For example a program like this:

ADD TEMP[0].x, CONST[0].x CONST[0].x
MUL TEMP[1].x, TEMP[0].x, TEMP[0].x
MUL TEMP[2].x, TEMP[1].x, TEMP[1].x

MAD OUT[0].x, TEMP[0].x, TEMP[1].x, TEMP[2].x

will now be transformed to this:

ADD TEMP[0].x, CONST[0].x CONST[0].x
MUL TEMP[0].y, TEMP[0].x, TEMP[0].x
MUL TEMP[0].z, TEMP[0].y, TEMP[0].y

MAD OUT[0].x, TEMP[0].x, TEMP[0].y, TEMP[0].z

This will have a big impact on shaders that use a lot of scalar values. Some of the bigger shaders in Lightsmark use 30-50% less registers with the new register allocator on my RV515. I also get an improvement in fps from ~4.75 to ~5.30, which is about 10%, but with fps that low I’m not sure the difference is really significant. I’d be interested to see the results on other cards with different games and benchmarks. If anyone wants to test it out, the code is in the new-register-allocator branch here.

If you run programs with the environment variable RADEON_DEBUG=pstat they will print out statistics from the compiled shaders that are useful for evaluating the effectiveness of the new register allocator.

Bug Fixes for the sched-perf Branch

I just pushed a rebased version of the sched-perf branch to a new branch called sched-perf-rebase at git://anongit.freedesktop.org/~tstellar/mesa

This new branch contains bug fixes for the old branch and has no piglit regression vs. master on my RC410 and RV515 cards. In fact this branch has +1 passes on both cards.

This new branch should reduce fragment shader program size by about 10-20%. Shaders with branches should see the most improvement. There are three major changes to the compiler that are driving these improvements.

The first change is that the dataflow analysis for the optimization passes has been unified in a single function: rc_get_readers() which saves us from having to redo dataflow analysis for every passes and made it really easy to add the new optimization passes in this branch.

Fragment shader instructions for R300-R500 cards are actually composed of two sub instructions: one vector and one scalar. The vector instruction writes to the xyz components of a register and the scalar instruction writes to the w component. Currently, in the master branch an instruction like: MOV Temp[0].x, Temp[1].x is treated as a vector instruction, since it writes to the x component. This wastes the vector unit on what is actually a scalar instruction. One of the optimizations I added converts MOV Temp[0].x, Temp[1].x to MOV Temp[0].w Temp[1].x which allows us to make use of the scalar unit and leaves the vector unit free for actual vector instructions. Since there are usually more vector instructions than scalar we can usually fill this empty vector slot with another instruction which reduces the overall program size by one.

The third big change is converting the code to a quasi static single assignment (SSA) form prior to instruction scheduling. SSA basically means that each register is only written once. The main advantage of SSA is that it makes dataflow analysis much easier, however in the r300 compiler we aren’t really using it for dataflow analysis. We are using it because it helps our scheduler do a better job pairing instructions and making use of the vector and scalar units on every cycle. I say quasi-SSA because you can’t really turn vector instructions into SSA unless you break them apart into individual scalar instructions. For example, with vector instructions you might run into cases like this:

MOV Temp[4].x, Temp[5].x
MOV Temp[4].y, Temp[6].x
MOV Temp[7].xy, Temp[4].xy

In true SSA, each register is only written one time so we would need to rewrite the 2nd instruction like this:

MOV Temp[4].x, Temp[5].x
MOV Temp[5].y, Temp[6].x
MOV Temp[7].xy, Temp[4].xy

Oops, now we broke the program. Instruction 3 reads from Temp[4].x, but that component is never written. We could change instruction 3 to
MOV Temp[7].xy, Temp[5].xy, but then it would read from Temp[5].y which isn’t written either. So, in the r300 compiler we convert everything to SSA unless we see code like the example above. In that case we just ignore it and don’t bother trying to rewrite it.

As I mentioned earlier, these compiler optimizations reduce program size by about 10 – 20% Here is an example from the piglit test glsl-fs-atan3:

Category master sched-perf-rebase fglrx
Total Instructions 111 93 60
Vector Instructions 81 65 47
Scalar Instructions 27 37 47
Flow Control Instructions 20 20 7
Presubtract Operations 3 4 4
Temporary Registers 10 9 6

The fglrx results come from the AMD Shader Analyzer v1.42.

So about a 15% decrease in shader size for this test, but we are still quite far away from fglrx. The good news is, however, that I can see lots of areas for improvement. The big gap between the r300 compiler and fglrx is mostly because the way we use flow control instructions is very inefficient, and in this shader, it costs us about 16 instructions. There are a few other optimization we could be doing better too.

I’m really not a GPU performance expert, so I don’t know how smaller shader programs will translate to better performance at least in terms of frames per second. Smaller shaders means less data needs to be submitted to the graphics processor so that should help, but I think most of the performance bottlenecks are other places in the driver.

I’m going to do more testing of the sched-perf-rebase branch before I merge it with master, but I feel pretty good about it now. Also, as a bonus while working on these performance improvements I found and fixed 5 non-performance related bugs, which I hope will resolve some of the outstanding r300g fdo bugs.

r300 Compiler Optimization Improvements

I just pushed a branch called sched-perf to git://anongit.freedesktop.org/~tstellar/mesa
It contains various optimization improvements:

  • Handling of flow control instructions in dataflow analysis.
  • More aggressive use of presubtact operations.
  • Some scheduler improvements.

I’m seeing about a 10% decrease in shader program size in most piglit tests with this branch, but I haven’t done much testing with real applications. I added a debug option a few weeks ago for dumping shader stats (RADEON_DEBUG=pstat), which I’ve been using with piglit and is helpful for comparing compiler performance between different branches.