Mixing OpenGL with Ray Tracing via NVIDIA CUDA for indirect lighting effects

This post is about a demo called Firefly for a course in Vienna University of Technology. The goal was the create a graphics demo, meaning a program with 3d animation which runs in real time, without user interaction, for few minutes and shows interesting graphic effects.

This is what I got (there is HD): Watch on YouTube
Notice that there are only two traditional lights (lamps). The arcade is mostly lit by indirect light, the cubes are shining from every point on the surface.

And I’m quite proud that I won the first place in the course’s contest :D

Now that you’ve seen the video, I’d like to talk a bit about the technology. In the end there is a conclusion, which you shouldn’t skip. Finally there is information on how to retrieve the code and executables.

Technology

I implemented a deferred rendering system using OpenGL and CUDA. The bouncing cubes are computed with Bullet by simply applying random forces. There are two main spot lights computed via shadow maps, indirect light is computed via ray tracing.

All geometry and texture information is rendered into a buffer on the GPU using OpenGL. The shadow maps are also rendered by OpenGL. Until here it is a standard deferred rendering system and therefore I’ll skip the details. Those buffers are then passed to CUDA, where the shading happens. This shading stage is divided into several CUDA kernels.

  • The first is doing the ray tracing.
    It is path tracing with only two bounces:
    camera —–> surface —–> surface —–> light.
    The first ray from the camera is not computed and taken from the geometry buffer instead. The last ray isn’t computed as well and taken from the shadow map instead. So only the middle ray is traced. Since indirect light usually contains only low frequencies, it is possible to compute it only in half of the resolution with only very little quality impact and big performance gain. On the half resolution image, 2-16 rays are computed per pixel (in the download you’ll find several executables, the quality refers to the number of rays per pixel. If I remember correctly, 4 rays per pixel were used for the video, this runs in real time (>30fps) on a modern GPU). For ray tracing itself I adapted a highly optimised kernel developed by Finnish NVIDIA engineers, along with a good BVH builder from the same source.
  • The next kernel filters the result from the ray tracing kernel. This filter is blurring, but it takes normals and distance into account, so that indirect light doesn’t get blurred over edges. The filter is not separable into a horizontal and vertical pass, because of the additional information taken into account. There were bandwidth and latency issues with the kernel and so I had to put quite some effort into optimising it.
  • Finally there is the main shading kernel, which computes the traditional shading and adds the indirect light. This is quite strait forward, there is just a little catch. Since the indirect light is computed in a lower resolution, in some cases there where ugly aliasing effects. Imagine a dark polygon in the foreground and a bright, indirectly lit one in the background. This situation would result in 2×2 steps across the edge. This is alleviated by another blurring stage, which takes normals and distance from the higher resolution buffer into account. So on pixels on polygon edges, which would take the colour of the underlying 2×2 block from a different polygon before, now the filtering would only take the colour from the correct polygon.

After the CUDA part finished, the buffers are handed back to OpenGL for post-processing: basic tone mapping, bloom, lens flares. Since I based firefly on a previous project’s OpenGL engine, those effects were already implemented. I won’t go into details because the effects are pretty standard and there are already a lot of sources.

Conclusion

Ok, so usually one should write how cool it was and how good the method worked. I’m an honest person: It was cool to program, I learned really a lot, I won the first place, I don’t regret, but the approach is bad, don’t try it out :), here is why:

My university tutors feared that the switch between OpenGL and CUDA would be too expensive, but this was not the case. Naturally there are costs, but those are way under 1 ms (unpacking geometry data and writing the output in CUDA costs about 1 ms, which includes already some computation work, while ray tracing takes around 30 ms). So this was the positive side: you can switch between OpenGL and CUDA every frame, if you do it right, the performance will be OK.

But to understand why the approach is only mediocre, one needs to understand how ray tracing performs on the GPU. Almost parallel rays are fast, random rays are slow. The reason is that “similar” rays will need mostly the same elements from the Bounding Volume Hierarchy, execution and data divergence will be low. Random rays are the opposite. That’s why it’s possible to get more than 60 fps when shooting only primary rays, while shooting just one random ray per pixel from the surface originally brought the performance to under 10 fps. Similarly it should be possible to compute rays from the lamps to the surface in a fast way, but it might include sorting and I didn’t test that.

So the approach from above accelerates the fast part of a 2 bounce path tracing algorithm, but the slow part – computing a random ray from one surface to another – stays slow.

It would be much better – from a software engineering perspective – not to mix ray tracing and the traditional pipeline in this case, because costs are to high compared to the benefits. A lot of data is duplicated on the graphics card, it is cumbersome to program, tradeoffs make performance okeyish, but quality is not great (look at the flickering and noise). It’s just not production ready and it will never be. It’s better to wait another couple of years until GPUs will be fast enough to do real path tracing in real-time.

While starting to program, I was also thinking of implementing caustics. Those could produce really nice effects and be over 60 fps – depending on the quality of the caustics, which could justify ray tracing. If somebody tries that, please let me know about it in the comments. In my case I couldn’t do it due to time constraints.

On a side note, I wasn’t thinking about NVIDIA OptiX due to past experiences with it. I was quite satisfied with CUDA, a ray tracing library would be cool, but that’s a pretty big wish : )

Code and Executable

You can use my old repository directly. The last version of the demo code is tagged, there are a few more revisions for another lecture’s submission.

I have packed everything together into a zip file. There are Windows and on Linux versions. You’ll need an NVIDIA graphics card and recent drivers. You might also need CUDA and you might need to delete the CUDA library files, it’s hard to deploy to unknown systems, do whatever works for you :) . I used CUDA 6.5..

Cross platform development for OpenGL, CUDA and friends

This is a small practical article about my development setup for Windows and Linux. I’m usually developing cross platform, because it doesn’t add much work and I’m switching from time to time between those systems. Both of them have advantages and disadvantages for me, but I don’t want to go into details here.

For version control I use Mercurial along with bitbucket. I prefer Mercurial over git for a simple reason, there is a very good open source GUI client for both operating systems, namely TortoiseHG. For git there are a few Windows clients, but I didn’t find any Linux one that I liked. Bitbucket is cool, because they provide the usual code hosting plans, mercurial and git hosting, optional bug tracker, wiki, code browsing and more.

The next thing is a cross platform build system: CMake. Most Linux and cross platform IDEs support it and it’s easy to generate a Visual Studio project out of it. Here is the code for the build file of my latest project. It includes OpenGL, CUDA and some other libraries. You can use it as a template.

project(firefly)

#this is my project structure
# ~/firefly/src/CMakeLists.txt   (this file)
# ~/firefly/src/main.cpp         (and other source files, there are also subdirectories)
# ~/firefly/data/..              (shaders, music, model data..)
# ~firefly/linlib/lib/..         (linux libs not installed on the system, usually .so and .a files)
# ~firefly/winlib/lib/..         (windows libs not installed on the system, usually .lib and .a files)
# ~/firefly/*inlib/include/..    (header files libs not installed on the system)
# ~/firefly/build/..             (not committed to the version control system, names vary, created by cmake.
                                  .dll files need to by copied here, there must be a cmake command which could do that but I did it manually)
# ~/firefly/documentation/..     (optional documentation, reports for uni etc.)

cmake_minimum_required(VERSION 2.6)
cmake_policy(SET CMP0015 OLD)

#find_package(Qt4 REQUIRED)
find_package(OpenGL REQUIRED)
find_package(CUDA REQUIRED)

#include_directories(${CMAKE_SYSTEM_INCLUDE_PATH} ${QT_INCLUDES} ${CMAKE_CURRENT_BINARY_DIR} ${CMAKE_CURRENT_SOURCE_DIR})
include_directories(${CMAKE_CURRENT_BINARY_DIR} ${CMAKE_CURRENT_SOURCE_DIR} ../include)

if(CMAKE_SYSTEM_NAME STREQUAL "Linux")
  link_directories(/usr/lib /usr/local/lib ../linlib/lib)

  # otherwise some bullet internal headers don't find friends..
  include_directories(/usr/local/include/bullet /usr/include/bullet ${CMAKE_CURRENT_SOURCE_DIR}/../linlib/include /usr/local/cuda/include)
else()
  #windows
  include_directories(${CMAKE_CURRENT_SOURCE_DIR}/../winlib/include ${CMAKE_CURRENT_SOURCE_DIR}/../winlib/include/bullet)
  link_directories(${CMAKE_CURRENT_SOURCE_DIR}/../winlib/lib)
endif()

set(project_SRCS
#list all source and header files here. separate files either by spaces or newlines
main.cpp
Class.cpp  Class.h
..
)

#shaders, optional, will be shown in the IDE, but not compiled
file(GLOB RES_FILES
../data/shader/Filename.frag
../data/shader/Filename.vert
)

#set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -O3 --use_fast_math -gencode arch=compute_20,code=sm_21 --maxrregcount 32)
#set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} "--use_fast_math -gencode arch=compute_20,code=sm_20 -lineinfo -G")   #-G for cuda debugger
#set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} "--use_fast_math -gencode arch=compute_20,code=sm_21 -lineinfo")
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} "--use_fast_math -gencode arch=compute_20,code=sm_21 -lineinfo --maxrregcount 32")

if(CMAKE_SYSTEM_NAME STREQUAL "Linux")
   #"-D VIENNA_DEBUG" defines the preprocessor variable VIENNA_DEBUG
#   set(CMAKE_CXX_FLAGS "-D VIENNA_DEBUG -D VIENNA_LINUX -std=c++11")
   set(CMAKE_CXX_FLAGS "-D VIENNA_LINUX -std=c++11")
   set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} " -std=c++11")
else()
   #"/DVIENNA_DEBUG" defines the preprocessor variable VIENNA_DEBUG
    #add_definitions(/DVIENNA_DEBUG)
    add_definitions(/DVIENNA_WINDOWS)
    SET( CMAKE_EXE_LINKER_FLAGS  "${CMAKE_EXE_LINKER_FLAGS}" )
endif()

#qt4_automoc(${project_SRCS})
#add_executable(firefly ${project_SRCS})
cuda_add_executable(firefly  ${RES_FILES} ${project_SRCS})

if(CMAKE_SYSTEM_NAME STREQUAL "Linux")
    set(LIBS ${LIBS} X11 Xxf86vm Xi GL glfw3 GLEW Xrandr pthread assimp BulletDynamics BulletCollision LinearMath fmodex64 freeimage gsl gslcblas ${CUDA_curand_LIBRARY})
	target_link_libraries(firefly ${LIBS})
else()
        set(LIBS ${LIBS} OpenGL32 glfw3 GLEW32 assimp fmodex64_vc FreeImage gsl cblas ${CUDA_curand_LIBRARY})
	target_link_libraries(firefly ${LIBS} debug BulletDynamics_Debug debug BulletCollision_Debug debug LinearMath_Debug)
	target_link_libraries(firefly ${LIBS} optimized BulletDynamics optimized BulletCollision optimized LinearMath)
	target_link_libraries(firefly ${LIBS} general BulletDynamics general BulletCollision general LinearMath)
endif()

And lastly you’ll need cross platform libraries. Fortunately there are quite good ones.

  • For GUI I use Qt, it can also open an OpenGL window, but if you don’t need GUI, it would be overkill.
  • So for Chawah and Firefly I used GLFW, which simply opens a window and provides mouse and keyboard input.
  • Then you need an OpenGL extension loader, I used GLEW, but there are also other alternatives.
  • For math (everything connected to vectors and matrices) I can recommend GLM.
  • Assimp loads meshes, objects, animations and whole scene graphs, but it’s a bit buggy in certain areas and lacks certain features (only linear animations afaik, only simple materials and only basic lights (no area lights)). I didn’t find anything better though.
  • Then there is FreeImage which is a very easy to use image loader and exporter.
  • You can also check out the other libs from the CMake file..

In case you want a working example, just download the firefly project. The project is quite big, I made a post about it. If you have any questions, then I’ll answer them in the comments..

How is GLM’s performance in CUDA?

Edit3:
Short answer: Slower than CUDA’s native types (float3, float4 etc). In the course of my real time graphics project I’ll extend and replace NVIDIA’s proprietary HelperMath header to include matrix multiplications and other helper functions I need. After the project is finished I’ll probably publish this library as open source (naturally not including any of NVIDIA’s code).

I’m working on a little ray tracing CUDA project right now and found out, that GLM also works in that environment. But soon I ran into performance issues and was looking for the culprit. While it certainly isn’t only GLM, it still slowed things down.

I decided to make some quick performance tests and here are the results:

  • GLM’s matrix multiplications are 4 times slower compared to my custom one liner and using CUDA’s native vector types.
  • Dot and cross products of vectors are roughly 30% faster than the implementation found in the cuda samples (helper_math.h)

I used a GForce GTX 550 Ti, CUDA 6.5 and GLM 0.9.5.4 on linux for the test.

Hopefully the people behind GLM can fix it quickly as it is really a neat library. I submitted a bug report.

In case you want to see or even edit the source code of the tests, it lies on bitbucket, feel free to commit any changes : )

Edit:
We were able to fix the performance problem in glm’s bug tracker. After aligning glm::vec4 properly, the matrix multiplication is almost equal and other functions (dot and cross) got even faster.
time for cuda glm (matrix): 233 milliseconds
time for cuda helper math (matrix): 226 milliseconds
time for cuda glm (dot): 185 milliseconds
time for cuda helper math (dot): 302 milliseconds
time for cuda glm (cross): 46 milliseconds
time for cuda helper math (cross): 162 milliseconds

Edit2:
The matrix multiplication indeed improved performance, but in the real world (well, my code :P ) this was only part of the issue. Additionally the tests I made triggered some kind of optimisation that was only possible in glm’s code. I discovered it when seeing that the results of those testing computations were either infinity or 0. I changed the values of the matrix/vectors so that the results stay in a range of [10*e-2, 1000] and vualà, the performance is almost the same.

Because of GLM running slower in my code, I conducted some additional tests and attempted to improve GLM’s performance. The performance didn’t improve enough, so I finally submitted another bug report.

Edit3:
The bug report was closed without any fixes to the lower performance. Additional aligned data types where added (aligned_vec4 etc), but the default ones won’t be aligned on CUDA.


## test results of my fastest glm version
#test 1 (synthetic)
time for cuda glm (matrix):.........546 milliseconds
time for cuda helper math (matrix):.660 milliseconds
time for cuda glm (dot):............471 milliseconds
time for cuda helper math (dot):....491 milliseconds
time for cuda glm (cross):..........246 milliseconds
time for cuda helper math (cross):..246 milliseconds

#test 2a (real life)
time for glm:.......................468 milliseconds
time for cuda:......................417 milliseconds

#test 2b (2a, but removed early exit from a loop)
time for glm:.......................373 milliseconds
time for cuda:......................370 milliseconds

Global illumination rendering using path tracing and bidirectional path tracing

This year I started a course on Advanced Computer Graphics in Aalto University by Jaakko Lehtinen. The focus was on methods for global illumination (GI). In a nutshell, global illumination means more realistic lighting compared to “traditional”, direct lighting, where light is reflected by all surfaces (including diffuse ones). Therefore the shading includes indirect light, colour bleeding, caustics etc. Think of a room with only one window and no additional light, the wall with the window will be also illuminated, but only by light that was reflected from the floor and other walls.

There are many methods of implementing GI, for instance photon mapping, instant radiosity, path tracing and bidirectional path tracing or the quite involved method of Metropolis light transport (MLT). It was obligatory to implement path tracing, but I implemented bidirectional path tracing in addition. As a bonus, I implemented both of the methods on the graphics card using the OptiX Framework from NVIDIA, but please don’t make the same mistake of using OptiX.

Path tracing

Path tracing means tracing a path from the camera, bouncing off of surfaces in a random direction, and doing so until the ray hits the light. On every bounce the reflectance function is used to calculate the amount of reflected light. For instance a red surface would reflect only red light, a white surface all of it and a black one nothing. Naturally one sample is not enough, several samples are taken per pixel and the average gives the actual colour. This results in an unbiased estimation of the colour at the pixel. The technical term for this is Monte Carlo integration.

There are many ways to speed up the convergence, I implemented the following (some of them reused for the bidirectional path tracer):

  • At every bounce one additional ray is shot into the direction of a light. This is especially important if the light source is small and hitting it randomly has a small probability. In order not to add the energy from the light twice, a random hit should be ignored.
  • The randomly reflected ray should not be sampled according to an uniform distribution, but to a distribution matching the reflectance function and a geometrical term (importance sampling). This gets even more important with a micro-facet model, which was used to simulate partly reflective surfaces (the same model was used for the normal path tracer and the bidirectional one).
  • Numbers from a random number generator tend to cluster, which is not optimal for Monte Carlo. A Low-discrepancy sequence is better, because the area of integration is covered more evenly. The Wikipedia Article about the Halton sequence shows the difference.

Bidirectional path tracing

Difference between bidirectional and normal path tracing (from Veach's PhD thesis).
Difference between bidirectional and normal path tracing (from Veach’s PhD thesis).

The idea of this method is quite simple, but the implementation is involved. Normal path tracing sometimes has difficulties to sample the light, especially when there are caustics or the light is a bit hidden as it is in the scene above. The reason in both cases is that it can’t be sampled directly (shooting a ray into the direction of the light) and hitting it randomly is improbable. More generally, normal path tracing has difficulties to sample important paths in certain situations, resulting in a noisy image.

Different bidirectional sampling techniques, taken from the Mitsuba documentation.
Different bidirectional sampling techniques, taken from the Mitsuba documentation.

Bidirectional path tracing shoots rays from both directions, from the light and from the camera. This results in two paths consisting of vertices, the light and the eye/camera path. Then each camera vertex is connected to each of the light vertices, forming different paths for transporting light. Eric Veach calls them bidirectional sampling techniques. These techniques have different “fields of expertise”, meaning that certain techniques are only efficient for certain effects. The figure on the right shows a collection of such sampling techniques.

In order to compile them into an image, they are weighted and added up. Computing the weight is a bit complex, but basically it’s computing the probability of generating the specific path divided by the sum of generating any path with vertices at the same positions. This results in switching off the techniques for effects that they can’t sample efficiently and therefore results in an image with less noise. This can be observed in the bottom part of the figure on the top right. The technical term is multiple importance sampling and is also covered in depth in Veach’s thesis.

After implementing the base of the algorithm with perfectly diffuse surfaces, I also added perfectly refractive and reflective surfaces as well as micro facet type surfaces (math taken from here).

Sub Surface Scattering

Finally I also implemented sub surface scattering. This book contains an excellent explanation of it (the papers (1, 2) by Jensen et al. aren’t so good in my opinion). I chose the accelerated version of the algorithm. This means that I have an octree containing surface points of cached irradiance. It seemed like this is not so cool as expected, the cache needs a lot of space in memory and the objects are lacking the montecarlo noise, which makes them look a bit odd. Blender’s renderer, Cycles, also doesn’t use the acceleration structure, instead they sample the surface directly. If I ever have to implement SSS another time, I’ll do it the Blender way. But for this project I chose to use a hack: In order to fix the unusual look, I mixed in the normal diffuse reflection. So a SSS surface is now 50% SSS and 50% perfectly diffuse.

Results

Because I implemented so many features, I didn’t have enough time to choose nice scenes.

There is one Pangolin, based on a model made by my sister:

Rendering of a Pangolin
The body contains a term for SSS, it filled up the video memory almost completely: 96%.

then there is the famous Sponza:

Sponza, featuring a simple path tracer and diffuse surfaces
Sponza, featuring a simple path tracer and diffuse surfaces

and finally a demo scene for the bidirectional tracer:

Bidirectional Test Scene
The glass was modeled by me, the rest are primitives from Blender. The scene features caustics, reflective and refractive surfaces, micro-facet model surfaces, some lights and diffuse surfaces.

Code

I was asked to provide the code. Since the project is based on the OptiX Samples framework and the framework has a proprietary license, I can’t give you the full project. Instead I made a diff to the sample directory provided with OptiX 3.0 (as this was the current version when I started coding).

To make it work you need:
1. download OptiX 3.0
2. apply the patch, under Linux it should be possible with the patch tool, something similar should exist on windows
3. if you want to compile and execute my code, download and compile Assimp and FreeImage (the dll’s and header files are assumed to be in folders next to the project folder)
4. follow the instructions in the readme file.

If you have any questions or problems with compiling, feel free to contact me. Preferable via the answer function, so others can also benefit..

Here is the patch.

Why you should never use NVIDIA OptiX: bugs

In my most recent project, I implemented a GPU path and bidirectional path tracer. I already previously used OptiX for my bachelor thesis, and I thought it is a good idea to use it also for this. I mean, it was created for ray tracing, so it should be predestined for such kind of GPU application.

One of the biggest advantages for me were the reliable and fast acceleration structures, which are crucial for ray tracing. But it proved to cause more pain than benefit due to several bugs. And the overall speed was not that great in the end, probably due to lacking optimisation in my code, which in turn is at least partly caused by lacking profilers (which would be available for CUDA).

But the big issue were really the bugs, which prevented me actively from implementing stuff. Some of the bugs are totally ridiculous and don’t get fixed..

I already encountered the first bug when coding for the bachelor thesis 2 years ago. Unfortunately I ignored it and decided again on OptiX. There is a special printf function, which takes into account the massively parallel architecture of the graphics card. For example, it is possible to limit the printing just to one pixel. But, ehm

// the following line works:
rtPrintf("asdf\n");

// on the contrary the following two don't
rtPrintf("asdf\n");
rtPrintf("asdf\n");

The latter crashes with

OptiX Error: Invalid value (Details: Function “RTresult _rtContextLaunch2D(RTcontext, unsigned int, RTsize, RTsize)” caught exception: Error in rtPrintf format string: “”, [7995632])

The difference is really only in calling the rtPrintf once or twice with the same text. Sometimes it could be even a different text and sometimes it worked with the same text. I reduced the code to a minimal example in order to eliminated possible stack corruption, but to no avail. This problem was
reported on the NVIDIA forum in June 2013. It’s possible to use stdio’s printf in conjunction with some ‘if’ guards as a workaround, as proposed in the forum.

The second one is also connected to rtPrintf:

RT_PROGRAM void pathtrace_camera() {
   BiDirSubPathVertex lightVertices[2];
   lightVertices[0].existing = false;
   lightVertices[1].existing = false;

   for(unsigned int i=0; i<2; i++) {
      sampleLightPath();
      if(!(lightVertices[i].existing)) break;
   }
   // rtPrintf("something\n");
   sampleEye();

   output_buffer[launch_index] = make_float4(1.f, 1.f, 1.f, 1.f);
}

This code would crash on two tested operating systems (Windows 7 and Linux) and on two different computers (my own workstation and one from the uni).

OptiX Error: Unknown error (Details: Function “RTresult _rtContextLaunch2D(RTcontext, unsigned int, RTsize, RTsize)” caught exception: Encountered a CUDA error: Kernel launch returned (700): Launch failed, [6619200])

.
It runs fine though, when the rtPrintf is not commented. I reported it here. Again I made a minimal example, in the end the source was only two files, containing ~30 and ~60 lines, but it constantly and reliably crashed depending on weather the rtPrintf was commented or not.

So, later, if the program crashed, the first attempt to resolve the issue was spreading randomly rtPrintfs. This was also one of the solutions to another problem presented in the next paragraphs.

But before I come to it, I quickly have to explain the compilation process. There are two steps, first, during “compile time”, the C++ code is turned into binaries and the OptiX source into intermediate .ptx files. Those contain sort of a GPU assembler, which is then compiled during runtime by the NVIDIA GPU driver into actual binaries executed on the device. This is triggered by a C++ function call, usually context->compile().

Now, the problem is that these context->compile()s don’t return always. Sometimes they run until the host memory is full. Once it helped to spread calls to the mentioned rtPrintf in the code, another time the resolution was an extra RT_CALLABLE_PROGRAM construct, report with additional information here.

Generally those problems are painful and demotivating. Especially taking into account, that it seems like NVIDIA doesn’t care, at least they didn’t answer any of my reports. But the reason for this could also be, that they are phasing out OptiX already due to little success. Any way, it was certainly the last time I used OptiX and I don’t recommend anybody to start with it.

Real Time 3d Mandelbulb

Render of a 3d Mandelbulb
Render of a 3d Mandelbulb

It’s actually already almost one year since I finished the work on an Erasmus project in Universitat Politècnica de València, Spain. The goal was to speed up the computation of distance estimated 3d fractals in a way so that they could be computed in real time and therefore make it possible to explore them in an interactive fly through program.

Fractals are graphics produced by recursive mathematical formulas. Sine the graphics are purely based on math, one can zoom in infinitely into them without loosing quality, well – as long as the float precision is enough. One classical 2d fractal is the Mandelbrot. Some years ago a formula for a 3d Fractal was found that resembles a Mandelbrot and the structure was called Mandelbulb.

Although there already exist programs that do the fractal computation on the graphics card, none of the ones I found was able to do it in acceptable quality and real time. Some of them produced nicer images, but at the cost of having long rendering times, others where sort of interactive, but the quality was bad.

I first implemented a quick proof of concept using NVIDIA OptiX, based on their Julia example, but OptiX quickly showed its limitations, especially when I started to work on a method to speed the thing up. So I switched to NVIDIA CUDA and was satisfied with that decision (and anyway, as I will blog soon, I recommend to stay away from OptiX as far as possible : ).

 

I’ll explain in a few sentences how the rendering works, details for the traditional method can be found on the page of Mikael Hvidtfeldt Christensen and for both, the traditional and improved one  in my report linked below.

Figure explaining ray marching
ray marching

There is a function, which returns the maximum lower bound for the distance to the fractal, called the distance estimator (DE). When walking along a ray, for instance shot by the camera, it is safe to walk this distance, because we have the guarantee that the fractal won’t intersect the ray in this interval. After one step, the distance is estimated again and we march as long, as the estimated distance is above a certain threshold.

My idea to speed up the rendering was to decrease the amount of computation by letting neighbouring rays “ride” on the previous rays.

Principle of the new algorithm
Principle of the new algorithm

This works in two passes, first, “primary” rays are shot with a distance of several pixels to each other, recording the DE values. Secondly, the neighbours, called secondary rays in the figure, are shot. Those can use the information calculated in the first pass to jump straight to the end of the stepping. So, that’s it, in short. Don’t confuse the terms primary and secondary rays with bouncing light of ray tracing here.

Benchmarks of the new algorithm compared to the old one from the report
Benchmarks of the new algorithm compared to the old one from the report

Comparing the traditional ray marching algorithm with the new, faster one, shows a speed up of up to 100% and interactive exploration in almost all views without shadows. It would be probably possible to also implement this speed up for shadow rays, but I had no time to do it. Unfortunately there are some artefacts due to the ray “riding”, details can be found in the report. If you are interested into the source code, please write me a message.

Edit: The  reason I don’t publish the repository is that there is still some (c) NVIDIA code inside (setting up CUDA, the render window etc). If there is somebody willing to replace / redo that code with something GPLv3, check the build and maybe update to the newest sdk, that would be most welcome :)

Edit2: Ok, the GPL parts of the source code are published now. I’m aware that it does not compile, however I haven’t got the time to replace the NVidia parts taken from the examples. Maybe it’ll help some of you anyways :)

Merging Ray Tracing and Rasterization in Mixed Reality

Reflection Scene rendered by our combined renderer
Reflection Scene rendered by our combined renderer

I’ve been studying computer sciences for three years and recently I finished writing my bachelor thesis. The topic was implementing reflections and refractions in the context of mixed reality. Basically, mixed reality means adding virtual objects into a video stream (of real environment). One of the goals is to make the resulting images look as realistic as possible. Naturally this includes reflection and refraction effects.

In my work I utilise the framework implemented in the RESHADE project. There was already a mixed reality renderer based on a Direct3D rasterizer. My task was to work on a ray tracer using OptiX and merge the resulting image with the rasterizer’s image. OptiX provides a ray tracing framework for the graphics card.

Glass Bunny Scene
Glass Bunny Scene

Generally a ray tracer is slower than a rasterizer, but it has better quality with reflections and refractions. Because of that, we used a combined method. The ray tracer is working only on transparent and reflecting surfaces. This is achieved by a mask, which is created by the rasterizer in a separate render pass. This mask indicates, where reflecting or refracting objects are.

Especially with virtual flat mirrors there are very strong artefacts (see reflection scene, right mirror). This is due to lack of information. We are using the video frame, the environment map and the textured model of the real scene to gather information for the reflected ray.

Other problems we encountered were artefacts due to different shading algorithms in the rasterizer and ray tracer, and various issues with implementation due to the tone mapper.

When I was more or less finished with my work, I discovered that somebody else in the project was working on the implementation of reflections and refractions using the rasterizer. This was cool, because we were able to compare the results. Usually rasterizing is much faster than ray tracing. Surprisingly this was not the case with all scenarios. The reflection scene (first image) was faster with the combined method, the transparent bunny was slower. Especially with the bunny scene, I believe that the main drawback in speed was accuracy. The ray tracer also computes inner reflections, which causes a lot of additional work. I didn’t manage to benchmark this (rendering without inner reflections), because I was running out of time, but maybe I will do it yet..

Thanks to Michael Wimmer for making this work possible and special thanks to Martin Knecht for being an excellent supervisor and helping with all the questions :).

Finally, if you are interested into more details, here you can find the thesis..

Lindenmayer brush for Krita

This term i had a course about fractals on the university. We also had to write code for a submission and the lecturer made the technique and topic free to choose. So I chose to make a lindenmayer brush for krita, because I planed to do so since the first university year, where I heard of lindenmayer systems the first time.

I quickly coded something together, so that I was able to prove, that my idea would work.

Lindenmayer systems are some kind of a grammar, they work like this: You have an alphabet of letters, in this case these letters are simple short lines, then you have productions and a rule: Always produce all letters. These productions eat one letter and produce ether no, one or several new letters. There are several classes of such productions, one of the simplest would be “A -> AA”. More sophisticated productions, like in the brush, can contain parameters, if clauses, calculations, random values etc. Basically it would be possible to have several different productions, but I chose to have only one and in turn make it more powerful.

Here is an example of a production, it is used in one of the brushes:

if{letter[branched] == false} {
newLetter = newLetter();
newLetter[position] = variant(letter[endPosition]);
newLetter[angle] = variant(newLetter[angleToSun]);
newLetter[length] = 5;
letter[branched] = bool(true);
}

if{letter[age] > 1} {
letter.delete();
}

It actually creates just a line.

At first the productions were plain C++ code, but that is very cumbersome, as you have to build, install and start krita for testing. The next thing I tried, was to use QtScript and write the productions in ECMAScript (also called JavaScript). It took about 5 to 7 hours until I proved, that it was way to slow.

So I decided to implement my own little scripting language. The interpreter is about 700 lines of code, including some tests. It’s certainly not as powerful, as ECMAScript, but it’s fast and okayish for this use case.

There is an if clause, which supports &&, there is a rand() function, a function for mixing angles, some very basic math, bool and float data types, you can create new letters and delete old ones. It’s possible to write new parameters (“branched” in the code above), edit some default ones (position, angle, length..), which will be used for drawing and access some computed ones (endPosition, angleToSun, distanceToSun, age..).

I have to write some documentation, expand the possibilities of the language and work on the configuration interface. But for now you can already test the current code, it’s in the krita-lbrush-adamc branch in the calligra repository.

Here is a final picture..

The presets for this brushes are included in the repository :)

Chawah, Final update a little latish..

I had lots of things to do, no time etc etc. You know the excuses. But now I want to write another blog about a new Lindenmayer brush for Krita, and though have the motivation to catch up with old posts..

Well, we added loads of new features, mainly graphical ones:

  • lens  flares and sun dazzle effect
  • moving space dust (white circles), which are moving randomly, this is an simple particle system.
  • explosion and fire stream effects (more complicated particle systems)
  • environment mapping (on the ships and station)
  • bloom
  • music and sound effects
  • some power ups (armor, rockets, boost)

Here is the video that we made for the presentation.

Chawah Progress

More than one month since my last post and we (Felix and me) are quite satisfied with our progress. There is some basic game logic (you can destroy the other ship and then the game restarts), health display and physics. We got all points at our first submission and currently we are starting implementing the effects. Environment mapping is already in place.

I’ve made a short video, where you can see bullet physics in action (at the end of the video you can see bullet debug mode enabled).