It’s been a long time since my last post. So here, in early 2023, I am publishing two of my recent reports on Graphics and Parallel Computing respectively. Due to the tight deadline during the final exam week, it is difficult to make these reports perfect. Despite this, I have made every effort to make them clear and easy to read, which, in my opinion, can serve as suitable tutorials for both subjects. And, they are great snapshots of my abilities in engineering and research at the moment.

Envoy: An High-Performance Bounding Volume Hierarchies System on Building and Intersecting

You can get a PDF version of this report here.

For this project and report, I need to reflect on myself, on my implementation ability and time management. I do like this report, but I’m not satisfied with the progress in this project. I’ve gained a lot from this project, by building the framework from scratch. But this prohibits me from implementing a complete version as proposed, as I do not have time, and I’ve made mistakes in this process. I’ve spent a lot of effort thinking about how to simplify Nanite, but it turned out that the original paper is not suitable for solely a final project. And I do not have the ability and time to implement and optimize an efficient BVH system as bvh. So I compromised to this level of implementation. As you can see from the report, although it is long and redundant, there isn’t a lot of information besides Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees.

My self-assessment of it is, great progress on my engineering(coding) ability, but lacks challenging implementations. But it reminded me that any innovation must be based on papers, it is yet a great progress on myself, maybe the most important progress for me in 2022.

I’d like to present one part of it here, as almost my full understanding of rendering the field now.

Although this course project finishes here, there is a lot to progress. For example, following this work [Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees], there is [Fast parallel construction of high-quality bounding volume hierarchies, Efficient Incoherent Ray Traversal on GPUs through Compressed Wide BVHs] that enables much more efficient BVH to build with triangle split, and [Efficient Stackless Hierarchy Traversal on GPUs with Backtracking in Constant Time] that boosts the performance of BVH traversal. There are also some existing efficient BVH implementations on GitHub, for example, brandonpelfrey/Fast-BVH, madmann91/bvh. These are much more efficient than my implementation. Before I move on to more advanced architectures, these optimizations must be done in advance.

I mentioned a prospect in my previous proposal. Three parts are crucial and general in a complete renderer, whether online or offline. Let’s consider that we are to render scenes that cannot be stored inside of a PC’s RAM [The Design and Evolution of Disney’s Hyperion Renderer]. They can have so many lights that standard sampling techniques (UniformSampleOneLight or UniformSampleAllLights) cannot handle. They can have scene geometries of more than 100 GB. We would want to render in real-time in a scene like this. These scenes are actually common in games or movies, as mentioned in [The Design and Evolution of Disney’s Hyperion Renderer]. Three Treasures of Modern Renderers are efficient multi-platform and heterogeneous math library (or compiler)[Mitsuba 2: A Retargetable Forward and Inverse Rendere, * “Dr.Jit: A Just-In-Time Compiler for Differentiable Rendering], general importance sampler[Generalized Resampled Importance Sampling: Foundations of ReSTIR], streaming BVHs with LoD[The Design and Evolution of Disney’s Hyperion Renderer*]. Using these techniques, the importance sampler can handle many lights, effectively selecting the lights that contribute to this scene. With an appropriate math library (or compiler) design, in extreme cases, a renderer written in Taichi can achieve heterogeneous computing for free. Streaming BVHs also requires no modification to the original renderer. LoD is necessary for large scenes. Consider a scenario where we can see the whole city. We don’t expect to load the whole scene into memory, taking up a lot of memory bandwidth. These are what I’m going to learn, research and implement in the future.

Reading Lock-Free Locks Revisited

You can get a PDF version of this report here.

This is the reading project for Parallel Computing course at our university. Although I heard from my seniors that this class is not hard, the content of the later courses is still very interesting. Interesting things start from cuckoo-hash, a hash algorithm that has two exciting properties. Efficient and completely parallel hash table build and constant time hash query. Later, designing work-efficient PRAM algorithms is yet important. I claimed to my friends that if you are not familiar with the PRAM algorithm design, your should learn parallel computing again. Starting from PRAM model, we have the so called super-fast max finding with $O(n)$ work, $O(\log \log n)$ time on doubly logarithmic tree and algorithm cascading technique, efficient list ranking on $O(n)$ work based on cycle coloring and algorithm cascading. Based on PRAM Euler Tour, we now have PRAM tree algorithms.

Finally it is this report, I chose it from best paper award of PPoPP 22, which turned out to be a good choice. This paper, Lock-Free Locks Revisited introduces a way to write code in fine-grained locks while getting efficient lock-free behavior. I turned to CMU’s parallel computing course for an brief introduction to lock-free programming. This paper utilizes the property of the so-called idempotent thunk, which is a function without parameters, and if the function is executed more than one time, it appears to be that it is only executed once. It achieves this by a stronger condition: synchronization. Thunks in different threads are well-synchronized with a log sequence, which maintains the transitions in FSMs. Anyway, it states that, using the primitives presented in its library flock, a lambda can be idempotent. And using idempotent lambdas, your code can be written in a fine-grained lock way and its real performance is lock-free. See my report and the original paper for more details.

Thank you for reading!