15618 Project: Fibonacci Heap in Parallel

Proposal

Authors: Lemin Zhang, Zihao Du

Summary

In this project, we will explore the parallelization of the Fibonacci heap, a highly efficient priority queue data structure on multi-core CPUs. We plan to explore semantic relaxation and develop parallel versions of core operations such as insert, delete-min, and decrease-key. As a final extension, we will investigate parallel strategies on heterogeneous CPUs with both P-cores and E-cores.

Webpage

https://leminzhang.github.io/projects/parallel_fib_heap/

Background

Fibonacci Heap

In a standard sequential Fibonacci heap, the data structure is represented as a forest of heap-ordered trees together with a pointer to the minimum root. It has very low amortized time complexity:

  • insert: O(1)
  • merge: O(1)
  • decrease-key: O(1)
  • delete-min: O(logn)

Its efficiency relies on lazy maintenance: insert and decrease-key are cheap in the amortized sense, while much of the structural cleanup is postponed until delete-min. This deferred cleanup, especially the consolidation step, is where a large amount of work is concentrated.

The most compute- and coordination-intensive part of the data structure is delete-min/consolidation. After removing the minimum root, all of its children must be promoted to the root list, and then trees with the same degree are repeatedly merged until each degree appears at most once. In the sequential implementation, this consolidation operation is typically performed by scanning the root list and linking trees one at a time. This process touches a large portion of the forest and performs many pointer updates, making it both expensive and difficult to scale with straightforward locking.

delete_min():  
remove minimum root z  
move all children of z to root list

// possible parallel region
while there exist two roots with the same degree:  
	link the two trees  
	update resulting root degree  
  
update global minimum

Similarly, decrease-key may trigger cascading cuts, where a node and potentially a chain of its ancestors are cut from their parents and moved back to the root list. These operations cause nonlocal structural updates and are therefore contention in a parallel setting.

Existing Solutions

There has been prior work suggesting that parallelizing Fibonacci heap is promising, but simple coarse-lock implementation is far from the best option, we need to reduce lock granularity for better scalability. Bhattarai’s experiments have proved that fine-grained parallel Fibonacci heap performs much better than coarse-grained locks. The paper also indicates that relaxation of semantics like minimum of the heap could improve concurrency. Another work on parallel Fibonacci heap is called ParButterfly, which is a framework that contains new parallel algorithms for problems on processing butterflies (a subgraph in bipartite graphs). It demonstrates that batch-parallel formulations can provide efficient support for several heap operations, including batched insertions and decrease-key. These works either focus on only a subset of operations or treat the Fibonacci heap as a component within a larger framework, leaving some of its most structurally challenging operations like delete-min unexplored. Motivated by this gap, our basic idea is to reproduce some of prior work and continue working on the parallel version of its core operations.

The Challenge

Although in theory, Fibonacci heap has very efficient amortized time complexity, in practice its performance faces significant challenges. This project aims to improve the real-world performance of Fibonacci heap though parallelism on both heterogeneous and homogeneous platforms. These improvements should be archived through solving existing challenges.

Workload Characteristics

As mentioned above, the delete-min operation is the most time consuming one in Fibonacci heap. All other operations such as insert and decrease-key are designed to be constant time in the amortized sense, while delete-min is logarithmic and involves a large amount of work. This project will focus on the delete-min operation with its time-consuming consolidation process for potential speedup.

In a serial Fibonacci heap, trees in the heap need to be visited one by one for consolidation. In a parallel heap, it is obvious that its performance can be optimized by assigning those tresses to different processors and processing multiple trees simultaneously. This assignment maps a tree to a specific processor, and its data will be likely cached in that processor. However, each insert and decrees-key operation both creates new tree in the heap, and the heap need to support those operations in an extremely frequent way. As a result, the number and sizes of trees in the heap changes dramatically. To balance the workload in consolidation, a semi-dynamic or dynamic algorithm should be used to assign trees to processors. The trade of between assignment overhead and workload balance is the challenging part, which this project will investigate.

Moreover, the bottleneck of Fibonacci heap lays in memory access instead of computing. Another key performance issue is that Fibonacci heap frequently splits and merges trees. To adapt this access pattern, pointers rather than arrays should be used to represent trees. Traversing tree nodes through pointers will results in a large number of random memory access. How to improve the locality of nodes in the same tress remains a challenge. Since the final performance of the tree is likely to be restricted by memory access, much effort should be put on improving locality when implementing the high-performance heap.

In terms of latency and throughput, the intrinsic seriality of some heap operations, mainly delete-min operation, makes latency the more import metric. Due to the nature of delete-min operation, the system needs only and should only handles one request of that kind. However, the situaton for insert and decrees-key operation is quite different, since the beneficial of running multiple operations of those kinds is obvious. Nevertheless, those operations modify the heap, and running multiple operations at the same time might cause data integrity issue. How to support multiple requests while preserving the data integrity of the heap will a low access latency can be difficult.

System Mapping Challenges

With a significant lead in energy efficiency, Heterogeneous computing architect has been widely deployed on a variety of devices. However, this means the micro architect of different CPU codes can now be different. So is the latency and communication bandwidth among individual codes. The workload assignment now should take this heterogeneous into consideration. Moreover, the cache hierarchical also becomes asymmetric on those Heterogeneous systems, which leads to changes in locality optimization. Introducing Heterogeneous computing architect into the project creates a more realistic testing environment and requires our implementation to be more flexible and adaptive.

Resources

References and Code Base

There have already been several papers discuss on leveraging parallelism to implement high-performance Fibonacci heap. Bhattarai 1 designed and implemented Fibonacci Heap on parallel system. However, this implementation lacks support for the decrees-key operation, which is actually the one benefits the most from Fibonacci Heap’s low theoretical complexity. But it still provides us an example on how to adopt parallelism into Fibonacci Heap. Driscoll et al. 2 developed another kind of relaxed heap that shares similar theoretical complexity with Fibonacci heap but is adapted to parallel computation, which provides insights into how to build efficient Fibonacci heap.

Hardware

For the first part of our work, we plan to use the multi-core platform provided by the GHC cluster. The GHC machines are equipped with multi-core processors, which are sufficient for our initial parallel implementation and evaluation. For the implementation and verification of heterogeneous strategies, since the GHC cluster only provides homogeneous CPU architectures, we will use our own machines with Apple M-series chips and Intel 13th generation Core procoessors.

Goals and Deliverables

The goals we plan to achieve are:

  • Implement correct baseline versions for comparison, including a sequential Fibonacci heap and a coarse-grained parallel Fibonacci heap, and possibly a simpler priority queue such as a binary heap. This will give us a solid practical baseline for evaluating the speedup of our more advanced parallel Fibonacci heap implementations, rather than relying only on asymptotic analysis.
  • Implement a fine-grained parallel Fibonacci heap supporting the standard operations needed by our workload, especially extract-min and decrease-key. Since prior work has already shown that simpler operations such as insert and union can benefit from less coarse locking – for 100K operations, fine-grained version is about 2x faster and achieved 5x CPU utilization, we hope to achieve at least comparable speedup for the version with additional methods we implement. We plan to evaluate this design on benchmark workloads with different mixes of operations.
  • Parallelize Fibonacci heap operations by allowing concurrent work across different trees. Because these operations still require synchronization and communication between trees, we aim for meaningful but likely sublinear speedup rather than expecting near-linear scaling. We also plan to allow some relaxation of semantics in parallel execution, and study the tradeoff between parallel performance and strict data-structure semantics using the same benchmark workloads.
  • Extend the implementation to heterogeneous CPUs, focusing on dynamic load-balancing strategies such as work stealing, and evaluate performance on a machine with both performance cores and efficiency cores. Our goal is to understand whether hardware-aware scheduling can improve throughput beyond what we achieve in a homogeneous multicore setting.

If this goes well, we hope to explore:

  • More advanced load-balancing strategies for heterogeneous CPUs. we may investigate whether operations with higher computational cost should be assigned to performance cores, while lighter-weight tasks are assigned to efficiency cores, and then evaluate the impact of these strategies experimentally.
  • Study workload-specific variations of Fibonacci heap behavior. would be interesting too. Batches with many consecutive delete-min operations may behave quite differently from batches dominated by insert and decrease-key. If time permits, we would like to characterize how different operation mixes affect scalability and whether different implementation choices are better suited for different workloads.

On the other hand, if progress is slower than expected, we will focus on the sequential baseline and the parallel implementation in a homogeneous CPU environment, while leaving heterogeneous CPU support as future work.

Platform Choice

Computing platform:

  • GHC: Intel Core i7-9700 with 8 homogeneous cores.
  • Heterogeneous x86: Intel Core i9-13900H with 6 Performance-cores 8 Efficient-cores.
  • Heterogeneous ARM: Apple M4 with 4 Performance-cores 6 Efficient-cores.

Since NUMA systems are not used in our project, more communication among processors is allowed. Instead, multiple processors are used to form a single Fibonacci Heap, a large portion of their memory space will be shared, which could become a bottleneck. As a result, we choose OpenMP with C++ to implement the heap, which is expected to have a better performance.

Proposed Schedule

Week Plan
Week 1 (March 30 - April 5) - Implement baseline Fibonacci heap, including sequential and coarse-lock versions.
- Design benchmark workloads, testing methodology, and evaluation metrics. - Run initial baseline experiments.
Week 2 (April 6 - April 12) - Explore different methods to relax semantics for parallel operations.
- Implement candidate parallel strategies for core operations.
- Validate correctness against the sequential baseline and analyze design tradeoffs.
Week 3 (April 13 - April 19) - Run experiments across different workloads and thread counts, compare performance of different versions.
- Identify bottlenecks and prepare for milestone report.
Week 4 (April 20 - April 26) - Extend the implementation on heterogeneous CPU, starting with dynamic load balancing like task stealing as baseline, explore strategies that’s more efficient in our implementation.
- Evaluate performance, utilization, and scalability.
Week 5 (April 27 - April 30) - Finalize implementation and experiments. Summarize findings and prepare final report and poster.
  1. Bhattarai D. Towards scalable parallel fibonacci heap implementation. 2018. 

  2. Driscoll JR, Gabow HN, Shrairman R, Tarjan RE. Relaxed heaps: An alternative to Fibonacci heaps with applications to parallel computation. Communications of the ACM. 1988 Nov 1;31(11):1343-54.