In heterogeneous computing, efficient parallelism can be obtained if every device runs the same task on a different portion of the data set. This requires designing a scheduler which assigns data chunks to compute units proportional to their throughputs. For FPGA-CPU heterogeneous devices, to provide the best possible overall throughput, a scheduler should accurately evaluate the different performance behaviour of the compute devices. In this article, we propose a scheduler which initially detects the highest throughput each device can obtain for a specific application with negligible overhead and then partitions the dataset for improved performance. To demonstrate the efficiency of this method, we choose a Zynq UltraScale+ ZCU102 device as the hardware target and parallelise four applications showing that the developed scheduler can provide up to 94.06% of the throughput achievable at an ideal condition, with comparable power and energy consumption.
|Title of host publication||28th International Conference on Field Programmable Logic and Applications (FPL)|
|Number of pages||5|
|Publication status||Published - Aug 2018|
|Event||28th International Conference on Field Programmable Logic and Applications - Dublin, Ireland|
Duration: 27 Aug 2018 → 31 Aug 2018
Conference number: 28
|Name||Conference on Field Programmable Logic and Applications|
|Conference||28th International Conference on Field Programmable Logic and Applications|
|Period||27/08/18 → 31/08/18|
Bibliographical note© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.