6.6.8.6.1 FPU Pipeline Operation

The CPU decodes all coprocessor instructions during the F-stage. The source and destination coprocessor registers are extracted from the opcode and supplied to the coprocessor, along with a corresponding instruction select and control signals such that no instruction decode is necessary within the coprocessor.

The FPU pipeline stages consist of Read (RD), Execute (X[n]) and Write-Back (WB) stages. The Read and Write-Back stages consist of a single register and are common to all instructions. The Execute stage consists of as many stages as required to execute the specific instruction (i.e., X [0], X[1]..... X[n]) but at least X [0].

One instruction may be issued into the RD-stage, where it will remain for one cycle (hazards aside) until dispatched into the X [0] stage. The number of cycles each instruction remains within the execute phase varies depending upon the operation. In order to avoid stalling the pipeline for the duration of any long instruction, up to four instructions may be dispatched into X[0] and executed concurrently (structural hazards aside).

Instructions retire in the same order in which they are issued. As a consequence of being able to execute multiple instructions with varying execution times, the pipeline Instruction/Hazard Tracker logic ensures that in-order retirement is maintained.

All instructions with an execution latency of four cycles or less are implemented such that the execution stages are fully pipelined. Consequently, assuming no data dependencies (hazards) arise, these instructions can be repeatedly issued at a rate of one per cycle (and receive their results at a rate of one per cycle after an initial execution latency), without incurring a structural hazard stall.

For instructions where the execution latency exceeds four cycles (FDIV and FSQRT), the FPU pipeline will fill the instruction and then stall subsequent instructions (due to a structural hazard) until the required execution resource becomes available.

FDIV: Floating-point divide is implemented as an iterative operation such that the input data cannot be pipelined until all iterations have completed and the result is passed onto the adjustment stage within the Functional Block. For example, should the CPU issue two sequential FDIV instructions, the second FDIV instruction will stall in the RD-stage until the first FDIV enters the final execution cycle, at which point the second FDIV may be dispatched into execute stage to commence execution.
FSQRT: Floating-point square root requires 10 (Single Precision) or 13 (Double Precision) cycles to execute. The hazard tracker can handle up to four issued instructions, so an FSQRT followed by up to three sequential FPU instructions (including FSQRT) may be executing at any one time. The CPU may issue one more instruction, but it will remain in the RD-stage until the oldest FSQRT instruction underway enters the WB-stage, six cycles later, and subsequently retires. At this point, one slot within the hazard tracker is now available for use, and the pending FPU instruction will be committed for execution. Another FSQRT instruction will retire in the next cycle, opening another hazard tracker slot for another issued FSQRT instruction, and so forth, until the hazard tracker is full again and the pipeline must wait a further six cycles for the initial FSQRT to retire. For FSQRT alone, the best case block repeat rate is therefore one per cycle for the initial 4 FSQRT instructions issued, with a subsequent four FSQRT instructions to be issued after six (Single Precision) or nine (Double Precision) cycles have passed. This supports an average execution time of (4+6)/4 or 2.5 cycles/instruction (Single Precision) or (4+9)/4 or 3.25 cycles/instruction (Double Precision).