Paper Review: The Load Slice Core Microarchitecture

- 6 mins

Paper Review: Load-Slice Core Microarchitecture

What is Paper Review?

Citation

Carlson, T. E., Heirman, W., Allam, O., Kaxiras, S., & Eeckhout, L. (2015, June). The load slice core microarchitecture. In Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on (pp. 272-284). IEEE.

Images in this article are taken from the paper - all credits to the authors.


Summary

An Oversimplified Abstract

Adding a second pipeline (B-IQ in image below) (that handles memory-related instructions) to an in-order processor makes this new architecture faster than a naive in-order processor (of course), but also makes it more power and size efficient over a traditional out-of-order processor.

Load Slice Core

(Image taken from the research paper)

Fundamental Premises:

Paper’s contribution:

Suggests a new processor architecture to:

This is called the Load-Slice Core.

Load Slice Core

(Image taken from the research paper)

Results from experiments

Conclusions

This is a possible direction for future processors to take.


In-depth

Key Concepts

Hardware needed

Iterative Backwards Dependency Analysis

Method discussed in the paper of iteratively discovering which instructions (a sequence of instructions usually) are generating addresses for load/store instructions. We want to issue those instructions to the bypass queue so that they can execute independently of a long load/store on the main queue.

If a section of code is executed multiple times: we iteratively optimize it in hardware, with the end goal of allowing slices of code that end in a load/store to execute in the bypass queue while the main queue is blocked.

Store instruction handling

If a load instruction overlaps with earlier store instruction: possible memory dependency.

Resolution: a store instruction is split into 2 parts: (1) address calculation part (2) collect data and update memory part. (1) is put into the bypass queue, (2) is in the main queue. This makes stores with an unresolved address automatically block future loads (the bypass queue is in-order)

Experimental Setup

Results

In terms of IPC, in-order < Load-Slice < OoO

Load-Slice outperforms both in area-normalized performance (MIPS/mm^2) and energy efficiency (MIPS/W)

They also compared the three if we had a power budget of 45W and max area of 350 mm^2 - if we could fit as many cores as possible, which set of cores is best? Many Load-Slice seems to outperform in-order and OoO clusters.


Further reading needed

To define:

rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora