
SC3 RISC CPU core
introduction
The project name SC3 is derived from the following properties:- SystemC - the core is designed and tested using SystemC
- super compact - the core implements efficiently in small FPGAs
- scalable - the core can be extended/ scaled to user application
- prove SystemC design flow
- evaluate RISC computer concepts
- create a lightweight processor
Regarding the SystemC design flow, I feel that tool support is lacking behind. I have not seen any reports of people using SystemC. So the question is: Is it really possible to create FPGA designs in SystemC today? How will design entry, simulation, synthesis and implementation work? This project should be a good vehicle to try and find out.
Regarding RISC computer concepts, I am not fully satisfied with all the CPUs being out there. OpenRisc and MicroBlaze are wonderful but large. PicoBlaze is a little too limited. I want to have an instruction set, which is human-readable, and I would like to try a few concepts and circuit ideas. I am not sure, if my results will be really better. But at least I‘d like to fully understand the implications of my concepts.
Finally, I have good use for a vendor-independent and lightweight processor. I prefer a design scheme that separates asynchronous and isochronous data flows. A small CPU can usually handle the asynchronous commands, e.g. the USB protocol. A dedicated data path will handle the isochronous data without CPU intervention, e.g. audio data being streamed. So I am targeting a size of about 150 4-LUTS, 100 FFs, and maybe 3500 Gates.
Regarding available CPUs, I have the following thoughts:
- MicroBlaze or Nios are simply an overkill for the targeted set of control applications. While I do like the features and tools, I just don‘t need that much. I would like to see, how several instances of a lightweight CPU scale to a larger system, and how this compares to a single CPU with more computing power.
- PicoBlaze is nice, but: It simply does not allow larger constant arrays, like the space required to store USB device descriptors. Neither the scratch-pad RAM , nor the external ports are suitable to fix this. In addition, there is no indirect addressing mode to walk through data arrays, which is a major flaw.
- SystemC 2.1 - the design environment
- sc2v script - SystemC to Verilog conversion for synthesis
- verilator simulator - Verilog to SystemC conversion for Verilog validation
- GTKwave - waveform viewer
architectural thoughts
overall structure
- Von-Neuman architecture in order to simplify pipeline stall logic: The memory arbiter is also used to control pipeline operation flow.
- Single address instructions with implicit accumulator to reduce instruction size
- 12-bit address bus for 4k address space. This should be enough for simple control applications, e.g. handling the USB protocol.
- 8-bit data bus for byte-oriented memory and compact data pathes
- the width of internal registers and ALU may be scaled to more than 8-bit, without implications on instruction set
- Wishbone bus interface for easy access to peripherals. 32-bit wide?
- several CPUs may build a distributed system and communicate using synchronized channels
- these channels may be used instead of interrupts
pipeline
- Fetch: Main task of this unit is to fetch instruction words from program memory.
- read instruction words from synchronous program memory
- implement program counter and stack
- deliver instruction words to decode
- Decode: Main task of this stage is to collect all required input data.
- receive instruction words from fetch
- interface to data memory and register file
- deliver decoded instruction to execution state
- Execute: This stage computes and retires the results
- receive decoded instruction from decode stage
- interface to ALU
- retire results to data memory or register file
- it might turn out that a separate retire stage is needed
- program memory demands wait state
- data memory read/ write collision
- data memory demands wait state
- ALU demands wait state
instruction set
| arithmetic group | ||
| 00 000 r3 | ADD | Add register value to accumulator |
| 00 001 r3 | SUB | Subtract register value from accumulator |
| 00 010 r3 | SHL | Shift accumulator left by register value |
| 00 011 r3 | SHR | Shift accumulator right by register value |
| 00 100 r3 | AND | And register value to accumulator |
| 00 101 r3 | OR | Or register value with accumulator |
| 00 110 r3 | XOR | Xor register value with accumulator |
| 00 111 x3 | --- | Idea: user-defined instruction |
| load/ store group | ||
| 01 0 00 r3 | LD | Load accumulator with register contents |
| 01 0 01 r3 | ST | Store accumulator to register contents |
| 01 0 10 r3 | LDI | Load accumulator with register indirect |
| 01 0 11 r3 | STI | Store accumulator to register indirect |
| jump group | ||
| 01 1 00 r3 | CALL | Call subroutine at register contents |
| 01 1 01 x3 | RET | Return from subroutine |
| 01 1 10 r3 | JMP | Jump to register contents |
| 01 1 11 c3 | IF | If condition not met, skip next instruction |
| reserved group | ||
| 10 x3 x3 | --- | Idea: synchronized exchange with external device or CPU |
| immediate group | ||
| 11 i6 | IMM | Load 6-bit immediate to accumulator; repeat to load further bits |
... to be continued...