Oscilloscope add-on for STM32F7 Discovery - questions

DrMario · Post by **DrMario** » Wed Jul 20, 2016 9:32 am

As I am designing my own decent oscilloscope add-on for the STMicroelectronics STM32F7 Discovery as I got a bit frustrated by excessive markups even on the used oscilloscopes, as well as lack of details in open source oscilloscopes (which usually turns out to be closed-source - the schematic diagrams for most of them are scarce) - some open source oscilloscopes also suck terribly, so I have to roll my own, and that brings me here from my long hiatus.

I am wondering if I would gain or lose anything if I use superscalar execution mode on xCore-200 processor as the oscilloscope captures tend to be computationally intensive - the technicial document confused me a bit; in single issue mode, 32-bit operands are allowed, while in superscalar mode, only two 16-bit operands are allowed in a clock cycle. I thought there would be two 32-bit operands allowed as is the case with Cortex M7 processor (which is also internally an in-order superscalar processor, ie. watered-down Cortex A8), and finally, should I watch out for the instruction pairing dependencies hazards if I go down this route?

Also, when the real work starts, I will change the title also, with the progress, if so.

peter · Post by **peter** » Wed Jul 20, 2016 2:04 pm

Hi DrMario,

The xCORE-200 supports two execution modes:

Single issue: effectively exactly the same as XS1
Dual issue: always executes 32-bits of instruction, whether that is two 16-bit instructions or a single 32-bit instruction. The only possible downside is that the code size will be larger because if there is only one useful instruction to execute (and it could be encoded in 16 bits) it will still require 32-bits of memory. That means in reality the 16-bit instruction gets paired with a no-op.

The decision as to whether the code is executing in single or dual-issue is done on a function by function basis. Each function entry determines the issue mode by whether it executes an ENTSP or a DUALENTSP function.

The compiler compiles whole files in either single or dual-issue mode. Within a hand-written assembler file you can make individual functions use different issue modes.

In terms of instruction hazards, the compiler takes care of that. So you only have to think about this if you are writing hand-coded assembler. In the case of assembler, the only hazards you have to worry about are within an instruction bundle (two 16-bit instructions). From one instruction to the next, all destinations will be written before the next instruction will be able to pick them up.

Within an instruction bundle, all sources are read simultaneously and all destinations written simultaneously.

Let me know if anything isn't clear,

Peter

DrMario · Post by **DrMario** » Wed Jul 20, 2016 4:05 pm

Sounds like a pair of compressed 16-bit operands (instructions). It's now clear to me (superscalar architecture in xCore-200 is rather weird, different from what I am used to, at least I will find a way to exploit it).

As for assembly, I'd have to figure out the hazards carefully, as some low-level functions are occasionally required, such as high-speed superpipelined ADC capture, and SDRAM / DDR memory chip read / write.

I will go ahead and do KiCAD design next as soon as I find some suitable parts (64 kilobytes SPI magnetic RAM is preferable here as it makes it easier to update the oscilloscope / DMM capture board firmware, not to mention much faster boot speed compared to flash - same with erase / write speed).

robertxmos · Post by **robertxmos** » Thu Jul 21, 2016 10:27 am

Hi DrMario

> "superscalar architecture in xCore-200 is rather weird, different from what I am used to, at least I will find a way to exploit it"

Yes, the architecture is significantly different from a superscalar architecture.
Dual-issue is a implementation of VLIW (of length two) - see https://en.wikipedia.org/wiki/Very_long ... ction_word.

The recommendation is to use dual-issue in hand crafted assembler, where you can even reduce the code size by cunningly parallelism (e.g. swapping two register values in one 32bit instruction!).

Have fun.

Robert

DrMario · Post by **DrMario** » Thu Jul 21, 2016 11:13 am

VLIW architecture itself is actually a special purpose superscalar architecture because VLIW processors themselves issue and execute more than one instruction at a time, although that's a discussion for other time. (I view VLIW architecture more of a software-defined superscalar architecture, meaning the compilers do messy works of branching and instruction reordering if necessary.)

However, if it is a hybrid VLIW architecture, it could be possible to perform out-of-order multithreading execution for some complex vector computation kernel (FFT, Nyquist, ADC anti-aliasing - you name it), although I will have to probably get a xCore-200 USB transputer board that is cheap enough for me to experiment and exploit the physical CPU cores (AKA Tile, as in the datasheet) contained therein, and see if it's possible, although it's possible that in-order execution could be strictly enforced (from reading the datasheet, I doubt it, except for certain high priority tasks. And, yes, out-of-order VLIW processors do exist although harder to find as a general CPU - Qualcomm Adreno GPU is such processor that takes advantage of such advanced VLIW architecture). If it can be done, a Cortex M7 processor won't have to do it the hard way (Occasionally out-of-order execution can be done here in software - VLIW processors can be exploited easily if you know how to do so - it'd save a lot of time analyzing complex ADC data and amplifier results before it is displayed compared to doing it in-order). Deterministic is not really necessary here, except for memory write and ADC setup which I can have separate tasks performed on two separate physical CPU cores.

Picked a Maxim Semiconductors MAX19506ETM+ 8-bit extremely high speed ADC (about 100 MHz within 100 megasamples per seconds time slots), so I will have to do fancy stuff in software to clean stuff up a bit before saved onto 64MB DDR-SDRAM chip (200 MHz may be ok, unless I need to bump it way up) as a linear measurement buffer data. That way I can get really deep into high speed design and troubleshooting.

EDITED: I am wondering if the xCore-200 already support the floating point, and if yes, can it be dual-issued? Lastly, will I have to make software branch predictor if it lacks hardware one?

Oscilloscope add-on for STM32F7 Discovery - questions

Oscilloscope add-on for STM32F7 Discovery - questions

Re: Oscilloscope add-on for STM32F7 Discovery - questions

Re: Oscilloscope add-on for STM32F7 Discovery - questions

Re: Oscilloscope add-on for STM32F7 Discovery - questions

Re: Oscilloscope add-on for STM32F7 Discovery - questions