Crossing clock domains - The need for FIFOS

Technical questions regarding the XTC tools and programming with XMOS.
TjBordelon
Active Member
Posts: 39
Joined: Mon Jul 29, 2013 4:41 pm

Crossing clock domains - The need for FIFOS

Post by TjBordelon »

I'm not sure how XCORE intends for us to implement fifos given that we only really have channels and can't "legally" have multiple threads touching a ring buffer. My trouble seeing how to implement this may be that I'm new to this new X-WAY of thinking, but hear me out.

Imagine you are feeding a DAC a nice waveform at 10 mhz and cannot pause to go read channel data to fill up your buffer and then continue with your fast output loop. Such an act would glitch the output data.

Also imagine that there is a pipe into the XCORE serving a bunch of other things going on (interleaved data), and if the thread servicing this incoming pipe of data had to spend its days filling a channel, it couldn't do much else. And it too would have a gap as it had to go get more data.

This is the old FPGA problem of clock domains. Usually it gets solved with a fifo. If we had xcore fifo resources we'd just do:

Code: Select all

thread 1:
loop
   fifo <: 8 bit port of data bursty at 50 mhz

thread 2:
   fifo :> DAC at 20 mhz
Obviously you have to take care to not blow the fifo so there is some feedback I omit.

In any case I did wind up solving this issue by writing double buffered "endpoint" objects. The basic use is:

Code: Select all

thread 1:
  write_dbuff( data ) ;

thread 2:
   read_dbuff( data);
Thread 2 can just keep reading and reading with no gaps. Thread 1 can sporadically write.

Anyone solve this with similar constructs, or even better-- with XCORE/XC primitives built in? I just couldn't figure out how to do it without rolling my own shizzle.


As a side note, I extended this concept with "endpoints". It's kinda like xc channels, but with double buffers built in. The above pseudocode takes a channel and streams the data to another core where it is double buffered. 1 thread is consumed on the sending core, and 1 thread at the receiving The consuming thread (on the same core as the rx) keeps calling read_dbuff() for uninterrupted data.

I had to write so much stuff to do this, I'm starting to wonder if someone was witty and did this another way. Maybe I overlooked something easier.

In any case, this exercise has yielded a few cool constructs that are useful to me aside from e to this: traditional thread signals using assembly 'lock' instructions, a double buffer "class" that doesn't use any form of formal synchronization (one reading thread, one writing thread, same object), and my own channels that can be passed around without the usual restrictions (going in structs, etc). I will probably clean them up and share once I determine A) they work and B) don't demonstrate that I did things the hard way. :)

Note that all this had to be done in C (thank you XMOS for allowing C dev!!!) as I really find XC too restrictive for me. Again, maybe it's because I don't X-THINK properly but the entire purist no sharing memory and no passing resources in structs hurts my brain.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

TjBordelon wrote:[...]and if the thread servicing this incoming pipe of data had to spend its days filling a channel, it couldn't do much else.
Reading/writing from/to a channel is cheaper than reading/writing
from/to memory (accessing memory prevents insn fetch, so you're
more limited scheduling those instructions; and more importantly,
when you're acessing memory you have to update addresses or
array indices. No such burden with channels).

Every channel end has a buffer, you can view that as a very small FIFO.
If you go via a switch, you get a somewhat larger buffer.

You can also implement FIFOs yourself, of course. The trick is to
have the receiver tell the FIFO (via a channel) when it is ready for
the next data. The FIFO does an event loop waiting for the channels
from both the sender and receiver. You can also merge the FIFO
thread with either the receiver or sender threads in the usual way,
if you like pain (or you really have too many threads already).

If what you're after is some low speed transfer, things are simple.
If what you want is very high-speed, you really should try to generate
the data smoothly (not bursty), for best results. Halfway working
around problems you caused yourself isn't the best plan ;-)
TjBordelon
Active Member
Posts: 39
Joined: Mon Jul 29, 2013 4:41 pm

Post by TjBordelon »

Good stuff! Thanks for this.
segher wrote: Reading/writing from/to a channel is cheaper than reading/writing
from/to memory (accessing memory prevents insn fetch, so you're
more limited scheduling those instructions; and more importantly,
when you're acessing memory you have to update addresses or
array indices. No such burden with channels).
I gathered this from many posts you've made to others. The one issue I'm confused about is how memory accesses prevent insn fetches. If I have 4 threads running, I thought no matter what they did the speed was the same.

It is a fine line-- "make things simpler but no more". My exercise in writing my own channels has shown me that although I have less restriction, I now use 1k of ram and spend processor for something that was free before. So it's all trade offs.

I think you've got the mindset right, and the key: Channels are cheaper. This is probably what the entire design of the XCORE knows and I don't :) I'll have to squeeze my brain into a new mindset.
TjBordelon
Active Member
Posts: 39
Joined: Mon Jul 29, 2013 4:41 pm

Post by TjBordelon »

Let's compare 2 scenarios. Both are where thread 1 gives data to thread 2:

Code: Select all

Thread 1:
   reg = port
   memory = reg       // this is taking place of a ring/double buffer
Thread 2:
   reg = memory       // Consumer reads from buffer
I count 3 cycles between these threads. Now if I use channels:

Code: Select all

Thread 1:  (probably done with select)
   reg = chanend_in
   memory = reg     // Write into buffer structure

   reg = memory     // Read at different rate
   chanend_out = reg

thread 2:
   reg = chanend _out

I count 5 cycles per byte in the second channel implementation.

In any case, I think getting rid of the buffer entirely would be nice.
User avatar
Bianco
XCore Expert
Posts: 754
Joined: Thu Dec 10, 2009 6:56 pm

Post by Bianco »

Instructions are fetched when instructions are executed that do not require a memory access. Thus if you execute a sequence of instructions that all require memory access, it will not have a chance to fetch new instructions. When the instruction buffer exhausts a fnop "instruction" is executed just to fetch new instructions. Usually many of these fnops can be avoided by rearranging the instruction sequence.

Also you probably like to know that the receive fifo of a chanend is 8 tokens or bytes. To actually use it as a fico to decouple the system, you will need to use streaming channels.

- wrote this quickly in the train
TjBordelon
Active Member
Posts: 39
Joined: Mon Jul 29, 2013 4:41 pm

Post by TjBordelon »

Wow-- this is key!! So it sounds like avoiding 2 memory ops adjacent to each other is key. And I bet this is why you guys are saying channels are faster. They don't count as a memory operation.

VERY good info. I didn't see this in the docs. I'm assuming the instruction buffer is 1 deep so the rule is simple. don't put 2 memory ops next to one another.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

TjBordelon wrote:
segher wrote: Reading/writing from/to a channel is cheaper than reading/writing
from/to memory (accessing memory prevents insn fetch, so you're
more limited scheduling those instructions; [...]
I gathered this from many posts you've made to others. The one issue I'm confused about is how memory accesses prevent insn fetches. If I have 4 threads running, I thought no matter what they did the speed was the same.
Instructions are issued from the instruction buffer of the thread,
which is 64 bits. Normal insns are 16 bits, long insns are 32 bits.
Every instruction takes the same time to execute (except when it
stalls, of course); every insn can do exactly one memory access.
Memory accesses are 32 bits (or smaller).

If an instruction does not need to do a memory access (for a load
or store instruction), it does one to replenish the instruction buffer.
Most of the time this all work out just fine, but when you have many
load/store insns and "long" insns in sequence the instruction buffer
will drain and the will be bubbles in the pipeline (if the buffer is empty,
a nop is issued, which will do a fetch, and then you have insns again).
Another point of attention is jump targets (the first insn of a loop or
subroutine): the jump instruction that gets you there will do a fetch,
but it's only 32 bits (like any fetch) so the buffer is only half full. It
helps to let the buffer fill up a bit, so you have more space to play with.

The toolchain knows about these things, but if you're writing in
assembler (or doing tight inner loops) you have to pay attention.
Also when you're doing performance analysis.
I think you've got the mindset right, and the key: Channels are cheaper.
It's more than that. _Threads_ are also cheap. And threads are most
effective if you treat each of them as their own little program doing
their own thing, communicating with other processes. Channels are
of course an excellently cheap (and fast!) way of communicating.

It is very very very very nice if you don't have to worry about asynch
issues: just block if the receiver of your communication isn't ready.
Most of the time you can do that just fine: it is great you can keep on
running when you go asynch, but there is no work to do *anyway*!
Think of the processes as assembly line workers.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

TjBordelon wrote: memory = reg // Write into buffer structure

reg = memory // Read at different rate
If it is a different rate, how can it be the same loop in the same code?
(I get different insn counts than yours btw; How do you do a ring buffer
access in fewer than three cycles?)

Btw. If you absolutely _have_ to communicate (bigger) packets of data
(as opposed to streams), I found the most effective way is to have the
producer write it to memory, and then pass the address and length
around (via a channel, of course); when the receiver is done with it it
passes the address and length back.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

TjBordelon wrote:Wow-- this is key!! So it sounds like avoiding 2 memory ops adjacent to each other is key. And I bet this is why you guys are saying channels are faster. They don't count as a memory operation.
As I said, updating the memory address or array index is more expensive
(it always costs instructions).
VERY good info. I didn't see this in the docs.
Chapter five, "The XMOS XS1 Architecture". Read that book, you
know you want to!
I'm assuming the instruction buffer is 1 deep so the rule is simple. don't put 2 memory ops next to one another.
It is 64 bits and it is not so simple. That is however a reasonable
rule of thumb, for normal code: spread the memory accesses
around. But also be wary about 32-bit instructions (LMUL etc. etc.)