Most efficient way to pass variable-length data thru Interface Topic is solved

If you have a simple question and just want an answer.
Posts: 23
Joined: Thu Feb 09, 2017 4:53 pm

Most efficient way to pass variable-length data thru Interface

Postby kevpatt » Thu Feb 09, 2017 5:11 pm

This is kind of a general question. What is the most efficient way to pass variable-length data through an interface in XC? Say, a null-terminated string or a variable-length packet of data like ethernet?

My understanding is that if we want to be able to pass from one tile to another, we can't use pointers or shared memory. So these are the other options I am aware of, (along with some implications according to my understanding):

1) Pass a fixed-size array, large enough to hold the largest possible packet (arrays are apparently always passed "by reference"?)
2) Break the packed into smaller fixed-size chunks and make multiple calls
3) Pass individual bytes or words using a raw channel. (handshaking for every word?!!)
4) Use a streaming channel (ties up resources for the life of the thread?)
5) Do some low-level I/O, using tokens to create your own protocol (seems non-standard, and possibly a lot of work)

My thoughts on each of the above:

1. This seems like the "preferred" approach, but does AKAIK involve using a fixed-size array. The client passes an array to the server (by reference) during an interface call, and the server writes data into it. XC and the hardware take care of marshaling this between tiles if necessary. This approach is probably ok for me, but I have a could questions about efficiency: Let's say that the largest packet is 4 kBytes, but most packets are less than 256 bytes, and packets are also transmitted infrequently (making a streaming channel overkill). So, to handle this, the client keeps a 4k array in memory, and passes it to the server on each interface call. The server writes data into the array and returns the size written. QUESTION: Assuming the server always writes to, and never reads from the array, what happens when this interface call goes across tiles. Does the entire contents of the array get copied over the link from client to server? I read somewhere that only "changes" are copied back from server to client; that's great. But if the server never needs to read from the array, only write to it, it seems like copying the all the data to the server is a waste of time and bandwidth.

2. I am guessing that each "call" involves some non-trivial amount of token "handshaking". So the inefficiency would be a trade off between moving garbage data (as in 1. above), vs. more handshaking.

3. This would avoid copying _any_ garbage data (see 1. above), but adds a the overhead of "handshaking" for every Byte or Word transmitted. BTW: Is data type preserved? What happens if I output a word to a regular chanend, and input a byte on the other side?

4. Seems like streaming channels will work very efficiently, but you lose the nice structure of "interfaces", and tie up some streaming channel resources permanently. (assume the streaming chanend is passed to two threads that each run in an infinite loop).

5. Apparently I can do whatever I want with the low-level I/O function calls, but I have to invent my own protocol entirely. Plus it won't be pretty, safe, XC.

Any ideas here, or am I just missing something really obvious?

View Solution
XCore Addict
Posts: 169
Joined: Fri Oct 23, 2015 10:23 am

Postby robertxmos » Fri Feb 10, 2017 9:59 am

Hi kevpatt

I assume this is a run time variability (otherwise you could pass around a 'static const int' parameter viz a compile time constant)

Interface calls vary depending upon where the other task resides.
If tasks are on the same logical core (distributable), it will become a function call.
If tasks are on different logical cores on the same tile, they will use shared memory.
If tasks are on different tiles, they will use channels.

you also need to know that memcpy has been overloaded to handle such situations.

Thus, large data should be passed by reference (a remote reference).
The client can then read as much of the referenced object as it needs - possibly using memcpy (the compiler will do the right/efficient thing).
You can't pass the remote-reference out of the case-block - you must make a local copy and pass that out.

Thus, I would go with #1 - a fixed sized array, but also send the data length.
The client can then do a memcpy of the actual length, into a local fixed sized array.
If both tasks end up on the same tile or same logical core, the compiler will optimise the memcpy appropriately.

N.B. compilers understand memcpy very well an will optimise it away when they can - use it and allow the 'optimising compiler' to do its job!
Posts: 23
Joined: Thu Feb 09, 2017 4:53 pm

Postby kevpatt » Fri Feb 10, 2017 5:55 pm

Yes, I know, sometimes it's hard to "trust the compiler". :) Sometimes you just want to know what's really going on...

So, let's assume a worst-case scenario: We have a client and server on different tiles. The length of the data in each "packet" is variable at run time, but will always be 4k or less. So:

The client keeps a 4k array (buffer) in memory, then makes an interface call to the server, passing the array "by reference". We assume that the compiler and hardware are going to do some magic here, and not actually pass any data across the link *yet*. Since the code on the server side never _reads_ from the array, the compiler generates no server-side instructions to pull array data across the link from the client side? Then, the server writes into the "array", which is translated by the compiler into transmissions across the link from the server to the client side, where the data is written into the array in the appropriate places? (I assume some control codes are going be be exchanged both ways to negotiate all this.)

Is this a rough approximation of how it works? If so, my hat is off to XMOS, that is some excellent work with the compiler, and I will be more than happy to "just trust it".

Does the interface mechanism for passing large chunks of data have link efficiency similar to a streaming channel? (i.e. no "handshaking" tokens for every single byte/word)

BTW, one thing I still wonder about: What happens if I write an int into a regular chanend and read out a char?

XCore Addict
Posts: 169
Joined: Fri Oct 23, 2015 10:23 am

Postby robertxmos » Fri Feb 10, 2017 6:47 pm

Yes, please take your hat off.

Initially only an 'event' packet is sent from the client to the server.
The client sends the server the parameters (e.g. a 32bit remote-reference + 32bit length)
The server will respond (in priority order) to client events.
The client and server engage in an efficient but dynamic protocol - with a defined end where the client and server are released to go their separate ways.

During the handshaking, the server may transfer (remote-memcpy) bytes, words, blocks efficiently between it and the client's end (the client is now serving!)
I do not know for sure if the compiler will re-transfer data that has previously been sent - feel free to check.
I also don't know if the compiler will reorder transfers and concatenate them - e.g. a for-loop over the data.
Help the compiler by using const and memcpy() as much as possible!
If you cache transfers in a server local buffer (using memcpy to and fro) the compiler will understand what you are doing and work with you - removing any that you did not need.

I would tell you to look at the compiler output - but it mostly gobbledygook.
XCore Addict
Posts: 169
Joined: Fri Oct 23, 2015 10:23 am

Postby robertxmos » Fri Feb 10, 2017 6:54 pm

p.s. as for int and char handling across chanends I have not looked.
I would expect the behaviour is either defined (an exception occurs), or undefined (an exception may be throw by future architectures).
Be warned, compilers can totally ignore any code following undefined behaviour, they reason 'why bother compiling it'.
Posts: 23
Joined: Thu Feb 09, 2017 4:53 pm

Postby kevpatt » Fri Feb 10, 2017 8:29 pm

WOW I am really impressed! I'll dig into the disassembly as I have time. :)

Right now I'm using a "plain" (non-streaming) channel, and although I suspect it's not very efficient when for-looping over an array of chars, it's working for now. Based on your feedback, I'll convert the communication to an XC "interface" and compare the performance in the near future.

Thanks for your responses, they are greatly appreciated.

Who is online

Users browsing this forum: No registered users and 8 guests