Fastest way to bit bang 8 bytes on an 8 bit port?

Technical questions regarding the XTC tools and programming with XMOS.
User avatar
Interactive_Matter
XCore Addict
Posts: 216
Joined: Wed Feb 10, 2010 10:26 am

Fastest way to bit bang 8 bytes on an 8 bit port?

Post by Interactive_Matter »

Hi,

in my current project I have to output 8 bytes parallel on an 8 bit wide port. I do it in a very straight forward (or dumb) way, like:

Code: Select all

for (int bit=0; bit<8;bit++) {
  unsigned char bit_position = (1<<bit);
  unsigned char output = 0;
  for (int value=0; value<8;value++) {
    if (values[value]&bit_position) {
      output|=(1 << value);
    }
  output_channel <: output;
}
This is not actual code but you get the principle: I check for each bit in each value if it set and set the corresponding bit in the output value. So if you imagine the bits as an 8x8 matrix I simply turn the matrix by 90°).

This method is easy to understand but uses up a lot of processing cycles. Is there an better way? Perhaps even using some XMOS assembler magic ;)

Thanks

Marcus


ale500
Respected Member
Posts: 259
Joined: Thu Sep 16, 2010 9:15 am

Post by ale500 »

One approach could be to change the way the 8 bytes are generated. Instead of arranging the bits of each byte into one byte you may store them in the successive bytes of the output array. Depending on how you get those bytes.

A table-look-up method could also be envisioned... but I think an unrolled and well crafted assembler program would be better... how are the starting bytes arranged in memory ? are they an array or are they 8 variables somewhere in non-consecutive memory locations ?...
User avatar
Interactive_Matter
XCore Addict
Posts: 216
Joined: Wed Feb 10, 2010 10:26 am

Post by Interactive_Matter »

ale500 wrote:One approach could be to change the way the 8 bytes are generated. Instead of arranging the bits of each byte into one byte you may store them in the successive bytes of the output array. Depending on how you get those bytes.

A table-look-up method could also be envisioned... but I think an unrolled and well crafted assembler program would be better... how are the starting bytes arranged in memory ? are they an array or are they 8 variables somewhere in non-consecutive memory locations ?...
Unfortunately I cannot change how the bytes are read. They come in as 8 single byte arrays like:

Code: Select all

unsigned char[NUMBER_OF_ELEMENTS] data;
This structure is read 8 times from a channel. This is 8 times decoded into a structure like

Code: Select all

unsigned char[NUMBER_OF_ELEMENTS][8] data;
While reading the data I can change it a bit - but I cannot change the fact that I have to read this array 8 times to get 8 arrays.

I was hoping for a simple solution - but somehow this did not work out - again ;)

Thanks

Marcus
User avatar
rp181
Respected Member
Posts: 395
Joined: Tue May 18, 2010 12:25 am

Post by rp181 »

Are you reading this from the IO pins? I know some ADCs use that format.
MaxFlashrom
Experienced Member
Posts: 82
Joined: Fri Nov 05, 2010 2:59 pm

Post by MaxFlashrom »

I imagine that the reason you want to do this is that you're driving 8 serial DACs in parallel. I'm not sure a lookup table would work. It would have to be huge to do a full lookup(and it would only buy a little doing a partial lookup). Of course using 8 1-bit buffered XMOS ports would work, but this is wasteful even if you have them. I think a shift-all bytes right, OR each with 0x1, Or result with each in turn, shifting result left after each bit would be my approach.

I'm thinking something like:
EDIT: Fixed dodgy example

Code: Select all

ldc r9,1

Then repeat 8 times:
ldc r8,0

and  r10, r7, r9
or r8,r8,r10
shl r8,r8,1

and  r10, r6, r9
or r8,r8,r10
shl r8,r8,1

and  r10, r5, r9
or r8,r8,r10
shl r8,r8,1

and  r10, r4, r9
or r8,r8,r10
shl r8,r8,1

and  r10, r3, r9
or r8,r8,r10
shl r8,r8,1

and  r10, r2, r9
or r8,r8,r10
shl r8,r8,1

and  r8, r1, r9
or r8,r8,r10
shl r8,r8,1

and  r8, r0, r9
or r8,r8,r10

shr r0, r0,1
shr r1, r1,1
shr r2, r2,1
shr r3, r3,1
shr r4, r4,1
shr r5, r5,1
shr r6, r6,1
shr r7, r7,1

stw [whatever], r8
lsu ..loopctr, 8 blah blah
bt loop

Max.
Last edited by MaxFlashrom on Tue Aug 16, 2011 4:50 pm, edited 1 time in total.
User avatar
Interactive_Matter
XCore Addict
Posts: 216
Joined: Wed Feb 10, 2010 10:26 am

Post by Interactive_Matter »

Thanks for the helpfull answers - still have to decode.

My use case is outputting a data buffer to 8 parallel lines of SPI (only MOSI), The 8 bit ports area a convenient way to drive 8 SPI clients using only one clock and CS line.

The data from the clients is irrelevant to me so I just simply drop them.

Thanks so far - have to understand the advantages of the shift operations a bit better (You know I am more familiar with C then assembler).

Marcus
richard
Respected Member
Posts: 318
Joined: Tue Dec 15, 2009 12:46 am

Post by richard »

You can compute 4 results at a time if you are careful. The approach is roughly:

1) For each value build a word that is that 8-bit value repeated 4 times.
2) Build a mask which extracts the 1st bit of the 1st byte, the 2nd bit of the 2nd byte, the 3rd bit of the 3rd byte, etc.
3) Loop over the first 4 values and use shifts / ors / ands to build a result where bits 0 - 3 are the first bits of the first 4 values, bits 9-12 are the second bits of the first 4 values, bits 18-21 are the third bits of the first 4 values and bits 27-30 are the fourth bits of the first 4 values.
4) Repeat step 3 to extract the same bits from the last 4 values.
5) Combine the result of step 3 and 4 to get the final result.

Code is as follows:

Code: Select all

void g(unsigned char values[8]) {
  unsigned mask = 1 | ((1 << 1) << 8) | ((1 << 2) << 16) | ((1 << 3) << 24);
  for (int bit = 0; bit != 8; bit+=4) {
    unsigned output;
    unsigned output1 = 0;
    unsigned output2 = 0;
    for (int value = 0; value < 4; value++) {
      unsigned data;
      unsigned masked;
      data = values[value] >> bit;
      data |= data << 8;
      data |= data << 16;
      masked = data & mask;
      output1 |= masked << value;
    }
    for (int value = 0; value < 4; value++) {
      unsigned data;
      unsigned masked;
      data = values[value + 4] >> bit;
      data |= data << 8;
      data |= data << 16;
      masked = data & mask;
      output2 |= masked << value;
    }
    output = output1 | (output2 << 4);
    printhexln((unsigned char)output);
    output >>= 9;
    printhexln((unsigned char)(output));
    output >>= 9;
    printhexln((unsigned char)(output));
    printhexln((output1 >> 27) | (output2 >> 23));
  }
}
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

You can shift a bit out of one word and into another in one insn; that would
result in 70-80 insns total for this.

But of course you can go a bit faster with more work:

C pseudo code:

Code: Select all

u8 in[8]; // your input
u8 out[8] // your output
const u32 magic[4][16] = {
  0x00000000, 0x00000001, 0x00000100, 0x00000101,
  0x00010000, 0x00010001, 0x00010100, 0x00010101,
  0x01000000, 0x01000001, 0x01000100, 0x01000101,
  0x01010000, 0x01010001, 0x01010100, 0x01010101
};
u32 j;
u32 accl = 0, acch = 0;
for (j = 7; j < 8; j--) {
  accl = 2*accl + magic[in[j] & 0x0f];
  acch = 2*acch + magic[in[j] >> 4];
}
out[0] = accl;
out[1] = accl >> 8;
out[2] = accl >> 16;
out[3] = accl >> 24;
out[4] = acch;
out[5] = acch >> 8;
out[6] = acch >> 16;
out[7] = acch >> 24;
64 bytes for the table, small code, about 50-60 cycles.

If you splurge on a 1024-byte table, you can do it in 30-40 cycles.

I suppose you can do it even faster by keeping everything packed in two
registers, and doing the standard matrix transpose thing on it; the bitrev
and byterev insns should help. Dunno.