New instructions

Technical questions regarding the XTC tools and programming with XMOS.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am
Contact:

New instructions

Post by segher »

No manuals yet, so let's do some light reverse engineering. I had
a look at the new instructions. This is just spelling, not any
actual semantics, but we can guess...

Code: Select all

MOVED:

r2   0000   byterev $ra,$rb   from r2_0 100000
r2   0800   clz $ra,$rb       from r2_0 000800
r2   1000   bitrev $ra,$rb     from r2_0 000000

r2_0 000000   init t[$rB]:sp,$rA   from r2 1000
r2_0 000001   init t[$rB]:cp,$rA   from r2 1800
r2_0 100000   init t[$rB]:pc,$rA   from r2 0000
r2_0 000800   init t[$rB]:dp,$rA   from r2 0800
r2_0 000801   tsetmr $rA,$rB       from r2 1810
r3_0 b000     set t[$rC]:$rA,$rB   from r3 b800
Those useful bit-fiddling instructions are nice to have as short
insns (so they can be paired); the cross-thread register access
instructions do not need to be short.

Code: Select all

REMOVED:

u6   78c0   krestsp $ui
r0   0011   kret
r0   0012   dret
r0   1000   dentsp
r0   1001   drestsp
I wonder how these things work now. Are the registers simply
removed? And then? There are no clues here :-)

Code: Select all

NEW:

u6   7c80   dualentsp $ui
r0   1013   nop
r1   8800   gettime $ra
r1   8810   elate $ra
r3_0 9801   unzip $rA,$rB,$bC
r3_0 9802   zip $rA,$rB,$bC
r3_0 c801   outpw res[$rB],$rA,$rC
r3_1 0810   crcn $ra,$rA,$rB,$rC
r3_1 1000   std $ra,$rA,$rB[$rC]
r3_1 1010   std $ra,$rA,$rB[$uC]
r3_1 2000   ldd $ra,$rA,$rB[$rC]
r3_1 2010   ldd $ra,$rA,$rB[$uC]
r3_2 0810   xor4 $rA,$ra,$rB,$rC,$rb
r3_2 1800   lextract $rA,$ra,$rB,$rC,$bb
r3_2 1810   linsert $rA,$ra,$rB,$rC,$bb
r3_2 2810   crc32_inc $rA,$rB,$rC,$ra,$bb
"nop" used to be "add r0,r0,0", but that cannot pair with anything
else setting r0. "gettime" is nice (reference clock always?);
"elate" ("exception if late"?) is fun too. Zip and unzip, are
those some sheep-and-goats instructions, interleaving bits?

Then we have an "outpw" taking a register as count, and a "crcn";
is that shifting a variable number of bits? Together they should
be quite useful for USB, MII, etc. "crc32_inc", I have no clue.
"lextract", "linsert" -- not sure about the exact semantics, but
that sounds like double-word shift-and-mask things. What "xor4"
is useful for, I have no idea. But ldd/std are more obvious ;-)


User avatar
ers35
Active Member
Posts: 62
Joined: Mon Jun 10, 2013 2:14 pm
Contact:

Post by ers35 »

The 14.0.0 xs1.h provides some clues:

Code: Select all

/**
 * Configures a clock to use a 1-bit port as its source with a divide. If divide
 * is set to zero the 1-bit port provides the clock signal for the clock block
 * directly. If divide is non zero the clock signal provided by the 1-bit port
 * is divided by 2 * \a divide. This function is only available on XS2 devices.
 *  If the port is not a 1-bit port, an exception is raised.
 *  \param clk The clock to configure.
 *  \param p The 1-bit port to use as the clock source.
 *  \sa configure_clock_ref
 *  \sa configure_clock_xcore
 *  \sa configure_clock_rate
 *  \sa configure_clock_rate_at_least
 *  \sa configure_clock_rate_at_most
 */
void configure_clock_src_divide(clock clk, void port p, unsigned char d);

/**
 * Incorporate n-bits of a 32-bit word into a Cyclic Redundancy Checksum (CRC).
 * Executing 32/N crcn calls sequentially has the same effect as executing a
 * single crc call.
 * \param[in,out] checksum The inital value of the checksum, which is updated
 *                         with the new checksum.
 * \param data The data to compute the CRC over.
 * \param poly The polynomial to use when computing the CRC.
 * \param n The number of lower bits of the data to incorporate.
 */
void crcn(unsigned int &checksum, unsigned int data,
          unsigned int poly, unsigned int n);
#if defined(__XS2A__)
#define crcn(c, d, p, n) __builtin_crcn(c, d, p, n)
#endif

/**
 * Extract a bitfield from a 64-bit value.
 * \param value The value to extract the bitfield from.
 * \param position The bit position of the field, which must be a value between
 *                 0 and bpw - 1, inclusive.
 * \param length The length of the field, one of bpw, 1, 2, 3, 4, 5,
 *               6, 7, 8, 16, 24, 32.
 * \return The value of the bitfield.
 */
unsigned int lextract(unsigned long long value, unsigned int position,
                      unsigned int length);
#if defined(__XS2A__)
#define lextract(v, p, l) __builtin_lextract(v, p, l)
#endif

/**
 * Insert a bitfield into a 64-bit value.
 * \param value The 64-bit value to insert the bitfield in.
 * \param bitfield The value of the bitfield.
 * \param position The bit position of the field, which must be a value between
 *                 0 and bpw - 1, inclusive.
 * \param length The length of the field, one of bpw, 1, 2, 3, 4, 5,
 *               6, 7, 8, 16, 24, 32.
 * \return The 64-bit value with the inserted bitfield.
 */
unsigned long long linsert(unsigned long long value, unsigned int bitfield,
                           unsigned int position, unsigned int length);
#if defined(__XS2A__)
#define linsert(v, b, p, l) __builtin_linsert(v, b, p, l)
#endif

/**
 * Perform saturation on a 64-bit value. If any arithmetic has overflowed
 * beyond a given bit index, then the value is set to MININT or MAXINT,
 * right shifted by the bit index.
 * \param value The 64-bit value to perform saturation on.
 * \param index The bit index at which overflow is checked for.
 * \result The saturated 64-bit value.
 */
signed long long lsats(signed long long value, unsigned int index);
#if defined(__XS2A__)
#define lsats(v, i) __builtin_lsats(v, i)
#endif

/**
 * Unzip a 64-bit value into two 32-bit values, with a granularity of
 * bits, bit pairs, nibbles, bytes or byte pairs.
 * \param value The 64-bit zipped value.
 * \param log_granularity The logarithm of the granularity.
 * \return Two 32-bit unzipped values.
 */
{unsigned int, unsigned int} unzip(unsigned long long value,
                                   unsigned int log_granularity);
#if defined(__XS2A__)
#define unzip(v, g) __builtin_unzip(v, g)
#endif

/**
 * Zip two 32-bit values into a single 64-bit value, with a granularity of
 * bits, bit pairs, nibbles, bytes or byte pairs.
 * \param value1 The first 32-bit value.
 * \param value2 The second 32-bit value.
 * \param log_granularity The logarithm of the granularity.
 * \return The 64-bit zipped value.
 */
unsigned long long zip(unsigned int value1, unsigned int value2,
                       unsigned int log_granularity);
#if defined(__XS2A__)
#define zip(v1, v2, g) __builtin_zip(v1, v2, g)
#endif
crc32_inc likely performs a crc32 and increments $ra by $bb. Very convenient for RGMII.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am
Contact:

Post by segher »

ers35 wrote:The 14.0.0 xs1.h provides some clues:
Ah yes, hadn't looked at headers yet. This was just
"xas -march=xs2a" together with xobjdump. And some
Perl (of course :-) )
crc32_inc likely performs a crc32 and increments $ra by $bb. Very convenient for RGMII.
Why is that?
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

Check out the new lib_ethernet

Code: Select all

tx_data:
        { out res[p_txd], tmp2            ; ldw tmp2, ptr[0] }
        crc32_inc crc, tmp2, poly, ptr, 4
        { out res[p_txd], tmp1            ; ldw tmp1, ptr[0] }
        crc32_inc crc, tmp1, poly, ptr, 4
        { out res[p_txd], tmp3            ; ldw tmp3, ptr[0] }
        crc32_inc crc, tmp3, poly, ptr, 4
        { out res[p_txd], tmp2            ; ldw tmp2, ptr[0] }
 
The gigabit_ethernet_demo is also interesting

Code: Select all

ldd r5, r4, sp[0]

std r5, r4, sp[0]
 
User avatar
Ross
XCore Expert
Posts: 962
Joined: Thu Dec 10, 2009 9:20 pm
Location: Bristol, UK

Post by Ross »

lilltroll wrote:Check out the new lib_ethernet

The gigabit_ethernet_demo is also interesting

Code: Select all

ldd r5, r4, sp[0]

std r5, r4, sp[0]
 
64bit (or "double") load/store.
User avatar
ers35
Active Member
Posts: 62
Joined: Mon Jun 10, 2013 2:14 pm
Contact:

Post by ers35 »

segher wrote:Why is that?
An L-series part running at 500 MHz can barely interface with a GMII with the help of a CPLD. Doing a crc after each input makes the timing even more strict. The performance requirements limits the number of cores that can be active on a tile.

crc32_inc combined with the dual issue reduces the performance requirements to a comfortable level.
Last edited by ers35 on Tue Mar 31, 2015 11:54 pm, edited 1 time in total.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am
Contact:

Post by segher »

lilltroll wrote:Check out the new lib_ethernet

Code: Select all

tx_data:
        { out res[p_txd], tmp2            ; ldw tmp2, ptr[0] }
        crc32_inc crc, tmp2, poly, ptr, 4
        { out res[p_txd], tmp1            ; ldw tmp1, ptr[0] }
        crc32_inc crc, tmp1, poly, ptr, 4
        { out res[p_txd], tmp3            ; ldw tmp3, ptr[0] }
        crc32_inc crc, tmp3, poly, ptr, 4
        { out res[p_txd], tmp2            ; ldw tmp2, ptr[0] }
 
Ah, with 8 threads active you have only
two thread cycles (and nothing to spare) to output gigabit.
Very handy indeed then :-)
The gigabit_ethernet_demo is also interesting

Code: Select all

ldd r5, r4, sp[0]

std r5, r4, sp[0]
 
And I completely missed those instructions.
Huh, not thorough enough.
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

The ethernet and USB implementation has some asm code targeting the new xCORE200

Search for the pattern

Code: Select all

#if defined(__XS2A__)
in all .S files in xTIMEcomposer 14 (After downloading all USB and Ethernet code/libs)

You will get examples of writing asm for the new instructions this way.
User avatar
infiniteimprobability
XCore Legend
Posts: 1126
Joined: Thu May 27, 2010 10:08 am
Contact:

Post by infiniteimprobability »

Ahead of the official docs, here are some pointers for the categories of new instructions:

Load/store double - obviously means that memory width has doubled too, which is handy for dual issue :)
CRC n-bit and 32b version with increment option
Part out, register vetsion
Extract/insert (from 2 regs into a single)
Saturate
Zip/Unzip for combining bits in turn - very handy for using wide ports for serial bit streams
XOR four words a single go
Actual NOP
Reading ref clock without allocating a timer
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am
Contact:

Post by segher »

Saturate, I missed that one as well. Grmbl.

The disassembler (in xobjdump, haven't looked at xgdb yet)
still decodes many invalid ops as non-sensical other ops. That
doesn't help finding the new insns :-)
Post Reply