Porting Qaoed to Solaris 10 - The Adventure

Requirements

  • Lots of experience with porting software
  • Solaris 10 running on either Sparc or x86
  • Either the GNU GCC toolchain or the SUNWpro toolchain (although you can use both)
  • Subversion (and a whole bunch of libraries
  • Qaoed sources (checked out from  http://svn.fubra.com/storage/qaoed/trunk)

Background

We at Fubra decided to evaluate various platforms to help support our next generation network infrastructure. One of the platforms we are evaluating is the Sun T5420, a high end Ultra Sparc? T2 dual processor with support for 64 threads per processors. It was decided to port our Qaoed daemon to the Solaris platform, on both x86 and Ultra Sparc?.

Porting

Under Linux, it is very easy to send or receive raw packets through sockets, simply by doing the below:

if ((fd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_AOE)) < 0)
{
    printf("Couldn't open socket: %s\n", strerror(errno));
    return -1;
}

Then you must bind to a specific device:

struct ifreq ifr;
struct sockaddr_ll sall;

strcpy(ifr.ifr_name, device);
if (ioctl(sock, SIOCGIFINDEX, &ifr) < 0)
{
    printf("Couldn't execute ioctl() on %s: %s\n, device, strerror(errno));
    return -1;
}

sall.sll_family = AF_PACKET;
sall.sll_protocol = htons(ETH_P_AOE);
sall.sll_ifindex = ifr.ifr_ifindex; /* from ioctl() */

if (bind(fd, (struct sockaddr_ll *)&sall, sizeof(struct sockaddr_ll)) < 0)
{
    printf("Couldn't bind to device %s: %s\n", device, strerror(errno));
    return -1;
}

The ETH_P_AOE macro specifies the type of ethernet packets we'll be using. There are many types of ethernet packets in use, the most common type is the 0x0800 type which defines and implements TCP/IP. As AOE is an ethernet protocol it has a type of 0x88a2. AOE is most definitely NOT a TCP/IP protocol, and thus does not use the familiar addressing scheme that we all have come to know and love (i.e. IP addresses and port numbers), instead they rely on MAC addresses. These MAC (Media Address Control) addresses are actually the hardware addresses used with the network interface cards used in our machines, it is 48 bits long, with up to 281 trillion possible addresses, for our purposes, essentially unique to the machine, although with today's hardware it is possible to change the MAC addresses.

All ethernet packets (or frames as they're often referred to), have a header to help identify and route to the appropriate stack (piece of software that processes the frames in and out). This header is 14 byte long, and consists of the following:

Destination MAC address (6 bytes)
Source MAC (6 bytes)
Ethernet Type (2 bytes)

Because the destination MAC address always comes first, it's a trivial task for switches to forward the packet to the appropriate machine, and the ethernet type makes it simple to forward it to the appropriate processing stack on the machine.

This is called the 'ethernet header', and is universal to almost all protocols used with ethernet.

The AOE protocol begins with the standard ethernet header, containing destination, source and a defined ethernet type of 0x88a2. Built on this is the AOE header, depending on what's set in there, it is followed with either the AOE ATA header, or the AOE CONFIG header. The reference document that defines all this can be found at  http://www.coraid.com/site/co-pdfs/AoEr10.pdf. This is essential reading if you wish to work with the AOE protocol, or with software using this protocol.

Under Solaris, things are done differently. You can send or receive ordinary packets through sockets, but to send raw ethernet packets, you must work with the DLPI (Data Link Programming Interface) API. This API can be found on most 'big iron' Unix operating systems, including Solaris, HP-UX, and AIX. This is what we need to be able to send or receive raw packets.

There are a number of steps we have to go through before we get the ability to do this:

  • Open the interface
  • Attach to the interface (only needed if there is more than one instance of the interface, i.e. nxge0, nxge1, nxge2, xnge3)
        dl_attach_req_t dl_attach_req;
        union
        {
                dl_ok_ack_t dl_ok_ack;
                dl_error_ack_t dl_err_ack;
        } dl_status_ack;

        struct strbuf ctl;
        int flags;

        dl_attach_req.dl_primitive = DL_ATTACH_REQ;
        dl_attach_req.dl_ppa = unit;

        ctl.maxlen = 0;
        ctl.len = sizeof(dl_attach_req);
        ctl.buf = (char *)&dl_attach_req;
        flags = 0;

        if (putmsg(sock, &ctl, NULL, flags) < 0)
        {
                printf("dlattachreq putmsg failed: %s\n", strerror(errno));
                close(sock);

                return -1;
        }

        ctl.maxlen = sizeof(dl_status_ack);
        ctl.len = 0;
        ctl.buf = (char *)&dl_status_ack;
        flags = 0;

        if (getmsg(sock, &ctl, NULL, &flags) < 0)
        {
                printf("dlattachreq getmsg failed: %s\n", strerror(errno));
                close(sock);

                return -1;
        }
        if (ctl.len < sizeof(dl_status_ack.dl_ok_ack))
        {
                printf("dlattachreq could not attach to device, %d/%d bytes", ctl.len, sizeof(dl_status_ack.dl_ok_ack));
                close(sock);

                return -1;
        }

        if (dl_status_ack.dl_ok_ack.dl_primitive == DL_ERROR_ACK)
        {
                printf("DL_ATTACH error: %lu%s%s\n",
                        dl_status_ack.dl_err_ack.dl_errno,
                        dl_status_ack.dl_err_ack.dl_errno == DL_SYSERR ? ": " : "",
                        dl_status_ack.dl_err_ack.dl_errno == DL_SYSERR ? strerror(dl_status_ack.dl_err_ack.dl_unix_errno) : "");
                close(sock);

                return -1;
        }

        if (dl_status_ack.dl_ok_ack.dl_primitive != DL_OK_ACK)
        {
                printf("DL_ATTACH returned 0x%X\n", (unsigned int)dl_status_ack.dl_ok_ack.dl_primitive);
                close(sock);

                return -1;
        }

        if (dl_status_ack.dl_ok_ack.dl_correct_primitive != DL_ATTACH_REQ)
        {
                printf("Unexpected ACK for command 0x%X\n", (unsigned int)dl_status_ack.dl_ok_ack.dl_primitive);
                close(sock);

                return -1;
        }
  • Bind to the interface (for raw packets, you need to bind to all types of ethernet packets .i.e ether type 0x0000)
        dl_bind_req_t dl_bind_req;
        dl_bind_ack_t dl_bind_ack;

        struct strbuf ctl;

        int flags;

        memset(&dl_bind_req, 0, sizeof(dl_bind_req_t));
        dl_bind_req.dl_primitive = DL_BIND_REQ;
        dl_bind_req.dl_sap = proto;
        dl_bind_req.dl_service_mode = DL_CLDLS;

        ctl.maxlen = 0;
        ctl.len = sizeof(dl_bind_req_t);
        ctl.buf = (char *)&dl_bind_req;
        flags = 0;

        if (putmsg(sock, &ctl, NULL, flags) < 0)
        {
                printf("dlbindreq putmsg failed: %s\n", strerror(errno));
                close(sock);

                return -1;
        }

        ctl.maxlen = sizeof(dl_bind_ack_t);
        ctl.len = 0;
        ctl.buf = (char *)&dl_bind_ack;
        flags = 0;
        if (getmsg(sock, &ctl, NULL, &flags) < 0)
        {
                printf("dlbindreq getmsg failed: %s\n", strerror(errno));
                close(sock);

                return -1;
        }

        if (ctl.len < sizeof(dl_bind_ack_t))
        {
                printf("dlbindreq could not bind to device: %d/%d bytes\n", ctl.len, sizeof(dl_bind_ack_t));
                close(sock);

                return -1;
        }

        if (dl_bind_ack.dl_primitive != DL_BIND_ACK)
        {
                printf("DL_BIND_ACK returned 0x%X\n", (unsigned int)dl_bind_ack.dl_primitive);
                close(sock);

                return -1;
        }
  • Set the mode on the interface (for raw packets, PHYSICAL, SAP and MULTICAST are needed, look in arch/solaris_net.c)
        struct strbuf ctl;
        dl_promiscon_req_t dl_promiscon_req;
        union DL_primitives dlp;
        int flags;

        memset(&dl_promiscon_req, 0, sizeof(dl_promiscon_req_t));
        dl_promiscon_req.dl_primitive = DL_PROMISCON_REQ;

        switch (level)
        {
                case SAP:
                        dl_promiscon_req.dl_level = DL_PROMISC_SAP;
                        break;

                case MULTICAST:
                        dl_promiscon_req.dl_level = DL_PROMISC_MULTI;
                        break;

                case PHYSICAL:
                default: /* just in case... */
                        dl_promiscon_req.dl_level = DL_PROMISC_PHYS;
                        break;
        }

        ctl.maxlen = 0;
        ctl.len = sizeof(dl_promiscon_req_t);
        ctl.buf = (char *)&dl_promiscon_req;
        flags = 0;

        if (putmsg(sock, &ctl, NULL, flags) < 0)
        {
                printf("dlpromisconreq putmsg failed: %s\n", strerror(errno));
                close(sock);

                return -1;
        }

        memset(&dlp, 0, sizeof(union DL_primitives));
        ctl.maxlen = sizeof(union DL_primitives);
        ctl.len = 0;
        ctl.buf = (char *)&dlp;

        if (getmsg(sock, &ctl, NULL, &flags) < 0)
        {
                printf("dlpromisconreq getmsg failed: %s\n", strerror(errno));
                close(sock);

                return -1;
        }

        if (ctl.len > sizeof(union DL_primitives))
        {
                printf("dlpromisconreq could not set device mode: %d/%d bytes\n", ctl.len, sizeof(union DL_primitives));
                close(sock);

                return -1;
        }
  • Put interface into raw mode (for sending and receiving raw packets)
        struct strioctl sioc;

        sioc.ic_cmd = DLIOCRAW;
        sioc.ic_timout = -1;
        sioc.ic_len = 0;
        sioc.ic_dp = 0;

        if (ioctl(sock, I_STR, &sioc) < 0)
        {
                printf("Could not set device into raw mode: %s\n", strerror(errno));
                return -1;
        }

It seems that particular combination is required. Any other combination just doesn't work.

The AOE Protocol

TBD.

Current problems

Qaoed on Solaris has a problem reading packets off the interfaces which needs tracking down. To faciliate this, I made a copy of the qaoed sources, renamed as sqaoed (single process qaoed), commented out all the pthreads stuff, changed it so that it accepts two parameters on the command line, interface and device to work with. This was successful, in that it can now read / write raw packets, proving there is a problem with pthreads on the Solaris platform. Unfortunately during testing it was found that packets sqaoed receives on Solaris from aoe kernel modules running on Linux hosts somehow seems to have their shelf / slot changed. Here is a sample dump of the aoe packets sent / received between sqaoed and aoe kernel module

Sending 32 bytes
FF FF FF FF FF FF 00 0C 29 A5 2B C9 88 A2 18 00 ........).+.....
00 2A 2A 01 00 00 00 00 00 14 40 0A 02 10 00 00 .**.......@.....
20 00 00 00 00 00 00 00 88 AA 06 08 00 00 00 00  ...............

The above packet is a broadcast sent from sqaoed on Solaris, this appears to be correct

This is the reply sqaoed gets in return

Received 60 bytes, Dumping 64 bytes from packet
00 0C 29 A5 2B C9 00 18 8B 85 55 C1 88 A2 10 00 ..).+.....U.....
00 2C 0A 00 01 E9 2F 87 00 00 01 EC 00 00 00 A0 .,..../.........
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 EB 8F 00 32 ...............2

Uh-oh! This packet should have contained 0x2a and 0x2a as seen in the broadcast packet, yet they are different. Why?

Apparently this is what happens when you use an obsolete version of the aoe kernel module. Upgrading to the latest version of aoe module for Linux solves this problem (currently at version 62). Recommendations Always use aoe-version to determine the version of the aoe-tools and the aoe kernel module to ensure that this doesn't happen again.

Sqaoed

Sqoaed was the result of stripping out all the pthreads and reordering parts of the qaoed sources. You can check out the sources from  http://svn.fubra.com/storage/sqaoed/trunk, and experiment with it. It has been tested on both Solaris and Linux, on both x86 and Ultra Sparc? hardware.