wiki:EquallogicPS5000ETesting

Diary of an Equallogic PS5000E Tester

This page documents the process of testing Equallogic iSCSI storage.

Model Details

We have a PS5000E unit with 24TB disks on test. No documentation was provided with the unit though, grr...

Unboxing/Installation?

The Equallogic unit came with quite a few components forming the racking kit. There were 2 different rail types, and bags of bolts, washers and cage nuts for seemingly any type of rack. All in neat little zippy bags with instruction leaflets. It was quite overwhelming at first and just reading through the installation sheets took a good 30 mins until I had worked it out. Installation of the unit into the rack took another 10m for a total of 40m installation time.

We connected two power cables, three network cables into one of the controllers (which turned out to be the backup controller once we powered the unit on).

Turning the unit on

The unit was powered up, came up ok with no warnings or beeps, quite smoothly all in all. Since we had no documentation I hoped it would come up and request a dhcp address but that didn't happen so we dug out a serial cable. Connected up to a trusty old Thinkpad, the console came up at 9600 8n1, sw/hw flow control off.

Prompted for username and password, with no documentation it was a hard guess. The default u/p was grpadmin/grpadmin, but it has been known for units to ship with grpadmin/eql.

Setting up the PS5000E

Factory Reset

First thing to do was factory reset the unit so we start from a known good position. There is a reset command in the CLI, but it requires that the serial cable is connected to the primary active controller on the back of the unit. This is where I discovered that we weren't and re-cabled to the other controller.

The reset command requires you to type "Delete All My Data Now?", which I thought was a nice touch in a way, could save an accident I suppose since doing a factory reset really does forget everything! It might be possible to recreate the exact same setup afterwards but having played with it for a while I can see how you might need a bit of look or internal development tools from Eql to recover.

group1> reset

Warning: This command resets an array to the factory
defaults (original condition).  The result is the
elimination of all group and volume configuration
information and any volume data residing on the array.

Before resetting an array that is a member of a group,
it is recommended that you delete the member from the
group.

Reset this array to factory defaults? [n/DeleteAllMyDataNow] DeleteAllMyDataNow
Resetting system, this will take a few minutes.
Deleting backup password files
Deleting agent.cnf
Zeroing drives, nvram and rebooting the system

Factory reset complete, and the unit reboots. Log back into the console and the next step is to run the setup command which asks you to set all sorts of IP addresses and passwords for the unit. I left passwords as default as possible so as to avoid confusion. In any case the unit would only be accessible via console cable or our ultra-private testing vlan.

  • Group Name: group1
  • Group IP Address: 10.100.0.100 (used for portal discovery and group operations)
  • Member Name: member1
  • Member IP Addresses
    • eth0:10.100.0.101/255.255.255.0 swport44/vlan102 untagged
    • eth1:10.100.0.102/255.255.255.0 swport43/vlan102 untagged
    • eth2:10.100.0.103/255.255.255.0 swport41/vlan102 untagged
    • No default gateway

Note that the Equallogic device automatically detects the mtu size on the network... This is quite cool - I'll sniff it later to see what it does!

Switch setup

The switch has two vlans, one for storage testing with jumbo frames enabled. Flow control is enabled for all ports participating in iSCSI traffic. The other vlan is for public communication with the servers, jumbos and flow control disabled.

HP ProCurve Switch 2848# sh ru

Running configuration:

; J4904A Configuration Editor; Created on release #I.10.43

hostname "HP ProCurve Switch 2848" 
max-vlans 256 
interface 2 
   flow-control 
exit
interface 4
   flow-control
exit
interface 41 
   flow-control 
exit
interface 43 
   flow-control 
exit
interface 44 
   flow-control 
exit
ip default-gateway 11.22.33.44 
ip timep manual 11.22.33.55
snmp-server community "public" Unrestricted 
vlan 1 
   name "DEFAULT_VLAN" 
   untagged 5-42,46-48 
   ip address 10.10.0.5 255.255.255.0 
   tagged 45 
   no untagged 1-4,43-44 
   exit 
vlan 110 
   name "public-core" 
   untagged 1,3 
   ip address 11.22.33.66 255.255.255.0 
   tagged 45 
   exit 
vlan 102 
   name "storagetest" 
   untagged 2,4,43-44 
   jumbo 
   exit 
password manager
password operator

Cabling the Equallogic unit

The unit has three ethernet ports per controller. eth0 on the unit is connected to port 43 of a Procurve 2848 switch untagged on the storage testing vlan with Jumbo frames enabled.

Client Setup

OK now we need some clients.

There are two HP DL360 G4 servers with 4GB ram, dual GigE nics.

  • eth0 is connected to the public testing vlan swport1/vlan110 untagged, 1500MTU
  • eth1 is connected to the storage testing vlan swport2/vlan102 untagged, 9000MTU and flow control enabled.

Enabling Flow Control on Linux with ethtool

Our nic had flow control enabled by default, it can be verified using ethtool:

[root@dl360-1 ~]# ethtool eth1
Settings for eth1:
	Supported ports: [ MII ]
	Supported link modes:   10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Half 1000baseT/Full 
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Half 1000baseT/Full 
	Advertised auto-negotiation: Yes
	Speed: 1000Mb/s
	Duplex: Full
	Port: Twisted Pair
	PHYAD: 1
	Transceiver: internal
	Auto-negotiation: on
	Supports Wake-on: g
	Wake-on: d
	Current message level: 0x000000ff (255)
	Link detected: yes
[root@dl360-1 ~]# ethtool -a eth1
Pause parameters for eth1:
Autonegotiate:	on
RX:		on
TX:		on

Checking everything s up properly

Next check the interfaces have come up with the correct IP address, masks and mtu etc...

[root@dl360-1 ~]# ifconfig 
eth0      Link encap:Ethernet  HWaddr 00:17:08:2B:92:C9  
          inet addr:11.22.33.10  Bcast:11.22.33.255  Mask:255.255.255.0
          inet6 addr: fe80::217:8ff:fe2b:92c9/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:14661832 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8473831 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:3164367574 (2.9 GiB)  TX bytes:2668286688 (2.4 GiB)
          Interrupt:21 

eth1      Link encap:Ethernet  HWaddr 00:16:35:3C:51:56  
          inet addr:10.100.0.10  Bcast:10.100.0.255  Mask:255.255.255.0
          inet6 addr: fe80::216:35ff:fe3c:5156/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:6203162 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6276677 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:526683357 (502.2 MiB)  TX bytes:2381418126 (2.2 GiB)
          Interrupt:22 

And ping the controller to make sure we have communications...

[root@dl360-1 ~]# ping -c 4 10.100.0.100
PING 10.100.0.100 (10.100.0.100) 56(84) bytes of data.
64 bytes from 10.100.0.100: icmp_seq=1 ttl=255 time=0.082 ms
64 bytes from 10.100.0.100: icmp_seq=2 ttl=255 time=0.069 ms
64 bytes from 10.100.0.100: icmp_seq=3 ttl=255 time=0.069 ms
64 bytes from 10.100.0.100: icmp_seq=4 ttl=255 time=0.072 ms

--- 10.100.0.100 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2999ms
rtt min/avg/max/mdev = 0.069/0.073/0.082/0.005 ms

Install the iSCSI Tools

Easy enough to do on Centos5, just install iscsi-initiator-utils using yum:

yum -y install iscsi-initiator-utils

Conecting to the PS5000E Admin Console

The Equallogic device has two main admin consoles. SSH provides access to the CLI, which is the easiest way to set it up. There is also a Java GUI. The Equallogic device is on a private vlan with jumbo frames, therefore you can either SSH to it from any machine on the storage vlan, or for the GUI use firefox/jre on ahost connected to the storage vlan forwarded over an ssh session.

Warning though, running the GUI app like this makes it really slow!

Connecting to the Java GUI

Fire up firefox and go to the group ip address:

http://10.100.0.100

Follow the prompts, install jre as necessary, blah blah blah! Eventually you get into the interface using u/p grpadmin/grpadmin.

Initial configuration of Equallogic device

In the console you can view some member information:

group1> member show member1
_____________________________ Member Information ______________________________
Name: member1                          Status: online                         
TotalSpace: 6248.72GB                  UsedSpace: 0MB                         
SnapSpace: 0MB                         Description:                           
Def-Gateway: 10.100.0.1                Serial-Number: SHM94552C010788         
Disks: 16                              Spares: 2                              
Controllers: 2                         CacheMode: write-back                  
Connections: 0                         RaidStatus: ok                         
RaidPercentage: 0%                     LostBlocks: false                      
HealthStatus: normal                   LocateMember: disable                  
Controller-Safe: disabled              Low-Battery-Safe: enabled              
Version: V3.3.1 (R68703)               Delay-Data-Move: disable               
Pool: default                          Raid-policy: raid10                    
_______________________________________________________________________________

A few steps are necessary to finish setting up the Equallogic device, this first step has to be done via the java gui until I figure out how to do it via the CLI.

Configure the member storage

In the gui, under the Members tree, this member should be there with a cross against it. You need to go in here and configure the type of raid you want (5,10,50 options) and make it available.

Now we go back to CLI for everything!

Add a volume

SSH to the CLI add a volume to test with:

group1> volume create volume1 100GB unrestricted thin-provision 
Volume creation succeeded.
iSCSI target name is iqn.2001-05.com.equallogic:0-8a0906-3e57d6d03-7870000042e48
d3a-volume1
group1> 
1010:154:member1:MgmtExec:19-Sep-2008 14:06:45.530155:VolInfo.cc:72:INFO:8.2.2:Volume 'volume1' successfully created.

View details for the new volume:

group1> volume show volume1
_____________________________ Volume Information ______________________________
Name: volume1                          Size: 100GB                            
VolReserve: 10GB                       VolReserveInUse: 0MB                   
ReplReserveInUse: 0MB                  iSCSI Alias: volume1                   
iSCSI Name:                            ActualMembers: 1                       
  iqn.2001-05.com.equallogic:0-8a0906- Snap-Warn: 10%                         
  3e57d6d03-7870000042e48d3a-volume1   Snap-Depletion: delete-oldest          
Description:                           Snap-Reserve: 100%                     
Snap-Reserve-Avail: 100% (10GB)        Permission: read-write                 
DesiredStatus: online                  Status: online                         
Connections: 0                         Snapshots: 0                           
Bind:                                  Type: not-replicated                   
ReplicationReserveSpace: 0MB           Replicas: 0                            
ReplicationPartner:                    Pool: default                          
Transmitted-Data: 0MB                  Received-Data: 0MB                     
Pref-Raid-Policy: none                 Pref-Raid-Policy-Status: none          
Thin-Provision: enabled                Thin-Min-Reserve: 10% (10GB)           
Thin-Growth-Warn: 60% (60GB)           Thin-Growth-Max: 100% (100GB)          
ReplicationTxData: 0MB                                                         
_______________________________________________________________________________

The volume was created, the default allocation policy causes 10GB to be pre-allocated, but none used as yet.

We're good to go!

Connecting a Linux Client to the iSCSI volume

Run the following command:

[root@dl360-1 ~]# iscsiadm --mode discovery --type sendtargets --portal 10.100.0.100
10.100.0.100:3260,1 iqn.2001-05.com.equallogic:0-8a0906-ee37d6d03-a520000042948d13-volume1

Then verify it was detected:

[root@dl360-1 ~]# dmesg
scsi0 : iSCSI Initiator over TCP/IP
scsi 0:0:0:0: Direct-Access     EQLOGIC  100E-00          3.3  PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 209725440 512-byte hardware sectors (107379 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 81 00 00 00
sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
sd 0:0:0:0: [sda] 209725440 512-byte hardware sectors (107379 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 81 00 00 00
sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
 sda: unknown partition table
sd 0:0:0:0: [sda] Attached SCSI disk
sd 0:0:0:0: Attached scsi generic sg0 type 0

If you have multipath enabled and not blacklisting you might see it here:

[root@dl360-1 ~]# multipath -l
mpath0 (36090a038d0d637ee138d9442000020a5) dm-0 EQLOGIC,100E-00
[size=100G][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][active]
 \_ 0:0:0:0 sda 8:0   [active][undef]

Lets do some initial testing to make sure everything is ok

Put a filesystem on the device, time it:

[root@dl360-1 ~]# time mkfs.ext3 /dev/mpath/mpath0 
mke2fs 1.39 (29-May-2006)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
13123584 inodes, 26215680 blocks
1310784 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=0
801 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872

Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 26 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

real	0m16.301s
user	0m0.124s
sys	0m3.904s

In the eql console, see how much space has been allocated for an empty ext3 filesystem:

group1> show volume volume1 
_____________________________ Volume Information ______________________________
Name: volume1                          Size: 100GB                            
VolReserve: 23.39GB                    VolReserveInUse: 13.49GB               
ReplReserveInUse: 0MB                  iSCSI Alias: volume1                   
iSCSI Name:                            ActualMembers: 1                       
  iqn.2001-05.com.equallogic:0-8a0906- Snap-Warn: 10%                         
  3e57d6d03-7870000042e48d3a-volume1   Snap-Depletion: delete-oldest          
Description:                           Snap-Reserve: 100%                     
Snap-Reserve-Avail: 100% (23.39GB)     Permission: read-write                 
DesiredStatus: online                  Status: online                         
Connections: 1                         Snapshots: 0                           
Bind:                                  Type: not-replicated                   
ReplicationReserveSpace: 0MB           Replicas: 0                            
ReplicationPartner:                    Pool: default                          
Transmitted-Data: 0MB                  Received-Data: 1.7GB                   
Pref-Raid-Policy: none                 Pref-Raid-Policy-Status: none          
Thin-Provision: enabled                Thin-Min-Reserve: 10% (10GB)           
Thin-Growth-Warn: 60% (60GB)           Thin-Growth-Max: 100% (100GB)          
ReplicationTxData: 13.49GB                                                     
_______________________________________________________________________________

Looks like the format caused 13.5GB of space to be used! The allocation policy has grown the allocation to 23.4GB. I believe this is something to do with superblock backups. I see from the mkfs.ext3 manpage that there are a couple of related options (-Osparse_super and -S) that might make a difference, we'll test that next and see what side effects this might have.

Mount the filesystem:

[root@dl360-1 ~]# mkdir /mnt/volume1
[root@dl360-1 ~]# mount /dev/mpath/mpath0 /mnt/volume1
[root@dl360-1 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/cciss/c0d0p3      31G  7.1G   23G  25% /
/dev/cciss/c0d0p1      99M   21M   73M  23% /boot
tmpfs                 3.0G     0  3.0G   0% /dev/shm
/dev/dm-0              99G  188M   94G   1% /mnt/volume1

Here we use the ddt tool used in our ataoe testing on this wiki, just to get a baseline measurement and make sure there are no major problems with the setup. The tool is a modification of dd written by Coraid.

We use a test size of 8GB to rule out the possiblity of any pagecache effect on the host (which has 4GB ram).

[root@dl360-1 ~]# ddt -t 8G /mnt/volume1/
Writing to /mnt/volume1/ddt.5290 ... syncing ... done.
sleeping 10 seconds ... done.
Reading from /mnt/volume1/ddt.5290 ... done.
8192 MiB  KiB/s    CPU% 
Write   108604     21
Read    113886      9

Also, interestingly, even more space has now been allocated, even though ddt deletes the data it creates. The volume now shows 20.57GB in use with 27.36GB reserved in total, for what is essentially still an empty disk. When you think about it, this makes sense as the storage array is a simple block device and has no way of knowing when a certain i/o operation caused a block of data to become unused. So there is no obvious way that the space can be automatically recovered - I'll discuss a possible strategy later and see how well it works.

group1> show volume volume1                                        
_____________________________ Volume Information ______________________________
Name: volume1                          Size: 100GB                            
VolReserve: 27.36GB                    VolReserveInUse: 20.57GB               
ReplReserveInUse: 0MB                  iSCSI Alias: volume1                   
iSCSI Name:                            ActualMembers: 1                       
  iqn.2001-05.com.equallogic:0-8a0906- Snap-Warn: 10%                         
  3e57d6d03-7870000042e48d3a-volume1   Snap-Depletion: delete-oldest          
Description:                           Snap-Reserve: 100%                     
Snap-Reserve-Avail: 100% (27.36GB)     Permission: read-write                 
DesiredStatus: online                  Status: online                         
Connections: 1                         Snapshots: 0                           
Bind:                                  Type: not-replicated                   
ReplicationReserveSpace: 0MB           Replicas: 0                            
ReplicationPartner:                    Pool: default                          
Transmitted-Data: 8.01GB               Received-Data: 9.72GB                  
Pref-Raid-Policy: none                 Pref-Raid-Policy-Status: none          
Thin-Provision: enabled                Thin-Min-Reserve: 10% (10GB)           
Thin-Growth-Warn: 60% (60GB)           Thin-Growth-Max: 100% (100GB)          
ReplicationTxData: 20.57GB                                                     
_______________________________________________________________________________

We want to see if the initial allocation can be reduced by supplying additional switches to mkfs.ext3. The man page says there is an option:

       -O feature[,...]
                   sparse_super
                          Create a filesystem with fewer superblock backup copies (saves space on large filesystems).

Create a couple more volumes

In the Equallogic console:

group1> volume create volume-th-1 100GB unrestricted thin-provision
Volume creation succeeded.
iSCSI target name is iqn.2001-05.com.equallogic:0-8a0906-cbc7d6d03-ae90000043148
d3a-volume-th-1
group1> 
1034:174:member1:MgmtExec:19-Sep-2008 14:44:29.520175:VolInfo.cc:72:INFO:8.2.2:Volume 'volume-th-1' successfully created.
                                                           
group1> volume create volume-th-2 100GB unrestricted thin-provision
Volume creation succeeded.
iSCSI target name is iqn.2001-05.com.equallogic:0-8a0906-cc97d6d03-3a90000043448
d3a-volume-th-2
group1> 
1036:176:member1:MgmtExec:19-Sep-2008 14:44:42.520177:VolInfo.cc:72:INFO:8.2.2:Volume 'volume-th-2' successfully created.

Discover the new volumes on the initiator

[root@dl360-1 ~]# service iscsi restart
Logout session [sid: 5, target: iqn.2001-05.com.equallogic:0-8a0906-3e57d6d03-7870000042e48d3a-volume1, portal: 10.100.0.100,3260]
Stopping iSCSI daemon: /etc/init.d/iscsi: line 33:  7097 Killed                  /etc/init.d/iscsid stop
iscsid dead but pid file exists                            [  OK  ]
Turning off network shutdown. Starting iSCSI daemon:       [  OK  ]
                                                           [  OK  ]
Setting up iSCSI targets: Login session [iface: default, target: iqn.2001-05.com.equallogic:0-8a0906-3e57d6d03-7870000042e48d3a-volume1, portal: 10.100.0.10
Login session [iface: default, target: iqn.2001-05.com.equallogic:0-8a0906-cc97d6d03-3a90000043448d3a-volume-th-2, portal: 10.100.0.100,3260]
Login session [iface: default, target: iqn.2001-05.com.equallogic:0-8a0906-cbc7d6d03-ae90000043148d3a-volume-th-1, portal: 10.100.0.100,3260]
                                                           [  OK  ]
[root@dl360-1 ~]# dmesg
scsi5 : iSCSI Initiator over TCP/IP
scsi 5:0:0:0: Direct-Access     EQLOGIC  100E-00          3.3  PQ: 0 ANSI: 5
sd 5:0:0:0: [sdb] 209725440 512-byte hardware sectors (107379 MB)
sd 5:0:0:0: [sdb] Write Protect is off
sd 5:0:0:0: [sdb] Mode Sense: 81 00 00 00
sd 5:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
sd 5:0:0:0: [sdb] 209725440 512-byte hardware sectors (107379 MB)
sd 5:0:0:0: [sdb] Write Protect is off
sd 5:0:0:0: [sdb] Mode Sense: 81 00 00 00
sd 5:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
 sdb: unknown partition table
sd 5:0:0:0: [sdb] Attached SCSI disk
sd 5:0:0:0: Attached scsi generic sg0 type 0
scsi6 : iSCSI Initiator over TCP/IP
scsi 6:0:0:0: Direct-Access     EQLOGIC  100E-00          3.3  PQ: 0 ANSI: 5
sd 6:0:0:0: [sda] 209725440 512-byte hardware sectors (107379 MB)
sd 6:0:0:0: [sda] Write Protect is off
sd 6:0:0:0: [sda] Mode Sense: 81 00 00 00
sd 6:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
sd 6:0:0:0: [sda] 209725440 512-byte hardware sectors (107379 MB)
sd 6:0:0:0: [sda] Write Protect is off
sd 6:0:0:0: [sda] Mode Sense: 81 00 00 00
sd 6:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
 sda: unknown partition table
sd 6:0:0:0: [sda] Attached SCSI disk
sd 6:0:0:0: Attached scsi generic sg1 type 0
scsi7 : iSCSI Initiator over TCP/IP
scsi 7:0:0:0: Direct-Access     EQLOGIC  100E-00          3.3  PQ: 0 ANSI: 5
sd 7:0:0:0: [sdc] 209725440 512-byte hardware sectors (107379 MB)
sd 7:0:0:0: [sdc] Write Protect is off
sd 7:0:0:0: [sdc] Mode Sense: 81 00 00 00
sd 7:0:0:0: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
sd 7:0:0:0: [sdc] 209725440 512-byte hardware sectors (107379 MB)
sd 7:0:0:0: [sdc] Write Protect is off
sd 7:0:0:0: [sdc] Mode Sense: 81 00 00 00
sd 7:0:0:0: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
 sdc: unknown partition table
sd 7:0:0:0: [sdc] Attached SCSI disk
sd 7:0:0:0: Attached scsi generic sg2 type 0
[root@dl360-1 ~]# multipath -l
mpath2 (36090a038d0d697cc3a8d44430000903a) dm-1 EQLOGIC,100E-00
[size=100G][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][active]
 \_ 6:0:0:0 sda 8:0   [active][undef]
mpath1 (36090a038d0d6573e3a8de44200007078) dm-0 EQLOGIC,100E-00
[size=100G][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][enabled]
 \_ 5:0:0:0 sdb 8:16  [active][undef]
mpath3 (36090a038d0d6c7cb3a8d1443000090ae) dm-2 EQLOGIC,100E-00
[size=100G][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][active]
 \_ 7:0:0:0 sdc 8:32  [active][undef]

This means that:

  • /dev/mpath/mpath1 = volume1
  • /dev/mpath/mpath2 = volume-th-2
  • /dev/mpath/mpath3 = volume-th-1

-Osparse_super Option

show volume volume-th-2

As you can see, this option makes no difference...

_____________________________ Volume Information ______________________________
Name: volume-th-2                      Size: 100GB                            
VolReserve: 19.57GB                    VolReserveInUse: 13.43GB               
ReplReserveInUse: 0MB                  iSCSI Alias: volume-th-2               
iSCSI Name:                            ActualMembers: 1                       
  iqn.2001-05.com.equallogic:0-8a0906- Snap-Warn: 10%                         
  cc97d6d03-3a90000043448d3a-volume-th Snap-Depletion: delete-oldest          
  -2                                   Description:                           
Snap-Reserve: 100%                     Snap-Reserve-Avail: 100% (19.57GB)     
Permission: read-write                 DesiredStatus: online                  
Status: online                         Connections: 1                         
Snapshots: 0                           Bind:                                  
Type: not-replicated                   ReplicationReserveSpace: 0MB           
Replicas: 0                            ReplicationPartner:                    
Pool: default                          Transmitted-Data: 0MB                  
Received-Data: 1.7GB                   Pref-Raid-Policy: none                 
Pref-Raid-Policy-Status: none          Thin-Provision: enabled                
Thin-Min-Reserve: 10% (10GB)           Thin-Growth-Warn: 60% (60GB)           
Thin-Growth-Max: 100% (100GB)          ReplicationTxData: 13.43GB             
_______________________________________________________________________________

Poking around the iSCSI layer

The first thing that struck me was that with all these arcane target names it would be easy to get devices mixed up and potentially do some serious damage.

In a system with a lot of volumes it is important to know what targets have been discovered and which system block devices they are installed to. This is handled nicely by the iscsiadm command, which has 4 levels of verbosity, 0-3. With a verbosity of 3 it gives al the information you could possibly want about a volume:

[root@dl360-1 ~]# /sbin/iscsiadm -m session -P 3
iSCSI Transport Class version 2.0-724
iscsiadm version 2.0-865
Target: iqn.2001-05.com.equallogic:0-8a0906-3e57d6d03-7870000042e48d3a-volume1
	Current Portal: 10.100.0.102:3260,1
	Persistent Portal: 10.100.0.100:3260,1
		**********
		Interface:
		**********
		Iface Name: default
		Iface Transport: tcp
		Iface IPaddress: default
		Iface HWaddress: default
		Iface Netdev: default
		SID: 12
		iSCSI Connection State: LOGGED IN
		Internal iscsid Session State: NO CHANGE
		************************
		Negotiated iSCSI params:
		************************
		HeaderDigest: None
		DataDigest: None
		MaxRecvDataSegmentLength: 131072
		MaxXmitDataSegmentLength: 65536
		FirstBurstLength: 65536
		MaxBurstLength: 262144
		ImmediateData: Yes
		InitialR2T: No
		MaxOutstandingR2T: 1
		************************
		Attached SCSI devices:
		************************
		Host Number: 11	State: running
		scsi11 Channel 00 Id 0 Lun: 0
			Attached scsi disk sdc		State: running
--SNIP--

Despite the huge amount of very useful information here, it can be a little unwieldy so I knocked up a little perl script to display iscsi device information in a more minimal way:

#!/usr/bin/perl -w

use strict;

# Get our map out
my $iscsiadm_cmd = '/sbin/iscsiadm -m session -P 3';
open(MAP, "$iscsiadm_cmd 2>/dev/null |");

# Start parsing it
my @t;
my ($targetname, $targetid, $targetportal, $targetdevice);

# We look for the following lines
# Target: iqn.2001-05.com.equallogic:0-8a0906-cbc7d6d03-ae90000043148d3a-volume-th-1
#       Current Portal: 10.100.0.101:3260,1
#                       Attached scsi disk sdd

while (<MAP>) {
        if ( /^Target: ([a-z0-9\.\-]+:[0-9a-z]+-[0-9a-z]+-[0-9a-z]+-[0-9a-z]+-(.*)$)/ ) {
                $targetid = $1;









                $targetname = $2;
        }

        if ( /Current Portal: (.*):/) {
                $targetportal = $1;
        }

        if ( /Attached scsi disk (sd[a-z]+)/ ) {
                 $targetdevice = $1;

                # Now we have everything for this target!
                my %target = (
                                'name' => $targetname,
                                'id'   => $targetid,
                                'portal' => $targetportal,
                                'device' => '/dev/' . $targetdevice
                                );
                push @t, \%target;
        }
}
close MAP;

# Unravel and print the map
print 'Discovered ' . scalar(@t) . " targets:\n";

for (my $i=0; $i < scalar(@t); $i++) {
        print "$t[$i]{'name'} -> $t[$i]{'device'} ($t[$i]{'portal'}:$t[$i]{'id'})\n";
}

When you run it on both hosts, it shows this:

[root@dl360-1 ~]# ./idit.pl 
Discovered 4 targets:
volume1 -> /dev/sda (10.100.0.102:iqn.2001-05.com.equallogic:0-8a0906-3e57d6d03-7870000042e48d3a-volume1)
volume-th-2 -> /dev/sdb (10.100.0.103:iqn.2001-05.com.equallogic:0-8a0906-cc97d6d03-3a90000043448d3a-volume-th-2)
volume-th-1 -> /dev/sdc (10.100.0.102:iqn.2001-05.com.equallogic:0-8a0906-cbc7d6d03-ae90000043148d3a-volume-th-1)
volume-th-3 -> /dev/sdd (10.100.0.103:iqn.2001-05.com.equallogic:0-8a0906-8447d6d03-5630000043748d3b-volume-th-3)

[root@dl360-2 ~]# ./idit.pl 
Discovered 4 targets:
volume-th-3 -> /dev/sda (10.100.0.101:iqn.2001-05.com.equallogic:0-8a0906-8447d6d03-5630000043748d3b-volume-th-3)
volume-th-2 -> /dev/sdb (10.100.0.102:iqn.2001-05.com.equallogic:0-8a0906-cc97d6d03-3a90000043448d3a-volume-th-2)
volume1 -> /dev/sdc (10.100.0.103:iqn.2001-05.com.equallogic:0-8a0906-3e57d6d03-7870000042e48d3a-volume1)
volume-th-1 -> /dev/sdd (10.100.0.101:iqn.2001-05.com.equallogic:0-8a0906-cbc7d6d03-ae90000043148d3a-volume-th-1)

This is the vital data you need to see at first glance.

There is a friendly name which was assigned at volume creation time on the PS5000E. Within the iSCSI system it is identified by its iSCSI Name, such as:

  • iqn.2001-05.com.equallogic:0-8a0906-8447d6d03-5630000043748d3b-volume-th-3

Linux maps this to a "real scsi device" at /dev/sdX, where X is the next available letter in the alphabet. So the first detected disk is installed to /dev/sda.

Each of these volumes are discovered from the "group" portal ip address of 10.100.0.100.

Load Balancing

This is where things get clever. The PS5000E does its best to balance the iSCSI load across as many interfaces as possible. From an initiator point of view this means that the discovered volumes are equally distributed across all three available ip addresses (which correspond to the ethernet ports on the PS5000E). In addition to this, when multiple initiators discover the same device, the PS5000E keeps track and rotates the targets to make sure that corresponding volumes are advertised on different interfaces. Dumb in one way in that I'm not sure if it is possible to manually modify the affinity, but still very clever and no doubt provides decent load balancing.

Kernel Tweaks

Added the following kernel tweaks to improve tcp throughput:

# increase TCP maximum buffer size
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

# increase Linux autotuning TCP buffer limits
# min, default, and maximum number of bytes to use
net.ipv4.tcp_rmem = 4096 87380 16777216 
net.ipv4.tcp_wmem = 4096 65536 16777216

Multipathing

The following link was very helpful:  http://linfrastructure.blogspot.com/2008/02/multipath-and-equallogic-iscsi.html

It is possibble to use multipathing, but it seems not all Linux initiators can do it. Fortunately we are running on Centos5.1, so our tools have full support. Lets try it...

OK Without multipath enabled:

[root@dl360-1 ~]# ddt -t 8G /mnt/volume1/
Writing to /mnt/volume1/ddt.30322 ... syncing ... done.
sleeping 10 seconds ... done.
Reading from /mnt/volume1/ddt.30322 ... done.
8192 MiB  KiB/s    CPU% 
Write   109963     20
Read    114103      9

1st run with multipath enabled:

[root@dl360-1 ~]# ddt -t 8G /mnt/volume1/
Writing to /mnt/volume1/ddt.32146 ... syncing ... done.
sleeping 10 seconds ... done.
Reading from /mnt/volume1/ddt.32146 ... done.
8192 MiB  KiB/s    CPU% 
Write   129769     25
Read    114180      8

It is faster, at least for writes, but I suspect there is a problem here. Time to investigate...

Flow control was not enabled on the switch for the second on the client. Fixed but made no difference:

[root@dl360-1 ~]# ddt -t 8G /mnt/volume1/
Writing to /mnt/volume1/ddt.428 ... syncing ... done.
sleeping 10 seconds ... done.
Reading from /mnt/volume1/ddt.428 ... done.
8192 MiB  KiB/s    CPU% 
Write   128226     25
Read    114336      8

Next I took the machine down for a reboot... When it booted back up, we had the following paths (we use mpath1):

[root@dl360-1 ~]# multipath -ll
mpath2 (36090a038d0d697cc3a8d44430000903a) dm-1 EQLOGIC,100E-00
[size=100G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=2][enabled]
 \_ 3:0:0:0 sdd 8:48  [active][ready]
 \_ 2:0:0:0 sdc 8:32  [active][ready]
mpath1 (36090a038d0d6573e3a8de44200007078) dm-0 EQLOGIC,100E-00
[size=100G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=2][active]
 \_ 1:0:0:0 sdb 8:16  [active][ready]
 \_ 0:0:0:0 sda 8:0   [active][ready]
mpath4 (36090a038d0d647843b8d744300003056) dm-3 EQLOGIC,100E-00
[size=100G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=2][enabled]
 \_ 7:0:0:0 sdh 8:112 [active][ready]
 \_ 6:0:0:0 sdg 8:96  [active][ready]
mpath3 (36090a038d0d6c7cb3a8d1443000090ae) dm-2 EQLOGIC,100E-00
[size=100G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=2][enabled]
 \_ 5:0:0:0 sdf 8:80  [active][ready]
 \_ 4:0:0:0 sde 8:64  [active][ready]

And the following performance under ddt:

[root@dl360-1 ~]# ddt -t 8G /mnt/volume1/
Writing to /mnt/volume1/ddt.4016 ... syncing ... done.
sleeping 10 seconds ... done.
Reading from /mnt/volume1/ddt.4016 ... done.
8192 MiB  KiB/s    CPU% 
Write   160393     39
Read    112530      8

So now thats more like it. 160MB/s writes... Nice. But the reads still seem broken. Iostat reports that io is taking place across both paths but each is doing a slow transfer speed and not getting the >200MB/s I would expect.

Performance Testing