Diary of an Equallogic PS5000E Tester
This page documents the process of testing Equallogic iSCSI storage.
Model Details
We have a PS5000E unit with 24TB disks on test. No documentation was provided with the unit though, grr...
Unboxing/Installation?
The Equallogic unit came with quite a few components forming the racking kit. There were 2 different rail types, and bags of bolts, washers and cage nuts for seemingly any type of rack. All in neat little zippy bags with instruction leaflets. It was quite overwhelming at first and just reading through the installation sheets took a good 30 mins until I had worked it out. Installation of the unit into the rack took another 10m for a total of 40m installation time.
We connected two power cables, three network cables into one of the controllers (which turned out to be the backup controller once we powered the unit on).
Turning the unit on
The unit was powered up, came up ok with no warnings or beeps, quite smoothly all in all. Since we had no documentation I hoped it would come up and request a dhcp address but that didn't happen so we dug out a serial cable. Connected up to a trusty old Thinkpad, the console came up at 9600 8n1, sw/hw flow control off.
Prompted for username and password, with no documentation it was a hard guess. The default u/p was grpadmin/grpadmin, but it has been known for units to ship with grpadmin/eql.
Setting up the PS5000E
Factory Reset
First thing to do was factory reset the unit so we start from a known good position. There is a reset command in the CLI, but it requires that the serial cable is connected to the primary active controller on the back of the unit. This is where I discovered that we weren't and re-cabled to the other controller.
The reset command requires you to type "Delete All My Data Now?", which I thought was a nice touch in a way, could save an accident I suppose since doing a factory reset really does forget everything! It might be possible to recreate the exact same setup afterwards but having played with it for a while I can see how you might need a bit of look or internal development tools from Eql to recover.
group1> reset Warning: This command resets an array to the factory defaults (original condition). The result is the elimination of all group and volume configuration information and any volume data residing on the array. Before resetting an array that is a member of a group, it is recommended that you delete the member from the group. Reset this array to factory defaults? [n/DeleteAllMyDataNow] DeleteAllMyDataNow Resetting system, this will take a few minutes. Deleting backup password files Deleting agent.cnf Zeroing drives, nvram and rebooting the system
Factory reset complete, and the unit reboots. Log back into the console and the next step is to run the setup command which asks you to set all sorts of IP addresses and passwords for the unit. I left passwords as default as possible so as to avoid confusion. In any case the unit would only be accessible via console cable or our ultra-private testing vlan.
- Group Name: group1
- Group IP Address: 10.100.0.100 (used for portal discovery and group operations)
- Member Name: member1
- Member IP Addresses
- eth0:10.100.0.101/255.255.255.0 swport44/vlan102 untagged
- eth1:10.100.0.102/255.255.255.0 swport43/vlan102 untagged
- eth2:10.100.0.103/255.255.255.0 swport41/vlan102 untagged
- No default gateway
Note that the Equallogic device automatically detects the mtu size on the network... This is quite cool - I'll sniff it later to see what it does!
Switch setup
The switch has two vlans, one for storage testing with jumbo frames enabled. Flow control is enabled for all ports participating in iSCSI traffic. The other vlan is for public communication with the servers, jumbos and flow control disabled.
HP ProCurve Switch 2848# sh ru Running configuration: ; J4904A Configuration Editor; Created on release #I.10.43 hostname "HP ProCurve Switch 2848" max-vlans 256 interface 2 flow-control exit interface 4 flow-control exit interface 41 flow-control exit interface 43 flow-control exit interface 44 flow-control exit ip default-gateway 11.22.33.44 ip timep manual 11.22.33.55 snmp-server community "public" Unrestricted vlan 1 name "DEFAULT_VLAN" untagged 5-42,46-48 ip address 10.10.0.5 255.255.255.0 tagged 45 no untagged 1-4,43-44 exit vlan 110 name "public-core" untagged 1,3 ip address 11.22.33.66 255.255.255.0 tagged 45 exit vlan 102 name "storagetest" untagged 2,4,43-44 jumbo exit password manager password operator
Cabling the Equallogic unit
The unit has three ethernet ports per controller. eth0 on the unit is connected to port 43 of a Procurve 2848 switch untagged on the storage testing vlan with Jumbo frames enabled.
Client Setup
OK now we need some clients.
There are two HP DL360 G4 servers with 4GB ram, dual GigE nics.
- eth0 is connected to the public testing vlan swport1/vlan110 untagged, 1500MTU
- eth1 is connected to the storage testing vlan swport2/vlan102 untagged, 9000MTU and flow control enabled.
Enabling Flow Control on Linux with ethtool
Our nic had flow control enabled by default, it can be verified using ethtool:
[root@dl360-1 ~]# ethtool eth1 Settings for eth1: Supported ports: [ MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on Supports Wake-on: g Wake-on: d Current message level: 0x000000ff (255) Link detected: yes [root@dl360-1 ~]# ethtool -a eth1 Pause parameters for eth1: Autonegotiate: on RX: on TX: on
Checking everything s up properly
Next check the interfaces have come up with the correct IP address, masks and mtu etc...
[root@dl360-1 ~]# ifconfig
eth0 Link encap:Ethernet HWaddr 00:17:08:2B:92:C9
inet addr:11.22.33.10 Bcast:11.22.33.255 Mask:255.255.255.0
inet6 addr: fe80::217:8ff:fe2b:92c9/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:14661832 errors:0 dropped:0 overruns:0 frame:0
TX packets:8473831 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3164367574 (2.9 GiB) TX bytes:2668286688 (2.4 GiB)
Interrupt:21
eth1 Link encap:Ethernet HWaddr 00:16:35:3C:51:56
inet addr:10.100.0.10 Bcast:10.100.0.255 Mask:255.255.255.0
inet6 addr: fe80::216:35ff:fe3c:5156/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:6203162 errors:0 dropped:0 overruns:0 frame:0
TX packets:6276677 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:526683357 (502.2 MiB) TX bytes:2381418126 (2.2 GiB)
Interrupt:22
And ping the controller to make sure we have communications...
[root@dl360-1 ~]# ping -c 4 10.100.0.100 PING 10.100.0.100 (10.100.0.100) 56(84) bytes of data. 64 bytes from 10.100.0.100: icmp_seq=1 ttl=255 time=0.082 ms 64 bytes from 10.100.0.100: icmp_seq=2 ttl=255 time=0.069 ms 64 bytes from 10.100.0.100: icmp_seq=3 ttl=255 time=0.069 ms 64 bytes from 10.100.0.100: icmp_seq=4 ttl=255 time=0.072 ms --- 10.100.0.100 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 2999ms rtt min/avg/max/mdev = 0.069/0.073/0.082/0.005 ms
Install the iSCSI Tools
Easy enough to do on Centos5, just install iscsi-initiator-utils using yum:
yum -y install iscsi-initiator-utils
Conecting to the PS5000E Admin Console
The Equallogic device has two main admin consoles. SSH provides access to the CLI, which is the easiest way to set it up. There is also a Java GUI. The Equallogic device is on a private vlan with jumbo frames, therefore you can either SSH to it from any machine on the storage vlan, or for the GUI use firefox/jre on ahost connected to the storage vlan forwarded over an ssh session.
Warning though, running the GUI app like this makes it really slow!
Connecting to the Java GUI
Fire up firefox and go to the group ip address:
http://10.100.0.100
Follow the prompts, install jre as necessary, blah blah blah! Eventually you get into the interface using u/p grpadmin/grpadmin.
Initial configuration of Equallogic device
In the console you can view some member information:
group1> member show member1 _____________________________ Member Information ______________________________ Name: member1 Status: online TotalSpace: 6248.72GB UsedSpace: 0MB SnapSpace: 0MB Description: Def-Gateway: 10.100.0.1 Serial-Number: SHM94552C010788 Disks: 16 Spares: 2 Controllers: 2 CacheMode: write-back Connections: 0 RaidStatus: ok RaidPercentage: 0% LostBlocks: false HealthStatus: normal LocateMember: disable Controller-Safe: disabled Low-Battery-Safe: enabled Version: V3.3.1 (R68703) Delay-Data-Move: disable Pool: default Raid-policy: raid10 _______________________________________________________________________________
A few steps are necessary to finish setting up the Equallogic device, this first step has to be done via the java gui until I figure out how to do it via the CLI.
Configure the member storage
In the gui, under the Members tree, this member should be there with a cross against it. You need to go in here and configure the type of raid you want (5,10,50 options) and make it available.
Now we go back to CLI for everything!
Add a volume
SSH to the CLI add a volume to test with:
group1> volume create volume1 100GB unrestricted thin-provision Volume creation succeeded. iSCSI target name is iqn.2001-05.com.equallogic:0-8a0906-3e57d6d03-7870000042e48 d3a-volume1 group1> 1010:154:member1:MgmtExec:19-Sep-2008 14:06:45.530155:VolInfo.cc:72:INFO:8.2.2:Volume 'volume1' successfully created.
View details for the new volume:
group1> volume show volume1 _____________________________ Volume Information ______________________________ Name: volume1 Size: 100GB VolReserve: 10GB VolReserveInUse: 0MB ReplReserveInUse: 0MB iSCSI Alias: volume1 iSCSI Name: ActualMembers: 1 iqn.2001-05.com.equallogic:0-8a0906- Snap-Warn: 10% 3e57d6d03-7870000042e48d3a-volume1 Snap-Depletion: delete-oldest Description: Snap-Reserve: 100% Snap-Reserve-Avail: 100% (10GB) Permission: read-write DesiredStatus: online Status: online Connections: 0 Snapshots: 0 Bind: Type: not-replicated ReplicationReserveSpace: 0MB Replicas: 0 ReplicationPartner: Pool: default Transmitted-Data: 0MB Received-Data: 0MB Pref-Raid-Policy: none Pref-Raid-Policy-Status: none Thin-Provision: enabled Thin-Min-Reserve: 10% (10GB) Thin-Growth-Warn: 60% (60GB) Thin-Growth-Max: 100% (100GB) ReplicationTxData: 0MB _______________________________________________________________________________
The volume was created, the default allocation policy causes 10GB to be pre-allocated, but none used as yet.
We're good to go!
Connecting a Linux Client to the iSCSI volume
Run the following command:
[root@dl360-1 ~]# iscsiadm --mode discovery --type sendtargets --portal 10.100.0.100 10.100.0.100:3260,1 iqn.2001-05.com.equallogic:0-8a0906-ee37d6d03-a520000042948d13-volume1
Then verify it was detected:
[root@dl360-1 ~]# dmesg scsi0 : iSCSI Initiator over TCP/IP scsi 0:0:0:0: Direct-Access EQLOGIC 100E-00 3.3 PQ: 0 ANSI: 5 sd 0:0:0:0: [sda] 209725440 512-byte hardware sectors (107379 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 81 00 00 00 sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA sd 0:0:0:0: [sda] 209725440 512-byte hardware sectors (107379 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 81 00 00 00 sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA sda: unknown partition table sd 0:0:0:0: [sda] Attached SCSI disk sd 0:0:0:0: Attached scsi generic sg0 type 0
If you have multipath enabled and not blacklisting you might see it here:
[root@dl360-1 ~]# multipath -l mpath0 (36090a038d0d637ee138d9442000020a5) dm-0 EQLOGIC,100E-00 [size=100G][features=0][hwhandler=0] \_ round-robin 0 [prio=0][active] \_ 0:0:0:0 sda 8:0 [active][undef]
Lets do some initial testing to make sure everything is ok
Put a filesystem on the device, time it:
[root@dl360-1 ~]# time mkfs.ext3 /dev/mpath/mpath0 mke2fs 1.39 (29-May-2006) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 13123584 inodes, 26215680 blocks 1310784 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=0 801 block groups 32768 blocks per group, 32768 fragments per group 16384 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872 Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 26 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. real 0m16.301s user 0m0.124s sys 0m3.904s
In the eql console, see how much space has been allocated for an empty ext3 filesystem:
group1> show volume volume1 _____________________________ Volume Information ______________________________ Name: volume1 Size: 100GB VolReserve: 23.39GB VolReserveInUse: 13.49GB ReplReserveInUse: 0MB iSCSI Alias: volume1 iSCSI Name: ActualMembers: 1 iqn.2001-05.com.equallogic:0-8a0906- Snap-Warn: 10% 3e57d6d03-7870000042e48d3a-volume1 Snap-Depletion: delete-oldest Description: Snap-Reserve: 100% Snap-Reserve-Avail: 100% (23.39GB) Permission: read-write DesiredStatus: online Status: online Connections: 1 Snapshots: 0 Bind: Type: not-replicated ReplicationReserveSpace: 0MB Replicas: 0 ReplicationPartner: Pool: default Transmitted-Data: 0MB Received-Data: 1.7GB Pref-Raid-Policy: none Pref-Raid-Policy-Status: none Thin-Provision: enabled Thin-Min-Reserve: 10% (10GB) Thin-Growth-Warn: 60% (60GB) Thin-Growth-Max: 100% (100GB) ReplicationTxData: 13.49GB _______________________________________________________________________________
Looks like the format caused 13.5GB of space to be used! The allocation policy has grown the allocation to 23.4GB. I believe this is something to do with superblock backups. I see from the mkfs.ext3 manpage that there are a couple of related options (-Osparse_super and -S) that might make a difference, we'll test that next and see what side effects this might have.
Mount the filesystem:
[root@dl360-1 ~]# mkdir /mnt/volume1 [root@dl360-1 ~]# mount /dev/mpath/mpath0 /mnt/volume1 [root@dl360-1 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/cciss/c0d0p3 31G 7.1G 23G 25% / /dev/cciss/c0d0p1 99M 21M 73M 23% /boot tmpfs 3.0G 0 3.0G 0% /dev/shm /dev/dm-0 99G 188M 94G 1% /mnt/volume1
Here we use the ddt tool used in our ataoe testing on this wiki, just to get a baseline measurement and make sure there are no major problems with the setup. The tool is a modification of dd written by Coraid.
We use a test size of 8GB to rule out the possiblity of any pagecache effect on the host (which has 4GB ram).
[root@dl360-1 ~]# ddt -t 8G /mnt/volume1/ Writing to /mnt/volume1/ddt.5290 ... syncing ... done. sleeping 10 seconds ... done. Reading from /mnt/volume1/ddt.5290 ... done. 8192 MiB KiB/s CPU% Write 108604 21 Read 113886 9
Also, interestingly, even more space has now been allocated, even though ddt deletes the data it creates. The volume now shows 20.57GB in use with 27.36GB reserved in total, for what is essentially still an empty disk. When you think about it, this makes sense as the storage array is a simple block device and has no way of knowing when a certain i/o operation caused a block of data to become unused. So there is no obvious way that the space can be automatically recovered - I'll discuss a possible strategy later and see how well it works.
group1> show volume volume1 _____________________________ Volume Information ______________________________ Name: volume1 Size: 100GB VolReserve: 27.36GB VolReserveInUse: 20.57GB ReplReserveInUse: 0MB iSCSI Alias: volume1 iSCSI Name: ActualMembers: 1 iqn.2001-05.com.equallogic:0-8a0906- Snap-Warn: 10% 3e57d6d03-7870000042e48d3a-volume1 Snap-Depletion: delete-oldest Description: Snap-Reserve: 100% Snap-Reserve-Avail: 100% (27.36GB) Permission: read-write DesiredStatus: online Status: online Connections: 1 Snapshots: 0 Bind: Type: not-replicated ReplicationReserveSpace: 0MB Replicas: 0 ReplicationPartner: Pool: default Transmitted-Data: 8.01GB Received-Data: 9.72GB Pref-Raid-Policy: none Pref-Raid-Policy-Status: none Thin-Provision: enabled Thin-Min-Reserve: 10% (10GB) Thin-Growth-Warn: 60% (60GB) Thin-Growth-Max: 100% (100GB) ReplicationTxData: 20.57GB _______________________________________________________________________________
We want to see if the initial allocation can be reduced by supplying additional switches to mkfs.ext3. The man page says there is an option:
-O feature[,...]
sparse_super
Create a filesystem with fewer superblock backup copies (saves space on large filesystems).
Create a couple more volumes
In the Equallogic console:
group1> volume create volume-th-1 100GB unrestricted thin-provision
Volume creation succeeded.
iSCSI target name is iqn.2001-05.com.equallogic:0-8a0906-cbc7d6d03-ae90000043148
d3a-volume-th-1
group1>
1034:174:member1:MgmtExec:19-Sep-2008 14:44:29.520175:VolInfo.cc:72:INFO:8.2.2:Volume 'volume-th-1' successfully created.
group1> volume create volume-th-2 100GB unrestricted thin-provision
Volume creation succeeded.
iSCSI target name is iqn.2001-05.com.equallogic:0-8a0906-cc97d6d03-3a90000043448
d3a-volume-th-2
group1>
1036:176:member1:MgmtExec:19-Sep-2008 14:44:42.520177:VolInfo.cc:72:INFO:8.2.2:Volume 'volume-th-2' successfully created.
Discover the new volumes on the initiator
[root@dl360-1 ~]# service iscsi restart
Logout session [sid: 5, target: iqn.2001-05.com.equallogic:0-8a0906-3e57d6d03-7870000042e48d3a-volume1, portal: 10.100.0.100,3260]
Stopping iSCSI daemon: /etc/init.d/iscsi: line 33: 7097 Killed /etc/init.d/iscsid stop
iscsid dead but pid file exists [ OK ]
Turning off network shutdown. Starting iSCSI daemon: [ OK ]
[ OK ]
Setting up iSCSI targets: Login session [iface: default, target: iqn.2001-05.com.equallogic:0-8a0906-3e57d6d03-7870000042e48d3a-volume1, portal: 10.100.0.10
Login session [iface: default, target: iqn.2001-05.com.equallogic:0-8a0906-cc97d6d03-3a90000043448d3a-volume-th-2, portal: 10.100.0.100,3260]
Login session [iface: default, target: iqn.2001-05.com.equallogic:0-8a0906-cbc7d6d03-ae90000043148d3a-volume-th-1, portal: 10.100.0.100,3260]
[ OK ]
[root@dl360-1 ~]# dmesg
scsi5 : iSCSI Initiator over TCP/IP
scsi 5:0:0:0: Direct-Access EQLOGIC 100E-00 3.3 PQ: 0 ANSI: 5
sd 5:0:0:0: [sdb] 209725440 512-byte hardware sectors (107379 MB)
sd 5:0:0:0: [sdb] Write Protect is off
sd 5:0:0:0: [sdb] Mode Sense: 81 00 00 00
sd 5:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
sd 5:0:0:0: [sdb] 209725440 512-byte hardware sectors (107379 MB)
sd 5:0:0:0: [sdb] Write Protect is off
sd 5:0:0:0: [sdb] Mode Sense: 81 00 00 00
sd 5:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
sdb: unknown partition table
sd 5:0:0:0: [sdb] Attached SCSI disk
sd 5:0:0:0: Attached scsi generic sg0 type 0
scsi6 : iSCSI Initiator over TCP/IP
scsi 6:0:0:0: Direct-Access EQLOGIC 100E-00 3.3 PQ: 0 ANSI: 5
sd 6:0:0:0: [sda] 209725440 512-byte hardware sectors (107379 MB)
sd 6:0:0:0: [sda] Write Protect is off
sd 6:0:0:0: [sda] Mode Sense: 81 00 00 00
sd 6:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
sd 6:0:0:0: [sda] 209725440 512-byte hardware sectors (107379 MB)
sd 6:0:0:0: [sda] Write Protect is off
sd 6:0:0:0: [sda] Mode Sense: 81 00 00 00
sd 6:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
sda: unknown partition table
sd 6:0:0:0: [sda] Attached SCSI disk
sd 6:0:0:0: Attached scsi generic sg1 type 0
scsi7 : iSCSI Initiator over TCP/IP
scsi 7:0:0:0: Direct-Access EQLOGIC 100E-00 3.3 PQ: 0 ANSI: 5
sd 7:0:0:0: [sdc] 209725440 512-byte hardware sectors (107379 MB)
sd 7:0:0:0: [sdc] Write Protect is off
sd 7:0:0:0: [sdc] Mode Sense: 81 00 00 00
sd 7:0:0:0: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
sd 7:0:0:0: [sdc] 209725440 512-byte hardware sectors (107379 MB)
sd 7:0:0:0: [sdc] Write Protect is off
sd 7:0:0:0: [sdc] Mode Sense: 81 00 00 00
sd 7:0:0:0: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
sdc: unknown partition table
sd 7:0:0:0: [sdc] Attached SCSI disk
sd 7:0:0:0: Attached scsi generic sg2 type 0
[root@dl360-1 ~]# multipath -l
mpath2 (36090a038d0d697cc3a8d44430000903a) dm-1 EQLOGIC,100E-00
[size=100G][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][active]
\_ 6:0:0:0 sda 8:0 [active][undef]
mpath1 (36090a038d0d6573e3a8de44200007078) dm-0 EQLOGIC,100E-00
[size=100G][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][enabled]
\_ 5:0:0:0 sdb 8:16 [active][undef]
mpath3 (36090a038d0d6c7cb3a8d1443000090ae) dm-2 EQLOGIC,100E-00
[size=100G][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][active]
\_ 7:0:0:0 sdc 8:32 [active][undef]
This means that:
- /dev/mpath/mpath1 = volume1
- /dev/mpath/mpath2 = volume-th-2
- /dev/mpath/mpath3 = volume-th-1
-Osparse_super Option
show volume volume-th-2
As you can see, this option makes no difference...
_____________________________ Volume Information ______________________________ Name: volume-th-2 Size: 100GB VolReserve: 19.57GB VolReserveInUse: 13.43GB ReplReserveInUse: 0MB iSCSI Alias: volume-th-2 iSCSI Name: ActualMembers: 1 iqn.2001-05.com.equallogic:0-8a0906- Snap-Warn: 10% cc97d6d03-3a90000043448d3a-volume-th Snap-Depletion: delete-oldest -2 Description: Snap-Reserve: 100% Snap-Reserve-Avail: 100% (19.57GB) Permission: read-write DesiredStatus: online Status: online Connections: 1 Snapshots: 0 Bind: Type: not-replicated ReplicationReserveSpace: 0MB Replicas: 0 ReplicationPartner: Pool: default Transmitted-Data: 0MB Received-Data: 1.7GB Pref-Raid-Policy: none Pref-Raid-Policy-Status: none Thin-Provision: enabled Thin-Min-Reserve: 10% (10GB) Thin-Growth-Warn: 60% (60GB) Thin-Growth-Max: 100% (100GB) ReplicationTxData: 13.43GB _______________________________________________________________________________
Poking around the iSCSI layer
The first thing that struck me was that with all these arcane target names it would be easy to get devices mixed up and potentially do some serious damage.
In a system with a lot of volumes it is important to know what targets have been discovered and which system block devices they are installed to. This is handled nicely by the iscsiadm command, which has 4 levels of verbosity, 0-3. With a verbosity of 3 it gives al the information you could possibly want about a volume:
[root@dl360-1 ~]# /sbin/iscsiadm -m session -P 3 iSCSI Transport Class version 2.0-724 iscsiadm version 2.0-865 Target: iqn.2001-05.com.equallogic:0-8a0906-3e57d6d03-7870000042e48d3a-volume1 Current Portal: 10.100.0.102:3260,1 Persistent Portal: 10.100.0.100:3260,1 ********** Interface: ********** Iface Name: default Iface Transport: tcp Iface IPaddress: default Iface HWaddress: default Iface Netdev: default SID: 12 iSCSI Connection State: LOGGED IN Internal iscsid Session State: NO CHANGE ************************ Negotiated iSCSI params: ************************ HeaderDigest: None DataDigest: None MaxRecvDataSegmentLength: 131072 MaxXmitDataSegmentLength: 65536 FirstBurstLength: 65536 MaxBurstLength: 262144 ImmediateData: Yes InitialR2T: No MaxOutstandingR2T: 1 ************************ Attached SCSI devices: ************************ Host Number: 11 State: running scsi11 Channel 00 Id 0 Lun: 0 Attached scsi disk sdc State: running --SNIP--
Despite the huge amount of very useful information here, it can be a little unwieldy so I knocked up a little perl script to display iscsi device information in a more minimal way:
#!/usr/bin/perl -w
use strict;
# Get our map out
my $iscsiadm_cmd = '/sbin/iscsiadm -m session -P 3';
open(MAP, "$iscsiadm_cmd 2>/dev/null |");
# Start parsing it
my @t;
my ($targetname, $targetid, $targetportal, $targetdevice);
# We look for the following lines
# Target: iqn.2001-05.com.equallogic:0-8a0906-cbc7d6d03-ae90000043148d3a-volume-th-1
# Current Portal: 10.100.0.101:3260,1
# Attached scsi disk sdd
while (<MAP>) {
if ( /^Target: ([a-z0-9\.\-]+:[0-9a-z]+-[0-9a-z]+-[0-9a-z]+-[0-9a-z]+-(.*)$)/ ) {
$targetid = $1;
$targetname = $2;
}
if ( /Current Portal: (.*):/) {
$targetportal = $1;
}
if ( /Attached scsi disk (sd[a-z]+)/ ) {
$targetdevice = $1;
# Now we have everything for this target!
my %target = (
'name' => $targetname,
'id' => $targetid,
'portal' => $targetportal,
'device' => '/dev/' . $targetdevice
);
push @t, \%target;
}
}
close MAP;
# Unravel and print the map
print 'Discovered ' . scalar(@t) . " targets:\n";
for (my $i=0; $i < scalar(@t); $i++) {
print "$t[$i]{'name'} -> $t[$i]{'device'} ($t[$i]{'portal'}:$t[$i]{'id'})\n";
}
When you run it on both hosts, it shows this:
[root@dl360-1 ~]# ./idit.pl Discovered 4 targets: volume1 -> /dev/sda (10.100.0.102:iqn.2001-05.com.equallogic:0-8a0906-3e57d6d03-7870000042e48d3a-volume1) volume-th-2 -> /dev/sdb (10.100.0.103:iqn.2001-05.com.equallogic:0-8a0906-cc97d6d03-3a90000043448d3a-volume-th-2) volume-th-1 -> /dev/sdc (10.100.0.102:iqn.2001-05.com.equallogic:0-8a0906-cbc7d6d03-ae90000043148d3a-volume-th-1) volume-th-3 -> /dev/sdd (10.100.0.103:iqn.2001-05.com.equallogic:0-8a0906-8447d6d03-5630000043748d3b-volume-th-3) [root@dl360-2 ~]# ./idit.pl Discovered 4 targets: volume-th-3 -> /dev/sda (10.100.0.101:iqn.2001-05.com.equallogic:0-8a0906-8447d6d03-5630000043748d3b-volume-th-3) volume-th-2 -> /dev/sdb (10.100.0.102:iqn.2001-05.com.equallogic:0-8a0906-cc97d6d03-3a90000043448d3a-volume-th-2) volume1 -> /dev/sdc (10.100.0.103:iqn.2001-05.com.equallogic:0-8a0906-3e57d6d03-7870000042e48d3a-volume1) volume-th-1 -> /dev/sdd (10.100.0.101:iqn.2001-05.com.equallogic:0-8a0906-cbc7d6d03-ae90000043148d3a-volume-th-1)
This is the vital data you need to see at first glance.
There is a friendly name which was assigned at volume creation time on the PS5000E. Within the iSCSI system it is identified by its iSCSI Name, such as:
- iqn.2001-05.com.equallogic:0-8a0906-8447d6d03-5630000043748d3b-volume-th-3
Linux maps this to a "real scsi device" at /dev/sdX, where X is the next available letter in the alphabet. So the first detected disk is installed to /dev/sda.
Each of these volumes are discovered from the "group" portal ip address of 10.100.0.100.
Load Balancing
This is where things get clever. The PS5000E does its best to balance the iSCSI load across as many interfaces as possible. From an initiator point of view this means that the discovered volumes are equally distributed across all three available ip addresses (which correspond to the ethernet ports on the PS5000E). In addition to this, when multiple initiators discover the same device, the PS5000E keeps track and rotates the targets to make sure that corresponding volumes are advertised on different interfaces. Dumb in one way in that I'm not sure if it is possible to manually modify the affinity, but still very clever and no doubt provides decent load balancing.
Kernel Tweaks
Added the following kernel tweaks to improve tcp throughput:
# increase TCP maximum buffer size net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 # increase Linux autotuning TCP buffer limits # min, default, and maximum number of bytes to use net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216
Multipathing
The following link was very helpful: http://linfrastructure.blogspot.com/2008/02/multipath-and-equallogic-iscsi.html
It is possibble to use multipathing, but it seems not all Linux initiators can do it. Fortunately we are running on Centos5.1, so our tools have full support. Lets try it...
OK Without multipath enabled:
[root@dl360-1 ~]# ddt -t 8G /mnt/volume1/ Writing to /mnt/volume1/ddt.30322 ... syncing ... done. sleeping 10 seconds ... done. Reading from /mnt/volume1/ddt.30322 ... done. 8192 MiB KiB/s CPU% Write 109963 20 Read 114103 9
1st run with multipath enabled:
[root@dl360-1 ~]# ddt -t 8G /mnt/volume1/ Writing to /mnt/volume1/ddt.32146 ... syncing ... done. sleeping 10 seconds ... done. Reading from /mnt/volume1/ddt.32146 ... done. 8192 MiB KiB/s CPU% Write 129769 25 Read 114180 8
It is faster, at least for writes, but I suspect there is a problem here. Time to investigate...
Flow control was not enabled on the switch for the second on the client. Fixed but made no difference:
[root@dl360-1 ~]# ddt -t 8G /mnt/volume1/ Writing to /mnt/volume1/ddt.428 ... syncing ... done. sleeping 10 seconds ... done. Reading from /mnt/volume1/ddt.428 ... done. 8192 MiB KiB/s CPU% Write 128226 25 Read 114336 8
Next I took the machine down for a reboot... When it booted back up, we had the following paths (we use mpath1):
[root@dl360-1 ~]# multipath -ll mpath2 (36090a038d0d697cc3a8d44430000903a) dm-1 EQLOGIC,100E-00 [size=100G][features=1 queue_if_no_path][hwhandler=0] \_ round-robin 0 [prio=2][enabled] \_ 3:0:0:0 sdd 8:48 [active][ready] \_ 2:0:0:0 sdc 8:32 [active][ready] mpath1 (36090a038d0d6573e3a8de44200007078) dm-0 EQLOGIC,100E-00 [size=100G][features=1 queue_if_no_path][hwhandler=0] \_ round-robin 0 [prio=2][active] \_ 1:0:0:0 sdb 8:16 [active][ready] \_ 0:0:0:0 sda 8:0 [active][ready] mpath4 (36090a038d0d647843b8d744300003056) dm-3 EQLOGIC,100E-00 [size=100G][features=1 queue_if_no_path][hwhandler=0] \_ round-robin 0 [prio=2][enabled] \_ 7:0:0:0 sdh 8:112 [active][ready] \_ 6:0:0:0 sdg 8:96 [active][ready] mpath3 (36090a038d0d6c7cb3a8d1443000090ae) dm-2 EQLOGIC,100E-00 [size=100G][features=1 queue_if_no_path][hwhandler=0] \_ round-robin 0 [prio=2][enabled] \_ 5:0:0:0 sdf 8:80 [active][ready] \_ 4:0:0:0 sde 8:64 [active][ready]
And the following performance under ddt:
[root@dl360-1 ~]# ddt -t 8G /mnt/volume1/ Writing to /mnt/volume1/ddt.4016 ... syncing ... done. sleeping 10 seconds ... done. Reading from /mnt/volume1/ddt.4016 ... done. 8192 MiB KiB/s CPU% Write 160393 39 Read 112530 8
So now thats more like it. 160MB/s writes... Nice. But the reads still seem broken. Iostat reports that io is taking place across both paths but each is doing a slow transfer speed and not getting the >200MB/s I would expect.