Adding swap with zfs

Adding swap with zfs

To add swap from the rpool:

# zfs get volsize rpool/dump
NAME PROPERTY VALUE SOURCE
rpool/dump volsize 1.50G local

to increase boot into single user mode and do:

# zfs set volsize=40G rpool/dump

if you can’t bring the system to single user mode
then
for sparc
# zfs create -V 2G -b 8k rpool/swap2

for x86
# zfs create -V 2G -b 4k rpool/swap2

then just add it

# swap -a /dev/zvol/dsk/rpool/swap2

You can also add swap from other that the rpool

# zfs create -V 40G -b 4k db1/swap
# swap -a /dev/zvol/dsk/db1/swap
# swap -l
swapfile dev swaplo blocks free
/dev/zvol/dsk/rpool/swap 181,1 8 4194296 4194296
/dev/zvol/dsk/db1/swap 181,3 8 83886072 83886072

Resize (grow) a mirror by swapping to larger disks with ZFS

Resize (grow) a mirror with ZFS

I have a non root mirror made up of two 72G drives,
can I replace them with 2 146g drives without a backup
and without loss of data

I knew you could grow a mirror under ZFS by breaking the mirror,
then inserting a larger disk for the broken out smaller disk,
creating a new pool on the larger disk,
cloning the old pool to the new pool,
then destroying the old pool,
then inserting a larger disk for the old (now single drive) pool,
then attaching the second larger drive to the new pool,
then mounting the new pool where
the old pool lived
but this is wild

======================
Resize (grow) a mirror with ZFS
======================

I have a non root mirror made up of two 72G drives,
can I replace them with 2 146g drives without a backup
and without loss of data

Luckly with ZFS you can treat files like drives for test purposes:

=========================
Let’s make 4 test devices
=========================

bash-3.00# cd /export/home/
bash-3.00# ls
lost+found
bash-3.00# mkfile 72m 0 1
bash-3.00# mkfile 156m 2 3

===========================================================
O and 1 become our 72 drives, 2 and 3 become our 156 drives
===========================================================

bash-3.00# ls -la
total 934452
drwxr-xr-x 3 root root 512 Mar 13 00:31 .
drwxr-xr-x 3 root sys 512 Mar 12 21:55 ..
-rw——T 1 root root 75497472 Mar 13 00:31 0
-rw——T 1 root root 75497472 Mar 13 00:31 1
-rw——T 1 root root 163577856 Mar 13 00:31 2
-rw——T 1 root root 163577856 Mar 13 00:31 3
drwx—— 2 root root 8192 Mar 12 21:55 lost+found

===================================
let’s create a mirror with the 72’s
===================================

bash-3.00# zpool create test mirror /export/home/0 /export/home/1
bash-3.00# zpool status
pool: test
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
/export/home/0 ONLINE 0 0 0
/export/home/1 ONLINE 0 0 0

errors: No known data errors
bash-3.00# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
test 67.5M 92.5K 67.4M 0% ONLINE –

================================
lets put some data on the mirror
================================

bash-3.00# cd /test
bash-3.00# ls
bash-3.00# pwd
/test
bash-3.00# mkfile 1m q w e r t y
bash-3.00# ls -la
total 11043
drwxr-xr-x 2 root root 8 Mar 13 00:33 .
drwxr-xr-x 27 root root 512 Mar 13 00:32 ..
-rw——T 1 root root 1048576 Mar 13 00:33 e
-rw——T 1 root root 1048576 Mar 13 00:33 q
-rw——T 1 root root 1048576 Mar 13 00:33 r
-rw——T 1 root root 1048576 Mar 13 00:33 t
-rw——T 1 root root 1048576 Mar 13 00:33 w
-rw——T 1 root root 1048576 Mar 13 00:33 y
bash-3.00# cd /

=======================================
let’s remove the backside of the mirror
=======================================

bash-3.00# zpool detach test /export/home/1
bash-3.00# zpool status
pool: test
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
/export/home/0 ONLINE 0 0 0

errors: No known data errors
bash-3.00# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
test 67.5M 6.16M 61.3M 9% ONLINE –

========================================
Now lets replace it with a larger device
========================================

bash-3.00# zpool attach test /export/home/0 /export/home/2
bash-3.00# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
test 67.5M 6.15M 61.4M 9% ONLINE –
bash-3.00# zpool status
pool: test
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Fri Mar 13 00:36:05
2009
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
/export/home/0 ONLINE 0 0 0
/export/home/2 ONLINE 0 0 0

errors: No known data errors

===========================================
Now lets detach the frontside of the mirror
===========================================

bash-3.00# zpool detach test /export/home/0
bash-3.00# zpool status
pool: test
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Fri Mar 13 00:36:05
2009
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
/export/home/2 ONLINE 0 0 0

errors: No known data errors

=========================================
Now let’s replace it with a larger device
=========================================

bash-3.00# zpool attach test /export/home/2 /export/home/3

============
Is it there?
============

bash-3.00# zpool status
pool: test
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Fri Mar 13 00:37:38
2009
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
/export/home/2 ONLINE 0 0 0
/export/home/3 ONLINE 0 0 0

errors: No known data errors

=======================================================
Note: the resilver will take much longer with more data
=======================================================

=============
Is it bigger?
=============

bash-3.00# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
test 152M 6.22M 145M 4% ONLINE –
bash-3.00#

=========
Yes it is
=========

========================
Is our data still there?
========================

bash-3.00# ls /test/
e q r t w y
bash-3.00# ls -la /test/
total 12323
drwxr-xr-x 2 root root 8 Mar 13 00:33 .
drwxr-xr-x 27 root root 512 Mar 13 00:32 ..
-rw——T 1 root root 1048576 Mar 13 00:33 e
-rw——T 1 root root 1048576 Mar 13 00:33 q
-rw——T 1 root root 1048576 Mar 13 00:33 r
-rw——T 1 root root 1048576 Mar 13 00:33 t
-rw——T 1 root root 1048576 Mar 13 00:33 w
-rw——T 1 root root 1048576 Mar 13 00:33 y

=========
Yes it is
=========

zero to 4 terabytes in 60 seconds with ZFS

zero to 4 terabytes in 60 seconds with ZFS

Note: 5300 storage array with 14 400G drives. Using raidz2 (raid 6 aka double parity)

root@foobar # iostat -En |grep c7|awk ‘{print $1}’
c7t220D000A330680AFd0
c7t220D000A330675DBd0
c7t220D000A33068F42d0
c7t220D000A33068D45d0
c7t220D000A33068D49d0
c7t220D000A330688F6d0
c7t220D000A33068D71d0
c7t220D000A3306956Ad0
c7t220D000A33066E31d0
c7t220D000A33068E53d0
c7t220D000A33068D75d0
c7t220D000A330680A4d0
c7t220D000A33069574d0
c7t220D000A33066C81d0

zpool create -f mypool raidz2 c7t220D000A330680AFd0 c7t220D000A330675DBd0 \
c7t220D000A33068F42d0 c7t220D000A33068D45d0 c7t220D000A33068D49d0 \
c7t220D000A330688F6d0 c7t220D000A33068D71d0 c7t220D000A3306956Ad0 \
c7t220D000A33066E31d0 c7t220D000A33068E53d0 c7t220D000A33068D75d0 \
c7t220D000A330680A4d0 spare c7t220D000A33069574d0 c7t220D000A33066C81d0

root@foobar # zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
mypool 4.34T 192K 4.34T 0% ONLINE –

root@foobar # zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 152K 3.54T 60.9K /mypool

root@foobar # zpool status
pool: mypool
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c7t220D000A330680AFd0 ONLINE 0 0 0
c7t220D000A330675DBd0 ONLINE 0 0 0
c7t220D000A33068F42d0 ONLINE 0 0 0
c7t220D000A33068D45d0 ONLINE 0 0 0
c7t220D000A33068D49d0 ONLINE 0 0 0
c7t220D000A330688F6d0 ONLINE 0 0 0
c7t220D000A33068D71d0 ONLINE 0 0 0
c7t220D000A3306956Ad0 ONLINE 0 0 0
c7t220D000A33066E31d0 ONLINE 0 0 0
c7t220D000A33068E53d0 ONLINE 0 0 0
c7t220D000A33068D75d0 ONLINE 0 0 0
c7t220D000A330680A4d0 ONLINE 0 0 0
spares
c7t220D000A33069574d0 AVAIL
c7t220D000A33066C81d0 AVAIL

errors: No known data errors

root@foobar # df -h
Filesystem size used avail capacity Mounted on
/dev/dsk/c3t0d0s0 15G 7.7G 7.0G 53% /
/devices 0K 0K 0K 0% /devices
ctfs 0K 0K 0K 0% /system/contract
proc 0K 0K 0K 0% /proc
mnttab 0K 0K 0K 0% /etc/mnttab
swap 17G 716K 17G 1% /etc/svc/volatile
objfs 0K 0K 0K 0% /system/object
/usr/lib/libc/libc_hwcap2.so.1
15G 7.7G 7.0G 53% /lib/libc.so.1
fd 0K 0K 0K 0% /dev/fd
swap 17G 36K 17G 1% /tmp
swap 17G 24K 17G 1% /var/run
mypool 3.5T 61K 3.5T 1% /mypool

root@foobar # zfs set sharenfs=on mypool

root@foobar # cat /etc/dfs/sharetab
/mypool – n

Recursive rolling snapshots

Recursive rolling snapshots

With build 63 (Solaris 5.10.6 aka r6) ZFS now supports the recursive snapshot command

my storage pool looks like this

mypool
mypool/home
mypool/home/john
mypool/home/joe

Here is a script to run nightly in cron (note we only snapshot the top directory)
#!/bin/sh
#: jcore
zfs destroy mypool@7daysago 2>&1 > /dev/null
zfs rename -r mypool@6daysago 7daysago 2>&1 > /dev/null
zfs rename -r mypool@5daysago 6daysago 2>&1 > /dev/null
zfs rename -r mypool@4daysago 5daysago 2>&1 > /dev/null
zfs rename -r mypool@3daysago 4daysago 2>&1 > /dev/null
zfs rename -r mypool@2daysago 3daysago 2>&1 > /dev/null
zfs rename -r mypool@yesterday 2daysago 2>&1 > /dev/null
zfs snapshot mypool@yesterday
exit

After running this 7 times this is what I get

# uname -a
SunOS serv10g-dxc1 5.10 Generic_137138-09 i86pc i386 i86pc

# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 9.00G 3.53T 9.00G /mypool
mypool@7daysago 0 – 9.00G –
mypool@6daysago 0 – 9.00G –
mypool@5daysago 0 – 9.00G –
mypool@4daysago 0 – 9.00G –
mypool@3daysago 0 – 9.00G –
mypool@2daysago 0 – 9.00G –
mypool@yesterday 0 – 9.00G –
mypool/home 139K 3.53T 49.7K /mypool/home
mypool/home@7daysago 0 – 49.7K –
mypool/home@6daysago 0 – 49.7K –
mypool/home@5daysago 0 – 49.7K –
mypool/home@4daysago 0 – 49.7K –
mypool/home@3daysago 0 – 49.7K –
mypool/home@2daysago 0 – 49.7K –
mypool/home@yesterday 0 – 49.7K –
mypool/home/joe 44.7K 3.53T 44.7K /mypool/home/joe
mypool/home/joe@7daysago 0 – 44.7K –
mypool/home/joe@6daysago 0 – 44.7K –
mypool/home/joe@5daysago 0 – 44.7K –
mypool/home/joe@4daysago 0 – 44.7K –
mypool/home/joe@3daysago 0 – 44.7K –
mypool/home/joe@2daysago 0 – 44.7K –
mypool/home/joe@yesterday 0 – 44.7K –
mypool/home/john 44.7K 3.53T 44.7K /mypool/home/john
mypool/home/john@7daysago 0 – 44.7K –
mypool/home/john@6daysago 0 – 44.7K –
mypool/home/john@5daysago 0 – 44.7K –
mypool/home/john@4daysago 0 – 44.7K –
mypool/home/john@3daysago 0 – 44.7K –
mypool/home/john@2daysago 0 – 44.7K –
mypool/home/john@yesterday 0 – 44.7K –
#

Creating a three way mirror in ZFS

you actually can create a three way mirror in ZFS if that is what you REALLY want to do

zpool create -f poolname mirror ctag ctag mirror ctag ctag mirror ctag ctag
—-mirrors (2 disks each)——->first<---------->second<--------->third<--- the first mirror will then replicate to the second which will then replicate to the third Note: not to be confused with the three disk mirror which you CAN do on a T3 storage array Dracko article #542 T3 - Three Disk Mirror (WTF?)

Attaching and Detaching Devices in a ZFS Storage Pool

Attaching and Detaching Devices in a ZFS Storage Pool

In addition to the zpool add command, you can use the zpool attach command to add a new device to an existing mirrored or non-mirrored device. For example:

# zpool attach zeepool c1t1d0 c2t1d0

If the existing device is part of a two-way mirror, attaching the new device, creates a three-way mirror, and so on. In either case, the new device begins to resilver immediately.

In is example, zeepool is an existing two-way mirror that is transformed to a three-way mirror by attaching c2t1d0, the new device, to the existing device, c1t1d0.

You can use the zpool detach command to detach a device from a pool. For example:

# zpool detach zeepool c2t1d0

However, this operation is refused if there are no other valid replicas of the data. For example:

# zpool detach newpool c1t2d0
cannot detach c1t2d0: only applicable to mirror and replacing vdevs

Onlining and Offlining Devices in a ZFS Storage Pool

Onlining and Offlining Devices in a ZFS Storage Pool

ZFS allows individual devices to be taken offline or brought online. When hardware is unreliable or not functioning properly, ZFS continues to read or write data to the device, assuming the condition is only temporary. If the condition is not temporary, it is possible to instruct ZFS to ignore the device by bringing it offline. ZFS does not send any requests to an offlined device.
Note

Devices do not need to be taken offline in order to replace them.

You can use the offline command when you need to temporarily disconnect storage. For example, if you need to physically disconnect an array from one set of Fibre Channel switches and connect the array to a different set, you could take the LUNs offline from the array that was used in ZFS storage pools. After the array was reconnected and operational on the new set of switches, you could then bring the same LUNs online. Data that had been added to the storage pools while the LUNs were offline would resilver to the LUNs after they were brought back online.

This scenario is possible assuming that the systems in question see the storage once it is attached to the new switches, possibly through different controllers than before, and your pools are set up as RAID-Z or mirrored configurations.
Taking a Device Offline

You can take a device offline by using the zpool offline command. The device can be specified by path or by short name, if the device is a disk. For example:

# zpool offline tank c1t0d0
bringing device c1t0d0 offline

You cannot take a pool offline to the point where it becomes faulted. For example, you cannot take offline two devices out of a RAID-Z configuration, nor can you take offline a top-level virtual device.

# zpool offline tank c1t0d0
cannot offline c1t0d0: no valid replicas

Note

Currently, you cannot replace a device that has been taken offline.

Offlined devices show up in the OFFLINE state when you query pool status. For information about querying pool status, see Querying ZFS Storage Pool Status.

By default, the offline state is persistent. The device remains offline when the system is rebooted.

To temporarily take a device offline, use the zpool offline t option. For example:

# zpool offline -t tank c1t0d0
bringing device ‘c1t0d0’ offline

When the system is rebooted, this device is automatically returned to the ONLINE state.

For more information on device health, see Health Status of ZFS Storage Pools.
Bringing a Device Online

Once a device is taken offline, it can be restored by using the zpool online command:

# zpool online tank c1t0d0
bringing device c1t0d0 online

When a device is brought online, any data that has been written to the pool is resynchronized to the newly available device. Note that you cannot use device onlining to replace a disk. If you offline a device, replace the drive, and try to bring it online, it remains in the faulted state.

If you attempt to online a faulted device, a message similar to the following is displayed from fmd:

# zpool online tank c1t0d0
Bringing device c1t0d0 online
#
SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Fri Mar 17 14:38:47 MST 2006
PLATFORM: SUNW,Ultra-60, CSN: -, HOSTNAME: neo
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 043bb0dd-f0a5-4b8f-a52d-8809e2ce2e0a
DESC: A ZFS device failed. Refer to http://sun.com/msg/ZFS-8000-D3 for more information.
AUTO-RESPONSE: No automated response will occur.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Run ‘zpool status -x’ and replace the bad device.

Remote Replication of ZFS Data

Remote Replication of ZFS Data

You can use the zfs send and zfs recv commands to remotely copy a snapshot stream representation from one system to another system. For example:

# zfs send tank/cindy@today | ssh newsys zfs recv sandbox/restfs@today

This command saves the tank/cindy@today snapshot data and restores it into the sandbox/restfs file system and also creates a restfs@today snapshot on the newsys system. In this example, the user has been configured to use ssh on the remote system.

ZFS mount: cd .. or ls .. permission denied

Problem : When moving up from one ZFS mount point to a parent directory (also a zfs mount point) using the ".. " notation through cd or ls permission is denied, however cd and ls will access directories when explicitly naming the directory.

ZFS mounts store the . and .. entries separately for their mounted and unmounted states. So when the zfs mount is in it’s mounted state, the . and .. commands will work properly, however if they mount is in the unmounted state an access denied response will display for all non root users. This is because zfs . and .. entries are owned by root in the unmounted state by default, and permissions for them need to be set separately when the zfs mount is unmounted in order for other users to use those entries.

This particular issue will cause problems in situations such as applying patches to those mount directories for programs such as oracle.

ZFS Limiting the ARC Cache

Limiting the ARC Cache

The ARC is where ZFS caches data from all active storage pools. The ARC grows and consumes memory on the principle that no need exists to return data to the system while there is still plenty of free memory. When the ARC has grown and outside memory pressure exists, for example, when a new application starts up, then the ARC releases its hold on memory. ZFS is not designed to steal memory from applications. A few bumps appeared along the way, but the established mechanism works reasonably well for many situations and does not commonly warrant tuning.

However, a few situations stand out.

* If a future memory requirement is significantly large and well defined, then it can be advantageous to prevent ZFS from growing the ARC into it. So, if we know that a future application requires 20% of memory, it makes sense to cap the ARC such that it does not consume more than the remaining 80% of memory.

* If the application is a known consumer of large memory pages, then again limiting the ARC prevents ZFS from breaking up the pages and fragmenting the memory. Limiting the ARC preserves the availability of large pages.

* If dynamic reconfiguration of a memory board is needed (supported on certain platforms), then it is a requirement to prevent the ARC (and thus the kernel cage) to grow onto all boards.

For theses cases, it can be desirable to limit the ARC. This will, of course, also limit the amount of cached data and this can have adverse effects on performance. No easy way exists to foretell if limiting the ARC degrades performance.

If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ARCSIZE
Solaris 10 8/07 and Solaris Nevada (snv_51) Releases

For example, if an application needs 5 Gbytes of memory on a system with 36-Gbytes of memory, you could set the arc maximum to 30 Gbytes (0x780000000).

Set the following parameter in the /etc/system file:

set zfs:zfs_arc_max = 0x780000000

Earlier Solaris Releases

You can only change the ARC maximum size by using the mdb command. Because the system is already booted, the ARC init routine has already executed and other ARC size parameters have already been set based on the default c_max size. Therefore, you should tune the arc.c and arc.p values, along with arc.c_max, using the formula:

arc.c = arc.c_max

arc.p = arc.c / 2

For example, to the set the ARC parameters to small values, such as arc_c_max to 512MB, and complying with the formula above (arc.c_max to 512MB, and arc.p to 256MB), use the following syntax:

# mdb -kw
> arc::print -a p c c_max
ffffffffc00b3260 p = 0xb75e46ff
ffffffffc00b3268 c = 0x11f51f570
ffffffffc00b3278 c_max = 0x3bb708000

> ffffffffc00b3260/Z 0x10000000
ffffffffc00b3260: 0xb75e46ff = 0x10000000
> ffffffffc00b3268/Z 0x20000000
ffffffffc00b3268: 0x11f51f570 = 0x20000000
> ffffffffc00b3278/Z 0x20000000
ffffffffc00b3278: 0x11f51f570 = 0x20000000

You should verify the values have been set correctly by examining them again in mdb (using the same print command in the example). You can also monitor the actual size of the ARC to ensure it has not exceeded:

# echo "arc::print -d size" | mdb -k

The above command displays the current ARC size in decimal.

Here is a perl script that you can call from an init script to configure your ARC on boot with the above guidelines:

#!/bin/perl

use strict;
my $arc_max = shift @ARGV;
if ( !defined($arc_max) ) {
print STDERR "usage: arc_tune \n";
exit -1;
}
$| = 1;
use IPC::Open2;
my %syms;
my $mdb = "/usr/bin/mdb";
open2(*READ, *WRITE, "$mdb -kw") || die "cannot execute mdb";
print WRITE "arc::print -a\n";
while() {
my $line = $_;

if ( $line =~ /^ +([a-f0-9]+) (.*) =/ ) {
$syms{$2} = $1;
} elsif ( $line =~ /^\}/ ) {
last;
}
}
# set c & c_max to our max; set p to max/2
printf WRITE "%s/Z 0x%x\n", $syms{p}, ( $arc_max / 2 );
print scalar ;
printf WRITE "%s/Z 0x%x\n", $syms{c}, $arc_max;
print scalar ;
printf WRITE "%s/Z 0x%x\n", $syms{c_max}, $arc_max;
print scalar ;

RFEs

* ZFS should avoiding growing the ARC into trouble

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6488341

* The ARC allocates memory inside the kernel cage, preventing DR

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6522017

* ZFS/ARC should cleanup more after itself

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6424665

* Each zpool needs to monitor it’s throughput and throttle heavy writers

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205

Further Reading

http://blogs.sun.com/roch/entry/does_zfs_really_use_more