Adding swap with zfs

Adding swap with zfs

To add swap from the rpool:

# zfs get volsize rpool/dump
NAME PROPERTY VALUE SOURCE
rpool/dump volsize 1.50G local

to increase boot into single user mode and do:

# zfs set volsize=40G rpool/dump

if you can’t bring the system to single user mode
then
for sparc
# zfs create -V 2G -b 8k rpool/swap2

for x86
# zfs create -V 2G -b 4k rpool/swap2

then just add it

# swap -a /dev/zvol/dsk/rpool/swap2

You can also add swap from other that the rpool

# zfs create -V 40G -b 4k db1/swap
# swap -a /dev/zvol/dsk/db1/swap
# swap -l
swapfile dev swaplo blocks free
/dev/zvol/dsk/rpool/swap 181,1 8 4194296 4194296
/dev/zvol/dsk/db1/swap 181,3 8 83886072 83886072

Resize (grow) a mirror by swapping to larger disks with ZFS

Resize (grow) a mirror with ZFS

I have a non root mirror made up of two 72G drives,
can I replace them with 2 146g drives without a backup
and without loss of data

I knew you could grow a mirror under ZFS by breaking the mirror,
then inserting a larger disk for the broken out smaller disk,
creating a new pool on the larger disk,
cloning the old pool to the new pool,
then destroying the old pool,
then inserting a larger disk for the old (now single drive) pool,
then attaching the second larger drive to the new pool,
then mounting the new pool where
the old pool lived
but this is wild

======================
Resize (grow) a mirror with ZFS
======================

I have a non root mirror made up of two 72G drives,
can I replace them with 2 146g drives without a backup
and without loss of data

Luckly with ZFS you can treat files like drives for test purposes:

=========================
Let’s make 4 test devices
=========================

bash-3.00# cd /export/home/
bash-3.00# ls
lost+found
bash-3.00# mkfile 72m 0 1
bash-3.00# mkfile 156m 2 3

===========================================================
O and 1 become our 72 drives, 2 and 3 become our 156 drives
===========================================================

bash-3.00# ls -la
total 934452
drwxr-xr-x 3 root root 512 Mar 13 00:31 .
drwxr-xr-x 3 root sys 512 Mar 12 21:55 ..
-rw——T 1 root root 75497472 Mar 13 00:31 0
-rw——T 1 root root 75497472 Mar 13 00:31 1
-rw——T 1 root root 163577856 Mar 13 00:31 2
-rw——T 1 root root 163577856 Mar 13 00:31 3
drwx—— 2 root root 8192 Mar 12 21:55 lost+found

===================================
let’s create a mirror with the 72’s
===================================

bash-3.00# zpool create test mirror /export/home/0 /export/home/1
bash-3.00# zpool status
pool: test
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
/export/home/0 ONLINE 0 0 0
/export/home/1 ONLINE 0 0 0

errors: No known data errors
bash-3.00# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
test 67.5M 92.5K 67.4M 0% ONLINE –

================================
lets put some data on the mirror
================================

bash-3.00# cd /test
bash-3.00# ls
bash-3.00# pwd
/test
bash-3.00# mkfile 1m q w e r t y
bash-3.00# ls -la
total 11043
drwxr-xr-x 2 root root 8 Mar 13 00:33 .
drwxr-xr-x 27 root root 512 Mar 13 00:32 ..
-rw——T 1 root root 1048576 Mar 13 00:33 e
-rw——T 1 root root 1048576 Mar 13 00:33 q
-rw——T 1 root root 1048576 Mar 13 00:33 r
-rw——T 1 root root 1048576 Mar 13 00:33 t
-rw——T 1 root root 1048576 Mar 13 00:33 w
-rw——T 1 root root 1048576 Mar 13 00:33 y
bash-3.00# cd /

=======================================
let’s remove the backside of the mirror
=======================================

bash-3.00# zpool detach test /export/home/1
bash-3.00# zpool status
pool: test
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
/export/home/0 ONLINE 0 0 0

errors: No known data errors
bash-3.00# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
test 67.5M 6.16M 61.3M 9% ONLINE –

========================================
Now lets replace it with a larger device
========================================

bash-3.00# zpool attach test /export/home/0 /export/home/2
bash-3.00# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
test 67.5M 6.15M 61.4M 9% ONLINE –
bash-3.00# zpool status
pool: test
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Fri Mar 13 00:36:05
2009
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
/export/home/0 ONLINE 0 0 0
/export/home/2 ONLINE 0 0 0

errors: No known data errors

===========================================
Now lets detach the frontside of the mirror
===========================================

bash-3.00# zpool detach test /export/home/0
bash-3.00# zpool status
pool: test
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Fri Mar 13 00:36:05
2009
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
/export/home/2 ONLINE 0 0 0

errors: No known data errors

=========================================
Now let’s replace it with a larger device
=========================================

bash-3.00# zpool attach test /export/home/2 /export/home/3

============
Is it there?
============

bash-3.00# zpool status
pool: test
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Fri Mar 13 00:37:38
2009
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
/export/home/2 ONLINE 0 0 0
/export/home/3 ONLINE 0 0 0

errors: No known data errors

=======================================================
Note: the resilver will take much longer with more data
=======================================================

=============
Is it bigger?
=============

bash-3.00# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
test 152M 6.22M 145M 4% ONLINE –
bash-3.00#

=========
Yes it is
=========

========================
Is our data still there?
========================

bash-3.00# ls /test/
e q r t w y
bash-3.00# ls -la /test/
total 12323
drwxr-xr-x 2 root root 8 Mar 13 00:33 .
drwxr-xr-x 27 root root 512 Mar 13 00:32 ..
-rw——T 1 root root 1048576 Mar 13 00:33 e
-rw——T 1 root root 1048576 Mar 13 00:33 q
-rw——T 1 root root 1048576 Mar 13 00:33 r
-rw——T 1 root root 1048576 Mar 13 00:33 t
-rw——T 1 root root 1048576 Mar 13 00:33 w
-rw——T 1 root root 1048576 Mar 13 00:33 y

=========
Yes it is
=========

zero to 4 terabytes in 60 seconds with ZFS

zero to 4 terabytes in 60 seconds with ZFS

Note: 5300 storage array with 14 400G drives. Using raidz2 (raid 6 aka double parity)

root@foobar # iostat -En |grep c7|awk ‘{print $1}’
c7t220D000A330680AFd0
c7t220D000A330675DBd0
c7t220D000A33068F42d0
c7t220D000A33068D45d0
c7t220D000A33068D49d0
c7t220D000A330688F6d0
c7t220D000A33068D71d0
c7t220D000A3306956Ad0
c7t220D000A33066E31d0
c7t220D000A33068E53d0
c7t220D000A33068D75d0
c7t220D000A330680A4d0
c7t220D000A33069574d0
c7t220D000A33066C81d0

zpool create -f mypool raidz2 c7t220D000A330680AFd0 c7t220D000A330675DBd0 \
c7t220D000A33068F42d0 c7t220D000A33068D45d0 c7t220D000A33068D49d0 \
c7t220D000A330688F6d0 c7t220D000A33068D71d0 c7t220D000A3306956Ad0 \
c7t220D000A33066E31d0 c7t220D000A33068E53d0 c7t220D000A33068D75d0 \
c7t220D000A330680A4d0 spare c7t220D000A33069574d0 c7t220D000A33066C81d0

root@foobar # zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
mypool 4.34T 192K 4.34T 0% ONLINE –

root@foobar # zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 152K 3.54T 60.9K /mypool

root@foobar # zpool status
pool: mypool
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c7t220D000A330680AFd0 ONLINE 0 0 0
c7t220D000A330675DBd0 ONLINE 0 0 0
c7t220D000A33068F42d0 ONLINE 0 0 0
c7t220D000A33068D45d0 ONLINE 0 0 0
c7t220D000A33068D49d0 ONLINE 0 0 0
c7t220D000A330688F6d0 ONLINE 0 0 0
c7t220D000A33068D71d0 ONLINE 0 0 0
c7t220D000A3306956Ad0 ONLINE 0 0 0
c7t220D000A33066E31d0 ONLINE 0 0 0
c7t220D000A33068E53d0 ONLINE 0 0 0
c7t220D000A33068D75d0 ONLINE 0 0 0
c7t220D000A330680A4d0 ONLINE 0 0 0
spares
c7t220D000A33069574d0 AVAIL
c7t220D000A33066C81d0 AVAIL

errors: No known data errors

root@foobar # df -h
Filesystem size used avail capacity Mounted on
/dev/dsk/c3t0d0s0 15G 7.7G 7.0G 53% /
/devices 0K 0K 0K 0% /devices
ctfs 0K 0K 0K 0% /system/contract
proc 0K 0K 0K 0% /proc
mnttab 0K 0K 0K 0% /etc/mnttab
swap 17G 716K 17G 1% /etc/svc/volatile
objfs 0K 0K 0K 0% /system/object
/usr/lib/libc/libc_hwcap2.so.1
15G 7.7G 7.0G 53% /lib/libc.so.1
fd 0K 0K 0K 0% /dev/fd
swap 17G 36K 17G 1% /tmp
swap 17G 24K 17G 1% /var/run
mypool 3.5T 61K 3.5T 1% /mypool

root@foobar # zfs set sharenfs=on mypool

root@foobar # cat /etc/dfs/sharetab
/mypool – n

Recursive rolling snapshots

Recursive rolling snapshots

With build 63 (Solaris 5.10.6 aka r6) ZFS now supports the recursive snapshot command

my storage pool looks like this

mypool
mypool/home
mypool/home/john
mypool/home/joe

Here is a script to run nightly in cron (note we only snapshot the top directory)
#!/bin/sh
#: jcore
zfs destroy mypool@7daysago 2>&1 > /dev/null
zfs rename -r mypool@6daysago 7daysago 2>&1 > /dev/null
zfs rename -r mypool@5daysago 6daysago 2>&1 > /dev/null
zfs rename -r mypool@4daysago 5daysago 2>&1 > /dev/null
zfs rename -r mypool@3daysago 4daysago 2>&1 > /dev/null
zfs rename -r mypool@2daysago 3daysago 2>&1 > /dev/null
zfs rename -r mypool@yesterday 2daysago 2>&1 > /dev/null
zfs snapshot mypool@yesterday
exit

After running this 7 times this is what I get

# uname -a
SunOS serv10g-dxc1 5.10 Generic_137138-09 i86pc i386 i86pc

# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 9.00G 3.53T 9.00G /mypool
mypool@7daysago 0 – 9.00G –
mypool@6daysago 0 – 9.00G –
mypool@5daysago 0 – 9.00G –
mypool@4daysago 0 – 9.00G –
mypool@3daysago 0 – 9.00G –
mypool@2daysago 0 – 9.00G –
mypool@yesterday 0 – 9.00G –
mypool/home 139K 3.53T 49.7K /mypool/home
mypool/home@7daysago 0 – 49.7K –
mypool/home@6daysago 0 – 49.7K –
mypool/home@5daysago 0 – 49.7K –
mypool/home@4daysago 0 – 49.7K –
mypool/home@3daysago 0 – 49.7K –
mypool/home@2daysago 0 – 49.7K –
mypool/home@yesterday 0 – 49.7K –
mypool/home/joe 44.7K 3.53T 44.7K /mypool/home/joe
mypool/home/joe@7daysago 0 – 44.7K –
mypool/home/joe@6daysago 0 – 44.7K –
mypool/home/joe@5daysago 0 – 44.7K –
mypool/home/joe@4daysago 0 – 44.7K –
mypool/home/joe@3daysago 0 – 44.7K –
mypool/home/joe@2daysago 0 – 44.7K –
mypool/home/joe@yesterday 0 – 44.7K –
mypool/home/john 44.7K 3.53T 44.7K /mypool/home/john
mypool/home/john@7daysago 0 – 44.7K –
mypool/home/john@6daysago 0 – 44.7K –
mypool/home/john@5daysago 0 – 44.7K –
mypool/home/john@4daysago 0 – 44.7K –
mypool/home/john@3daysago 0 – 44.7K –
mypool/home/john@2daysago 0 – 44.7K –
mypool/home/john@yesterday 0 – 44.7K –
#

Creating a three way mirror in ZFS

you actually can create a three way mirror in ZFS if that is what you REALLY want to do

zpool create -f poolname mirror ctag ctag mirror ctag ctag mirror ctag ctag
—-mirrors (2 disks each)——->first<---------->second<--------->third<--- the first mirror will then replicate to the second which will then replicate to the third Note: not to be confused with the three disk mirror which you CAN do on a T3 storage array Dracko article #542 T3 - Three Disk Mirror (WTF?)

ZFS Device I/O Queue Size (I/O Concurrency)

Device I/O Queue Size (I/O Concurrency)

ZFS controls the I/O queue depth for a given LUN. The default is 35, which allows common SCSI and SATA disks to reach their maximum throughput under ZFS. However, having 35 concurrent I/Os means that the service times can be inflated. For NVRAM-based storage, it is not expected that this 35-deep queue is reached nor plays a significant role. Tuning this parameter for NVRAM-based storage is expected to be ineffective. For JBOD-type storage, tuning this parameter is expected to help response times at the expense of raw streaming throughput.

The Solaris Nevada release now has the option of storing the ZIL on separate devices from the main pool. Using separate intent log devices can alleviate the need to tune this parameter for loads that are synchronously write intensive.

If you tune this parmeter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#MAXPEND

Tuning is not expected to be effective for NVRAM-based storage arrays.
[edit] Solaris 10 8/07 and Solaris Nevada (snv_53 to snv_69) Releases

Set dynamically:

echo zfs_vdev_max_pending/W0t10 | mdb -kw

Revert to default:

echo zfs_vdev_max_pending/W0t35 | mdb -kw

Set the following parameter in the /etc/system file:

set zfs:zfs_vdev_max_pending = 10

For earlier Solaris releases, see:

http://blogs.sun.com/roch/entry/tuning_the_knobs
RFEs

* 6471212 need reserved I/O scheduler slots to improve I/O latency of critical ops

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6471212
Further Reading

http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on

ZFS Device-Level Prefetching

Device-Level Prefetching

ZFS does a device-level prefetching in addition to file-level prefetching. When ZFS reads a block from a disk, it inflates the I/O size, hoping to pull interesting data or metadata from the disk. Prior to the Solaris Nevada (snv_70) release, the code has caused problems for system with lots of disks because the extra prefetched data can cause congestion on the channel between the storage and the host. Tuning down the prefetching has been effective for OLTP type loads in the past. However, in the Solaris Nevada release, the code is now only prefetching metadata and this is not expected to require any tuning.

No tuning is required for snv_70 and after.
Solaris 10 8/07 and Nevada (snv_53 to snv_69) Releases

Set the following parameter in the /etc/system file:

set zfs:zfs_vdev_cache_bshift = 13

/* Comments
/* Setting zfs_vdev_cache_bshift with mdb crashes a system.
/* zfs_vdev_cache_bshift is the base 2 logarithm of the size used to read disks.
/* The default value of 16 means reads are issued in size of 1 << 16 = 64K. /* A value of 13 means disk reads are padded to 8K. For earlier releases, see: http://blogs.sun.com/roch/entry/tuning_the_knobs RFEs * vdev_cache wises up: increase DB performance by 16% (integrated in snv_70) http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6437054 [edit] Further Reading http://blogs.sun.com/erickustarz/entry/vdev_cache_improvements_to_help

ZFS Cache Flushes

ZFS Cache Flushes

ZFS is designed to work with storage devices that manage a disk-level cache. ZFS commonly asks the storage device to ensure that data is safely placed on stable storage by requesting a cache flush. For JBOD storage, this works as designed and without problems. For many NVRAM-based storage arrays, a problem might come up if the array takes the cache flush request and actually does something rather than ignoring it. Some storage will flush their caches despite the fact that the NVRAM protection makes those caches as good as stable storage.

ZFS issues infrequent flushes (every 5 second or so) after the uberblock updates. The problem here is fairly inconsequential. No tuning is warranted here.

ZFS also issues a flush every time an application requests a synchronous write (O_DSYNC, fsync, NFS commit, and so on). The completion of this type of flush is waited upon by the application and impacts performance. Greatly so, in fact. From a performance standpoint, this neutralizes the benefits of having an NVRAM-based storage.

The upcoming fix is that the flush request semantic will be qualified to instruct storage devices to ignore the requests if they have the proper protection. This change requires a fix to our disk drivers and for the storage to support the updated semantics.

Since ZFS is not aware of the nature of the storage and if NVRAM is present, the best way to fix this issue is to tell the storage to ignore the requests. For more information, see:

http://blogs.digitar.com/jjww/?itemid=44.

Please check with your storage vendor for ways to achieve the same thing.

As a last resort, when all LUNs exposed to ZFS come from NVRAM-protected storage array and procedures ensure that no unprotected LUNs will be added in the future, ZFS can be tuned to not issue the flush requests. If some LUNs exposed to ZFS are not protected by NVRAM, then this tuning can lead to data loss, application level corruption, or even pool corruption.

NOTE: Cache flushing is commonly done as part of the ZIL operations. While disabling cache flushing can, at times, make sense, disabling the ZIL does not.

Solaris 10 11/06 and Solaris Nevada (snv_52) Releases

Set dynamically:

echo zfs_nocacheflush/W0t1 | mdb -kw

Revert to default:

echo zfs_nocacheflush/W0t0 | mdb -kw

Set the following parameter in the /etc/system file:

set zfs:zfs_nocacheflush = 1

Risk: Some storage might revert to working like a JBOD disk when their battery is low, for instance. Disabling the caches can have adverse effects here. Check with your storage vendor.

Earlier Solaris Releases

Set the following parameter in the /etc/system file:

set zfs:zil_noflush = 1

Set dynamically:

echo zil_noflush/W0t1 | mdb -kw

Revert to default:

echo zil_noflush/W0t0 | mdb -kw

Risk: Some storage might revert to working like a JBOD disk when their battery is low, for instance. Disabling the caches can have adverse effects here. Check with your storage vendor.

RFEs

* sd driver should set SYNC_NV bit when issuing SYNCHRONIZE CACHE to SBC-2 devices

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6462690

* zil shouldn’t send write-cache-flush command …

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6460889

ZFS Disabling the ZIL

Disabling the ZIL (Don’t)

ZIL stands for ZFS Intent Log. It is used during synchronous writes operations. The ZIL is an essential part of ZFS and should never be disabled. Significant performance gains can be achieved by not having the ZIL, but that would be at the expense of data integrity. One can be infinitely fast, if correctness is not required.

One reason to disable the ZIL is to check if a given workload is significantly impacted by it. A little while ago, a workload that was a heavy consumer of ZIL operations was shown to not be impacted by disabling the ZIL. It convinced us to look elsewhere for improvements. If the ZIL is shown to be a factor in the performance of a workload, more investigation is necessary to see if the ZIL can be improved.

The Solaris Nevada release now has the option of storing the ZIL on separate devices from the main pool. Using separate possibly low latency devices for the Intent Log is a great way to improve ZIL sensitive loads.

Caution: Disabling the ZIL on an NFS server will lead to client side corruption. The ZFS pool integrity itself is not compromised by this tuning.

Current Solaris Releases

If you must, then:

echo zil_disable/W0t1 | mdb -kw

Revert to default:

echo zil_disable/W0t0 | mdb -kw

RFEs

* zil synchronicity

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6280630

Further Reading

http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on http://blogs.sun.com/erickustarz/entry/zil_disable http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine

ZFS Disabling Metadata Compression

Disabling Metadata Compression

Caution: This tuning needs to be researched as it’s now apparent that the tunable applies only to indirect blocks leaving a lot of metadata compressed anyway.

With ZFS, compression of data blocks is under the control of the file system administrator and can be turned on or off by using the command "zfs set compression …".

On the other hand, ZFS internal metadata is always compressed on disk, by default. For metadata intensive loads, this default is expected to gain some amount of space (a few percentages) at the expense of a little extra CPU computation. However, a bigger motivation exists to have metadata compression on. For directories that grow to millions of objects then shrink to just a few, metadata compression saves large amounts of space (>>10X).

In general, metadata compression can be left as is. If your workload is CPU intensive (say > 80% load) and kernel profiling shows medata compression is a significant contributor and we are not expected to create and shrink huge directories, then disabling metadata compression can be attempted with the goal of providing more CPU to handle the workload.

Solaris 10 11/06 and Solaris Nevada (snv_52) Releases

Set dynamically:

echo zfs_mdcomp_disable/W0t1 | mdb -kw

Revert to default:

echo zfs_mdcomp_disable/W0t0 | mdb -kw

Set the following parameter in the /etc/system file:

set zfs:zfs_mdcomp_disable = 1

Earlier Solaris Releases

Not tunable.

RFEs

* 6391873 metadata compression should be turned back on (Integrated in NEVADA snv_36)

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6391873