Sei sulla pagina 1di 7

11/16/12

Document Display

SafeWallet for Chrome is missing the SafeWallet Application. It is a pre-requsite that you have it Fix Now installed. Would you like to correct this problem? Sun StorageTek[TM] SAM-QFS: Filesystem Tuning and Best Practices [ID 1010902.1]
Modified: Nov 9, 2012 Type: HOWTO Migrated ID: 215042 Status: PUBLISHED Priority: 3

Applies to:
Sun QFS and Storage Archive Manager (SAM) - Version 3.5 and later All Platforms

Goal
Provide some guidelines for tuning a QFS filesystem for optimal read and write performance. Reviewing the man pages for each command file is highly recommended when choosing tuning options.

Fix
CONFIGURATION AND TUNING POINTERS ================================== - KNOW YOUR SYSTEM - know your application - know your disk hardware/configuration - understand Solaris and QFS tuning options - CRITICAL TUNING POINTS - Solaris disk drivers - Solaris OS parameters - QFS striping and disk allocation - DAU size - file system-level mount options - ITERATIVE TUNING SEQUENCE - determine app I/O sizes and patterns - understand/configure disk hardware geometry - enable large I/O requests through Solaris - decide disk striping strategy - mcf configuration details - decide DAU size - decide file striping strategy - file I/O buffering options - code path shortening options - validate and repeat KNOW YOUR SYSTEM --------------------------Below are a some guidelines for QFS settings. Reviewing the man pages for each command file is highly recommended when choosing tuning options. To tune application read and write performance you must know your system: A) Know your application's read and write I/O size and patterns (random, quasi-random, quasi-sequential, sequential). Knowledgeable use of sar, iostat, and truss commands can help you derive your application's I/O sizes and patterns. Note that Solaris and file system tuning parameters can mask true I/O sizes attempted from an application. Note that application behavior may change over time. Note that read and write behaviors often differ, and there are different tuning variables (that sometimes affect each other) designed for each. Nothing is better than talking with the person who wrote the application.
https://support.oracle.com/epmos/f aces/ui/km/SearchDocDisplay .jspx?_adf .ctrl-state=n8bndlf uc_29 1/7

11/16/12

B) Know your disk hardware. Understand what hardware striping options exist and know what caching options exist. Understand which hardware RAID controller services each disk and channel to the host. And know your channel configuration between your disks and your host bus adapters. Nothing is better than having your RAID hardware expert handy. C) Know what Solaris[TM] and QFS tuning options are available and know their impacts on one another. This quick install and tuning guide will help you get started. Read the man pages in detail. Nothing is better than having a Solaris and QFS tuning expert handy. CRITICAL TUNING POINTS -------------------------------You can specify tuning parameters for the OS and file system at five points. Note the scope of these guidelines does not cover virtual memory management of files; only the efficiency of file I/O between the app (or the Solaris paging subsystem) and the disk hardware: A) You can change Solaris disk driver parameters affecting the maximum I/O size the driver can read or write when performing physical I/O. B) You can change Solaris OS parameters affecting the maximum I/O size Solaris can write or read to drivers when performing I/O. C) You can add to, rearrange, and group your disk slices in the master configuration file /etc/opt/SUNWsamfs/mcf depending on where you want logical disk blocks (disk allocation units or DAUs) and index node (inode) metadata to be spread in parallel across your disk slices. Changing these options requires a re-creation of the file system using the sammkfs command, losing all data. Adding slices to a file system via the samgrowfs command preserves all data. D) You can select the size of the DAU when you create the file system by running the sammkfs command. Running the sammkfs command loses all data on the disk slices configured in the /etc/opt/SUNWsamfs/mcf file for that particular file system. E) You can change tuning parameters in the /etc/opt/SUNWsamfs/samfs.cmd file (recommended), the /etc/vfstab file (overrides samfs.cmd) or using the mount command (overrides all other options). These options take effect when the file system is mounted. ITERATIVE TUNING SEQUENCE -------------------------------------Tuning is an iterative process which is best done on a clean file system with test data - changing some parameters will require file system rebuilds and loss of data. There is, however, a rough sequence of tasks: A) Do your best to determine the I/O request size your application wants. Do not initially trust the numbers from performance tools because they will be affected by parameters that are as yet untuned. You may need some specific knowledge of the application before tuning. Or you may need to initially bypass this step, starting with step C but tuning the whole system to enable huge I/Os, put the file system into Direct I/O mode via the forcedirectio mount parameter, and observe actual I/O sizes while running the application. Validate the configuration to assure observed I/O sizes are not constrained by hardware or software parameters outside the app. Do not expect good performance during this analysis. Run through all read and write operating modes of the application and observe/document changes in I/O size and pattern. B) Determine disk hardware geometry and hardware RAID options. Set your RAID LUNs up to have a stripe size at the same size or an even multiple of your application's I/O size. For example, T3 disks use a stripe width (block size written to each disk in the RAID set) of 16k, 32k, or 64k. The largest full stripe size for a T3 RAID set would be in an 8+1 stripe at 64k, yielding 512k bytes (8 * 64k = 512k). The lowest typical would be a 7+1 (9th disk as hot spare) with a stripe width of 16k, yielding 112k bytes (7 * 16k = 112k). Performing write and read requests exactly at stripe boundaries makes for very efficient I/O, avoiding internal RAID hardware stripe reads-before-writes for the portion of I/O requests that don't fill a stripe completely or spill over into another stripe. Selecting stripe widths and sizes closely
https://support.oracle.com/epmos/f aces/ui/km/SearchDocDisplay .jspx?_adf .ctrl-state=n8bndlf uc_29 2/7

11/16/12

Document Display

matching the characteristics of the application will smooth app-to-hardware I/O flow. C) Allow Solaris and Solaris disk drivers to perform the maximum size I/Os you anticipate. Setting these numbers lower than what the application and file system prefer, results in fragmentation of I/Os. Setting these numbers higher consumes more memory and may cause some very old disk or channel hardware to become unstable, but for recent large memory footprint machines and disks this is typically not an issue. The system must be rebooted for these changes to take effect: Set the maximum I/O size for Solaris by editing the /etc/system file. # /etc/system set maxphys = 0x800000 #(NOTE: this sets maximum physical I/O size to 8MB)

Set the maximum I/O size in the sd driver for each SCSI target and LUN that will comprise the QFS file system. Target and LUN numbers map to the t and d integers in the disk device naming convention /dev/rdsk/c#t#d#s# . Edit the file: /kernel/drv/sd.conf: # /kernel/drv/sd.conf # this sets maximum physical I/O size for target 3, lun 0 to 8MB name="sd" class="scsi" sd_max_xfer_size=0x800000 target=3 lun=0; # this sets maximum physical I/O size for target 4, lun 0 to 8MB name="sd" class="scsi" sd_max_xfer_size=0x800000 target=4 lun=0; Set the maximum I/O size for fibre channel disks by editing: /kernel/drv/ssd.conf: ssd_max_xfer_size = 0x800000; There may be other configuration files and option settings for non-Sun host bus adapters and drivers. See the thirdparty documents for more. D) Your disk striping strategy determines the way DAUs will be striped across disks in the file system and define all the file system devices for each QFS file system in the file /etc/opt/SUNWsamfs/mcf. Some tips: You do not need a volume manager unless you need to implement a software mirror/RAID (not recommended; vol mgrs affect performance). Most times it is wise to use a whole RAID LUN for each QFS device (one big disk slice) for high performance file systems. Rarely is it useful to place two devices from the same QFS file system on different slices of the same LUN (disk or RAID group) or on more than one LUN presented from a single hardware RAID set. Your striping options allow DAUs to be laid down individually or ganged in stripe groups as hard stripes across multiple devices. Your QFS device type declarations in the mcf determines treatment of a given disk device by the file system. The following four behaviors are affected by your striping choices: 16k-64k or 16k-65536k DAU size choice; pick one fixed size at sammkfs (see the DAU discussion above) Uniform or Dual DAU; Dual makes each file's first 8 DAUs small (4kB) (dual DAU avoids wasted disk space when many small files will be used) Mixed or separated metadata; inodes, directories not put on data disks. (separating metadata avoids interrupting data I/O to access metadata) Round robin or stripe groups; DAUs on grouped disks are hard striped. (Stripe groups allow very large disk allocations in parallel for fast sequential I/O. Writing a first byte to a file sent to a round robin device in a file system will allocate a DAU on a single device. Writing one byte to a file sent to a stripe group will allocate one DAU at the same address on EACH device in the group, in parallel. Stripe group devices must be identical https://support.oracle.com/epmos/f aces/ui/km/SearchDocDisplay .jspx?_adf .ctrl-state=n8bndlf uc_29 in size.)
3/7

11/16/12

Document Display

EACH device in the group, in parallel. Stripe group devices must be identical in size.) Not all options are available together. Pick a device type combination: ms & md (16K-64k, Dual-DAU size) & (Mixed metadata and data, round robin) ma & mr (16k-65536k, Uniform DAU size) & (Separate metadata, round robin) ma & g# (16k-65536k, Uniform DAU size) & (Separate metadata, stripe groups) ma & md (16K-64k, Dual-DAU size) & (separate metadata, round robin) E) Your mcf file should be configured to reflect your chosen disk striping strategy. This is an example with explanations below: #/etc/opt/SUNWsamfs/mcf #equipment equipment device family device additional #identifier ordinal type set state parameter qfs1 10 ma qfs1 /dev/dsk/c0t1d0s7 11 mm qfs1 - /dev/rdsk/c0t1d0s7 /dev/dsk/c3t0d0s6 12 mr qfs1 - /dev/rdsk/c3t0d0s6 /dev/dsk/c3t0d1s6 13 mr qfs1 - /dev/rdsk/c3t0d1s6 # "ma" parent line defines file system name "qfs1"; name also binds lines. # The "mm" slice, equipment 11, holds your metadata (inodes, directories) # Metadata is separated from "mr" equipment 12 and 13 holding your data. # The metadata device should be a low latency device. # The "mr" devices are round-robin striped, independently accepting DAUs. # DAUs are striped across "mr" devices according to "stripe" mount option. myfs2 20 ma myfs2 on /dev/dsk/c4t1d0s0 21 mm myfs2 on /dev/rdsk/c4t1d0s0 /dev/dsk/c4t2d0s0 22 g0 myfs2 on /dev/rdsk/c4t2d0s0 /dev/dsk/c4t3d0s0 23 g0 myfs2 on /dev/rdsk/c4t3d0s0 /dev/dsk/c4t4d0s0 24 g1 myfs2 on /dev/rdsk/c4t4d0s0 /dev/dsk/c4t5d0s0 25 g1 myfs2 on /dev/rdsk/c4t5d0s0 # A DAU is allocated at once to each disk in a striped group (g0 etc) # DAU groups are striped across "g#" groups with "stripe" mount option # note all disks in a stripe group (g0 etc.) must be the same size. # "ma" filesys layout supports/requires at least one "mm" and one "mr" or "g#" myfs3 30 ms myfs3 on /dev/dsk/c5t3d0s0 31 md myfs3 on /dev/rdsk/c5t3d0s0 /dev/dsk/c5t4d0s0 32 md myfs3 on /dev/rdsk/c5t4d0s0 /dev/dsk/c5t5d0s0 33 md myfs3 on /dev/rdsk/c5t5d0s0 # "ms" mixes metadata and data on "md" device types; round-robin striping. F) Decide on a DAU size based on your disk hardware geometry and setup determined in step B above and constrained by device types chosen in step D above. Possible values are 16k/32k/64k for "ms" type or 16k-65536k for "ma" type in 8k byte increments. Favor the ideal size for the application I/O size determined in step A above. Even multiples of the ideal size may also be experimented with and chosen if proven best. Make the file system using the sammkfs command, specifying DAU size: If you have 8 data disks with a stripe width of 64k (8 * 64 = 512): # sammkfs -a 512k samfs1 For 10 data disks with a stripe width of 64k (10 * 64 = 640): # sammkfs -a 640k samfs1
https://support.oracle.com/epmos/f aces/ui/km/SearchDocDisplay .jspx?_adf .ctrl-state=n8bndlf uc_29 4/7

11/16/12

Document Display

For JBOD, a 128k DAU often works well, especially for databases. # sammkfs -a 128k samfs1 G) Decide how many DAUs to put on each device in sequence for each file, before allocating to the next device in the file system. Or if the file system uses stripe groups, your decision will determine how many DAU SETS will be allocated to each stripe group in parallel for a given file before allocating onto the next stripe group. You can choose to direct all the contents of each file onto a single round-robin device or stripe group within the file system. Setting stripe to 0 round-robins at file granularity. Setting stripe to 1 or more round-robins that number of DAUs (or DAU sets for stripe groups) onto each device or group. Striping is managed independently for each file. Free disk space is allocated to any file on a first-come-first-served basis. The stripe mount option defaults to 0 for file systems having stripe groups, round-robining whole files. The default stripe for mr devices causes at least 128k to be written on each. This means stripe is set by default: stripe = 128k / DAUsize for DAUs less than 128k; stripe = 1 for DAUs greater than or equal to 128k. Stripe can be changed at mount time (without initializing the filesystem), from 0 to 255. Stripe groups with stripe set to 0 (round-robin file allocation) work very well for streaming applications where the number of concurrent streams equals the number of stripe groups. #/etc/vfstab #device device mount fs fsck mount mount #tomount tofsck point type pass atboot options qfs1 - /qfs samfs - yes stripe=16 #/etc/opt/SUNWsamfs/samfs.cmd fs = qfs1 stripe = 16 H) Decide file-by-file buffering strategies. There are numerous fine-tuning points that depend on previously set tuning parameters and I/O sizes/patterns of your application. In general, if you do not understand how to use these settings, it is best not to set them, leaving them at their defaults. An example samfs.cmd file is at the bottom of this section. QFS suppports two types of I/O: paged (cached in the server's memory via the Solaris page cache) or direct (bypassing the page cache). The default is paged. Applications that write and read consistently sized I/Os at multiples of 512 bytes, or more ideally at the RAID stripe and DAU size, are candidates for direct I/O, which is more efficient because it uses less Solaris buffering and processing steps. Applications that cannot be tuned to perform well-formed I/O are good candidates for paged I/O, which consumes more buffering and processing power on the host but decouples the file system somewhat from the application, allowing the file system to read and write hardware optimally using Solaris page cache buffers. You can set a directio attribute on any pre-existing file with the command: setfa -D filename . If filename is a directory, all files created in that directory will inherit the directio attribute. You can also force directio on the entire file system with a mount option, and applications can enable directio on the fly using standard Solaris file system calls. The mount parameters readahead and writebehind cause the file system to read extra data from the disk in advance of the application's use and to allow buffering of the application's writes. They are specified in units of kilobytes rounded to an 8K multiple. They are disabled when directio is used for I/O to a given file. The writebehind size should ideally be set to a multiple of the RAID stripe size for both hardware and software RAID. See the treatment of RAID stripe sizes above. The readahead size should be set to a size that increases the I/O performance for paged I/O. Note, too big a readahead size can hurt performance, causing excessive disk reads and loading of unnecessary data into memory. It is advised to test various readahead sizes for your environment. It is important to figure the amount of memory
https://support.oracle.com/epmos/f aces/ui/km/SearchDocDisplay .jspx?_adf .ctrl-state=n8bndlf uc_29 5/7

11/16/12

Document Display

consumed by the total number of expected concurrent application read streams when you set readahead. You can also have the file system actively flush written cache pages to disk to help Solaris keep the page cache clean and available for new data. This is done using the flush_behind mount option. And you can assure that the file system does not overrun the disk hardware using the write_throttle mount option. For applications that sometimes perform large block sequential I/O, you can set mount parameters to control when the file system automatically switches between paged and direct I/O for a given file. I/Os that are at least 512 bytes in length and end on a 512-byte offset into a given file are considered well aligned or well formed. I/Os that do not end on an even 512-byte offset (even very large ones spanning many 512-byte boundaries) are considered misaligned or ill formed. Well formed I/Os often realize performance benefits from direct I/O, especially if their sizes closely match the file system's DAU size and hardware setup. Ill formed I/Os, if large enough, may still benefit from the use of direct I/O since only the beginnings and ends of them cannot be performed efficiently. And generally, direct I/O consumes less host CPU and memory than paged. You can configure the smallest read or write I/O size in kilobytes that is eligible for direct I/O, for well and ill formed I/Os independently. And you can set the number of consecutive operations meeting the size minimum that must be performed before direct I/O will be enabled. This avoids inefficiencies arising from constantly switching back and forth between direct and paged I/O when I/O sizes and patterns fluctuate. You can set read and write options independently, which is useful when an application reads in different patterns than it writes. Direct I/O will automatically be disabled for a file when one of its application's I/Os does not meet the minimum size criteria. By default automatic switching is disabled. As noted above, there are several ways to set the various options. Here are the options as set at mount time via the samfs.cmd file. Again, if you do not understand how to use these settings, it is best not to set them, leaving them at their defaults. See the samfs.cmd man page for more: #/etc/opt/SUNWsamfs/samfs.cmd fs = qfs1 #read ahead in kbytes, multiples of 8k, default 128 readahead = n #write behind in kbytes, multiples of 8k, default 128 writebehind = n #flush behind in kbytes, 16 to 8192, default 0 disabled flush_behind = n #write throttle in kbytes, default smaller of 512M or 1/4 phys mem wr_throttle = n #force direct I/O disables paging subsystem for highly tuned app I/O forcedirectio #directio read well-formed minimum in kB, default 256; 0 disables aligned autosw dio_rd_form_min = n #directio read ill-formed minimum in kB, default 0 disables misaligned autosw dio_rd_ill_min = n #directio consecutive reads in number of operations, default 0 disables autosw dio_rd_consec = n #directio write well-formed minimum in kB, default 0 disables aligned autosw dio_wr_form_min = n #directio write ill-formed minimum in kB, default 0 disables misaligned autosw dio_wr_ill_min = n #dio_wr_consec = n # in number of operations, default 0 disables autosw I) Decide Code-path options which can reduce processing steps and contention between processes/threads when attempting to perform I/O. In general, if you do not understand how to use these settings, it is best not to set them, leaving them at their defaults. An example samfs.cmd file is at the bottom of this section. The qwrite mount parameter disables the Solaris mechanism that serializes parallel writes from separate processes/threads to a given file. This constraint is usually enforced to prevent file corruption resulting from two uncoordinated threads independently altering a single file. Databases such as Oracle are I/O-thread-aware, and manage concurrent file writes themselves. Enabling qwrite for such applications reduces Solaris locking overhead and allows parallel file I/O throughput from the application(s). https://support.oracle.com/epmos/f aces/ui/km/SearchDocDisplay .jspx?_adf .ctrl-state=n8bndlf uc_29
6/7

11/16/12

Document Display

allows parallel file I/O throughput from the application(s). The notrace mount parameter adds about 2-3% to your performance by jumping around tracing code used for diagnosing problems in the file system. Tracing is turned on by default. If you encounter any problem, support personnel may require you to remove notrace and re-create the problem before it can be diagnosed. The mount option shared_writer flushes paged file data and cached metadata to disk at file closure, to coordinate file coherency with one or more shared_reader systems reading the same file system. When a given file is flushed, a consistent point-in-time image of the file on disk is assured. Reading systems have a coherent view of the file at that moment. File data remains cached in memory after the flush, until Solaris finds a better use for the memory. Enabling shared_writer also has the beneficial side-effect of assuring more frequent committment of paged file data and metadata to disk than may be achieved with periodic Solaris fsflush scanning. Note mounting a file system as shared_writer from more than one host will not be prevented and will corrupt the file system - assure proper configuration before mounting. The mount option shared_reader invalidates a file's memory pages and cached metadata when the metadata in memory is known to be stale. A file open operation on a reading host occurring after the inode has been memory resident for invalidate_interval (in seconds) causes the metadata on disk to be checked for updates. If the disk-based metadata has not been updated, any memory-resident metadata and paged file data in the reader's memory is trusted for file I/O. Otherwise, memory pages and metadata are invalidated and read from disk. This feature helps coordinate file coherency with a shared_writer host. Also, reading past file EOF on a reader host can optionally check and refresh the file's metadata if necessary, invalidating cached file pages as needed. This allows files that are being changed or sequentially written to be updated and read simultaneously from a shared_reader host. File systems mounted as shared readers will not allow writes of any kind to the file system. You must configure reading hosts with the same disks in the same order as the writing host, within /etc/opt/SUNWsamfs/mcf on each reading host (device names may be different). You may mount as many systems as you like, as shared readers to a shared_writer file system. Here is a sample of the configuration parameters. Again, if you do not understand how to use these settings, it is best not to set them, leaving them at their defaults. #/etc/opt/SUNWsamfs/samfs.cmd fs = qfs1 #parameter enables overlapped (concurrent) I/O to any file from multiple threads qwrite #parameter disables tracing code in QFS for added performance notrace #shared writer causes diligent flushing of cached file pages and metadata. #only one host should mount a file system as the writer (buyer beware). shared_writer fs = qfs2 #shared reader causes diligent checking of inodes and invalidation of file #memory pages; assures coherent file view on reading host at time of file open. shared_reader #invalidate interval in seconds, cached inode trusted on file open, default 0 invalidate_interval = n #/etc/system # Note this option can reduce shared_reader performance; use only when needed set samfs:refresh_at_eof = 1 J) Validate and repeat It is likely that the first tuning pass will be educational and get relatively close to the best configuration, but further analysis and tuning will often improve performance even more. Keep refining until the speed is optimized.

https://support.oracle.com/epmos/f aces/ui/km/SearchDocDisplay .jspx?_adf .ctrl-state=n8bndlf uc_29

7/7

Potrebbero piacerti anche