Documentation/device-mapper/log-writes.txt - arm/linux - Git at Google

 dm-log-writes
 =============

 This target takes 2 devices, one to pass all IO to normally, and one to log all
 of the write operations to.  This is intended for file system developers wishing
 to verify the integrity of metadata or data as the file system is written to.
 There is a log_write_entry written for every WRITE request and the target is
 able to take arbitrary data from userspace to insert into the log.  The data
 that is in the WRITE requests is copied into the log to make the replay happen
 exactly as it happened originally.

 Log Ordering
 ============

 We log things in order of completion once we are sure the write is no longer in
 cache.  This means that normal WRITE requests are not actually logged until the
 next REQ_PREFLUSH request.  This is to make it easier for userspace to replay
 the log in a way that correlates to what is on disk and not what is in cache,
 to make it easier to detect improper waiting/flushing.

 This works by attaching all WRITE requests to a list once the write completes.
 Once we see a REQ_PREFLUSH request we splice this list onto the request and once
 the FLUSH request completes we log all of the WRITEs and then the FLUSH.  Only
 completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
 simulate the worst case scenario with regard to power failures.  Consider the
 following example (W means write, C means complete):

 W1,W2,W3,C3,C2,Wflush,C1,Cflush

 The log would show the following

 W3,W2,flush,W1....

 Again this is to simulate what is actually on disk, this allows us to detect
 cases where a power failure at a particular point in time would create an
 inconsistent file system.

 Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
 they complete as those requests will obviously bypass the device cache.

 Any REQ_DISCARD requests are treated like WRITE requests.  Otherwise we would
 have all the DISCARD requests, and then the WRITE requests and then the FLUSH
 request.  Consider the following example:

 WRITE block 1, DISCARD block 1, FLUSH

 If we logged DISCARD when it completed, the replay would look like this

 DISCARD 1, WRITE 1, FLUSH

 which isn't quite what happened and wouldn't be caught during the log replay.

 Target interface
 ================

 i) Constructor

    log-writes <dev_path> <log_dev_path>

    dev_path	: Device that all of the IO will go to normally.
    log_dev_path : Device where the log entries are written to.

 ii) Status

     <#logged entries> <highest allocated sector>

     #logged entries	       : Number of logged entries
     highest allocated sector   : Highest allocated sector

 iii) Messages

     mark <description>

 	You can use a dmsetup message to set an arbitrary mark in a log.
 	For example say you want to fsck a file system after every
 	write, but first you need to replay up to the mkfs to make sure
 	we're fsck'ing something reasonable, you would do something like
 	this:

 	  mkfs.btrfs -f /dev/mapper/log
 	  dmsetup message log 0 mark mkfs
 	  <run test>

 	  This would allow you to replay the log up to the mkfs mark and
 	  then replay from that point on doing the fsck check in the
 	  interval that you want.

 	Every log has a mark at the end labeled "dm-log-writes-end".

 Userspace component
 ===================

 There is a userspace tool that will replay the log for you in various ways.
 It can be found here: https://github.com/josefbacik/log-writes

 Example usage
 =============

 Say you want to test fsync on your file system.  You would do something like
 this:

 TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
 dmsetup create log --table "$TABLE"
 mkfs.btrfs -f /dev/mapper/log
 dmsetup message log 0 mark mkfs

 mount /dev/mapper/log /mnt/btrfs-test
 <some test that does fsync at the end>
 dmsetup message log 0 mark fsync
 md5sum /mnt/btrfs-test/foo
 umount /mnt/btrfs-test

 dmsetup remove log
 replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
 mount /dev/sdb /mnt/btrfs-test
 md5sum /mnt/btrfs-test/foo
 <verify md5sum's are correct>

 Another option is to do a complicated file system operation and verify the file
 system is consistent during the entire operation.  You could do this with:

 TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
 dmsetup create log --table "$TABLE"
 mkfs.btrfs -f /dev/mapper/log
 dmsetup message log 0 mark mkfs

 mount /dev/mapper/log /mnt/btrfs-test
 <fsstress to dirty the fs>
 btrfs filesystem balance /mnt/btrfs-test
 umount /mnt/btrfs-test
 dmsetup remove log

 replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
 btrfsck /dev/sdb
 replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
 	--fsck "btrfsck /dev/sdb" --check fua

 And that will replay the log until it sees a FUA request, run the fsck command
 and if the fsck passes it will replay to the next FUA, until it is completed or
 the fsck command exists abnormally.
	dm-log-writes
	=============

	This target takes 2 devices, one to pass all IO to normally, and one to log all
	of the write operations to. This is intended for file system developers wishing
	to verify the integrity of metadata or data as the file system is written to.
	There is a log_write_entry written for every WRITE request and the target is
	able to take arbitrary data from userspace to insert into the log. The data
	that is in the WRITE requests is copied into the log to make the replay happen
	exactly as it happened originally.

	Log Ordering
	============

	We log things in order of completion once we are sure the write is no longer in
	cache. This means that normal WRITE requests are not actually logged until the
	next REQ_PREFLUSH request. This is to make it easier for userspace to replay
	the log in a way that correlates to what is on disk and not what is in cache,
	to make it easier to detect improper waiting/flushing.

	This works by attaching all WRITE requests to a list once the write completes.
	Once we see a REQ_PREFLUSH request we splice this list onto the request and once
	the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only
	completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
	simulate the worst case scenario with regard to power failures. Consider the
	following example (W means write, C means complete):

	W1,W2,W3,C3,C2,Wflush,C1,Cflush

	The log would show the following

	W3,W2,flush,W1....

	Again this is to simulate what is actually on disk, this allows us to detect
	cases where a power failure at a particular point in time would create an
	inconsistent file system.

	Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
	they complete as those requests will obviously bypass the device cache.

	Any REQ_DISCARD requests are treated like WRITE requests. Otherwise we would
	have all the DISCARD requests, and then the WRITE requests and then the FLUSH
	request. Consider the following example:

	WRITE block 1, DISCARD block 1, FLUSH

	If we logged DISCARD when it completed, the replay would look like this

	DISCARD 1, WRITE 1, FLUSH

	which isn't quite what happened and wouldn't be caught during the log replay.

	Target interface
	================

	i) Constructor

	log-writes <dev_path> <log_dev_path>

	dev_path : Device that all of the IO will go to normally.
	log_dev_path : Device where the log entries are written to.

	ii) Status

	<#logged entries> <highest allocated sector>

	#logged entries : Number of logged entries
	highest allocated sector : Highest allocated sector

	iii) Messages

	mark <description>

	You can use a dmsetup message to set an arbitrary mark in a log.
	For example say you want to fsck a file system after every
	write, but first you need to replay up to the mkfs to make sure
	we're fsck'ing something reasonable, you would do something like
	this:

	mkfs.btrfs -f /dev/mapper/log
	dmsetup message log 0 mark mkfs
	<run test>

	This would allow you to replay the log up to the mkfs mark and
	then replay from that point on doing the fsck check in the
	interval that you want.

	Every log has a mark at the end labeled "dm-log-writes-end".

	Userspace component
	===================

	There is a userspace tool that will replay the log for you in various ways.
	It can be found here: https://github.com/josefbacik/log-writes

	Example usage
	=============

	Say you want to test fsync on your file system. You would do something like
	this:

	TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
	dmsetup create log --table "$TABLE"
	mkfs.btrfs -f /dev/mapper/log
	dmsetup message log 0 mark mkfs

	mount /dev/mapper/log /mnt/btrfs-test
	<some test that does fsync at the end>
	dmsetup message log 0 mark fsync
	md5sum /mnt/btrfs-test/foo
	umount /mnt/btrfs-test

	dmsetup remove log
	replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
	mount /dev/sdb /mnt/btrfs-test
	md5sum /mnt/btrfs-test/foo
	<verify md5sum's are correct>

	Another option is to do a complicated file system operation and verify the file
	system is consistent during the entire operation. You could do this with:

	TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
	dmsetup create log --table "$TABLE"
	mkfs.btrfs -f /dev/mapper/log
	dmsetup message log 0 mark mkfs

	mount /dev/mapper/log /mnt/btrfs-test
	<fsstress to dirty the fs>
	btrfs filesystem balance /mnt/btrfs-test
	umount /mnt/btrfs-test
	dmsetup remove log

	replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
	btrfsck /dev/sdb
	replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
	--fsck "btrfsck /dev/sdb" --check fua

	And that will replay the log until it sees a FUA request, run the fsck command
	and if the fsck passes it will replay to the next FUA, until it is completed or
	the fsck command exists abnormally.