A Linux bug you can now understand

Hi all students from INFO0940, I came accross a bug which may ring multiple bells for you (sorry, it’s in french) :

http://www.developpez.com/actu/85653/Linux-decouverte-d-un-bug-sur-le-systeme-de-fichiers-EXT4-qui-pourrait-causer-une-importante-perte-de-donnees/

The bug description should remember you some all ghost :
« La variable “sector” dans “raid0_make_request()” n’a pas été correctement modifiée par l’appel à “sector_div()” qui modifie son premier argument à la place. Le commit [précèdent] restaurait cette variable après l’appel pour une utilisation ultérieure. Malheureusement la restauration a été effectuée après que la variable “bio” a été avancée »

I know multiple people had problems with dividing sectors and using sector_div(). Fortunately, most of you used it correctly ! Maybe you should help them :p I wonder if the bug could happen with any file system as raid0_make_request seems to be related to MD and not to ext4 at all, maybe the article is poorely written and it was just seen with ext4 but is related to all md0 device?

The MD personnality

The structure “struct md_personality” is the one which contains information about a RAID level. It is defined in md.h . You’ll find below some explanations about each member of the structure.

[code lang=”c”]

char *name; //Name of the level
int level; //Number related to the level, negative level are for "special one", for example the linear level is -1
struct list_head l i s t ; //Linked list of all MD personnalities.
struct module *owner; //Pointer to the Linux module managing this level

/∗* Do a block I/O operation
 * @param mddev structure representing the array
 * @param bio Bock I/O representing the operation to do
 */
void (*make_request)(struct mddev *mddev, struct bio* bio);

/** Build and assemble a RAID array
 * @param mddev structure representing the array
 */
int (*run)(struct mddev *mddev);

/** Disassemble an array
 * @param mddev structure representing the array
 */
int (*stop)(struct mddev *mddev);

/** Write some status informations to a file
 * @param seq virtual file
 * @param mddev structure representing the array
 */
void (*status)(struct seq_file *seq, struct mddev *mddev);

/** Function called when an error occur in the MD subsystem.
 * The usual way to run it is to call md_error() inside a md module, which will set some flags and start the recovery thread, and call this error_handler
 * @param mddev structure representing the array
 * @param rdev The physical drive responsible for the failure if applicable
void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev);

/** Add a disk while the array is still running
 * @param mddev structure representing the array
 * @param rdev the physical disk to add
 */
int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);

/** Remove a disk while the array is still running
 * @param mddev structure representing the array
 * @param number index of the disk to remove
 */
int (*hot_remove_disk) (struct mddev *mddev, int number);

/** Return the number of active spare drive.
 * @param mddev structure representing the array
 */
int (*spare_active) (struct mddev *mddev);

/** Handle a re-synchronization request for some sector
 * @param mddev structure representing the array
 * @param sector_nr Sector to synchronize
 * @param skipped Will be set to 1 if the sync was skipped
 * @param go_faster If false, will sleep interruptible to throttle resync
 */ 
sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster);

/** Resize an array
 * @param mddev structure representing the array
 * @param sectors new size
 */
int (*resize) (struct mddev *mddev, sector_t sectors);

/** Return the size of the array. If sectors and raid_disks are not zero, size is computed using those numbers instead of the real ones.
 */
sector_t (*size) (struct mddev *mddev, sector_t sectors, int raid_disks);

/** Check_reshape Reshape an array after adding or removing some disks
 */
int (*check_reshape) (struct mddev *mddev);

// unused
int (*start_reshape) (struct mddev *mddev);
void (*finish_reshape) (struct mddev *mddev);

/** quiesce moves between quiescence states
 * 0 - fully active
 * 1 - no new requests allowed
 * others - reserved
 */
void (*quiesce) (struct mddev *mddev, int state);

/** takeover is used to transition an array from one
 * personality to another.  The new personality must be able
 * to handle the data in the current layout.
 * e.g. 2drive raid1 -> 2drive raid5
 *      ndrive raid5 -> degraded n+1drive raid6 with special layout
 * If the takeover succeeds, a new 'private' structure is returned.
 * This needs to be installed and then ->run used to activate the
 * array.
 */
void *(*takeover) (struct mddev *mddev);

[/code]

Limiting the incoming Block I/O requests to a device driver/md device

When implementing a device driver or a MD device which can receive Block I/O (struct bio in the kernel), you can receive BIO of nearly any size, with any number of segments (segments are discontinued parts of a common buffer, defined in a bio request). You may want to limit :

– The number of segments you can receive with
[code lang=”c”]blk_queue_max_segments(queue, X);[/code]
Where X is the number of segments per struct bio

– The maximal size of the request :
[code lang=”c”]blk_queue_max_hw_sectors(queue, Y);[/code]
Where Y is the maximal size in sectors

For a md device, the queue can be recovered with mddev->queue

The combination of the two allows to limit ensure that all bio request have always maximum X segments for a maximal size of Y sectors.

It is used in raid0 with Y=mddev->chunk_sectors to ensure that no request is bigger than one chunk, so any request cross at most one chunk boundary. And with X=1, it allows to use the bio_split function to split a request which would span on the two sides of a chunk boundary.

Creating a dynamic and redundant array with LVM and MDADM

RAID5 allows to create an array of N+1 drives where N is the number of drives which will contain real data. The last drive will be used to store parity about the other drives (in practice, the parity information is stored by chunks across all drives and not only on one drive). RAID 5 allows to loose any of the drive without loosing the data thanks to the parity drive, and has a cheaper cost than RAID 1 where the usable data will be N/ instead of N-1.

MDADM is the tool of predilection to build a RAID5 drive. Given 3 disks, the command to build a raid 5 array is :

[code lang=”bash”]mdadm –create /dev/md0 –level=5 –raid-devices=3 /dev/sda1 /dev/sdb1 /dev/sdc1[/code]

Problem is, RAID5 drives are not easily splittable/shrinkable/resizable, the operation is complex and must be done offline. The solution is to use LVM on top of MDADM to build a big volume group which will be “protected” by RAID5 allowing to make dynamic paritions on it :

[code lang=”bash”]pvcreate /dev/md0
vgcreate group0 /dev/md0[/code]

And then create multiple, online-resizeable partitions with :

[code lang=”bash”]lvcreate /dev/group0 -n system -L 10G
mkfs.ext4 /dev/mapper/group0-system[/code]

[code lang=”bash”]lvcreate /dev/group0 -n home -L 50G
mkfs.ext4 /dev/mapper/group0-home[/code]

To resize a partition, one can do :

[code lang=”bash”]lvresize /dev/mapper/group0-home -L +10G
resize2fs /dev/mapper/group0-home[/code]

Which will add 10G to the partition, and resize it. It will work even with the system partition, without needing any reboot.

 

Block I/O caching

In Linux, blocks resulting from BIO requests are cached, so further reads are done from memory instead of re-reading data from the disk. To clean this cache, one can use echo 3 > /proc/sys/vm/drop_caches .
In the same way, write to the filesystem are not directly done. The filesystem waits before sending data written creating an “write” BIO request. Imagine if a software writes byte per byte to the filesystem : it would mean that one write request would be created per byte, this would be very slow.
Instead, filesystems wait before creating and sending the BIO write request, and this explains why after writing a file to the disk, there is still some write BIO request passing through the block layer even after the writing software says it has finished or is even closed.
To force the “real” write of all pending write in the FS, one can use the “sync” software already installed on all Linux systems.
This is why you’ve got to unmount USB drives by the way : even if the copy seems finished, it is maybe not really finished because the writes are pending but not yet written “for real”.
Using the combination of the two (sync then flush the cache) will ensure that all data is written and that further read will be done from the disk, and not from memory. This is very important to test that a disk driver or a MD raid array is working well.

Using a thread to do the swapping in assignment #2 is now a bonus

As a lot of the INFO-0940 students seems to have trouble for this part and struggle with time, you can do the swapping of chunks in the indirection table directly inside make_request. Doing it in another thread will be considered as a bonus.

More than the bonus, it will be needed for assignment #3, so doing it in a separated thread is not loosing time, on the contrary… But focus on sysfs entries and having a working indirection table first ! (and a working raid module, of course…)

About sysfs entries, I added this post : https://www.tombarbette.be/sysfs-entries/

There will be no delay for the assignment #2. Remember that this time you have to use the RUN submission platform (http://submit.run.montefiore.ulg.ac.be) to send your project and that you can submit any number of times until the deadline. A script will automatically run to tell you if your archive was good (patch, …). If it wasn’t you project will not be considered for correction… Better being late than get a zero !

MDADM git repository

If by any chance *someone* would have to edit mdadm for the assignment number 2, here is the git repository :

git://neil.brown.name/mdadm

The function Create in Create.c take cares of creating a new array, the mapping_t pers[] structure in maps.c take care of mapping “RAID level names” to level numbers. Those numbers are defined in mdadm.h. Of course those numbers should match what’s defined in the kernel in linux/raid/md_u.h…

To compile mdadm, just type “make”.

Introduce delay between printk (kernel messages) at boot

If you’re developing into the linux kernel but your system is crashing at boot, the boot_delay parameter may be usefull. If you added printk messages to see what happened just before, you may use the boot_delay=XXX options to add XXX milliseconds between each printk, and therefore allow you to see each messages if the messages pass through quickly.

To add this parameter, type “e” while the good entry is highlighted in the grub menu bootloader, and add boot_delay=XXX at the end of the line starting with “linux=”.

This is the “quick” solution. Other posts in this blog allows you to find other solution to make the console go through a serial link in a VM for instance.