Asynchronous Block I/O request

In the Linux kernel, Block I/O request are asynchronous. It means that when you call submit_bio(READ/WRITE, bio); or generic_make_request(…), the function will (most probably) return directly, and of course, the read is not done. So after calling bio_submit(READ…); you absolutely cannot read the content of a page added by bio_add_page().

So, how to know when it is finished? You have to use bio->bi_end_io pointer function. You have to set this pointer to a function which will be called when the read has been done.

[code lang=”c”]void myReadIsFinished(struct bio* bio, int error) {
//Read is finished, do something with the bio content
}

bio->bi_end_io = &myReadIsFinished;
bio_submit(READ, bio);[/code]

bio->bi_private allows you to store some pointer with the bio. Use it to know what you tried to read.

A reminder about variable size structures

Variable-size structures are helpful, because you can allocate a structure containing an array of an unkown (at compile time) size at the end. For example, a “nuda_conf” structure containing informations about an MD Array and a “dev_info” structure per disks. The number of disks is unknown at compile time, so the only solution is to use a kalloc/vmalloc. The classical solution is to do :

[code lang=”c”]

struct dev_info {
int whatever;
struct rdev* rdev;
}

struct nuda_conf {
char* name;
int whotever;
struct dev_info disks*
}

...

struct nuda_conf* conf;
conf = kzalloc(sizeof(struct nuda_conf));
conf->disks = kzalloc(sizeof(struct dev_info) * NUMER_OF_DISKS);

[/code]

But variable-size structures allows to do only one allocation :

[code lang=”c”]

struct nuda_conf {
char* name;
int whotever;
struct dev_info disks[0];
}

struct nuda_conf* conf;

conf = kzalloc(sizeof(struct nuda_conf) + sizeof(struct dev_info) * NUMBER_OF_DISKS);

[/code]

This will allocate the memory for the char* and the int of nuda_conf, and the memory for a number of “dev_info”. To access them, one can do conf->disks[N] where N is the disk index.

But, this is limited to one variable-size array. If you do :

[code lang=”c”]

struct nuda_conf {
char* name;
int whotever;
struct dev_info disks[0];
struct indirection_line indirection_table[0];
}

[/code]

The “indirection_table” and the “disks” pointer will point to the same location ! Even if you allocate size for both ! The solution is either to use a classical pointer as in the first example, or handle the memory access yourself :

[code lang=”c”]

struct nuda_conf {
char* name;
int whotever;
}
...
conf = kzalloc (sizeof(struct nuda_conf) + sizeof(struct dev_info) * NUMBER_OF_DISKS + sizeof(struct indirection_line)*NUMBER_OF_CHUNKS);
struct dev_info* disks = (struct dev_info*)(conf + 1);
struct indirection_line* indirection_table = (struct dev_info*)(disks + NUMBER_OF_DISKS);

[/code]

But there is no interest in doing so, and both disks and indirection_table pointers have to be manually computed each time you need them… The use for that is really really really specific (dynamic metadata in packets for example).

The MD personnality

The structure “struct md_personality” is the one which contains information about a RAID level. It is defined in md.h . You’ll find below some explanations about each member of the structure.

[code lang=”c”]

char *name; //Name of the level
int level; //Number related to the level, negative level are for "special one", for example the linear level is -1
struct list_head l i s t ; //Linked list of all MD personnalities.
struct module *owner; //Pointer to the Linux module managing this level

/∗* Do a block I/O operation
 * @param mddev structure representing the array
 * @param bio Bock I/O representing the operation to do
 */
void (*make_request)(struct mddev *mddev, struct bio* bio);

/** Build and assemble a RAID array
 * @param mddev structure representing the array
 */
int (*run)(struct mddev *mddev);

/** Disassemble an array
 * @param mddev structure representing the array
 */
int (*stop)(struct mddev *mddev);

/** Write some status informations to a file
 * @param seq virtual file
 * @param mddev structure representing the array
 */
void (*status)(struct seq_file *seq, struct mddev *mddev);

/** Function called when an error occur in the MD subsystem.
 * The usual way to run it is to call md_error() inside a md module, which will set some flags and start the recovery thread, and call this error_handler
 * @param mddev structure representing the array
 * @param rdev The physical drive responsible for the failure if applicable
void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev);

/** Add a disk while the array is still running
 * @param mddev structure representing the array
 * @param rdev the physical disk to add
 */
int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);

/** Remove a disk while the array is still running
 * @param mddev structure representing the array
 * @param number index of the disk to remove
 */
int (*hot_remove_disk) (struct mddev *mddev, int number);

/** Return the number of active spare drive.
 * @param mddev structure representing the array
 */
int (*spare_active) (struct mddev *mddev);

/** Handle a re-synchronization request for some sector
 * @param mddev structure representing the array
 * @param sector_nr Sector to synchronize
 * @param skipped Will be set to 1 if the sync was skipped
 * @param go_faster If false, will sleep interruptible to throttle resync
 */ 
sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster);

/** Resize an array
 * @param mddev structure representing the array
 * @param sectors new size
 */
int (*resize) (struct mddev *mddev, sector_t sectors);

/** Return the size of the array. If sectors and raid_disks are not zero, size is computed using those numbers instead of the real ones.
 */
sector_t (*size) (struct mddev *mddev, sector_t sectors, int raid_disks);

/** Check_reshape Reshape an array after adding or removing some disks
 */
int (*check_reshape) (struct mddev *mddev);

// unused
int (*start_reshape) (struct mddev *mddev);
void (*finish_reshape) (struct mddev *mddev);

/** quiesce moves between quiescence states
 * 0 - fully active
 * 1 - no new requests allowed
 * others - reserved
 */
void (*quiesce) (struct mddev *mddev, int state);

/** takeover is used to transition an array from one
 * personality to another.  The new personality must be able
 * to handle the data in the current layout.
 * e.g. 2drive raid1 -> 2drive raid5
 *      ndrive raid5 -> degraded n+1drive raid6 with special layout
 * If the takeover succeeds, a new 'private' structure is returned.
 * This needs to be installed and then ->run used to activate the
 * array.
 */
void *(*takeover) (struct mddev *mddev);

[/code]

Limiting the incoming Block I/O requests to a device driver/md device

When implementing a device driver or a MD device which can receive Block I/O (struct bio in the kernel), you can receive BIO of nearly any size, with any number of segments (segments are discontinued parts of a common buffer, defined in a bio request). You may want to limit :

– The number of segments you can receive with
[code lang=”c”]blk_queue_max_segments(queue, X);[/code]
Where X is the number of segments per struct bio

– The maximal size of the request :
[code lang=”c”]blk_queue_max_hw_sectors(queue, Y);[/code]
Where Y is the maximal size in sectors

For a md device, the queue can be recovered with mddev->queue

The combination of the two allows to limit ensure that all bio request have always maximum X segments for a maximal size of Y sectors.

It is used in raid0 with Y=mddev->chunk_sectors to ensure that no request is bigger than one chunk, so any request cross at most one chunk boundary. And with X=1, it allows to use the bio_split function to split a request which would span on the two sides of a chunk boundary.

Creating a dynamic and redundant array with LVM and MDADM

RAID5 allows to create an array of N+1 drives where N is the number of drives which will contain real data. The last drive will be used to store parity about the other drives (in practice, the parity information is stored by chunks across all drives and not only on one drive). RAID 5 allows to loose any of the drive without loosing the data thanks to the parity drive, and has a cheaper cost than RAID 1 where the usable data will be N/ instead of N-1.

MDADM is the tool of predilection to build a RAID5 drive. Given 3 disks, the command to build a raid 5 array is :

[code lang=”bash”]mdadm –create /dev/md0 –level=5 –raid-devices=3 /dev/sda1 /dev/sdb1 /dev/sdc1[/code]

Problem is, RAID5 drives are not easily splittable/shrinkable/resizable, the operation is complex and must be done offline. The solution is to use LVM on top of MDADM to build a big volume group which will be “protected” by RAID5 allowing to make dynamic paritions on it :

[code lang=”bash”]pvcreate /dev/md0
vgcreate group0 /dev/md0[/code]

And then create multiple, online-resizeable partitions with :

[code lang=”bash”]lvcreate /dev/group0 -n system -L 10G
mkfs.ext4 /dev/mapper/group0-system[/code]

[code lang=”bash”]lvcreate /dev/group0 -n home -L 50G
mkfs.ext4 /dev/mapper/group0-home[/code]

To resize a partition, one can do :

[code lang=”bash”]lvresize /dev/mapper/group0-home -L +10G
resize2fs /dev/mapper/group0-home[/code]

Which will add 10G to the partition, and resize it. It will work even with the system partition, without needing any reboot.

 

Block I/O caching

In Linux, blocks resulting from BIO requests are cached, so further reads are done from memory instead of re-reading data from the disk. To clean this cache, one can use echo 3 > /proc/sys/vm/drop_caches .
In the same way, write to the filesystem are not directly done. The filesystem waits before sending data written creating an “write” BIO request. Imagine if a software writes byte per byte to the filesystem : it would mean that one write request would be created per byte, this would be very slow.
Instead, filesystems wait before creating and sending the BIO write request, and this explains why after writing a file to the disk, there is still some write BIO request passing through the block layer even after the writing software says it has finished or is even closed.
To force the “real” write of all pending write in the FS, one can use the “sync” software already installed on all Linux systems.
This is why you’ve got to unmount USB drives by the way : even if the copy seems finished, it is maybe not really finished because the writes are pending but not yet written “for real”.
Using the combination of the two (sync then flush the cache) will ensure that all data is written and that further read will be done from the disk, and not from memory. This is very important to test that a disk driver or a MD raid array is working well.

Using a thread to do the swapping in assignment #2 is now a bonus

As a lot of the INFO-0940 students seems to have trouble for this part and struggle with time, you can do the swapping of chunks in the indirection table directly inside make_request. Doing it in another thread will be considered as a bonus.

More than the bonus, it will be needed for assignment #3, so doing it in a separated thread is not loosing time, on the contrary… But focus on sysfs entries and having a working indirection table first ! (and a working raid module, of course…)

About sysfs entries, I added this post : https://www.tombarbette.be/sysfs-entries/

There will be no delay for the assignment #2. Remember that this time you have to use the RUN submission platform (http://submit.run.montefiore.ulg.ac.be) to send your project and that you can submit any number of times until the deadline. A script will automatically run to tell you if your archive was good (patch, …). If it wasn’t you project will not be considered for correction… Better being late than get a zero !

sysfs entries

Entries in the /sys folder are represented by “struct kobject“.

A kobject is … a kernel object. So it can be anything. Regarding the /sys system, kobject can be more or less thinked as a “folder” of /sys. To obtain the “kobject” entry for /sys of an md device, one can do &disk_to_dev(mddev->gendisk)->kobj. This object will handle the “folder” /sys/block/md/mdX/ where X is the number of the md device.

To initialize your own kobject and add it as a child of the kobject of the mddev, one can use kobject_init_and_add(yourobject, &yourobect_type,
&disk_to_dev(mddev->gendisk)->kobj, “%s”, “foldername”); In practice this will create a folder named “foldername” in /sys/block/md/mdX/.

yourobject” should be a pointer to your kobject. It must be persistent, allocated with kmalloc or something like that and not in your function stack, even if you don’t plan to modify it. As kobject are more or less anything, you have to describe your kobject by passing to kobject_init_and_add a struct kobj_type

static struct kobj_type myobject_ktype = {
.release = release_function,
.sysfs_ops = &myobject_sysfs_ops,
}; 

release is the function which will be called when it’s time to destroy your kobject, while sysfs_ops is a struct sys_ops. We’ll come back to them later.

The “files” in your sys folder are represented by struct attribute. You have two choices here. Either the files in your folder do not change, and you add all the attribute as the “default list”of attributesof your kobject before the kobject initialisation  , or you start with an empty kobject (or with some default files in the list but not all) and you add some files after kobject initialization using sysfs_create_file(yourobject, attr); where attr is a pointer to the attribute that you want to add. We’ll consider that all files are known in advance and we’ll put everything in the “default” list.

The “default” list is more a default array and is referenced via  yourobject_ktype.default_attr . It is a pointer to the array of pointer to attributes. Yes, re-read that sentence twice 😉 It means you’ll give an array of pointer to attributes and not an array of attributes.

struct attribute **all_attrs;
all_attrs =  kzalloc(sizeof(struct attribute *) * number_of_files);
attrs =  kzalloc(sizeof(struct attribute) * number_of_files);
[fill attrs]
for (i = 0; i < number_of_files;i++)
all_attrs[i] = &attrs[i];
yourobject_ktype.default_attrs =all_attrs;

If we look at struct attributes it contains :

struct attribute {
const char *name;
mode_t mode;
};

You should find by yourself how to initialize the attrs. Note that name is a char pointer, and in noway can store character themselves. So you need to allocate somewhere all the names of your attributes, and obviously not on your function stack…

Now, let’s go back to the myobject_sysfs_ops which describes how to read and write to these attributes

The main two functions in the sys_ops are show an store

static const struct sysfs_ops myobject_sysfs_ops = {
.show = myobject_attr_show,
.store = myobject_attr_store,
};

These two functions will be called when any kobject attr is read or written in your kobject entry. Let’s see the show function :

static ssize_t
myobject_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
{}

kobj is the pointer to your kobject, the attr is the attribute the user is trying to read and the page is a pointer to a space where you should print the content which was trying to be read.

The function container_of() is very usefull and often used in this case. Let’s say your kobject is stored in another structure, that we will call “struct conf” in our example. To recover the conf storing the kobject, one can do :

struct conf* conf = container_of(kobj, struct conf, kobj);

And you can use the attribute name to find what to write in the page.

This should be sufficient for most usage, using some tricks to respond according the attribute name.

If the treatment need to be specific according to the attribute/file, or if you absolutely need to store some data with each attribute, you have to allocate another bigger structure which contains the struct attr, this explains by the way why you give pointer to attributes and not an array of attributes to the ktype, because attributes could be scattered inside an array of a bigger structure. As an exemple, here the per-file structure of the md driver :

struct md_sysfs_entry {
struct attribute attr;
ssize_t (*show)(struct mddev *, char *);
ssize_t (*store)(struct mddev *, const char *, size_t);
};

The default_attributes cointains pointers to all the md_sysfs_entry->attr. And the show function of the kobject use cointainer_of() to find the md_sysfs_entry containing the attr. Then it calls the specific show and store functions of the md_sysfs_entry instead of doing something similar for all attributes.