Step #5 update and Step #4 correction

Please note the deadline for Step 5 correction has been changed to tuesday. I also added a note about the ioctl number : you should define it correctly, but do not check for it inside your IOCTL, as I cannot find automatically the number you’ll use. I will personally use 1 as the command argument to call your ioctl. So just ignore the command argument in your ioctl. Remember the mode/dev is given as a structure pointer in the third argument, the command number is normally used to have multiple functionalities using one function, using kind of a big switch/case inside the ioctl. We only have one : “set fastnet mode”. So it’s not a big deal.

I added a “false” project on the platform, a clone  of the step 4 to allow you to correct your Step 4 code if you want and test it on the platform by re-submitting as many times as you want. I also pushed my code to Gitlab, if you find bugs or problems, tell me ! As the project begin slowly to be bigger, you may catch things I forgot… I remember I told something in class I forgot in my code, but I don’t remember what it was…

The Step 5 will come on the platform ASAP…

Remember that you have to completely remove the system call and any definition made for it. But keep the messages from step 2, the credits from step 1, …

A reminder about packets in networks

An ethernet packet is made in layers. Each layer is inside another layer.

The content of the skbuff created in e1000_clean_rx_irq is the whole ethernet frame. The Ethernet frame starts with an ethernet header, and then contains the ethernet payload.

You do not receive the preamble (it’s just data to mark the beginning of the frame, always the same so useless to copy) and usually not the CRC either as all NICs can check it is correct for you (it will be removed in e1000_main.c:4451 if it’s there.)

Then how to know what’s in the data? Well, it will be given by two bytes starting from the 12th byte.

The known types are defined in if_ether.h, for example you can find the type of the IP packets there :

#define ETH_P_IP 0x0800 /* Internet Protocol packet */

So you know that somewhere, the “thing” handling the packets of the IP protocol will check that the type equals ETH_P_IP. It will in fact check if the type is cpu_to_be16(ETH_P_IP), because in the network, bytes are big-endian, while the CPU use little-endian. It means that in the network, 0x0800 will be 0x0080 as the most significant byte will be on the right. There is a lot of packets types, not just IP. Do not expect to find a “if (type == cpu_to_be16(ETH_P_IP))”… The kernel use a list of structure of known packet type and check the whole list against the actual packet type.

The kernel will call the handler defined for the specific protocol matching the IP packet.

The same scheme applies more or less for every layer, as the IP packet itself is also composed of a header with a type, and some data. And inside it there is again a UDP, TCP, ICMP, … packet with again a header and some data.

For example the type of the payload of the IP packet is given as a unique byte (so no byte order problem here) in the 9th byte of the IP header :

As there is only 256 possible types, the linux kernel use a table, and not a list as it allows to directly jump to good “sub” ip layer handler such as the one for UDP, TCP, …

IP protocols are defined in in.h for example we have IPPROTO_UDP = 17, /* User Datagram Protocol */ at line 40, which tells us that 17 is the UDP protocol. Again, a quick search on who use IPPROTO_UDP in the /net/ipv4 folder will tell us who is defining some kind of handler to handle that kind of packets with the IP protocol set to 17. A hint : there is a function which will set the good index of the “protocol table” to the structure containing informations about the protocol. So it’s not like the ethernet layer where the list contains the type and the function, here the protocol number is not in the handler structure 😉

How to find the mkdir syscall

The first question to ask yourself is probably what the mkdir syscall does?

Obviously, it will create a new dir (thanks captain!) but is it only one piece of code for the whole kernel?

If I ask, the answer is no. First, there is one syscall entry per achitecture, so searching about mkdir in the arch will already show a lot entries.

But what’s a dir? Something which contains one or multiple files? Yes but no. Conceptually, the main purpose of the directory is to give names to files and allow to address them. If the name of the file was in the file structure, how to access it? Without folder you would only have an array of files without names… But I’m getting away from the question (just giving you an exam answer by the way…)

The dir, in practice is implemented differently in all file systems. So there is one different mkdir function per file system, and there is a lot of FS in the linux kernel… How the kernel knows which one to use? It uses something much like net_device_ops for a network device (OS 2015-2016), md_personality for an md array (OS 2014-2015) or sched_class for a scheduler (OS 2013-2014) : a structure of function pointers that the kernel can follow to accomplish some actions.

So a “grep -Ri mkdir” on the top level of the kernel will give you way too much results…

If you look at the syscall table, you’ll find that we often talk about sys_something, so searching for sys_mkdir could be a good idea…

Could be… Because you won’t find it. Why? You should remember half of my slides, or what I said in class, or … Well, that smells the macro… Cscope does not support them well, Eclipse does. But Eclipse can only resolve a macro usage, not search in reverse from the macro declarations (as far as I know).

Time to show your regular expression skills ! (You could also search for the macro defining the syscall directly…). We should have something called syscall and mkdir in the same line… So let’s search for sys [something] mkdir, or the reverse :

grep –exclude “*.o” -RiE “sys.*mkdir|mkdir.*sys”

–exclude *.o allows to avoid searching object files, R do the search recursively, i case insensitive, E use regular expressions.

That stills give too much results. Looking quickly through, you could add –exclude Documentation and arch as those two won’t contain the actual implementation. Another way is to search only in the fs folder, as we can think that the syscall implementation is something about file systems… Even if it will call a per-fs function.

Let’s do the later :

cd fs
grep --exclude "*.o" -RiE "sys.*mkdir|mkdir.*sys"
sysv/namei.c:static int sysv_mkdir(struct inode * dir, struct dentry *dentry, umode_t mode)
sysv/namei.c: .mkdir = sysv_mkdir,
Fichier binaire sysv/sysv.ko correspondant
tracefs/inode.c:static int tracefs_syscall_mkdir(struct inode *inode, struct dentry *dentry, umode_t mode)
tracefs/inode.c: .mkdir = tracefs_syscall_mkdir,
proc/root.c: proc_mkdir("sysvipc", NULL);
proc/proc_sysctl.c: proc_sys_root = proc_mkdir("sys", NULL);
namei.c:SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode)
namei.c:SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
namei.c: return sys_mkdirat(AT_FDCWD, pathname, mode);
btrfs/ioctl.c: * sys_mkdirat and vfs_mkdir, but we only do a single component lookup

And it’s right in front of you 😉 The number to use at the end of the macro is quite obvious… But google can help you if you don’t find what it means.

A Linux bug you can now understand

Hi all students from INFO0940, I came accross a bug which may ring multiple bells for you (sorry, it’s in french) :

http://www.developpez.com/actu/85653/Linux-decouverte-d-un-bug-sur-le-systeme-de-fichiers-EXT4-qui-pourrait-causer-une-importante-perte-de-donnees/

The bug description should remember you some all ghost :
« La variable “sector” dans “raid0_make_request()” n’a pas été correctement modifiée par l’appel à “sector_div()” qui modifie son premier argument à la place. Le commit [précèdent] restaurait cette variable après l’appel pour une utilisation ultérieure. Malheureusement la restauration a été effectuée après que la variable “bio” a été avancée »

I know multiple people had problems with dividing sectors and using sector_div(). Fortunately, most of you used it correctly ! Maybe you should help them :p I wonder if the bug could happen with any file system as raid0_make_request seems to be related to MD and not to ext4 at all, maybe the article is poorely written and it was just seen with ext4 but is related to all md0 device?

A reminder about variable size structures

Variable-size structures are helpful, because you can allocate a structure containing an array of an unkown (at compile time) size at the end. For example, a “nuda_conf” structure containing informations about an MD Array and a “dev_info” structure per disks. The number of disks is unknown at compile time, so the only solution is to use a kalloc/vmalloc. The classical solution is to do :

[code lang=”c”]
struct dev_info {
int whatever;
struct rdev* rdev;
}

struct nuda_conf {
char* name;
int whotever;
struct dev_info disks*
}

...

struct nuda_conf* conf;
conf = kzalloc(sizeof(struct nuda_conf));
conf->disks = kzalloc(sizeof(struct dev_info) * NUMER_OF_DISKS);
[/code]

But variable-size structures allows to do only one allocation :

[code lang=”c”]
struct nuda_conf {
char* name;
int whotever;
struct dev_info disks[0];
}

struct nuda_conf* conf;

conf = kzalloc(sizeof(struct nuda_conf) + sizeof(struct dev_info) * NUMBER_OF_DISKS);
[/code]

This will allocate the memory for the char* and the int of nuda_conf, and the memory for a number of “dev_info”. To access them, one can do conf->disks[N] where N is the disk index.

But, this is limited to one variable-size array. If you do :

[code lang=”c”]
struct nuda_conf {
char* name;
int whotever;
struct dev_info disks[0];
struct indirection_line indirection_table[0];
}
[/code]

The “indirection_table” and the “disks” pointer will point to the same location ! Even if you allocate size for both ! The solution is either to use a classical pointer as in the first example, or handle the memory access yourself :

[code lang=”c”]
struct nuda_conf {
char* name;
int whotever;
}
...
conf = kzalloc (sizeof(struct nuda_conf) + sizeof(struct dev_info) * NUMBER_OF_DISKS + sizeof(struct indirection_line)*NUMBER_OF_CHUNKS);
struct dev_info* disks = (struct dev_info*)(conf + 1);
struct indirection_line* indirection_table = (struct dev_info*)(disks + NUMBER_OF_DISKS);
[/code]

But there is no interest in doing so, and both disks and indirection_table pointers have to be manually computed each time you need them… The use for that is really really really specific (dynamic metadata in packets for example).

The MD personnality

The structure “struct md_personality” is the one which contains information about a RAID level. It is defined in md.h . You’ll find below some explanations about each member of the structure.

[code lang=”c”]
char *name; //Name of the level
int level; //Number related to the level, negative level are for "special one", for example the linear level is -1
struct list_head l i s t ; //Linked list of all MD personnalities.
struct module *owner; //Pointer to the Linux module managing this level

/∗* Do a block I/O operation
 * @param mddev structure representing the array
 * @param bio Bock I/O representing the operation to do
 */
void (*make_request)(struct mddev *mddev, struct bio* bio);

/** Build and assemble a RAID array
 * @param mddev structure representing the array
 */
int (*run)(struct mddev *mddev);

/** Disassemble an array
 * @param mddev structure representing the array
 */
int (*stop)(struct mddev *mddev);

/** Write some status informations to a file
 * @param seq virtual file
 * @param mddev structure representing the array
 */
void (*status)(struct seq_file *seq, struct mddev *mddev);

/** Function called when an error occur in the MD subsystem.
 * The usual way to run it is to call md_error() inside a md module, which will set some flags and start the recovery thread, and call this error_handler
 * @param mddev structure representing the array
 * @param rdev The physical drive responsible for the failure if applicable
void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev);

/** Add a disk while the array is still running
 * @param mddev structure representing the array
 * @param rdev the physical disk to add
 */
int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);

/** Remove a disk while the array is still running
 * @param mddev structure representing the array
 * @param number index of the disk to remove
 */
int (*hot_remove_disk) (struct mddev *mddev, int number);

/** Return the number of active spare drive.
 * @param mddev structure representing the array
 */
int (*spare_active) (struct mddev *mddev);

/** Handle a re-synchronization request for some sector
 * @param mddev structure representing the array
 * @param sector_nr Sector to synchronize
 * @param skipped Will be set to 1 if the sync was skipped
 * @param go_faster If false, will sleep interruptible to throttle resync
 */ 
sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster);

/** Resize an array
 * @param mddev structure representing the array
 * @param sectors new size
 */
int (*resize) (struct mddev *mddev, sector_t sectors);

/** Return the size of the array. If sectors and raid_disks are not zero, size is computed using those numbers instead of the real ones.
 */
sector_t (*size) (struct mddev *mddev, sector_t sectors, int raid_disks);

/** Check_reshape Reshape an array after adding or removing some disks
 */
int (*check_reshape) (struct mddev *mddev);

// unused
int (*start_reshape) (struct mddev *mddev);
void (*finish_reshape) (struct mddev *mddev);

/** quiesce moves between quiescence states
 * 0 - fully active
 * 1 - no new requests allowed
 * others - reserved
 */
void (*quiesce) (struct mddev *mddev, int state);

/** takeover is used to transition an array from one
 * personality to another.  The new personality must be able
 * to handle the data in the current layout.
 * e.g. 2drive raid1 -> 2drive raid5
 *      ndrive raid5 -> degraded n+1drive raid6 with special layout
 * If the takeover succeeds, a new 'private' structure is returned.
 * This needs to be installed and then ->run used to activate the
 * array.
 */
void *(*takeover) (struct mddev *mddev);

[/code]

Limiting the incoming Block I/O requests to a device driver/md device

When implementing a device driver or a MD device which can receive Block I/O (struct bio in the kernel), you can receive BIO of nearly any size, with any number of segments (segments are discontinued parts of a common buffer, defined in a bio request). You may want to limit :

– The number of segments you can receive with
[code lang=”c”]blk_queue_max_segments(queue, X);[/code] Where X is the number of segments per struct bio

– The maximal size of the request :
[code lang=”c”]blk_queue_max_hw_sectors(queue, Y);[/code] Where Y is the maximal size in sectors

For a md device, the queue can be recovered with mddev->queue

The combination of the two allows to limit ensure that all bio request have always maximum X segments for a maximal size of Y sectors.

It is used in raid0 with Y=mddev->chunk_sectors to ensure that no request is bigger than one chunk, so any request cross at most one chunk boundary. And with X=1, it allows to use the bio_split function to split a request which would span on the two sides of a chunk boundary.

Creating a dynamic and redundant array with LVM and MDADM

RAID5 allows to create an array of N+1 drives where N is the number of drives which will contain real data. The last drive will be used to store parity about the other drives (in practice, the parity information is stored by chunks across all drives and not only on one drive). RAID 5 allows to loose any of the drive without loosing the data thanks to the parity drive, and has a cheaper cost than RAID 1 where the usable data will be N/ instead of N-1.

MDADM is the tool of predilection to build a RAID5 drive. Given 3 disks, the command to build a raid 5 array is :

[code lang=”bash”]mdadm –create /dev/md0 –level=5 –raid-devices=3 /dev/sda1 /dev/sdb1 /dev/sdc1[/code]

Problem is, RAID5 drives are not easily splittable/shrinkable/resizable, the operation is complex and must be done offline. The solution is to use LVM on top of MDADM to build a big volume group which will be “protected” by RAID5 allowing to make dynamic paritions on it :

[code lang=”bash”]pvcreate /dev/md0
vgcreate group0 /dev/md0[/code]

And then create multiple, online-resizeable partitions with :

[code lang=”bash”]lvcreate /dev/group0 -n system -L 10G
mkfs.ext4 /dev/mapper/group0-system[/code] [code lang=”bash”]lvcreate /dev/group0 -n home -L 50G
mkfs.ext4 /dev/mapper/group0-home[/code]

To resize a partition, one can do :

[code lang=”bash”]lvresize /dev/mapper/group0-home -L +10G
resize2fs /dev/mapper/group0-home[/code]

Which will add 10G to the partition, and resize it. It will work even with the system partition, without needing any reboot.

 

Block I/O caching

In Linux, blocks resulting from BIO requests are cached, so further reads are done from memory instead of re-reading data from the disk. To clean this cache, one can use echo 3 > /proc/sys/vm/drop_caches .
In the same way, write to the filesystem are not directly done. The filesystem waits before sending data written creating an “write” BIO request. Imagine if a software writes byte per byte to the filesystem : it would mean that one write request would be created per byte, this would be very slow.
Instead, filesystems wait before creating and sending the BIO write request, and this explains why after writing a file to the disk, there is still some write BIO request passing through the block layer even after the writing software says it has finished or is even closed.
To force the “real” write of all pending write in the FS, one can use the “sync” software already installed on all Linux systems.
This is why you’ve got to unmount USB drives by the way : even if the copy seems finished, it is maybe not really finished because the writes are pending but not yet written “for real”.
Using the combination of the two (sync then flush the cache) will ensure that all data is written and that further read will be done from the disk, and not from memory. This is very important to test that a disk driver or a MD raid array is working well.

Using a thread to do the swapping in assignment #2 is now a bonus

As a lot of the INFO-0940 students seems to have trouble for this part and struggle with time, you can do the swapping of chunks in the indirection table directly inside make_request. Doing it in another thread will be considered as a bonus.

More than the bonus, it will be needed for assignment #3, so doing it in a separated thread is not loosing time, on the contrary… But focus on sysfs entries and having a working indirection table first ! (and a working raid module, of course…)

About sysfs entries, I added this post : https://www.tombarbette.be/sysfs-entries/

There will be no delay for the assignment #2. Remember that this time you have to use the RUN submission platform (http://submit.run.montefiore.ulg.ac.be) to send your project and that you can submit any number of times until the deadline. A script will automatically run to tell you if your archive was good (patch, …). If it wasn’t you project will not be considered for correction… Better being late than get a zero !