A reminder about packets in networks

An ethernet packet is made in layers. Each layer is inside another layer.

The content of the skbuff created in e1000_clean_rx_irq is the whole ethernet frame. The Ethernet frame starts with an ethernet header, and then contains the ethernet payload.

You do not receive the preamble (it’s just data to mark the beginning of the frame, always the same so useless to copy) and usually not the CRC either as all NICs can check it is correct for you (it will be removed in e1000_main.c:4451 if it’s there.)

Then how to know what’s in the data? Well, it will be given by two bytes starting from the 12th byte.

The known types are defined in if_ether.h, for example you can find the type of the IP packets there :

#define ETH_P_IP 0x0800 /* Internet Protocol packet */

So you know that somewhere, the “thing” handling the packets of the IP protocol will check that the type equals ETH_P_IP. It will in fact check if the type is cpu_to_be16(ETH_P_IP), because in the network, bytes are big-endian, while the CPU use little-endian. It means that in the network, 0x0800 will be 0x0080 as the most significant byte will be on the right. There is a lot of packets types, not just IP. Do not expect to find a “if (type == cpu_to_be16(ETH_P_IP))”… The kernel use a list of structure of known packet type and check the whole list against the actual packet type.

The kernel will call the handler defined for the specific protocol matching the IP packet.

The same scheme applies more or less for every layer, as the IP packet itself is also composed of a header with a type, and some data. And inside it there is again a UDP, TCP, ICMP, … packet with again a header and some data.

For example the type of the payload of the IP packet is given as a unique byte (so no byte order problem here) in the 9th byte of the IP header :

As there is only 256 possible types, the linux kernel use a table, and not a list as it allows to directly jump to good “sub” ip layer handler such as the one for UDP, TCP, …

IP protocols are defined in in.h for example we have IPPROTO_UDP = 17, /* User Datagram Protocol */ at line 40, which tells us that 17 is the UDP protocol. Again, a quick search on who use IPPROTO_UDP in the /net/ipv4 folder will tell us who is defining some kind of handler to handle that kind of packets with the IP protocol set to 17. A hint : there is a function which will set the good index of the “protocol table” to the structure containing informations about the protocol. So it’s not like the ethernet layer where the list contains the type and the function, here the protocol number is not in the handler structure 😉

How to find the mkdir syscall

The first question to ask yourself is probably what the mkdir syscall does?

Obviously, it will create a new dir (thanks captain!) but is it only one piece of code for the whole kernel?

If I ask, the answer is no. First, there is one syscall entry per achitecture, so searching about mkdir in the arch will already show a lot entries.

But what’s a dir? Something which contains one or multiple files? Yes but no. Conceptually, the main purpose of the directory is to give names to files and allow to address them. If the name of the file was in the file structure, how to access it? Without folder you would only have an array of files without names… But I’m getting away from the question (just giving you an exam answer by the way…)

The dir, in practice is implemented differently in all file systems. So there is one different mkdir function per file system, and there is a lot of FS in the linux kernel… How the kernel knows which one to use? It uses something much like net_device_ops for a network device (OS 2015-2016), md_personality for an md array (OS 2014-2015) or sched_class for a scheduler (OS 2013-2014) : a structure of function pointers that the kernel can follow to accomplish some actions.

So a “grep -Ri mkdir” on the top level of the kernel will give you way too much results…

If you look at the syscall table, you’ll find that we often talk about sys_something, so searching for sys_mkdir could be a good idea…

Could be… Because you won’t find it. Why? You should remember half of my slides, or what I said in class, or … Well, that smells the macro… Cscope does not support them well, Eclipse does. But Eclipse can only resolve a macro usage, not search in reverse from the macro declarations (as far as I know).

Time to show your regular expression skills ! (You could also search for the macro defining the syscall directly…). We should have something called syscall and mkdir in the same line… So let’s search for sys [something] mkdir, or the reverse :

grep –exclude “*.o” -RiE “sys.*mkdir|mkdir.*sys”

–exclude *.o allows to avoid searching object files, R do the search recursively, i case insensitive, E use regular expressions.

That stills give too much results. Looking quickly through, you could add –exclude Documentation and arch as those two won’t contain the actual implementation. Another way is to search only in the fs folder, as we can think that the syscall implementation is something about file systems… Even if it will call a per-fs function.

Let’s do the later :

cd fs
grep --exclude "*.o" -RiE "sys.*mkdir|mkdir.*sys"
sysv/namei.c:static int sysv_mkdir(struct inode * dir, struct dentry *dentry, umode_t mode)
sysv/namei.c: .mkdir = sysv_mkdir,
Fichier binaire sysv/sysv.ko correspondant
tracefs/inode.c:static int tracefs_syscall_mkdir(struct inode *inode, struct dentry *dentry, umode_t mode)
tracefs/inode.c: .mkdir = tracefs_syscall_mkdir,
proc/root.c: proc_mkdir("sysvipc", NULL);
proc/proc_sysctl.c: proc_sys_root = proc_mkdir("sys", NULL);
namei.c:SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode)
namei.c:SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
namei.c: return sys_mkdirat(AT_FDCWD, pathname, mode);
btrfs/ioctl.c: * sys_mkdirat and vfs_mkdir, but we only do a single component lookup

And it’s right in front of you 😉 The number to use at the end of the macro is quite obvious… But google can help you if you don’t find what it means.

Use tile-eclipse to launch a software on a remote Tilera TileEncoreGX36 through SSH

Here is the set-up to make it work :

First find the listening monitor port. On this picture it is 34531.
pict1

We’ll have to set up a reverse ssh forwarding for the tile-monitor to connect to our tile-eclipse instead of trying to connect to some local listener on the remote host.

Run :
ssh -R 34531:localhost:34531 sauron.run.montefiore.ulg.ac.be
Where in our case, the port 34531 is the one you found in tile-eclipse, and sauron.run.montefiore.ulg.ac.be is our host where the tile is connected.

Each time you re-run tile-eclipse you’ll have to redo that part as the port will change.

Then only once you have to set up your run configuration.

In the hardware part, correctly set up the hardware :
pict2

And in the monitor part, set remote as given in this picture and click on manage hosts :
pict3

And add your own :
pict5

I never found a better way to set it up (not using myself a ssh -R reverse forwarding), it should be possible to set it up automatically.

PROXIMUS_AUTO_FON automatic connexion on linux using wpa_supplicant

If you understand this title, you don’t need more explanation :

/etc/network/interfaces
auto wlan1
iface wlan1 inet dhcp
wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf

/etc/wpa_supplicant/wpa_supplicant.conf
ctrl_interface=/var/run/wpa_supplicant

network={
ssid="PROXIMUS_AUTO_FON"
scan_ssid=1
key_mgmt=WPA-EAP
eap=TTLS
identity="LOGIN@proximusfon.be"
password="PASS1234"
phase2="auth=MSCHAPV2"
}

Some may ask why some people would want to do that… I’m now using Voo, but I use my parent’s FON login when voo crash. My current project is towards aggregating the two links by load balancing, or at least have some kind of automatic failover. The more interesting part would be to switch to “FON only” when I reach my 100Gb limit…

Install and share the Canon Pixma MX395 Scanner with Sane

Found a Pixma MX395 at 27€ yesterday… It’s quite easy to find the Canon debian package to install the printer (use these one and not the included) and “scangearmp” which is the specific tool from Canon to scan, but it is not standard, and do not allow to share your scanner on the network through SANE.

The current version of sane do not support that printer, so you’ll need to use an updated one. Do :

sudo add-apt-repository ppa:rolfbensch/sane-git
sudo apt-get install sane sane-utils libsane

And it’s up !

scangearmp -L should show your scanner :
scanimage -L < ~ [14:04:02] device `v4l:/dev/video0' is a Noname USB2.0 Camera virtual device device `pixma:04A91766_21F9AD' is a CANON Canon PIXMA MX390 Series multi-function peripheral

Also edit /etc/sane.d/saned.cong to add the network subnet which can access the scanner :
10.0.0.0/24
[2a02:578:3fe:8139::]/64

For me. Do not forget the IPv6 address, of course 😉

Then on your client, install sane and edit /etc/sane.d/net.conf to add the server address :
10.0.0.1

And if you run scanimage -L on your client you should now see the remote scanner :
scanimage -L
device `v4l:/dev/video0' is a Noname USB2.0 UVC HD Webcam virtual device
device `net:10.0.0.1:v4l:/dev/video0' is a Noname USB2.0 Camera virtual device
device `net:10.0.0.1:pixma:04A91766_21F9AD' is a CANON Canon PIXMA MX390 Series multi-function peripheral

Proximus BBOX 3 in bridge mode with prefix delegation on Linux

Using bridge mode allows you to get a public IP address on one computer (which can serve as a router) behind your modem. This allows you to know your public IP address without using a third-party service, and control more finely all your routing parameters inside your own Linux-based router (this tutorial) or a better router than the BBOX’s one.

We’ll call “the router” the device you want to use behind the modem for clarity.

The bridge mode of the Proximus BBOX 3 is quite interesting. You connect normally to your BBOX using DHCP and will get a locally routable address (i.e. 192.168.0.0/24), but you can use PPP over Ethernet (PPPoE) to get a virtual interface inside your router. This virtual “ppp” interface will have a public IP address, and packets will flow IN and OUT the internet through that interface.

Proximus allows you to therefore maintain 2 PPP connections, one established by the BBOX (also used for the TV), and the other inside your router. It also means your home gets 2 IPv4 addresses.

I prefer that mode to the VOO one, where the external IP address is given by DHCP to only one host in the LAN, the first device to connect to the router using DHCP (dangerous and prone to configuration errors...). Same and independently for IPv6 using DHCPv6. While Proximus not only gives you an IPv6 address but also a /64 prefix via PPPoE to get a direct connection without using a crappy NAT to all your PCs. For IPv6, Proximus is much simpler than setting up an independent DHCPv6 client which gives back the v6 prefix to your LAN side. The second downside is that VOO must use ugly hacks to allow connection to the box as there is no "modem internal network" anymore. You can access your modem at the normally-illegal 192.168.100.1 address as this is on the "public web" space from the router perspective. Moreover, it seems that the modem stops responding to DHCP requests from time to time, losing connectivity... VOO bridge mode is definitively not good... But this may be a temporary bug. I did not observe this anymore...

The bridge/WAN part

Edit /etc/network/interfaces to add the following lines , assuming that eth0 is the interface used to connect to your BBOX.

auto dsl-provider
 iface dsl-provider inet ppp
 pre-up /bin/ip link set eth0 up
 provider dsl-provider

Install pppoe with sudo apt-get install pppoe on ubuntu/debian or sudo yum install pppoe centos/fedora

Then create a file named /etc/ppp/peers/dsl-provider and add the following lines :

noipdefault
defaultroute
replacedefaultroute
hide-password
noauth
persist
mtu 1492
plugin rp-pppoe.so eth0
user "fc0123456@skynet"
usepeerdns

Then edit the file /etc/ppp/chap-secrets and add the line :
"fc012345@skynet" * "password"

If you lost your skynet credentials (personally, I just never received them), you can change them online on MyProximus. You’ll have to reboot your modem so it receives automatically the new credentials.

And that’s all, you can reboot or do a sudo pon dsl-provider and you’ll have a new interface with a public IPv4 and a /64 IPv6.

The router/LAN part

To give connectivity in IPv4 for your hosts and use your Linux host as a router, you’ll have to do a NAT. But you can delegate your IPv6 range and give public IPv6 addresses to all your PCs using SLAAC! Remember to also install a firewall…

To do so, install radvd and add in /etc/radvd.conf (if br0 is the interface connected to your internal network) :

interface br0

{
 AdvSendAdvert on;
 prefix ::/64
 {
   AdvOnLink on;
   AdvAutonomous on;
   AdvRouterAddr on;
 };
 RDNSS 2001:4860:4860::8888 2001:4860:4860::8844
 {
   # AdvRDNSSLifetime 3600;
 };
};

Then do a sudo radvd restart and that’s it.

The RDNSS line gives the address of Google’s public DNS to your host. We could use Proximus’ one, but I don’t have the address on hand.

Do not hesitate to contact me!

A Linux bug you can now understand

Hi all students from INFO0940, I came accross a bug which may ring multiple bells for you (sorry, it’s in french) :

http://www.developpez.com/actu/85653/Linux-decouverte-d-un-bug-sur-le-systeme-de-fichiers-EXT4-qui-pourrait-causer-une-importante-perte-de-donnees/

The bug description should remember you some all ghost :
« La variable “sector” dans “raid0_make_request()” n’a pas été correctement modifiée par l’appel à “sector_div()” qui modifie son premier argument à la place. Le commit [précèdent] restaurait cette variable après l’appel pour une utilisation ultérieure. Malheureusement la restauration a été effectuée après que la variable “bio” a été avancée »

I know multiple people had problems with dividing sectors and using sector_div(). Fortunately, most of you used it correctly ! Maybe you should help them :p I wonder if the bug could happen with any file system as raid0_make_request seems to be related to MD and not to ext4 at all, maybe the article is poorely written and it was just seen with ext4 but is related to all md0 device?

Asynchronous Block I/O request

In the Linux kernel, Block I/O request are asynchronous. It means that when you call submit_bio(READ/WRITE, bio); or generic_make_request(…), the function will (most probably) return directly, and of course, the read is not done. So after calling bio_submit(READ…); you absolutely cannot read the content of a page added by bio_add_page().

So, how to know when it is finished? You have to use bio->bi_end_io pointer function. You have to set this pointer to a function which will be called when the read has been done.

[code lang=”c”]void myReadIsFinished(struct bio* bio, int error) {
//Read is finished, do something with the bio content
}

bio->bi_end_io = &myReadIsFinished;
bio_submit(READ, bio);[/code]

bio->bi_private allows you to store some pointer with the bio. Use it to know what you tried to read.

A reminder about variable size structures

Variable-size structures are helpful, because you can allocate a structure containing an array of an unkown (at compile time) size at the end. For example, a “nuda_conf” structure containing informations about an MD Array and a “dev_info” structure per disks. The number of disks is unknown at compile time, so the only solution is to use a kalloc/vmalloc. The classical solution is to do :

[code lang=”c”]

struct dev_info {
int whatever;
struct rdev* rdev;
}

struct nuda_conf {
char* name;
int whotever;
struct dev_info disks*
}

...

struct nuda_conf* conf;
conf = kzalloc(sizeof(struct nuda_conf));
conf->disks = kzalloc(sizeof(struct dev_info) * NUMER_OF_DISKS);

[/code]

But variable-size structures allows to do only one allocation :

[code lang=”c”]

struct nuda_conf {
char* name;
int whotever;
struct dev_info disks[0];
}

struct nuda_conf* conf;

conf = kzalloc(sizeof(struct nuda_conf) + sizeof(struct dev_info) * NUMBER_OF_DISKS);

[/code]

This will allocate the memory for the char* and the int of nuda_conf, and the memory for a number of “dev_info”. To access them, one can do conf->disks[N] where N is the disk index.

But, this is limited to one variable-size array. If you do :

[code lang=”c”]

struct nuda_conf {
char* name;
int whotever;
struct dev_info disks[0];
struct indirection_line indirection_table[0];
}

[/code]

The “indirection_table” and the “disks” pointer will point to the same location ! Even if you allocate size for both ! The solution is either to use a classical pointer as in the first example, or handle the memory access yourself :

[code lang=”c”]

struct nuda_conf {
char* name;
int whotever;
}
...
conf = kzalloc (sizeof(struct nuda_conf) + sizeof(struct dev_info) * NUMBER_OF_DISKS + sizeof(struct indirection_line)*NUMBER_OF_CHUNKS);
struct dev_info* disks = (struct dev_info*)(conf + 1);
struct indirection_line* indirection_table = (struct dev_info*)(disks + NUMBER_OF_DISKS);

[/code]

But there is no interest in doing so, and both disks and indirection_table pointers have to be manually computed each time you need them… The use for that is really really really specific (dynamic metadata in packets for example).