2
0
Fork 0
mirror of git://git.savannah.gnu.org/guix/maintenance.git synced 2023-12-14 03:33:04 +01:00
maintenance/doc/infra-handbook.org
Maxim Cournoyer ed575f27e7
doc: Fix PXE boot procedure documented in infra-handbook.org.
* doc/infra-handbook.org (Repairing a non-bootable Guix System via a
PXE booted image): Describe procedure as done via the BIOS instead of
the iDRAC web page.
2023-04-24 14:47:57 -04:00

12 KiB

#:TITLE Guix Infrastructure Handbook

This handbook is intended for sysadmin volunteers taking care of the infrastructure powering the Guix website, substitutes and other services offered via https://guix.gnu.org/.

The different machines involved are registered in the file:../hydra/machines.rec file.

Berlin

Berlin is the main machine, which hosts the website (https://guix.gnu.org/), the MUMI issue tracker (https://issues.guix.gnu.org/), runs the build farm (https://ci.guix.gnu.org/) and serves the cached substitutes. It is graciously provided by the Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC) and hosted at their datacenter in Berlin, hence its name.

Specifications

Dell PowerEdge R7425 server with the following specifications:

  • 2x AMD EPYC 7451 24-Core processors
  • Storage Area Network (SAN) of 100 TiB
  • SAN connected to two QLogic QLE2692 16G Fibre Channel adapters (qla2xxx)
  • PERC 730p RAID/HBA disk controller with 8 slots
  • 2x 1 TB hard drives in a RAID 1 configuration (attached to the PERC)
  • 188 GiB of memory

The machine can be remotely administered via iDRAC, the Dell server management platform.

Its configuration is defined in file:../hydra/berlin.scm. Berlin has a machine intended to become a fallback, known as node 129, which is deployed from Berlin via the deploy file: file:../hydra/deploy-node-129.scm.

Boot device

The PowerEdge R7425 firmware works best in UEFI mode. The boot device is made of two 931 GB rotational disks attached to the PERC controller card and configured in RAID 1. It holds the UEFI partition as well as another partition for /boot. It is made necessary because the SAN is not visible to GRUB.

SSH access to Berlin and node 129

The following ~/.ssh/config snippets can be defined to access the Berlin machine:

Host berlin
     HostName berlin.guix.gnu.org
     DynamicForward 8022
     ForwardAgent yes

The DynamicForward on port 8022 will be explained in the iDRAC web access section below, while ForwardAgent is useful to have your agent credentials used to deploy to node 129 from Berlin available.

For node 129, you can use:

Host hydra-guix-129
     HostName 141.80.181.41
     DynamicForward 8022

iDRAC web page access

The Dell iDRAC management suite offers a web site to easily do actions such as rebooting a machine, changing parameters or simply checking its current status. The iDRAC page of Berlin can be accessed at https://141.80.167.225, while node 129's page can be accessed at https://141.80.167.229. Because the iDRAC web interface can only be accessed locally from the MDC, it is necessary to configure some HTTP proxy. This can be accomplished via OpenSSH's SOCKS proxy support. For it to work, two things are needed:

  1. A DynamicForward directive on your SSH host, as shown in the snippets from the above /guix/maintenance/src/commit/8eea4b83119e581a62cd4b20cbb382504acc19ee/doc/SSH%20access%20to%20Berlin%20and%20node%20129 section.
  2. A proxy auto-configuration (PAC) file to configure your browser to relay requests to specific domains to through the SOCKS proxy.

For GNU IceCat, the PAC file can be defined as below, and placed for example at ~/.mozilla/proxy.pac. Then you should navigate to the IceCat Settings -> General -> Network Settings (completely at the bottom), and tick the "Automatic proxy configuration URL" checkbox, inputting the PAC file URI in the associated text box, e.g.: /home/maxim/.mozilla/proxy.pac. Click the "Reload" button to have it effective.

function FindProxyForURL(url, host) {
    if (isInNet(dnsResolve(host), "141.80.167.0", "255.255.255.0")) {
        return "SOCKS localhost:8022; DIRECT";
    } else {
        return "DIRECT";
    }
}

After that, navigating to https://141.80.167.229 should display the iDRAC login page, as long as you have an active connection to either berlin or hydra-guix-129.

iDRAC serial console access to Berlin

iDRAC also provides access to a server's serial console, which can be very handy to debug boot problems (before an SSH server is available). The iDRAC main console interfaces reachable per specific IPs private to the MDC network, so it is necessary to proxy jump through Berlin or node 129 to reach them, as shown in the ~/.ssh/config configuration snippets below:

Host hydra-guix-129-idrac
     ProxyJump berlin
     HostName 141.80.167.229
     User guix

Host berlin-idrac
     ProxyJump hydra-guix-129
     HostName 141.80.167.225
     User guix

You may notice that we don't proxy jump through berlin itself to access its iDRAC interface, because this wouldn't work in case berlin is not currently running. For the same reason, the iDRAC interface of node 129 is reached by proxy jumping through berlin.

After having connected to the iDRAC interface, the serial console can be entered by typing the console com2 command at the racadm>> ~ prompt. To exit, press ~C-\.

Repairing a non-bootable Guix System via a PXE booted image

One way to fix a non-bootable Guix System is to boot a different GNU/Linux system and mount the partitions and make changes to them. This is made possible for Berlin and node 129 by having their boot mode fallback to a network (PXE) boot, and using the serial console to navigate the boot menus.

Pressing F12 as suggested during the boot to reach PXE doesn't seem to work. The most reliable way I've found is to change the Boot Settings in a persistent fashion by entering the System Setup (F2) at boot:

System Setup

  • System BIOS

    • Boot Settings

      • UEFI Boot Settings

Leave only the PXE Device checkbox enabled, then press ESC, ESC, ESC, Yes, OK, ESC and YES to save and exit. The PXE boot typically succeeds on the second reboot, which it attempts automatically after failing once.

The images are made available by the MDC infrastructure team via Cobbler , and only a few of the images available are bootable (sadly, Guix System is not one of them). One image which works and has Btrfs support is "Ubuntu-22.04-server-amd64". Upon selecting that entry and pressing RET, a sub-menu should appear, containing "Ubuntu-22.04-server-amd64-GuixFarm". Before booting it, you need to adjust its 'clinux' kernel arguments at the GRUB menu boot to add console=ttyS0,115200 in order to see the serial output. There is a convenient way to turn on SSH at the installer screen, which you can connect to from the hydra-guix-129 or berlin machines.

You can then mount the file systems and modify /boot/grub/grub.cfg or anything. If you need to reconfigure the machine, you can refer to: info:guix#Chrooting to chroot into an existing system, except you'll need to use the --substitute-urls=https://bordeaux.guix.gnu.org to avoid blocking on attempting to fetch substitutes from https://ci.guix.gnu.org, in vain. If the reconfiguration hangs, you may also need to use --no-grafts.

To allow connecting to a root shell from a remote machine (e.g. berlin), set the PermitRootLogin to yes in /etc/ssh/sshd_config and set a password for the root user via the passwd command, then systemctl restart sshd.

Scribbled Notes

To replicate node-129's file system under /mnt, use:

mount -o subvol=@root /dev/mapper/mpathb /mnt
mount -o subvol=@cache /dev/mapper/mpathb /mnt/var/cache
mount -o subvol=@home /dev/mapper/mpathb /mnt/home
mount /dev/sda3 /mnt/boot/
mount /dev/sda2 /mnt/boot/efi
mount /dev/sdb2 /mnt/boot/efi2/

Btrfs file system

Due to not being susceptible to the EXT4 inodes exhaustion problem and offering zstd compression which can almost double the actual capacity of a storage device at little computation cost, Btrfs is the current file system of choice for GNU/Linux-based Guix System build machines.

Btrfs compression and mount options

To get the most out of Btrfs, enabling zstd compression is recommended. When using RAID arrays, it can also be useful to use the degraded mount option, otherwise the RAID could fail to assemble at boot if any drive part of the array has a problem. Here's an alist of recommended mount options, taken from file:../hydra/deploy-node-129.scm for a build machine when high availability is preferred over data safety (degraded):

(define %common-btrfs-options '(("compress" . "zstd")
                                ("space_cache" . "v2")
                                "degraded"))

Btrfs balance mcron job

To ensure it operates without manual intervention, a balance job should run periodically to ensure the unallocated space (a Btrfs-specific concept) remains in check with the actual free space. Otherwise, the system could report ENOSPC even when common utilities such as df -h report plenty of free space. To view the amount of available unallocated space, the btrfs filesystem usage / can be used.

The following mcron job example is taken from the file:../hydra/deploy-node-129.scm machine configuration:

(define btrfs-balance-job
  ;; Re-allocate chunks which are using less than 5% of their chunk
  ;; space, to regain Btrfs 'unallocated' space.  The usage is kept
  ;; low (5%) to minimize wear on the SSD.  Runs at 5 AM every 3 days.
  #~(job '(next-hour-from (next-day (range 1 31 3)) '(5))
         (lambda ()
           (system* #$(file-append btrfs-progs "/bin/btrfs")
                    "balance" "start" "-dusage=5" "/"))
         "btrfs-balance"))

Problems/solutions knowledge base

The boot fails with kernel panick on qla2xxx-related errors

Here's an example:

[   51.266790] Call Trace:
[   51.266792]  <TASK>
[   51.266794]  _raw_spin_lock_irqsave+0x46/0x60
[   51.266799]  qla2xxx_dif_start_scsi_mq+0x2b7/0xe60 [qla2xxx 124f4fec4ef588623af420625c6af8b5bcce53fd]
[   51.266823]  qla2xxx_mqueuecommand+0x222/0x2d0 [qla2xxx 124f4fec4ef588623af420625c6af8b5bcce53fd]
[   51.266838]  qla2xxx_queuecommand+0x1a1/0x3d0 [qla2xxx 124f4fec4ef588623af420625c6af8b5bcce53fd]
[   51.266852]  scsi_queue_rq+0x390/0xc00
[   51.266857]  __blk_mq_try_issue_directly+0x176/0x1e0
[   51.266861]  blk_mq_plug_issue_direct.constprop.0+0x93/0x180
[   51.266865]  blk_mq_flush_plug_list+0x23d/0x2a0
[   51.266868]  __blk_flush_plug+0xed/0x130
[   51.266872]  blk_finish_plug+0x31/0x50
[   51.266874]  read_pages+0x1f5/0x300
[   51.266879]  page_cache_ra_unbounded+0x131/0x180
[   51.266882]  force_page_cache_ra+0xc7/0x100
[   51.266885]  page_cache_sync_ra+0x34/0x90
[   51.266887]  filemap_get_pages+0x127/0x700
[   51.266893]  filemap_read+0xde/0x420
[   51.266898]  blkdev_read_iter+0xbd/0x1e0
[   51.266901]  new_sync_read+0x13e/0x1c0
[   51.266905]  vfs_read+0x151/0x1a0
[   51.266908]  ksys_read+0x73/0xf0
[   51.266911]  __x64_sys_read+0x1e/0x30
[   51.266913]  do_syscall_64+0x60/0xc0
[   51.266919]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   51.266922] RIP: 0033:0x4e73de
[   51.266924] Code: 0f 1f 40 00 48 c7 c2 bc ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb ba 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
[   51.266926] RSP: 002b:00007ffc403f39e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   51.266928] RAX: ffffffffffffffda RBX: 0000000001a98738 RCX: 00000000004e73de
[   51.266929] RDX: 0000000000000100 RSI: 0000000001a98748 RDI: 0000000000000006
[   51.266930] RBP: 0000000001a51bc0 R08: 0000000001a98720 R09: 0000000001a3ef10
[   51.266932] R10: 0000000000000007 R11: 0000000000000246 R12: 000009ffffffe000
[   51.266933] R13: 0000000000000100 R14: 0000000001a98720 R15: 0000000001a51c10
[   51.266936]  </TASK>
[   54.246148] NMI watchdog: Watchdog detected hard LOCKUP on cpu 64

Solution: This is indicative of a device failure part of the backing devices of the SAN (Storage Area Network) array. Ensure multipath is in use to mount the SAN (TBD), which adds resiliency to this problem, and report the problem to Ricardo Wurmus/the SIMB infrastructure department.