Previously, a non-local exit (such as an uncaught exception) in the
child process would cause it to execute the same code as its parent.
* src/cuirass/scripts/remote-worker.scm (start-worker): Wrap child body
in 'dynamic-wind'.
This is a follow-up of 1fb4b0ac12 that tried to
work around the remote-worker hangs by introducing a non-blocking read.
This solution was problematic because when the server is unresponsive, the
request-work requests are queued on the worker. When the server is back
online, the requests were all sent to server.
Use instead the ZMQ_PROBE_ROUTER option that causes the server to send an
empty boostrap message to the worker when a connection is established. This
empty message will unlock the workers that were hanging on the request-work
response.
* src/cuirass/scripts/remote-server.scm (zmq-start-proxy): Set the
ZMQ_PROBE_ROUTER option on the build socket.
* src/cuirass/scripts/remote-worker.scm (start-worker): Ignore the bootstrap
message when reading server info however, when receiving a bootstrap message
while waiting for a request-work response, keep going.
* src/cuirass/scripts/remote-worker.scm (show-help, %options): Adapt them.
(%minimum-disk-space): Define if before %default-options in order to use it to
set the default value. Use a 5GiB threshold because image builds that are
frequently failing due to the lack of space require a lot more than 100MiB.
When using 'par-for-each', we'd spawn the whole thread pool of (ice-9
futures), with one thread per core. Using 'n-par-for-each' allows us to
spawn just as many threads as needed.
* src/cuirass/scripts/evaluate.scm (cuirass-evaluate): Use
'n-par-for-each' instead of 'par-for-each'.
This helps ensure workers don't pick up builds that are likely to fail
due to ENOSPC.
* src/cuirass/scripts/remote-worker.scm (show-help, %options): Add
'--minimum-disk-space' option.
(%default-options): Add 'minimum-disk-space'.
(%minimum-disk-space): New variable.
(low-disk-space?): New procedure.
(start-worker): Call 'request-work' only when 'low-disk-space?' returns #f.
(cuirass-remote-worker): Parameterize %MINIMUM-DISK-SPACE.
When the worker sends a request-work message to the server, it then waits
undefinitely for a response. If the server receives the response but dies
before answering, the client can be blocked forever.
* src/cuirass/remote.scm (EAGAIN-safe): New macro.
(zmq-get-msg-parts-bytevector/no-wait): New procedure.
* src/cuirass/scripts/remote-worker.scm (start-worker): Use the above
procedure not to wait the server response undefinitely.
Fixes: <https://issues.guix.gnu.org/55024>.
* src/cuirass/http.scm (body->specification): Decode all arguments and not
only channel URL and build params.
Signed-off-by: Mathieu Othacehe <othacehe@gnu.org>
Proposed by Leo Famulari here: <https://issues.guix.gnu.org/52487>.
* src/cuirass/database.scm (db-update-failed-builds!): Record the starttime
and stopttime.
Also distinguish between a build that is completed from a build that is
completed with some available logs.
* src/cuirass/templates.scm (completed?, completed-with-logs?): New
procedures.
* src/cuirass/scripts/remote-worker.scm (run-build): If the worker was not
able to send the build logs, report it, dump the build logs them and keep
things going.
The specification used to build our master branch was renamed from
"guix-master" to "master".
* doc/cuirass.texi (API description): Change "guix-master" to "master".
* src/cuirass/templates.scm: Change "guix-master" to "master".
* src/cuirass/scripts/remote-server.scm (add-to-store): Take care of
registering the GC roots and triggering the baking if the ensure-path call is
successful.
(trigger-substitutes-baking): Take a single output argument.
(need-fetching?): Add logging.
(run-fetch): Adapt it.