This class extends the basic ZMQ addresses with addresses that handle
parsing and generating of addresses with embedded curve pubkeys of
various forms, along with a QR-friendly address generator.
Currently with bt_value if you give a high-bit uint64_t value it encodes
it on the wire as a 2's complement int64_t value (and then converts it
back during deserialization). This is not a problem for bt_value, but
forces other code to deal with a bt_value convention and makes it
impossible to actually put a high-bit uint64_t value on the wire.
This adds an explicit bt_u64 wrapper to allow doing just that. This is
only really needed when the type is a 2^64 modulo arithmetic type
(ideally *all* uint64_t's should be that, but in practice there is a ton
of code that misuses unsigned types as "shouldn't be negative" when that
isn't what unsigned means at all. cf C++ Core Guidelines ES.100-102)
When doing an optional send that gets declined (because we aren't
connected) the "sending would block" warning would still be printed, but
shouldn't be.
ConnectionIDs weren't comparing their routes, which meant that if
external code stored one in a map or set *all* incoming connections on
the same listener would be considered the same connection.
This fixes it by considering route for equality/hashing, and strips
route off internally where we need to map it to a socket.
When we hit the limit on the number of workers the proxy thread would
stop processing incoming messages, sending it into an infinite loop of
death. The check was supposed to use `active_workers()` rather than
`workers.size()`, but even that isn't quite right: we want to *always*
pull all incoming messages off and queue them internally since different
categories have their own queue sizes (and so we have to pull it off to
know whether we want to keep it -- if spare category queue room -- or
drop it).
The check here on "only if we have some idle workers" fails
catastrophically with one worker because that worker is always occupied
when this code gets called because of how the loop works and so
connections don't get expired at all.
ZMQ's default reconnection time is 100ms, indefinitely, which seems far
too aggressive, particularly where we have some potential for hundreds
or thousands of connections.
This changes the default to be slightly slower (250ms instead of 100ms)
on the first attempt, and to use exponential backoff doubling the time
between each failed connection attempt up to a max of 5s between
reconnection attempts to calm things down.
Idle time was being calculated as the negative of what it should have
been, so a connection idle for 30s was idle for "-30s", and since -30 is
not greater than whatever the idle time is, it would never expire and
get closed.
This was resulting in SNs keeping connections open forever, which was
very likely not helping with connectivity (and probably also responsible
for some of the connection rushes triggering ISP DDOS warnings).
This replaces the recognition of SN status to be checked per-command
invocation rather than on connection. As this breaks the API quite
substantially, though doesn't really affect the functionality, it seems
suitable to bump the minor version.
This requires a fundamental shift in how the calling application tells
LokiMQ about service nodes: rather than using a callback invoked on
connection, the application now has to call set_active_sns() (or the
more efficient update_active_sns(), if changes are readily available) to
update the list whenever it changes. LokiMQ then keeps this list
internally and uses it when determining whether to invoke.
This release also brings better request responses on errors: when a
request fails, the data argument will now be set to the failure reason,
one of:
- TIMEOUT
- UNKNOWNCOMMAND
- NOT_A_SERVICE_NODE (the remote isn't running in SN mode)
- FORBIDDEN (auth level denies the request)
- FORBIDDEN_SN (SN required and the remote doesn't see us as a SN)
Some of these (UNKNOWNCOMMAND, NOT_A_SERVICE_NODE, FORBIDDEN) were
already sent by remotes, but there was no connection to a request and so
they would log a warning, but the request would have to time out.
These errors (minus TIMEOUT, plus NO_REPLY_TAG signalling that a command
is a request but didn't include a reply tag) are also sent in response
to regular commands, but they simply result in a log warning showing the
error type and the command that caused the failure when received.
SS wants this, in particular, to be able to do reachability tests.
(Using connect_remote for this was bad with pubkey-based routing ids
because the second connection could replace an existing connection).
This was never being reset to false which could really hurt performance
(because it being false would cause the proxy socket reading loop to
short circuit before reading all available msgs, basically needing one
full proxy loop per incoming message).
It can still be set using `lmq.log_level(...)`, but this can be slightly
more convenient -- and without this log messages in the constructor are
completely useless.
Downgrades a bunch of not-useful-at-info-level debug messages from info
-> debug. This makes `info` a more useful value for a client that wants
messages about startup/shutdown but not random non-serious connection
related messages.
We really don't *ever* want send to block, no matter how it is called,
since the send is always in the proxy thread. This makes the actual
send call always non-blocking, and adds callbacks that we can invoke on
send failures: either on queue full errors (which might be recoverable),
or both full queue and hard failures (which are generally not
recoverable). These callbacks are both optional: they have to be passed
in using `send_option::queue_full` (if you just want queue full
notifies) or `send_option::queue_failure` (if you want queue full
notifies *and* other send exceptions).
When we try to route an internal message ("BYE", "NOT_A_SERVICE_NODE",
etc.) back to the remote from the proxy thread we can end up trying to
send to a disconnected remote, which raises an exception, but this isn't
caught in proxy code: fix this by catching and ignoring it.
This also changes the code to send these messages in "dontwait" mode so
that if we can't queue the message we get (and ignore) an exception
rather than blocking.
When we fail to send to a SN but can retry (e.g. because we had an
incoming connection which no longer works, but can retry an outgoing
connection) we were recursing, but this was resulting in a double-free
of the request callback (since we'd try to take ownership of the
incoming serialized pointer twice).
Rewrite the code to use a loop with single ownership instead.
This also changes the request callback behaviour to fire a failure
callback immediately if we can't send a request; previously you'd have
to wait for a timeout, but that is pointless if we couldn't get the
request out.