There's a very rare race condition where a tagged thread doesn't seem to
exist when the proxy tries syncing startup with them, and so the proxy
thread hangs in startup.
This addresses it by avoiding looking at the `proxy_thread` variable
(which probably isn't thread safe) in the worker's startup, and
signalling the you-need-to-shutdown condition via a third option for the
(formerly boolean) `tagged_go`.
MAX_SOCKETS wasn't working properly because ZMQ uses it when the context
is initialized, which happens when the first socket is constructed on
that context.
For OxenMQ, we had several sockets constructed on the context during
OxenMQ construction, which meant the context_t was being initialized
during OxenMQ construction, rather than during start(), and so setting
MAX_SOCKETS would have no effect and you'd always get the default.
This fixes it by making all the member variable zmq::socket_t's
default-constructed, then replacing them with proper zmq::socket_t's
during startup() so that we also defer zmq::context_t initialization to
the right place.
A second issue found during testing (also fixed here) is that the socket
worker threads use to communicate to the proxy could fail if the worker
socket creation would violate the zmq max sockets limit, which wound up
throwing an uncaught exception and aborting. This pre-initializes (but
doesn't connect) all potential worker threads sockets during start() so
that the lazily-initialized worker thread will have one already set up
rather than having to create a new one (which could fail).
Internal messages (control messages, worker messages) are always 3 parts
or less, so we can optimize by using a stack allocated std::array for
those cases rather than needing to continually clear and expand a heap
allocated vector.
Change the internal worker routing id to be "w" followed by the raw
integer bytes, so that we can just memcpy them into a uint32_t rather
than needing to do str -> integer conversion on each received worker
message.
(This also eliminates a vestigal call into oxenc internals).
I don't know what this set was originally meant to be doing, but it
currently does nothing (except adding overhead).
The comment says it "owns" the instances but that isn't really true; the
instances effectively already manage themselves as they pass the pointer
through the communications between proxy and workers.
This adds a much simpler `Job` implementation of `Batch` that is used
for simple no-return, no-completion jobs (as are initiated via
`omq.job(...)`).
This reduces the overhead involved in constructing/destroying the Batch
instance for these common jobs.
bt_*, hex, base32z, base64 all moved to oxen-encoding a while ago; this
finishes the move by removing them from oxenmq and instead making oxenmq
depend on oxen-encoding.
libzmq's IPv6 support is buggy when also using DNS hostname: in
particular, if you try to connect to a DNS name that has an IPv6
address, then zmq will *only* try an IPv6 connection, even if the local
client has no IPv6 connectivity, and even if the remote is only
listening on its IPv4 address.
This is much too unreliable to enable by default.
Currently if the proxy thread fails to start (typically because a bind
fails) the exception happens in the proxy thread which is uncatchable by
the caller (and aborts the program).
This makes it nicer by transporting startup exceptions back to the
start() call.
Currently if you pass a nullptr for Logger you get a random
std::bad_function_call called from some random thread the first time a
log message goes out.
This fixes it allow a nullptr that logs nothing.
Makes some send/connection options more robust to "do nothing" runtime
value, which the Python wrapper needs.
Also found a bunch of doc typos and fixes.
Bump version to 1.2.8 so that new pyoxenmq can build-depend on it.
inproc support is special in zmq: in particular it completely bypasses
the auth layer, which causes problems in OxenMQ because we assume that a
message will always have auth information (set during initial connection
handshake).
This adds an "always-on" inproc listener and adds a new `connect_inproc`
method for a caller to establish a connection to it.
It also throws exceptions if you try to `listen_plain` or `listen_curve`
on an inproc address, because that won't work for the reasons detailed
above.
The recent PR that revamped the connection IDs missed a case when
connecting to service nodes where we store the SN pubkey in peers, but
then fail to find the peer when we look it up by connection id.
This adds the required tracking to fix that case (and adds a test that
fails without the fix here).
The existing code was overly complicated by trying to track indices in
the `connections` vector, which complication happening because things
get removed from `connections` requiring all the internal index values
to be updated. So we ended up with a connection ID inside the
ConnectionID object, plus a map of those connection IDs to the
`connections` index, and need a map back from indices to ConnectionIDs.
Though this seems to work usually, I recently noticed an
oxen-storage-server sending oxend requests on the wrong connection and
so I suspect there is some rare edge cases here where a failed
connection index might not be updated properly.
This PR simplifies the whole thing by making getting rid of connection
ids entirely and keeping the connections in a map (with connection ids
that never change). This might end up being a little less efficient
than the vector, but it's unlikely to matter and the added complexity
isn't worth it.
This commit adds support for listening on new ports after startup. This
will make things easier in storage server, in particular, where we want
to delay listening on public ports until we have an established
connection and initial block status update from oxend.
I realized after merging the previous PR that it is difficult to
correctly pass ownership into a timer, because something like:
TimerID x = omq.add_timer([&] { omq.cancel_timer(x); }, 5ms);
doesn't work when the timer job needs to outlive the caller. My next
approach was:
auto x = std::make_shared<TimerID>();
*x = omq.add_timer([&omq, x] { omq.cancel_timer(*x); }, 5ms);
but this has two problems: first, TimerID wasn't default constructible,
and second, there is no guarantee that the assignment to *x happens
before (and is visible to) the access for the cancellation.
This commit fixes both issues: TimerID is now default constructible, and
an overload is added that takes the lvalue reference to the TimerID to
set rather than returning it (and guarantees that it will be set before
the timer is created).
Updates `add_timer` to return a new opaque TimerID object that can later
be passed to `cancel_timer` to cancel an existing timer.
Also adds timer tests, which was omitted (except for one in the tagged
threads section), along with a new test for timer deletion.
Storage server, in particular, needs to disable pubkey-based routing on
its connection to oxend (because it is sharing oxend's own keys), but
wants it by default for SS-to-SS connections. This allows the oxend
connection to turn it off so that we don't have oxend omq connections
replacing each other.
This provides an interface for sending a reply to a message later (i.e.
after the Message& itself is no longer valid) by using a new
`send_later()` method of the Message instance that returns an object
that can properly route replies (and can outlive the Message it was
called on).
Intended use is:
run_this_lambda_later([send=msg.send_later()] {
send.reply("content");
});
which is equivalent to:
run_this_lambda_later([&msg] {
msg.send_reply("content");
});
except that it works properly even if the lambda is invoked beyond the
lifetime of `msg`.