Commit Graph

14 Commits

Author SHA1 Message Date
Jason Rhinelander 8f97add30f
Add epoll support for Linux
Each call to zmq::poll is painfully slow when we have many open zmq
sockets, such as when we have 1800 outbound connections (i.e. connected
to every other service node, as services nodes might have sometimes and
the Session push notification server *always* has).

In testing on my local Ryzen 5950 system each time we go back to
zmq::poll incurs about 1.5ms of (mostly system) CPU time with 2000 open
outbound sockets, and so if we're being pelted with a nearly constant
stream of requests (such as happens with the Session push notification
server) we incur massive CPU costs every time we finish processing
messages and go back to wait (via zmq::poll) for more.

In testing a simple ZMQ (no OxenMQ) client/server that establishes 2000
connections to a server, and then has the server send a message back on
a random connection every 1ms, we get atrocious CPU usage: the proxy
thread spends a constant 100% CPU time.  Virtually all of this is in the
poll call itself, though, so we aren't really bottlenecked by how much
can go through the proxy thread: in such a scenario the poll call uses
its CPU then returns right away, we process the queue of messages, and
return to another poll call.  If we have lots of messages received in
that time, though (because messages are coming fast and the poll was
slow) then we process a lot all at once before going back to the poll,
so the main consequences here are that:

1) We use a huge amount of CPU
2) We introduce latency in a busy situation because the CPU has to make
   the poll call (e.g. 1.5ms) before the next message can be processed.
3) If traffic is very bursty then the latency can manifest another
   problem: in the time it takes to poll we could accumulate enough
   incoming messages to overfill our internal per-category job queue,
   which was happening in the SPNS.

(I also tested with 20k connections, and the poll time scaling was
linear: we still processed everything, but in larger chunks because
every poll call took about 15ms, and so we'd have about 15 messages at a
time to process with added latency of up to 15ms).

Switching to epoll *drastically* reduces the CPU usage in two ways:

1) It's massively faster by design: there's a single setup and
   communication of all the polling details to the kernel which we only
   have to do when our set of zmq sockets changes (which is relatively
   rare).
2) We can further reduce CPU time because epoll tells us *which* sockets
   need attention, and so if only 1 connection out of the 2000 sent us
   something we can only bother checking that single socket for
   messages.  (In theory we can do the same with zmq::poll by querying
   for events available on the socket, but in practice it doesn't
   improve anything over just trying to read from them all).

In my straight zmq test script, using epoll instead reduced CPU usage in
the sends-every-1ms scenario from a constant pegged 100% of a core to an
average of 2-3% of a single core.  (Moreover this CPU usage level didn't
noticeably change when using 20k connections instead of 2k).
2023-09-14 15:03:15 -03:00
Jason Rhinelander edcde9246a
Fix zmq socket limit setting
MAX_SOCKETS wasn't working properly because ZMQ uses it when the context
is initialized, which happens when the first socket is constructed on
that context.

For OxenMQ, we had several sockets constructed on the context during
OxenMQ construction, which meant the context_t was being initialized
during OxenMQ construction, rather than during start(), and so setting
MAX_SOCKETS would have no effect and you'd always get the default.

This fixes it by making all the member variable zmq::socket_t's
default-constructed, then replacing them with proper zmq::socket_t's
during startup() so that we also defer zmq::context_t initialization to
the right place.

A second issue found during testing (also fixed here) is that the socket
worker threads use to communicate to the proxy could fail if the worker
socket creation would violate the zmq max sockets limit, which wound up
throwing an uncaught exception and aborting.  This pre-initializes (but
doesn't connect) all potential worker threads sockets during start() so
that the lazily-initialized worker thread will have one already set up
rather than having to create a new one (which could fail).
2022-08-05 10:40:01 -03:00
Sean Darcy c91e56cf2d adds custom formatter for OMQ structs that have to_string member 2022-08-04 10:50:02 +10:00
Jason Rhinelander 045df9cb9b
Use oxen-encoding and add compatibility shim headers
bt_*, hex, base32z, base64 all moved to oxen-encoding a while ago; this
finishes the move by removing them from oxenmq and instead making oxenmq
depend on oxen-encoding.
2022-01-18 10:30:23 -04:00
Jason Rhinelander fe8a1f4306
Disable IPv6 by default
libzmq's IPv6 support is buggy when also using DNS hostname: in
particular, if you try to connect to a DNS name that has an IPv6
address, then zmq will *only* try an IPv6 connection, even if the local
client has no IPv6 connectivity, and even if the remote is only
listening on its IPv4 address.

This is much too unreliable to enable by default.
2021-12-02 19:01:21 -04:00
Jason Rhinelander 03749c87f0
Merge pull request #67 from jagerman/ipv6
Enable ipv6 support on sockets
2021-11-30 14:20:43 -04:00
Jason Rhinelander 375cfab4ce Rebrand variables LMQ -> OMQ
Various things still were using the lmq (or LMQ) names; change them all
to omq/OMQ.
2021-11-30 14:10:47 -04:00
Jason Rhinelander f04bd72a4c Enable ipv6 support on sockets
Without this you cannot bind or connect to IPv6 addresses because,
oddly, libzmq defaults ipv6 to disabled.
2021-11-30 13:50:24 -04:00
Jason Rhinelander 4a6bb3f702 Fix messages coming back on an outgoing connection
The recent PR that revamped the connection IDs missed a case when
connecting to service nodes where we store the SN pubkey in peers, but
then fail to find the peer when we look it up by connection id.

This adds the required tracking to fix that case (and adds a test that
fails without the fix here).
2021-07-01 00:37:55 -03:00
Jason Rhinelander ad04c53c0e
Simplify conn index handling (#41)
The existing code was overly complicated by trying to track indices in
the `connections` vector, which complication happening because things
get removed from `connections` requiring all the internal index values
to be updated.  So we ended up with a connection ID inside the
ConnectionID object, plus a map of those connection IDs to the
`connections` index, and need a map back from indices to ConnectionIDs.

Though this seems to work usually, I recently noticed an
oxen-storage-server sending oxend requests on the wrong connection and
so I suspect there is some rare edge cases here where a failed
connection index might not be updated properly.

This PR simplifies the whole thing by making getting rid of connection
ids entirely and keeping the connections in a map (with connection ids
that never change).  This might end up being a little less efficient
than the vector, but it's unlikely to matter and the added complexity
isn't worth it.
2021-06-23 17:51:25 -03:00
Jason Rhinelander 5dd7c12219 Add support for listening after startup
This commit adds support for listening on new ports after startup.  This
will make things easier in storage server, in particular, where we want
to delay listening on public ports until we have an established
connection and initial block status update from oxend.
2021-06-23 10:51:08 -03:00
Jason Rhinelander ac58e5b574 Rename PUBKEY_BASED_ROUTING_ID to EPHEMERAL_ROUTING_ID
And similarly for the connect_option
2021-04-15 15:42:04 -03:00
Jason Rhinelander dc40ebd428 Add connect_option::*; allow per-connection pubkey-based routing setting
Storage server, in particular, needs to disable pubkey-based routing on
its connection to oxend (because it is sharing oxend's own keys), but
wants it by default for SS-to-SS connections.  This allows the oxend
connection to turn it off so that we don't have oxend omq connections
replacing each other.
2021-04-15 15:11:54 -03:00
Jason Rhinelander 2ae6b96016 Rename LokiMQ to OxenMQ 2021-01-14 15:32:38 -04:00
Renamed from lokimq/connections.cpp (Browse further)