Commit graph

138 commits

Author SHA1 Message Date
Jason Rhinelander
b905a8a4ff Silence spurious warning on optional send failure
When doing an optional send that gets declined (because we aren't
connected) the "sending would block" warning would still be printed, but
shouldn't be.
2020-04-29 14:54:54 -03:00
Jason Rhinelander
3a0508fdce Fix incoming ConnectionIDs not being storable
ConnectionIDs weren't comparing their routes, which meant that if
external code stored one in a map or set *all* incoming connections on
the same listener would be considered the same connection.

This fixes it by considering route for equality/hashing, and strips
route off internally where we need to map it to a socket.
2020-04-26 12:12:04 -03:00
Jason Rhinelander
f4f1506df0 Add remote address into Message object
Can be useful for end point logging.
2020-04-24 18:59:33 -03:00
Jason Rhinelander
a812abd422 Fix ""_sv literal being non-constexpr 2020-04-23 21:59:24 -03:00
Jason Rhinelander
730633bbae Provide caller Access in Message
This lets a callback set up something at, say, basic level, but provide
different values for an admin auth remote than a basic auth remote.
2020-04-23 21:52:39 -03:00
Jason Rhinelander
e7cd2dedc2 Change worker thread names: "w2" -> "lmq-w2" 2020-04-21 16:58:10 -03:00
Jason Rhinelander
6ddf033674 Fix proxy thread stall when workers fill up
When we hit the limit on the number of workers the proxy thread would
stop processing incoming messages, sending it into an infinite loop of
death.  The check was supposed to use `active_workers()` rather than
`workers.size()`, but even that isn't quite right: we want to *always*
pull all incoming messages off and queue them internally since different
categories have their own queue sizes (and so we have to pull it off to
know whether we want to keep it -- if spare category queue room -- or
drop it).
2020-04-21 16:55:40 -03:00
Jason Rhinelander
0ebfef2164 Set thread names on proxy/workers
Makes debugging which threads are using CPU easier.
2020-04-21 12:02:44 -03:00
Jason Rhinelander
fc1ea66599 Reduce heartbeat frequency to 15s
3s was excessive especially considering that the default heartbeat
timeout is set to 30s.
2020-04-18 02:58:22 -03:00
Jason Rhinelander
238dfa7f78 Drop idle connections regularly
The check here on "only if we have some idle workers" fails
catastrophically with one worker because that worker is always occupied
when this code gets called because of how the loop works and so
connections don't get expired at all.
2020-04-18 02:55:12 -03:00
Jason Rhinelander
2966427cc0 Increase ZMQ socket limit
ZMQ's default is 1024, which we are close to hitting; this changes the
default for LokiMQ to 10000.
2020-04-17 16:13:04 -03:00
Jason Rhinelander
34bbaaf612 Use slower and exponential backoff in reconnection
ZMQ's default reconnection time is 100ms, indefinitely, which seems far
too aggressive, particularly where we have some potential for hundreds
or thousands of connections.

This changes the default to be slightly slower (250ms instead of 100ms)
on the first attempt, and to use exponential backoff doubling the time
between each failed connection attempt up to a max of 5s between
reconnection attempts to calm things down.
2020-04-17 16:09:53 -03:00
Jason Rhinelander
b2518b8eb3 Fix broken idle expiry timeout
Idle time was being calculated as the negative of what it should have
been, so a connection idle for 30s was idle for "-30s", and since -30 is
not greater than whatever the idle time is, it would never expire and
get closed.

This was resulting in SNs keeping connections open forever, which was
very likely not helping with connectivity (and probably also responsible
for some of the connection rushes triggering ISP DDOS warnings).
2020-04-17 16:06:54 -03:00
Jason Rhinelander
712662f144 Fix storing reference to temporary
consume_string returns a temporary string; we wnat consume_string_view
which returns a view into the data being consumed.
2020-04-17 16:05:41 -03:00
Jason Rhinelander
131bc95f65 Fix pre-1.1.0 UNKNOWNCOMMAND detection
1.0.5 sends just ["UNKNOWNCOMMAND"], so the detection here was broken,
which resulted in a warning rather than just a debug log message.
2020-04-14 23:53:19 -03:00
Jason Rhinelander
7de36da483 Add ZMTP heartbeating (enabled by default)
ZMTP heartbeating should help keep the connection alive, and should
result in earlier detection of connection failures.
2020-04-14 16:08:54 -03:00
Jason Rhinelander
b081cf9331 Add missing SET_SNS proxy handler 2020-04-13 16:11:30 -03:00
Jason Rhinelander
84bd5544cc Move pubkey_set into auth.h header
This allows it to be brought in without the full lokimq.h header.
2020-04-13 13:03:19 -03:00
Jason Rhinelander
3b86eb1341 1.1.0: invocation-time SN auth; failure responses
This replaces the recognition of SN status to be checked per-command
invocation rather than on connection.  As this breaks the API quite
substantially, though doesn't really affect the functionality, it seems
suitable to bump the minor version.

This requires a fundamental shift in how the calling application tells
LokiMQ about service nodes: rather than using a callback invoked on
connection, the application now has to call set_active_sns() (or the
more efficient update_active_sns(), if changes are readily available) to
update the list whenever it changes.  LokiMQ then keeps this list
internally and uses it when determining whether to invoke.

This release also brings better request responses on errors: when a
request fails, the data argument will now be set to the failure reason,
one of:

- TIMEOUT
- UNKNOWNCOMMAND
- NOT_A_SERVICE_NODE (the remote isn't running in SN mode)
- FORBIDDEN (auth level denies the request)
- FORBIDDEN_SN (SN required and the remote doesn't see us as a SN)

Some of these (UNKNOWNCOMMAND, NOT_A_SERVICE_NODE, FORBIDDEN) were
already sent by remotes, but there was no connection to a request and so
they would log a warning, but the request would have to time out.

These errors (minus TIMEOUT, plus NO_REPLY_TAG signalling that a command
is a request but didn't include a reply tag) are also sent in response
to regular commands, but they simply result in a log warning showing the
error type and the command that caused the failure when received.
2020-04-12 19:57:19 -03:00
Jason Rhinelander
95540ec7d5 Fix pollitems_stale not being set in some cases
This could cause stalls of up to 250ms before we detect an incoming
message.
2020-04-06 13:16:55 -03:00
Jason Rhinelander
af42875e97 Made simple_string_view take a char type
This allows (most usefully) a `ustring_view` for viewing unsigned char
strings.
2020-04-03 12:28:50 -03:00
Jason Rhinelander
bc49b5e9a0 Expose advanced zmq context setting ability 2020-04-03 12:28:50 -03:00
Jason Rhinelander
e3a86aaf71 Add send_option::outgoing to force a send on an outgoing connection
SS wants this, in particular, to be able to do reachability tests.
(Using connect_remote for this was bad with pubkey-based routing ids
because the second connection could replace an existing connection).
2020-04-03 01:34:21 -03:00
Jason Rhinelander
b9e9f10f29 Reset stale pollitems
This was never being reset to false which could really hurt performance
(because it being false would cause the proxy socket reading loop to
short circuit before reading all available msgs, basically needing one
full proxy loop per incoming message).
2020-04-03 01:34:21 -03:00
Jason Rhinelander
d4ffebebbd Change thread count logs to debug from trace 2020-04-03 01:34:21 -03:00
Jason Rhinelander
6ba70923b9 Add job queue check on total workers size
Without this there could be a race condition where a job could create a
new worker during shutdown, and end up causing an assert failure.
2020-03-29 15:43:17 -03:00
Jason Rhinelander
bd196d08b8 Allow log level to be specified in constructor
It can still be set using `lmq.log_level(...)`, but this can be slightly
more convenient -- and without this log messages in the constructor are
completely useless.
2020-03-29 15:21:20 -03:00
Jason Rhinelander
b66f653708 Less verbose logging at info level
Downgrades a bunch of not-useful-at-info-level debug messages from info
-> debug.  This makes `info` a more useful value for a client that wants
messages about startup/shutdown but not random non-serious connection
related messages.
2020-03-29 15:21:20 -03:00
Jason Rhinelander
716d73d196 All sends use dontwait; add send failure callbacks
We really don't *ever* want send to block, no matter how it is called,
since the send is always in the proxy thread.  This makes the actual
send call always non-blocking, and adds callbacks that we can invoke on
send failures: either on queue full errors (which might be recoverable),
or both full queue and hard failures (which are generally not
recoverable).  These callbacks are both optional: they have to be passed
in using `send_option::queue_full` (if you just want queue full
notifies) or `send_option::queue_failure` (if you want queue full
notifies *and* other send exceptions).
2020-03-29 15:21:20 -03:00
Jason Rhinelander
8e1b2dffa5 Catch connect failures
socket.connect() can throw, e.g. if given an invalid connection address;
catch this, log the error, and return a failure condition.
2020-03-29 14:40:21 -03:00
Jason Rhinelander
2493e2abd4 Remove empty file
All the batch implementation code is in jobs.cpp, this file wasn't meant
to be committed originally.
2020-03-29 12:29:38 -03:00
Jason Rhinelander
bcca8dd34e Catch errors on internal msgs; support non-blocking sends
When we try to route an internal message ("BYE", "NOT_A_SERVICE_NODE",
etc.) back to the remote from the proxy thread we can end up trying to
send to a disconnected remote, which raises an exception, but this isn't
caught in proxy code: fix this by catching and ignoring it.

This also changes the code to send these messages in "dontwait" mode so
that if we can't queue the message we get (and ignore) an exception
rather than blocking.
2020-03-29 11:34:55 -03:00
Jason Rhinelander
fd19f7b183 Trim logged filenames to lokimq/*
Otherwise this includes the full build path which is gross.
2020-03-27 15:17:34 -03:00
Jason Rhinelander
0639bfa629 Avoid segfault on retried SN connection request
When we fail to send to a SN but can retry (e.g. because we had an
incoming connection which no longer works, but can retry an outgoing
connection) we were recursing, but this was resulting in a double-free
of the request callback (since we'd try to take ownership of the
incoming serialized pointer twice).

Rewrite the code to use a loop with single ownership instead.

This also changes the request callback behaviour to fire a failure
callback immediately if we can't send a request; previously you'd have
to wait for a timeout, but that is pointless if we couldn't get the
request out.
2020-03-27 14:59:11 -03:00
Jason Rhinelander
a7c669775f Avoid masking ReplyCallback type with template param 2020-03-27 14:48:35 -03:00
Jason Rhinelander
8b6f6f498c Make request timeout configurable
For example:

    lmq.request(conn, "some.method", callback, lokimq::request_timeout{5s});

will result in the callback being called with a failure if the response
doesn't arrive within 5s.  (If it still arrives, but after the failure
callback, it gets dropped).
2020-03-23 22:30:53 -03:00
Jason Rhinelander
75750001ce Reduce connection check interval and make configurable
The previous 1s default seems on the long side; this reduces it to
250ms.  It also makes it a public member so that it can be configured
(which is mainly needed for the test suite, but might be useful for
lokimq-calling code that needs faster or slower connection cleanups).
2020-03-23 22:29:14 -03:00
Jason Rhinelander
b97f3442e7 Rename keep-alive -> keep_alive in internal serialization
This makes it consistent with other internal parameter names.
2020-03-23 22:28:23 -03:00
Jason Rhinelander
04e2bf7cf7 Change pending_connects from vector to list
Having this as a vector seems to cause armhf/gcc-6 to segfault.  On
closer inspection there's no good reason this should be a vector in the
first place: it only gets used during new connection handshaking and
isn't in any hot loop, plus the elements are fairly large tuples where
shifting elements is going to be relatively expensive.  Thus switching
it to a list everywhere (rather than just on old gcc arm) seems fine.
2020-03-21 12:56:46 -03:00
Jason Rhinelander
036e871cdb 32-bit warning fix 2020-03-13 19:05:12 -03:00
Jason Rhinelander
49f8ef21f1 Install mapbox-variant and cppzmq headers 2020-03-13 19:05:12 -03:00
Jason Rhinelander
e17ca30411 Split up into logical headers and compilation units
lokimq.cpp and lokimq.h were getting monolithic; this splits lokimq.cpp
into multiple smaller cpp files by logical purpose for better parallel
compilation ability.  It also splits up the lokimq.h header slightly by
moving the ConnectionID and Message types into their own headers.
2020-03-13 14:28:21 -03:00
Jason Rhinelander
1c80b61335 Add version to cmake, generate version header 2020-03-13 14:28:05 -03:00
Jason Rhinelander
f4fad9c194 Fix problems on outgoing disconnect
This removes two superfluous erases that occur during connection closing
(the proxy_close_connection just above them already removes the element
from `peers`), and also short-circuits the incoming message loop if our
pollitems becomes stale so that we don't try to use a closed connection.

It also fixes a bug in the outgoing connection index that was
decrementing the wrong connection indices, leading to failures when
trying to send on an existing connection after a disconnect.

Also adds a test case (which fails before the changes in this commit) to
test this.
2020-03-10 13:54:41 -03:00
Jason Rhinelander
0ce614ef8b Workarounds for old cmake/gcc 2020-03-05 01:33:02 -04:00
Jason Rhinelander
fa7d4a8a42 Silence -Wmismatched-tags warning 2020-03-05 01:19:29 -04:00
Jason Rhinelander
882750b700 Don't use consumed data when recursing proxy_send
Fixes "Internal error: Invalid proxy send command; conn_pubkey or
conn_id missing"
2020-03-04 00:13:01 -04:00
Jason Rhinelander
3cb52df837 Fix conn index error
The conn_index entry wasn't being added for outgoing SN connections.
2020-03-03 23:26:14 -04:00
Jason Rhinelander
88f0a10bd8 Fix infinite loop on idle peer expiry 2020-03-03 18:12:13 -04:00
Jason Rhinelander
ea5ff7790d Fix destructor when start() hasn't been called 2020-03-03 17:28:53 -04:00
Jason Rhinelander
428ef12506 Add missing inline to de-templatized hex funcs 2020-03-03 15:06:39 -04:00
Jason Rhinelander
465b398b10 Use string_view instead of taking string type by template
Fixes the case of using a char *
2020-03-02 18:50:43 -04:00
Jason Rhinelander
dcb7e4df0b Add a category command helper class
This allows simplifying:

    lmq.add_category("foo", ...);
    lmq.add_command("foo", "a", ...);
    lmq.add_command("foo", "b", ...);
    lmq.add_request_command("foo", "c", ...);

to:

    lmq.add_category("foo", ...)
        .add_command("a", ...)
        .add_command("b", ...)
        .add_request_command("b", ...)
        ;
2020-03-02 15:11:54 -04:00
Jason Rhinelander
a43ee15b58 _sv wasn't being defined inline 2020-03-02 15:11:29 -04:00
Jason Rhinelander
b3abcfc9ae Add ""_sv literal that works just like C++17 ""sv 2020-03-02 14:24:07 -04:00
Jason Rhinelander
f18f86cf96 Allow optional and incoming to take a bool
This makes it much more convenience to use them with a run-time
condition; this simplifies:

    if (should_be_optional)
        lmq.send(..., send_option::optional{});
    else
        lmq.send(...);

to:

    lmq.send(..., send_option::optional{should_be_optional});
2020-03-02 14:24:07 -04:00
Jason Rhinelander
0493082040 Add default allow to listen_plain()
The default on listen_curve() was supposed to go on both.
2020-03-01 23:54:06 -04:00
Jason Rhinelander
1cf02d0c66 Fix invalid access in peer address debug message 2020-03-01 15:21:09 -04:00
Jason Rhinelander
e4f93afafa Fix trace IP address 2020-02-29 16:34:37 -04:00
Jason Rhinelander
1ad68b2605 Wrap HI response in try/catch
It's possible that the client has gone away, in which case we can get an
exception raised from an unroutable request.
2020-02-29 16:06:08 -04:00
Jason Rhinelander
3be632e73e Fix auth_level for remote connections
The wrong key was being set here (deserialization expects auth_level).
2020-02-29 16:04:43 -04:00
Jason Rhinelander
09c487f327 Add ability to use random routing ids for outgoing 2020-02-29 16:03:25 -04:00
Jason Rhinelander
ece8870896 Move routing prefix into ConnectionID
This allows storing a ConnectionID received in a message callback and
using it later to send another message along the connection without
worrying about a routing id: the ConnectionID will have it if it is
required.  Previously you would have had to store the ConnectionID *and*
the routing prefix, and then specified the route as a
send_option::route{}, which was annoying and cumbersome.
2020-02-29 16:01:47 -04:00
Jason Rhinelander
28e36a3eaf Make AuthLevel stream printable 2020-02-29 15:16:58 -04:00
Jason Rhinelander
2743e576b2 Distinguish between batch jobs and reply jobs
This adds a separate category (and reserve count) for "reply jobs",
which are jobs triggered by receiving a reply to a request, or after a
successful connect or unsuccessful timeout.  Previously these were
scheduled as regular batch jobs; this schedules them as a new "reply
jobs" category with its own reserved threads count.

This also changes the defaults for batch jobs and reply jobs to be based
on the specified general workers count rather than directly on hardware
concurrency, so that if you are on a 16-thread CPU but override general
workers from its default of 16 to 4 and don't change batch workers you
now get reserved batch workers set to 2 rather than 8 which constrains
the typical parallel batch jobs to 4 (i.e. the general worker limit)
rather than exceeding it with the batch job limit.

Similarly for reply jobs, which is now ceil(general/8) by default.
2020-02-28 17:54:00 -04:00
Jason Rhinelander
57f0ca74da Added support for general (non-SN) communications
The existing code was largely set up for SN-to-SN or client-to-SN
communications, where messages can always get to the right place because
we can always send by pubkey.

This doesn't work when we want general communications with a random
remote address.

This commit overhauls the way loki-mq handles communication in a few
important ways:

- Listening instances no longer pass bind addresses into the
constructor; instead they call `listen_curve()` or `listen_plain()`
before invoking `start()`.

- `listen_curve()` is equivalent to the existing bind support: it
listens on a socket and accepts encrypted handshaked connections from
anyone who already knows the server's public key.

- `listen_plain()` is all new: it sets up a plain text listening socket
over which random clients can connect and talk.  End-points aren't
verified, and it isn't encrypted, but if you don't know who you are
talking to then encryption isn't doing anything anyway.

- Connecting to a remote now connections in CURVE encryption or NULL
(plain-text) encryption based on whether you provide a remote_pubkey.
For CURVE, the connection will fail if the pubkey does not match.

- `ConnectionID` objects are now returned when connecting to a remote
address; this object is then passed in to send/request/etc. to direct
the message.  For SN communication, ConnectionID's can be created
implicitly from SN pubkey strings, so the existing interface of
`lmq.send(pubkey, ...)` will still work in most cases.

- A ConnectionID is now passed to the ConnectSuccess and ConnectFailure
callbacks.  This can be used to uniquely identify which connection
succeeded or failed, and can determine whether the remote is a service
node (`.sn()`) and/or the pubkey (`.pubkey()`).  (Obviously the service
node status is only available when the client can do service node
lookups, and the pubkey() is only non-empty for encrypted connections).
2020-02-28 00:16:43 -04:00
Jason Rhinelander
e4d371b026 Fixed string_view c++17 compatibility
string_view isn't supposed to be implicitly convertible to std::string
and code would break compiling under c++17 (when our local string_view
is simply a std::string_view typedef).
2020-02-24 22:20:56 -04:00
Jason Rhinelander
c589599892 Fix off-by-one remotes access 2020-02-23 23:50:47 -04:00
Jason Rhinelander
99f4333b18 Add gcc 5 constexpr workaround 2020-02-23 16:28:22 -04:00
Jason Rhinelander
f2ee3d9b41 constexpr string_view fixes
Pre-C++17 char_traits::compare isn't constexpr so we can't constexpr the
find/rfind methods that use it.

begin() etc, however, can be constexpr (and need to be for some of the
other constexpr methods here that use them).
2020-02-22 23:27:21 -04:00
Jason Rhinelander
da96c1ec79 Make string_view a full std::string_view implementation
Adds the missing pieces plus adds a test script.
2020-02-22 21:59:07 -04:00
Jason Rhinelander
03827ac1f7 Properly remove pending requests after they are invoked 2020-02-20 20:42:44 -04:00
Jason Rhinelander
ed9af92411 Fix reply_tag off-by-one 2020-02-20 20:23:16 -04:00
Jason Rhinelander
de0a5842af Added debugging around pending request creation/removal 2020-02-20 20:13:29 -04:00
Jason Rhinelander
5d30846cee Disable non-working self-connection inproc socket
This was meant to be an optimization but doesn't actually work because
we don't do a ZAP request at all when connecting inproc sockets, so the
metadata we need never gets set.

Remove it so a SN can still connect to itself.
2020-02-17 17:59:01 -04:00
Jason Rhinelander
58e45db996 Fix truncated pubkey
Fix bug from quorumnet transition; quorumnet set the user id to
something like "S:abc..." or "C:abc..." to indicate SN or non-SN, but
now that gets carried in a separate message property but the +2 on the
copying position was still erroneously being used.
2020-02-16 22:42:03 -04:00
Jason Rhinelander
b11d2870bd Fix verify pubkey initialization 2020-02-13 00:53:43 -04:00
Jason Rhinelander
98a4aed68f Fix bad string access for pubkey verification 2020-02-13 00:39:00 -04:00
Jason Rhinelander
1b6f38fc07 Add LMQ_TRACE macro instead of LMQ_LOG(trace
LMQ_TRACE becomes nothing under a release build, which is good because
many traces are in the proxy hot path.

Also fixes some confusing log level comparison logic by flipping the
order of log levels.  Previous trace < debug < info < warn, which was
confusing: now that order is reversed which makes way more sense (i.e.
larger LogLevel means more logging).
2020-02-12 22:19:18 -04:00
Jason Rhinelander
13e7953bd7 Updated headers and remove dead code 2020-02-12 22:05:15 -04:00
Jason Rhinelander
63e70f9912 Fix delayed proxy-scheduled batch jobs
Batch jobs scheduled by the proxy thread itself were delayed to the next
poll timeout (because nothing ever gets sent on a socket).  Add a
variable to bypass the next poll to handle this case.
2020-02-11 19:08:19 -04:00
Jason Rhinelander
ccfb6d080b Add request/reply abstraction
This allows making RPC requests with a callback that gets called when
the response comes back.  The is essentially a wrapper around doing it
yourself (i.e. by setting up a server-side "request" and client-side
"reply" command where "request" responds with a "reply" command), but
abstracted into lokimq itself as it is likely to be very useful when
integrating client/server connections rather than peer-to-peer
connections.
2020-02-11 02:30:07 -04:00
Jason Rhinelander
061bdee0a8 Add zmq timer support 2020-02-11 02:29:00 -04:00
Jason Rhinelander
7393e8422c Improve on-the-fly bt deserialization interface
This separates the dict traversal into `next_...` functions that return
a key-value pair, and `consume_...` that return just the value.  The
latter is particularly useful when using `skip_until` to position
yourself at a known key.
2020-02-11 02:18:14 -04:00
Jason Rhinelander
3ff66490ad Add ability to run a batch completion job in the proxy thread
For very small jobs (i.e. just setting a flag or something) this is
going to be faster than dispatching to a thread.
2020-02-11 02:17:12 -04:00
Jason Rhinelander
03ea49167c Various small optimizations 2020-02-06 00:50:31 -04:00
Jason Rhinelander
f75b6cf221 Added batch job implementation + various improvements
This overhauls the proposed batch implementation (described in the
README but previously not implemented) and implements it.

Various other minor improvements and code restructuring.

Added a proposed "request" type to the README to be implemented; this is
like a command, but always expects a ["REPLY", TAG, ...] response.  The
request invoker provides a callback to invoke when such a REPLY arrives.
2020-02-05 20:21:27 -04:00
Jason Rhinelander
f3d583c520 Initial LokiMQ release
This library is adapted from lokid's existing quorumnet code (added in
6.x) used for SN-to-SN communication for quorum voting but generalized
to be usable both there and as a basis for other communication channels
with loki projects (for example: wallet-to-lokid communication; loki-ss
and lokinet internal communication with lokid; loki-ss to loki-ss
communication and message passing; perhaps eventually loki p2p traffic).

This initial release compiles but likely has a few warts and bugs that
need ironing out in the implementation before it is production ready.
Some tests will follow.
2020-02-02 22:39:26 -04:00