- Show [filename:line] rather than __func__ because __func__ is useless
when called from a lambda (it just shows `operator()`).
- Add time since startup (like lokinet does)
- Fix up log formatter so that it doesn't have to double-format and
doesn't break when given a value that isn't a string literal for the
format string.
- Calling via ->log instead of ->debug, etc. requires changing
`OXEN_LOG(error, ...)` to `OXEN_LOG(err, ...)` to match the actual
spdlog log level (which is different from the logging method for some
reason).
- Enforce hex rather than accepting any random 66- or 64-character
string as a pubkey
- Clean up pubkey -> integer code
- The cleanup fixes a bug where pubkey -> integer conversion was
skipping the first two bytes on testnet (and ended up in UB by reading
the null + one byte beyond the end of the string for testnet
addresses). THIS WILL BREAK EXISTING TESTNET PUBKEY->SWARM VALUES!
(but it's only testnet, so that's okay).
Typically storage server output goes to systemd, which passes it off to
the journal; journalctl knows how to show (-a) ansi colors, and also
knows how to strip them when you don't provide -a.
No need to bifurcate rvalue and const-lvalue versions here: a plain
value will work exactly the same (copying or moving based on what the
caller provides).
Refactoring:
- Overhaul how pubkeys are stored: rather that storing in strings in multiple forms (hex, binary,
base32) we now store each pubkey type (legacy, ed25519, x25519) in a type-safe container which
extends an std::array, thus giving us allocation-free storage for them.
- Do conversion into these types early on so that *most* of the code takes type-safe pubkeys
rather than strings in random formats; thus making the public API (e.g. HTTP request parser)
deal with encoding/decoding, rather than making it happen deep down. (For example,
ChannelEncryption shouldn't need to worry about decoding a hex string). This was pretty messy:
some code wanted hex, some base32z, some base32z with ".snode" appended, some base64.
- When printing pubkeys in logs, print them as hex, not base32z. Base32z was really hard to
reconcile with pubkeys because everywhere *else* we show them as hex.
- Overhaul sn_record_t with a much lighter one: it is now a very simple struct with just the members
it needs (ip, ports, pubkeys), and is no longer hashable (instead we can hash on the
.legacy_pubkey member).
- Simplify some interfaces taking multiple values from sn_record_t by just passing the sn_record_t
- Moved a bunch of things that were in the global namespace into the oxen namespace.
- Simplify swarm storage and lookup: we previous had various methods that did a linear scan on all
active nodes, and could do string-based searching for different representations of the pubkey
(hex, base32z, etc.). Replace it all with a simple structure of:
unordered_map<legacy_pubkey, sn_record_t>
unordered_map<ed25519_pubkey, legacy_pubkey>
unordered_map<x25519_pubkey, legacy_pubkey>
where the first map holds the entries and the latter two point us to the key in the first one.
- De-templatize ChannelEncryption, and make it take pubkey types rather than strings. (The template
was doing nothing as it was only ever used with T=std::string).
- Fix a leak in ChannelEncryption CBC decryption if it throws (the context would leak) by storing
the context in a unique_ptr with a deleter that frees the context.
- Optimized ChannelEncryption encryption code somewhat by reducing allocations via more use of
string_views and tweaking how we build the encrypted strings.
- Fix legacy (i.e. Monero) signature generation: the random byte value being generated in was only
setting the first 11 bytes of 32.
- Miscellaneous code cleanups (much of which are C++14/17 syntax).
- Moved std::literals namespace import to a small number of top-level headers so they are available
everywhere. (std::literals is guaranteed to never conflict with user-space literals precisely so
that doing this everywhere is perfectly safe without polluting -- i.e. "foo"sv, 44s can *never* be
anything other than string_view and seconds literals).
- Made pubkey parsing (e.g. in HTTP headers) accept any of hex/base32z/base64 so that we can, at
some point in the future, just stop using base32z representation (since it conflicts badly with
lokinet .snode address which is based on the ed25519 pubkey, not the legacy pubkey).
- RateLimiter - simply it to take the IP as a uint32_t (and thus avoid allocations). (Similarly it
avoids allocations in the SN case via using the new pubkey type).
- Move some repeated tasks into the simpler oxenmq add_timer().
Ping reporting:
This completely rewrites how ping results are handled.
Current situation: SS does its own testing of other SS. Every 10s it picks some random node and
sends an HTTP ping and OMQ ping to it. If they fail either it tracks the failure and tries them
again every 10s. If they are still failing after 2h, then it finally tells oxend about the failure
one time, remembers that it told oxend (in memory), and then never tells it again until the remote
SS starts responding to pings again; once that happens it tells oxend that it is responding again
and then never tells it anything else unless the remote starts failing again for 2h.
This has some major shortcomings:
- The 10s repeat will hammer a node pretty hard with pings, because if it is down for a while, most
of the network will notice and be pinging it every 10s. That means 1600x2 incoming requests every
10s, which is pretty excessive.
- Oxend reporting edge case 1: If a node is bad and then storage server restarts, SS won't have it
in its "bad list" anymore, so it isn't testing it at all (until it gets selected randomly, but
with 1600 nodes that is going to be an average of more than 2 hours and could be way longer).
Thus oxend never gets the "all good" signal and will continue to think the node is bad for much,
much longer than it actually is. (In fact, it may *never* get a good signal again if it's working
the next time SS randomly pings it).
- Restarts the other way are also a problem: when oxend restarts it doesn't know of any of the bad
nodes anymore, but since SS only tells it once, it never learns about it and thinks it's good.
- `oxend print_sn <PUBKEY>` is much, much less useful than it could be: usually storage servers are
in "awaiting first result" status because SS won't tell oxend anything until it has decided there
is some failure. When it tells you "last ping was .... ago" that's also completely useless
because SS never reports any pings except for the first >2h bad result, and the first good result.
I suspect the reporting above was out of a concern than talking to oxend rpc too much would overload
it; that isn't the case anymore (and since SS is now using a persistent oxenmq connection the
requests are *extremely* fast since it doesn't even have to establish a connection).
So this PR overhauls it completely as follows:
- SS gets much dumber (i.e. simpler) w.r.t. pings: all it does is randomly pick probably-good nodes
to test every 10s, and then pings known-failing nodes to re-test them.
- Retested nodes don't get pounded every 10s, instead they get the first retry after 10s, the second
retry 20s after that, then 30s after that, and so on up to 5 minute intervals between re-tests.
- SS tells oxend *every* ping result that it gets (and doesn't track them, except to keep them or
remove them from the "bad" list)
- Oxend then becomes responsible for deciding when a SS is bad enough to fail proofs. On the oxend
side the rule is:
- if we have been receiving bad ping reports about a node from SS for more than 1h5min without
any good ping in that time *and* we received a bad ping in the past 10 minutes then we
consider it bad. (the first condition is so that it has to have been bad for more than an
hour, and the second condition is to ensure that SS is still sending us bad test results).
- otherwise we consider it good (i.e. because either we aren't getting test results or because
we're getting good test results).
- Thus oxend can useful and accurately report the last time some storage server was tested,
which allows much better diagnostics of remote SN status.
- Thus if oxend restarts it'll start getting the bad results right away, and if SS restarts oxend
will stop getting them (and then fall back to "no info means good").
- loki_add_subdirectory was a unnecessary wrapper around
add_subdirectory: it was an attempt to make an idempotent version of
add_subdirectory, but that isn't needed at all and just adds cruft: the
top-level CMakeLists.txt already includes all the subdirectories so we
can just trust it.
- removed incorrect subdirectory "project()" definitions.
- crypto/CMakeLists.txt pointlessly listed all headers in the source
list.
- set c++ standard in the top-level makefile instead of on each target
since we intentionally want it everywhere.
- Linking directly to pthread/dl with conditional OS checks was wrong;
fix it to be proper cmake (linking to Threads::Threads and
${CMAKE_DL_LIBS}).
- Various cmake files erroneously listed their src directories in their
include paths.
- Made various library linkages PRIVATE instead of PUBLIC where a
transient dependency to dependent targets does not make sense.
- updates lokimq to dev branch
- changes compilation mode to C++17 (which is now required by lokimq,
and already widely applied in lokinet and lokid dev branches)
- replace lokimq::string_view with std::string_view
- replace boost::optional with std::optional, except for:
- boost::optional<std::function<...>> doesn't need optional at all
because a std::function<...> is already nullable.
- boost::optional<T&> isn't supported by std::optional because it
makes little sense (it is just a `T*`) so just switch to T* instead.
Headers aren't supposed to be listed in `add_library` calls and are an
unfortunately common cmake anti-pattern. (cmake does *not* need headers
listed to know how to check that things need rebuilding when a header
changes, which seems to be the reason people think they have to include
them).
Apparently this antipattern emerged partly because of buggy behaviour in
MSVC pre-2017 that didn't understand how to find headers when loading a
CMake project.
Upstream spdlog moves the instance and the absolute namespace qualifier
here breaks the code; this fixes it by going through the base class
which will work for both the bundled and newer spdlog versions.
Also changes the `fmt::memory_buffer` to `spdlog::memory_buf_t` because
the former doesn't work with newer libfmt's (which upstream spdlog and
the devendored debian sid build now use).
The semicolon made the macro not a single statement (unlike the 3+
argument version), so code such as
if (asdf)
LOKI_LOG(critical, "whatever");
else
LOKI_LOG(critical, "something else");
was a syntax error because of the expanded double-semicolon which made
the "else" not follow a single-statement if and thus invalid.
Use different data dir and seed nodes for testnet
Use testnet flag to change the size of valid pubkey length
Move to a global flag for testnet instead of service node proptery. Fix compile issues and some other stuff, still will be issues to fix
* Initial swarm bootstrapping from seed nodes
* Bootstrap the IPs when we have finished syncing, plus don't overwrite valid IPs with defaults
* Review fixed plus lint