Redesigns the database to be a more appropriate, less duplicative design
using "owners" and "messages" with a foreign key between them.
Rewrites all the database code using SQLiteCpp which substantially
reduces the amount of boilerplate, duplicate code for query handling.
Makes the statement handlers thread_local for better thread safety; this
also allows the actual query to be where it is executed, rather than
having all the prepared queries in one place nowhere close to where they
are actually used.
- Eliminated storage::Item because it just duplicates message_t for no
useful reason.
- Always pass user pubkeys as user_pubkey_t, never as std::strings.
Pubkeys are not strings, and the same pubkey has multiple string
representations.
- Properly split up the pubkey as [type][pubkey] where type = byte (5
for Session) and pubkey is a type-dependent key (currently only
supporting 32-byte Ed25519 keys).
- Allow pubkeys to be loaded from either hex (66) or binary (33) values.
- Allow pubkey prefixes on testnet. A 32-byte key will be interpreted
as if the key was prefixed with a 0. Thus there is now just one
"proper" pubkey size of 33 bytes or 66 bytes, and we silently accept the
missing-prefix for testnet (but can do away with the proper sizes being
different).
- Expose pubkey stringification in different ways (prefixed/unprefixed,
hex/raw).
- Simplify time_point-to-integer conversion with new time.hpp functions
from_epoch_ms and to_epoch_ms.
- Restore old message hash format so that we can keep using it until the
mandatory transition point.
- Added a much more network efficient serialization format, using
standard b-encoding rather than NIH custom encoding. After the
mandatory upgrade we can remove the old format.
- Refactor swarm bootstrap code to be more efficient
This is a nice C++ wrapper that cleans up the data interface
considerably over using the C sqlite3 API.
I also evaluated (and started implementing) sqlite_orm for this, but ran
into considerable obstacles: the orm components get in the way without
being good enough to really solve anything (and essentially just making
you write queries in C++ code that is much less elegant than straight
queries), plus it fundamentally doesn't support threaded operation,
which sucks.
- Further documents each endpoint
- Changes the expiry update endpoints to just return the hashes of
messages rather than hashes + timestamps as its quite a bit simpler
(especially for signing).
- implements delete_msgs/delete_before/expire_all/expire_msgs endpoint
processing logic.
Exposes the rpc functions and adds a preliminary distribution between
SNs.
- Adds bt-to-json conversion and expose the json-to-bt converter for us
with converting incoming json requests to bt-encoding (for inter-swarm
relaying), and back (for responding).
- Wire up database functions <-> service node calls <-> rpc endpoints.
- Add swarm command distribution.
(This is not yet working).
Refactors storage rpc requests to abstract parsing and then wires them
up so that all of direct, onion-requested, and omq can now take the same
codepaths for requests.
The new OMQ requests, in particular, are publicly accessable at
`storage.whatever` endpoints (e.g. storage.store).
This also relaxes/changes some of the argument parsing:
- allow `pubkey` to be passed as that (currently `pubKey` is required,
and continues to work)
`last_hash` (AKA `lastHash`) is no longer required; omitting it is now
the same as providing an empty one.
Data is now stored as binary in the database; previously we were storing
the base64-encoded value received from the client. (This will, however,
break any client that expected to be able to send random data).
Also adds an `info` storage endpoint that returns the current version &
timestamp.
TODO: need a database migration to convert existing (base64) data. (We
need a db migration for other reasons, as well).
Currently we have an awkward storage of timestamp + ttl + expiry in the
database, and timestamp + ttl passed in from the client (as strings!).
Althis is awkward because we want to be able to shorten the expiry, but
that would mess up TTL. Additionally everything is stored as
`uint64_t`s, which is messy and not type safe.
This commit makes these changes:
- ttl and timestamp can now be sent by the client as integers (in
addition to the current string value). NB: this is not reliable until
the entire SN network is on the next SS release.
- ttl, timestamp, and expiry are now type-safe std::chrono types instead
of raw integers (milliseconds for ttl and system_clock::time_points for
the other two).
- ttl is no longer stored: instead we just store timestamp + expiry.
(This will let us update expiry later without worrying about TTL).
- ttl/timestamp value validation is moved out of `common/oxen_common.h`
(which was a very odd place for it) and into request_handler.
- serialization no longer supports message_t; rather message_t is *only*
for holding the value the client gives us. SS now uses storage::Item
everywhere other than incoming client data.
- Removed TTL/Nonce storage and retrieval
- Fully specify the queries columns instead of using `SELECT *`
- Don't use "`" for identifier quoting inside sqlite. It's non-standard
MySQL garbage that sqlite3 supported only for MySQL compatibility.
The approach being used here with an offset is painfully inefficient,
and has a race condition; this switches it to something better.
This also allows elimination of the ridiculous
"util::uniform_distribution_portable" call which didn't produce anything
portable at all.
- Show [filename:line] rather than __func__ because __func__ is useless
when called from a lambda (it just shows `operator()`).
- Add time since startup (like lokinet does)
- Fix up log formatter so that it doesn't have to double-format and
doesn't break when given a value that isn't a string literal for the
format string.
- Calling via ->log instead of ->debug, etc. requires changing
`OXEN_LOG(error, ...)` to `OXEN_LOG(err, ...)` to match the actual
spdlog log level (which is different from the logging method for some
reason).
Database has a dependency on boost::asio so that it can set up a timer,
but this is awkward as it couples the Database class with an
implementation detail of the Database user.
Fix this by removing it, making the cleanup timer callback the
responsibility of the caller.
This also fixes some spurious failures due to race conditions between
the threads in the storage test code.
Refactoring:
- Overhaul how pubkeys are stored: rather that storing in strings in multiple forms (hex, binary,
base32) we now store each pubkey type (legacy, ed25519, x25519) in a type-safe container which
extends an std::array, thus giving us allocation-free storage for them.
- Do conversion into these types early on so that *most* of the code takes type-safe pubkeys
rather than strings in random formats; thus making the public API (e.g. HTTP request parser)
deal with encoding/decoding, rather than making it happen deep down. (For example,
ChannelEncryption shouldn't need to worry about decoding a hex string). This was pretty messy:
some code wanted hex, some base32z, some base32z with ".snode" appended, some base64.
- When printing pubkeys in logs, print them as hex, not base32z. Base32z was really hard to
reconcile with pubkeys because everywhere *else* we show them as hex.
- Overhaul sn_record_t with a much lighter one: it is now a very simple struct with just the members
it needs (ip, ports, pubkeys), and is no longer hashable (instead we can hash on the
.legacy_pubkey member).
- Simplify some interfaces taking multiple values from sn_record_t by just passing the sn_record_t
- Moved a bunch of things that were in the global namespace into the oxen namespace.
- Simplify swarm storage and lookup: we previous had various methods that did a linear scan on all
active nodes, and could do string-based searching for different representations of the pubkey
(hex, base32z, etc.). Replace it all with a simple structure of:
unordered_map<legacy_pubkey, sn_record_t>
unordered_map<ed25519_pubkey, legacy_pubkey>
unordered_map<x25519_pubkey, legacy_pubkey>
where the first map holds the entries and the latter two point us to the key in the first one.
- De-templatize ChannelEncryption, and make it take pubkey types rather than strings. (The template
was doing nothing as it was only ever used with T=std::string).
- Fix a leak in ChannelEncryption CBC decryption if it throws (the context would leak) by storing
the context in a unique_ptr with a deleter that frees the context.
- Optimized ChannelEncryption encryption code somewhat by reducing allocations via more use of
string_views and tweaking how we build the encrypted strings.
- Fix legacy (i.e. Monero) signature generation: the random byte value being generated in was only
setting the first 11 bytes of 32.
- Miscellaneous code cleanups (much of which are C++14/17 syntax).
- Moved std::literals namespace import to a small number of top-level headers so they are available
everywhere. (std::literals is guaranteed to never conflict with user-space literals precisely so
that doing this everywhere is perfectly safe without polluting -- i.e. "foo"sv, 44s can *never* be
anything other than string_view and seconds literals).
- Made pubkey parsing (e.g. in HTTP headers) accept any of hex/base32z/base64 so that we can, at
some point in the future, just stop using base32z representation (since it conflicts badly with
lokinet .snode address which is based on the ed25519 pubkey, not the legacy pubkey).
- RateLimiter - simply it to take the IP as a uint32_t (and thus avoid allocations). (Similarly it
avoids allocations in the SN case via using the new pubkey type).
- Move some repeated tasks into the simpler oxenmq add_timer().
Ping reporting:
This completely rewrites how ping results are handled.
Current situation: SS does its own testing of other SS. Every 10s it picks some random node and
sends an HTTP ping and OMQ ping to it. If they fail either it tracks the failure and tries them
again every 10s. If they are still failing after 2h, then it finally tells oxend about the failure
one time, remembers that it told oxend (in memory), and then never tells it again until the remote
SS starts responding to pings again; once that happens it tells oxend that it is responding again
and then never tells it anything else unless the remote starts failing again for 2h.
This has some major shortcomings:
- The 10s repeat will hammer a node pretty hard with pings, because if it is down for a while, most
of the network will notice and be pinging it every 10s. That means 1600x2 incoming requests every
10s, which is pretty excessive.
- Oxend reporting edge case 1: If a node is bad and then storage server restarts, SS won't have it
in its "bad list" anymore, so it isn't testing it at all (until it gets selected randomly, but
with 1600 nodes that is going to be an average of more than 2 hours and could be way longer).
Thus oxend never gets the "all good" signal and will continue to think the node is bad for much,
much longer than it actually is. (In fact, it may *never* get a good signal again if it's working
the next time SS randomly pings it).
- Restarts the other way are also a problem: when oxend restarts it doesn't know of any of the bad
nodes anymore, but since SS only tells it once, it never learns about it and thinks it's good.
- `oxend print_sn <PUBKEY>` is much, much less useful than it could be: usually storage servers are
in "awaiting first result" status because SS won't tell oxend anything until it has decided there
is some failure. When it tells you "last ping was .... ago" that's also completely useless
because SS never reports any pings except for the first >2h bad result, and the first good result.
I suspect the reporting above was out of a concern than talking to oxend rpc too much would overload
it; that isn't the case anymore (and since SS is now using a persistent oxenmq connection the
requests are *extremely* fast since it doesn't even have to establish a connection).
So this PR overhauls it completely as follows:
- SS gets much dumber (i.e. simpler) w.r.t. pings: all it does is randomly pick probably-good nodes
to test every 10s, and then pings known-failing nodes to re-test them.
- Retested nodes don't get pounded every 10s, instead they get the first retry after 10s, the second
retry 20s after that, then 30s after that, and so on up to 5 minute intervals between re-tests.
- SS tells oxend *every* ping result that it gets (and doesn't track them, except to keep them or
remove them from the "bad" list)
- Oxend then becomes responsible for deciding when a SS is bad enough to fail proofs. On the oxend
side the rule is:
- if we have been receiving bad ping reports about a node from SS for more than 1h5min without
any good ping in that time *and* we received a bad ping in the past 10 minutes then we
consider it bad. (the first condition is so that it has to have been bad for more than an
hour, and the second condition is to ensure that SS is still sending us bad test results).
- otherwise we consider it good (i.e. because either we aren't getting test results or because
we're getting good test results).
- Thus oxend can useful and accurately report the last time some storage server was tested,
which allows much better diagnostics of remote SN status.
- Thus if oxend restarts it'll start getting the bad results right away, and if SS restarts oxend
will stop getting them (and then fall back to "no info means good").
- loki_add_subdirectory was a unnecessary wrapper around
add_subdirectory: it was an attempt to make an idempotent version of
add_subdirectory, but that isn't needed at all and just adds cruft: the
top-level CMakeLists.txt already includes all the subdirectories so we
can just trust it.
- removed incorrect subdirectory "project()" definitions.
- crypto/CMakeLists.txt pointlessly listed all headers in the source
list.
- set c++ standard in the top-level makefile instead of on each target
since we intentionally want it everywhere.
- Linking directly to pthread/dl with conditional OS checks was wrong;
fix it to be proper cmake (linking to Threads::Threads and
${CMAKE_DL_LIBS}).
- Various cmake files erroneously listed their src directories in their
include paths.
- Made various library linkages PRIVATE instead of PUBLIC where a
transient dependency to dependent targets does not make sense.
- updates lokimq to dev branch
- changes compilation mode to C++17 (which is now required by lokimq,
and already widely applied in lokinet and lokid dev branches)
- replace lokimq::string_view with std::string_view
- replace boost::optional with std::optional, except for:
- boost::optional<std::function<...>> doesn't need optional at all
because a std::function<...> is already nullable.
- boost::optional<T&> isn't supported by std::optional because it
makes little sense (it is just a `T*`) so just switch to T* instead.
Headers aren't supposed to be listed in `add_library` calls and are an
unfortunately common cmake anti-pattern. (cmake does *not* need headers
listed to know how to check that things need rebuilding when a header
changes, which seems to be the reason people think they have to include
them).
Apparently this antipattern emerged partly because of buggy behaviour in
MSVC pre-2017 that didn't understand how to find headers when loading a
CMake project.
* Add LOG macro and function name to logging
* Move common.h to common folder
* Rename all BOOST_LOG_TRIVIAL to LOG
* LOG -> LOKI_LOG and use boost::filesystem
* Don't log from worker thread
* Do filename in line