Storage server for Oxen Service Nodes
Go to file
Jason Rhinelander 8d34f76002 Storage server refactoring & ping reporting redesign
Refactoring:

- Overhaul how pubkeys are stored: rather that storing in strings in multiple forms (hex, binary,
  base32) we now store each pubkey type (legacy, ed25519, x25519) in a type-safe container which
  extends an std::array, thus giving us allocation-free storage for them.
  - Do conversion into these types early on so that *most* of the code takes type-safe pubkeys
    rather than strings in random formats; thus making the public API (e.g. HTTP request parser)
    deal with encoding/decoding, rather than making it happen deep down.  (For example,
    ChannelEncryption shouldn't need to worry about decoding a hex string).  This was pretty messy:
    some code wanted hex, some base32z, some base32z with ".snode" appended, some base64.
- When printing pubkeys in logs, print them as hex, not base32z.  Base32z was really hard to
  reconcile with pubkeys because everywhere *else* we show them as hex.
- Overhaul sn_record_t with a much lighter one: it is now a very simple struct with just the members
  it needs (ip, ports, pubkeys), and is no longer hashable (instead we can hash on the
  .legacy_pubkey member).
  - Simplify some interfaces taking multiple values from sn_record_t by just passing the sn_record_t
- Moved a bunch of things that were in the global namespace into the oxen namespace.
- Simplify swarm storage and lookup: we previous had various methods that did a linear scan on all
  active nodes, and could do string-based searching for different representations of the pubkey
  (hex, base32z, etc.).  Replace it all with a simple structure of:
    unordered_map<legacy_pubkey, sn_record_t>
    unordered_map<ed25519_pubkey, legacy_pubkey>
    unordered_map<x25519_pubkey, legacy_pubkey>
  where the first map holds the entries and the latter two point us to the key in the first one.
- De-templatize ChannelEncryption, and make it take pubkey types rather than strings.  (The template
  was doing nothing as it was only ever used with T=std::string).
- Fix a leak in ChannelEncryption CBC decryption if it throws (the context would leak) by storing
  the context in a unique_ptr with a deleter that frees the context.
- Optimized ChannelEncryption encryption code somewhat by reducing allocations via more use of
  string_views and tweaking how we build the encrypted strings.
- Fix legacy (i.e. Monero) signature generation: the random byte value being generated in was only
  setting the first 11 bytes of 32.
- Miscellaneous code cleanups (much of which are C++14/17 syntax).
- Moved std::literals namespace import to a small number of top-level headers so they are available
  everywhere.  (std::literals is guaranteed to never conflict with user-space literals precisely so
  that doing this everywhere is perfectly safe without polluting -- i.e. "foo"sv, 44s can *never* be
  anything other than string_view and seconds literals).
- Made pubkey parsing (e.g. in HTTP headers) accept any of hex/base32z/base64 so that we can, at
  some point in the future, just stop using base32z representation (since it conflicts badly with
  lokinet .snode address which is based on the ed25519 pubkey, not the legacy pubkey).
- RateLimiter - simply it to take the IP as a uint32_t (and thus avoid allocations).  (Similarly it
  avoids allocations in the SN case via using the new pubkey type).
- Move some repeated tasks into the simpler oxenmq add_timer().

Ping reporting:

This completely rewrites how ping results are handled.

Current situation: SS does its own testing of other SS.  Every 10s it picks some random node and
sends an HTTP ping and OMQ ping to it.  If they fail either it tracks the failure and tries them
again every 10s.  If they are still failing after 2h, then it finally tells oxend about the failure
one time, remembers that it told oxend (in memory), and then never tells it again until the remote
SS starts responding to pings again; once that happens it tells oxend that it is responding again
and then never tells it anything else unless the remote starts failing again for 2h.

This has some major shortcomings:
- The 10s repeat will hammer a node pretty hard with pings, because if it is down for a while, most
  of the network will notice and be pinging it every 10s.  That means 1600x2 incoming requests every
  10s, which is pretty excessive.
- Oxend reporting edge case 1: If a node is bad and then storage server restarts, SS won't have it
  in its "bad list" anymore, so it isn't testing it at all (until it gets selected randomly, but
  with 1600 nodes that is going to be an average of more than 2 hours and could be way longer).
  Thus oxend never gets the "all good" signal and will continue to think the node is bad for much,
  much longer than it actually is.  (In fact, it may *never* get a good signal again if it's working
  the next time SS randomly pings it).
- Restarts the other way are also a problem: when oxend restarts it doesn't know of any of the bad
  nodes anymore, but since SS only tells it once, it never learns about it and thinks it's good.
- `oxend print_sn <PUBKEY>` is much, much less useful than it could be: usually storage servers are
  in "awaiting first result" status because SS won't tell oxend anything until it has decided there
  is some failure.  When it tells you "last ping was .... ago" that's also completely useless
  because SS never reports any pings except for the first >2h bad result, and the first good result.

I suspect the reporting above was out of a concern than talking to oxend rpc too much would overload
it; that isn't the case anymore (and since SS is now using a persistent oxenmq connection the
requests are *extremely* fast since it doesn't even have to establish a connection).

So this PR overhauls it completely as follows:
- SS gets much dumber (i.e. simpler) w.r.t. pings: all it does is randomly pick probably-good nodes
  to test every 10s, and then pings known-failing nodes to re-test them.
- Retested nodes don't get pounded every 10s, instead they get the first retry after 10s, the second
  retry 20s after that, then 30s after that, and so on up to 5 minute intervals between re-tests.
- SS tells oxend *every* ping result that it gets (and doesn't track them, except to keep them or
  remove them from the "bad" list)
- Oxend then becomes responsible for deciding when a SS is bad enough to fail proofs.  On the oxend
  side the rule is:
    - if we have been receiving bad ping reports about a node from SS for more than 1h5min without
      any good ping in that time *and* we received a bad ping in the past 10 minutes then we
      consider it bad.  (the first condition is so that it has to have been bad for more than an
      hour, and the second condition is to ensure that SS is still sending us bad test results).
    - otherwise we consider it good (i.e. because either we aren't getting test results or because
      we're getting good test results).
    - Thus oxend can useful and accurately report the last time some storage server was tested,
      which allows much better diagnostics of remote SN status.
- Thus if oxend restarts it'll start getting the bad results right away, and if SS restarts oxend
  will stop getting them (and then fall back to "no info means good").
2021-04-18 14:50:40 -03:00
.github/ISSUE_TEMPLATE Update issue templates 2019-04-17 13:46:28 +10:00
.vscode check the difficulty every 10 mins 2019-06-06 17:13:17 +10:00
cmake Fix archive naming for tagged commits 2021-01-18 13:28:57 -04:00
common Storage server refactoring & ping reporting redesign 2021-04-18 14:50:40 -03:00
contrib Add drone CI 2021-01-13 15:20:23 -04:00
crypto Storage server refactoring & ping reporting redesign 2021-04-18 14:50:40 -03:00
httpserver Storage server refactoring & ping reporting redesign 2021-04-18 14:50:40 -03:00
storage Storage server refactoring & ping reporting redesign 2021-04-18 14:50:40 -03:00
unit_test Speed up unit tests for storage 2021-04-09 16:32:13 +10:00
utils Storage server refactoring & ping reporting redesign 2021-04-18 14:50:40 -03:00
vendors Use oxenmq 1.2.4's send_later() mechanism 2021-04-13 01:19:50 -03:00
.clang-format Run clang-format 2019-03-19 14:47:55 +11:00
.dockerignore docker support 2019-04-26 10:35:05 +00:00
.drone.jsonnet Add drone CI 2021-01-13 15:20:23 -04:00
.gitignore Ignore swapfiles and reject localhost binding 2019-07-25 13:12:44 +10:00
.gitmodules Replace embedded nlohmann with updated submodule 2021-01-05 17:27:32 -04:00
CMakeLists.txt No longer require POW for message storage 2021-03-26 12:01:47 +11:00
Dockerfile Allow http in onion requests to an external server 2021-03-29 17:28:34 +11:00
LICENSE Add MIT license 2018-11-16 02:04:02 +11:00
Makefile No longer require POW for message storage 2021-03-26 12:01:47 +11:00
mock_lokid.py Always strip the 05 from start of client request pubkeys 2019-03-28 15:33:41 +11:00
README.md Add static build capability to cmake 2021-01-13 15:19:22 -04:00

loki-storage-server

Storage server for Loki Service Nodes

Requirements:

  • Boost >= 1.66 (for boost.beast)
  • OpenSSL >= 1.1.1a (for X25519 curves)
  • sodium >= 1.0.17 (for ed25119 to curve25519 conversion)

You can, however, download and build static versions these dependencies as part of the build by adding the -DBUILD_STATIC_DEPS=ON option to cmake.

Can use RelWithDebInfo instead of Release if you want to include debug symbols to provide developers with valueable core dumps from crashes. Also make sure you don't have an older (than 4.3.0) libzmq header in /usr/local/include, if so please install a new version.

git submodule update --init --recursive
mkdir build && cd build
cmake -DDISABLE_SNODE_SIGNATURE=OFF -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
./loki-storage 0.0.0.0 8080

The paths for Boost and OpenSSL can be specified by exporting the variables in the terminal before running make:

export OPENSSL_ROOT_DIR = ...
export BOOST_ROOT= ...

Then using something like Postman (https://www.getpostman.com/) you can hit the API:

post data

HTTP POST http://127.0.0.1/store
body: "hello world"
headers:
- X-Loki-recipient: "mypubkey"
- X-Loki-ttl: "86400"
- X-Loki-timestamp: "1540860811000"
- X-Loki-pow-nonce: "xxxx..."

get data

HTTP GET http://127.0.0.1/retrieve
headers:
- X-Loki-recipient: "mypubkey"
- X-Loki-last-hash: "" (optional)

unit tests

mkdir build_test
cd build_test
cmake ../unit_test -DBOOST_ROOT="path to boost" -DOPENSSL_ROOT_DIR="path to openssl"
cmake --build .
./Test --log_level=all