oxen-storage-server/httpserver/oxend_rpc.h

23 lines
673 B
C
Raw Normal View History

#pragma once
#include "oxend_key.h"
#include <string_view>
#include <oxenmq/oxenmq.h>
#include <oxenmq/hex.h>
namespace oxen {
Storage server refactoring & ping reporting redesign Refactoring: - Overhaul how pubkeys are stored: rather that storing in strings in multiple forms (hex, binary, base32) we now store each pubkey type (legacy, ed25519, x25519) in a type-safe container which extends an std::array, thus giving us allocation-free storage for them. - Do conversion into these types early on so that *most* of the code takes type-safe pubkeys rather than strings in random formats; thus making the public API (e.g. HTTP request parser) deal with encoding/decoding, rather than making it happen deep down. (For example, ChannelEncryption shouldn't need to worry about decoding a hex string). This was pretty messy: some code wanted hex, some base32z, some base32z with ".snode" appended, some base64. - When printing pubkeys in logs, print them as hex, not base32z. Base32z was really hard to reconcile with pubkeys because everywhere *else* we show them as hex. - Overhaul sn_record_t with a much lighter one: it is now a very simple struct with just the members it needs (ip, ports, pubkeys), and is no longer hashable (instead we can hash on the .legacy_pubkey member). - Simplify some interfaces taking multiple values from sn_record_t by just passing the sn_record_t - Moved a bunch of things that were in the global namespace into the oxen namespace. - Simplify swarm storage and lookup: we previous had various methods that did a linear scan on all active nodes, and could do string-based searching for different representations of the pubkey (hex, base32z, etc.). Replace it all with a simple structure of: unordered_map<legacy_pubkey, sn_record_t> unordered_map<ed25519_pubkey, legacy_pubkey> unordered_map<x25519_pubkey, legacy_pubkey> where the first map holds the entries and the latter two point us to the key in the first one. - De-templatize ChannelEncryption, and make it take pubkey types rather than strings. (The template was doing nothing as it was only ever used with T=std::string). - Fix a leak in ChannelEncryption CBC decryption if it throws (the context would leak) by storing the context in a unique_ptr with a deleter that frees the context. - Optimized ChannelEncryption encryption code somewhat by reducing allocations via more use of string_views and tweaking how we build the encrypted strings. - Fix legacy (i.e. Monero) signature generation: the random byte value being generated in was only setting the first 11 bytes of 32. - Miscellaneous code cleanups (much of which are C++14/17 syntax). - Moved std::literals namespace import to a small number of top-level headers so they are available everywhere. (std::literals is guaranteed to never conflict with user-space literals precisely so that doing this everywhere is perfectly safe without polluting -- i.e. "foo"sv, 44s can *never* be anything other than string_view and seconds literals). - Made pubkey parsing (e.g. in HTTP headers) accept any of hex/base32z/base64 so that we can, at some point in the future, just stop using base32z representation (since it conflicts badly with lokinet .snode address which is based on the ed25519 pubkey, not the legacy pubkey). - RateLimiter - simply it to take the IP as a uint32_t (and thus avoid allocations). (Similarly it avoids allocations in the SN case via using the new pubkey type). - Move some repeated tasks into the simpler oxenmq add_timer(). Ping reporting: This completely rewrites how ping results are handled. Current situation: SS does its own testing of other SS. Every 10s it picks some random node and sends an HTTP ping and OMQ ping to it. If they fail either it tracks the failure and tries them again every 10s. If they are still failing after 2h, then it finally tells oxend about the failure one time, remembers that it told oxend (in memory), and then never tells it again until the remote SS starts responding to pings again; once that happens it tells oxend that it is responding again and then never tells it anything else unless the remote starts failing again for 2h. This has some major shortcomings: - The 10s repeat will hammer a node pretty hard with pings, because if it is down for a while, most of the network will notice and be pinging it every 10s. That means 1600x2 incoming requests every 10s, which is pretty excessive. - Oxend reporting edge case 1: If a node is bad and then storage server restarts, SS won't have it in its "bad list" anymore, so it isn't testing it at all (until it gets selected randomly, but with 1600 nodes that is going to be an average of more than 2 hours and could be way longer). Thus oxend never gets the "all good" signal and will continue to think the node is bad for much, much longer than it actually is. (In fact, it may *never* get a good signal again if it's working the next time SS randomly pings it). - Restarts the other way are also a problem: when oxend restarts it doesn't know of any of the bad nodes anymore, but since SS only tells it once, it never learns about it and thinks it's good. - `oxend print_sn <PUBKEY>` is much, much less useful than it could be: usually storage servers are in "awaiting first result" status because SS won't tell oxend anything until it has decided there is some failure. When it tells you "last ping was .... ago" that's also completely useless because SS never reports any pings except for the first >2h bad result, and the first good result. I suspect the reporting above was out of a concern than talking to oxend rpc too much would overload it; that isn't the case anymore (and since SS is now using a persistent oxenmq connection the requests are *extremely* fast since it doesn't even have to establish a connection). So this PR overhauls it completely as follows: - SS gets much dumber (i.e. simpler) w.r.t. pings: all it does is randomly pick probably-good nodes to test every 10s, and then pings known-failing nodes to re-test them. - Retested nodes don't get pounded every 10s, instead they get the first retry after 10s, the second retry 20s after that, then 30s after that, and so on up to 5 minute intervals between re-tests. - SS tells oxend *every* ping result that it gets (and doesn't track them, except to keep them or remove them from the "bad" list) - Oxend then becomes responsible for deciding when a SS is bad enough to fail proofs. On the oxend side the rule is: - if we have been receiving bad ping reports about a node from SS for more than 1h5min without any good ping in that time *and* we received a bad ping in the past 10 minutes then we consider it bad. (the first condition is so that it has to have been bad for more than an hour, and the second condition is to ensure that SS is still sending us bad test results). - otherwise we consider it good (i.e. because either we aren't getting test results or because we're getting good test results). - Thus oxend can useful and accurately report the last time some storage server was tested, which allows much better diagnostics of remote SN status. - Thus if oxend restarts it'll start getting the bad results right away, and if SS restarts oxend will stop getting them (and then fall back to "no info means good").
2021-03-27 02:43:22 +01:00
using oxend_seckeys = std::tuple<legacy_seckey, ed25519_seckey, x25519_seckey>;
// Synchronously retrieves SN private keys from oxend via the given oxenmq address. This constructs
// a temporary OxenMQ instance to do the request (because generally storage server will have to
// re-construct one once we have the private keys).
//
// Returns legacy privkey; ed25519 privkey; x25519 privkey.
//
// This retries indefinitely until the connection & request are successful.
Storage server refactoring & ping reporting redesign Refactoring: - Overhaul how pubkeys are stored: rather that storing in strings in multiple forms (hex, binary, base32) we now store each pubkey type (legacy, ed25519, x25519) in a type-safe container which extends an std::array, thus giving us allocation-free storage for them. - Do conversion into these types early on so that *most* of the code takes type-safe pubkeys rather than strings in random formats; thus making the public API (e.g. HTTP request parser) deal with encoding/decoding, rather than making it happen deep down. (For example, ChannelEncryption shouldn't need to worry about decoding a hex string). This was pretty messy: some code wanted hex, some base32z, some base32z with ".snode" appended, some base64. - When printing pubkeys in logs, print them as hex, not base32z. Base32z was really hard to reconcile with pubkeys because everywhere *else* we show them as hex. - Overhaul sn_record_t with a much lighter one: it is now a very simple struct with just the members it needs (ip, ports, pubkeys), and is no longer hashable (instead we can hash on the .legacy_pubkey member). - Simplify some interfaces taking multiple values from sn_record_t by just passing the sn_record_t - Moved a bunch of things that were in the global namespace into the oxen namespace. - Simplify swarm storage and lookup: we previous had various methods that did a linear scan on all active nodes, and could do string-based searching for different representations of the pubkey (hex, base32z, etc.). Replace it all with a simple structure of: unordered_map<legacy_pubkey, sn_record_t> unordered_map<ed25519_pubkey, legacy_pubkey> unordered_map<x25519_pubkey, legacy_pubkey> where the first map holds the entries and the latter two point us to the key in the first one. - De-templatize ChannelEncryption, and make it take pubkey types rather than strings. (The template was doing nothing as it was only ever used with T=std::string). - Fix a leak in ChannelEncryption CBC decryption if it throws (the context would leak) by storing the context in a unique_ptr with a deleter that frees the context. - Optimized ChannelEncryption encryption code somewhat by reducing allocations via more use of string_views and tweaking how we build the encrypted strings. - Fix legacy (i.e. Monero) signature generation: the random byte value being generated in was only setting the first 11 bytes of 32. - Miscellaneous code cleanups (much of which are C++14/17 syntax). - Moved std::literals namespace import to a small number of top-level headers so they are available everywhere. (std::literals is guaranteed to never conflict with user-space literals precisely so that doing this everywhere is perfectly safe without polluting -- i.e. "foo"sv, 44s can *never* be anything other than string_view and seconds literals). - Made pubkey parsing (e.g. in HTTP headers) accept any of hex/base32z/base64 so that we can, at some point in the future, just stop using base32z representation (since it conflicts badly with lokinet .snode address which is based on the ed25519 pubkey, not the legacy pubkey). - RateLimiter - simply it to take the IP as a uint32_t (and thus avoid allocations). (Similarly it avoids allocations in the SN case via using the new pubkey type). - Move some repeated tasks into the simpler oxenmq add_timer(). Ping reporting: This completely rewrites how ping results are handled. Current situation: SS does its own testing of other SS. Every 10s it picks some random node and sends an HTTP ping and OMQ ping to it. If they fail either it tracks the failure and tries them again every 10s. If they are still failing after 2h, then it finally tells oxend about the failure one time, remembers that it told oxend (in memory), and then never tells it again until the remote SS starts responding to pings again; once that happens it tells oxend that it is responding again and then never tells it anything else unless the remote starts failing again for 2h. This has some major shortcomings: - The 10s repeat will hammer a node pretty hard with pings, because if it is down for a while, most of the network will notice and be pinging it every 10s. That means 1600x2 incoming requests every 10s, which is pretty excessive. - Oxend reporting edge case 1: If a node is bad and then storage server restarts, SS won't have it in its "bad list" anymore, so it isn't testing it at all (until it gets selected randomly, but with 1600 nodes that is going to be an average of more than 2 hours and could be way longer). Thus oxend never gets the "all good" signal and will continue to think the node is bad for much, much longer than it actually is. (In fact, it may *never* get a good signal again if it's working the next time SS randomly pings it). - Restarts the other way are also a problem: when oxend restarts it doesn't know of any of the bad nodes anymore, but since SS only tells it once, it never learns about it and thinks it's good. - `oxend print_sn <PUBKEY>` is much, much less useful than it could be: usually storage servers are in "awaiting first result" status because SS won't tell oxend anything until it has decided there is some failure. When it tells you "last ping was .... ago" that's also completely useless because SS never reports any pings except for the first >2h bad result, and the first good result. I suspect the reporting above was out of a concern than talking to oxend rpc too much would overload it; that isn't the case anymore (and since SS is now using a persistent oxenmq connection the requests are *extremely* fast since it doesn't even have to establish a connection). So this PR overhauls it completely as follows: - SS gets much dumber (i.e. simpler) w.r.t. pings: all it does is randomly pick probably-good nodes to test every 10s, and then pings known-failing nodes to re-test them. - Retested nodes don't get pounded every 10s, instead they get the first retry after 10s, the second retry 20s after that, then 30s after that, and so on up to 5 minute intervals between re-tests. - SS tells oxend *every* ping result that it gets (and doesn't track them, except to keep them or remove them from the "bad" list) - Oxend then becomes responsible for deciding when a SS is bad enough to fail proofs. On the oxend side the rule is: - if we have been receiving bad ping reports about a node from SS for more than 1h5min without any good ping in that time *and* we received a bad ping in the past 10 minutes then we consider it bad. (the first condition is so that it has to have been bad for more than an hour, and the second condition is to ensure that SS is still sending us bad test results). - otherwise we consider it good (i.e. because either we aren't getting test results or because we're getting good test results). - Thus oxend can useful and accurately report the last time some storage server was tested, which allows much better diagnostics of remote SN status. - Thus if oxend restarts it'll start getting the bad results right away, and if SS restarts oxend will stop getting them (and then fall back to "no info means good").
2021-03-27 02:43:22 +01:00
oxend_seckeys get_sn_privkeys(std::string_view oxend_rpc_address);
}