Notes on Decentralized Storage Networks

I am posting some notes overviewing decentralized storage networks (DSN). Before I go into the details of specific networks, let’s explore some general features of DSN.

Traditional blockchain networks could act as storage networks. Public ledger usually impose additional structures on how bytes are structured in the forms of blocks and transactions. Transactions are the basic unit that have to satisfy the blockchain unique transition function from one transaction to the next. Most blockchains allow users to put in arbitrary bytes as a special data field in transactions. However, these blockchains should not be considered storage networks because of limited data throughput. I had previously written about blockchain throughput. Without compromising the risk of centralization, a reasonable throughput of blockchain is about 10 KB/s per shard. This space is very limited and expensive. It would require millions, if not billions, of dollars to store a movie on one of these blockchains, even if the chain is sharded.

The fundamental reason that a public ledgers should not evolve to become a storage network is because the security level required by a public ledger is far more expensive than what is necessary for storage. The data within a public ledger has to be validated by a securing resource. That securing resource could be proof of work, proof of stake, proof of authority, or any other scarce resource. In some ways, the validity of the data has to be voted on by some limited resources; otherwise, if an arbitrary amount of voting power is manufactured out of the thin air, the public ledger is not secured. (I will write a post on theoretical fundamentals blockchain security.) Furthermore, the ledger’s data has to be also massively available. Data availability is inherently expensive. Replicating data requires physical resources in the form of hardwares, bandwidth, real estate, and electricity. If a piece of data is replicated 5 times, it costs roughly 5x in physical resources. Public ledgers requires data to extremely resilient to failures. Bitcoin and Ethereum chain states are replicated tens of thousands times, if not hundreds of thousands or even millions times. Even for a sharded chain, it will have 100s of replication at a minimum. This replication level would be impossibly expensive for petabytes of data, let alone exabytes or more. It is worth noting that AWS S3 charges $500,00 - $2,000,000 a year for 10 petabyte of data. It is safe to assume that data there has 3-10 replicas. AWS could be operating with a great margin, but it is hard to see that AWS could run a profit if it is replicating data 100s over. Decentralized storage network should allow users to choose their own replication level.

Database networks could be loosely categorized into three typologies. In roundabout ways, I have already discussed two of those. The first topology type is single machine DB, where the DB is replicated in every full node of the network. The current version of Bitcoin and Ethereum is of this variety. The second type is a sharded db, with each shard replicated by a group of nodes. Polkadot and Eth2 are examples of this type. The third network type expect heterogeneous storage nodes. Each node store a unique set of data segments. One would find this network typology for almost all large scaled database deployment. S3, Hadoop file system, Cassandra, HBase, Druid DB, Elasticsearch, etc. A decentralized storage network will likely take this form, where there are a large number of storage nodes storing unique set of data segments.

A decentralized storage network needs to have a mechanism to coordinate incentives and storage proofs. Storage providers have to be incentivized to participate in the network. Storage clients or the network have to be able ask for storage proofs from storage providers. This boils down to the need to coordinate storage proofs and payments. A blockchain, either native or external, is a natural choice to coordinate these interactions if decentralization is required.

In the next sections, I will review some popular DSNs. I am not aiming to have a full description of these networks. My goal is to provide a succinct overview of the key features so I could discuss what I like and don’t like about the different networks.

IPFS¶

IPFS is often mentioned as the leading solution for decentralized storage. IPFS could be used to storing and accessing data. However, it does not have an incentive mechanism. There is no guarantee that the IPFS nodes would continue to keep the data or provide access. IPFS is a network of nodes that are joined together to form a distributed hash table (DHT). IPFS has additional structure such as InterPlanetary Linked Data (IPLD) and Merkle direct acyclic graphs (DAG) that enable a variety of use cases to be built on top of its network. IPFS alone cannot act as a data storage solution.

Filecoin¶

The Filecoin network uses a native blockchain to enforce storage deals. The blockchain is similar to the Bitcoin or Ethereum chain. Each block is composed of messages. Message is analogous to transactions. A blocks is produced by a miner selected by the storage power consensus mechanism. Storage power consensus is similar to proof of stake, but instead of using stake to gain a higher probability of becoming a block producer, the mechanism favors storage providers with more proven storage capacity and storing more validated data. The contents of messages record storage deals, which are negotiated between storage clients and providers. The negotiation could be done in any communication channels. The completed deal is published on-chain.

Storage miners are incentivized to submit storage proofs because doing so increases their probability of earning block rewards. The submitted proofs are included in the blocks. Miners encode the data in a process known as sealing. The slow encoding ensures that the miners could not re-compute the encoding on demand when an audit is requested. A SNARK on the encoded data is used as the replication proof to reduce the proof size. This is known as proof of replication. The miner is expected to submit additional proofs for the possession of this encoded data on a regular interval. It is incentive compatible for the miners to keep the encoded data on local drives because otherwise it would not be able to regenerate the data or fetch the data from remote storages. This is known as proof of spacetime.

It is worth pointing out that the Filecoin blockchain is essentially a single shard blockchain. The stored data is not on-chain, but the administrative messages are on-chain. A blockchain has limited throughput. See blockchain throughput for more details. The chain snapshot as of 2021/09/05 was already at 600 GB. It is not hard to see that the chain’s daily throughput will reach saturation quickly if usage increases. Regardless of how much storage the total miners pool could support, the bottleneck would be on how much throughput the blockchain could support storage deals. This is not all that different from the total transaction throughput of a general purpose blockchain such as Ethereum. I will discuss this more toward the end of the post.

Arweave¶

Arweave is designed to offer permanent storage. One of key features of Arweave is that it does not chain the blocks together as a singly linked list. Instead, it introduces the concept of recall blocks to form what a blockweave. Each block contains a reference to a random previous block. Having access to those recall blocks is what the mining nodes are competing on. The mining nodes could only produce a new block if they store the specific recall blocks, depending on the randomness as denoted by the new block’s hash. The more recall blocks the mining nodes store, the more probable they could produce new blocks. The network also provides access to metadata in the from of block hash list and wallet list. These summarizing data allow transactions to be verified without reprocessing historical blocks.

The blockweave acts as both the incentive layer and the data storage layer. The blocks in the weave records information about the account state and rewards, and also crucially, the blocks also store arbitrary data. The permanence of the blockweave equates to data permanence.

Arweave’s alternative approach to block linkage offers some unique advantages. First, it allows for heterogenous storage nodes, which is a requirement for a storage network. This allows the network to scale linearly because nodes are not required to replicate the exact same data. Second, it combines the storage and incentive layer into a single data structure. Third, the use of block hash list and wallet list allow a new node to join the network quickly. This eliminates the need for a light client implementation.

Arweave incentives nodes to provide access to historical blocks through a reputation system known as Wildfire. It ranks nodes based on how frequently and how quickly they respond to data requests. Nodes that do not regularly participate could get blacklisted.

Blockweave’s claim of permanent storage might not be guaranteed if the size of the blockweave gets large and sufficiently decentralized. There is no formal guarantee that all blocks will be available. Nodes are incentivized to store as many blocks as possible, but equally, they want to make their blocks as rare as possible. This could lead into suboptimal social behavior. First, it is possible that some rare blocks got lost if there are not sufficient replicas in the network. This could be mitigated by adjusting the token economics such that we can probabilistically guarantee that there are certain number of copies under optimal conditions. Even with that, if the replica number is in the single digit, it is not hard to see some data segment get lost from time to time. Second, miners are incentivized to hold on to rare private blocks so that they would become the only one who could mine some blocks. While this is mitigated by the reputation system, but some miners with a short horizon would risk their reputation to maximum short term profit and exit the market. The centralization of these blocks would increase the probability that these blocks could be lost permanently.

The availability of the metadata could be a weak point in the system. The network makes an implicit assumption that block hash list and wallet list are valid and accessible. These metadata have to be keep up-to-date by all the nodes in the network. These metadata acts similarly to the blockchain as global state. This leads to two problems. First is that if the wallet list increases, say 10 billions, the metadata would be hard to propagate within the network. It would become increasingly hard to gossip the updated state across the network. It could take days to download a copy of that data for nodes that just joined the network. Second, it is not fully specified how the network could guarantee the integrity of these metadata. In theory, The metadata could be recalculated from all the previous blocks. But as the weave’s total size increase to exabyte level, it would be prohibitively expensive to recompute all the blocks by any one party. An attacker could concentrate its resources on gossiping malicious blockhashes to corrupt the weave, or gossipping wallet lists for monetary gains directly.

Sia¶

Sia uses a blockchain to manage file storage contracts, storage proofs, and payments. The contract negotiation is off-chain, but the completed contracts are recorded as on-chain transactions. Transactions are similar to Bitcoin transactions but with some extensions, allowing contract creation, storage proof, and contract update messages. Similar to how Filecoin uses the blockchain as the coordinating device to maintain a contract’s storage proofs and payment, the blockchain’s throughput is limited.

Once a storage contract is published, the network requests the storage proofs on regular interval. Storage proof is a randomly requested segment of the original file and the appropriate Merkle branch. Because the proof requires the provider submitting an actual data segment that walks up the Merkle tree, the data segment has to be at least as large as the smallest Merkle leaf. This leads to a tradeoff. If the leaves of the Merkle tree are small data segments, the tree is large; otherwise, the data segments are large. Either way, it is easy to see that for a 10 GB storage, the proof could easily take more than 10KB. The chain has to put these bytes into the blocks, further reducing the space available for contracts.

Payments are through payment channels and therefore, primarily off-chain. Payment channel eliminates a lot of on-chain transactions, but it also creates a dependencies that storage clients have to be connected to the network. It discourages individual users from directly storing the data on the network. Instead, users would most likely go through a service provider because users do not want to perform the work of regular audit and payments. This flaw is not critical because even an one-person application development shop could maintain a long running service to keep payment channels up-to-date. The frequency of interacting with the network is small, the same frequency as the data possession proof is accepted on chain. As long as the software is sufficiently open sourced, and it is easy for third-party vendors to setup to provide this service according to an open protocol, it is an acceptable requirement for a DSN.

Safe Network¶

The network maintains local state by segmenting the network into group by XOR closeness. Instead of using a blockchain to maintain state, the network segments the network into local group by XOR distances, the same metric used in Kademlia. The group manages its own state. The key managing nodes within each group would assign and re-assign data based on the group’s membership. This technical overview describes the network to be built on top of a Kademlia-like DHT. The key feature that the Safe Network innovates on seems to be its ability to guarantee data availability as nodes join and exit the network. But it is not exactly clear how the network could self health.

The network explicitly excludes the usage of blockchain to maintain a global state. This begs the question of how the network provides incentives to the network participants. As best I could understand from the available documentations, Safe Network uses a MaidSafeCoin that exist still as a coin on Bitcoin’s omnilayer. I could not understand how the coin would be part of the incentive mechanism of the storage network other than the coin is used as direct payment to node or group of nodes. This payment channel would happen outside of the Safe Network.

The Safe Network’s proposal looks a lot like an improved version BitTorrent or IPFS. While it offers some level of data persistence guarantee that IPFS does not, the Safe Network does not fully incentive storage provider to participate. The idea of not requiring a global state is appealing, but the use of local state must extends beyond data management and incorporate incentives. A BitTorrent-like storage network could only autonomous and self-healing if each participating nodes are paid to stay online. Data storage requires physical resources, and someone has to pay for those physical resources.

Storj¶

Storj does not use a blockchain to maintain state. Published storage contracts are maintained by the private databases in satellites, which are centrally operated services. A satellite manages accounts, API credentials, billings, payments, audits, repairs, and various administrative tasks. Both storage clients and providers make API requests to a satellite to publish and maintain storage contracts. Accounts from different satellites do not interact with each other.

Clients and providers could negotiate storage contracts through any communication channels. The negotiations are not recorded, but they publish the contracts to a satellite. The satellite accepts the contract and subsequently service the contract by requesting storage proofs and handling payments. The storage proof is in the form of a challenge mechanism. The data owner built a Merkle hash with secret salts at the leaves. Storage provider has to calculate a Merkle branch based when a secret salt is reviewed to them. The provider could only successfully answer the challenge if it has the data to combine with salt to produce the correct hash that corresponds to the Merkle branch.

The client library has built-in toolings to encrypt data and segment the data by way of erasure encoding. Both encryption and erasure encoding are enforced by the network.

Storj only accepts Storj coin as a payment mechanism. It is a coin on the Ethereum network. Ethereum is only used to facilitate payment. The EVM is not used to store state or coordinate contracts.

Storj uses the concept of satellite to maintain state instead of the approach of a public ledger. A satellite has full control over the accounts that it manages. If a satellite experiences a technical failure, the associated accounts would cease to function. In most cases, the storage clients would not be able to recover the data because they would not be able to find the corresponding storage providers. A satellite is a trusted entity. For example, this is a list of trusted satellites in operations today. While storage nodes are decentralized, the network relies on centrally operated satellites to coordinate the clients and providers. Other than centralization risk, the scalability of the network also hinges on the satellites’ abilities to scale their services. The Storj network’s overall capacity largely depends on the satellites’ capacity to handle accounts and storage contracts.

Discussions¶

The promise of decentralization could easily be eroded in storage networks. While some storage solutions use decentralized networks as storage backends, their services depend on centrally operative services. The first example is that Storj’s use of satellites to manage accounts, storage proofs, and payments. The second example is IPFS and pinning services, e.g. Pinata. The service provider stores data on IPFS. It guarantees data persistence by running IPFS nodes and pinning user data on those private servers. The availability of users’ data is only guaranteed if Pinata could successfully maintain uptime of their IPFS nodes. The third example is Filecoin and easy-to-use storage gateways, e.g. Textile Hub and Web3.storage. The stored data is ultimately routed to the Filecoin network, but these services manage accounts, packaging the data, and negotiate with storage miners. The fourth example is Sia Skynet. It is a portal built on top of the Sia network. Skynet give users and application developers a superior user experience, but it also means that data go through an intermediary before landing on the decentralized network.

The promise of permanent storage might not be desirable. Arweave and Safe Network proclaim that data will be kept forever on their networks. Despite the many claims that storage has been getting cheaper, storage is expensive. An averaged American household cannot afford storing petabytes of data. An averaged company cannot afford petabytes without at least extracting tens of thousands of dollars annually from those data. One might point to Bitcoin and Ethereum network as example of permanent storage. The amount of storage is extremely limited. 1GB of storage in those network will cost tens of million of dollars or 100s of millions depending on market conditions. A network could keep its permanence guarantee if the value of the token economics always outpaces the actual cost of physical resources. That is not a stable equilibrium, nor is it desirable. With a sufficient large stable storage network, the marginal benefit a node gains should be the marginal value that it provides in storage. The overall market design should account for the costs of continuing to pay the storage providers to keep the exabytes of data around month after month, and year after year. When large amount of data becomes useless and no one is willing to pay for it, the world should not be wasting resources to keep them around.

Storage contracts require a consensus mechanism. A blockchain is the most obvious solution. However, none of these networks has thought through how to scale the blockchain part of the network. It is fortunate that blockchain scalability has been gotten a lot of attention due to the rapid development of general purpose blockchains, such as Ethereum, Polkadot, Near, and Solana. There have been major advances in recent years. The most mature solution is to shard a blockchain. One of the key limits of sharded blockchains is the amount of cross-shard communications allowed. Storage contracts do not need to communicate across shards. This would make a sharded solution especially well suited as the coordinating layer of a DSN. Furthermore, the security of the additional shards do not require recruiting additional validators if we adopt storage capacity as the securing instrument.

Storage network requires a mechanism for validating storage proofs to maintain data availability. Storage providers submit proof on a regular interval. However, the requests could come from storage clients or the network. For example, Sia uses off-chain storage proofs and payment channels to maintain record of data possession. There are obvious advantages of not requiring storage clients to perform any ongoing maintenance work. The network could require the clients to lock up sufficient fund to pay for the cost of the storage duration, and pay the providers proportionate to the number of storage proofs provided.

There are some DSN features that could be implemented by client libraries. For example, Sia and Storj enforce data replication and encryption. Filecoin enforce replication. I tend to believe that the storage network should leave those features as client options. A decentralized network is already complex. Storage clients could choose how their data would be encrypted, stored using erasure coding technique, or replicated. There are applications that want to store unencrypted data. There are applications that do not want to pay the extra costs of replication. Furthermore, these features could be built as client libraries to be compatible on multiple networks. Keeping these feature out of the core network simplifies the design of a storage network.

Notes on Decentralized Storage Networks

IPFS¶

Filecoin¶

Arweave¶

Sia¶

Safe Network¶

Storj¶

Discussions¶

Related Posts

Published

Tags

Contact