This document lists a few random thoughts and considerations in the selection of a suitable hash function for Torfs. This hash function will be used for the identification of unique files and for (incremental) verification of downloaded data.
The main advantages of a Merkle tree is interoperability: When using a Merkle tree with a predefined leaf block size, the root hash of a file will be the same regardless of the chosen block size of the intermediate hash nodes that are being kept around for verification.
Hash lists, as used by BitTorrent, do not have this property: The same file could hash to different Torrent hashes if the clients choose different chunk sizes.
I think this is the most widely used hash tree to identify files, but it has its problems.
Security: The Tiger hash function is not used much outside of its application in TTH and P2P networks. Tiger does not have as much published cryptanalysis as some of its alternatives. Some weaknesses have been found over the years.
Performance: There are faster hash functions with similar or better security guarantees. The used tree hashing mode also suffers a small performance hit due to the 1-byte \x00 prefix in data blocks, which prevents aligned mmap-based file hashing.
The ADC project has shown interest in switching to a different hash. They’ve been hoping for SHA-3 to standardize a hash tree construction, but that hasn’t happened so far.
The BLAKE2 paper describes a tree hashing mode. BLAKE2 itself is slowly gaining widespread use, but this tree hashing mode has not been standardized as part of RFC-7693. I’m worried that not all BLAKE2 implementations will support it (or at least, will not have tested it as thoroughly). Apart from that, this seems like a perfect match for Torfs.
KangarooTwelve is a later development of Keccak (SHA-3) and includes a tree hashing mode. I’ve not looked at it long enough to see if this is suitable for Torfs. There don’t seem to be many implementations.
The Dat project makes use of Merkle trees. Overall ideas are described in DEP-0002, but I’m missing a description of how file contents are mapped to leaf nodes. Judging from the whitepaper, it seems that file contents are split into chunks and stored into a log structure. A file is described by its position and length in the log file, rather than by a root hash that is unique to that file, so this is not something that we can use.
It is interesting to note that, although Dat uses the BLAKE2 hash function, they do not make use of its tree hashing feature, but instead use a slight variation on the 1-byte prefix approach used in TTH.
Multihash seems useful, but I’m not sure we need anything that fancy. If our file hashes aren’t going to be compatible with existing systems anyway, I have a slight preference for base32 or base58.