Deduplicating a 10.4 TiB game preservation archive (WIP)
Hi folks,
I am working on a game preservation project, where the data set holds 10.4 TiB. It contains 1044 earlier versions of a single game in a multitude of different languages, architectures and stages of development. As you can guess, that means extreme redundancy.
The goals are: - bring the size down - retain good read speed (for further processing/reversing) - easy sharable format - lower end machines can use it
My choice fell on the BTRFS filesystem, since it provides advanced features for deduplication, which is not as resource hungry as ZFS. Once the data is processed, it no longer requires a lot of system resources.
In the first round of deduplication, I used "jdupes -rQL" (yes, I know what -Q does) to replace exact copies of files in different directories via hardlinks to minimize data and metadata. This got it down to roughly 874 GiB already, out of which 866 GiB are MPQ files. That's 99,08%... everything besides is a drop in the bucket.
For those uninitiated: this is an archive format. Representing it as a pseudo-code struct it looks something like this { header, files[], hash_table[], block_table[] } Compression exists, but it is applied to each file individually. This means the same file is compressed the same way in different MPQ archives, no matter the offset it happens to be in.
What is throwing a wrench into my plans of further data deduplication are the following points: - the order of files seems not to be deterministic when MPQ files were created (at least I picked that up somewhere) - altered order of elements (files added or removed at the start) causes shifts in file offsets
I thought for quite some time about this, and I think the smartest way forward is, that I manually hack apart the file into multiple extents at specific offsets. Thus the file would contain of an extent for: - the header - each file individually - the hash table - the block table It will increase the size for each file of course, because of wasted space at the end of the last block in each extent. But it allows for sharing whole extents between different archives (and extracted files of it), as long as the file within is content-wise the same, no matter the exact offset. The second round of deduplication will then be whole extents via duperemove, which should cut down the size dramatically once more.
This is where I am hanging right now: I don't know how to pull it off on a technical level. I already was crawling through documentation, googling, asking ChatGPT and fighting it's hallucinations, but so far I wasn't very successful in finding leads (probably need to perform some ioctl calls).
From what I imagine, there are probably two ways to do this: - rewrite the file with a new name in the intended extent layout, delete the original and rename the new one to take it's place - rewrite the extent layout of an already existing file, without bending over backwards like described above
I need is a reliable way to, without chances of the filesystem optimizing away my intended layout, while I write it. The best case scenario for a solution would be a call, which takes a file/inode and a list of offsets, and then reorganizes it into that extents. If something like this does not exist, neither through btrfs-progs, nor other third party applications, I would be up for writing a generic utility like described above. It would enable me to solve my problem, and others to write their own custom dedicated deduplicaton software for their specific scenario.
If YOU - can guide me into the right direction - give me hints how to solve this - tell me about the right btrfs communities where I can talk about it - brainstorm ideas I would be eternally grateful :) This is not a call for YOU to solve my problem, but for some guidance, so I can do it on my own.
I think that BTRFS is superb for deduplicated archives, and it can really shine, if you can give it a helping hand.
It looks like your solutions so far expose the result as a plain, accessible filesystem. If you can relax this constraint you might have an easier time -
"Precomp" is a program that losslessly expands a zip/gzip archive in a way that can be 1:1 restored, but in its expanded format, is more amenable to solid compression.
Could you write a custom archiver for the MPQ files that does something similar - expands them out, in a reversible way, to make the files appear more similar? Then long-range solid compression (e.g. zpaq) can take care of the rest,
Or, your custom archiver could extract the blobs to individual files, and then you'd only need to store each inner stream once, while still being able to reconstruct the original MPQ on demand.
I was also thinking about extracting the MPQ on-disk e.g. "foo.mpq" -> ".foo.mpq.extracted/..." (which - to my understanding - would lead to a lot of deduplication even just with jdupes). N.B. depending on the requirements with regards to read performance, you could design a FUSE filesystem that would transparently reconstruct the MPQ on-the-fly.
Of course, I guess all of this would be much slower than being able to use inherent btrfs features as the author of the post wishes, but at least it would be independent from btrfs :)
prying apart the MPQ file into it's parts and writing them down to disk is at the moment probably the strongest contender for my solution. it will cause the parts to be block aligned, be able to be hardlinked and cut down on metadata and improve performance (same inode when hardlinked). only thing i need in this case is to have a script to reassamble those extracted ones into the original MPQ archives, which have to match byte-by-byte to the original content ofc. extracting them into it's distinct parts also allows to access the contents directly, if so desired, without needing to extract them on demand (some people wanna look up assets in specific versions). these distinct parts can then additionally be deduplicated on block level as well.
I was curious about MPQ format and found this tool: https://github.com/eagleflo/mpyq
Maybe you can use it to decompress some files and assess how much disk space you can save using deduplication at filesystem level.
If it's worth the effort (1 mean, going from 1TiB to 100-200 Gib) I would consider coding the reassembly part. It can be done by a "script" fist, then can be "promoted" to a FUSE filesystem if needed.
you are absolutely on point - i would prefer having a real filesystem with deduplication (not compression), which offers data in a compact form, with good read speed for further processing.
i was already brainstorming of writing a custom purpose-built archive format, which would allow me to have more fine grained control over how i can lay out data and reference it. the thing is that this archive is most likely not absolutely final (additional versions being added) - having a plain filesystem allows for easier adding of new entries. an archive file might have to be rewritten.
if i go the route of custom archive, i can in theory write a virtual filesystem for it to access it read only like it would be a real filesystem... and if i design it properly, maybe even write it.
still would prefer to use a btrfs filesystem tbh ^^ will brainstorm a bit more over the next days - thanks for your input!
This is good thinking, but I think you are basically describing a Restic/Borg respository :)
- Deduplication? Check.
- Compact format? Check.
- Good read speed? Yep. (Proportional to backing store.)
- Custom purpose-built? Yeah, that's what backup programs are for.
- Custom data layout? Check. (Rabin/BuzHash content-defined chunking, SHA256 dedupe.)
- Adding additional versions? Yes. ("Incremental backups"— See above.)
- Virtual filesystem access? `restic mount`/`borg mount`.
The same storage system is used by some non-backup software too, Perkeep and Seafile
thanks for mentioning those 2 projects, will check them out over the holidays and do some experimenting ^^
If the MPQ files are not too big and they match across languages, architectures and stages of development (e.g. same file names), the simplest solution is just to compress related files together. Let the compression algorithms figure out the symbols of your alphabet (eventually they will match your compressed files within MPQ archives, or their components). Restoring a specific version of the game then becomes "extract this list of files from archives and place them at these locations."
Random idea (I haven't tested it): If you need to figure out how to cluster together related MPQ files, you could compute merkle trees (or multiple merkle trees shifted by prime number offsets to identify common subsequences at arbitrary offsets), and use selected hashes for similarity search.
even though I do not like it too much, i think i will have to pry apart the MPQ files into it's distinct parts, and write them down to the filesystem individually (and then deleting the original file) - basically what i wanted to do with the extents, but instead have them as distinct files. this then can be reversed via script to assemble the original archive file on demand to get a byte-by-byte equal file again. writing the parts down to the filesystem will cause the parts to be properly block aligned, and be able to be hardlinked, if they exist multiple times on the filesystem - this cuts down on metadata even more and also boosts performance when doing block/extent deduplication, since a single inode is only processed once in most proper deduplication programs.
the MPQ files range from a few MiB to around 2.5 GiB. since the access should be rather fast, pairing them as an archive file is not an option for me.
thanks about the hint of the merkle trees, i will read up on what that is... always good to know about different approaches to a problem :)
I'm not sure about how it works in WoW but I can give you a warning about what happens in StarCraft. To prevent other people from editing StarCraft maps (which are MPQs), users would intentionally mangle the MPQ format in just the right way so that the game could still play it but other tools could not open it for editing. So, if there is anything like that going on in WoW world then it might be very hard to reassemble the original MPQs and get a byte for byte match.
shouldn't be the case, and i would implement a proper verify when exploding the MPQ file, by running a reassamble and hash comparison right afterwards.
Instead of dupremove, bees may be more appropriate for your goals.
https://github.com/Zygo/bees
Bees does things at a lower level than just reflinking full files like dupremove (it dedups at the level of extents). Be sure to read the links in the "Recommended Reading" section, of the above link, to hopefully avoid issues.
duperemove has "--dedupe-options=partial" which also enables this, not just full extents. the issue still is, that the data within the archive is not block aligned, thus preventing me from deduplicating them properly
It seems that the filesystem-level deduplication approach with BTRFS conflicts with your goal of having an easily-shareable format. A filesystem is not an ideal archive format, especially one that is (~)only compatible with Linux systems. Also, over a long enough time horizon, filesystems become obsolete and difficult to mount.
Secondly, it seems like it is going to take contortions to utilize BTRFS deduplication with MPQ archives, hacking archives apart into separate extents, fighting with wasted space, etc. You're talking about low level filesystem manipulations to optimize it for this peculiar use case. And then you would need custom tooling anyway to be able to re-constitute the MPQ files from this filesystem of exploded fragments.
As a simpler, filesystem-agnostic approach, I would move all of the file data into a large content-addressable storage blob, which is by construction deduplicated. Then, strip the file data out of the MPQ archives and leave pointers into this blob. The process to reconstitute an archive is straightforward.
This approach does not rely on filesystem-specific deduplication, is portable to other platforms, and largely avoids the problem of wasted space due to filesystem structures. However, it would be a non-standard archive format, and you would have to make sure your scripts are well-documented enough that they will be usable in the future.
In more detail, to prepare your collection, you would write a script to do these steps:
1. Enumerate each file in each MPQ archive and compute the SHA-256 hash of its data.
2. If this hash has never been seen before, concatenate the file to your large storage blob. Record the hash => (offset, length) mapping to an index file.
3. Strip the file data out of the MPQ archive, leaving behind the SHA-256 hash as pointer you will use later to look up the (offset, length) in the index and recover the file.
4. Record the SHA-256 hash of the entire original MPQ archive, so you can later verify you've recreated it without loss of data integrity.
To reverse the process, you just need to pull pieces out of the storage blob and reconstitute the original MPQ archive. Check the hash of the recreated file against the original.
The stripped archives should be of negligible size and no further attempt to compress or deduplicate them is necessary. That said, you should try to compress your storage blob with something like the Zstandard Seekable Format. This will work best if you decompress the file data before hashing/storing it, and then re-compress it later when reconstituting the MPQ archive.
You just described git-annex.
a BTRFS subvolume can be exported as a snapshot, and this can be passed around rather easily. i know the concern mounting BTRFS, but there are windows drivers for that as well, and you can also mount it via WSL nowadays, to have proper linux tooling.
the approach of having a custom storage blob with pointer references which is something i consider as well, will play around with that during the holidays and do some experimenting. thanks for your input.
There are open source libraries for interacting with MPQ, e.g., https://github.com/mbroemme/mpq-tools, if you’re not already using one.
thanks for the link - i will write my extraction code though, since the format is very simplistic, and it gives me fine grained control over how things are done. appreciate your help though!
I'm surprised to see MPQ come up. What game is it?
World of Warcraft - from the earliest (publicly leaked) Alpha up and including the very last Wrath of the Lich King version. Other people are already working on getting backups of the next 2 expansions working again to add those client versions to the archive as well.
Lookup FICLONERANGE ioctl
this is either a very big coincidence, or you are in the datamining discord as well. the original archive i base my project on uses RMAN to store everything :D --- thanks for the hint about the FICLONERANGE ioctl... it seems to be fine grained enough to allow me deduplicate on arbitrary offsets, not just whole blocks. will give it a go.
Btrfs is something i originally wanted to use but other people were not fans of linux so custom ad-hoc tooling (RMAN) it was.
i tried FICLONERANGE via a python wrapper btw - it turns out, that i can only clone ranges aligned to block boundaries :(
BTRFS is very neat per se, but documentation and help (most of all in very niche cases like this one here lol) is not that easy to come by. my plan would be to properly process the data set, and then make it available as a BTRFS snapshot... you can export btrfs send as a file as well for storage etc.
if all my tries to use BTRFS fail, i might to write my own tooling and virtual filesytem as well, but optimized for my use case (MPQ files and such). thanks for your input so far.
Options:
- `zstd --train` a dictionary over all the files, then compress it all using the dictionary. Decompress on demand. Make sure you don't lose the dictionary, or you'll lose all of it.
- OR stick it into a Restic/Borg/Kopia/Deduplicacy repository. These use a rolling hash to deduplicate any subsections of files that are the same, before applying compression. You can try it with `--dry-run` first to see how much space you would save. Use their `mount` command to access the files as a directory.
I would not be surprised if it gets the total size down close-ish to the size of only a single copy of the game.
See the other comment about `precomp` if compression gives you issues.
Personally I would just go with the `restic` route, as that's what it's meant for.
will probably write the MPQ blobs down to disk and deduplicate via hardlinks and additionally on block level. i don't know about restic (or borg, which was also recommended), but i will read up on it and doe some tests with it, regardless, since this seems to be a very nice tool for a lot of problem scenarios. thanks for the input!
Deduplication is a poor archival strategy. Storage is cheap. 10TiB is a couple of thousand dollars with reasonable redundancy.
Write Once is how to manage an archive.
Displaying a curated subset is a good interface. Good luck.
this was not helpful at all, and i think you also did not read the goals of this project
I assumed the goal was archiving games for preservation.
If the goal is algorithmic erasure of data rather than preservation of games, then “archive” might again create confusion like the type I probably have.
If there is a strong business case for deduplication, I recommend hiring a consultant with expertise and experience in the problem.
To be clear search is the only way to identify redundancy. If you have search, then redundancy is not a problem.
Often these archive projects have the goal of propagating the archive across many different systems, for which reducing size is very valuable. This is basically a compression exercise in the case where there are many large duplicated blocks.
This is correct - the main goal is to have this rather compact, but still having good read times. This will allow me to store it on my new 4 TiB NVMe drive. A lot of iterative scanning will happen, because I search for interesting information, which helps reverse engineering. Also it allows me to share this with other people over the internet before I kick the bucket in a few decades... transferring 10.4 TiB would be rather boring :D
Thanks, that makes sense.
The question began, I am working on a game preservation project.
I think the confusion here (for me too) is what precisely is the purpose/meaning of "bring the size down?"
aka, is the is a "make it easier to search" question or "we need to take up fewer bytes" question (which, perhaps, used to be the same question, but aren't necessarily now?)
it's a question of both "i can store this on my 4 TiB NVMe disk" and "i can share this with other people over the internet in a timely manner"
I think it's "I want people to make copies of this"