Comments Page - Hyperspace

« Back Hyperspacehypercritical.coSubmitted by tobr 8 hours ago

Take8435 2 hours ago
Downloaded. Ran it. Tells me "900" files can be cleaned. No summary, no list. But I was at least asked to buy the app. Why would I buy the app if I have no idea if it'll help?
- crb 2 hours ago
  From the FAQ:
  > If some eligible files were found, the amount of disk space that can be reclaimed is shown next to the “Potential Savings” label. To proceed any further, you will have to make a purchase. Once the app’s full functionality is unlocked, a “Review Files” button will become available after a successful scan. This will open the Review Window.
  I half remember this being discussed on ATP; the logic being that if you have the list of files, you will just go and de-dupe them yourself.
  AyyEye an hour ago
  > the logic being that if you have the list of files, you will just go and de-dupe them yourself.
  If you can do that, you can check for duplicates yourself anyway. It's not like there aren't already dozens of great apps that dedupe.
- elicash 8 minutes ago
  It didn't tell you how much disk space? It's supposed to. It only told you the number of files?
- eps 2 hours ago
  This reminds me -
  Back in the MS-DOS days, when the RAM was sparse, there was a class of so-called "memory optimization" programs. They all inevitably found at least few KB to be reclaimed through their magic even if the same optimizer was run back to back with itself and allowed to "optimize" things. That is, on each run they always find extra memory to be freed. They ultimately did nothing but claim they did the work. Must've sold pretty well nonetheless.
  tomnipotent 34 minutes ago
  I remember using MemTurbo in the Windows 2000 era, though now I know it was mostly smoke and mirrors. My biggest gripe these days is too many "hardware accelerated" apps eating away VRAM, which is less of a problem with Windows (better over-commit) but which causes me a few crashes a month on KDE.
bob1029 7 hours ago
> There is no way for Hyperspace to cooperate with all other applications and macOS itself to coordinate a “safe” time for those files to be replaced, nor is there a way for Hyperspace for forcibly take exclusive control of those files.
This got me wondering why the filesystem itself doesn't run a similar kind of deduplication process in the background. Presumably, it is at a level of abstraction where it could safely manage these concerns. What could be the downsides of having this happen automatically within APFS?
- a-dub a minute ago
  a content addressed block store with pointers and skiplists for file continuity would be kinda neat.
- taneliv 7 hours ago
  On ZFS it consumes a lot of RAM. In part I think this is because ZFS does it on the block level, and has to keep track of a lot of blocks to compare against when a new one is written out. It might be easier on resources if implemented on the file level. Not sure if the implementation would be simpler or more complex.
  It might also be a little unintuitive that modifying one byte of a large file would result in a lot disk activity, as the file system would need to duplicate the file again.
  abrookewood 3 hours ago
  In regards to the second point, this isn't correct for ZFS: "If several files contain the same pieces (blocks) of data or any other pool data occurs more than once in the pool, ZFS stores just one copy of it. Instead of storing many copies of a book it stores one copy and an arbitrary number of pointers to that one copy." [0]. So changing one byte of a large file will not suddenly result in writing the whole file to disk again.
  [0] https://www.truenas.com/docs/references/zfsdeduplication/
  btilly 2 hours ago
  This applies to modifying a byte. But inserting a byte will change every block from then on, and will force a rewrite.
  Of course, that is true of most filesystems.
  karparov 2 hours ago
  Not the whole file but it would duplicate the block. GP didn't claim that the whole file is copied.
  gmueckl 7 hours ago
  Files are always represented as lists of blocks or block spans within a file system. Individual blocks could in theory be partially shared between files at the complexity cost of a reference counter for each block. So changing a single byte in a copy on write file could take the same time regardless of file size because only the affected bock would have to be duplicated. I don't know at all how MacOS implements this copynon write scheme, though.
  MBCook 6 hours ago
  APFS is a copy on write filesystem if you use the right APIs, so it does what you describe but only for entire files.
  I believe as soon as you change a single bite you get a complete copy that’s your own.
  And that’s how this program works. It finds perfect duplicates and then effectively deletes and replaces them with a copy of the existing file so in the background there’s only one copy of the bits on the disk.
  mintplant 5 hours ago
  I suppose this means that you could find yourself unexpectedly out of disk space in unintuitive ways, if you're only trying to change one byte in a cloned file but there isn't enough space to copy its entire contents?
  MBCook 3 hours ago
  I’m not sure if it works on a file or block level for CoW, but yes.
  However APFS gives you a number of space related foot-guns if you want. You can overcommit partitions, for example.
  It also means if you have 30 GB of files on disk that could take up anywhere from a few hundred K to 30 GB of actual data depending on how many dupes you have.
  It’s a crazy world, but it provides some nice features.
  pansa777 4 hours ago
  It doesn't work like you think. If you change one byte of duplicated file - the only "byte" will be changed on disk (a "byte", because, technically is not a byte, but a block).
  As far as I understand, it works like a reflink feature in the modern linux FSs. If so, thats really cool, and thats also a bit better than the zfs's snapshots. Iam newbie on macos, but it looks amazing
  alwillis 2 hours ago
  That’s not how this works. Nothing is deleted. It creates zero-space clones of existing files.
  https://en.wikipedia.org/wiki/Apple_File_System?wprov=sfti1#...
  tonyedgecombe 3 hours ago
  > I believe as soon as you change a single bite you get a complete copy that’s your own.
  I think it stores a delta:
  https://en.m.wikipedia.org/wiki/Apple_File_System#Clones
  amzin 6 hours ago
  Is there a FS that keeps only diffs in clone files? It would be neat
  rappatic 6 hours ago
  I wondered that too.
  If we only have two files, A and its duplicate B with some changes as a diff, this works pretty well. Even if the user deletes A, the OS could just apply the diff to the file on disk, unlink A, and assign B to that file.
  But if we have A and two different diffs B1 and B2, then try to delete A, it gets a little murkier. Either you do the above process and recalculate the diff for B2 to make it a diff of B1; or you keep the original A floating around on disk, not linked to any file.
  Similarly, if you try to modify A, you'd need to recalculate the diffs for all the duplicates. Alternatively, you could do version tracking and have the duplicate's diffs be on a specific version of A. Then every file would have a chain of diffs stretching back to the original content of the file. Complex but could be useful.
  It's certainly an interesting concept but might be more trouble than it's worth.
  abrookewood 3 hours ago
  ZFS does this by de-duplicating at the block level, not the file level. It means you can do what you want without needing to keep track of a chain of differences between files. Note that de-duplication on ZFS has had issues in the past, so there is definitely a trade-off. A newer version of de-duplication sounds interesting, but I don't have any experience with it: https://www.truenas.com/docs/references/zfsdeduplication/
  alwillis 2 hours ago
  That’s how APFS works; it uses delta extents for tracking differences in clones: https://en.wikipedia.org/wiki/Delta_encoding?wprov=sfti1#Var...
  abrookewood 3 hours ago
  ZFS: "The main benefit of deduplication is that, where appropriate, it can greatly reduce the size of a pool and the disk count and cost. For example, if a server stores files with identical blocks, it could store thousands or even millions of copies for almost no extra disk space." (emphasis added)
  https://www.truenas.com/docs/references/zfsdeduplication/
  UltraSane 5 hours ago
  VAST storage does something like this. Unlike how most storage arrays identify the same block by hash and only store it once VAST uses a content aware hash so hashes of similar blocks are also similar. They store a reference block for each unique hash and then when new data comes in and is hashed the most similar block is used to create byte level deltas against. In practice this works extremely well.
  https://www.vastdata.com/blog/breaking-data-reduction-trade-...
- ted_dunning 7 hours ago
  This is commonly done with compression on block storage devices. That fails, of course, if the file system is encrypting the blocks it sends down to the device.
  Doing deduplication at this level is nice because you can dedupe across file systems. If you have, say, a thousand systems that all have the same OS files you can save vats of storage. Many times, the only differences will be system specific configurations like host keys and hostnames. No single filesystem could recognize this commonality.
  This fails when the deduplication causes you to have fewer replicas of files with intense usage. To take the previous example, if you boot all thousand machines at the same time, you will have a prodigious I/O load on the kernel images.
- albertzeyer 7 hours ago
  > This got me wondering why the filesystem itself doesn't run a similar kind of deduplication process in the background.
  I think that ZFS actually does this. https://www.truenas.com/docs/references/zfsdeduplication/
  pmarreck 5 hours ago
  It's considered an "expensive" configuration that is only good for certain use-cases, though, due to its memory requirements.
  abrookewood 3 hours ago
  Yes true, but that page also covers some recent improvements to de-duplication that might assist.
  pmarreck 10 minutes ago
  Really? I haven't looked at this ZFS feature in a few years so I will take a look
  EDIT: Is this referring to the "fast" dedup feature?
- p_ing 7 hours ago
  Windows Server does this for NTFS and ReFS volumes. I used it quite a bit on ReFS w/ Hyper-V VMs and it worked wonders. Cut my storage usage down by ~45% with a majority of Windows Server VMs running a mix of 2016/2019 at the time.
  borland 6 hours ago
  Yep. At a previous job we had a file server that we published Windows build output to.
  There were about 1000 copies of the same pre-requisite .NET and VC++ runtimes (each build had one) and we only paid for the cost of storing it once. It was great.
  It is worth pointing out though, that on Windows Server this deduplication is a background process; When new duplicate files are created, they genuinely are duplicates and take up extra space, but once in a while the background process comes along and "reclaims" them, much like the Hyperspace app here does.
  Because of this (the background sweep process is expensive), it doesn't run all the time and you have to tell it which directories to scan.
  If you want "real" de-duplication, where a duplicate file will never get written in the first place, then you need something like ZFS
  p_ing 2 hours ago
  Both ZFS and WinSvr offer "real" dedupe. One is on-write, which requires a significant amount of available memory, the other is on a defined schedule, which uses considerably less memory (300MB + 10MB/TB).
  ZFS is great if you believe you'll exceed some threshold of space while writing. I don't personally plan my volumes with that in mind but rather make sure I have some amount of excess free space.
  WinSvr allows you to disable dedupe if you want (don't know why you would) where as ZFS is a one-way street without exporting the data.
  Both have pros and cons. I can live with the WinSvr cons while ZFS cons (memory) would be outside of my budget, or would have been at the particular time with the particular system.
  sterlind 3 hours ago
  hey, it's defrag all over again!
  (not really, since it's not fragmentation, but conceptually similar)
- mentalgear an hour ago
  what's the source of that quote, does it mean it's not safe to use hyperspace ?
- nielsbot 3 hours ago
  Disk Utility.app manages to keep the OS running while make the disk exclusive-access.. I wonder how it does that.
- pizzafeelsright 7 hours ago
  data loss is the largest concern
  I still do not trust de-duplication software.
  dylan604 7 hours ago
  Even using sha-256 or greater type of hashing, I'd still have concerns about letting a system make deletion decisions without my involvement. I've even been part of de-dupe efforts, so maybe my hesitation is just because I wrote some of the code and I know I'm not perfect in my coding or even my algo decision trees. I know that any mistake I made would not be of malice but just ignorance or other stupid mistake.
  I've done the entire compare every file via hashing and then log each of the matches for humans to compare, but never has any of that ever been allowed to mv/rm/link -s anything. I feel my imposter syndrome in this regard is not a bad thing.
  borland 4 hours ago
  Now you understand why this app costs more than 2x the price of alternatives such as diskDedupe.
  Any halfway-competent developer can write some code that does a SHA256 hash of all your files and uses the Apple filesystem API's to replace duplicates with shared-clones. I know swift, I could probably do it in an hour or two. Should you trust my bodgy quick script? Heck no.
  The author - John Siracusa - has been a professional programmer for decades and is an exceedingly meticulous kind of person. I've been listening to the ATP podcast where they've talked about it, and the app has undergone an absolute ton of testing. Look at the guardrails on the FAQ page https://hypercritical.co/hyperspace/ for an example of some of the extra steps the app takes to keep things safe. Plus you can review all the proposed file changes before you touch anything.
  You're not paying for the functionality, but rather the care and safety that goes around it. Personally, I would trust this app over just about any other on the mac.
  btilly 2 hours ago
  More than TeX or SQLite?
  criddell 3 hours ago
  > I'd still have concerns about letting a system make deletion decisions without my involvement
  You are involved. You see the list of duplicates and can review them as carefully as you'd like before hitting the button to write the changes.
  dylan604 3 hours ago
  Yeah, the lack of involvement was more in response to ZFS doing this not this app. I could have crossed the streams with other threads about ZFS if it's not directly in this thread
  axus 6 hours ago
  Question for the developer: what's your liability if user files are corrupted?
  codazoda 5 hours ago
  Most EULA’s would disclaim liability for data loss and suggest users keep good backups. I haven’t read a EULA for a long time, but I think most of them do so.
  borland 4 hours ago
  I can't find a specific EULA or disclaimer for the Hyperspace app, but given that the EULA's for major things like Microsoft Office basically say "we offer you no warranty or recourse no matter what this software does" I would hardly expect an indie app to offer anything like that
- UltraSane 5 hours ago
  NTFS supports deduplication but it is only available on Server versions which is very annoying.
- asdfman123 7 hours ago
  If Apple is anything like where I work, there's probably a three-year-old bug ticket in their system about it and no real mandate from upper management to allocate resources for it.
petercooper 8 hours ago
I love the model of it being free to scan and see if you'd get any benefit, then paying for the actual results. I, too, am a packrat, ran it, and got 7GB to reclaim. Not quite worth the squeeze for me, but I appreciate it existing!
- mentalgear an hour ago
  am I really that old that I remember this being the default for most of the software about 10 years ago? Are people already that used to the subscription trap that they think this is a new model ?
- MBCook 6 hours ago
  He’s talked about it on the podcast he was on. So many users would buy this, run it once, then save a few gigs and be done. So a subscription didn’t make a ton of sense.
  After all how many perfect duplicate files do you probably create a month accidentally?
  There’s a subscription or buy forever option for people who think that would actually be quite useful to them. But for a ton of people a one time IAP that gives them a limited amount of time to use the program really does make a lot of sense.
  And you can always rerun it for free to see if you have enough stuff worth paying for again.
- sejje 6 hours ago
  I also really like this pricing model.
  I wish it were more obvious how to do it with other software. Often there's a learning curve in the way before you can see the value.
- jedbrooke 3 hours ago
  it’s very refreshing compared to those “free trials” you have to remember to cancel (pro tip: use virtual credit cards which you can lock for those so if you forget to cancel the charges are blocked)
  however has anyone been able to find out from the website how much the license actually costs?
astennumero 6 hours ago
What algorithm does the application use to figure out if two files are identical? There's a lot of interesting algorithms out there. Hashes, bit by bit comparison etc. But these techniques have their own disadvantages. What is the best way to do this for a large amount of files?
- borland 6 hours ago
  I don't know exactly what Siracusa is doing here, but I can take an educated guess:
  For each candidate file, you need some "key" that you can use to check if another candidate file is the same. There can be millions of files so the key needs to be small and quick to generate, but at the same time we don't want any false positives.
  The obvious answer today is a SHA256 hash of the file's contents; It's very fast, not too large (32 bytes) and the odds of a false positive/collision are low enough that the world will end before you ever encounter one. SHA256 is the de-facto standard for this kind of thing and I'd be very surprised if he'd done anything else.
  MBCook 6 hours ago
  You can start with the size, which is probably really unique. That would likely cut down the search space fast.
  At that point maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash and if you just compare the bytes there is no chance of hash collision no matter how small.
  Plus if you find a difference in bytes 1290 you can just stop there instead of reading the whole thing to finish the hash.
  I don’t think John has said exactly how on ATP (his podcast with Marco and Casey), but knowing him as a longtime listener/reader he’s being very careful. And I think he’s said that on the podcast too.
  unclebucknasty 4 hours ago
  >which is probably really unique
  Wonder what the distribution is here, on average? I know certain file types tend to cluster in specific ranges.
  >maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash
  Definitely, for comparing any two files. But, if you're searching for duplicates across the entire disk, then you're theoretically checking each file multiple times, and each file is checked against multiple times. So, hashing them on first pass could conceivably be more efficient.
  >if you just compare the bytes there is no chance of hash collision
  You could then compare hashes and, only in the exceedingly rare case of a collision, do a byte-by-byte comparison to rule out false positives.
  But, if your first optimization (the file size comparison) really does dramatically reduce the search space, then you'd also dramatically cut down on the number of re-comparisons, meaning you may be better off not hashing after all.
  You could probably run the file size check, then based on how many comparisons you'll have to do for each matched set, decide whether hashing or byte-by-byte is optimal.
  karparov 2 hours ago
  This can be done much faster and safer.
  You can group all files into buckets, and as soon as a bucket is empty, discard it. If in the end there are still files in the same bucket, they are duplicates.
  Initially all files are in the same bucket.
  You now iterate over differentiators which given two files tell you whether they are maybe equal or definitely not equal. They become more and more costly but also more and more exact. You run the differentiator on all files in a bucket to split the bucket into finer equivalence classes.
  For example:
  * Differentiator 1 is the file size. It's really cheap, you only look at metadata, not the file contents.
  * Differentiator 2 can be a hash over the first file block. Slower since you need to open every file, but still blazingly fast and O(1) in file size.
  * Differentiator 3 can be a hash over the whole file. O(N) in file size but so precise that if you use a cryptographic hash then you're very unlikely to have false positives still.
  * Differentiator 4 can compare files bit for bit. Whether that is really needed depends on how much you trust collision resistance of your chosen hash function. Don't discard this though. Git got bitten by this.
  rzzzt 5 hours ago
  I experimented with a similar, "hardlink farm"-style approach for deduplicated, browseable snapshots. It resulted in a small bash script which did the following:
  - compute SHA256 hashes for each file on the source side
  - copy files which are not already known to a "canonical copies" folder on the destination (this step uses the hash itself as the file name, which makes it easy to check if I had a copy from the same file earlier)
  - mirror the source directory structure to the destination
  - create hardlinks in the destination directory structure for each source file; these should use the original file name but point to the canonical copy.
  Then I got too scared to actually use it :)
  pmarreck 5 hours ago
  xxHash (or xxh3 which I believe is even faster) is massively faster than SHA256 at the cost of security, which is unnecessary here.
  Of course, engineering being what it is, it's possible that only one of these has hardware support and thus might end up actually being faster in realtime.
  PhilipRoman 3 hours ago
  Blake3 is my favorite for this kind of thing. It's a cryptographic hash (maybe not the world's strongest, but considered secure), and also fast enough that in real world scenarios it performs just as well as non-crypto hashes like xx.
  f1shy 6 hours ago
  I think the prob. is not so low. I remember reading here about a person getting a foto of another chat in a chat application, which was using sha in the background. I do not recall all the details, it is improbable, but possible.
  kittoes 5 hours ago
  The probability is truly, obscenely, low. If you read about a collision then you surely weren't reading about SHA256.
  https://crypto.stackexchange.com/questions/47809/why-havent-...
  sgerenser 5 hours ago
  LOL nope, I seriously doubt that was the result of a SHA256 collision.
  amelius 6 hours ago
  Or just use whatever algorithm rsync uses.
- diegs 6 hours ago
  This reminds me of https://en.wikipedia.org/wiki/Venti_(software) which was a content-addressible filesystem which used hashes for de-duplication. Since the hashes were computed at write time, the performance penalty is amortized.
- w4yai 6 hours ago
  I'd hash the first 1024 bytes of all files, and starts from there is any collision. That way you don't need to hash the whole (large) files, but only those with same hashes.
  amelius 6 hours ago
  I suspect that bytes near the end are more likely to be different (even if there may be some padding). For example, imagine you have several versions of the same document.
  Also, use the length of the file for a fast check.
  kstrauser 6 hours ago
  At that point, why hash them instead of just using the first 1024 bytes as-is?
  borland 6 hours ago
  In order to check if a file is a duplicate of another, you need to check it against _every other possible file_. You need some kind of "lookup key".
  If we took the first 1024 bytes of each file as the lookup key, then our key size would be 1024 bytes. If you have 1 million files on your disk, then that's 128MB of ram just to store all the keys. That's not a big deal these days, but it's also annoying if you have a bunch of files that all start with the same 1024 bytes -- e.g. perhaps all the photoshop documents start with the same header. You'd need a 2-stage comparison, where you first match the key (1024 bytes) and then do a full comparison to see if it really matches.
  Far more efficient - and less work - If you just use a SHA256 of the file's contents. That gets you a much smaller 32 byte key, and you don't need to bother with 2-stage comparisons.
  kstrauser 5 hours ago
  I understand the concept. My main point is that it's probably not a huge advantage to store hashes of the first 1KB, which requires CPU to calculate, over just the raw bytes, which requires storage. There's a tradeoff either way.
  I don't think it would be far more efficient to do hash the entire contents though. If you have a million files storing a terabyte of data, the 2 stage comparison would read at most 1GB (1 million * 1KB) of data, and less for smaller files. If you do a comparison of the whole hashed contents, you have to read the entire 1TB. There are a hundred confounding variables, for sure. I don't think you could confidently estimate which would be more efficient without a lot of experimenting.
  philsnow 4 hours ago
  If you're going to keep partial hashes in memory, may as well align it on whatever boundary is the minimal block/sector size that your drives give back to you. Hashing (say) 8kB takes less time than it takes to fetch it from SSD (much less disk), so if you only used the first 1kB, you'd (eventually) need to re-fetch the same block to calculate the hash for the rest of the bytes in that block.
  ... okay, so as long as you always feed chunks of data into your hash in the same deterministic order, it doesn't matter for the sake of correctness what that order is or even if you process some bytes multiple times. You could hash the first 1kB, then the second-through-last disk blocks, then the entire first disk block again (double-hashing the first 1kB) and it would still tell you whether two files are identical.
  If you're reading from an SSD and seek times don't matter, it's in fact probable that on average a lot of files are going to differ near the start and end (file formats with a header and/or footer) more than in the middle, so maybe a good strategy is to use the first 32k and the last 32k, and then if they're still identical, continue with the middle blocks.
  In memory, per-file, you can keep something like
  - the length - h(block[0:4]) - h(block[0:4] | block[-5:]) - h(block[0:4] | block[-5:] | block[4:32]) - h(block[0:4] | block[-5:] | block[4:128]) - ... - h(block[0:4] | block[-5:] | block[4:])
  etc, and only calculate the latter partial hashes when there is a collision between earlier ones. If you have 10M files and none of them have the same length, you don't need to hash anything. If you have 10M files and 9M of them are copies of each other except for a metadata tweak that resides in the last handful of bytes, you don't need to read the entirety of all 10M files, just a few blocks from each.
  A further refinement would be to have per-file-format hashing strategies... but then hashes wouldn't be comparable between different formats, so if you had 1M pngs, 1M zips, and 1M png-but-also-zip quine files, it gets weird. Probably not worth it to go down this road.
  sedatk 6 hours ago
  Probably because you need to keep a lot of those in memory.
  kstrauser 5 hours ago
  I suspect that a computer with so many files that this would be useful probably has a lot of RAM in it, at least in the common case.
  sedatk 5 hours ago
  But you need to constantly process them too, not just store them.
  smusamashah 6 hours ago
  And why first 1024, can pick from predefined points.
  f1shy 6 hours ago
  Depending on the medium, the penalty of reading single bytes in sparse locations could be comparable with reading the whole file. Maybe not a big win.
- williamsmj 5 hours ago
  Deleted comment based on a misunderstanding.
  Sohcahtoa82 5 hours ago
  > This tool simply identifies files that point at literally the same data on disk because they were duplicated in a copy-on-write setting.
  You misunderstood the article, as it's basically doing the opposite of what you said.
  This tool finds duplicate data that is specifically not duplicated via copy-on-write, and then turns it into a copy-on-write copy.
  williamsmj 5 hours ago
  Fair. Deleted.
mattgreenrocks an hour ago
What jumped out to me:
> Finally, at WWDC 2017, Apple announced Apple File System (APFS) for macOS (after secretly test-converting everyone’s iPhones to APFS and then reverting them back to HFS+ as part of an earlier iOS 10.x update in one of the most audacious technological gambits in history).
How can you revert a FS change like that if it goes south? You'd certainly exercise the code well but also it seems like you wouldn't be able to back out of it if something was wrong.
- quux an hour ago
  IIRC migrating from HFS+ to APFS can be done without touching any of the data blocks and a parallel set of APFS metadata blocks and superblocks are written to disk. In the test migrations Apple did the entire migration including generating APFS superblocks but held short of committing the change that would permanently replace the HFS+ superblocks with APFS ones. To roll back they “just” needed clean up all the generated APFS superblocks and metadata blocks.
  k1t an hour ago
  Yes, that's how it's described in this talk transcript:
  https://asciiwwdc.com/2017/sessions/715
  Let’s say for simplification we have three metadata regions that report all the entirety of what the file system might be tracking, things like file names, time stamps, where the blocks actually live on disk, and that we also have two regions labeled file data, and if you recall during the conversion process the goal is to only replace the metadata and not touch the file data.
  We want that to stay exactly where it is as if nothing had happened to it.
  So the first thing that we’re going to do is identify exactly where the metadata is, and as we’re walking through it we’ll start writing it into the free space of the HFS+ volume.
  And what this gives us is crash protection and the ability to recover in the event that conversion doesn’t actually succeed.
  Now the metadata is identified.
  We’ll then start to write it out to disk, and at this point, if we were doing a dry-run conversion, we’d end here.
  If we’re completing the process, we will write the new superblock on top of the old one, and now we have an APFS volume.
  MBCook an hour ago
  I think that’s what they did too. And it was a genius way of testing. They did it more than once too I think.
  Run the real thing, throw away the results, report all problems back to the mothership so you have a high chance of catching them all even on their multi-hundred million device fleet.
formerphotoj 33 minutes ago
OmniDiskSweeper (I know, not exactly the same thing, but still...)
BWStearns 6 hours ago
I have file A that's in two places and I run this.
I modify A_0. Does this modify A_1 as well or just kind of reify the new state of A_0 while leaving A_1 untouched?
- madeofpalk 6 hours ago
  It's called copy-on-write because when you modify A_0, it'll make a copy of the file if you write to it but not A_1.
  https://en.wikipedia.org/wiki/Copy-on-write#In_computer_stor...
  kdmtctl 2 hours ago
  What will happen when the original file will be deleted? Often this handled by block reference counters, which just would be decreased. How APFS handles this? Is there any master/copy concepts or just block references?
  BWStearns an hour ago
  Thanks for the clarification. I expected it worked like that but couldn't find it spelled out after a brief perusal of the docs.
  bsimpson 6 hours ago
  Which means if you actually edited those files, you might fill up your HD much more quickly than you expected.
  But if you have the same 500MB of node_modules in each of your dozen projects, this might actually durably save some space.
  _rend 4 hours ago
  > Which means if you actually edited those files, you might fill up your HD much more quickly than you expected.
  I'm not sure if this is what you intended, but just to be sure: writing changes to a cloned file doesn't immediately duplicate the entire file again in order to write those changes — they're actually written out-of-line, and the identical blocks are only stored once. From [the docs](^1) posted in a sibling comment:
  > Modifications to the data are written elsewhere, and both files continue to share the unmodified blocks. You can use this behavior, for example, to reduce storage space required for document revisions and copies. The figure below shows a file named “My file” and its copy “My file copy” that have two blocks in common and one block that varies between them. On file systems like HFS Plus, they’d each need three on-disk blocks, but on an Apple File System volume, the two common blocks are shared.
  [^1]: https://developer.apple.com/documentation/foundation/file_sy...
  bsimpson 42 minutes ago
  Thanks for the clarification!
- lgdskhglsa 6 hours ago
  He's using the "copy on write" feature of the file system. So it should leave A_1 untouched, creating a new copy for A_0's modifications. More info: https://developer.apple.com/documentation/foundation/file_sy...
bhouston 7 hours ago
I gave it a try on my massive folder of NodeJS projects but it only found 1GB of savings on a 8.1GB folder.
I then tried again including my user home folder (731K files, 127K folders, 2755 eligible files) to hopefully catch more savings and I only ended up at 1.3GB of savings (300MB more than just what was in the NodeJS folders.)
I tried to scan System and Library but it refused to do so because of permission issues.
I think the fact that I use pnpm for my package manager has made my disk space usage already pretty near optimal.
Oh well. Neat idea. But the current price is too high to justify this. Also I would want it as a background process that runs once a month or something.
- kdmtctl 2 hours ago
  Didn't have time to try it myself, but there is an option for minimal files size to consider clearly seen on the AppStore screenshot. I suppose it was introduced to minimize comparison buffers. It is possible that node modules are sliding under this size and wasn't considered.
- p_ing 5 hours ago
  > I tried to scan System and Library but it refused to do so because of permission issues.
  macOS has a sealed volume which is why you're seeing permission errors.
  https://support.apple.com/guide/security/signed-system-volum...
  bhouston 5 hours ago
  For some reason "disk-inventory-x" will scan those folders. I used that amazing tool to prune left over Unreal Engine files and docker caches when they put them not in my home folder. The tool asks for a ton of permissions when you run it in order to do the scan though, which is a bit annoying.
  alwillis an hour ago
  It’s not obvious but the system folder is on a separate, secure volume; the Finder does some trickery to make the system volume and the data volume appear as one.
  In general, you don’t want to mess with that.
- zamalek 7 hours ago
  pnpm tries to be a drop-in replacement for npm, and dedupes automatically.
  MrJohz 6 hours ago
  More importantly, pnpm installs packages as symlinks, so the deduping is rather more effective. I believe it also tries to mirror the NPM folder structure and style of deduping as well, but if you have two of the same package installed anywhere on your system, pnpm will only need to download and save one copy of that package.
  spankalee 6 hours ago
  npm's --install-strategy=linked flag is supposed to do this too, but it has been broken in several ways for years.
  diggan 7 hours ago
  > pnpm tries to be a drop-in replacement for npm
  True
  > and dedupes automatically
  Also true.
  But the way you put them after each other, makes it sound like npm does de-duplication, and since pnpm tries to be a drop-in replacement for npm, so does pnpm.
  So for clarification: npm doesn't do de-duplication across all your projects, and that in particular was of the more useful features that pnpm brought to the ecosystem when it first arrived.
- lou1306 7 hours ago
  > it only found 1GB of savings on a 8.1GB folder.
  You "only" found that 12% of the space you are using is wasted? Am I reading this right?
  bhouston 5 hours ago
  I have a 512GB drive in my MacBook Air M3 with 225GB free. Saving 1GB is 0.5% of my total free space, and it is definitely "below my line." It is a neat tool still in concept.
  When I ran it on my home folder with 165GB of data it only found 1.3GB of savings. This isn't that significant to me and it isn't really worth paying for.
  BTW I highly recommend the free "disk-inventory-x" utility for MacOS space management.
  timerol 5 hours ago
  Your original comment did not mention that your home folder was 165 GB, which is extremely relevant here
  warkdarrior 7 hours ago
  The relevant number (missing from above) is the total amount of space on that storage device. If it saves 1GB on a 8TB drive, it's not a big win.
  oneeyedpigeon 7 hours ago
  It should be proportional to the total used space, not the space available. The previous commenter said it was a 1 GB savings from ~8 GB of used space; that's equally significant whether it happens on a 10 GB drive or a 10 TB one.
  horsawlarway 7 hours ago
  He picked node_modules because it's highly likely to encounter redundant files there.
  If you read the rest of the comment he only saved another 30% running his entire user home directory through it.
  So this is not a linear trend based on space used.
  borland 6 hours ago
  He "only" saved 30%? That's amazing. I really doubt most people are going to get anywhere near that.
  When I run it on my home folder (Roughly 500GB of data) I find 124 MB of duplicated files.
  At this stage I'd like it to tell me what those files are - The dupes are probably dumb ones that I can simply go delete by hand, but I can understand why he'd want people to pay up first, as by simply telling me what the dupes are he's proved the app's value :-)
  bhouston 5 hours ago
  > He "only" saved 30%? That's amazing. I really doubt most people are going to get anywhere near that.
  You misunderstood my comment. I ran it on my home folder which contains 165GB of data and it found 1.3GB is savings. That isn't significant for me to care about because I currently have 225GB free of my 512GB drive.
  BTW I highly recommend the free "disk-inventory-x" utility for MacOS space management.
  jeromegv 24 minutes ago
  Everyone misunderstood your comment for a reason.
  You wrote: but it only found 1GB of savings on a 8.1GB folder.
  It’s quite a saving and that’s what everyone understood from your comment.
  wlesieutre 5 hours ago
  Another 30% more than the 1GB saved in node modules, for 1.3GB total. Not 30% of total disk space.
  For reference, from the comment they’re talking about:
  > I then tried again including my user home folder (731K files, 127K folders, 2755 eligible files) to hopefully catch more savings and I only ended up at 1.3GB of savings (300MB more than just what was in the NodeJS folders.)
  jy14898 7 hours ago
  If it saved 8.1GB, by your measure it'd also not be a big win?
  horsawlarway 7 hours ago
  This is basically only a win on macOS, and only because Apple charges through the nose for disk space.
  Ex - On my non-apple machines, 8GB is trivial. I load them up with the astoundingly cheap NVMe drives in the multiple terabyte range (2TB for ~$100, 4TB for ~$250) and I have a cheap NAS.
  So that "big win" is roughly 40 cents of hardware costs on the direct laptop hardware. Hardly worth the time and effort involved, even if the risk is zero (and I don't trust it to be zero).
  If it's just "storage" and I don't need it fast (the perfect case for this type of optimization) I throw it on my NAS where it's cheaper still... Ex - it's not 40 cents saved, it's ~10.
  ---
  At least for me, 8GB is no longer much of a win. It's a rounding error on the last LLM model I downloaded.
  And I'd suggest that basically anyone who has the ability to not buy extortionately priced drives soldered onto a mainboard is not really winning much here either.
  I picked up a quarter off the ground on my walk last night. That's a bigger win.
  borland 6 hours ago
  > This is basically only a win on macOS, and only because Apple charges through the nose for disk space
  You do realize that this software is only available on macOS, and only works because of Apple's APFS filesystem? You're essentially complaining that medicine is only a win for people who are sick.
  horsawlarway 3 hours ago
  > and only works because of Apple's APFS filesystem
  There are lots of other file systems that support this kind of deduplication...
  Like ZFS that the author of the software explicitly mentions in his write up https://www.truenas.com/docs/references/zfsdeduplication/
  Or Btrfs ex: https://kb.synology.com/en-id/DSM/help/DSM/StorageManager/vo...
  Or hell, even NTFS: https://learn.microsoft.com/en-us/windows-server/storage/dat...
  This is NOT a novel or new feature in filesystems... Basically any CoW file system will do it, and lots of other filesystems have hacks built on top to support this kinds of feature.
  ---
  My point is that "people are only sick" because the company is pricing storage outrageously. Not that Apple is the only offender in this space - but man are they the most egregious.
  rconti 7 hours ago
  Absolutely, 100% backwards. The tool cannot save space from disk space that is not scanned. Your "not a big win" comment assumes that there is no space left to be reclaimed on the rest of the disk. Or that the disk is not empty, or that the rest of the disk can't be reclaimed at an even higher rate.
- modzu 7 hours ago
  whats the price? doesnt seem to be published anywhere
  scblock 7 hours ago
  It's on the Mac App Store so you'll find the pricing there. Looks like $10 for one month (one time use maybe?), $20 for a year, $50 lifetime.
  diggan 7 hours ago
  Even if I have both a Mac and iPhone, but happen to use my Linux computer right now, it seems like the store page (https://apps.apple.com/us/app/hyperspace-reclaim-disk-space/...) is not showing the price, probably because I'm not actively on a Apple device? Seems like a poor UX even for us Mac users.
  pimlottc 7 hours ago
  It's buried under a drop-down in the "Information" section, under "In-App Purchases". I agree, it's not the greatest.
  MBCook 6 hours ago
  It’s a side effect of the terrible store design.
  It’s a free app because you don’t have to buy it to run it. It will tell you how much space it can save you for free. So you don’t have to waste $20 to find out it only would’ve been 2kb.
  But that means the parts you actually have to buy are in app purchases, which are always hidden on the store pages.
  diggan 7 hours ago
  Åh, you're absolutely right, missed that completely. Buried at the bottom of the page :) Thanks for pointing it out.
  oneeyedpigeon 7 hours ago
  I see it on my android phone. It's a free app but the subs are an in-app purchase so you need to hunt that section down.
  piqufoh 7 hours ago
  £9.99 a month, £19.99 for one year, £49.99 for life (app store purchase prices visible once you've scanned a directory).
galaxyLogic 2 hours ago
On Windows there is "Dev Drive" which I believe does a similar "copy-on-write" -thing.
If it works it's a no-brainer so why isn't it the default?
https://learn.microsoft.com/en-us/windows/dev-drive/#dev-dri...
- siranachronist 2 hours ago
  requires refs, which still isnt supported on the system drive on windows, iirc
diggan 7 hours ago
> Like all my apps, Hyperspace is a bit difficult to explain. I’ve attempted to do so, at length, in the Hyperspace documentation. I hope it makes enough sense to enough people that it will be a useful addition to the Mac ecosystem.
Am I missing something, or isn't it a "file de-duplicator" with a nice UI/UX? Sounds pretty simple to describe, and tells you why it's useful with just two words.
- dewey 5 hours ago
  The author of the software is a file system enthusiast (so much that in the podcast he's a part of they have a dedicated sound effect every time "filesystem" comes up), a long time blogger and macOS reviewer. So you'll have to see it in that context while documenting every bit and the technical details behind it is important to him...even if it's longer than a tag line on a landing page.
  In times where documentation is often an afterthought, and technical details get hidden away from users all the time ("Ooops some error occurred") this should be celebrated.
- protonbob 7 hours ago
  No because it isn't getting rid of the duplicate, it's using a feature of APFS that allows for duplicates to exist separately but share the same internal data.
  yayoohooyahoo 7 hours ago
  Is it not the same as a hard link (which I believe are supported on Mac too)?
  andrewla 7 hours ago
  My understanding is that it is a copy-on-write clone, not a hard link. [1]
  > Q: Are clone files the same thing as symbolic links or hard links?
  > A: No. Symbolic links ("symlinks") and hard links are ways to make two entries in the file system that share the same data. This might sound like the same thing as the space-saving clones used by Hyperspace, but there’s one important difference. With symlinks and hard links, a change to one of the files affects all the files.
  > The space-saving clones made by Hyperspace are different. Changes to one clone file do not affect other files. Cloned files should look and behave exactly the same as they did before they were converted into clones.
  [1] https://hypercritical.co/hyperspace/
  dylan604 6 hours ago
  What kind of changes could you make to one clone that would still qualify it as a clone? If there are changes, it's no longer the same file. Even after reading the How It Works[0] link, I'm not groking how it works. Is it making some sort of delta/diff that is applied to the original file? That's not possible for every file format like large media files. I could see that being interesting for text based files, but that gets complicated for complex files.
  [0] https://hypercritical.co/hyperspace/#how-it-works
  aeontech 6 hours ago
  If I understand correctly, a COW clone references the same contents (just like a hardlink) as long as all the filesystem references are pointing to identical file contents.
  Once you open one of the reference handles and modify the contents, the copy-on-write process is invoked by the filesystem, and the underlying data is copied into a new, separate file with your new changes, breaking the link.
  Comparing with a hardlink, there is no copy-on-write, so any changes made to the contents when editing the file opened from one reference would also show up if you open the other hardlinks to the same file contents.
  dylan604 6 hours ago
  ah, that's where the copy-on-write takes place. sometimes, just reading it written by someone else is the knock upside the head I need.
  MBCook 6 hours ago
  That’s correct.
  actionfromafar 7 hours ago
  Almost, but the difference is that if you change one of hardlinked files, you change "all of them". (It's really the same file but with different paths.)
  https://hypercritical.co/hyperspace/#how-it-works
  APFS apparently allows for creating "link files" which when changed, start to diverge.
  alwillis an hour ago
  It’s not the same because clones can have separate meta data; in addition, if a cloned file changes, it stores a diff of the changes from the original.
  zippergz 7 hours ago
  A copy-on-write clone is not the same thing as a hard link.
  rahimnathwani 7 hours ago
  With a hard link, the content of each of the two 'files' are identical in perpetuity.
  With APFS Clones, the contents start off identical, but can be changed independently. If you change a small part of a file, those block(s) will need to be created, but the existing blocks will continue to be shared with the clone.
  diggan 7 hours ago
  Right, but the concept is the same, "remove duplicates" in order to save storage space. If it's using reflinks, softlinks, APFS clones or whatever is more or less an implementation detail.
  I know that internally it isn't actually "removing" anything, and that it uses fancy new technology from Apple. But in order to explain the project to strangers, I think my tagline gets the point across pretty well.
  CharlesW 7 hours ago
  > Right, but the concept is the same, "remove duplicates" in order to save storage space.
  The duplicates aren't removed, though. Nothing changes from the POV of users or software that use those files, and you can continue to make changes to them independently.
  vultour 6 hours ago
  De-duplication does not mean the duplicates completely disappear. If I download a deduplication utility I expect it to create some sort of soft/hard link. I definitely don’t want it to completely remove random files on the filesystem, that’s just going to wreak havoc.
  sgerenser 5 hours ago
  But it can still wreak havoc if you use hardlinks or softlinks, because maybe there was a good reason for having a duplicate file! Imagine you have a photo “foo.jpg.” You make a copy of it “foo2.jpg” You’re planning on editing that file, but right now, it’s a duplicate. At this point you run your “deduper” that turns the second file into a hardlink. Then a few days later you go and edit the file, but wait, the original “backup” file is now modified too! You lost your original.
  That’s why Copy-on-write clones are completely different than hardlinks.
  dingnuts 7 hours ago
  It does get rid of the duplicate. The duplicate data is deleted and a hard link is created in its place.
  zippergz 7 hours ago
  It does not make hard links. It makes copy-on-write clones.
  kemayo 7 hours ago
  No, because it's not actually a hard link -- if you modify one of the files they'll diverge.
  8n4vidtmkvmk 7 hours ago
  Sounds like jdupes with -B
  kemayo 6 hours ago
  Cursory googling suggests that it's using the same filesystem feature, yeah.
- zerd 5 hours ago
  I've been using `fclones` [1] to do this, with `dedupe`, which uses reflink/clonefile.
  https://github.com/pkolaczk/fclones
jamesfmilne 7 hours ago
Would be nice if git could make use of this on macOS.
Each worktree I usually work on is several gigs of (mostly) identical files.
Unfortunately the source files are often deep in a compressed git pack file, so you can't de-duplicate that.
(Of course, the bigger problem is the build artefacts on each branch, which are like 12G per debug/release per product, but they often diverge for boring reasons.)
- theamk 6 hours ago
  "git worktree" shares a .git folder between multiple checkouts. You'll still have multiple files in working copy, but at least the .pack files would be shared. It is great feature, very robust, I use it all the time.
  There is also ".git/objects/info/alternates", accessed via "--shared"/"--reference" option of "git clone", that allows only sharing of object storage and not branches etc... but it is has caveats, and I've only used it in some special circumstances.
- diggan 7 hours ago
  Git is a really poor fit for a project like that since it's snapshot based instead of diff based... Luckily, `git lfs` exists for working around that, I'm assuming you've already investigated that for the large artifacts?
- globular-toast 3 hours ago
  Git de-duplicates everything in its store (in the .git directory) already. That's how it can store thousands of commits which are snapshots of the entire repository without eating up tons of disk space. Why do you have duplicated files in the working directory, though?
exitb 8 hours ago
What are examples of files that make up the "dozens of gigabytes" of duplicated data?
- xnx 7 hours ago
  There are some CUDA files that every local AI app install that take multiple GB.
  wruza 7 hours ago
  Also models that various AI libraries and plugins love to autodownload into custom locations. Python folks definitely need to learn caching, symlinks, asking a user where to store data, or at least logging where they actually do it.
- zerd 5 hours ago
  .terraform, rust target directory, node_modules.
- password4321 6 hours ago
  iMovie used to copy video files etc. into its "library".
- butlike 6 hours ago
  audio files; renders, etc.
albertzeyer 7 hours ago
I wrote a similar (but simpler) script which would replace a file by a hardlink if it has the same content.
My main motivation was for the packages of Python virtual envs, where I often have similar packages installed, and even if versions are different, many files would still match. Some of the packages are quite huge, e.g. Numpy, PyTorch, TensorFlow, etc. I got quite some disk space savings from this.
https://github.com/albertz/system-tools/blob/master/bin/merg...
- andrewla 7 hours ago
  This does not use hard links or symlinks; this uses a feature of the filesystem that allows the creation of copy-on-write clones. [1]
  [1] https://en.wikipedia.org/wiki/Apple_File_System#Clones
  gurjeet an hour ago
  So albertzeyer's script can be adapted to use `cp -c` command, to achieve the same effect as Hyperspace.
o10449366 2 hours ago
What would an equivalent tool be on linux? I guess it depends on the filesystem?
re 3 hours ago
On a related note: are there any utilities that can measure disk usage of a folder taking (APFS) cloned files into account?
bsimpson 6 hours ago
Interesting idea, and I like the idea of people getting paid for making useful things.
Also, I get a data security itch having a random piece of software from the internet scan every file on an HD, particularly on a work machine where some lawyers might care about what's reading your hard drive. It would be nice if it was open source, so you could see what it's doing.
- Nevermark 6 hours ago
  > I like the idea of people getting paid for making useful things
  > It would be nice if it was open source
  > I get a data security itch having a random piece of software from the internet scan every file on an HD
  With the source it would be easy for others to create freebie versions, with or without respecting license restrictions or security.
  I am not arguing anything, except pondering how software economics and security issues are full of unresolved holes, and the world isn't getting default fairer or safer.
  --
  The app was a great idea, indeed. I am now surprised Apple doesn't automatically reclaim storage like this. Kudos to the author.
- benced 6 hours ago
  You could download the app, disconnect Wifi and Ethernet, run the app and the reclamation process, remove the app (remember, you have the guarantees of the macOS App Store so no kernel extensions etc), and then reconnect.
  Edit: this might not work with the payment option actually. I don't think you can IAP without the internet.
JackYoustra 2 hours ago
What's the difference with jdupes?
radicality 7 hours ago
Hopefully doesn’t have similar bug like jdupes did
https://web.archive.org/web/20210506130542/https://github.co...
sgt 3 hours ago
Any way it can be built for 14? It requires macOS 15.
karparov 3 hours ago
TL;DR: He wrote an OS X dedup app which finds files with the same contents and tells the filesystem that their contents are identical, so it can save space (using copy-on-write features).
He points out its dangerous but could be worth it cause space savings.
I wonder if the implementation is using a hash only or does an additional step to actually compare the contents to avoid hash collision issues.
It's not open source, so we'll never know. He chose a pay model instead.
Also, some files might not be identical but have identical blocks. Something that could be explored too. Other filesystems have that either in their tooling or do it online or both.
pca006132 8 hours ago
Is this the dedup function provided by other FS?
- coder543 8 hours ago
  I think the term to search for is reflink. Btrfs is one example: https://btrfs.readthedocs.io/en/latest/Reflink.html
  Like with Hyperspace, you would need to use a tool that can identify which files are duplicates, and then convert them into reflinks.
  pca006132 7 hours ago
  I thought reflink is provided by the underlying FS, and Hyperspace is a dedup tool that finds the duplicates.
  MBCook 5 hours ago
  Hyperspace uses built in APFS features, it just applies them to existing files.
  You only get CoW on APFS if you copy a file with certain APIs or tools.
  If you have a program that does it manually, you copied a duplicate to somewhere on your desk from some other source, or your files already existed on the file system when you converted to APFS because you’ve been carrying them for a long time then you’d have duplicates.
  APFS doesn’t look for duplicates at any point. It just keeps track of those that it knows are duplicates because of copy operations.
  coder543 7 hours ago
  Yes. Hyperspace is finding the identical files and then replacing all but one copy with a reflink copy using the filesystem's reflink functionality.
  When you asked about the filesystem, I assumed you were asking about which filesystem feature was being used, since hyperspace itself is not provided by the filesystem.
  Someone else mentioned[0] fclones, which can do this task of finding and replacing duplicates with reflinks on more than just macOS, if you were looking for a userspace tool.
  [0]: https://news.ycombinator.com/item?id=43173713
  zerd 5 hours ago
  You can do the same with `cp -c` on macOS, or `cp --reflink=always` on Linux, if your filesystem supports it.
- kevincox 2 hours ago
  Yes, Linux has a systemcall to do this for any filesystem with reflink support (and it is safe and atomic). You need a "driver" program to identify duplicates but there are a handful out there. I've used https://github.com/markfasheh/duperemove and was very pleased with how it worked.
jbverschoor 7 hours ago
Does it preserve all metadata, extended attributes, and alternate streams/named forks?
- criddell 3 hours ago
  The FAQ talks about this a little:
  Q: Does Hyperspace preserve file metadata during reclamation?
  A: When Hyperspace replaces a file with a space-saving clone, it attempts to preserve all metadata associated with that file. This includes the creation date, modification date, permissions, ownership, Finder labels, Finder comments, whether or not the file name extension is visible, and even resource forks. If the attempt to preserve any of these piece of metadata fails, then the file is not replaced.
  If you find some piece of file metadata that is not preserved, please let us know.
  Q: How does Hyperspace handle resource forks?
  A: Hyperspace considers the contents of a file’s resource fork to be part of the file’s data. Two files are considered identical only if their data and resource forks are identical to each other.
  When a file is replaced by a space-saving clone during reclamation, its resource fork is preserved.
- atommclain 6 hours ago
  He spoke to this on No Longer ery Good, episode 626 of The Accidental Tech Podcast. Time stamp ~1:32:30
  It tries, but there are some things it can't perfectly preserve like the last access time. Instances where it can't duplicate certain types of extended attributes or ownership permissions it will not perform the operation.
  https://podcasts.apple.com/podcast/id617416468?i=10006919599...
  jbverschoor 5 hours ago
  Well, the FAQ also states that people should notify if you're missing attributes, so it really sounds like it's a predefined list instead of just enumeration through everything.
  No word about alternate data streams. I'll pass for now.. Although it's nice to see how much duplicates you have
divan 8 hours ago
What are the potential risks or problems of such conversion of duplicates into APFS clones?
- captn3m0 8 hours ago
  The linked docs cover this in detail.
jarbus 8 hours ago
In my experience, Macs use up a ridiculous amount of "System" storage for no reason that users can't delete. I've grown tired of family members asking me to help them free up storage that I can't even find. That's the major issue from what I've seen; unless this app prevents apple deliberately eating up 50%+ of the storage space of a machine, this doesn't do much for the people I know.
- p_ing 7 hours ago
  These are often Time Machine snapshots. Nuking those can free up quite a bit of space.
  sudo tmutil listlocalsnapshots / sudo tmutil deletelocalsnapshots <date_value_of_snapshot>
  Jaxan 4 hours ago
  Even without time machine there are loads of storage spent on “system”. Especially now with the apple intelligence (even when turned off).
  p_ing 2 hours ago
  Apple "Intelligence" gets its own category in 15.3.1.
- ezfe 7 hours ago
  There's no magic around it, macOS just doesn't do a good job explaining it using the built in tools. Just use Daisy Disk or something. It's all there and can be examined.
rusinov 3 hours ago
John is a the legend.
Analemma_ 8 hours ago
In earlier episodes of ATP when they were musing on possible names, one listener suggested the frankly amazing "Dupe Nukem". I get that this is a potential IP problem, which is why John didn't use it, but surely Duke Nukem is not a zealously-defended brand in 2025. I think interest in that particular name has been stone dead for a while now.
- InsideOutSanta 8 hours ago
  It's a genius name, but Gearbox owns Duke Nukem. They're not exactly dormant. Duke Nukem as a franchise made over a billion in revenue. In 2023, Zen released a licensed Duke Nukem pinball table, so there is at least some ongoing interest in the franchise.
  I probably wouldn't have risked it, either.
- mzajc 7 hours ago
  Reminds me of Avira's Luke Filewalker - I wonder if they needed any special agreement with Lucasfilm/Disney. I couldn't find any info on it, and their website doesn't mention Star Wars at all.
andrewla 7 hours ago
Many comments here offering similar solutions based on hardlinks or symlinks.
This uses a specific feature of APFS that allows the creation of copy-on-write clones. [1] If a clone is written to, then it is copied on demand and the original file is unmodified. This is distinct from the behavior of hardlinks or symlinks.
[1] https://en.wikipedia.org/wiki/Apple_File_System#Clones
- bombela 7 hours ago
  Also called reflink on Linux. Which are supported by bcachefs, Btrfs, CIFS, NFS 4.2, OCFS2, overlayfs, XFS, and OpenZFS.
  Sources: https://unix.stackexchange.com/questions/631237/in-linux-whi... https://forums.veeam.com/veeam-backup-replication-f2/openzfs...
DontBreakAlex 6 hours ago
Nice, but I'm not getting a subscription for a filesystem utility. Had it been a one-time $5 license, I would have bought it. At the current price, it's literally cheaper to put files in a S3 bucket or outright buy an SSD.
- dewey 5 hours ago
  They had long discussions about the pricing on the podcast the author is a part of (atp.fm). It went through a few iterations of one time purchase, fee for each time you free up space and a subscription. There will always be people unhappy about either choice.
  Edit: Apparently both is possible in the end: https://hypercritical.co/hyperspace/#purchase
  mrguyorama 3 hours ago
  Who would be unhappy with $5 owned forever? Other than the author of course for making less money.
  criddell 3 hours ago
  People who want the app to stick around and continue to be developed.
  I worry about that with Procreate. It feels like it's priced too low to be sustainable.
- dewey 5 hours ago
  > Two kinds of purchases are possible: one-time purchases and subscriptions.
  https://hypercritical.co/hyperspace/#purchase
- pmarreck 5 hours ago
  Claude 3.7 just rewrote the whole thing (just based on reading the webpage description) as a commandline app for me, so there's that.
  And because it has no Internet access yet (and because I prompted it to use a workaround like this in that circumstance), the first thing it asked me to do (after hallucinating the functionality first, and then catching itself) was run `curl https://hypercritical.co/hyperspace/ | sed 's/<[^>]*>//g' | grep -v "^$" | clip`
  ("clip" is a bash function I wrote to pipe things onto the clipboard or spit them back out in a cross-platform linux/mac way)
  clip() { if command -v pbcopy > /dev/null; then [ -t 0 ] && pbpaste || pbcopy; else if command -v xclip > /dev/null; then [ -t 0 ] && xclip -o -selection clipboard || xclip -selection clipboard; else echo "clip function error: Neither pbcopy/pbpaste nor xclip are available." >&2; return 1; fi; fi }
- criddell 3 hours ago
  I think it's priced reasonably. A one-time $5 license wouldn't be sustainable.
  Since it's the kind of thing you will likely only need every couple of years, $10 each time feels fair.
  If putting all your data online or into an SSD makes more sense, then this app isn't for you and that's okay too.
- jacobp100 3 hours ago
  The price does seem very high. It’s probably a niche product and I’d imagine developers are the ones who would see the biggest savings. Hopefully it works out for them
- botanical76 5 hours ago
  I can't even find the price anywhere. Do you have to install the software to see it?
  sbarre 5 hours ago
  The Mac App Store page has the pricing at the bottom in the In-App Purchases section..
  TL;DR - $49 for a lifetime subscription, or $19/year or $9/month.
  It could definitely be easier to find.
- benced 6 hours ago
  "I don't value software but that's not a respectable opinion so I'll launder that opinion via subscriptions"
  DontBreakAlex 2 hours ago
  Well I do value software, I'm paid $86/h to write some! I just find that for $20/year or $50 one time, you can get way more than 12G of hard drive space. I also don't think that this piece of software requires so much maintenance that it wouldn't be worth making at a lower price. I'm not saying that it's bad software, it's really great, just too expensive... Personally, my gut feeling is that the dev would have had more sales with a one time $5, and made more money overall.
- amelius 6 hours ago
  There are several such tools for Linux, and they are free, so maybe just change operating systems.
  augusto-moura 6 hours ago
  I'm pretty sure some of them also work on MacOS. rmlint[1], for example can output a script that reflinks duplicates (or run any script for both files):
  rmlint -c sh:handler=reflink .
  I'm not sure if reflink works out of the box, but you can write your own alternative script that just links both files
  [1]: https://github.com/sahib/rmlint
  dewey 5 hours ago
  It does not support APFS: https://github.com/sahib/rmlint/issues/421
  dewey 5 hours ago
  I don't think either of them supports APFS deduplication though?
siranachronist 8 hours ago
https://github.com/pkolaczk/fclones can do the same thing, and it's perfectly free and open source. terminal based though
- rahimnathwani 6 hours ago
  Hyperspace said I can save 10GB.
  But then I ran this command and saved over 20GB:
  brew install fclones cd ~ fclones group . | fclones dedupe
  I've used fclones before in the default mode (create hard links) but this is the first time I've run it at the top level of my home folder, in dedupe mode (i.e. using APFS clones). Fingers crossed it didn't wreck anything.
- CharlesW 7 hours ago
  [I was wrong, see below.—cw] It doesn't do the same thing. An APFS clone/copy-on-write clone is not the same as a hard or soft link. https://eclecticlight.co/2019/01/05/aliases-hard-links-symli...
  PenguinRevolver 7 hours ago
  Your source points out that:
  < You can also create [APFS (copy on write) clones] in Terminal using the command `cp -c oldfilename newfilename` where the c option requires cloning rather than a regular copy.
  `fclones dedupe` uses the same command[1]:
  if cfg!(target_os = "macos") { result.push(format!("cp -c {target} {link}"));
  [1] https://github.com/pkolaczk/fclones/blob/555cde08fde4e700b25...
  CharlesW 7 hours ago
  I stand corrected, thank you!
- PenguinRevolver 8 hours ago
  brew install fclones
  Thanks for the recommendation! Just installed it via homebrew.
- diimdeep 6 hours ago
  Nice, also compression at file system level can save a lot of space and with current CPU speeds is completely transparent. It is feature from HFS+ that is still works in APFS, but is not officially supported anymore, what is wrong with you Apple ?
  This tool to enable compression is free and open source
  https://github.com/RJVB/afsctool
  Also note about APFS vs HFS+, if you use HDD e.g. as backup media for Time Machine, HFS+ is must have over APFS as it is optimised only for SSD (random access).
  https://bombich.com/blog/2019/09/12/analysis-apfs-enumeratio...
  https://larryjordan.com/blog/apfs-is-not-yet-ready-for-tradi...
  Not so smart Time Machine setup utility forcefully re-creates APFS on a HDD media, so you have to manually create HFS+ volume (e.g. with Disk Utily) and then use terminal command to add this volume as TM destination
  `sudo tmutil setdestination /Volumes/TM07T`
eikenberry 6 hours ago
I don't understand why a simple, closed source de-dup app is at the top of the front page with 160+ comments? What is so interesting about it? I read the blog and the comments here and I still don't get it.
- therockhead 5 hours ago
  I assume it’s because it’s from John Siracusa, a long-time Mac enthusiast, blogger, and podcaster. If you listen to him on ATP, it’s hard not to like him, and anything he does is bound to get more than the usual upvotes on HN.
- benced 6 hours ago
  The developer is popular and APFS cloning is genuinely technically interesting.
  (no, it's not a symlink)
  augusto-moura 5 hours ago
  COW filesystems are older than MacOS, no surprises for me. Maybe people aren't that aware of it?
  ForOldHack 5 hours ago
  CoW - Copy on Write. Most probably on older mainframes. ( Actually newer mainframes ).
  "CoW is used as the underlying mechanism in file systems like ZFS, Btrfs, ReFS, and Bcachefs"
  Obligatory: https://en.wikipedia.org/wiki/Copy-on-write
david_allison 6 hours ago
> Hyperspace can’t be installed on “Macintosh HD” because macOS version 15 or later is required.
macOS 15 was released in September 2024, this feels far too soon to deprecate older versions.
- tobr 6 hours ago
  Can it really be seen as deprecating an old version when it’s a brand new app?
  borland 6 hours ago
  +1. He's not taking anything away because you never had it.
  johnmaguire 6 hours ago
  I'm a bit confused as the Mac App Store says it's over 4 years old.
  furyofantares 6 hours ago
  The 4+ Age rating is like, who can use the app. Not for 3 year olds, apparently.
  throwanem 5 hours ago
  I feel like that's true for most of the relatively low-level disk and partition management tooling. As unpopular an opinion as it may lately be around here, I'm enough of a pedagogical traditionalist to remain convinced that introductory logical volume management is best left at least till kindergarten.
  heywoods 6 hours ago
  Despite knowing this is the correct interpretation, I still consistently make the same incorrect interpretation as the parent comment. It would be nice if they made this more intuitive. Glad I’m not the only one that’s made that mistake.
  pmarreck 5 hours ago
  The way they specify this has always confused me, because I actually care more about how old the app is than what age range it's aimed for
- kstrauser 6 hours ago
  He wanted to write it in Swift 6. Does it support older OS versions?
  jjcob 6 hours ago
  Swift 6 is not the problem. It's backward compatible.
  The problem is SwiftUI. It's very new, still barely usable on the Mac, but they are adding lots of new features every macOS release.
  If you want to support older versions of macOS you can't use the nice stuff they just released. Eg. pointerStyle() is a brand new macOS 15 API that is very useful.
  MBCook 6 hours ago
  I can’t remember for sure but there may also have been a recent file system API he said he needed. Or a bug that he had to wait for a fix on.
  therockhead 5 hours ago
  It's been a while since I last looked at SwiftUI on mac, Is it really still that bad ?
  jjcob 4 hours ago
  It's not bad, just limited. I think it's getting usable, but just barely so.
  They are working on it, and making it better every year. I've started using it for small projects and it's pretty neat how fast you can work with it -- but not everything can be done yet.
  Since they are still adding pretty basic stuff every year, it really hurts if you target older versions. AppKit is so mature that for most people it doesn't matter if you can't use new features introduced in the last 3 years. For SwiftUI it still makes a big difference.
  therockhead 4 hours ago
  I wonder why they haven't tried to back port SwiftUI improvements/versions to the older OSs. Seems like this should have been possible.
- ryandrake 6 hours ago
  Came here to post the same thing. Would love to try the application, but I guess not if the developer is deliberately excluding my device (which cannot run the bleeding edge OS).
  kstrauser an hour ago
  In fairness, I don't think you can describe it as bleeding edge when we're 5 months into the annual 12 month upgrade cycle. It's recent, but not exactly an early adapter version at this point.
  wpm 6 hours ago
  The developer deliberately chose to write it in Swift 6. Apple is the one who deliberately excluded Swift 6 from your device.
  ryandrake 5 hours ago
  Yea, too bad :( Everyone involved with macOS and iOS development seems to be (intentionally or unintentionally) keeping us on the hardware treadmill.
  ForOldHack 5 hours ago
  Expensive. Keeping us on the expensive hardware treadmill. My guess is that it cannot be listed in the Apple store unless its only for Macs released in the last 11 months.
herrkanin 8 hours ago
As a web dev, it’s been fun listening to Accidental Tech Podcast where Siracusa has been talking (or ranting) about the ins and outs of developing modern mac apps in Swift and SwiftUI.
- Analemma_ 8 hours ago
  The part where he said making a large table in HTML and rendering it with a web view was orders of magnitude faster than using the SwiftUI native platform controls made me bash my head against my desk a couple times. What are we doing here, Apple.
  BobAliceInATree 7 hours ago
  SwiftUI is a joke when it comes to performance. Even Marco's Overcast stutters when displaying a table of a dozen rows (of equal height).
  That being said, it's not quite an apples to apples comparison, because SwiftUI or UIKit can work with basically an infinite number of rows, whereas HTML will eventually get to a point where it won't load.
  wpm 6 hours ago
  I love the new Overcast's habit of mistaking my scroll gestures for taps when browsing the sections of a podcast.
  airstrike 7 hours ago
  Shoutout to iced, my favorite GUI toolkit, which isn't even in 1.0 yet but can do that with ease and faster than anything I've ever seen: https://github.com/iced-rs/iced
  https://github.com/tarkah/iced_table is a third-party widget for tables, but you can roll out your own or use other alternatives too
  It's in Rust, not Swift, but I think switching from the latter to the former is easier than when moving away from many other popular languages.
  megaman821 7 hours ago
  I wish there were modern benchmarks against browser engines. A long time ago native apps were much faster at rendering UI than the browser, but that may performance rewrites ago, so I wonder how browsers perform now.
  mohsen1 8 hours ago
  Hacker News loves to hate Electron apps. In my experience ChatGPT on Mac (which I assume is fully native) is nearly impossible to use because I have a lot of large chats in my history but the website works much better and faster. ChatGPT website packed in Electron would've been much better. In fact, I am using a Chrome "PWA App" for ChatGPT now instead of the native app.
  wat10000 7 hours ago
  It's possible to make bad apps with anything. The difference is that, as far as I can tell, it's not possible to make good apps with Electron.
  avtar 5 hours ago
  > In my experience ChatGPT on Mac (which I assume is fully native)
  If we are to believe ChatGPT itself: "The ChatGPT macOS desktop app is built using Electron, which means it is primarily written in JavaScript, HTML, and CSS"
  RandomDistort 7 hours ago
  Someone more experienced that me could probably comment on this more, but theoretically is it possible for Electron production builds to become more efficient by having a much longer build process and stripping out all the unnecessary parts of Chromium?
  spiderfarmer 7 hours ago
  As a web dev I must say that this segment made me happy and thankful for the browser team that really knows how to optimize.
dewey 5 hours ago
For those mentioning that there's no price listed, it's not that easy as in the App Store the price varies by country. You can open the App Store link and then look at "In App Purchases" though.
For me on the German store it looks like this:
```
    Unlock for One Year 22,99 €
    Unlock for One Month 9,99 €
    Lifetime Unlock 59,99 €
```
So it supports both one time purchases and subscriptions. Depending on what you prefer. More about that here: https://hypercritical.co/hyperspace/#purchase
the_clarence 6 hours ago
Its interesting how Linux tools are all free when even trivial mac tools are being sold. Nothing against someone trying to monetize but the linux culture sure is nice!
- dewey 5 hours ago
  It's not that nice to call someone's work they spent months on "trivial" without knowing anything about the internals and what they ran into.
  MadnessASAP 5 hours ago
  I don't think they meant it in a disparaging way, except maybe against Apple. Moreso that typically filesystems that can support deduplication include a deduplication tool in it's standard suite of FS tools. I too find it odd that Apple does not do this.
undefined 7 hours ago
[deleted]
twp 5 hours ago
CLI tool to find duplicate files unbelievably quickly:
https://github.com/twpayne/find-duplicates
999900000999 7 hours ago
A 20$ 1 year licence for something that probably has a FOSS equivalent on Linux...
However, considering Apple will never ever ever allow user replaceable storage on a laptop, this might be worth it.
- p_ing 7 hours ago
  The developer does need to make up for the $100 yearly privilege of publishing the app to the App Store.
- jeroenhd 7 hours ago
  I have yet to see a GUI variant of deduplication software for Linux. There are plenty of command line tools, which probably can be ported to macOS, but there's no user friendly tool to just click through as far as I know.
  There's value in convenience. I wouldn't pay for a yearly license (that price seems more than fair for a "version lifetime" price to me?) but seeing as this tool will probably need constant maintenance as Apple tweaks and changes APFS over time, combined with the mandatory Apple taxes for publishing software like this, it's not too awful.
  999900000999 7 hours ago
  50$ for a lifetime license.
  Which really means up until the dev gets bored, which can be as short as 18 months.
  I wouldn't mind something like this versioned to OS. 20$ for the current OS, and ten dollars for every significant update.
  artimaeis 2 hours ago
  The Mac App Store (and all of Apple's App Stores) doesn't enable this sort of licensing. It's exactly the sort of thing that drives a lot of developers to independent distribution.
  That's why we see so many more subscription-based apps these days, application development is an ongoing process with ongoing costs, so it needs to have ongoing income. But the traditional buy-it-once app pricing doesn't enable that long-term development and support. The app store supports subscriptions though, so now we get way more subscription-based apps.
  I really think Siracusa came up with a clever pricing scheme here, given his want to use the app store for distribution.
  999900000999 2 hours ago
  Okay I stand corrected.
- ezfe 7 hours ago
  The cost is because of the fact people won't use it regularly. The developer is offering life time unlocks, lower cost levels for shorter timeframes etc.
svilen_dobrev 7 hours ago
$ rmlint -c sh:link -L -y s -p -T duplicates
will produce a script which, if run, will hardlink duplicates
- Analemma_ 7 hours ago
  That's not what this app is doing though. APFS clones are copy-on-write pointers to the same data, not hardlinks.
  phiresky 7 hours ago
  If you replace `sh:link` with `sh:clone` instead, it will.
  > clone: reflink-capable filesystems only. Try to clone both files with the FIDEDUPERANGE ioctl(3p) (or BTRFS_IOC_FILE_EXTENT_SAME on older kernels). This will free up duplicate extents while preserving the metadata of both. Needs at least kernel 4.2.
- wpm 5 hours ago
  On Linux
sir_eliah 8 hours ago
There's a cross-platform open-source version of this program: https://github.com/qarmin/czkawka
- nulld3v 7 hours ago
  I don't think czkawa supports deduplication via reflink so it's not exactly the same thing. fclones as linked by another user is more similar: https://news.ycombinator.com/item?id=43173713
- spiderfarmer 7 hours ago
  That’s not remotely comparable.
NoToP 8 hours ago
The fact that copying doesn't copy seems dangerous. Like what if I wanted to copy for the purpose of modifying the file while retaining the original. A trivial example of this might be I have a meme template and I want to write text in it while still keeping a blank copy of the template.
There's a place for alias file pointers, but lying to the user and pretending like an alias is a copy is bound to lead to unintended and confusing results
- IsTom 8 hours ago
  Copy-on-write means that it performs copy only when you make the first change (and only copies part that changes, rest is used from the original file), until then copying is free.
  mlhpdx 7 hours ago
  Is it file level or block level copy? The latter, I hope.
  Update: whoops, missed it in your comment. Block (changed bytes) level.
- herrkanin 8 hours ago
  It’s not a symbolic link - it copies on modification. No need to worry!
- pca006132 8 hours ago
  CoW is not aliasing. It will perform the actual copying when you modify the file content.
- hutattedonmyarm 8 hours ago
  It‘s Copy On Write. When you modify either one it does get turned into an actual copy
- timabdulla 8 hours ago
  It's copy on write.
- parwej 7 hours ago
  Psj lagi Eu
diimdeep 6 hours ago
Requires macOS 15.0 or later. – Oh god, this is so stupid and most irritating thing about macOS "Application development".
It is really unfair to call it "software" it is more like "glued to recent version of OS ware", meanwhile I can still run .exe compiled in 2006, and with wine even on mac or linux.
- kstrauser an hour ago
  However, you can't run an app targeted for Windows 11 on Windows XP. How unfair is that? Curse you, Microsoft.
archagon 5 hours ago
I have to confess: it miffs me that a utility that would normally fly completely under the radar is likely to make the creator thousands of dollars just because he runs a popular podcast. (Am I jealous? Oh yes. But only because I tried to sell similar apps in the past and could barely get any downloads no matter how much I marketed them. Selling software without an existing network seems nigh-on impossible these days.)
Anyway, congrats to Siracusa on the release, great idea, etc. etc.
- dewey 4 hours ago
  I can understand your criticism as it's easy to arrive at that conclusion (Also a common occurrence when levelsio launches a new product, as his Twitter following is large) but it's also not fair to discount it as "just because he runs a popular podcast".
  The author is a "household" name in the macOS / Apple scene for a long time even before the podcast. If someone is spending all their life blogging about all things Apple on outlets like ArsTechnica and is consistently putting out new content on podcasts for decades they will naturally have a better distribution.
  How many years did you spend on building up your marketing and distribution reach?
  archagon 4 hours ago
  I know! I actually like him and wish him the best. I just get a bit annoyed when one of the ATP folks releases some small utility with an unclear niche and then later talks about how they've "merely" earned thousands of dollars from it. When I was an app developer, I would have counted myself lucky to have made just a hundred bucks from a similar release. The gang's popularity gives them a distorted view of the market sometimes, IMHO.
gnomesteel 8 hours ago
I don’t need this,storage is cheap, but I’m glad it exists.
- ttoinou 8 hours ago
  Storage isnt cheap on macs though. One has to pay 2k USD to get 8 TB SSD
  bob1029 7 hours ago
  Storage comes in many forms. It doesn't need to be soldered to the mainboard to satisfy most use cases.
  ttoinou 7 hours ago
  But cleaning / making space on your main soldered drive where the OS is is quite important
ziofill 6 hours ago
Lovely idea, but way too expensive for me.
ZedZark 7 hours ago
I did this with two scripts - one that produces and cached sha1 sums of files, and another that consumes the output of the first (or any of the *sum progs) and produces stats about duplicate files, with options to delete or hard-link them.
- strunz 7 hours ago
  I wonder how any comments about hard links will be in these comments by people misunderstanding what this app does.
  theamk 6 hours ago
  if file is not going to be modified (in the low-level sense - open("w") on the filename; as opposed to rename-and-create-new), then reflinks (what this app does) and hardlinks act somewhat identically.
  For example if you have multiple node_modules, or app installs, or source photos/videos (ones you don't edit), or music archives, then hardlinks work just fine.