Teaching advanced storage right now.
As I re-read the content to refamiliarize myself I came across this description:
"Deduplication eliminates redundant data blocks. When you create multiple copies of the
same data, VDO detects the duplicate data blocks and updates the metadata to use those
duplicate blocks as references to the original data block without creating redundant data
This sounds a little like creating a hard link to an inode. When 2 files have different names but refer to the same data block you just use the same databloack for both.
I suspect the mechanics and specific technology used to acheive this is different but conceptually... Am I on the right track?
Trying to ease the cognitive load on my students by making references to things they've already learned.
That's a very nice visual. It might very well accidently on purpose show up in my discussions with students at sometimein the near future......
It really helped me crystalize my thinking. That is, it raised more questions for me.
As I'm thinking about this, a block of data is initially common for 2 files (for ease of discussion). When each file is opened that same block is used. However, when making hard links (or softlinks in theory) changes made in one file are reflected when the other file is next opened.
I can easily see a scenario where that is not the intention. A file may be intially be copied and thus be a duplicate but then changes are made that should only apply to one file.
What happens now? Does the original data block remain and a new a difference file, unique to the changed file get created (similar to a differential backup file)?
There would still be considerable overlap in the content. Does the block get broken into smaller pieces to represent the common information and separate unique blocks get created for file 1 and file 2?
Again, just to be clear, I'm just trying to think through the process both conceptually and mechanically. This is probably way beyond the scope of the class, but I find this kind of interesting. Storage is my jam
"As I'm thinking about this, a block of data is initially common for 2 files (for ease of discussion). When each file is opened that same block is used. However, when making hard links (or softlinks in theory) changes made in one file are reflected when the other file is next opened."
A hard link represents a label, not a "file," to a block of data. In this context, there is no "file," there are only hard links to a block of data (which is represented by an inode number). A "file" with a link count of 1 simply has one "label" to the block. When you create another hard link to that block of data, you aren't creating another "file," you are simply creating another "label" - and the link count increases by one.
"The concept of a hard link is the most basic we will discuss today. Every file on the Linux filesystem starts with a single hard link. The link is between the filename and the actual data stored on the filesystem. Creating an additional hard link to a file means a few different things. Let's discuss these.
First, you create a new filename pointing to the exact same data as the old filename. This means that the two filenames, though different, point to identical data."
"When changes are made to one filename, the other reflects those changes. The permissions, link count, ownership, timestamps, and file content are the exact same. If the original file is deleted, the data still exists under the secondary hard link. The data is only removed from your drive when all links to the data have been removed."
Taken with the VDO concept, the block of data doesn't change because of hard links. As a result, VDO doesn't care about the hard links, it only cares about the blocks of data. Other mechanisms care about the links.
That's my 2¢
Found this AMAZING 1 hour presentation that covers almost every question I had about how deduplication works as a global level and how it works for data blocks within files and copies of files (where one is a initially a copy of the original but gets modfied later).
It also does a really nice job of explaining how much space you save over time when you are using it as backups (a 1TB drive with full initial backup and weekly incrementals for 4 months can require a cumulative storage size of over 23TB but only actually use a little over 1TB of space).
Thanks Tracy for prompting my curiousity about this.
Here is the link if anyone is interested. I'm assigning it as extra credt assignment in my class