From write() down to the flash chips
Tuesday, August 4, 2009 at 3:46PM Like a lot of people, I have made the move to an SSD, and I love it. While there is quite a bit of variability between the different vendors of high end drives, all of them far exceed the performance of any conventional HDs. Moving to an SSD has easily been the best improvement in terms of performance and experience of any HW upgrade I have purchased this decade.
Anyway, the catch with these new SSDs is that they are really new. They all have had some bumpy firmware issues, so unless you are an enthusiast who pays attention to those sorts of things and keeps your firmware up to date I can't quite recommend without reservations. While discussing an upcoming firmware upgrade for my drive (OCZ Technology Vertex 250GB) it became clear that a number of people did not understand how the entire storage stack above the SSD worked, and that was causing some serious misunderstandings about what a firmware up date could and could not do.
Because of that, I decided it might be worthwhile to write up an explanation of how a modern storage stack (FS through flash chips) worked. For the sake of this discussion we will assume an overwrite filesystem (such as NTFS, HFS+, or ext4), an ATA bus, and what is generally referred to as a 2nd Generation FTL (Flash Translation Layer). Obviously the exact details may change quite a bit if any of these parameters are changed. But this is by far the most common setup people use today.
The filesystem's view of the world
The filesystem is responsible for a number of things, like permissions and name lookup, but none of those has an impact on where it puts blocks on a drive. Once we remove all of that, we can consider the file system to be an "object to block" mapping layer. In general a storage object just means a separate distinct entity, so each file is an object.
It is the file systems job to take these various object requests (create, delete, read write) and service them using the chunk of block storage it is provided by some underlying device.
ATA's view of the world
ATA is a command based transport that allows a lot of things, but for example we don't care about ATAPI and all that stuff, we just care about the block storage services it provides. ATA doesn't have any object commands, it just has block commands. Basically your entire drive divided into a series of LBAs (logical block addresses). The drive has commands to read and write data to a particular LBA. Note that it does not have a command to delete an LBA, that will become an enormous issue later.
Flash's view of the world
Flash is a relatively complicated storage medium, and has its own view of the world. It works in terms of pages and blocks. Usually a page is the smallest amount of space you can reasonably read or write to a a flash chip (for our discussion, 4K), and a block is the smallest chunk of space you can erase at a time (for our discussion 128 pages). With a fresh (unwritten block) all the bits are set to "1", and during a write they can only be transition to "0." That means in order to rewrite a page you must erase it first. This is a very important point, you can't just go and erase a page of the flash, you need to erase the 128 contiguous pages contained in a whole block at a the same time.
The SSD controller's view of the world
Okay, the SSD controller has to deal with ATAs view of the world on one side (LBA), but it also has to deal with flash chips view of the world on the other. This is a hugely complicated task, and it is the reason that the quality of drives varies widely. Because the SSD only deals with those points of view, and not the higher level parts of the OS view (objects) we can characterize its behaviors in those terms, though later we will look at how the stack operates as a whole.
Okay, so lets assume for a second we have a 1MB flash device with 2 512KB blocks. This would be sold to the consumer as a 512KB flash drive, because some amount of the internal storage needs to be used for bookkeeping as we shall see. When the controller starts up what the flash looks like (from the point of view of the controller) is this:

In the above picture the blue page represents the internal tracking data the drive controller is using for remembering things like wear counts and page indirections. The OS cannot see that page, it is completely internal to the drive.
The controller can see all of that flash, but it tells the ATA chip on your motherboard there is a 1 512KB ATA device there. Now, lets imagine you go to write a page to the drive, what happens:

Okay, the SSD was handed a chunk of data from the ATA bus (the green page). It found somewhere to put it. It also needs to update its tables so it can find it in the future. But if you recall in order to erase the original table it would have to erase the whole block. That would be super wasteful, since most of the block is empty and flash has a limited number of erase cycles, so instead it just writes a new copy to a free spot in the block, and the old copy of the drives control data is marked as invalid. Lets imagine we write another page to the drive, we will see something similiar occur:

Now, lets write over the first page. This is where things get interesting. We do basically the same thing is as before, except that when the write occurs the SSD looks in its tables and sees that the address we are writing to is in use, so it goes and marks the page as invalid in its tables.

Going on in this manner eventually a flash block will have more invalid pages than valid. At that time the drive will sweep through and gather all the valid data into a new page:

Then erase the old page:

At the moment the only way a user generated piece of data can be marked as invalid is if it is overwritten (though that is changing). This has some serious repercussions for the drive GC, as we will see.
Throwing in the filesystem
Okay, so we have a basic understanding of how a modern SSD works. Now lets throw a file system into the mix. Lets imagine we have a filesystem with a single file in it:

In the above picture the blue page is the SSDs internal tables (which the OS cannot see), the orange is the FSes internal tables (which the SSD can see, but cannot understand), and the green is the actual file data. Since the FS cannot see the SSDs tables and the SSD cannot understand the FSes tables we are now in a situation where no part of the stack has a complete understanding of what is going on. The repercussions of this are most apparent when you delete a file.

So in the above case the filesystem updated its internal tables and wrote it out to the SSD, the SSD then found somewhere to put the new page and modified its tables. But notice how nothing happened to the actual file data (now colored dark green), since a deletion is just a matter of marking a few bits in the FSes tables. The SSD does not know those pages no longer needed by the FS since it doesn't understand the FSes internal structures, so when the SSD runs its GC algorithms it must preserve them:

Note that drive preserved all those pages that will never be read again, since the OS considers them free and will use them to write out new files. This is where the new ATA TRIM command makes a difference. What TRIM does is let the OS tell the drive "I have not yet overwritten this data, but I never need it again, so you can throw it out." Lets redo the last scenario with a TRIM aware drive:

Now we delete the file and the OS sends TRIMs for the file pages to the drive:

Notice how the file pages are not marked invalid. At this point when GC runs it can throw them out:

Which results in the drive needing to preserve fewer pages during its GC process. That has several impacts, including:
- Reducing the time GC takes
- Increasing the amount of freespace available after a GC (which increases the time it takes for performance to degrade after a GC)
- It lets the FTL have a wider selection of pages to choose from when it when it need a new page to write to, which means it has a better chance of finding low write count pages, increasing the lifespan of the drive
Now, I want to be clear, a sufficiently clever GC on a drive that has enough reserved space might be able to do very well on its own, but ultimately what TRIM does is give a drive GC algorithm better information to work with, which of course makes the GC more effective. What I showed above was a super simple GC, real drive GCes take a lot more information into account. First off they have to deal with more than two blocks, and their data takes up more than a single page. They track data locality, they only run against blocks have hit certain threshold of invalid pages or have really bad data locality. There are a ton of research papers and patents on the various techniques they use. But they all have to follow certain rules based on on the environment they work in, hopefully this post makes some of those clear.
Louis Gerbarg | Comments Disabled |
SSD 
Reader Comments (5)
I've been reading your posts in the topic on the OCZ forum and you were spot on in every single on of them. Hats of to you, sir. It's also quite frustrating that almost no-one realizes what you are talking about, yet think they do.
Good post.
Louis -
My comment has nothing to do with this specific post. I have a question for you. I found your website off of your Stack Overflow account.
My question relates to the iPhone PNS and how to best implement a server. If you don't mind sending me an email with your email address so I can ask you more specifically, that would be fabulous :)
It really shouldn't be a long or difficult question.
Thanks so much,
Oddbjørn
JegElskerPingviner@gmail.com
My OS has option to "securely delete" files by zeroing their contents. Could that be recognized as poor man's TRIM? (it should be easy to recognize and optimize)
Rob: Thanks
colterhaycock: Thanks, though I actually have never implemented an APNS server, just debugged the client side. While I am sure I will at some point in the future, I actually have no advice to give about implementing a server because I only have the most cursory view of how it works.
pornel: Generally, no. For technical reasons it is is often very problematic for devices to actually look at the data as it is incoming, usually the data is directly DMAed around with no analysis. That means even if you send a block of zeros it counts as a write.
Now if the drive returns zeros (something that most drives do now, and all drives will need to do after TRIM comes out since TRIM specifies that all reads after TRIM are supposed to return 0). on reads to unmapped then it would make it safe for the GC to throw those pages out, but again it would need to read every pages to determine that and and chances are it does not actually read the contents of any pages except for its internal structures and those it intends to move. If it is actually going to move a page and the whole process is not internally DMA driven it might catch it there and be able to avoid preserving it.
So it might result in more cleaned space after a GC (though I seriously doubt it on any current drives), but at the cost of an page write upfront.
Easy to understand for newbs if no acronyms :D