SSD M.2 cache on my server and my observations.

At some point, when the number of deployed docker containers grew, I began to feel that the performance of HDD disks alone was not enough. This was observed in the form of a server management web interface that took longer and longer to open, all my services became less and less responsive, and the server took longer and longer to start and shut down for maintenance. And so it was time for me to buy SSD M.2 disks to create a cache.

Initially, I only wanted to try organizing a cache on my server, so the first thing I read was that in the classic version, everyone recommends organizing a cache of 5-10% of the volume volume, and using the server's built-in calculation tools, I determined the volume I needed at the moment. Since I use my server mostly alone and the load did not imply a sharp increase, I decided to stop at the option of organizing a 256 gigabyte read-only cache for an 8 terabyte volume (in reality, the numbers are lower and the cache came out to 238 GB for a 7.3 TB volume). The speed increase was really noticeable and the responsiveness returned to the original values.

Time passed, I deployed more and more services for myself, since I found out that these services would be really convenient for me and I would use them 100%, the cache filled up more and more in terms of the required amount of requested data, and at some point I began to notice rare slowdowns of the entire system during writing to the HDD disk. The writing itself with a read-only cache hit the overall performance of the server, which was quite noticeable and during another power surge, when my server decided to turn off and turn on, I decided to monitor the cache. Literally in 24 hours my cache was filled to 100% and the filling values ​​almost never dropped below 100%. (yes, I understand that the cache should be filled, but if you have ever monitored its operation, then this is one of the clear signs of its insufficient). If we compare it with the moment when I just bought the cache and the values ​​after its first installation and what was happening now, I concluded that the cache was clearly not enough and I decided to immediately buy a larger M.2 SSD of 512 GB (this time 5% of the volume) and immediately organize a read/write cache.

While installing new M.2 SSD disks for the cache, I disconnected the old read-only cache from the volume and almost instantly my server almost completely lost responsiveness, since all the existing load hit the HDD disks and the server thought for a very long time even about my every action in the web interface. After turning off the server and replacing the disks for the cache, there was also a noticeable slowdown in turning on the server and its services with the same depressing responsiveness. When everything started an hour later, I organized a read/write cache for the volume along with the transfer of all Btrfs metadata to the cache. Almost immediately, the server's performance increased to new heights and the speeds for the HDD disk returned to the speeds stated in the characteristics for the HDD disk. I began to monitor the new cache. After one night (8-12 hours passed), the cache filled up to 250 GB, which made it clear that I was right about the lack of cache. Then, after two days, the cache filling stabilized at 67% of re-requested data and 83% together with the written data.

Based on my observations, I think we can draw the following conclusions:

  1. Track how much the cache fills up in a day. If the cache literally fills up to 100% in the first day, this may clearly indicate a shortage.
  2. Read/write cache is much faster than read-only cache, and read-only cache unloads the disk for writing. If read-only cache is not enough, you will start to notice occasional performance drops.
  3. Do not compress files at the file system level if you have read-only cache. It may save some space, but it will slow down file handling (I have already tested it).
  4. Move file system metadata to read/write cache, it speeds up file searches a lot.