Let's explore the different ways applications store data permanently on disk. The different methods are in some sense merely conveniences for the programmer writing an application.
At its heart, all data storage consists of a stream of zeros and ones on some medium. Paper and magnetic tape have given way to disk, but the principles remain the same. When we talk about storage methods, we mean the structure imposed on these zeros and ones to provide the semantics necessary for storing them—that is, the ability to express ideas such as data ownership, security, access method, and even clues as to what type of data it is that we are storing.
I will begin the discussion with the most familiar method (file storage) and then discuss block storage (the way databases store data for us), and finally I will briefly discuss object-based storage, which is a method modern applications use to move data structures from working memory to disk.
The traditional method for storing data on a disk is as a file. It consists of two things.
First the data itself. For example, this might be the words and formatting of a word-processor document or the text of a saved email. Disks are divided up into millions of blocks of storage (often 512 bytes each) and the saved data will be written to some number of these blocks on the surface of the disk.
Second, information about the data, which is called metadata. This includes such items as the size of the file, who owns it, who has permission to read from and write to the file, where the disk blocks that make up the file are located on disk, when the file was created and last modified, and so on. Generally this information is stored separately from the file in a directory. The directory both contains the metadata and pointers to the file blocks on the disk.
Block storage is provided by a special type of file that, rather than containing data, is really a gateway that points to a physical disk (or part of one) and gives it a name in the file system. The "gateway" referred to may pass data into the memory buffers of the operating system to be transferred to permanent storage as time and workload permit (this is basic block storage), or the data to be stored can be passed directly—without any such buffering—to the code that controls the disk. This is known as a character or rawinterface.
Block storage is used by applications that are not primarily interested in how their data is stored; the whole data storage problem is handed off to the operating system which "knows best" how to do the job of permanent data storage.
Character or raw storage methods are provided for those applications that, for reasons of performance and data integrity, want intimate control of how their data reaches permanent storage. The largest category of applications here is databases. Databases undertake many of the tasks usually left to the operating system (such as storage, synchronization, and communication) for themselves and can to some extent be viewed as operating systems in miniature.
Object storage is optimized for storing larger data objects (several hundred kB, or even better - multiple GB). Object storage is typically accessed via REST-based HTTP interfaces supporting basic read, write, and delete operations. De facto standards are Amazon S3–compatible or OpenStack Swift–compatible APIs. By relaxing requirements on the API compared to file-based access, a single object storage system can scale to several hundred PB or even into the multiple EB region. An object store can also contain far more metadata, for instance, a richer set of information about the user or owner of the data or other information. Additionally object-based storage allows for accessing the data in different ways, a good example of which is web services that use object interfaces. It can also provide greater safety for your data through redundancy mechanisms via replication or erasure coding across multiple storage nodes and multiple sites.
Why should you care about data storage methods?
Put simply, the reason is money and so that we can do our jobs as system administrators or purchasers of hardware properly. Disks and system administrators are expensive. There is a greater overhead in terms of disk space in creating a file system. However, there is a greater overhead in terms of effort in making and verifying backups of block-based data. Object storage hands off a measure of control over storage to the application but, for file and raw data, we have to ensure data safety by creating mirrored disk arrays. Different approaches to implementing storage hardware yield different results in terms of cost, data security, and system efficiency. As the data admin's phrase puts it, "Fast, cheap, safe: pick any two."
Background photo by Deirdre Straughan.