So, what is Storage? In computer engineering, we take this word for granted. In Big Data, storage means an object store. But we continue to call it “storage” as most people are familiar with the word. In this blog, let’s have a look at the engineering rationale and the human experience related to the word “storage”.
Technical Big Data Storage
Meenakshi Kaul-Basu, the Leader of Data Storage in PDU Cloud System at Ericsson, suggested this reference: How an object store differs from file and block storage. You can read it in detail, but here are some essential extracts:
What is file storage?
We know a file is typically structured in a file system, which is nothing more than a hierarchical way of organizing files so that an individual file can be located by describing the path to that file. We know that certain attributes — information that might describe a file and its contents, such as its owner, who can access the file and its size — are conveniently stored as metadata in a file system. We also know that network-attached storage (NAS) is the best way to share files securely among users on a network. It works great locally on a LAN but not so well if the users are across a WAN. And managing a single (or a small number) of NAS boxes is trivial, but managing hundreds of them is a nightmare. The file system is responsible for the placement of data on the NAS box, as well as implementing file sharing by locking and unlocking files as needed. And lastly, file systems work well with hundreds of thousands, and perhaps millions, of files but are not designed to handle billions of files. These limitations were not well understood because many IT shops had not tested those high levels — until recently.
What is block storage?
We know a block is a chunk of data, and when appropriate blocks are combined, it creates a file. A block has an address, and the application retrieves a block by making a SCSI call to that address. It is a very microscopic way of controlling storage. Unlike in the case of NAS, the application decides where to place the data and how to organize the storage. How the blocks are combined or accessed is left up to the application. There is no storage-side metadata associated with the block, except for the address, and even that, arguably, is not metadata about the block. In other words, the block is simply a chunk of data that has no description, no association and no owner. It only takes on meaning when the application controlling it combines it with other blocks. Under the right circumstances, granting this level of granular control to the application allows it to extract the best performance from a given storage array. This is the reason why block storage has been king of the hill for performance-centric applications, mostly transactional and database-oriented. Adding distance between the application and storage kills this performance advantage due to latency, so most block storage is used locally instead
Armed with this knowledge of file and block storage, this is what an Object Storage is:
… An object is defined as data (typically a file) along with all its metadata, all bundled up as an object. This object is given an ID that is typically calculated from the content of that object (both file and metadata) itself. An object is always retrieved by an application by presenting the object ID to object storage. Unlike files and file systems, objects are stored in a flat structure. You have a pool of objects, and you simply ask for a given object by presenting its object ID. Objects may be local or geographically separated, but because they are in a flat address space, they are retrieved exactly the same way. An object is not limited to any type or amount of metadata. If you choose to, you can assign metadata such as the type of application the object is associated with; the importance of an application; the level of data protection you want to assign to an object; if you want this object replicated to another site or sites; when to move this object to a different tier of storage or to a different geography; and when to delete this object. This type of metadata goes way beyond the access control lists used in file systems. The fact that object storage allows users flexibility to define metadata as they wish is unique to object storage. You can start to see how this opens up vast opportunities for analytics that one could never dream of performing before. Given the nature of objects, as described above, performance is not necessarily a hallmark of object storage. But if you want a simple way to manage storage and a service that spans geographies and provides rich (and user-definable) metadata, object storage is the way to go.
What kind of big storage do we need?
Whereas being somewhat informed about big data technologies is a big plus, it does not help the average person in a company to make a decision based on the explanation above. We need a different angle.
We asked Ericsson’s Cloud CTO, Jason Hoffman, what is his view. Here are excerpts from our conversation.
People care about data, not storage
“We sell cloud storage. Yet no one cares about storage. People care about their data. They want a data centric view of the space. An example: when you look at metadata about things – associations, connections, graphs of things, startup files, data generated from events or activities, provisional files and so on – just a whole list of things. We look at a list of things that are not meant for easy usage in a corporate environment. One cannot see (easily) how important data is and what people currently want to do with it.”
“If you look at the whole data space, now we have more data diversity. For example the Hadoop space… we get a new type of data…. But in addition to the great data diversity, we have a big diversity of “whys”. Why we keep this specific data? Why we don’t keep that other data?”
What is the final goal?
“A much more data centric view of the world means we can actually get to the point where “normal” people (meaning you and me and everybody else) are able to locate and read the data from somewhere. A lot of very interesting things can happen as a result”.
“Because for now storage solutions are intended for storage experts who run around trying to figure out how to make a suitable storage solution for a given application, but perhaps we should approach it by looking at what people are doing with this data and how much they care about it. These decisions can be made by mainstream people, no need for them to understand the differences between block storage and an object, or an “append”.”
“What data do you generate for the company?” “What data you get sent to you from other people?” “What data do you keep?” “Why do you keep it?” “For the data that you keep, what do you do with it?” “What value does it provide to you?” “What applications generated all these data?” “How do you consume this data?”
Companies where data matters versus companies where it doesn’t
“Consider the companies in the world for whom data matters, versus the companies for whom it doesn’t. Historically the companies that did care became the Googles or Amazons of the world.
Our hypothesis is that in an average company, people don’t pay attention enough about extracting value from their data.
People in manufacturing, and of course we have plenty of examples in Telecom, they care. We know that Oil and Gas companies care about their data sets, mining companies care too.
But what about the diversity of things that people don’t consider important? What data do they need? What data is sensitive? How is that sensitive data generated? Why keep some of it? Why not keep all of it? If they do keep it, how important is the data location to them? What value are they deriving from their current data?
Why can’t we make it easier for them to keep their data? Maybe we can give them a way to more confidently secure their data, and why not make it simpler to question that data?
I believe that there are many more companies like the Googles of the world, where data should matter. There are at least half a dozen industries where data also should matter, though they currently don’t derive any value from it. There are reasons why they don’t. We have to address these reasons why.
I call this storage solution “Empathic Storage” – storage with empathy.”
Johan Carlsson, Hans Haenlein, Miha Ahronovitz Accessibility Group – Cloud Product Team
Jason Hoffman, Ericsson Cloud CTO ideas inspired this blog.
Geoff Hollingworth, Head of Cloud Marketing who supported in our team from day 1, seeded key ideas. We received insights from Seamus Keane. Many thanks to Deirdre Straughan and Stacie Pham the Evolution blog editors. And last, but not least, to the meaningful conversations with Noam Zomerfeld, the capable student intern in Accessibility team.