What Is A Distributed File System?

What is a distributed file system?

In general, distributed files systems are IT solutions that allow multiple users to access and share data in what appears to be a single seamless storage pool. The back-end enabling systems can follow one of a few architectural patterns, in popular order they are client-server, which tends to be the most common, cluster-based architectures are most useful in large data centers, and decentralized files systems also exist.

These architectures comprise multiple back-end systems connected together via a network, with middleware orchestrating file storage, and employing many techniques to ensure the “distributed” system’s performance meets the needs of users. In this way, the distributed system has a capacity of service, and the load on that service is the total demand by all the active users. When load approaches or exceeds that capacity, system performance will degrade, and will show signs of lag, or service outages.

The chief benefit rests on the fact that sharing data is fundamental to distributed systems, and therefore forms the basis for many distributed applications. Specifically, distributed files systems are proven ways to securely and reliably accommodate the data sharing between multiple processes over long periods. This makes them ideal as a foundational layer for distributed systems and applications.

Distributed systems form the modern concept of “the cloud” and support the idea that the cloud is essentially limitless in storage capacity. These systems can expand behind the scenes and match any growth in demand. They can manage massive volumes of information, safeguard its integrity, and ensure its availability to users 99.9995% of the time. And in that small sliver of downtime, there are contingencies upon contingencies in place. For cloud data centers, this is their business, and so are able to benefit from economies of scale more readily than enterprises or smaller businesses that deploy their own distributed systems.

Enterprises and small businesses may deploy their own distributed file systems to facilitate business operations, regionally, even globally. For instance, distributed systems may support private clouds, parallel computing, even real-time control systems. Municipalities deploy real-time traffic control and monitoring systems to better manage commuter times, all made possible by DFS supported applications. Sophisticated parallel computing models are deployed across many participating computing systems in collaborations that help compute large data sets, as in astronomical calculations where one computer simply won’t do the work.

Features of a distributed file system

While they are popularly understood to be file sharing technologies, distributed file systems are characterized by several features beyond sharing data. The most desirable DFS characteristics are outlined below.

  • Data Integrity — Data integrity issues arise in replication and redundancy, but more specifically, when access controls and file sharing are considered, so too must data integrity if two or more requests attempt to access the same information simultaneously. Systems must properly synchronize data across the system while offering concurrent access controls. Imagine POS systems interacting with inventory and banking systems, because it’s a financial transaction, it’s vital that data remains in sync, or risk potential errors in accounting.
  • Heterogeneity — Another hallmark of distributed file systems is their platform agnostic compatibility, and should be able to access data on multiple diverse systems, like Unix, and Windows.
  • High Availability — High availability (HA) is a method of designing systems to ensure they are maximally available to users at all times. While it first may spark ideas of having more servers, it usually falls to greater data management that employs replication, redundancy, fault tolerance, multipathing, and load balancing.
  • High Reliability — Much like High Availability, High Reliability focuses on ensuring that data loss is reduced to a minimum. This employs automated replication and redundancy and can be considered a subset of HA.
  • PerformanceDistributed file systems can suffer poorer performance than centralized file systems. Performance is measured as the average time to fulfill client requests. This time equates to network request time + CPU time + secondary storage access time + network response time. While local storage will always be faster, distributed file systems are exceptionally adept at ensuring performance.
  • Scalability — Scalability is a hallmark feature of distributed file systems. Distributed file systems are designed to increase their compute and storage systems without disruption. In practice, IT admins can simply add new storage to the system, sometimes as easy as plug-and-play.
  • Security — Security is a tremendous concern in distributed file systems, as many are now using the public internet for transmission channels. Many security systems offer encryption of data in-transit and at-rest, and this practice has become the de facto standard.
  • Simplicity — Distributed file systems are designed to be seamless and high performance, which follows a simplicity mindset. Complexity should be reduced in areas of user interface, including the breadth of commands, which should be reduced to the most powerful and useful tools.
  • Transparency — Distributed file systems must show transparency on four levels. Structure transparency says clients should know where file servers or storage devices are located, as well as the total number. Access transparency demands that all files are accessed in the same way, whether local or remote. Likewise, naming transparency says that file names should not reveal file location. And following naming rules, replication transparency hides the locations of files replicated across the system.
  • User Mobility — When users log on to a distributed system, the system should be able to bring the user’s environment to the node where they’ve made access. In this way, the user engagement experience is seamless.
How the distributed file system works

Technically, distributed file systems need to achieve several goals to produce the effect of active file sharing for multiple users. The basic distributed model connects multiple local file systems together, by mounting them, and then abstracts a layer of storage management from the user interface layer, hiding how those files are optimally stored on underlying infrastructure, so that engaging clients view the multiple system as one. From the user’s perspective, they will often expect simultaneous user data access, while accessing the freshest version. These demands raise many considerations for system designers.

To answer many of these considerations, system designers use different DFS mechanisms, including:

  • Mounting — Connecting the file namespaces of multiple systems is a key feature of distributed file systems. Mounting mechanism allows different file namespaces to be connected into a single hierarchical namespace.
  • Client Caching — Because files are stored across a network, file access time inherently includes network transmission times, so network performance directly impacts the whole system performance. Poor network performance reduces client experience. To mitigate poor access time due to lag, caching features are employed to store copy files or parts of files locally at the client, for example, accessing a folder may cache the metadata of all files inside, once a file is requested, then the content data may be transferred.
  • Bulk Data Transfers — Closely linked to system performance is the transfer of data in bulk. Because network performance is key to system performance, techniques like burst transfers are utilized to optimize network transfers.
  • EncryptionDistributed systems are network supported, so encryption of data in transit is used to enforce security.
  • Stateful and Stateless File Servers — Sometimes users will interact with a file, and expect to edit it, while other users are doing the same thing. This complicates distributed file system management. This can be done using stateful servers, which maintains session information. However, in general, stateless servers are used, and stateful ones avoided, because stateless are more fault tolerant—client crashes do not impact the stateless server like it would a stateful server actively managing sessions.
Network file system vs. distributed file system

A distributed file system is a reference to a group of systems that work together acting as a single shared file system. A distributed system is governed in part by a network file system, in general terms. There are several varieties of network file systems in use, but they all allow a remote host to mount and interact with file systems over a network. Often distributed file systems and network file systems are used interchangeably.

More specifically, Network File System (NFS) is a protocol developed by Sun Microsystems that has subsequently become the de facto standard file sharing protocol today. The most widely supported is NFS version 2 (NFSv2). For example, Windows Servers use the NFS protocol to allow transfers of files between systems running Windows and non-Windows systems, like Unix or Linux. The latest version, NFS version 4 (NFSv4), works through firewalls and on the Internet.

Benefits of a distributed file system

The main benefit of distributed file systems is to connect together multiple systems through a network and expand storage capabilities while maximizing user access. Subsequent benefits include:

  • Unified view and control over file storage assets
  • File access availability to users anytime, anywhere
  • File access controls prioritizing access to data
  • Storage system interoperability
  • Storage Scaling capabilities
  • Load-balancing management
  • Data access transparency and location transparency
Block storage vs. object storage vs. file storage

Block storage can be compared to two other common storage formats, file, and object storage. These formats aim to store, organize, and allow access to data in specific ways that benefit certain data applications. For instance, file storage, commonly seen on desktop computers as a file and folder hierarchy, presents information intuitively to users. This intuitive format, though, can hamper operations when data becomes voluminous. Block storage and object storage both help to overcome the scaling of data in their own ways. Block storage does this by “chunking” data into arbitrarily sized data blocks that can be easily managed by software, but provides little data about file contents, leaving that to the application to determine. Object storage decouples the data from the application, using metadata as a file organization method which then allows object stores to span multiple systems, but still be easily located and accessed.

Types of distributed file systems

Three distributed file system architectures are in use today. Client/server file systems are most common, and readily available. Decentralized file systems are often found in peer-to-peer community-based networks. And cluster-based file systems are useful in large datacenters.

  • Client/Server Architecture — The most common DFS today is the client/server, where files are managed by a file server, and files are stored on one of several servers within the distributed system. This model relies on remote access, where a client sends requests to work on a file that is physically located on a server, in contrast to downloading a file, and uploading an updated version if there are any edits.
  • Decentralized Architecture — In a decentralized architecture, files are dispersed throughout the participating network, and transferred peer-to-peer (P2P). This style is popular with torrent networks.
  • Cluster-based Architecture — Typical solutions for client/server architectures become burdensome when scaling to very large applications. Cluster-based architecture is deployed in large data centers, and uses more enhanced methods of data management, like file-striping, chunking, and parallelism, to make data access, transfer, and processing more efficient. The leading solutions in this space are Google and Hadoop