pCFS vs. PVFS: comparing a highly-available symmetrical parallel cluster file system with an asymmetrical parallel file system

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

In this paper, we argue that symmetrical cluster file systems using shared disks over a SAN can, in medium-sized clusters (with, say, a few dozens of nodes) where all nodes can simultaneously perform both I/O and computational tasks, achieve levels of performance comparable with those of asymmetrical parallel file systems such as PVFS [1], where some nodes must be set aside as I/O servers while others are compute nodes. Cluster file systems such as GFS [2, 3] and OCFS [4, 5] have been strong in IT, data centre, and other non-HPC environments that require high-availability (HA) file systems that can tolerate node failures. However, these CFSs are known not to exhibit high performance, a situation that can be traced down to their objectives, design and attempt to retain the so called POSIX single-node equivalent semantics [6]. We deem that today, where even in a single node computer running Linux' ext2/3 with processes read/write sharing a file the write() is not atomic with respect to the read(), the application programmer should not rely on the POSIX single-node semantics but use appropriate mechanisms such as file locking. We have proposed and implemented pCFS, a highly available Parallel Cluster File System [7, 8] in a way as to maintain clusterwide coherency when sharing a file across nodes with very low overhead, provided that standard POSIX locks (placed with fcntl calls) are used [9]; as a result, we are able to provide exactly (within 1%) the same performance as GFS for files that are read-shared across nodes, but we arrive at large performance gains when files are read/write (or write/write) shared across cluster nodes for non-overlapping requests (a quite common pattern in scientific codes). These gains are noticeable higher for smaller I/Os: when write/write sharing files across nodes, GFS starts at 0.1 MB/s for 4KB records and slowly rises up to 34 MB/s for 4MB records, while pCFS starts at 55 MB/s and quickly rises to 63 MB/s; CPU consumption is about the same, with pCFS at 17% and GFS at 14% (Note: comparing GFS and pCFS as been submitted and is pending reviewing). Currently, pCFS is implemented through modifications to GFS' kernel module with the addition of two new kernel modules per node, and a single cluster-wide user level daemon. No modifications whatsoever have been made to the Linux VFS layer. Using exactly the same hardware and operating system version across all nodes we compare pCFS with two distinct configurations of PVFS -- one using internal disks, and therefore not able to provide any tolerance against disk and/or I/O node failures, and another where the PVFS I/O servers access LUNs in a disk array and therefore can, in the event of failures, have services restarted in another node in a matter of minutes (in the following named HA-PVFS).
Original languageUnknown
Title of host publicationLNCS
Pages131 to 142
DOIs
Publication statusPublished - 1 Jan 2010
Event16th International Euro-Par Conference on Parallel Processing -
Duration: 1 Jan 2010 → …

Conference

Conference16th International Euro-Par Conference on Parallel Processing
Period1/01/10 → …

Cite this