[Linux-aus] Ceph FS vs Gluster FS

Wed Jun 7 13:03:00 AEST 2023

Hi Andrew,
  I'm not really able to compare with ZFS as I'm not all that familiar with the intricacies of its capabilities. However I can talk about a couple of the data integrity weirdnesses we had w/r to "hardware problems" with the cluster I dealt with. So herein lie some taradiddles.

  I'll address the last question first: Ceph works *fine* with nodes that have a single storage device (you just have to have 3 of them). You can force Ceph to function with fewer than 3 storage nodes (i.e. 3 separate networked computers), but it'll fight you and whine incessantly. The default failure domain is per-node and the default redundancy is 3-copy replication, so each copy will live on a different storage node.

  Ceph *intends* to have good data integrity, but if the stars align just the right way, it can have a "negative feedback" on system integrity. Two big incidents stick out to me:

  The first involved a problem with disk read errors in our "spinning disk" pools. This was using Ceph 12 ("Luminous"), which is pretty old and I suspect the issue might be mitigated a bit in later versions. The hardware vendor probably preferred/intended its customers use the inbuilt hardware RAID of the nodes to make read errors disappear to the OS. However we wanted to use Ceph for redundancy handling, so didn't use the hardware RAID. The issue is that the Ceph convention is that if a disk gets a read error, it's "bad" and is already degraded/remapping sectors and you should replace it. About 5% of the disks exhibited a behavior where they'd get a single block read error a long time after commencing service, then another one about 15m later. A small number of disks (<1%) got multiple read errors on an ongoing basis ("really bad" disks). Ceph just "gave up" on fixing redundancy sets (called a "placement group"/PG in Ceph parlance) if there's a block read. So the PG remained inconsistent and a warning flagged. It's a pretty simple fix to tell Ceph to repair the PG, but it's a manual process. The departmental change policy was pretty inflexible and demanded that any non-automated repair processes involved filling in a form that took about 5 minutes, getting 2nd level management approval, exemptions from SIT and PIT teams (Which involved filling another form and then *finding* and cajoling the team leads to approve it), then fronting a 2H Friday meeting to defend your change and if it's approved the change then needs to be implemented between 8pm and 4 am the next Tuesday night (I am not joking). So instead I just set-up a cron job to repair any inconsistent PGs every hour. This worked "fine" (i.e. a hack to fit the rules) except for those "really bad" disks, which could often get a read error *while* a PG on it was being repaired. This was a pretty rare occurrence, but it meant that the PG had to have some fairly drastic surgery to get it unstuck. So I also implemented another automated process to hunt for disks "going really bad" and remove them from the cluster before this could happen. These bad disks have a pretty big impact on any VMs using them resulting in multiple tens of seconds I/O delays.

  Anther big problem we had was an "imperfect storm" of networking problems. The department had a single networking vendor (managed externally to my team) with a single flat network across the department (Not how we'd prefer it done). The networking setup meant that all data needed to traverse leaf/spine switches to get between any nodes. In the Ceph/Openstack pod, we had 2 leaf and 2 spine switches with redundant cross-connected links between them. Each of these 4 links "thought" they were configured correctly, but an incomplete/failed config update in the aforementioned 8pm-4am change window resulted in one link dropping all traffic. This resulted in 1/4 packet loss and because of the 3-way handshake only 1/4 of TCP connections could be established. So far, so bad. Ceph's OSDs were going down and coming up quickly as a result. The Ceph management software saw this as failing nodes and started marking all these "flapping" nodes as "out" (i.e. "I know this exists, but I'll just ignore it for now and not try to talk to it"). As a result it started trying to rebuild redundant copies of data on other nodes (...which were also "flapping"). Fun times! It took a couple of days to convince the right guy in the networking team to diagnose and fix the networking issues. There's a special place in hell for the vendor's support engineer who determined that the networking hardware saying all links are fine meant it wasn't a networking issue.

OK, that's more than enough waffle. Hope that was of some help.

M0les.

On Wed, 7 Jun 2023, at 07:53, Andrew Radke wrote:
> Hi Miles,
> 
> I’ve been a ZFS guy from long before Linux had heard of it. The data integrity in ZFS, performance with the ARC and L2ARC on modern machines with excess RAM and fast storage, and snapshots that can be taken and shoved around the network or to external disks are just part of why I rarely think about other filesystems.
> 
> But I’ve looked at Ceph a few times over the last few years since Proxmox has it largely built in. I’m loath to reduce the level of data integrity I have now with ZFS. How do you find Ceph compares?
> 
> Also how well does Ceph work with nodes with only one storage device in them? For instance we have some that only support one NVMe and one SATA, so we could put the boot/OS on a SATA SSD and then use the NVMe for Ceph.
> 
> Cheers,
> Andrew
> 
>> On 6 Jun 2023, at 10:45 am, Miles Goodhew via linux-aus <linux-aus at lists.linux.org.au> wrote:
>> 
>> Hi Anestis,
>>   I used to be "The Ceph guy" at a large an annoying government department. I think the nutshell differences I see are:
>>  
>> Gluster:
>>  • Smaller scale (5-ish nodes max, I think)
>>  • Network filesystem only
>>  • Integrated services (storage and control/mgmt on the same boxes)
>>  • Limited redundancy and failure-domain options
>>  • A little simpler to set up on its own
>> Ceph:
>>  • Scales up to gigantic, multi-region clusters
>>  • Block storage (RBD), File storage (CephFS) and Object storage (RGW) options available
>>  • Control/mgmt can be on separate nodes (And should be unless you have a really small cluster)
>>  • Any speed, redundancy (replication or erasure coding) or failure-domain setup you can think of. You can have multiple setups for different storage pools within the cluster.
>>  • Takes a bit more planning and implementation to deploy
>> Like Neill said: Openstack uses the RBD application to present "disk like" virtual storage devices to the compute nodes for the VMs to use. The old Redhat Enterprise Virtualisation (OVirt) *used* to use Gluster as its network storage system (putting disk images as files on top of it). However I'm not sure this is still the case.
>> 
>> CephFS works really well as an NFS replacement (it's just a lot more fiddly to set up). RGW can present itself as either S3 or Swift protocol (Or a "weird" NFS version too - but don't go there).
>> 
>> Hope that's enough, but not too much info,
>> 
>> M0les.
>> 
>> On Tue, 6 Jun 2023, at 04:55, Anestis Kozakis via linux-aus wrote:
>>> I was wondering fi people could summarize  me the difference as well s the pros and cons fo GlusterFS vs CephFS inr regards to the following uses:
>>> 
>>> File Server/System and creating Virtual Machines and Containers.
>>> 
>>> I will, of course, do my own research, but I am looking to get other people's experiences and opinions.
>>> 
>>> Anestis.
>>> --
>>> Anestis Kozakis | kenosti at gmail.com
>>> 
>>> _______________________________________________
>>> linux-aus mailing list
>>> linux-aus at lists.linux.org.au
>>> http://lists.linux.org.au/mailman/listinfo/linux-aus
>>> 
>>> To unsubscribe from this list, send a blank email to
>>> linux-aus-unsubscribe at lists.linux.org.au
>> 
>> _______________________________________________
>> linux-aus mailing list
>> linux-aus at lists.linux.org.au
>> http://lists.linux.org.au/mailman/listinfo/linux-aus
>> 
>> To unsubscribe from this list, send a blank email to
>> linux-aus-unsubscribe at lists.linux.org.au
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linux.org.au/pipermail/linux-aus/attachments/20230607/21f91327/attachment-0001.html>