Why has Red Hat Cluster Suite stopped working?

Question

I've been testing the Cluster Suite on CentOS 6.4, and had it working fine, but I noticed today [8th August, when this question was originally asked] that it's not liking the config that was previously working. I tried to recreate a configuration from scratch using CCS, but that gave validation errors.

Edited 21st August:

I've now reinstalled the box completely from CentOS 6.4 x86_64 minimal install, adding the following packages and their dependencies:

yum install bind-utils dhcp dos2unix man man-pages man-pages-overrides nano nmap ntp rsync tcpdump unix2dos vim-enhanced wget

and

yum install rgmanager ccs

The following commands all worked:

ccs -h ha-01 --createcluster test-ha
ccs -h ha-01 --addnode ha-01
ccs -h ha-01 --addnode ha-02
ccs -h ha-01 --addresource ip address=10.1.1.3 monitor_link=1
ccs -h ha-01 --addresource ip address=10.1.1.4 monitor_link=1
ccs -h ha-01 --addresource ip address=10.110.0.3 monitor_link=1
ccs -h ha-01 --addresource ip address=10.110.8.3 monitor_link=1
ccs -h ha-01 --addservice routing-a autostart=1 recovery=restart
ccs -h ha-01 --addservice routing-b autostart=1 recovery=restart
ccs -h ha-01 --addsubservice routing-a ip ref=10.1.1.3
ccs -h ha-01 --addsubservice routing-a ip ref=10.110.0.3
ccs -h ha-01 --addsubservice routing-b ip ref=10.1.1.4
ccs -h ha-01 --addsubservice routing-b ip ref=10.110.8.3

and resulted in the following config:

<?xml version="1.0"?>
<cluster config_version="13" name="test-ha">
    <fence_daemon/>
    <clusternodes>
        <clusternode name="ha-01" nodeid="1"/>
        <clusternode name="ha-02" nodeid="2"/>
    </clusternodes>
    <cman/>
    <fencedevices/>
    <rm>
        <failoverdomains/>
        <resources>
            <ip address="10.1.1.3" monitor_link="1"/>
            <ip address="10.1.1.4" monitor_link="1"/>
            <ip address="10.110.0.3" monitor_link="1"/>
            <ip address="10.110.8.3" monitor_link="1"/>
        </resources>
        <service autostart="1" name="routing-a" recovery="restart">
            <ip ref="10.1.1.3"/>
            <ip ref="10.110.0.3"/>
        </service>
        <service autostart="1" name="routing-b" recovery="restart">
            <ip ref="10.1.1.4"/>
            <ip ref="10.110.8.3"/>
        </service>
    </rm>
</cluster>

However, if I use ccs_config_validate or try to start the cman service, it fails with:

Relax-NG validity error : Extra element rm in interleave
tempfile:10: element rm: Relax-NG validity error : Element cluster failed to validate content
Configuration fails to validate

What's going on? This used to work!

c4f4t0r · Answer 1 · 2013-08-27T12:51:36.483

I think you are missing the failover domains, if you wanna define a service on redhat cluster, first you need to define a failoverdomain, you can use a failoverdomain for many services or one per service.

If you need more information about the failover domain "man clurgmgrd"

A failover domain is an ordered subset of members to which a service may be bound. The following

is a list of semantics governing the options as to how the different configuration options affect the behavior of a failover domain:

score 0 · Accepted Answer · answered Sep 20 '13 at 13:01

It's just started working again, after more yum update dancing. I have compared the old and new /var/lib/cluster/cluster.rng and, surprise, surprise, there's a difference. The one on the systems that didn't work was missing any definitions for the <ip> element.

The current incarnation of the system was installed from the same minimal CD, and I have a step-by-step procedure of commands to cut and paste, which worked several times while I was developing it, then failed for nearly two months, now starts working again. I've built the box about half a dozen times, so I guess it's not the procedure.

A slip up on Red Hat's part, perhaps, but I'm not sure how to find out what changes were checked into this file in the last two months.

Why has Red Hat Cluster Suite stopped working?

2 Answers2