Please read the following documents before addressing the issues to become familiar with the gluster architecture.
Quickstart guide
Architecture
CLI reference
When alerts are triggered, the C3 team receives notifications via email. The C3 team is expected to follow the outlined procedures below.
Data Collection: When an alert is fired, the C3 team should first gather relevant data to understand the source of the issue.
Severity-Based Actions:
Severity-Specific Notifications:
Before taking action on the C3 Remedy, the C3 team should thoroughly review the “Dependent Metrics and Checks” section to ensure all supporting data is understood.
This process ensures effective response and resolution for all alerts based on severity and priority.
Dashboard & Row | Panel | Panel Description | Query | Query Description | Query Operating Range | Metrics | Metric Description | Metric Operating Range | SEVERITY: CRITICAL | SEVERITY: WARNING | SEVERITY: OK |
---|---|---|---|---|---|---|---|---|---|---|---|
1.2 | Client mount status psorbit-node01 | displays success when client mount is successful on a gluster node | gluster_mount_successful{job="$job",instance="$node"} | Checks if mountpoint exists, returns a bool value 0 or 1 | 0,1 | gluster_mount_successful{job="$job",instance="$node"} | Checks if mountpoint exists, returns a bool value 0 or 1 | 0,1 | !1 | 1 | |
1.1 | Volume status | This panel shows volume status | gluster_volume_status{instance="$node",volume="$volume",job="$job"} | it returns the requested volume’s status, 1 if ok, 0 error | 0,1 | gluster_volume_status{instance="$node",volume="$volume",job="$job"} | it returns the requested volume’s status, 1 if ok, 0 error | 0,1 | !1 | 0 | |
1.2 | Peers online | This panel shows how many peers are online currently | count(gluster_up{job="$job"}==1) | this query counts the number of gluster_up metrics are returning value `1` | 0,1,2,3 | gluster_up | Was the last query of Gluster successful. | 0,1 | <2 | !3 | 3 |
1.3 | Brick status(writeable) | This panel shows if a given brick is in writeable state | gluster_volume_writeable{job="$job",instance="$node"} | The metric `gluster_volume_writeable{job="$job",instance="$node"}` indicates whether a Gluster volume is writable (`1` if writable, `0` otherwise) for the specified job and node. | 0,1 | gluster_volume_writeable | Writes and deletes file in Volume and checks if it is writeable | 0,1 | !1 | 1 | |
6.2 | Inodes status | This panel shows how many inodes available and how many used | gluster_node_inodes_total{volume=~"$volume",instance="$node",job="$job"} | these queries collectively shows the enitrely available inodes for the glusterfs, used inodes and free inodes in a piechart | For free:44152-11104212 | gluster_node_inodes_total | Total inodes reported for each node on each instance. Labels are to distinguish origins | <10% | >= 10% | ||
gluster_node_inodes_free{volume=~"$volume",instance="$node",job="$job"} | gluster_node_inodes_free | Free inodes reported for each node on each instance. Labels are to distinguish origins | |||||||||
gluster_node_inodes_total{volume=~"$volume",instance="$node",job="$job"}-gluster_node_inodes_free{volume=~"$volume",instance="$node",job="$job"} | gluster_node_inodes_total | Total inodes reported for each node on each instance. Labels are to distinguish origins | 107352-11108352 | ||||||||
1.4 | Brick disk space status | This panel shows disk space available by MB and usage in a pie chart | gluster_node_size_bytes_total{volume=~"$volume",instance="$node",job="$job"} | Total size (in bytes) of the specified Gluster brick on the given node and job. | For free: 0MB to 294.66MB | gluster_node_size_bytes_total | Total bytes reported for each node on each instance. Labels are to distinguish origins | 0 to 294.66MB | < 20% | > 20% | |
gluster_node_size_free_bytes{volume=~"$volume",instance="$node",hostname!=“pssb1abm003”,job="$job"} | Free space (in bytes) available on the specified Gluster brick | gluster_node_size_free_bytes | Free bytes reported for each node on each instance. Labels are to distinguish origins | ||||||||
gluster_node_size_bytes_total{volume=~"$volume",hostname!=“pssb1abm003”,instance="$node",job="$job"}-gluster_node_size_free_bytes{volume=~"$volume",hostname!=“pssb1abm003”,instance="$node",job="$job"} | Used space (in bytes) on the specified Gluster brick | ||||||||||
4.1 | Max FOP Latency(Node wise) | This panel shows max FOP latency for a range of file operations | gluster_brick_fop_latency_max{instance="$node",volume="$volume",job="$job"} | The metric `gluster_brick_fop_latency_max` represents the maximum file operation latency (in seconds) for a specific Gluster brick on the given node, volume, and job. | 0-infinite | gluster_brick_fop_latency_max | Maximum fileoperations latency over total uptime | 0-infinite | > 6seconds | < 6 seconds |
The gluster_mount_successful
metric in Gluster monitoring indicates whether a Gluster volume is successfully mounted and accessible. When gluster_mount_successful
is 1, it means the volume is mounted and operational. If the metric is 0, it signifies that the volume mount has failed due to issues such as network problems, configuration errors, or service unavailability. The C3 team must provide all relevant data to the DevOps team if the C3 remedy is unsuccessful.
Instance name,IP address, age of alert in firing state: Collect the instance name,IP address of the instance, total time for the alert is being in firing state.
For no data alert: Check the gluster_exporter status first in case of no data
alert.
systemctl status gluster_exporter
to check the status of gluster_exporter service.Service Status: Verify the Glusterd service status on the instance where the metric reported 0 or no data. Check if the service is in an active, activating, or failed state.
systemctl status glusterd
to see what state the service is in.Process status: Verify if glusterd process is running and includes several child processes as given:
ps aux | grep gluster
and check if gluster process is running by verifying the output existence of something like: root 775 0.1 0.4 616272 33996 ? SLsl Nov11 34:21 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
root 1133 0.2 0.1 1809192 8700 ? SLsl Nov11 57:14 /usr/sbin/glusterfsd -s pssb1abm003 --volfile-id pssb_dfs.pssb1abm003.data-export-vdb-brick -p /var/run/gluster/vols/pssb_dfs/pssb1abm003-data-export-vdb-brick.pid -S /var/run/gluster/54a9732b9e4c6146.socket --brick-name /data/export/vdb/brick -l /var/log/glusterfs/bricks/data-export-vdb-brick.log --xlator-option *-posix.glusterd-uuid=4c72a811-a357-42e5-9b8f-8343e9c35fe4 --process-name brick --brick-port 49164 --xlator-option pssb_dfs-server.listen-port=49164
root 1166 0.0 0.0 810940 3420 ? SLsl Nov11 0:52 /usr/sbin/glusterfs -s localhost --volfile-id shd/pssb_dfs -p /var/run/gluster/shd/pssb_dfs/pssb_dfs-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/26302251ede1f174.socket --xlator-option *replicate*.node-uuid=4c72a811-a357-42e5-9b8f-8343e9c35fe4 --process-name glustershd --client-pid=-6
root 2537 0.0 0.0 730560 6344 ? SLsl Nov11 10:25 /usr/sbin/glusterfs --process-name fuse --volfile-server=pssb1abm003 --volfile-id=pssb_dfs /data/pssb
Port status: Check if the port is correctly listening on given port and process name.
netstat -tlnp | grep gluster
to see if the port number 24007
and 49164
is mentioned in the output.Disk space & mount status: Check if storage space is available for gluster to be able to function.
For pssb
cluster, check in /data/pssb
:
df -h / && df -h /data/pssb
to check for disk space available for given paths.
Sample output:Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p4 25G 17G 7.2G 70% /
Filesystem Size Used Avail Use% Mounted on
pssb1abm003:pssb_dfs 295M 56M 240M 19% /data/pssb
For pssb
cluster, check in /data/pssb
:
df -h / && df -h /data/ps/orbit/
to check for disk space available for given paths.
Sample output:Filesystem Size Used Avail Use% Mounted on
/dev/vda5 25G 14G 9.7G 59% /
Filesystem Size Used Avail Use% Mounted on
psorbit-node01:/ps_orbit_dfs 295M 178M 118M 61% /data/ps/orbit
Gluster volume information: Check if volume list contains the volume that should be on the given cluster.
To list available volumes: gluster volume list
Sample output:
For pssb_cluster:
pssb_dfs
For psorbit_cluster:
ps_orbit_dfs
To view the information of available on the node: For pssb_cluster:
Volume Name: pssb_dfs
Type: Replicate
Volume ID: 35ea87a9-f105-4591-a6ff-04d407b8e457
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 5 = 5
Transport-type: tcp
Bricks:
Brick1: pssb1avm001:/export/vdb/brick
Brick2: pssb1avm002:/export/vdb/brick
Brick3: pssb1abm003:/data/export/vdb/brick
Brick4: pssb1avm004:/export/vdb/brick
Brick5: pssb1avm005:/export/vdb/brick
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
performance.client-io-threads: off
For psorbit cluster:
Volume Name: ps_orbit_dfs
Type: Replicate
Volume ID: 0a88ae36-d097-4747-afca-9587b3f9d114
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: psorbit-node01:/export/vdb/brick
Brick2: psorbit-node02:/export/vdb/brick
Brick3: psorbit-node03:/export/vdb/brick
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
performance.client-io-threads: off
Log messages: Submit the log messages from the given below process.
tail -100f /var/log/glusterfs/glusterd.log
to get the glusterd service logs(maintained seperately by gluster daemon itself).
For volume specific logs:
For ps_orbit_dfs
volume: Use tail -100f /var/log/glusterfs/data-ps-orbit.log
For pssb_dfs
volume: Use tail -100f /var/log/glusterfs/data-pssb.log
Check ufw status: Check if firewall is blocking the functional ports
ufw status | grep -E '^(491|2400)'
and submit these ports.When gluster_mount_successful
metric returns 0
, please check the following metrics/data
gluster_volume_status{instance=""}
to check the status of the volume on the node.gluster_volume_status
for the firing instance.Follow the given remedies in case of recovery from the alert firing state.
Ensure daemon and other glusterfs process/ports on the instance:
netstat -tlnp | grep gluster
tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN 857/glusterd
tcp 0 0 0.0.0.0:49229 0.0.0.0:* LISTEN 2763/glusterfsd
tcp6 0 0 :::9106 :::* LISTEN 4087700/gluster-exp
24007
and 49229
are required ports for gluster to function.
systemctl status glusterd
to check the status of the gluster daemon and activate if it is failed state.
Use: systemctl start glusterd
to start and to check: systemctl status glusterd
to check the glusterd status and relevant processes
Sample output:● glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled)
Active: active (running) since Sat 2024-11-02 00:33:20 IST; 3 weeks 4 days ago
Docs: man:glusterd(8)
Main PID: 857 (glusterd)
Tasks: 89 (limit: 9388)
Memory: 62.8M
CPU: 4h 42min 31.033s
CGroup: /system.slice/glusterd.service
├─ 857 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
├─2763 /usr/sbin/glusterfsd -s psorbit-node01 --volfile-id ps_orbit_dfs.psorbit-node01.export-vdb-brick -p /var/run/gluster/vols/ps_orbit_dfs/psorbit-node01-export>
└─2835 /usr/sbin/glusterfs -s localhost --volfile-id shd/ps_orbit_dfs -p /var/run/gluster/shd/ps_orbit_dfs/ps_orbit_dfs-shd.pid -l /var/log/glusterfs/glustershd.lo>
Notice: journal has been rotated since unit was started, output may be incomplete.
Ensure if 3 processes in CGroup section exist on the instance, if not try restarting the service again. If none of the below remedies didn’t work, handover the issue to devops team.
Check and resolve mount issues:
The client mount depends on the DNS name psorbit-node01
for psorbit cluster
and pssb1avm01
for pssb cluster
, the name differs for each node.
/etc/hosts
are same: ip a
or ifconfig
to check for the IP address for the current instance. cat /etc/hostname
to check for the relevant IP address for the instance name that the gluster uses(You could see this from cat /etc/fstab
)dmesg | grep -i mount
for kernal level mount logs. dmesg
).Check for mount logs in kern.log:
Use: grep -i "mount" /var/log/kern.log
Sample output:
Nov 11 18:04:36 pssb1abm003 kernel: [ 0.152602] Mount-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
Nov 11 18:04:36 pssb1abm003 kernel: [ 0.152613] Mountpoint-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
Nov 11 18:04:36 pssb1abm003 kernel: [ 4.066694] EXT4-fs (nvme0n1p4): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Nov 11 18:04:36 pssb1abm003 kernel: [ 4.707828] EXT4-fs (nvme0n1p4): re-mounted. Opts: (null). Quota mode: none.
Nov 11 18:04:36 pssb1abm003 kernel: [ 5.569647] EXT4-fs (nvme0n1p2): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Nov 11 18:04:36 pssb1abm003 kernel: [ 5.583832] FAT-fs (nvme0n1p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
Nov 11 18:04:36 pssb1abm003 kernel: [ 6.140376] EXT4-fs (nvme0n1p5): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Nov 11 18:04:36 pssb1abm003 kernel: [ 7.043893] EXT4-fs (nvme0n1p6): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Nov 11 18:04:36 pssb1abm003 kernel: [ 7.093467] audit: type=1400 audit(1731328468.240:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=614 comm="apparmor_parser"
Nov 11 18:04:36 pssb1abm003 kernel: [ 15.633758] audit: type=1400 audit(1731328476.780:13): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=1062 comm="apparmor_parser"
ping <instance_hostname>
Similar output:
PING pssb1abm003 (172.21.0.63) 56(84) bytes of data.
64 bytes from pssb1abm003 (172.21.0.63): icmp_seq=1 ttl=64 time=0.017 ms
64 bytes from pssb1abm003 (172.21.0.63): icmp_seq=2 ttl=64 time=0.018 ms
^C
--- pssb1abm003 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1003ms
Refer to cat /etc/fstab
for hostname the mount uses, check for line similar to:
pssb1abm003:pssb_dfs /data/pssb glusterfs defaults,_netdev 1 0
Restart the gluster daemon after the above remedy anf ensure it is in running state and all the process exists as described in the before remedy.
Check for mounting directory and volume disk status:
Check if the volume disk is mount correcty and mount if it is not mounted correctly.
Use df -h | grep /dev/vdb
You should the similar output:
/dev/vdb 295M 175M 121M 60% /export/vdb
if the given disk/patition isn’t mounted on the instance, try mounting it:
Make sure of existence of the line: /dev/vdb /export/vdb xfs defaults 0 0
in the /etc/fstab
on the instance(common for bot clusters, except for pssb1bvm003)
If the above line doesn’t exists in /etc/fstab
:
add and mount using:
echo "/dev/vdb /export/vdb xfs defaults 0 0" >> /etc/fstab
mkdir -p /export/vdb && mount -a
And verify using df -h | grep /dev/vdb
If the issue still exists, forward the issue to devops team.
Check if the target mount directory exists by checking with ls <mount directory>
for the particular instance.
For pssb cluster: Use ll -d /data/pssb && ll /data/pssb
(refer to /etc/fstab file for correct mount directory)
Sample output:
drwxr-xr-x 7 tomcat tomcat 4096 Nov 26 17:18 /data/pssb/
root@pssb1abm003:~# ll -d /data/pssb && ll /data/pssb
drwxr-xr-x 7 tomcat tomcat 4096 Nov 26 17:18 /data/pssb/
total 20
drwxr-xr-x 7 tomcat tomcat 4096 Nov 26 17:18 ./
drwxr-xr-x 8 root root 4096 Oct 22 16:10 ../
drwx------ 2 tomcat tomcat 4096 Oct 21 17:43 archive/
drwx------ 2 tomcat tomcat 4096 Nov 25 15:53 health_monitor/
drwx------ 2 tomcat tomcat 4096 Oct 21 17:43 reports/
For psorbit cluster: Use ll -d /data/ps/orbit && ll /data/ps/orbit
Sample output:
drwxr-xr-x 6 tomcat tomcat 4096 Nov 27 12:10 /data/ps/orbit/
total 12
drwxr-xr-x 6 tomcat tomcat 4096 Nov 27 12:10 ./
drwxr-xr-x 3 root root 4096 Oct 9 12:50 ../
drwx------ 2 tomcat tomcat 6 Nov 27 12:10 health_monitor/
drwx------ 2 tomcat tomcat 4096 Nov 26 23:08 playstore/
Output above is one that of a healthy server, if you see the directory and not the directories/files inside the requested directory, there’s must be an issue with mount and the mount directory creation is successful.
If the directory exists and files inside them are not existent:
Verify if the following line exists and correct in /etc/fstab
(Hostnames mentioned here differ from that of the instance):
For psorbit:
psorbit-node01:/ps_orbit_dfs /data/ps/orbit glusterfs defaults,_netdev 1 0
For pssb:
pssb1abm003:pssb_dfs /data/pssb glusterfs defaults,_netdev 1 0
And use: mount -a
and check using df -h
to view if the mount was successful.
If the directory doesn’t exist at all:
Use mkdir -p /data/pssb && chown tomcat:tomcat /data/pssb
for pssb cluster and mkdir -p /data/ps/orbit/ && chown tomcat:tomcat /data/ps/orbit/
to create the mount directory for the gluster volume.
Restart the gluster daemon after the above remedy anf ensure it is in running state and all the process exists as described in the before remedy.
Configure firewall rules:
- Use ufw status | grep -E '^(491|2400)'
to check if ports have been allowed for the ranges mentioned in output below.
Sample output:
49152:49252/tcp ALLOW Anywhere 24007:24008/tcp ALLOW Anywhere 49152:49252/tcp (v6) ALLOW Anywhere (v6) 24007:24008/tcp (v6) ALLOW Anywhere (v6)
Allow the ports required if you didn't see the output as above.
```
ufw allow 49152:49252 && ufw allow 49152
```
Ensure passing along the all collection data described in above sections to the devops team while C3 remedies doesn’t work.
When C3 remedies doesn’t work, devops team are said to follow these remedies to recover from the alerting state.
Check for disk mount of gluster volumes:
lsblk
to check for the disk attachment.
Sample output:NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sr0 11:0 1 1024M 0 rom
vda 252:0 0 128G 0 disk
├─vda1 252:1 0 1M 0 part
├─vda2 252:2 0 8G 0 part [SWAP]
├─vda3 252:3 0 1G 0 part /boot
├─vda4 252:4 0 60G 0 part /data
├─vda5 252:5 0 25G 0 part /
└─vda6 252:6 0 34G 0 part /opt
vdb 252:16 0 300M 0 disk /export/vdb
You should see a disk vdb
(vdb is the name we use to attach the disk to the vm). or a similar disk name, check for 300M
other than the usual vda
disk to exist on the given output.
If the output didn’t contain another disk(vdb
), then try debugging the VM’s xml file to make sure the following code and the actual disk exists on the KVM host.
Check the node’s xml file to include(get the vm name using virsh list
and find the node corresponding to the instance of the firing alert):
Use virsh dumpxml ps-orbit-in-demo1a-node01 | grep vdb
to check for the following content exists in the output from the above command:
<disk type='file' device='disk'>
<source file='/data1/d_disks/psorbit-in-demo1a-brick1.img' index='2'/>
<alias name='virtio-disk1'/>
</disk>
If the content similar to above(might differ with the disk .img name) doesn’t exist, add it to the vm’s xml file in the disks section.Make sure to change the content by including the correct path for the disk image(.img) file.
After editing the vm’s xml configuration, you need to restart the instance and check for the disk in lsblk
output.
.img
disk file path on KVM host.
Typically, secondary disk files exist on the KVM host in the /data1/d_disks/
directory, check for the relevant file existence(usually the disk file name is same as the instance name from the virsh list
command output)Check if IP address is correctly assigned from that of /etc/hosts
file.
Check cat /etc/hosts
files’s IP assignment and make sure the instance has the same IP assigned, If the IP assignment has changed, configure through the winbox through usual procedures.
After resolving IP assignment issues, restart the glusterd service.
The gluster_up
metric in Gluster monitoring indicates whether a Gluster volume is successfully mounted and accessible. When gluster_up
is 1, it means the volume is mounted and operational. If the metric is 0, it signifies that the volume mount has failed due to issues such as network problems, configuration errors, or service unavailability. The C3 team must provide all relevant data to the DevOps team if the C3 remedy is unsuccessful.
Instance name,IP address, age of alert in firing state: Collect the instance name,IP address of the instance, total time for the alert is being in firing state.
For no data alert: Check the gluster_exporter status first in case of no data
alert on the node that returns 0
for the gluster_up
metric.
systemctl status gluster_exporter
to check the status of gluster_exporter service.Peer Status: Check the Gluster peer status to identify disconnected peers on the alert firing instance.
gluster peer status
Sample Output:Number of Peers: 2
Hostname: psorbit-node03
Uuid: c972e8e7-3471-401a-972a-c4dc2d65727c
State: Peer in Cluster (Disconnected)
Hostname: psorbit-node02
Uuid: 4fd82040-c9fa-4cfb-a706-fd62074d0d28
State: Peer in Cluster (Connected)
The peer state that shows Disconnected
from the above output is the node that has been disconnected from the gluster. Perform the following checks on that node is crucial to bring the gluster status to normal.
Collect network status information: Check connectivity:
ping <instance's hostname>
Check port connectivity:telnet <ip>:24007
telnet <ip>:49192
List out all connections from port 49192
:lsof -i :49192
lsof -i :24007
On the faulty node/nodes 5. Service Status: Verify the Glusterd service status on the instance where the metric reported 0 or no data. Check if the service is in an active, activating, or failed state.
systemctl status glusterd
to see what state the service is in.Process status: Verify if glusterd process is running and includes several child processes as given:
ps aux | grep gluster
and check if gluster process is running by verifying the output existence of something like: root 775 0.1 0.4 616272 33996 ? SLsl Nov11 34:21 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
root 1133 0.2 0.1 1809192 8700 ? SLsl Nov11 57:14 /usr/sbin/glusterfsd -s pssb1abm003 --volfile-id pssb_dfs.pssb1abm003.data-export-vdb-brick -p /var/run/gluster/vols/pssb_dfs/pssb1abm003-data-export-vdb-brick.pid -S /var/run/gluster/54a9732b9e4c6146.socket --brick-name /data/export/vdb/brick -l /var/log/glusterfs/bricks/data-export-vdb-brick.log --xlator-option *-posix.glusterd-uuid=4c72a811-a357-42e5-9b8f-8343e9c35fe4 --process-name brick --brick-port 49164 --xlator-option pssb_dfs-server.listen-port=49164
root 1166 0.0 0.0 810940 3420 ? SLsl Nov11 0:52 /usr/sbin/glusterfs -s localhost --volfile-id shd/pssb_dfs -p /var/run/gluster/shd/pssb_dfs/pssb_dfs-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/26302251ede1f174.socket --xlator-option *replicate*.node-uuid=4c72a811-a357-42e5-9b8f-8343e9c35fe4 --process-name glustershd --client-pid=-6
root 2537 0.0 0.0 730560 6344 ? SLsl Nov11 10:25 /usr/sbin/glusterfs --process-name fuse --volfile-server=pssb1abm003 --volfile-id=pssb_dfs /data/pssb
Port status: Check if the port is correctly listening on given port and process name.
netstat -tlnp | grep gluster
to see if the port number 24007
and 49164
is mentioned in the output.Disk space & mount status: Check if storage space is available for gluster to be able to function.
For pssb
cluster, check in /data/pssb
:
df -h / && df -h /data/pssb
to check for disk space available for given paths.
Sample output:Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p4 25G 17G 7.2G 70% /
Filesystem Size Used Avail Use% Mounted on
pssb1abm003:pssb_dfs 295M 56M 240M 19% /data/pssb
For pssb
cluster, check in /data/pssb
:
df -h / && df -h /data/ps/orbit/
to check for disk space available for given paths.
Sample output:Filesystem Size Used Avail Use% Mounted on
/dev/vda5 25G 14G 9.7G 59% /
Filesystem Size Used Avail Use% Mounted on
psorbit-node01:/ps_orbit_dfs 295M 178M 118M 61% /data/ps/orbit
Gluster volume information: Check if volume list contains the volume that should be on the given cluster.
To list available volumes: gluster volume list
Sample output:
For pssb_cluster:
pssb_dfs
For psorbit_cluster:
ps_orbit_dfs
To view the information of available on the node: For pssb_cluster:
Volume Name: pssb_dfs
Type: Replicate
Volume ID: 35ea87a9-f105-4591-a6ff-04d407b8e457
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 5 = 5
Transport-type: tcp
Bricks:
Brick1: pssb1avm001:/export/vdb/brick
Brick2: pssb1avm002:/export/vdb/brick
Brick3: pssb1abm003:/data/export/vdb/brick
Brick4: pssb1avm004:/export/vdb/brick
Brick5: pssb1avm005:/export/vdb/brick
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
performance.client-io-threads: off
For psorbit cluster:
Volume Name: ps_orbit_dfs
Type: Replicate
Volume ID: 0a88ae36-d097-4747-afca-9587b3f9d114
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: psorbit-node01:/export/vdb/brick
Brick2: psorbit-node02:/export/vdb/brick
Brick3: psorbit-node03:/export/vdb/brick
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
performance.client-io-threads: off
Log messages: Submit the log messages from the given below process.
tail -100f /var/log/glusterfs/glusterd.log
to get the glusterd service logs(maintained seperately by gluster daemon itself).
For volume specific logs:
For ps_orbit_dfs
volume: Use tail -100f /var/log/glusterfs/data-ps-orbit.log
For pssb_dfs
volume: Use tail -100f /var/log/glusterfs/data-pssb.log
When gluster_peers_connected
shows status of less than 2
for psorbit cluster or <4
for pssb cluster, check for other relevant metrics:
gluster_volume_status
this typically show if volume is functional, if this query returns 0
, then it an instant gluster recovery is necessary.gluster_brick_available
query.gluster_volume_writeable
gluster_up
shows what peers are online.Resolve network issues(if any) Check if network connectivity is functional by using the following process.
gluster peer status
to check what nodes are in disconnected state.
Sample output:Number of Peers: 4
Hostname: pssb1avm002
Uuid: d9c765b5-2c67-426d-a47e-f8fe2ffcdc0e
State: Peer in Cluster (Disconnected)
Hostname: pssb1avm004
Uuid: 5b96ff7e-cd9a-4536-b89c-c72558debef1
State: Peer in Cluster (Connected)
Hostname: pssb1abm003
Uuid: 4c72a811-a357-42e5-9b8f-8343e9c35fe4
State: Peer in Cluster (Connected)
Hostname: pssb1avm005
Uuid: 5fe8fb3b-83b0-42bd-a77d-6e5bc4f4abbb
State: Peer in Cluster (Connected)
gluster peer probe <diconnected hostname>
to try probing the peer.
Sample output:peer probe: Host pssb1abm00x port 24007 already in peer list
Similar output to above indicates that peer is already connected to the gluster cluster, but in disconnected state due to some issue.
Detect network issues: Try probing the server:
ping <hostname>
Sample output:PING pssb1abm003 (172.21.0.63) 56(84) bytes of data.
64 bytes from pssb1abm003 (172.21.0.63): icmp_seq=1 ttl=64 time=28.9 ms
64 bytes from pssb1abm003 (172.21.0.63): icmp_seq=2 ttl=64 time=0.314 ms
^C
--- pssb1abm003 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.314/14.621/28.929/14.307 ms
Try connecting to the port.
telnet <hostname> 49192
, telnet <hostname> 24007
Sample output:Trying 172.21.0.61...
Connected to pssb1avm001.
Escape character is '^]'.
^CConnection closed by foreign host.
If node and port is accessible, troubleshoot for gluster process on the faulty node.
Check for all existent gluster connections:
Use: lsof -i 49192
, lsof -i 24007
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
glusterd 775 root 8u IPv4 25856 0t0 TCP pssb1abm003:24007->pssb1abm003:49149 (ESTABLISHED)
glusterd 775 root 11u IPv4 24085 0t0 TCP *:24007 (LISTEN)
glusterd 775 root 13u IPv4 23243 0t0 TCP pssb1abm003:24007->pssb1avm002:49131 (ESTABLISHED)
glusterd 775 root 14u IPv4 25047 0t0 TCP pssb1abm003:49151->pssb1avm002:24007 (ESTABLISHED)
glusterd 775 root 15u IPv4 25048 0t0 TCP pssb1abm003:49150->pssb1avm005:24007 (ESTABLISHED)
glusterd 775 root 17u IPv4 25050 0t0 TCP pssb1abm003:49148->pssb1avm004:24007 (ESTABLISHED)
glusterd 775 root 18u IPv4 23245 0t0 TCP pssb1abm003:24007->pssb1avm004:49147 (ESTABLISHED)
glusterd 775 root 19u IPv4 25930 0t0 TCP localhost:24007->localhost:49147 (ESTABLISHED)
glusterd 775 root 20u IPv4 33292 0t0 TCP pssb1abm003:24007->pssb1abm003:49119 (ESTABLISHED)
glusterd 775 root 22u IPv4 23249 0t0 TCP pssb1abm003:24007->pssb1avm005:49127 (ESTABLISHED)
glusterd 775 root 25u IPv4 23252 0t0 TCP pssb1abm003:24007->pssb1avm001:49131 (ESTABLISHED)
glusterfs 1133 root 10u IPv4 25067 0t0 TCP pssb1abm003:49149->pssb1abm003:24007 (ESTABLISHED)
glusterfs 1166 root 10u IPv4 24319 0t0 TCP localhost:49147->localhost:24007 (ESTABLISHED)
glusterfs 2537 root 11u IPv4 34251 0t0 TCP pssb1abm003:49119->pssb1abm003:24007 (ESTABLISHED)
glusterfs 1166 root 15u IPv4 24432 0t0 TCP pssb1abm003:49143->pssb1avm001:49192 (ESTABLISHED)
glusterfs 2537 root 12u IPv4 32454 0t0 TCP pssb1abm003:49112->pssb1avm001:49192 (ESTABLISHED)
If the given nodename is not in the gluster peer list
at all, it indicates that the nodes is not yet added to gluster file system, devops team had to take care of this in that case.
If there are no network issues, and apart from the disconnected nodes, other peers have connections with thier peers, you should troublshoot on the faulty nodes.
Ensure daemon and other glusterfs process/ports on the instance:
netstat -tlnp | grep gluster
tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN 857/glusterd
tcp 0 0 0.0.0.0:49229 0.0.0.0:* LISTEN 2763/glusterfsd
tcp6 0 0 :::9106 :::* LISTEN 4087700/gluster-exp
24007
and 49229
are required ports for gluster to function.
systemctl status glusterd
to check the status of the gluster daemon and activate if it is failed state.
Use: systemctl start glusterd
to start and to check: systemctl status glusterd
to check the glusterd status and relevant processes
Sample output:● glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled)
Active: active (running) since Sat 2024-11-02 00:33:20 IST; 3 weeks 4 days ago
Docs: man:glusterd(8)
Main PID: 857 (glusterd)
Tasks: 89 (limit: 9388)
Memory: 62.8M
CPU: 4h 42min 31.033s
CGroup: /system.slice/glusterd.service
├─ 857 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
├─2763 /usr/sbin/glusterfsd -s psorbit-node01 --volfile-id ps_orbit_dfs.psorbit-node01.export-vdb-brick -p /var/run/gluster/vols/ps_orbit_dfs/psorbit-node01-export>
└─2835 /usr/sbin/glusterfs -s localhost --volfile-id shd/ps_orbit_dfs -p /var/run/gluster/shd/ps_orbit_dfs/ps_orbit_dfs-shd.pid -l /var/log/glusterfs/glustershd.lo>
Notice: journal has been rotated since unit was started, output may be incomplete.
Ensure if 3 processes in CGroup section exist on the instance, if not try restarting the service again. If none of the below remedies doesn’t work, handover the issue to devops team.
Check and resolve mount issues:
The client mount depends on the DNS name psorbit-node01
for psorbit cluster
and pssb1avm01
for pssb cluster
, the name differs for each node.
/etc/hosts
are same: ip a
or ifconfig
to check for the IP address for the current instance. cat /etc/hostname
to check for the relevant IP address for the instance name that the gluster uses(You could see this from cat /etc/fstab
)dmesg | grep -i mount
for kernal level mount logs. dmesg
).Check for mount logs in kern.log:
Use: grep -i "mount" /var/log/kern.log
Sample output:
Nov 11 18:04:36 pssb1abm003 kernel: [ 0.152602] Mount-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
Nov 11 18:04:36 pssb1abm003 kernel: [ 0.152613] Mountpoint-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
Nov 11 18:04:36 pssb1abm003 kernel: [ 4.066694] EXT4-fs (nvme0n1p4): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Nov 11 18:04:36 pssb1abm003 kernel: [ 4.707828] EXT4-fs (nvme0n1p4): re-mounted. Opts: (null). Quota mode: none.
Nov 11 18:04:36 pssb1abm003 kernel: [ 5.569647] EXT4-fs (nvme0n1p2): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Nov 11 18:04:36 pssb1abm003 kernel: [ 5.583832] FAT-fs (nvme0n1p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
Nov 11 18:04:36 pssb1abm003 kernel: [ 6.140376] EXT4-fs (nvme0n1p5): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Nov 11 18:04:36 pssb1abm003 kernel: [ 7.043893] EXT4-fs (nvme0n1p6): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Nov 11 18:04:36 pssb1abm003 kernel: [ 7.093467] audit: type=1400 audit(1731328468.240:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=614 comm="apparmor_parser"
Nov 11 18:04:36 pssb1abm003 kernel: [ 15.633758] audit: type=1400 audit(1731328476.780:13): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=1062 comm="apparmor_parser"
ping <instance_hostname>
Similar output:
PING pssb1abm003 (172.21.0.63) 56(84) bytes of data.
64 bytes from pssb1abm003 (172.21.0.63): icmp_seq=1 ttl=64 time=0.017 ms
64 bytes from pssb1abm003 (172.21.0.63): icmp_seq=2 ttl=64 time=0.018 ms
^C
--- pssb1abm003 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1003ms
Refer to cat /etc/fstab
for hostname the mount uses, check for line similar to:
pssb1abm003:pssb_dfs /data/pssb glusterfs defaults,_netdev 1 0
Restart the gluster daemon after the above remedy anf ensure it is in running state and all the process exists as described in the before remedy.
Check for mounting directory and volume disk status:
Check if the volume disk is mount correcty and mount if it is not mounted correctly.
Use df -h | grep /dev/vdb
You should the similar output:
/dev/vdb 295M 175M 121M 60% /export/vdb
if the given disk/patition isn’t mounted on the instance, try mounting it:
Make sure of existence of the line: /dev/vdb /export/vdb xfs defaults 0 0
in the /etc/fstab
on the instance(common for bot clusters, except for pssb1bvm003)
If the above line doesn’t exists in /etc/fstab
:
add and mount using:
echo "/dev/vdb /export/vdb xfs defaults 0 0" >> /etc/fstab
mkdir -p /export/vdb && mount -a
And verify using df -h | grep /dev/vdb
If the issue still exists, forward the issue to devops team.
Check if the target mount directory exists by checking with ls <mount directory>
for the particular instance.
For pssb cluster: Use ll -d /data/pssb && ll /data/pssb
(refer to /etc/fstab file for correct mount directory)
Sample output:
drwxr-xr-x 7 tomcat tomcat 4096 Nov 26 17:18 /data/pssb/
root@pssb1abm003:~# ll -d /data/pssb && ll /data/pssb
drwxr-xr-x 7 tomcat tomcat 4096 Nov 26 17:18 /data/pssb/
total 20
drwxr-xr-x 7 tomcat tomcat 4096 Nov 26 17:18 ./
drwxr-xr-x 8 root root 4096 Oct 22 16:10 ../
drwx------ 2 tomcat tomcat 4096 Oct 21 17:43 archive/
drwx------ 2 tomcat tomcat 4096 Nov 25 15:53 health_monitor/
drwx------ 2 tomcat tomcat 4096 Oct 21 17:43 reports/
For psorbit cluster: Use ll -d /data/ps/orbit && ll /data/ps/orbit
Sample output:
drwxr-xr-x 6 tomcat tomcat 4096 Nov 27 12:10 /data/ps/orbit/
total 12
drwxr-xr-x 6 tomcat tomcat 4096 Nov 27 12:10 ./
drwxr-xr-x 3 root root 4096 Oct 9 12:50 ../
drwx------ 2 tomcat tomcat 6 Nov 27 12:10 health_monitor/
drwx------ 2 tomcat tomcat 4096 Nov 26 23:08 playstore/
Output above is one that of a healthy server, if you see the directory and not the directories/files inside the requested directory, there’s must be an issue with mount and the mount directory creation is successful.
If the directory exists and files inside them are not existent:
Verify if the following line exists and correct in /etc/fstab
(Hostnames mentioned here differ from that of the instance):
For psorbit:
psorbit-node01:/ps_orbit_dfs /data/ps/orbit glusterfs defaults,_netdev 1 0
For pssb:
pssb1abm003:pssb_dfs /data/pssb glusterfs defaults,_netdev 1 0
And use: mount -a
and check using df -h
to view if the mount was successful.
If the directory doesn’t exist at all:
Use mkdir -p /data/pssb && chown tomcat:tomcat /data/pssb
for pssb cluster and mkdir -p /data/ps/orbit/ && chown tomcat:tomcat /data/ps/orbit/
to create the mount directory for the gluster volume.
Restart the gluster daemon after the above remedy anf ensure it is in running state and all the process exists as described in the before remedy.
Ensure passing along the all collection data described in above sections to the devops team while C3 remedies doesn’t work.
In case C3 remedies fail to reinstantiate the gluster nodes status, Try performing the following remedies.
Remove data on the brick and restart glusterd:
Kill all gluster related processes on the faulty node, using: pkill *gluster*
Remove all the data from the current brick except directories that start with .
: find /export/vdb/brick -type f ! -name ".*" -exec rm -f {} +
You should be left with the following directories: ls -la /export/vdb/brick
Sample output:
drw------- 262 root root 8192 Oct 21 17:18 .glusterfs
drwxr-xr-x 2 root root 6 Oct 21 17:39 .glusterfs-anonymous-inode-35ea87a9-f105-4591-a6ff-04d407b8e457
Restart the glusterd: systemctl restart glusterd && systemctl status glusterd
Sample output:
● glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2024-11-11 18:04:39 IST; 2 weeks 1 day ago
Docs: man:glusterd(8)
Main PID: 775 (glusterd)
Tasks: 93 (limit: 9250)
Memory: 66.1M
CPU: 1h 32min 53.992s
CGroup: /system.slice/glusterd.service
├─ 775 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
├─1133 /usr/sbin/glusterfsd -s pssb1abm003 --volfile-id pssb_dfs.pssb1abm003.data-export-vdb-brick -p /var/run/gluster/vols/pssb_dfs/pssb1abm003-data-export-vdb-br>
└─1166 /usr/sbin/glusterfs -s localhost --volfile-id shd/pssb_dfs -p /var/run/gluster/shd/pssb_dfs/pssb_dfs-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/ru
For gluster to function correctly, you should see all three processes on the output.
Start the gluster exporter: systemctl start gluster-exporter
Check port status: netstat -tlnp | grep gluster
Sample output:
tcp 0 0 0.0.0.0:49164 0.0.0.0:* LISTEN 1133/glusterfsd
tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN 775/glusterd
Remove the brick and re-initailze as a new brick: Follow the below steps to re-initialize the faulty brick in case all remedies fail.
Step-1: Remove the brick from the volume:
Use gluster volume remove-brick <volume name> replica <new replica count(original - 1)) <brick's hostname to be removed>:<brick path> force
Replica count: Incase of pssb cluster new replica count should be 1 while removing 1st node and 2 for 2nd brick removal and so on.
Brick path: Use gluster volume info
to view the brick path for the volume and for the particular hostname.
Step-2: Detach the faulty node from the volume by executing the following commands on a healthy node:
Use gluster peer detach <hostname>
Step-3: Rebalance the volume, this process might invlove self-heals and configuration updates to ammend for the new changes in the gluster cluster.
Use gluster volume rebalance volume <volume_name> start
Step-4: After login to the faulty node and clear everything that’s residing on the brick;
Use rm -rf /export/vdb/brick
Step-5: Now add the new reset brick to the trusted pool from another healthy server
Use gluster peer probe <hostname of the new brick>
Step-6: Now login to a healthy gluster node and add the now reset brick to the volume:
Use gluster volume add-brick <volume_name> replica <new replica count(original)> <hostname>:<new_brick_path, typically /export/vdb/brick>
Step-7: Rebalance the volume
Use gluster volume rebalance volume <volume_name> start
Step-8: Check volume info
Use gluster volume info
The metric (gluster_node_size_free_bytes{job="job_name"} / gluster_node_size_bytes_total{job="job_name"}) * 100
represents the percentage of free storage space on a Gluster node. A higher value indicates adequate free space, while a lower value suggests the node is nearing full capacity, potentially leading to performance issues or storage failures.
If storage space becomes critically low, the C3 team must escalate with detailed information to the DevOps team if their remedies, such as removing unnecessary files is not the case.
Instance name,IP address, age of alert in firing state: Collect the instance name,IP address of the instance, total time for the alert is being in firing state.
For no data alert: Check the gluster_exporter status first in case of no data
alert on the node that returns 0
for the gluster_up
metric.
systemctl status gluster_exporter
to check the status of gluster_exporter service.Storage on the volume disk: Check the storage available on the disk where the gluster fs is residing by using: df -h /export/vdb
Files list: Use ls -lSha /export/vdb/brick/
to list directories in the file system.
Use find /var/log/ -type f -exec du -m {} + | sort -rn | head -n 20
to list out the files occupying most storage on the directory in MB.
df -h /export/vdb
is relevant in context to low disk space on the glusterfs..swp
files under the fluster file system and remove them after clarifying thier importance with devops team members and developers(if there are any).Increase brick storage capacity To increase disk storage capacity, the entire volume must be removed with backup of existing data and then bricks need to be cleared and new disks are to be created
To create new disks: Step-1: Login to KVM host and execute the following commands:
qemu-img create -f raw /data1/d_disks/<brick_name>.img <size in megabytes>M
Create as many as the number of nodes and with relevant names.
Replace the path for the disk in the following content on the vm’s xml file. Use virsh edit <vm_name>
<disk type='file' device='disk'>
<source file='/data1/d_disks/psorbit-in-demo1a-brick1.img' index='2'/>
<alias name='virtio-disk1'/>
</disk>
Step-2: After attaching follow the procedure to remove the existing volume and adding the current volume
Step-2.1: Remove bricks from the volume: Following this step with all existing hostname to completely remove the volume(by removing each brick at a time)
Use gluster volume remove-brick <volume name> replica <new replica count(original - 1)) <brick's hostname to be removed>:<brick path> force
Replica count: Incase of pssb cluster new replica count should be 1 while removing 1st node and 2 for 2nd brick removal and so on.
Brick path: Use gluster volume info
to view the brick path for the volume and for the particular hostname.
Step-2.2: After removing the bricks, delete the volume by using the following command:
gluster volume stop <volume_name>
gluster volume delete <volume_name>
Step-3: Restart the vm to apply the new disk configuration and to mount it by itself.
Step-4: Format the attached disk with xfs file system and mount it. Use lsblk to check if the disk is correctly attached and then proceed
mkfs.xfs -i size=512 /dev/vdb
Step-5: Add the persistent mount configuration to /etc/fstab
in case it doesn’t exist:
echo "/dev/vdb /export/vdb xfs defaults 0 0" >> /etc/fstab
Step-6: use mount -a
to mount the disk to /export/vdb
directory.
Step-7: Create brick
directory on the disk now attached to create a capacity to store at for the volume on all the nodes.
Use mkdir -p /export/vdb/brick
Step-8: Create the volume and add the nodes with thier respective bricks to the volume.
gluster volume create <volume_name> replica <replication number> <node-1>:/export/vdb/brick <node-2>:/export/vdb/brick <node-3>:/export/vdb/brick
Add the other nodes details as they exist.
Step-9: Start the volume
gluster volume start <volume_name>
Step-10: Check the volume info
gluster volume info
(gluster_node_inodes_free{job="gluster_psorbit"} / gluster_node_inodes_total{job="gluster_psorbit"}) * 100
query returns the free inodes percentage on the glusterfs, this percentage being low could indicate that there will be no file creation operations going to be successful when the percentage hits 0%, as inodes are necessary for new files to be created.
Instance name,IP address, age of alert in firing state: Collect the instance name,IP address of the instance, total time for the alert is being in firing state.
For no data alert: Check the gluster_exporter status first in case of no data
alert on the node that returns 0
for the gluster_up
metric.
systemctl status gluster_exporter
to check the status of gluster_exporter service.Volume status: Collect volume status metric values from past 5 minutes.
Collect Inode usage:
df -ih
to list inode usage per mount point.Count number of files exist on the disk: Use ls -SRlac /export/vdb | wc -l
to count all the files on the partition.
Collect statedump of the volume: State dump of the volume helps in further analyzing the issues in the volume.
Use: gluster volume statedump <vol_name>
Output will be saved at /var/run/gluster
directory with the name of .*.<pid/number>
Disk Usage: Looking at the disk usage metrics such as gluster_node_size_bytes_total{volume=~"$volume",instance="$node",job="$job",hostname!="pssb1abm003"}
helps in determining exact reason for high inodes usage, as more files use more inodes.
FOP hitrate: rate(gluster_brick_fop_hits_total{job="$job",instance="$node",volume="$volume"}[5m])
this metric helps in determining the instant high usage of Inodes of the brick.
This issue must be addressed by devops team.
Clear unnceccessary files
Use ls -lSha /export/vdb/brick/
to list directories in the file system.
Use find /var/log/ -type f -exec du -m {} + | sort -rn
to list out the files occupying most storage on the directory in MB.
Increase Inodes after resetting the disk
GlusterLowDiskSpace
alert to delete and recreate the volume and while formatting the file system, decrease the the block size as shown belowmkfs.xfs -i size=<lower it thatn 512> /dev/vdb
While small files being created(<512 bytes), the same number of inodes will be used for each file. In this case, if the files being created are (<256) and inodes are sized for (256 bytes) each too. then the usage will be similar(rather than using up other 256 bytes capacity for a seperate inode as with the case of 512 block size).
The gluster_volume_writeable
metric in Gluster monitoring indicates whether a Gluster volume is capable of handling write operations. When gluster_volume_writeable
is 1, it means the volume is fully writable and accepting data without issues. If the metric is 0, it signifies that write operations are failing, which could be caused by problems such as insufficient disk space, split-brain scenarios, brick failures, or permission misconfigurations. If the C3 team is unable to resolve the issue, they must collect and share all relevant diagnostic data, such as logs, volume status, and brick health, with the DevOps team for further investigation.
/var/log/glusterfs/<volume_name>.log
for volume logs./var/log/glusterfs/glusterd.log
on all Gluster nodes to check.gluster volume status <volume_name> detail
gluster volume heal <volume_name> info
mount | grep glusterfs
ping <node>
gluster peer status
gluster_up
: Indicates if Gluster services are running on all nodes.gluster_brick_available
: Checks the status of individual bricks.gluster_brick_fop_latency_avg
.gluster_heal_info_files_count
: Ensures there are no pending heal entries.echo "test" > <client_mount_path>/test
Try triggering a heal
gluster volume heal <volume_name> full
Restart Gluster services on all nodes to resolve transient issues:
systemctl restart glusterd
Verify that all bricks are online and accessible:
gluster volume status <volume_name>
Check for split-brain scenarios and resolve them:
gluster volume heal <volume_name> split-brain
Perform a network check to ensure all nodes in the Gluster cluster can communicate.
If the issue persists, reset faulty bricks and re-add them to the volume with process specified on previous alert remedies.
Monitor GlusterFS logs on all nodes for specific error messages to narrow down the issue.
The gluster_volume_status
metric in Gluster monitoring provides an overall indication of the health and operational state of a Gluster volume. When gluster_volume_status
is 1, it means the volume is healthy, running, and accessible for both read and write operations. If the metric is 0, it signifies that the volume has encountered critical issues such as brick failures, unresponsive nodes, or service outages that are preventing normal operations. If the C3 team cannot resolve the issue, they must gather and share all relevant information, including peer status, volume logs, heal status, and network diagnostics, with the DevOps team for detailed analysis and resolution.
Node Status:
Check the availability of all nodes in the cluster:
node_up
gluster peer status
ping <node_IP>
Service Status:
Verify if the Gluster services are running on all nodes:
systemctl status glusterd
Output should return active.
Volume Status:
Confirm the state of volumes:
gluster volume status
Sample output:
Status of volume: ps_orbit_dfs
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick psorbit-node01:/export/vdb/brick 49229 0 Y 2763
Brick psorbit-node02:/export/vdb/brick 49225 0 Y 1577
Brick psorbit-node03:/export/vdb/brick 49220 0 Y 1595
Self-heal Daemon on localhost N/A N/A Y 2073
Self-heal Daemon on psorbit-node03 N/A N/A Y 2042
Self-heal Daemon on psorbit-node01 N/A N/A Y 2835
Task Status of Volume ps_orbit_dfs
------------------------------------------------------------------------------
There are no active volume tasks
Log Files:
Gather logs from all nodes for analysis:
tail -f /var/log/glusterfs/glusterd.log
tail -f /var/log/glusterfs/bricks/<brick_name>.log
Mount Points:
Confirm the mount points on the client systems and ensure they are accessible:
mount | grep glusterfs
ls -la /data/<pssb| psorbit>
Sample output:
pssb1abm003:pssb_dfs on /data/pssb type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072,_netdev)
total 12
drwx------ 2 tomcat tomcat 4096 Oct 21 17:43 archive
drwx------ 2 tomcat tomcat 4096 Nov 25 15:53 health_monitor
drwx------ 2 tomcat tomcat 4096 Oct 21 17:43 reports
mount -t glusterfs hostname:<volume_name> /data/<mount_point>
mount -a
gluster_up
: Indicates if Gluster services are running on all nodes.gluster_peers_connected
: Verifies the connectivity between cluster peers.gluster_volume_status
: Checks the operational status of the volumes.gluster_heal_info_files_count
: Ensures no pending or excessive heal entries causing cluster disruptions.Check Node Status:
gluster peer status
to confirm that all nodes are connected. Number of Peers: 2
Hostname: psorbit-node03
Uuid: c972e8e7-3471-401a-972a-c4dc2d65727c
State: Peer in Cluster (Connected)
Hostname: psorbit-node01
Uuid: c9d5dda0-3359-401a-a9b8-2b4cf2eb7ece
State: Peer in Cluster (Connected)
ping <hostname>
Verify Service Status:
glusterd
is running on all nodes:
systemctl status glusterd
systemctl restart glusterd
Sample output:
● glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2024-11-11 18:04:39 IST; 2 weeks 2 days ago
Docs: man:glusterd(8)
Main PID: 775 (glusterd)
Tasks: 93 (limit: 9250)
Memory: 69.5M
CPU: 1h 33min 19.578s
CGroup: /system.slice/glusterd.service
├─ 775 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
├─1133 /usr/sbin/glusterfsd -s pssb1abm003 --volfile-id pssb_dfs.pssb1abm003.data-export-vdb-brick -p /var/run/gluster/vols/pssb_dfs/pssb1abm003-data-export-vdb-br>
└─1166 /usr/sbin/glusterfs -s localhost --volfile-id shd/pssb_dfs -p /var/run/gluster/shd/pssb_dfs/pssb_dfs-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/ru>
Notice: journal has been rotated since unit was started, output may be incomplete.
Inspect Mount Points:
mount | grep glusterfs
ls -la /data/<pssb| psorbit>
Sample output:
pssb1abm003:pssb_dfs on /data/pssb type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072,_netdev)
total 12
drwx------ 2 tomcat tomcat 4096 Oct 21 17:43 archive
drwx------ 2 tomcat tomcat 4096 Nov 25 15:53 health_monitor
drwx------ 2 tomcat tomcat 4096 Oct 21 17:43 reports
mount -t glusterfs hostname:<volume_name> /data/<mount_point>
mount -a
Review Logs for Errors:
tail -f /var/log/glusterfs/glusterd.log
tail -f /var/log/glusterfs/bricks/<brick_name>.log
Escalate to DevOps:
Restart Gluster Services on All Nodes:
systemctl restart glusterd
Verify Cluster Health:
gluster peer status
Sample output:
Status of volume: pssb_dfs
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick pssb1avm002:/export/vdb/brick 49172 0 Y 2945
Brick pssb1abm003:/data/export/vdb/brick 49164 0 Y 1133
Brick pssb1avm004:/export/vdb/brick 49204 0 Y 1594
Brick pssb1avm005:/export/vdb/brick 49249 0 Y 2780
Self-heal Daemon on localhost N/A N/A Y 1166
Self-heal Daemon on pssb1avm005 N/A N/A Y 3101
Self-heal Daemon on pssb1avm002 N/A N/A Y 3645
Self-heal Daemon on pssb1avm004 N/A N/A Y 1736
Task Status of Volume pssb_dfs
------------------------------------------------------------------------------
There are no active volume tasks
volume start: pssb_dfs: failed: Volume pssb_dfs already started
root@pssb1abm003:/tmp# gluster peer status
Number of Peers: 4
Hostname: pssb1avm002
Uuid: d9c765b5-2c67-426d-a47e-f8fe2ffcdc0e
State: Peer in Cluster (Connected)
Hostname: pssb1avm005
Uuid: 5fe8fb3b-83b0-42bd-a77d-6e5bc4f4abbb
State: Peer in Cluster (Connected)
Hostname: 172.21.0.67
Uuid: 326be036-bebb-4d10-9638-ffef159961ad
State: Peer in Cluster (Disconnected)
Other names:
pssb-pxc01
pssb1avm001
Hostname: pssb1avm004
Uuid: 5b96ff7e-cd9a-4536-b89c-c72558debef1
State: Peer in Cluster (Connected)
gluster peer probe <node_IP>
Check Volume Status:
gluster volume status <volume_name>
gluster volume start <volume_name>
Sample output:Status of volume: pssb_dfs
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick pssb1avm002:/export/vdb/brick 49172 0 Y 2945
Brick pssb1abm003:/data/export/vdb/brick 49164 0 Y 1133
Brick pssb1avm004:/export/vdb/brick 49204 0 Y 1594
Brick pssb1avm005:/export/vdb/brick 49249 0 Y 2780
Self-heal Daemon on localhost N/A N/A Y 1166
Self-heal Daemon on pssb1avm005 N/A N/A Y 3101
Self-heal Daemon on pssb1avm002 N/A N/A Y 3645
Self-heal Daemon on pssb1avm004 N/A N/A Y 1736
Task Status of Volume pssb_dfs
------------------------------------------------------------------------------
There are no active volume tasks
volume start: pssb_dfs: failed: Volume pssb_dfs already started
Resolve Network Issues:
ping <node_IP>
Sample output:
PING pssb1avm001 (172.21.0.61) 56(84) bytes of data.
64 bytes from pssb1avm001 (172.21.0.61): icmp_seq=1 ttl=64 time=0.632 ms
64 bytes from pssb1avm001 (172.21.0.61): icmp_seq=2 ttl=64 time=0.306 ms
^C
--- pssb1avm001 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1007ms
rtt min/avg/max/mdev = 0.306/0.469/0.632/0.163 ms
Examine Logs for Errors:
tail -f /var/log/glusterfs/glusterd.log
Trigger Cluster Heal (if applicable):
Start a heal operation if inconsistencies are detected from the logs:
gluster volume heal <volume_name> full
Sample output:
Launching heal operation to perform full self heal on volume pssb_dfs has been successful
Use heal info commands to check status.
gluster volume heal <volume_name> info
Sample output:
Brick pssb1avm001:/export/vdb/brick
Status: Connected
Number of entries: 0
Brick pssb1avm002:/export/vdb/brick
Status: Connected
Number of entries: 0
Brick pssb1abm003:/data/export/vdb/brick
Status: Connected
Number of entries: 0
Brick pssb1avm004:/export/vdb/brick
Status: Connected
Number of entries: 0
Brick pssb1avm005:/export/vdb/brick
Status: Connected
Number of entries: 0
Replace or Reset Faulty Nodes (if necessary):
For GlusterClusterDown alerts, please follow the remedies given in GlusterVolumeDown alert remedies(as they both are very similar, the only difference is that of the number of nodes that are inoperable) andgluster_peers disconnected, to restart, re-assigning bricks in case of a corrupt brick