Easily solve Avamar Data Integrity issue with these simple steps

There will be times when you open the Avamar Graphical User Interface or GUI and see red everywhere. The first time I saw this I was freaking out inside, but remain calm. Now let solve the issue by looking past the red and into the errors.

Data integrity issue has been detected.

Avamar Failed to get avamar mtree

A Checkpoint validation has failed and must be resolved.

A Data Integrity Issue only means that the Avamar is not able to see or validate the data on the storage. In my case the data is stored on a Data Domain. When I login to the Data Domain storage I see no issues, no alerts, and no red. Everything looks normal.

Also, Avamar failed to get Avamar mtree is stating the same thing. Avamar can’t see the file system where the data is stored.

Avamar can’t see the storage checkpoints and also isn’t creating new checkpoints.

So we now see that our storage looks fine so we can breath easy knowing the data is safe, but that Avamar is not seeing the storage. Let’s run some putty commands to get some information about why Avamar is not talking to the Data Domain storage.

Once logged into a session into Avamar let’s elevate to root using the “su -” command and enter the root password. Then enter the “ssh-agent bash” and then the “ssh-add ~/.ssh/rootid” commands.

admin@avamar:~/>: su -
Password:  
root@avamar:~/#: ssh-agent bash 
root@avamar:~/#: ssh-add ~/.ssh/rootid
Identity added: /root/.ssh/rootid (/root/.ssh/rootid)

Now let check the status of the Avamar using the “status.dpn” command.

root@avamar:~/#: status.dpn 
Tue Jul 26 14:31:31 CDT 2022  [AVAMAR.DOMAIN.LOCAL] Tue Jul 26 19:31:31 2022 UTC (Initialized Fri May  6 15:52:01 2022 UTC)
Node   IP Address     Version   State   Runlevel  Srvr+Root+User Dis Suspend Load UsedMB Errlen  %Full   Percent Full and Stripe Status by Disk
0.0  10.20.30.40 19.4.0-124  ONLINE fullaccess mhpu+0hpu+0hpu   1 false   0.12 15310  8243209   0.9%   0%(onl:71 )  0%(onl:71 )
Srvr+Root+User Modes = migrate + hfswriteable + persistwriteable + useraccntwriteable

System ID: 1987654321@00:AB:06:0E:0F:0A

All reported states=(ONLINE), runlevels=(fullaccess), modes=(mhpu+0hpu+0hpu)
System-Status: ok
Access-Status: full

Checkpoint failed with result MSG_ERR_DDR_ERROR : cp.20220726165840 started Tue Jul 26 11:58:40 2022 ended Tue Jul 26 11:59:00 2022, completed 142 of 142 stripes
Last GC: finished Tue Jul 26 11:58:17 2022 after 00m 41s >> recovered 119.82 KB (MSG_ERR_DDR_ERROR)
Last hfscheck failed with result MSG_ERR_DDR_ERROR : started Tue Jul 26 08:01:27 2022 ended Tue Jul 26 08:03:58 2022, nodes queried/replied/total(1/1/1)

Maintenance windows scheduler capacity profile is active.
  The maintenance window is currently running.
  Next backup window start time: Tue Jul 26 20:00:00 2022 CDT
  Next maintenance window start time: Wed Jul 27 08:00:00 2022 CDT
root@avamar:~/#: 

We see the System-Status is OK and we can see our checkpoint failed errors. Let’s check our checkpoints by running the “cplist” command.

root@avamar:~/#: cplist
cplist: ERROR: ddrmaint: <4750>Datadomain get checkpoint list operation failed.

2022/07/26-19:32:02.94852 [cplist]  ERROR: <0001> ddrmaint: <4750>Datadomain get checkpoint list operation failed.
cp.20220724130110 Sun Jul 24 08:01:10 2022 invalid --- ---  nodes   1/1 stripes    141
cp.20220724130449 Sun Jul 24 08:04:49 2022 invalid --- ---  nodes   1/1 stripes    141
cp.20220725130125 Mon Jul 25 08:01:25 2022 invalid --- ---  nodes   1/1 stripes    141
cp.20220725130501 Mon Jul 25 08:05:01 2022 invalid --- ---  nodes   1/1 stripes    141
cp.20220726130104 Tue Jul 26 08:01:04 2022 invalid --- ---  nodes   1/1 stripes    142
cp.20220726130440 Tue Jul 26 08:04:40 2022 invalid --- ---  nodes   1/1 stripes    142
cp.20220726165818 Tue Jul 26 11:58:18 2022 invalid --- ---  nodes   1/1 stripes    142
cp.20220726165840 Tue Jul 26 11:58:40 2022 invalid --- ---  nodes   1/1 stripes    142
root@avamar:~/#: 

We see that it has attempted to create a checkpoint, but it keeps failing and therefore are invalid. Now let’s check the maintenance logs running the “grep -i “err.*hfscheck” /usr/local/avamar/var/ddrmaintlogs/ddrmaint.log | grep -v gc” command.

root@avamar:~/#: grep -i "err.*hfscheck" /usr/local/avamar/var/ddrmaintlogs/ddrmaint.log | grep -v gc
2022-07-26T08:02:38.960979-05:00 avamar ddrmaint.bin[8731]: Error: hfscheck-start::open_ddr: DDR_Open failed: 10.20.30.41(1) lsu: avamar-1987654321, DDR result code: 5075, desc: the user has insufficient access rights
2022-07-26T08:02:38.961244-05:00 avamar ddrmaint.bin[8731]: Error: <4700>Datadomain hfscheck operation failed.
root@avamar:~/#:

Now we can see why we have an issue with the line, “the user has insufficient access rights”. The account that talks to the Data Domain storage is the DDBoost account. This is the account that doesn’t have “access rights”. If it was just working and it suddenly stopped then it probably expired. Let’s look into this account to see if that is the issue.

If you don’t know the name of the DDBoost account, in the case you renamed this account, we can enter the “ddrmaint read-ddr-info –format=full” command.

root@avamar:~/#: ddrmaint read-ddr-info --format=full
====================== Read-DDR-Info ======================

 System name        : 10.20.30.41 
 System ID          : XXXXXXXXXXXXXXX 
 DDBoost user       : ddboost 
 System index       : 1 
 Replication        : True 
 CP Backup          : True 
 Model number       : DD6900 
 Serialno           : APM0XXXXXXXXXXXX 
 DDOS version       : 7.7.1.0-1007743 
 System attached    : 2022-05-06 18:49:22 (XXXXXXXXXX) 
 System max streams : 50 

root@avamar:~/#

The DDBoost user account name is “ddboost” and we can test with ssh using this ddboost user account. The command is “ssh ddboost@10.20.30.41” which is the DDBoost name @ and the IP of the storage we got from above also in the System Name section.

root@avamar:~/#: ssh ddboost@10.20.30.41
Warning: Permanently added '10.20.30.41' (ECDSA) to the list of known hosts.
Data Domain OS
Password: 
You are required to change your password immediately (password aged)
Changing password for ddboost.
Enter current password: 

You can now choose the new password.

A valid password should be a mix of upper and lower case letters,
digits, and other characters.  You can use a 9 character long
password with characters from all of these classes.  An upper
case letter that begins the password and a digit that ends it do
not count towards the number of character classes used.

Enter new password: 
Welcome to Data Domain OS 7.7.1.0-1007743
-----------------------------------------
ddboost@DD6900# 

Great we are in and solved the issue, but wait there’s more! Now we have to clean up the mess it made.

First type “exit” to leave that Data Domain and get back to the Avamar shell. Now we need to get a valid checkpoint and to do that we need to stop maintenance by running the “dpnctl stop maint” command.

root@avamar:~/#: dpnctl stop maint
Identity added: /home/admin/.ssh/admin_key (/home/admin/.ssh/admin_key)
^[[28~dpnctl: INFO: Suspending maintenance windows scheduler...
root@avamar:~/#:

Then the “avmaint checkpoint –ava” command to create a new checkpoint.

root@avamar:~/#: avmaint checkpoint --ava
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<checkpoint
  tag="cp.20220726194317"
  isvalid="false"/>
root@avamar:~/#:

You can run the “watch avmaint cpstatus” command to watch the checkpoint being created. Then ctrl-C to get back to the shell and then run the “avmaint hfscheck –ava –rolling” to run the checkpoint validation command.

root@avamar:~/#: avmaint hfscheck --ava --rolling

^[[28~^[[28~<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<hfscheck
  checkpoint="cp.20220726194317"
  status="waitcgsan"
  type="rolling"
  checks="rolling+metadata:0:2"
  elapsed-time="72"
  start-time="1658864635"
  end-time="0"
  check-start-time="0"
  check-end-time="0"
  generation-time="1658864707"
  percent-complete="0.00">
  <hfscheckerrors/>
</hfscheck>
root@avamar:~/#:

Now we rerun the “status.dpn” command to recheck the Avamar status.

root@avamar:~/#: status.dpn 
Tue Jul 26 14:46:55 CDT 2022  [AVAMAR.DOMAIN.LOCAL] Tue Jul 26 19:46:55 2022 UTC (Initialized Fri May  6 15:52:01 2022 UTC)
Node   IP Address     Version   State   Runlevel  Srvr+Root+User Dis Suspend Load UsedMB Errlen  %Full   Percent Full and Stripe Status by Disk
0.0  10.20.30.40 19.4.0-124  ONLINE fullaccess mhpu+0hpu+0hpu   1 false   2.57 14979  8248686   0.9%   0%(onl:71 )  0%(onl:71 )
Srvr+Root+User Modes = migrate + hfswriteable + persistwriteable + useraccntwriteable

System ID: 1987654321@00:AB:06:0E:0F:0A

All reported states=(ONLINE), runlevels=(fullaccess), modes=(mhpu+0hpu+0hpu)
System-Status: ok
Access-Status: full

Last checkpoint: cp.20220726194317 finished Tue Jul 26 14:43:37 2022 after 00m 20s (OK)
Last GC: finished Tue Jul 26 11:58:17 2022 after 00m 41s >> recovered 119.82 KB (MSG_ERR_DDR_ERROR)
Last hfscheck: finished Tue Jul 26 14:46:45 2022 after 02m 50s >> checked 76 of 76 stripes (OK)

Maintenance windows scheduler capacity profile is active.
  WARNING: Scheduler is STOPPED.
  Next backup window start time: Tue Jul 26 20:00:00 2022 CDT
  Next maintenance window start time: Wed Jul 27 08:00:00 2022 CDT
root@avamar:~/#:

We can see that we have “recovered” and that we now have a valid checkpoint. We now need to restart the maintenance schedule by running the “dpnctl start maint” command.

root@avamar:~/#: dpnctl start maint
dpnctl: INFO: No /usr/local/avamar/var/dpn_service_status exist.
dpnctl: INFO: Resuming maintenance windows scheduler...
dpnctl: INFO: maintenance windows scheduler resumed.
root@avamar:~/#: 

From here you will need to go back into the Avamar GUI and open the Administrator window. Then navigate to Actions/Event Management/Clear Data Integrity Alert…

Here you will put in the reset code in all caps: AVAMARDATAOK

You can now select all error messages on the page and remove them. Maintenance will run at the next scheduled time and clean up the rest. You’re done!

Leave me a message if you thought this was helpful.

One thought on “Easily solve Avamar Data Integrity issue with these simple steps

Leave a Reply

Your email address will not be published. Required fields are marked *