How to rollback to a checkpoint after Avamar crash

We recently had a major failure in the environment outside of the systems that I am the engineer over. Because of the outage these events caused, something to happened to our Avamar. When I checked on it the next morning, after the recovery from the outage, I was not able to login to the Avamar console. I began to troubleshoot the issues and here I will take you through the steps I took to get Avamar back up and running.

After getting an error when attempting to login to the console, We need to open a putty session to check the system status. After successful login we run the “dpnctl status” command.

admin@avamar:~/>: dpnctl status
/bin/chgrp: changing group of '/usr/local/avamar/var/log': Read-only file system
dpnctl: ERROR: running as user "admin" - problem opening log file "/usr/local/avamar/var/log/dpnctl.log" (-rw-rw-r--) - 
dpnctl: ERROR: traceback on exit:
dpnctl_util::open_log_file (/usr/local/avamar/bin/dpnctl line 683)
main::init (/usr/local/avamar/bin/dpnctl line 6252)

dpnctl: ERROR: [user "admin"] program exit status = 1 (error)
admin@avamar:~/>:

That’s a problem, as we can’t even check the status of the Avamar. As we believe that someone else may have hard reset the Avamar, instead of shutting it down properly, maybe a clean restart will allow it to come up properly. We issue the “mcserver.sh –restart && dpnctl start sched” command

admin@avamar:~/>: mcserver.sh --restart && dpnctl start sched
The Administrator Server is not running.
=== BEGIN === check.mcs (prestart)
ERROR: check.mcs: [avamar] Executing: /usr/local/avamar/bin/avmaint nodelist --hfsport=27000 CHECK FAILED: ERROR: avmaint: nodelist: cannot connect to server avamar at 10.20.30.40:27000

ERROR: check.mcs: [avamar] Executing: /usr/local/avamar/bin/avmaint nodelist --hfsport=27000 CHECK FAILED: ERROR: avmaint: nodelist: cannot connect to server avamar at 10.20.30.40:27000

+++ WARN +++ check.mcs: [avamar] checking that the dispatcher is not suspended CHECK FAILED: ('' matching /false/)
C--restart will restart the Administrator Server.
Do you want to proceed with the restart Y/N? [Y]: ERROR: Failed to update database server. Script failed to execute successfully. Check for execute permission of script OR script errors.
admin@avamar:~/>:

Wow, we cant even do a clean reboot, so instead we will have to force the restart. As this is a virtual Avamar we will reboot it through vCenter by telling the VM to reboot and monitor the progress through the VMWare console. Now let’s go back into putty with a new session and see if we can get a status.

admin@avamar:~/>: dpnctl status
Identity added: /home/admin/.ssh/admin_key (/home/admin/.ssh/admin_key)
dpnctl: INFO: gsan status: not running
dpnctl: INFO: MCS status: down.
dpnctl: INFO: emt status: down.
dpnctl: INFO: Backup scheduler status: down.
dpnctl: INFO: Maintenance windows scheduler status: unknown.
dpnctl: INFO: Unattended startup status: disabled.
dpnctl: INFO: avinstaller status: up.
dpnctl: INFO: ConnectEMC status: up.
dpnctl: INFO: ddrmaint-service status: down.
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]
admin@avamar:~/>:

Great, we can now get a dpnctl status, so we can assume that something caused the Avamar to not boot up correctly and this caused the issue with the status. Now let’s start services beginning with gsan by running the “dpnctl start gsan” command.

admin@avamar:~/>: dpnctl start gsan
Identity added: /home/admin/.ssh/admin_key (/home/admin/.ssh/admin_key)


  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
Action: starting gsan
Have you contacted Avamar Technical Support to ensure that this
  is the right thing to do?

Answering y(es) proceeds with starting gsan;
          n(o) or q(uit) exits


y(es), n(o), q(uit/exit): y
dpnctl: INFO: Checking that gsan was shut down cleanly...
  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
Here are the most recent validated and non-validated checkpoints:
  Mon Aug 15 14:04:50 2022 UTC Validated(type=rolling)
  Mon Aug 15 14:14:56 2022 UTC Not Validated

A rollback is recommended: the gsan was not shut down cleanly.

The choices are as follows:
  1   roll back to the most recent checkpoint, whether or not validated
  2   roll back to the most recent validated checkpoint
  3   select a specific checkpoint to which to roll back
  4   do not restart
  q   quit/exit


(Entering an empty (blank) line twice quits/exits.)
>

Here we can see that “the gsan was not shut down cleanly” and I don’t believe this is because I rebooted the Avamar as I doubt it was running then either. We will choose option 2, to roll back to the last “validated checkpoint”.

(Entering an empty (blank) line twice quits/exits.)
> 2


  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
You have selected this checkpoint:

  name:       cp.20220815140450
  date:       Mon Aug 15 14:04:50 2022 UTC
  validated:  yes
  age:        1 day, 2 hours

Roll back to this checkpoint?

Answering y(es)  accepts this checkpoint and initiates rollback
          n(o)   rejects this checkpoint and returns to the main menu
          q(uit) exits


y(es), n(o), q(uit/exit): y
dpnctl: INFO: Initiating rollback to "cp.20220815140450" with gsan restart (this may take some time)...
dpnctl: INFO: To monitor progress, run in another window: tail -f /tmp/dpnctl-gsan-restart-output-15684

As stated above, we can monitor the progress by pasting in the “tail -f /tmp/dpnctl-gsan-restart-output-15684” in a new shell (new putty session into the same Avamar). Now for clarity, the above file directory is uniquely different, to the Avamar, at the time this was created and your command will be similar, but different if following along.

As the above command is just to monitor the progress of the checkpoint roll back I will omit this from being in this tutorial, except to say that when it has finished it will start showing this:

Checking for server ready. Please wait.
0.0 state=ONLINE runlevel=startup
sleep 10
Checking for server ready. Please wait.
0.0 state=ONLINE runlevel=running
sleep 10

Now lets go back to our original shell (Putty session) and run the “dpnctl status” command.

admin@avamar:~/>: dpnctl status   
Identity added: /home/admin/.ssh/admin_key (/home/admin/.ssh/admin_key)
dpnctl: INFO: gsan status: up
dpnctl: INFO: MCS status: down.
dpnctl: INFO: emt status: down.
dpnctl: INFO: Backup scheduler status: down.
dpnctl: INFO: Maintenance windows scheduler status: enabled.
dpnctl: INFO: Unattended startup status: disabled.
dpnctl: INFO: avinstaller status: up.
dpnctl: INFO: ConnectEMC status: up.
dpnctl: INFO: ddrmaint-service status: down.
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]
admin@avamar:~/>: timed out waiting for input: auto-logout

Things are looking better, now lets start MCS which is the Avamar console by issuing the “dpnctl start mcs” command.

admin@avamar:~/>: dpnctl start mcs
Identity added: /home/admin/.ssh/admin_key (/home/admin/.ssh/admin_key)
dpnctl: INFO: Starting MCS...
dpnctl: INFO: To monitor progress, run in another window: tail -f /tmp/dpnctl-mcs-start-output-20800
dpnctl: ERROR: error return from "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/mcserver.sh --start" - exit status 1
dpnctl: ERROR: 1 error seen in output of "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/mcserver.sh --start"
dpnctl: INFO: No /usr/local/avamar/var/dpn_service_status exist.
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]
admin@avamar:~/>:

We are getting an error and when we look at the “tail -f /tmp/dpnctl-mcs-start-output-20800” in another shell we see the following:

admin@avamar:~/>: tail -f /tmp/dpnctl-mcs-start-output-20800
Database server is running...
Start MCDB: processing time = 1 s.
Check MCS: processing time = 4 s.
INFO: Starting messaging service.
INFO: Started messaging service.
Start Message Broker: processing time = 29 s.
=== BEGIN === check.mcs (prestart)
check.mcs                        passed
=== PASS === check.mcs PASSED OVERALL (prestart)
Starting Administrator Server at: Tue Aug 16 12:54:34 CDT 2022
Starting Administrator Server...
Upgrade MCS Preference: processing time = 181ms
Upgrade MCDB: processing time = 153ms
ERROR: gsan rollbacktime: 1660667310 does not match stored rollbacktime: 1631233841
^C
admin@avamar:~/>:

Because we did a roll back the “rollbacktime” does not match. To fix this we issue the “dpnctl start mcs –force_mcs_restore” command (link to KB article can be found here)

admin@avamar:~/>:  dpnctl start mcs --force_mcs_restore
Identity added: /home/admin/.ssh/admin_key (/home/admin/.ssh/admin_key)
dpnctl: INFO: Restoring MCS data...
dpnctl: INFO: MCS data restored.
dpnctl: INFO: Starting MCS...
dpnctl: INFO: To monitor progress, run in another window: tail -f /tmp/dpnctl-mcs-start-output-22529
dpnctl: WARNING: 2 warnings seen in output of "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/mcserver.sh --start"
dpnctl: INFO: MCS started.
dpnctl: INFO: No /usr/local/avamar/var/dpn_service_status exist.
admin@avamar:~/>: 

Looks like it worked so let’s confirm:

admin@avmar:~/>: dpnctl status
Identity added: /home/admin/.ssh/admin_key (/home/admin/.ssh/admin_key)
dpnctl: INFO: gsan status: up
dpnctl: INFO: MCS status: up.
dpnctl: INFO: emt status: down.
dpnctl: INFO: Backup scheduler status: down.
dpnctl: INFO: Maintenance windows scheduler status: enabled.
dpnctl: INFO: Unattended startup status: disabled.
dpnctl: INFO: avinstaller status: up.
dpnctl: INFO: ConnectEMC status: up.
dpnctl: INFO: ddrmaint-service status: down.
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]
admin@avamar:~/>:

MCS is up and we are now able to login to the Avamar console! But, before we do we need to run the “dpnctl start ddrmaint-service” command and once complete the “dpnctl status” to verify it is up. Without the ddrmaint-service in an up state, backups will fail when communicating with the Data Domain.

admin@avamar:~/>: dpnctl start ddrmaint-service
Identity added: /home/admin/.ssh/admin_key (/home/admin/.ssh/admin_key)
dpnctl: INFO: To monitor progress, run in another window: tail -f /tmp/dpnctl-subsystem-control-action-output-24135
dpnctl: INFO: No /usr/local/avamar/var/dpn_service_status exist.
admin@avamar:~/>: dpnctl status
Identity added: /home/admin/.ssh/admin_key (/home/admin/.ssh/admin_key)
dpnctl: INFO: gsan status: up
dpnctl: INFO: MCS status: up.
dpnctl: INFO: emt status: up.
dpnctl: INFO: Backup scheduler status: up.
dpnctl: INFO: Maintenance windows scheduler status: enabled.
dpnctl: INFO: Unattended startup status: disabled.
dpnctl: INFO: avinstaller status: up.
dpnctl: INFO: ConnectEMC status: up.
dpnctl: INFO: ddrmaint-service status: up.
admin@avamar:~/>:

The below graphic was taken the next morning as I needed to attend other emergencies, but you can see that Garbage Collection had just happened and we are getting plenty of green lights.

We will click to resume “Scheduler State” to get all backups running again. A few hours latter and we are all green lights!

That will wrap up this tutorial, we can kick off our backups or wait for their scheduled time. Let me know what you think in the comments below.

Leave a Reply

Your email address will not be published. Required fields are marked *