Possibility to increase network timeout

The technical support forum for Firestreamer (the virtual tape library).
namezero111111
Posts: 7
Joined: 08 Aug 2012, 21:17

Post by namezero111111 »

Dear Cristalink Support:

We've been using FireStreamer for the past year with great success!
We do have a little issue though and were wondering if it was possible to tweak FireStreamer to mitigate the problem.
We run it over a 1km wireless link, and maybe once or twice a week the link disappears for maybe 5 minutes.
Since the network share that FireStreamer utilizes is on the other end of the link, that makes all backups in progress fail with the error "Tape library not functioning".

Is there a way to make FireStreamer retry the connection (or increase the timeout) before returning that error? Obviously FireStreamer receives that exception from the Windows Subsystem, but maybe you know of a way to increase the "failed" timeout, or there is a setting to make FireStreamer retry for a few minutes before complaining to DPM...

Thank you for your help!

-namezero111111
jsf
Cristalink Support
Posts: 300
Joined: 29 Aug 2010, 09:03

Post by jsf »

There is no way for Firestreamer to control the network timeout. When the network connection is broken, it is up to the underlying protocol (SMB) to recover. If it doesn't recover, Windows returns an error to Firestreamer.

You can try to change the registry values related to SMB in Windows. You need to contact Microsoft for the info, or check this post.

The best workaround would probably be a script that monitors the connection state, and if a network error occurs, reloads media in Firestreamer, and possibly initiates a new backup if the current backup failed.
Best regards,
John Smith
Cristalink Support
namezero111111
Posts: 7
Joined: 08 Aug 2012, 21:17

Post by namezero111111 »

Yes, I was fearing that.

The problem for us is not restarting, but missing the allotted slot if a 700Gb backup fails 80% through.

I was hoping there would be some sort of undocumented retry count for that in FS.

The other option would be this, and maybe bear with me and see how this could be pratical:

Local (smaller) drive with maybe only 2TB for firestreamer to write to.
The remote site has 32TB, so a sync software (maybe DFSR, or something else) would copy the files (they're then 24Gb each, hence a manageable loss if the connection goes down since we'd only lose 24Gb bandwidth max.).
Some service would then create new, empty *.fsrm files for the ones that have been moved.
Of course we'd then have to add a separate library in case we'd need to restore, one that would point to the "actual" storage.

So in essence:
1. DFSR sync between DPM server 2Tb drive and offsite storage "Incoming" folder
2. Service on remote server creates fake/empty *.fsrm files according to the media layout
3. FS now opens an fsrm file on primary site, writes it, is done.
4. DFSR now dutifully replicates that fsrm file
5. Remote site service sees file in "Incoming" folder has been touched, and moves it to the real backup location (overwrites file if existent)
6. Remote site service replaces that fsrm file in "Incoming" folder with new empty file
7. Now FS can write that file again if need be

Of course, like I said, this wouldn't work for restore, we'd need to restore from a different library that points to the actual archive files.

Is there anything in how firestreamer works that would prevent a solution like this to be implemented successfully?
jsf
Cristalink Support
Posts: 300
Joined: 29 Aug 2010, 09:03

Post by jsf »

Are you sure that your DFSR sync won't fail copying a 700GB file over an unstable network?

Do you know the reason of your network connection going down during a 700GB backup? Have you tried to limit the size of your tapes, and use say ten 80GB tapes instead of a single huge one?

Yes, you can use a script, but the following looks simpler to me:
  1. The initial media layout contains empty C:\NewTape.fsrm and possibly \\Server\OldTape1.fsrm, \\Server\OldTape2.fsrm etc.
  2. The backup is performed to C:\NewTape.fsrm.
  3. Your script copies C:\NewTape.fsrm to \\Server\OldTapeN.fsrm, and updates the media layout so that it contains new empty C:\NewTape.fsrm and \\Server\OldTape1...OldTapeN.fsrm
Best regards,
John Smith
Cristalink Support
namezero111111
Posts: 7
Joined: 08 Aug 2012, 21:17

Post by namezero111111 »

No, you're right. DFSR will also fail on the current file(s) and retry.

Our tape size already is 24Gb, so it will fail on the current file and start again transferring those 24Gb.
But if Firestreamer gets the SMB error, then DPM fill fail on the whole 700Gb backup, so I lose about 30 hours worth of bandwidth, compared to one hour in the DFSR case.

The connection is fairly stable, and I haven't been able to track the intermittent losses which have plagued us for about 6 months now. I'm just about to say that it is related to the nature of the link, which is wireless, and may experience interference from time to time.
Of course, if we ever find out why, that would be great. However, until then we have to be able to push successful DR backups off site in a reliable manner.
Given the tall order of pushing about 3.2 TB across that link weekly (out of a total capacity just short of 4TB), missing a backup window for the 700 Gb backup is an non-retryable error.

Your idea with the media layout sounds like a good idea. Do you know if Firestreamer will load the new media on the fly when the file is changed?
jsf
Cristalink Support
Posts: 300
Joined: 29 Aug 2010, 09:03

Post by jsf »

You seem to need to perform backups locally as you planned and then transfer the tapes as I described in my previous reply.
Do you know if Firestreamer will load the new media on the fly when the file is changed?
Firestreamer won't do anything on its own. You need to schedule a script that will do what you need.
Best regards,
John Smith
Cristalink Support
namezero111111
Posts: 7
Joined: 08 Aug 2012, 21:17

Post by namezero111111 »

Thank you, this sounds like a plan! Let me just ask a few more details about your idea.

1. The initial media layout contains empty C:\NewTape.fsrm and possibly \\Server\OldTape1.fsrm, \\Server\OldTape2.fsrm etc.
- Are the old (UNC) paths in there for restore/inventory purposes?
2. The backup is performed to C:\NewTape.fsrm.
- How can I make sure that DPM chooses one of the C:\NewTapeXXX.fsrm tapes, and not a \\Server\OldTapeXXX.frsm in case it expired?
3. Your script copies C:\NewTape.fsrm to \\Server\OldTapeN.fsrm, and updates the media layout so that it contains new empty C:\NewTape.fsrm and \\Server\OldTape1...OldTapeN.fsrm
- The scheduled script that does that would have to run after every backup I assume?
- Can FS reload the media while other backups are still ongoing?
- For example, backup A overran it's time, is done now, but backup B has already begun to another tape. Now the script runs to copy the backup A
files around and update the media map, and FS reloads the media map. What happens to backup B?

I apologize for all the questions, but before making an invasive change into our DR Backup infrastructure, I'd like to eliminate potential, unforeseen blocking issues.
jsf
Cristalink Support
Posts: 300
Joined: 29 Aug 2010, 09:03

Post by jsf »

Re: 1: Yes
Re: 2: You may need to configure a second library, or make those tapes read only in the media layout.

> The scheduled script that does that would have to run after every backup I assume?

It's entirely up to you how you choose to implement it.

> Can FS reload the media while other backups are still ongoing?

It can, but you may need to specify the "force" parameter because the tapes may be locked by DPM. The backups will of course fail if they are using the tapes which get removed (for more details, see http://www.cristalink.com/fs/hh.aspx?id=basic).

> For example, backup A overran it's time, is done now, but backup B has already begun to another tape. Now the script runs to copy the backup A files around and update the media map, and FS reloads the media map. What happens to backup B?

Apply the new media layout that affects tape A only and leaves tape B as is. In that case backup B won't be affected, see the link above.
Best regards,
John Smith
Cristalink Support
namezero111111
Posts: 7
Joined: 08 Aug 2012, 21:17

Post by namezero111111 »

Thanks for clearing that up!

We'll try to implement this or some variation thereof and then report back here in case someone has a similar issue.

However, this will take up to two months maybe.

Thanks again!
namezero111111
Posts: 7
Joined: 08 Aug 2012, 21:17

Post by namezero111111 »

Dear folks,

I just wanted to update you on the topic as promised.

We have decided to get a second, local storage for FS to backup to, and then we DFS-R the data over the "unreliable" link to the remote site.
We have configured a maximum of 2 simulatenous syncs; this way we lose only 48 gigabytes if the link breaks. Additionally, DFSR is forgiving and will only start over if the link has been down for more than a few minutes (it'll log the 1726 event in that case).
This works very well, because in addition to non-failed FS backups, DFS-R's own delta sync doesn't transmit all the data all the time.

The only downside of course is money, since we need two identical 32TB storages.

Just wanted to post this here for anyone else here who might run into something like this.
Locked