DPM 2010 Firestreamer backup hangs - clfs3chr / STOP errors

The technical support forum for Firestreamer (the virtual tape library).
Locked
Minkus
Posts: 4
Joined: 30 Nov 2010, 16:02
Location: Colchester, Essex, UK

Post by Minkus »

Hi,

We were running Firestreamer 3.95.9 (4.0 RC) with our Dell RD1000 removable hard drive media and Microsoft DPM 2007 to back up our Hyper-V servers for a few months now, and the system has been working without issues.

However we recently upgraded to DPM 2010 and Firestreamer 4.0 (drivers 4.0.1), and we have been having some serious problems. Specifically, the overnight short-term backup to tape hangs after a few hours, and the following errors start to appear in the System Event Log every 30 seconds from the time of the hang:

Log Name: System
Source: clfs3chr
Date: 30/11/2010 16:05:59
Event ID: 129
Task Category: None
Level: Warning
Keywords: Classic
User: N/A
Computer: hyper1server.crgs.local
Description:
Reset to device, \Device\RaidPort4, was issued.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="clfs3chr" />
<EventID Qualifiers="32772">129</EventID>
<Level>3</Level>
<Task>0</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2010-11-30T16:05:59.027294300Z" />
<EventRecordID>11666</EventRecordID>
<Channel>System</Channel>
<Computer>hyper1server.crgs.local</Computer>
<Security />
</System>
<EventData>
<Data>\Device\RaidPort4</Data>
<Binary>0F001800010000000000000081000480040000000000000000000000000000000000000000000000000000000000000000010000810004800000000000000000</Binary>
</EventData>
</Event>

When I try to restart the server to resolve the problem, the server hangs for about 20 minutes at the 'Shutting down...' phase of the reboot process, and then generates the following STOP error:

DRIVER_POWER_STATE_FAILURE
STOP : 0x0000009F (0x0000000000000003, 0xFFFFFA800CDE1700, 0xFFFFF800015DA518, 0xFFFFFA801A5A7760)

At first when I was experiencing these issues I thought it was some sort of hardware or DPM issue, but since there are no other errors in the event log to indicate any problem with DPM or the RD1000 drive, I was at a loss to explain it. In the end I investigated 'clfs3chr' further and found that this is a Firestreamer driver, which seemed to indicate this might be the source of the problem.

The first time this error occurred I tried uninstalling Firestreamer 4.0 and reinstalling it again, as we had originally done an in-place upgrade from 3.95.9. This seemed to resolve the issue for a while, which seems to indicate that it was indeed a problem with Firestreamer. However recently I had to reformat the server again, and after installing the same software as before, including a 'clean' install of Firestreamer 4.0, after a few days of apparently working fine, the problem has reoccurred.

Please could you tell me what I can do to troubleshoot this problem, as it seems to be an issue with the latest version of Firestreamer, and I cannot use it to protect my servers until this is resolved.
jsf
Cristalink Support
Posts: 300
Joined: 29 Aug 2010, 09:03

Post by jsf »

It's highly unlikely that the upgrade from 3.95.9 to 4.0 caused the issue. Most likely, you have a problem with the storage stack, and Firestreamer is involved only because it uses that stack as the destination storage.
  • What's the operating system version?
  • Do you have any third party drivers installed for RD1000?
  • In Device Manager, check the version information of the drivers loaded for RD1000 and for the storage controller RD1000 is connected to. To check the drivers that are loaded for a particular device, right-click the device node, click Properties, click the Driver tab, and then click Driver Details. In the driver list, click a driver to view its version information. Are there any non-Microsoft drivers listed? If yes, what are they, and how old are they?
  • You may want to update the firmware of your RD1000.
  • You mentioned that the problem happens when a backup job runs for a few hours. If so, then most likely you have a memory leak. Open Task Manager and check the memory counters (in particular, kernel memory ones) on the Performance tab every 30-60 mins. The available memory should not gradually decrease over time. If it does, there's a program or a device driver leaking memory. It can even be Windows itself - see, for example, KB968675. You can troubleshoot the issue with the Poolmon utility. For more information, see KB177415.
  • To troubleshoot STOP ("blue screen") errors, you need to generate and analyze a kernel memory dump with Microsoft Debugging Tools for Windows. The name of the offending driver is usually displayed in the output of the !analyze -v command next to Probably caused by.
See also http://www.cristalink.com/fs/hh.aspx?id ... ration#bug
Best regards,
John Smith
Cristalink Support
Minkus
Posts: 4
Joined: 30 Nov 2010, 16:02
Location: Colchester, Essex, UK

Post by Minkus »

Hi,

Am still collating data relating to the crash, but here are the answers to the questions you asked:

x Operating system is Windows Server 2008 R2 Enterprise (x64)
x Only driver installed for the RD1000 is 'Dell PowerVault RD1000 Utility 1.44' from the Dell website, which enables the eject button to work properly.
x The RD1000 is detected as a 'DELL RD1000 ATA Device' (Microsoft driver), and is connected to a 'Standard Dual Channel PCI IDE Controller' (Microsoft driver).
x That firmware is already installed on the RD1000
x (This is what I am still collating data for, as the crash is a little irregular)
x Strangely, the STOP error didn't seem to have created a kernel memory dump in the usual location (%SystemRoot%\MEMORY.DMP). However next time it crashes I will check again.
jsf
Cristalink Support
Posts: 300
Joined: 29 Aug 2010, 09:03

Post by jsf »

Only driver installed for the RD1000 is 'Dell PowerVault RD1000 Utility 1.44' from the Dell website, which enables the eject button to work properly.
Is this driver displayed as signed by Microsoft Winqual? You may want to try Windows Driver Verifier (Verifier.exe) against that driver. See http://www.cristalink.com/fs/hh.aspx?id ... ration#bug for more info. Note that on any problem Verifier displays the blue screen and generates a memory dump.
the STOP error didn't seem to have created a kernel memory dump
See http://support.microsoft.com/?id=969028 and http://support.microsoft.com/kb/254649.
Best regards,
John Smith
Cristalink Support
Minkus
Posts: 4
Joined: 30 Nov 2010, 16:02
Location: Colchester, Essex, UK

Post by Minkus »

Hi,

Just to say after a bit more troubleshooting I have managed to track down the issue a bit further, and I don't think it's a problem with Firestreamer after all.

Looks like it was a lucky fix last time - seems to be an issue with our DPM backups.

Sorry for the false alarm, and thanks for your time!

Kind regards,
Chris Hill
jsf
Cristalink Support
Posts: 300
Joined: 29 Aug 2010, 09:03

Post by jsf »

Hi Chris,

Thank you for the update. Do you mind letting us know the actual cause of the STOP error?

Thank you.
Best regards,
John Smith
Cristalink Support
Minkus
Posts: 4
Joined: 30 Nov 2010, 16:02
Location: Colchester, Essex, UK

Post by Minkus »

Hi,

I know this is somewhat resurrecting the dead, but I have been all round the houses on this issue. At first it looked like a Microsoft clustering service issue, then a Microsoft DPM issue, then a Microsoft storage driver issue, but having raised a support case with Microsoft we have ruled all of these out. After a lot of troubleshooting, we found out if I leave the server on long enough, it eventually creates a *different* STOP error (0x0000009e). While the initial cause of the crash points to the Microsoft clustering service, this STOP error apparently indicates a different underlying issue (http://blogs.technet.com/b/askcore/arch ... 0009e.aspx), and so I had to sent the dump off to Microsoft for full analysis to find the cause.

Having done a full analysis of the crash dump, Microsoft tell me that they have come to the conclusion that it is the fault of the Firestreamer driver, possibly the one mentioned in the event log; therefore they have asked me to get in touch with you again so that we can troubleshoot it together.

Microsoft are requesting that we do a three-way conference call to resolve this issue together; they have said that they cannot raise a support call with you themselves as they did not purchase the product, and obviously I am not qualified to go as a go-between when it comes to analysing kernel mode crash dumps etc; it's not exactly my field of expertise!

Please could you get in touch with me ASAP with how we can achieve this so that we can get the issue resolved.

Kind regards,
CHris
jsf
Cristalink Support
Posts: 300
Joined: 29 Aug 2010, 09:03

Post by jsf »

Please contact us privately.

>>they have said that they cannot raise a support call with you themselves as they did not purchase the product

They can contact us directly using the above link. We are more than willing to cooperate if it's really a fault in Firestreamer.
Best regards,
John Smith
Cristalink Support
jsf
Cristalink Support
Posts: 300
Joined: 29 Aug 2010, 09:03

Post by jsf »

I had a discussion with a Microsoft Support representative, and the outcome is that Microsoft Support will handle the issue.
Best regards,
John Smith
Cristalink Support
Locked