FSM Failures Client Failures | Quantum 3.5 instruction

FSM Failures

Client Failures

Appendix D Quality of Service Guide

Callbacks

If the FSM crashes or is stopped, there is no immediate affect on real-time (ungated) I/O. As long as the I/O does not need to contact the FSM for some reason (attribute update, extent request, etc.), the I/O will continue. From the standpoint of QOS, the FSM being unavailable has no affect.

Non-real-time I/O will be pended until the FSM is re-connected. The rationale for this is that since the stripe group is in real-time mode, there is no way to know if the parameters have changed while the FSM is disconnected. The conservative design approach was taken to hold off all non-real-time I/O until the FSM is reconnected.

Once the client reconnects to the FSM, the client must re-request any real- time I/O it had previously requested. The FSM does not keep track of QOS parameters across crashes; that is, the information is not logged and is not persistent. Therefore, it is up to the clients to inform the FSM of the amount of required RTIO and to put the FSM back into the same state as it was before the failure.

In most cases, this results in the amount of real-time and non-real-time I/ O being exactly the same as it was before the crash. The only time this would be different is if the stripe group is oversubscribed. In this case, since more RTIO had been requested than was actually available, and the FSM had adjusted the request amounts, it is not deterministically possible to re-create the picture exactly as it was before. Therefore, if a deterministic picture is required across reboots, it is advisable to not over- subscribe the amount of real-time I/O.

The process of each client re-requesting RTIO is exactly the same as it was initially; once each client has reestablished its RTIO parameters, the non- real-time I/O is allowed to proceed to request a non-real-time token. It may take several seconds for the SAN to settle back to its previous state. It may be necessary to adjust the RtTokenTimeout parameter on the FSM to account for clients that are slow in reconnecting to the FSM.

When a client disconnects either abruptly (via a crash or a network partition,) or in a controlled manner (via an unmount), the FSM releases the client's resources back to the SAN. If the client had real-time I/O on the stripe group, that amount of real-time I/O is released back to the system. This causes a series of callbacks to the clients (all clients if the stripe group is transitioning from real-time to non-real-time,) informing them of the new amount of non-real-time I/O available.

If the client had a non-real-time I/O token, the token is released and the amount of non-real-time I/O available is recalculated. Callbacks are sent

StorNext 3.5 Installation Guide

159

Quantum 3.5 manual FSM Failures Client Failures

Models: 3.5

FSM Failures

Client Failures