[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #HEQ-649192]: LDM fault tolerance

This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.

Subject: [LDM #HEQ-649192]: LDM fault tolerance
Date: Tue, 21 Apr 2015 12:15:41 -0600

Geoffry,

> Your clarifications help a lot. Though I still have some questions about
> machine failures and where a replacement node would 'start' at in the data.

When a downstream LDM starts, it requests products starting from some time in 
the past (typically one hour) or from the last successfully-received product 
(which it tracks) -- whichever is most recent. Thus, if the downstream site is 
offline for less than one hour or the minimum residency time of the upstream 
site's product-queue (whichever is less), then no products will be lost.

> Allow me to elaborate on my usecase: I'm planning to download large amounts
> of weather data, spreading the load across many 'nodes.' Nodes in this case
> are AWS instances. At the scale we're looking at, instance ('machine'
> failure) with loss of any LDM state is to be expected relatively
> frequently. If there's some piece of LDM non-memory state that needs to be
> persisted between machine failures to guarantee delivery, I need to be
> aware.
> 
> I am evaluating, if we run LDM naively what are our failure conditions when
> 
> - we lose a machine, all it's local state, and disk
> - an LDM process dies
> - network partitions or failures between or during transfers
> 
> And how we might structure our LDM cluster to avoid any related problems.
> I'm also looking to better understand the implementation of LDM so I can be
> aware of our upstream providers (paid Universities) potential failure cases
> and how they impact our cluster's ability to always successfully receive
> and process files. (ignoring network partitions > ~45mins, failure of more
> than some set number of redundant nodes, and the data being unavailable to
> the LDM network)
> 
> So, basically, trying to figure out how strong the processing guarantees
> that LDM provides are so I know where I need to add extra
> monitoring/coordination between redundant nodes.

Sounds like you might be interested in the section on LDM clusters in the 
reference manual 
<https://www.unidata.ucar.edu/software/ldm/ldm-current/basics/cluster.html>.

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: HEQ-649192
Department: Support LDM
Priority: Normal
Status: Closed

Prev by Date: [LDM #HEQ-649192]: LDM fault tolerance
Next by Date: [LDM #HEQ-649192]: LDM fault tolerance
Previous by thread: [LDM #HEQ-649192]: LDM fault tolerance
Next by thread: [LDM #HEQ-649192]: LDM fault tolerance
Index(es):
- Date
- Thread