[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #RJS-786355]: Regular expressions

This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.

Subject: [LDM #RJS-786355]: Regular expressions
Date: Sat, 04 Apr 2009 10:52:21 -0600

Dave,

> I've just borrowed from a sample pqact.conf file for a GEMPAK
> installation (provided by Tom Yoksas) a pattern for action in my
> pqact.conf file. Rather than trigger a decoder (a la virtually every
> action in the sample file), though, I'm just trying to file the data.
> 
> A regular expression issue comes up. Here's a simplified example that
> I hope illustrates my conceptual problem:
> 
> WMO   (^a)|(^b|(c|d)) .... ([0-3][0-9])([0-2][0-9])..
> FILE (\?:yy)(\?:mm)\?\(?+1)_type.wmo
> 
> where "?" represents an integer that matches the paranthetical
> expression ([0-3][0-9]) (the day of the month) and "?+1" matches the
> next paranthetical expression, ([0-2][0-9]). The letters a, b, c, and
> d represent strings of one or more regular expressions without
> parentheses.
> 
> The question is, what should "?" be?
> 
> I have two conceptual uncertainties here. First, when two
> parenthetical expressions are separated by "|", are the two referred
> to by separate (sequential) values of \n (where n is an integer), or
> are they both referred to by the same value of \n (since they
> represent possibly mutually exclusive alternatives)?

Two parenthetical subexpressions separated by "|" would have two different
\n backreferences.

> Second, when parentheses are nested, how should the expressions they
> enclose be counted when determining an appropriate value of \n?

Backreference \n always refers to the subexpression enclosed by the n-th
unescaped left parenthesis.

> In the example above, "?" could be anywhere from 2 to 4, depending on
> the answers to these questions, and in one instance the number could
> vary depending on which option of the highest-level "|" ("or")
> structure in the example above is realized.
> 
> The actual pattern that I'm working with is supposed to capture ship,
> buoy, and CMAN data and looks like this:
> 
> WMO  (^S[IMN]V[^GINS])|^S[IMN]W[^KZ]|(^S(HV|HXX|S[^X]))|(^SX(VD|V.50|
> US(2[0-
> 3]|08|40|82|86)))|(^Y[HO]XX84) .... ([0-3][0-9])([0-2][0-9])..
> FILE data/surface/(\n:yy)(\n:mm)\n\m_boy.wmo
> 
> where \n and \m are to be determined to get the date and time when the
> data were recorded.

There are some things wrong with the above extended regular expression.
As I recall, the first field in a WMO header has six characters: four 
letters followed by two digits.  The above ERE, however, would match,
for example, "SIVA ", "SIWA ", "SHV ", "SHXX ", and "SSA " -- which
don't fit the pattern of the first field of a WMO header.

To simplify things, you can always break-up a complicated ERE into
multiple pqact(1) entries, each one handling a subset of the complicated
ERE.

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: RJS-786355
Department: Support LDM
Priority: Normal
Status: Closed

Prev by Date: [LDM #VHT-836868]: WMO Headers: broken links
Next by Date: [LDM #RJS-786355]: Regular expressions
Previous by thread: [LDM #VHT-836868]: WMO Headers: broken links
Next by thread: [LDM #RJS-786355]: Regular expressions
Index(es):
- Date
- Thread