Month: June 2012

How to use Perl Expect

i am trying to use expect to log into an appliance, fetch the config, and write it to a file (i can’t use ssh keys, appliance doesn’t support it, and the thing actually has two logins).

the problem is when i use this, the data is truncated (i only get the last ~100 lines of the config file):

use regular expression as an argument to perl expect

Check whether this helps.

use Expect;

my $Obj = Expect->new();


    [ qr/(?:.*?Hello){2}/i, sub {
            my $Self = shift;
            print "Matched qr/.*?Hello.*?Hello/i..\n";
            exp_continue; }

And /some/tst.bash looks like this.

echo "Hello! This is for testing. !Hello"

Basically the regex (?:.*?Hello){2} looks for anything/nothing followed by Hello twice. So in essence the following too would have matched HelloHello

q, qq, qr, qx

q// is generally the same thing as using single quotes – meaning it doesn’t interpolate values inside the delimiters.
qq// is the same as double quoting a string. It interpolates.
qw// return a list of white space delimited words. @q = qw/this is a test/ is functionally the same as @q = (‘this’, ‘is’, ‘a’, ‘test’)
qx// is the same thing as using the backtick operators.
I’ve never used qr//, but it’s got something to do with compiling regex’s for later use.

Maintaining Regular Expressions

Maintaining Regular Expressions

By Aaron Mackey on January 16, 2004 12:00 AM

For some, regular expressions provide the chainsaw functionality of the much-touted Perl “Swiss Army knife” metaphor. They are powerful, fast, and very sharp, but like real chainsaws, can be dangerous when used without appropriate safety measures.

In this article I’ll discuss the issues associated with using heavy-duty, contractor-grade regular expressions, and demonstrate a few maintenance techniques to keep these chainsaws in proper condition for safe and effective long-term use.

Readability: Whitespace and Comments

Before getting into any deep issues, I want to cover the number one rule of shop safety: use whitespace to format your regular expressions. Most of us already honor this wisdom in our various coding styles (though perhaps not with the zeal of Python developers). But more of us could make better, judicious use of whitespace in our regular expressions, via the /x modifier. Not only does it improve readability, but allows us to add meaningful, explanatory comments. For example, this simple regular expression:

# matching "foobar" is critical here ... $_ =~ m/foobar/;

Could be rewritten, using a trailing /x modifier, as:

$_ =~ m/ foobar # matching "foobar" is critical here ... /x;

Now, in this example you might argue that readability wasn’t improved at all; I guess that’s the problem with triviality. Here’s another, slightly less trivial example that also illustrates the need to escape literal whitespace and comment characters when using the /x modifier:

$_ =~ m/^ # anchor at beginning of line The\ quick\ (\w+)\ fox # fox adjective \ (\w+)\ over # fox action verb \ the\ (\w+) dog # dog adjective (?: # whitespace-trimmed comment: \s* \# \s* # whitespace and comment token (.*?) # captured comment text; non-greedy! \s* # any trailing whitespace )? # this is all optional $ # end of line anchor /x; # allow whitespace

This regular expression successfully matches the following lines of input:

The quick brown fox jumped over the lazy dog The quick red fox bounded over the sleeping dog The quick black fox slavered over the dead dog # a bit macabre, no?

While embedding meaningful explanatory comments in your regular expressions can only help readability and maintenance, many of us don’t like the plethora of backslashed spaces made necessary by the “global” /x modifier. Enter the “locally” acting (?#) and (?x:) embedded modifiers:

$_ =~ m/^(?# # anchor at beginning of line )The quick (\w+) fox (?# # fox adjective )(\w+) over (?# # fox action verb )the (\w+) dog(?x: # dog adjective # optional, trimmed comment: \s* # leading whitespace \# \s* (.*?) # comment text \s* # trailing whitespace )?$(?# # end of line anchor )/;

In this case, the (?#) embedded modifier was used to introduce our commentary between each set of whitespace-sensitive textual components; the non-capturing parentheses construct (?:) used for the optional comment text was also altered to include a locally-acting x modifier. No backslashing was necessary, but it’s a bit harder to quickly distinguish relevant whitespace. To each their own, YMMV, TIMTOWTDI, etc.; the fact is, both commented examples are probably easier to maintain than:

# match the fox adjective and action verb, then the dog adjective, # and any optional, whitespace-trimmed commentary: $_ =~ m/^The quick (\w+) fox (\w+) over the (\w+) dog(?:\s*#\s*(.*?)\s*$/;

This example, while well-commented and clear at first, quickly deteriorates into the nearly unreadable “line noise” that gives Perl programmers a bad name and makes later maintenance difficult.

So, as in other programming languages, use whitespace formatting and commenting as appropriate, or maybe even when it seems like overkill; it can’t hurt. And like the choice between alternative code indentation and bracing styles, Perl regular expressions allow a few different options (global /x modifier, local (?#) and (?x:) embedded modifiers) to suit your particular aesthetics.

Capturing Parenthesis: Taming the Jungle

Most of us use regular expressions to actually do something with the parsed text (although the condition that the input matches the expressions is also important). Assigning the captured text from the previous example is relatively easy: the first three capturing parentheses are visually distinct and can be clearly numbered $1, $2 and $3; however, the extra set of non-capturing parentheses, which provide optional commentary, themselves have another set of embedded, capturing parentheses; here’s another rewriting of the example, with slightly less whitespace formatting:

my ($fox, $verb, $dog, $comment); if ( $_ =~ m/^ # anchor at beginning of line The\ quick\ (\w+)\ fox # fox adjective \ (\w+)\ over # fox action verb \ the\ (\w+) dog # dog adjective (?:\s* \# \s* (.*?) \s*)? # an optional, trimmed comment $ # end of line anchor /x ) { ($fox, $verb, $dog, $comment) = ($1, $2, $3, $4); }

From a quick glance at this code, can you immediately tell whether the $comment variable will come from $4 or $5? Will it include the leading # comment character? If you are a practiced regular expression programmer, you probably can answer these questions without difficulty, at least for this fairly trivial example. But if we could make this example even clearer, you will hopefully agree that similarly clarifying some of your more gnarly regular expressions would be beneficial in the long run.

When regular expressions grow very large, or include more than three pairs of parentheses (capturing or otherwise), a useful clarifying technique is to embed the capturing assignments directly within the regular expression, via the code-executing pattern (?{}). In the embedded code, the special $^N variable, which holds the contents of the last parenthetical capture, is used to “inline” any variable assignments; our previous example turns into this:

my ($fox, $verb, $dog, $comment); $_ =~ m/^ # anchor at beginning of line The\ quick\ (\w+) # fox adjective (?{ $fox = $^N }) \ fox\ (\w+) # fox action verb (?{ $verb = $^N }) \ over\ the\ (\w+) # dog adjective (?{ $dog = $^N }) dog # optional trimmed comment (?:\s* \# \s* # leading whitespace (.*?) # comment text (?{ $comment = $^N }) \s*)? # trailing whitespace $ # end of line anchor /x; # allow whitespace

Now it should be explicitly clear that the $comment variable will only contain the whitespace-trimmed commentary following (but not including) the # character. We also don’t have to worry about numbered variables $1, $2, $3, etc. anymore, since we don’t make use of them. This regular expression can be easily extended to capture other text without rearranging variable assignments.

Repeated Execution

There are a few caveats to using this technique, however; note that code within (?{}) constructs is executed immediately as the regular expression engine incorporates it into a match. That is, if the engine backtracks off a parenthetical capture to generate a successful match that does not include that capture, the associated (?{}) code will have already been executed. To illustrate, let’s again look at just the capturing pattern for the comment text (.*?) and let’s also add a debugging warn "$comment\n" statement:

# optional trimmed comment (?:\s* \# \s* # leading whitespace (.*?) (?{ $comment = $^N; # comment text warn ">>$comment<<\n" if $debug; }) \s*)? # trailing whitespace $ # end of line anchor

The capturing (.*?) pattern is a non-greedy extension that will cause the regular expression matching engine to constantly try to finish the match (looking for any trailing whitespace and the end of string, $) without extending the .*? pattern any further. The upshot of all this is that with debugging turned on, this input text:

The quick black fox slavered over the dead dog # a bit macabre, no?

Will lead to these debugging statements:

>><< >>a<< >>a << >>a b<< >>a bi<< >>a bit<< >>a bit << >>a bit m<< [ ... ] >>a bit macabre, n<< >>a bit macabre, no<< >>a bit macabre, no?<<

In other words, the adjacent embedded (?{}) code gets executed every time the matching engine “uses” it while trying to complete the match; because the matching engine may “backtrack” to try many alternatives, the embedded code will also be executed as many times.

This multiple execution behavior does raise a few concerns. If the embedded code is only performing assignments, via $^N, there doesn’t seem at first to be much of a problem, because each successive execution overrides any previous assignments, and only the final, successful execution matters, right? However, what if the input text had instead been:

The quick black fox slavered over the dead doggie # a bit macabre, no?

This text should fail to match the regular expression overall (since “doggie” won’t match “dog”), and it does. But, because the embedded (?{}) code chunks are executed as the match is evaluated, the $fox, $verb and $dog variables are successfully assigned; the match doesn’t fail until “doggie” is seen. Our program might now be more readable and maintainable, but we’ve also subtly altered the behavior of the program.

The second problem is one of performance; what if our assignment code hadn’t simply copied $^N into a variable, but had instead executed a remote database update? Repeatedly hitting the database with meaningless updates may be crippling and inefficient. However, the behavioral aspects of the database example are even more frightening: what if the match failed overall, but our updates had already been executed? Imagine that instead of an update operation, our code triggered a new row insert for the comment, inserting multiple, incorrect comment rows!

Deferred Execution

Luckily, Perl’s ability to introduce “locally scoped” variables provides a mechanism to “defer” code execution until an overall successful match is accomplished. As the regular expression matching engine tries alternative matches, it introduces a new, nested scope for each (?{}) block, and, more importantly, it exits a local scope if a particular match is abandoned for another. If we were to write out the code executed by the matching engine as it moved (and backtracked) through our input, it might look like this:

{ # introduce new scope $fox = $^N; { # introduce new scope $verb = $^N; { # introduce new scope $dog = $^N; { # introduce new scope $comment = $^N; } # close scope: failed overall match { # introduce new scope $comment = $^N; } # close scope: failed overall match { # introduce new scope $comment = $^N; } # close scope: failed overall match # ... { # introduce new scope $comment = $^N; } # close scope: successful overall match } # close scope: successful overall match } # close scope: successful overall match } # close scope: successful overall match

We can use this block-scoping behavior to solve both our altered behavior and performance issues. Instead of executing code immediately within each block, we’ll cleverly “bundle” the code up, save it away on a locally scoped “stack,” and only process the code if and when we get to the end of a successful match:

my ($fox, $verb, $dog, $comment); $_ =~ m/(?{ local @c = (); # provide storage "stack" }) ^ # anchor at beginning of line The\ quick\ (\w+) # fox adjective (?{ local @c; push @c, sub { $fox = $^N; }; }) \ fox\ (\w+) # fox action verb (?{ local @c = @c; push @c, sub { $verb = $^N; }; }) \ over\ the\ (\w+) # dog adjective (?{ local @c = @c; push @c, sub { $dog = $^N; }; }) dog # optional trimmed comment (?:\s* \# \s* # leading whitespace (.*?) # comment text (?{ local @c = @c; push @c, sub { $comment = $^N; warn ">>$comment<<\n" if $debug; }; }) \s*)? # trailing whitespace $ # end of line anchor (?{ for (@c) { &$_; } # execute the deferred code }) /x; # allow whitespace

Using subroutine “closures” to package up our code and save them on a locally defined stack, @c, allows us to defer any processing until the very end of a successful match. Here’s the matching engine code execution “path”:

{ # introduce new scope local @c = (); # provide storage "stack" { # introduce new scope local @c; push @c, sub { $fox = $^N; }; { # introduce new scope local @c = @c; push @c, sub { $verb = $^N; }; { # introduce new scope local @c = @c; push @c, sub { $dog = $^N; }; { # introduce new scope local @c = @c; push @c, sub { $comment = $^N; }; } # close scope; lose changes to @c { # introduce new scope local @c = @c; push @c, sub { $comment = $^N; }; } # close scope; lose changes to @c # ... { # introduce new scope local @c = @c; push @c, sub { $comment = $^N; }; { # introduce new scope for (@c) { &$_; } } # close scope } # close scope; lose changes to @c } # close scope; lose changes to @c } # close scope; lose changes to @c } # close scope; lose changes to @c } # close scope; no more @c at all

This last technique is especially wordy; however, given judicious use of whitespace and well-aligned formatting, this idiom could ease the maintenance of long, complicated regular expressions.

But, more importantly, it doesn’t work as written. What!?! Why? Well, it turns out that Perl’s support for code blocks inside (?{}) constructs doesn’t support subroutine closures (even attempting to compile one causes a core dump). But don’t worry, all is not lost! Since this is Perl, we can always take things a step further, and make the hard things easy …

Making it Actually Work: use Regexp::DeferredExecution

Though we cannot (yet) compile subroutines within (?{}) constructs, we can manipulate all the other types of Perl variables: scalars, arrays, and hashes. So instead of using closures:

m/ (?{ local @c = (); }) # ... (?{ local @c; push @c, sub { $comment = ^$N; } }) # ... (?{ for (@c) { &$_; } }) /x

We can instead just package up our $comment = $^N code into a string, to be executed by an eval statement later:

m/ (?{ local @c = (); }) # ... (?{ local @c; push @c, [ $^N, q{ $comment = ^$N; } ] }) # ... (?{ for (@c) { $^N = $$[0]; eval $$[1]; } }) /x

Note that we also had to store away the version of $^N that was active at the time of the (?{}) pattern, because it very likely will have changed by the end of the match. We didn’t need to do this previously, as we were storing closures that efficiently captured all the local context of the code to be executed.

Well, now this is getting really wordy, and downright ugly to be honest. However, through the magic of Perl’s overloading mechanism, we can avoid having to see any of that ugliness, by simply using the Regexp::DeferredExecution module from CPAN:

use Regexp:DeferredExecution; my ($fox, $verb, $dog, $comment); $_ =~ m/^ # anchor at beginning of line The\ quick\ (\w+) # fox adjective (?{ $fox = $^N }) \ fox\ (\w+) # fox action verb (?{ $verb = $^N }) \ over\ the\ (\w+) # dog adjective (?{ $dog = $^N }) dog # optional trimmed comment (?:\s* \# \s* # leading whitespace (.*?) (?{ $comment = $^N }) # comment text \s*)? # trailing whitespace $ # end of line anchor /x; # allow whitespace

How does the Regexp::DeferredExecution module perform its magic? Carefully, of course, but also simply; it just makes the same alterations to regular expressions that we made manually. 1) An initiating embedded code pattern is prepended to declare local “stack” storage. 2) Another embedded code pattern is added at the end of the expression to execute any code found in the stack (the stack itself is stored in @Regexp::DeferredExecution::c, so you shouldn’t need to worry about variable name collisions with your own code). 3) Finally, any (?{}) constructs seen in your regular expressions are saved away onto a local copy of the stack for later execution. It looks a little like this:

package Regexp::DeferredExecution; use Text::Balanced qw(extract_multiple extract_codeblock); use overload; sub import { overload::constant 'qr' => \&convert; } sub unimport { overload::remove_constant 'qr'; } sub convert { my $re = shift; # no need to convert regexp's without (?{ <code> }): return $re unless $re =~ m/\(\?\{/; my @chunks = extract_multiple($re, [ qr/\(\? # '(?' (escaped) (?={) # followed by '{' (lookahead) /x, \&extract_codeblock ] ); for (my $i = 1 ; $i < @chunks ; $i++) { if ($chunks[$i-1] eq "(?") { # wrap all code into a closure and push onto the stack: $chunks[$i] =~ s/\A{ (.*) }\Z/{ local \@Regexp::DeferredExecution::c; push \@Regexp::DeferredExecution::c, [\$^N, q{$1}]; }/msx; } $re = join("", @chunks); # install the stack storage and execution code: $re = "(?{ local \@Regexp::DeferredExecution::c = (); # the stack })$re(?{ for (\@Regexp::DeferredExecution::c) { \$^N = \$\$_[0]; # reinstate \$^N eval \$\$_[1]; # execute the code } })"; return $re; } 1;

One caveat of Regexp::DeferredExecution use is that while execution will occur only once per compiled regular expressions, the ability to embed regular expressions inside of other regular expressions will circumvent this behavior:

use Regexp::DeferredExecution; # the quintessential foobar/foobaz parser: $re = qr/foo (?: bar (?:{ warn "saw bar!\n"; }) | baz (?:{ warn "saw baz!\n"; }) )?/x; # someone's getting silly now: $re2 = qr/ $re baroo! (?:{ warn "saw foobarbaroo! (or, foobazbaroo!)\n"; }) /x; "foobar" =~ /$re2/; __END__ "saw bar!" 

Even though the input text to $re2 failed to match, the deferred code from $re was executed because its pattern did match successfully. Therefore, Regexp::DeferredExecutionshould only be used with “constant” regular expressions; there is currently no way to overload dynamic, “interpolated” regular expressions.

See Also

The Regexp::Fields module provides a much more compact shorthand for embedded named variable assignments, (?<varname> pattern), such that our example becomes:

use Regexp::Fields qw(my); my $rx = qr/^ # anchor at beginning of line The\ quick\ (?<fox> \w+)\ fox # fox adjective \ (?<verb> \w+)\ over # fox action verb \ the\ (?<dog> \w+) dog # dog adjective (?:\s* \# \s* (?<comment> .*?) \s*)? # an optional, trimmed comment $ # end of line anchor /x;

Note that in this particular example, the my $rx compilation stanza actually implicitly declared $fox, $verb etc. If variable assignment is all you’re ever doing, Regexp::Fields is all you’ll need. If you want to embed more generic code fragments in your regular expressions, Regexp::DeferredExecution may be your ticket.

And finally, because in Perl there is always One More Way To Do It, I’ll also demonstrate Regexp::English, a module that allows you to use regular expressions without actually writing any regular expressions:

use Regexp::English; my ($fox, $verb, $dog, $comment); my $rx = Regexp::English->new -> start_of_line -> literal('The quick ') -> remember(\$fox) -> word_chars -> end -> literal(' fox ') -> remember(\$verb) -> word_chars -> end -> literal(' over the ') -> remember(\$dog) -> word_chars -> end -> literal(' dog') -> optional -> zero_or_more -> whitespace_char -> end -> literal('#') -> zero_or_more -> whitespace_char -> end -> remember(\$comment) -> minimal -> multiple -> word_char -> or -> whitespace_char -> end -> end -> end -> zero_or_more -> whitespace_char -> end ->end -> end_of_line; $rx->match($_);

I must admit that this last example appeals to my inner-Lispish self.

Hopefully you’ve gleaned a few tips and tricks from this little workshop of mine that you can take back to your own shop.

command line Switch name idioms

Switch name idioms

Over the years, a number of conventions have arisen over the best letters to assign to common operations that crop up again and again in program design. This list attempts to codify existing practices (updates welcomed). Use these conventions and people will find your programs easy to learn.

-a Process everything (all).
-d Debug mode. Print out lots of stuff.
-h Help. Print out a brief summary of what the script does and what it expects.
-i Input file, or include file
-l Name of logfile
-o Name of output file
-q Quiet. Print out nothing.
-v Verbose. Print out lots of stuff.


Local variables

The @_ variable is local to the current subroutine, and so of course are $_[0], $_[1], $_[2], and so on. Other variables can be made local too, and this is useful if we want to start altering the input parameters. The following subroutine tests to see if one string is inside another, spaces not withstanding. An example follows.

sub inside
	local($a, $b);			# Make local variables
	($a, $b) = ($_[0], $_[1]);	# Assign values
	$a =~ s/ //g;			# Strip spaces from
	$b =~ s/ //g;			#   local variables
	($a =~ /$b/ || $b =~ /$a/);	# Is $b inside $a
					#   or $a inside $b?

&inside("lemon", "dole money");		# true

In fact, it can even be tidied up by replacing the first two lines with

local($a, $b) = ($_[0], $_[1]);

Handling Command Line Options in Perl programs


Controlling a computer by typing commands to a so-called command line interpreter is still most people’s favorite way of working, despite the capabilities of modern window systems. When you know the names of the commands and their options, working from the command line is much less complicated and usually faster than complex series of mouse movements and button clicks.

The way commands and options are specified depends on how the commands are interpreted, and who is handling the options. Sometimes this is the command line interpreter, but quite often the program that is run by the command has to handle the options itself.

Under most moderns command shells, including the popular Unix and Windows shells, a command line consists of the name of the program to be executed, followed by zero or more options and arguments. There are two conventions on how options look like and should be interpreted: option letters and option words.

In the case of option letters, options consist of a single dash followed by one or more characters, usually letters, each being interpreted individually. For example, ‘-abc’ means the same as ‘-a -b -c’. When options take values it is usually possible to bundle the values as well. For example, ‘-aw80L24x’ means the same as ‘-a -w 80 -L 24 -x’.

In the case of option words, options consist of a double dash followed by a single option word. When an option takes a value, the value follows the option word or can be appended to the option word using an equals sign. Using this convention, the previous example could read ‘--all --width=80 --length 24 --extend’. With option words, it is much easier to remember options and their meanings.

In either case, options precede other program arguments and the recognition of options stops as soon as a non-option argument is encountered. A double dash on itself explicitly stops option recognition.

Often combinations are allowed, for example, a program can accept ‘-a’ being the same as ‘--all’. Some programs accept option words with just a single dash (and will not use option letters). Sometimes options and non-option arguments may be mixed.

You’ve probably written programs that handle command line options like ‘-h’ for height, ‘-w’ for width, ‘-v’ for verbose, and so on. Some might be optional, some might be case-insensitive, some might not expect an argument afterward. With Perl, parsing options is not very hard to do, but after writing eight subroutines for eight programs, you might wonder whether there’s a better way. There is — in fact, there are several ways.

The simple way — ‘perl -s

The Perl interpreter itself supports the single-character style of options. The Perl script is free to interpret the command line arguments the way it likes. Perl uses a special command line option ‘-s’ to facilitate the option handling for scripts. Assuming you start Perl as follows: perl -s -foo -bar myfile.dat

Perl will remove anything that look like options (‘-foo’ and ‘-bar’) from the command line and set corresponding variables ($foo and $bar) to a true value. Note that the options are words but preceded with a single dash. When a command line argument is encountered that is not an option Perl will not look any further.

Although this method is very limited it is quite useful to get started.

The easy way — Getopt::Std

Perl comes standard with two modules that assist programs in handling command line options: Getopt::Std and Getopt::Long.

Module Getopt::Std provides two subroutines, getopt and getopts. These routines have in common that they use a single dash to identify option letters and they stop processing options when the first non-option is detected.

Subroutine getopt takes one mandatory argument, a string containing the option letters that take values. For example, when you call

    getopt ('lw');

your program will accept ‘-l24 -w 80’ and set the variable $opt_l to 24 and $opt_w to 80. Note that the value can be bundled with the option letter but it need not. Other option letters are also accepted (and can be bundled with other letters), for example ‘-ab’ will set each of the variables $opt_a and $opt_b to the value 1. When it is not desired to have (global) variables defined, getopt can be passed a reference to a hash as an optional second argument. Hash keys will be x (where x is the option letter) and the key value will be set to the option value or 1if the option did not take a value.

Subroutine getops allows a little bit more control over the options. Its argument is a string containing the option letters of all options that are recognized. If an option takes a value, the option letter in the string is followed by a colon. For example, using

    getops ('abl:w:');

will make your program take options ‘a’ and ‘b’ without a value, and ‘l’ and ‘w’ with a value. Bundling is allowed. Other command line arguments that start with a dash but are not one of these will cause an error message to be printed. As with getopt, a hash reference can be passed as an optional second argument.

The functionality provided by Getopt::Std is much better than ‘perl -s’, but still limited.

The advanced way — Getopt::Long

Module Getopt::Long defines subroutine GetOptions that takes care of advanced handling of command line options.

GetOptions makes it possible to have ultimate control over the handling of command line options. It provides support for:

  • single-letter options, with bundling;
  • option words, using a single dash, double dash or plus (using a plus sign was an intermediate standard used by the GNU project);
  • a mix of the above, in which case the long options must start with a double dash.

Other important features include:

  • options can take (mandatory or optional) values;
  • option values can be strings or numbers;
  • full control over where the option value will be delivered;
  • full checking of options and values.

Standard operation: option words

In its standard configuration, GetOptions will handle option words, matching them in a case-insensitive way. Options may be abbreviated to uniqueness. Options and other command line arguments may be mixed, in which case the all options will be processed first and the other arguments will remain in @ARGV.

The following call to GetOptions will allow a single option, ‘foo’. When this option is specified on the command line the variable $doit will be set to value 1:

    GetOptions ('foo' => \$doit);

In this call, 'foo' is the option control string, and \$doit the option destination. Multiple pairs of control strings and destinations may be passed. GetOptions will return a true result if processing was successful and a false result when errors were detected. Besides a false result, GetOptions will issue a descriptive error message using warn.

The option word may optionally be followed by aliases, alternative option words that refer to the same option, for example:

    GetOptions ('foo|bar' => \$doit);

If you want to specify that an option takes a value, for example a string, append ‘=s’ to the option control string:

    GetOptions ('foo=s' => \$thevalue);

When you use a colon instead of the equals, the option takes a value only when one is present:

    GetOptions ('foo:s' => \$thevalue, 'bar' => \$doit);

Calling this program with arguments ‘-foo bar blech’ will deliver value 'bar' in $thevalue but when called with ‘-foo -bar blech$thevalue will be set to an empty string (and $bar will be set to 1).

Besides strings, options can take numeric values; you can use ‘=i’ or ‘:i’ for integer values and ‘=f’ or ‘:f’ for floating point values.

Using single-letter options and bundling

To use single-letter options is trivial, but to allow them to be bundled GetOptions needs to be configured first. Module Getopt::Long has a subroutine Configure that can be called with a list of strings, each describing a configuration characteristic. For the bundling of single-letter options, you should use:

    Getopt::Long::Configure ('bundling');

Now GetOptionswill happily accept single-letter options and bundle them:

    GetOptions ('a' => \$all, 'l=i' => \$length, 'w=i' => \$width);

This will allow command line arguments of the form ‘-a -l 24 -w 80’ but also ‘-al24w80’. You can mix these with option words:

    GetOptions ('a|all' => \$all, 'l|length=i' => \$length, 'w|width=i' => \$width);

However, for the option words, a double dash is required: ‘--length 24’ is acceptible, but ‘-length 24’ is not. The latter will cause the leading ‘l’ to be interpreted as option letter ‘l’, and then complain that ‘ength’ is not a valid integer value.

For maximum confusion,

    Getopt::Long::Configure ('bundling_override');

will allow option words with a single dash, where the words take precedence over bundled single-letter options. For example:

    GetOptions ('a' => \$a, 'v' => \$v, 'x' => \$x, 'vax' => \$vax);

will treat ‘-axv’ as ‘-a -x -v’ but ‘-vax’ as a single option word.

Advanced destinations

You do not need to specified the option destination. If no destination is specified, GetOptions will define variables $opt_xxx where xxx is the name of the option, just like getopt and getopts. GetOptions will also accept a reference to a hash as its first argument and deliver the option values there, again just like getopt and getopts.

If you do specify the option destination, it does not necessarily need to be a scalar. If you specify a reference to an array, option values are pushed into this array:

    GetOptions ('foo=i' => \@values);

Calling this program with arguments ‘-foo 1 -foo 2 -foo 3’ will result in @values having the value (1,2,3) provided it was initially empty.

Also, the option destination can be a reference to a hash. In this case, option values can have the form ‘key=value’. The value will be stored in the hash with the given key.

Finally, the destination can be a reference to a subroutine. This subroutine will be called when the option is handled. It gets two arguments passed: the name of the option and the value.

A special option control string ‘<>’ can be used in this case to connect a subroutine to handle non-option arguments. This subroutine will be called with the name of the non-option argument. For example:

    GetOptions ('x=i' => \$x, '<>' => \&doit);

When you execute this program with command line arguments ‘-x 1 foo -x 2 bar’ this will call subroutine ‘doit’ with argument 'foo' (and $x equal to 1), and then call ‘doit’ with argument 'bar' (and $x equal to 2).

Other configuration characteristics

GetOptions supports several other configuration characteristics. You can switch off the default behavior to match option words in a case-insensitive way with:

    Getopt::Long::Configure ('no_ignore_case');

To inhibit automatic abbreviations for option words, use 'no_auto_abbrev'. To stop detecting options after the first non-option command line argument, use 'require_order'. For a complete list see the Getopt::Long documentation.

Help texts

People often ask me why GetOptions does not provide facilities for help messages regarding command line options. There are two reasons why I have not implemented these.

The first reason is that although command line options have a fairly uniform appearance, help messages have not. Whatever format of messages would be supported it would please some and displease lots of others. It would also clobber the calls to GetOptions, requiring long lists of parameters to get all the information passed through.

The second reason is that Perl allows a program to contain its own documentation, in so-called Plain Old Documentation (POD) format, and modules exist that extract this information to supply help messages. The following subroutine uses module Pod::Usage for this purpose, it also shows how Pod::Usage can be demand loaded:

    sub options () { my $help = 0; # handled locally my $ident = 0; # handled locally my $man = 0; # handled locally # Process options. if ( @ARGV > 0 ) { GetOptions('verbose' => \$verbose, 'trace' => \$trace, 'help|?' => \$help, 'manual' => \$man, 'debug' => \$debug) or pod2usage(2); } if ( $man or $help ) { # Load Pod::Usage only if needed. require "Pod/"; import Pod::Usage; pod2usage(1) if $help; pod2usage(VERBOSE => 2) if $man; } }

The latest version of Getopt::Long (This article describes version 2.17) can be found on CPAN in directory authors/Johan_Vromans. This kit also contains a script template that uses Getopt::Long with Pod::Usage.

Other option handling modules

A few other option handling modules can be found on CPAN. From directory modules/by-category/12_Option_Argument_Parameter_Processing the following modules can be downloaded:

Getopt::Mixed (file Getopt-Mixed-1.008.tar.gz)

This module provides handling option words and option letters. It was developed a couple of years ago, when Getopt::Std only handled option letters and Getopt::Long only handled option words. It is very much obsolete now.

Getopt::Regex (file Getopt-Regex-0.02.tar.gz)

An option handler that uses regular expressions to identify the options, and closures to deliver the option values.

Getopt::EvaP (file Getopt-EvaP-2.3.1.tar.gz)

This module uses a table-driven option handler that provides most of the features of Getopt::Long but also includes first level help messages.

Getopt::Tabular (file Getopt-Tabular-0.2.tar.gz)

Another table-driven option handler loosely inspired by Tcl/Tk. Powerful, but very complex to set up.

Using a Variable as a Match Expression

Kewl Splitpath One Liner Regex

Check out this splitpath command:

my($text) = "/etc/sysconfig/network-scripts/ifcfg-eth0";
my($directory, $filename) = $text =~ m/(.*\/)(.*)$/;

print "D=$directory, F=$filename\n";

Is that cool or what?

Using a Variable as a Match Expression

You can use a variable inside the match expression. This yields tremendous power. Simply place the variable name between the forward slashes, and the expression will be sought in the string. Here’s an example:

#!/usr/bin/perl -w
# use strict;

sub test($$)
	my $lookfor = shift;
	my $string = shift;
	print "\n$lookfor ";
	if($string =~ m/($lookfor)/)
		print " is in ";
		print " is NOT in ";
	print "$string.";
		print "      <$1>";
	print "\n";

test("st.v.", "steve was here");
test("st.v.", "kitchen stove");
test("st.v.", "kitchen store");

The preceding code produces the following output.

[slitt@mydesk slitt]$ ./

st.v.  is in steve was here.      <steve>

st.v.  is in kitchen stove.      <stove>

st.v.  is NOT in kitchen store.
[slitt@mydesk slitt]$

As you can see, you can seek a regex expression stored in a variable, and you can retrieve the result in $1.

A common mistake people do when using regular expressions is to try to match a variable in your regular expressions.


Code: Perl
$data =~ s!$url!!;

This is going to work properly most of the time. But sometime it won’t behave as expected or you will be experiencing occasional run time errors. For example, if your $url is equal to, the substitution operator is going to fail and exit with an error message.

 "/ nested *?+ in regex..."

The reason for the failure is that you can’t use “++” inside your regular expression. You have to escape them. The variable might include several special variables, which have to be escaped properly. To correct way to implement this substitution is:

Code: Perl
$temp = quotemeta($url);
$data =~ s!$temp!!;

quotemeta() is a standard perl function and it escapes all non-alphanumeric characters in your variable.

Perl: Greedy and Ungreedy Matching

Greedy and Ungreedy Matching

Perl regular expressions normally match the longest string possible. For instance:

my($text) = "mississippi";
$text =~ m/(i.*s)/;
print $1 . "\n";

Run the preceding code, and here’s what you get:


It matches the first i, the last s, and everything in between them. But what if you want to match the first i to the s most closely following it? Use this code:

my($text) = "mississippi";
$text =~ m/(i.*?s)/;
print $1 . "\n";

Now look what the code produces:


Clearly, the use of the question mark makes the match ungreedy. But theres another problem in that regular expressions always try to match as early as possible.



One thing to watch out for in non-greedy matching:


One thing to watch out for: given a pattern such as /^(.*?)%(.*?)/ one
could match and extract the first two fields of a like of % separated

#!/usr/bin/perl -w
use strict;
$_ = ‘Johnson%Andrew%AX321%37’;
print “$2 $1\n”;

And one can easily begin to think of each subexpression as
meaning ‘match up to the next % symbol’, but that isn’t exactly what it
means. Let’s say that the third field represents an ID tag and we want
to extract only those names of people with ID tags starting with ‘A’.
We might be tempted to do this:

#!/usr/bin/perl -w
use strict;
while () {
print “$2 $1\n” if m/^(.*?)%(.*?)%A/;

This would print out:

Andrew Johnson
John%BC142 Smith

But that isn’t what we wanted at all — what happened? Well, the second
half of the regex does not say match up to the next % symbol and then
match an ‘A’, it says, match up to the next % symbol that is followed
by an ‘A’. The pattern ‘(.*?)’ part is not prevented from matching and
proceeding past a % character if that is what is necessary to cause the
whole regex to succeed. What we really wanted in this case was a
negated character class:

#!/usr/bin/perl -w
use strict;
while () {
print “$2 $1\n” if m/^([^%]*)%([^%]*)%A/;

Now we are saying exactly what we want: the first subexpression grabs
zero or more of anything except a % character, then we match a %
character, then the second subexpression also grabs zero or more of
anything but a % character, and finally we match ‘%A’ or we fail.

To summarize, a greedy quantifier takes as much as it can get, and a
non-greedy quantifier takes as little as possible (in both cases only
while still allowing the entire regex to succeed). Take care in how you
use non-greedy quantifiers — it is easy to get fooled into using one
where a negated character class is more appropriate.