1. ProblemYou want to fetch a page from your application, read it, and then
modify part of it to send back in your response. For our example, we
will modify a page on Wikipedia.
2. Solution
See Example 1.
Example 1. Editing a Wikipedia page with Perl
#!/usr/bin/perl use LWP::UserAgent; use HTTP::Request::Common qw(GET POST); use HTML::Parser; use URI; use HTML::Entities;
use constant MAINPAGE => 'http://en.wikipedia.org/wiki/Wikipedia:Tutorial_%28Keep_in_mind%29/sandbox'; use constant EDITPAGE => 'http://en.wikipedia.org/w/index.php' . '?title=Wikipedia:Tutorial_%28Keep_in_mind%29/sandbox';
# These are form inputs we care about on the edit page my @wpTags = qw(wpEditToken wpAutoSummary wpStarttime wpEdittime wpSave );
sub findPageData { my ( $self, $tag, $attr ) = @_; # signal to the endHandler handler if we find the text if ( $attr->{name} eq "wpTextbox1" ) { $main::wpTextboxFound = 1; return; } elsif ( grep( /$attr->{name}/, @wpTags ) > 0 ) { # if it's one of the form parameters we care about, # record the parameter's value for use in our submission later. $main::parms{ $attr->{name} } = $attr->{value}; return; } }
# This is called on closing tags like </textarea> sub endHandler { next unless $main::wpTextboxFound; my ( $self, $tag, $attr, $skipped ) = @_; if ( $tag eq "textarea" ) { $main::parms{"wpTextbox1"} = $skipped; undef $main::wpTextboxFound; } }
sub checkError { my $resp = shift; if ( ( $resp->code() < 200 ) || ( $resp->code() >= 400 ) ) { print "Error: " . $resp->status_line . "\n"; exit 1; } }
### ### MAIN ###
# First, fetch the main wikipedia sandbox page. This just confirms # our connectivity and makes sure it really works. $UA = LWP::UserAgent->new(); $req = HTTP::Request->new( GET => MAINPAGE ); $resp = $UA->request($req);
checkError($resp);
# Now fetch the edit version of that page $req->uri( EDITPAGE . '&action=edit' ); $resp = $UA->request($req);
checkError($resp);
# Build a parser to parse the edit page and find the text on it. my $p = HTML::Parser->new( api_version => 3, start_h => [ \&findPageData, "self,tagname,attr" ], end_h => [ \&endHandler, "self,tagname,attr,skipped_text" ], unbroken_text => 1, attr_encoded => 0, report_tags => [qw(textarea input)] ); $p->parse( $resp->content ); $p->eof;
# The text will have entities encoded (e.g., < instead of <) # We have to decode them and submit raw characters. $main::parms{wpTextbox1} = decode_entities($main::parms{wpTextbox1});
# make our trivial edit. append text to whatever was already there. $main::parms{wpTextbox1} .= "\r\n\r\n===Test 1===\r\n\r\n" . "ISBN: 9780596514839\r\n\r\nThis is a test.\r\n\r\n";
# POST our edit $req = HTTP::Request::Common::POST( EDITPAGE, Content_Type => 'form-data', Content => \%main::parms ); $req->uri( EDITPAGE . '&action=submit' );
$resp = $UA->request($req); checkError($resp); # We expect a 302 redirection if it is successful.
|
3. Discussion
This kind of test is most applicable in web applications that
change a lot between requests. Perhaps it is a blog, forum, or document
management system where multiple users may be simultaneously be
introducing changes to the application’s state. If you have to find
parameters before you can modify them and send them back, this is the
recipe for you.
The script in Example 1 is pretty
complex. The main reason for that complexity is the way <textarea>
elements are handled in HTML::Parser.
Many form elements are self-contained (i.e., the value is inside the
element itself) like <input type="hidden"
name="date" value="20080101">. In an element like that, you
just find the one named “date” and look at its value. In a text area, we
have a start tag, an end tag, and the text we care about in between. Our
parser, therefore, has a “start” handler and an “end” handler. If the
start handler sees the start of the textarea, we check to see if it’s the one we
want (the one named wpTextbox1). If
we found the textarea<) encoded (like <). We have to decode those because
Wikipedia expects raw input (i.e., it wants the real, raw < we want, it
sets a signal variable to tell the end handler that we just passed the
text we want. The text handler scoops up the “skipped” text from the
parser and we’re done. The skipped text has HTML entities (like character). Once we know what we
originally received, we will simply append our demonstration text to
it.
There’s another bit of special handling we’re doing that relates
to the URLs we are GETting and POSTing. We append the action to the URL
using concatenation instead of just embedding it in the EDITPAGE constant. That is, we set the URL
using $req->uri(EDITPAGE .
'&action=edit'). If the ampersand is in the original URL
that is passed to HTTP::Request::Common::POST, then the
ampersand will be encoded as %26,
which won’t be parsed by Wikipedia correctly.