Page 593 - Beginning PHP 5.3
P. 593
Chapter 18: String Matching with Regular Expressions
displayForm() outputs an HTML form that sends its data back to the find_links.php script. This
form contains just two controls: a url field for the user to enter a URL to scan and a Find Links button to
submit the form.
processForm() first performs some simple validation on the submitted URL: if it doesn ’ t begin with
http:// or https:// , then http:// is assumed, and prepended to the URL. Notice the use of a regular
expression to determine if the URL begins with http:// or https:// . This expression is delimited by vertical
bars ( | ) rather than the usual slashes; this saves having to escape the double slashes within the expression:
if ( !preg_match( ‘|^http(s)?\://|’, $url ) ) $url = “http://$url”;
Once the URL has been validated, it ’ s passed to the built - in file_get_contents() function. You may
remember from Chapter 11 that, when passed a URL, file_get_contents() requests that URL and
returns the contents of the page at that URL, just as if it were reading a file. This is a quick and easy way
to read the HTML of a Web page.
The meat of the function is in the call to the preg_match_all() function, which uses a regular
expression to extract all the linked URLs in the page:
preg_match_all( “/ < a\s*href=[‘\”](.+?)[‘\”].*? > /i”, $html, $matches );
This regular expression reads as follows:
1. Match an opening angle bracket ( < ) and letter “ a ” followed by zero or more whitespace characters.
2. Match the characters “ href= ” , followed by either a single or double quote character (either can
be used in HTML).
3. Match at least one character followed by another single or double quote. The question mark
ensures that the matching is non - greedy (otherwise all text up to the last single or double quote in
the page would be matched). The pattern is enclosed in parentheses to capture the resulting URL.
4. Match zero or more characters, followed by a closing angle bracket. This ensures that the whole
of the < a > tag is matched. Again, non - greedy matching is used, otherwise all text would be
matched up to the last closing angle bracket in the page.
Notice the letter ‘ i ’ after the closing delimiter. This is known as a pattern modifier , and it causes the
matching to be case - insensitive (because HTML can be written in upper - or lowercase). For more details,
see the “ Altering Matching Behavior with Pattern Modifiers ” section toward the end of the chapter.
Now that all the linked URLs have been extracted, it ’ s simply a case of displaying them as an unordered
list. Notice that, for both security and XHTML compliance reasons, htmlspecialchars() is called to
escape any markup characters in the output:
echo ‘ < div style=”clear: both;” > < /div > ’;
echo “ < h2 > Linked URLs found at “ . htmlspecialchars( $url ) . “: < /h2 > ”;
echo “ < ul > ”;
for ( $i = 0; $i < count( $matches[1] ); $i++ ) {
echo “ < li > ” . htmlspecialchars( $matches[1][$i] ) . “ < /li > ”;
}
echo “ < /ul > ”;
555
9/21/09 6:17:57 PM
c18.indd 555
c18.indd 555 9/21/09 6:17:57 PM