Page 593 - Beginning PHP 5.3
P. 593

Chapter 18: String Matching with Regular Expressions
                            displayForm()  outputs an HTML form that sends its data back to the  find_links.php  script. This
                         form contains just two controls: a   url  field for the user to enter a URL to scan and a Find Links button to
                         submit the form.
                            processForm()  first performs some simple validation on the submitted URL: if it doesn ’ t begin with
                            http://  or  https:// , then  http://  is assumed, and prepended to the URL. Notice the use of a regular
                         expression to determine if the URL begins with   http://  or  https:// . This expression is delimited by vertical
                         bars (  | ) rather than the usual slashes; this saves having to escape the double slashes within the expression:


                               if ( !preg_match( ‘|^http(s)?\://|’, $url ) ) $url = “http://$url”;
                           Once the URL has been validated, it ’ s passed to the built - in  file_get_contents()  function. You may
                          remember from Chapter 11 that, when passed a URL,   file_get_contents()  requests that URL and
                         returns the contents of the page at that URL, just as if it were reading a file. This is a quick and easy way
                         to read the HTML of a Web page.

                           The meat of the function is in the call to the   preg_match_all()  function, which uses a regular
                          expression to extract all the linked URLs in the page:

                               preg_match_all( “/ < a\s*href=[‘\”](.+?)[‘\”].*? > /i”, $html, $matches );

                           This regular expression reads as follows:
                               1.       Match an opening angle bracket (  < ) and letter   “ a ”  followed by zero or more whitespace characters.


                               2.       Match the characters   “ href= ” , followed by either a single or double quote character (either can

                                be used in HTML).
                               3.       Match at least one character followed by another single or double quote. The question mark
                                  ensures that the matching is non - greedy (otherwise all text up to the last single or double quote in
                                the page would be matched). The pattern is enclosed in parentheses to capture the resulting URL.
                               4.       Match zero or more characters, followed by a closing angle bracket. This ensures that the whole
                                of the    < a >   tag is matched. Again, non - greedy matching is used, otherwise all text would be
                                matched up to the last closing angle bracket in the page.
                            Notice the letter    ‘ i ’   after the closing delimiter. This is known as a  pattern modifier , and it causes the
                          matching to be case - insensitive (because HTML can be written in upper -  or lowercase). For more details,
                          see the  “ Altering Matching Behavior with Pattern Modifiers ”  section toward the end of the chapter.

                            Now that all the linked URLs have been extracted, it ’ s simply a case of displaying them as an unordered
                          list. Notice that, for both security and XHTML compliance reasons,   htmlspecialchars()  is called to
                         escape any markup characters in the output:


                               echo ‘ < div style=”clear: both;” >    < /div > ’;
                               echo “ < h2 > Linked URLs found at “ . htmlspecialchars( $url ) . “: < /h2 > ”;
                               echo “ < ul > ”;

                               for ( $i = 0; $i  <  count( $matches[1] ); $i++ ) {
                                 echo “ < li > ” . htmlspecialchars( $matches[1][$i] ) . “ < /li > ”;
                               }

                               echo “ < /ul > ”;



                                                                                                         555



                                                                                                      9/21/09   6:17:57 PM
          c18.indd   555
          c18.indd   555                                                                              9/21/09   6:17:57 PM
   588   589   590   591   592   593   594   595   596   597   598