Page 165 - Python for Everybody
P. 165
12.9. BONUS SECTION FOR UNIX / LINUX USERS 153
print('URL:', tag.get('href', None)) print('Contents:', tag.contents[0]) print('Attrs:', tag.attrs)
# Code: http://www.py4e.com/code3/urllink2.py
python urllink2.py
Enter - http://www.dr-chuck.com/page1.htm
TAG: <a href="http://www.dr-chuck.com/page2.htm"> Second Page</a>
URL: http://www.dr-chuck.com/page2.htm
Content: ['\nSecond Page']
Attrs: [('href', 'http://www.dr-chuck.com/page2.htm')]
html.parser is the HTML parser included in the standard Python 3 library. In- formation on other HTML parsers is available at:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
These examples only begin to show the power of BeautifulSoup when it comes to parsing HTML.
12.9 Bonus section for Unix / Linux users
If you have a Linux, Unix, or Macintosh computer, you probably have commands built in to your operating system that retrieves both plain text and binary files using the HTTP or File Transfer (FTP) protocols. One of these commands is curl:
$ curl -O http://www.py4e.com/cover.jpg
The command curl is short for “copy URL” and so the two examples listed earlier to retrieve binary files with urllib are cleverly named curl1.py and curl2.py on www.py4e.com/code3 as they implement similar functionality to the curl com- mand. There is also a curl3.py sample program that does this task a little more effectively, in case you actually want to use this pattern in a program you are writing.
A second command that functions very similarly is wget:
$ wget http://www.py4e.com/cover.jpg
Both of these commands make retrieving webpages and remote files a simple task.
12.10 Glossary
BeautifulSoup A Python library for parsing HTML documents and extracting data from HTML documents that compensates for most of the imperfections in the HTML that browsers generally ignore. You can download the Beauti- fulSoup code from www.crummy.com.