| Credits | p. ix |
| Preface | p. xv |
| Walking Softly | p. 1 |
| A Crash Course in Spidering and Scraping | p. 1 |
| Best Practices for You and Your Spider | p. 3 |
| Anatomy of an HTML Page | p. 7 |
| Registering Your Spider | p. 10 |
| Preempting Discovery | p. 12 |
| Keeping Your Spider Out of Sticky Situations | p. 15 |
| Finding the Patterns of Identifiers | p. 18 |
| Assembling a Toolbox | p. 21 |
| Perl Modules | p. 22 |
| Resources You May Find Helpful | p. 23 |
| Installing Perl Modules | p. 24 |
| Simply Fetching with LWP::Simple | p. 27 |
| More Involved Requests with LWP::UserAgent | p. 29 |
| Adding HTTP Headers to Your Request | p. 30 |
| Posting Form Data with LWP | p. 32 |
| Authentication, Cookies, and Proxies | p. 34 |
| Handling Relative and Absolute URLs | p. 38 |
| Secured Access and Browser Attributes | p. 40 |
| Respecting Your Scrapee's Bandwidth | p. 42 |
| Respecting robots.txt | p. 46 |
| Adding Progress Bars to Your Scripts | p. 47 |
| Scraping with HTML::TreeBuilder | p. 53 |
| Parsing with HTML::TokeParser | p. 56 |
| WWW::Mechanize 101 | p. 59 |
| Scraping with WWW::Mechanize | p. 62 |
| In Praise of Regular Expressions | p. 67 |
| Painless RSS with Template::Extract | p. 70 |
| A Quick Introduction to XPath | p. 74 |
| Downloading with curl and wget | p. 78 |
| More Advanced wget Techniques | p. 80 |
| Using Pipes to Chain Commands | p. 82 |
| Running Multiple Utilities at Once | p. 86 |
| Utilizing the Web Scraping Proxy | p. 89 |
| Being Warned When Things Go Wrong | p. 93 |
| Being Adaptive to Site Redesigns | p. 96 |
| Collecting Media Files | p. 99 |
| Detective Case Study: Newgrounds | p. 99 |
| Detective Case Study: iFilm | p. 105 |
| Downloading Movies from the Library of Congress | p. 108 |
| Downloading Images from Webshots | p. 111 |
| Downloading Comics with dailystrips | p. 115 |
| Archiving Your Favorite Webcams | p. 118 |
| News Wallpaper for Your Site | p. 122 |
| Saving Only POP3 Email Attachments | p. 125 |
| Downloading MP3s from a Playlist | p. 132 |
| Downloading from Usenet with nget | p. 137 |
| Gleaning Data from Databases | p. 141 |
| Archiving Yahoo! Groups Messages with yahoo2mbox | p. 141 |
| Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups | p. 143 |
| Gleaning Buzz from Yahoo! | p. 147 |
| Spidering the Yahoo! Catalog | p. 150 |
| Tracking Additions to Yahoo! | p. 157 |
| Scattersearch with Yahoo! and Google | p. 160 |
| Yahoo! Directory Mindshare in Google | p. 164 |
| Weblog-Free Google Results | p. 168 |
| Spidering, Google, and Multiple Domains | p. 171 |
| Scraping Amazon.com Product Reviews | p. 176 |
| Receive an Email Alert for Newly Added Amazon.com Reviews | p. 178 |
| Scraping Amazon.com Customer Advice | p. 180 |
| Publishing Amazon.com Associates Statistics | p. 182 |
| Sorting Amazon.com Recommendations by Rating | p. 185 |
| Related Amazon.com Products with Alexa | p. 188 |
| Scraping Alexa's Competitive Data with Java | p. 193 |
| Finding Album Information with FreeDB and Amazon.com | p. 194 |
| Expanding Your Musical Tastes | p. 203 |
| Saving Daily Horoscopes to Your iPod | p. 207 |
| Graphing Data with RRDTOOL | p. 209 |
| Stocking Up on Financial Quotes | p. 213 |
| Super Author Searching | p. 217 |
| Mapping O'Reilly Best Sellers to Library Popularity | p. 232 |
| Using All Consuming to Get Book Lists | p. 235 |
| Tracking Packages with FedEx | p. 241 |
| Checking Blogs for New Comments | p. 243 |
| Aggregating RSS and Posting Changes | p. 248 |
| Using the Link Cosmos of Technorati | p. 255 |
| Finding Related RSS Feeds | p. 259 |
| Automatically Finding Blogs of Interest | p. 270 |
| Scraping TV Listings | p. 273 |
| What's Your Visitor's Weather Like? | p. 277 |
| Trendspotting with Geotargeting | p. 281 |
| Getting the Best Travel Route by Train | p. 287 |
| Geographic Distance and Back Again | p. 290 |
| Super Word Lookup | p. 296 |
| Word Associations with Lexical Freenet | p. 300 |
| Reformatting Bugtraq Reports | p. 303 |
| Keeping Tabs on the Web via Email | p. 308 |
| Publish IE's Favorites to Your Web Site | p. 314 |
| Spidering GameStop.com Game Prices | p. 322 |
| Bargain Hunting with PHP | p. 325 |
| Aggregating Multiple Search Engine Results | p. 331 |
| Robot Karaoke | p. 335 |
| Searching the Better Business Bureau | p. 339 |
| Searching for Health Inspections | p. 342 |
| Filtering for the Naughties | p. 345 |
| Maintaining Your Collections | p. 349 |
| Using cron to Automate Tasks | p. 349 |
| Scheduling Tasks Without cron | p. 351 |
| Mirroring Web Sites with wget and rsync | p. 355 |
| Accumulating Search Results Over Time | p. 359 |
| Giving Back to the World | p. 363 |
| Using XML::RSS to Repurpose Data | p. 364 |
| Placing RSS Headlines on Your Site | p. 368 |
| Making Your Resources Scrapable with Regular Expressions | p. 371 |
| Making Your Resources Scrapable with a REST Interface | p. 378 |
| Making Your Resources Scrapable with XML-RPC | p. 381 |
| Creating an IM Interface | p. 385 |
| Going Beyond the Book | p. 389 |
| Index | p. 391 |
| Table of Contents provided by Ingram. All Rights Reserved. |