Web Hosting Forum | Lunarpages

Author Topic: Using the Yahoo! Pipes Fetch Page module to make a web scraper  (Read 11187 times)

Offline Mitch

  • Berserker Poster
  • *****
  • Posts: 12641
    • MitchKeeler.com
Using the Yahoo! Pipes Fetch Page module to make a web scraper
« on: December 14, 2007, 09:03:17 AM »
From Lunarforums memeber, Day:

I've written on these forums before about how to write a web scraper in PHP that can be hosted on a Lunarpages account.  But I recently discovered Yahoo! Pipes and the new Fetch Page module which allows you to make web scrapers in a visual way using drag and drop and a few rules.  The resulting scraper is hosted by Yahoo! so you don't need to write any PHP, figure out how to do cached HTTP requests from your Lunarpages account or schedule any cron jobs.

So I abandoned my previous PHP scripts and created a Pipe to make the RSS feeds for these forums.  I've written up how I did it in a tutorial.  The general techniques should be usable by anyone who needs to extract data from web pages without RSS feeds or other structured data sets.

Yahoo! Pipes Tutorial - An example using the Fetch Page module to make a web scraper

New to Web Site Hosting? Check Out the Lunarpages Blog Hosting Guide!


Follow us @lunarpages on Twitter!
Important Threads: Read This Before Posting! | Lunarforums Rules! | Mitch's Link of the Day!
Also, be sure to check out and subscribe to the Lunartics Blog and the Lunarpages Newsletter !

Need Web Hosting Help? Check out the Lunarpages Web Hosting Wiki. It has tons of tips, tutorials and resources!

Offline Sudija

  • Newbie
  • *
  • Posts: 1
Re: Using the Yahoo! Pipes Fetch Page module to make a web scraper
« Reply #1 on: April 17, 2008, 02:16:11 PM »
 :clap: Awesome tutorial on Pipes!

I have a little problem applying it to the forum I am trying to turn into feed - http://www.pigeon-chat.com/search.php?search_id=newposts
Please keep in mind I'm a total newbie in this area.
My problem kicks in when i try to cut the part that I need. No tables!
Second thing is I get a lot of junk in my output and Regex doesn't work for me, can't replace crap with nothing to keep it out.

I just want these posts to appear as an RSS feed, that's all  :rofl:

Any help is appreciated!

 

Share |