How to Scrape Archive.org Like a Pro

How to Scrape Archive

This is an article on advanced technical SEO and since you’re here, I assume you know why one would want to scrape Archive.org so I won’t go into details on that. Let’s just get down to the business and look at some interesting SEO tools.

How to Scrape Archive.org with ArchiveScraper.net

How to Scrape Archive

Have a look at https://archivescraper.net/ – it’s a startup and currently in Beta stage. Pricing structure is very simple – you pay $5 per scrape. Credits can be replenished via PayPal and if you find that the payment hasn’t gone through, either log out and log back in or wait for 12 hours for PayPal to finalise the payment. After all it’s Beta, so don’t expect to be perfect.

On the other hand, this SEO tool is extremely simple and useful and it saves a lot of time. First step is to browse the actual Archive.org calendar and determine the date that contained the fullest version of the site. Just click around and check for broken links or missing images. Write down the date, enter the domain name in ArchiveScraper.net, press Enter, select the chosen date from the calendar and off you go.

Using Wayback Machine Downloader

Wayback Machine Downloader

This is a more established SEO tool that lets you scrape Wayback Machine like a pro. It’s more expensive and more complicated than the previous one. It will cost you $15 per site or $45 per site if you want to port the content over to WordPress format.

Using HTTrack Software

Httrack

If you’ve got too much time on your hands or you have a big team struggling to find ways of keeping themselves busy, you may find a way how to scrape Archive.org the old-school way using HTTrack. There are two ways to work through this. On project setup choose “Download Web site(s)” from the dropdown and rely on HTTrack’s own crawler.

Since Archive.org’s file structure is not always consistent, you may find that you get more complete results by giving HTTrack a list of urls instead of relying on its crawler.

You can try to obtain a full list of an archived site’s urls by crawling it with XENU. Then save the list of urls into a text file and choose “Get Separated Files” from HTTrack’s project menu.

Either way you will be left with a bunch of HTML files that are now in need of post-processing. Each internal and external url now contains references to archive.org that you now need to remove.

Best way of doing this is to use Dreamweaver’s Find and Replace functionality. If you value your time, you’ll probably choose one of the first two methods.

Latest posts by Gerald Curtis (see all)

Leave a Reply

Cat Biscuits (AKA privacy cookies)

By clicking the "agree" button you're agreeing with our Privacy Policy. It is compliant with GDPR and all that jazz. This is standard stuff so unless you're in a super stealth mode, you're ok.