When you need archive sites, for research, there's often no better a tool than Wget. Wget is available for almost all platforms. If you don't already have access to Wget or have never used it before check out my Wget OS tutorials: Using Wget on Linux. Using Wget on OSX. Using Wget on Windows. Using Wget on Android, and Using Wget on Chrome OS. If you have an operating system that I missed, please tell me.
Wget is amazing at backing up and wget cloning websites. There is no better wget alternative, just better ways to use it. Some site's, especially WordPress sites, can be a little tricky to save well. There are many recipes online for using Wget to grab websites. They always need tweaking. So that you can avoid excessive tweaking here's my one line Wget "script."
wget --random-wait --wait 1 --mirror -p --html-extension -k -e --no-clobber robots=off -p --user-agent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" . URL-TO-SITE
Here’s what that one line script does:
--random-wait: Wget doesn’t throttle calls so it could cause problems if the server has a high load. It's for this reason that we start with --random-wait. Random varies requests between 0.5 and 1.5 times the wait seconds, where the wait is specified using the
--wait for: This both masks Wget's presence from analysis and plays better with servers.
--mirror: turns on recursion so that we don't download the URL but instead the entire site.
-p: download all prerequisites (supporting media, photos, and CSS) rather than the HTML
--HTML-extension: adds .html suffix the downloaded HTML files. HTML extensions make it easy to look at the contents of the site without having to configure a new server
--convert-links: this rewrites URLs in the downloaded HTML files. URLs are pointed at the downloaded files rather than online. Without this, saved sites need to online access to the original server.
--no-clobber: Not clobbering speeds things up by avoiding downloading existing files that exist locally.
-e robots=off: executes the “robots off” command, telling Wget to ignore the site's robots.txt. Ignoring Robots.txt is frowned on generally. If it's your site, this is OK. If you don’t own the site, it is polite to follow robots files. Robots.txt files are essentially notices. They are like "no dogs allowed" singes. There can be real reasons for these rules, or they might be there out of a convention. If they are essential depends on the site and the specifics of the robots.txt.
Identify as agent-string to the HTTP server.
The HTTP protocol allows web clients to identify themselves using a "User-Agent" field. Identifying Web browser software is usually for statistical purposes or for tracing of protocol violations. Wget usually identifies as Wget/version, the version being the current version number of Wget. Masquerading as Google lets us see websites the way Google does. This is handy as many sites limit general visitors access but not Google's.
However, some sites are known to impose policies tailoring output according to your supplied "User-Agent." While in theory customizing content based on browsers is not a horrible idea. Some sites abuse the feature, denying information to clients other than Firefox, Chrome, or most often Microsoft Internet Explorer. By changing the "User-Agent" line issued by Wget, we're able to sidestep these issues.
-p: sets the download directory to something. In my one-liner, I left it at the default “.” (which means “here”)
URL-TO-SITE: this is the URL of the site you want to download.
If you need to adjust how wget handles content hosted on more than one domain the -D domain-list option must be used. We'll go over that in another tutorial.