« The Post Floppy World | Main | Mmm.. that is a Tasty eMachines Mobile Athlon 64 powered laptop! »
WinHTTrack : Very Handy, Somewhat Evil
I wanted to mention a very cool, very handy and somewhat evil program. It is called HTTrack and WinHTTrack. It is an free open source program for Windows and Linux that has been designed to download an entire website.
The way that it works is it automatically starts with a root index HTML file and it will download all pages and images linked. By it's default setting it will modify the saved HTML so all the links are relative and you can browse the entire site off line. So I tried it out with my website and a few of my favorite web comics. I saved a copy of 8Bitjoystick.com and Mental Ground Zero.
All the web pages, images and links were there. The comments forms were there but pushing the post button lead you to the real site. It does do it's work instantly and I would not use it on a slow connection but with a bit tweaking you can download copies of your favorite websites to browse off line.
This is going to come in very handy my notebook it is a bit like a super deluxe version of AvantGo for normal PCs.
However this does raise some ethical situations. Since you can download entire sites with ease and format them for off line enjoyment you can capture huge amounts of content with out looking at the ads or clicking on the ads. This can be slightly sketchy and if you crank the limits up so you increase the HTTP connections you can flood and crash some weaker servers if you are not careful.
It can not get any authenticated or secret data from a website. Sure you do get a hell of a lot of content but it is all stuff that the authors put on the web in the first place.
One thing that sort of sucks is all site forums that are linked are downloaded so you get a hell of a lot of pages that you were not looking for. Also this only gets the products of a database and does not get the actual DB or source files for dynamic PHP or ASP pages.
I was able to download all the archive strips of Diesel Sweeties, PVP, Penny-Arcade and Coffee Brain for my own reading enjoyment when I am not connected to the internet. Like when I am on a ferry across the sound.
This is a cool and slightly freaky program.
Jake at January 15, 2004
WebDev
Trackback Pings
TrackBack URL for this entry:
http://www.8bitjoystick.com/cgi-bin/mtype/mt-tb.cgi/498
Comments
I've used it for lots of sites that I've worried might not be online forever.
It is great for sites like these http://www.geocities.com/SoHo/Atrium/1031/trans/1trans.html
Posted by: PromoGuy at January 15, 2004 8:04 AM
This is a great piece of software but the problem i am running into is i am trying to cache something that uses java script and flash to link to files but this program will not work with that.
Posted by: shawn at January 15, 2004 8:25 PM
I've used WinHTTracker to download and convert HUGE amounts of my own PHP web sites (> 2000) to HTML during a server migration from Apache to IIS6. I didn't want to run PHP on our new web server so I fed WINHTTrack a list of the PHP sites on our Linux Apache server and sat back for a few days while it converted them to static HTML for me. The PHP was unnecessary in this case, since the code generated the same HTML file every time, and didn't accept querystring variables. If server-side scripting can be avoided, the sites have a lighter impact on the server as static HTML files.
I do agree that it is a powerful tool. Just for fun, I ran 8 copies simultaneously - all pointed to my Apache server - and did an accidental HTTP DOS attack on my own web server :) It's like a giant website vacuum cleaner that can use up a lot of bandwidth if you don't set the throttles right. There is an option in the most recent version where you can choose not to download external pages, which keeps the downloaded sites more or less on target. Also, I've found it helpful to exclude all *google* URLs since that can get the crawler stuck in a Google Ad Words near-infinite loop.
Posted by: Rob Hall at August 19, 2006 12:43 PM

