crawler.pl
wget は CGIパラメタが違うだけのURLを区別してくれないっぽい。
もっと攻撃的なクローラがほしかったので書きました。
無条件(ホストだけは合わせるけど)で2段階のリンクを踏んでとってくるクローラ:
#! /usr/bin/env perl use warnings; use strict; use WWW::Mechanize; use WWW::Mechanize::Link; use URI::URL; use URI::Escape; use Getopt::Long; use Pod::Usage; my $max_follow_links = 2; my $wait_seconds = 3; my $verbose = 0; my $dest_dir = ''; GetOptions( 'level=i' => \$max_follow_links, 'wait=i' => \$wait_seconds, 'dest_dir=i' => \$dest_dir, 'verbose' => \$verbose, 'help' => sub{pod2usage(-exitstatus => 0)} ); my $url = shift @ARGV; $dest_dir = uri_escape $url if $dest_dir eq ''; my $m = WWW::Mechanize->new(); my %links; my $collect_links; $collect_links = sub($$) { my($url, $dep)=@_; if ( $dep <= 0 ) { return; } print STDERR "waiting $wait_seconds seconds ... \n" if $verbose; sleep $wait_seconds; # GET and save my $res = $m->get($url); my $local_path = $dest_dir . '/'. uri_escape $url; open F, '>', $local_path or die "$!: $local_path"; print F $res->as_string; close F; my @links = map $_->URI, $m->links($url); foreach (@links) { print STDERR $_, "\n" if $verbose; $links{$_->abs} = $_; } print STDERR "collected ".(scalar keys %links)." links in total\n" if $verbose; if ( $dep >= 2 ) { $collect_links->($_,$dep-1) foreach @links; } }; mkdir uri_escape $url; $collect_links->(URI::URL->new($url), $max_follow_links); __END__ =head1 NAME crawler.pl - Simple Web Crawler =head1 SYNOPSIS crawler.pl [options] URL Options: --level max follow depth [2] --dest_dir destination directory [URL] --help shows this help =head1 DESCRIPTION B<crawler.pl> is a simple crawler which follows the links on a specified URL within a specified link depth. =head1 SEE ALSO L<WWW::Mechanize> =cut
とってきたファイルは URI escape した名前で保存する。