crawler.pl - mtbrの日記

wget は CGIパラメタが違うだけのURLを区別してくれないっぽい。
もっと攻撃的なクローラがほしかったので書きました。
無条件(ホストだけは合わせるけど)で2段階のリンクを踏んでとってくるクローラ：

#! /usr/bin/env perl
use warnings;
use strict;
use WWW::Mechanize;
use WWW::Mechanize::Link;
use URI::URL;
use URI::Escape;
use Getopt::Long;
use Pod::Usage;

my $max_follow_links = 2;
my $wait_seconds = 3;
my $verbose = 0;
my $dest_dir = '';
GetOptions(
   'level=i' => \$max_follow_links,
   'wait=i'  => \$wait_seconds,
   'dest_dir=i'  => \$dest_dir,
   'verbose' => \$verbose,
   'help' => sub{pod2usage(-exitstatus => 0)}
  );
my $url = shift @ARGV;
$dest_dir = uri_escape $url if $dest_dir eq '';

my $m = WWW::Mechanize->new();
my %links;
my $collect_links;
$collect_links = sub($$) {
  my($url, $dep)=@_;
  if ( $dep <= 0 ) {
    return;
  }
  print STDERR "waiting $wait_seconds seconds ... \n" if $verbose;
  sleep $wait_seconds;

  # GET and save
  my $res = $m->get($url);
  my $local_path = $dest_dir . '/'. uri_escape $url;
  open F, '>', $local_path or die "$!: $local_path";
  print F $res->as_string;
  close F;
  
  my @links = map $_->URI, $m->links($url);
  foreach (@links) {
    print STDERR $_, "\n" if $verbose;
    $links{$_->abs} = $_;
  }
  print STDERR "collected ".(scalar keys %links)." links in total\n" if $verbose;
  if ( $dep >= 2 ) {
    $collect_links->($_,$dep-1) foreach @links;
  }
};

mkdir uri_escape $url;
$collect_links->(URI::URL->new($url), $max_follow_links);

__END__

=head1 NAME

  crawler.pl - Simple Web Crawler

=head1 SYNOPSIS

  crawler.pl [options] URL

  Options:
   --level      max follow depth [2]
   --dest_dir   destination directory [URL]
   --help       shows this help

=head1 DESCRIPTION

B<crawler.pl> is a simple crawler which follows the links on a specified URL
within a specified link depth.

=head1 SEE ALSO

L<WWW::Mechanize>

=cut

とってきたファイルは URI escape した名前で保存する。