9

I'm trying to create a static mirror of a php application (an old php Gallery installation, specifically). The app produces URLs such as:

view_album.php?set_albumName=MyAlbum

wget downloads these directly to files named the same, complete with question marks. In order to not break inbound links, I'd like to keep those names. But how do I serve them? I'm running into two problems:

  1. Webservers (correctly) attempt to find "view_album.php", and pass it the query args, rather than a finding a file with a question mark in it. How do I tell a webserver to look for files with a question mark in them? Renaming the files isn't desirable, as it would break inbound links. I can't tell the inbound linkers to %-encode their URLs.

  2. The files don't end with HTML, so most webservers won't send an html content-type header. What configuration parameters should I look for to tell it to force a 'text/html' content-type for all files in a directory or matching a certain pattern?

I'm using lighttpd ultimately, but if you know what sort of configuration might get the desired results with apache/nginx I'd love to hear that too.

masegaloeh
  • 18,498
user67641
  • 1,332
  • 2
  • 14
  • 18

3 Answers3

6

wget downloads these directly to files named the same, complete with question marks.

You can disable that behavior with --restrict-file-names=ascii,windows, this resolves your issue right on wget, without needing fancy server configs.

3

I think you can also fix this by changing the way wget downloads the php files:

wget -r --adjust-extension --convert-links 'http://example.com/index.php?foo=bar'

Option --adjust-extension makes wget save the PHP files with a .html extension, e.g. index.php?foo=bar.html

Option --convert-links makes wget convert the links in the downloaded files to the newly created .html files. Note that this conversion takes place after all files have been downloaded.

See also: http://fvue.nl/wiki/Wget_storing_files_with_question_marks

fvue
  • 131
0

I think you can use mod_rewrite in Apache to do this. Ideally, if you tell mod_rewrite to do what looks like a useless rewrite, you can trick it into thinking it should serve a file whose name includes the query-string. Put something like this in your server config (not, unfortunately, in a .htaccess or a <Directory> block)

RewriteEngine on
RewriteCond %{QUERY_STRING} (.*)
RewriteRule ^(.*) /path/to/webdir/$1?%1

I don't know what this will do to URLs with multiple question-marks. I think it'll also append a question-mark to URLs with no query-string. You could change the first regexp to (.+), but then it'd strip the question-mark from URLs with an empty query-string.

If that doesn't work, you could rename the files to some name without question-marks (e.g. change them all to %s or something) and use:

RewriteEngine on
RewriteCond %{QUERY_STRING} (.*)
RewriteRule ^(.*) /path/to/webdir/$1\%%1

I don't know how this deals with PATH_INFO. If Gallery uses it, you'll need to maybe add something like

RewriteCond %{PATH_INFO} (.*)
RewriteRule ^(.*) /path/to/webdir/$1/%1

(But then you'd have a conflict if Gallery used both "http://.../index.php" and "http://.../index.php/foobar", since you couldn't have index.php on the filesystem be both a file and a directory. You could get around that by doing some more name munging.)

While we're throwing in a bunch of mod_rewrite, might as well use it to set MIME types:

RewriteRule \.php - [T=text/html]

or

RewriteCond %{REQUEST_FILENAME} \.jpg$
RewriteRule ^ - [T=image/jpeg]

or similar stuff. (Note how the first one would break if an album or photo name contained ".php", etc.)

Let us know how it turns out!

jade
  • 880
  • 5
  • 15