Let’s start with a look at URIs in structure. This is probably familiar to most web developers, but it’s worth reviewing anyway.
[//[user:pass@][hostname.][:port]] is the
netloc. But the password portion (IIRC) is deprecated for security reasons.
Relative URIs do not include a
netloc, and are meant to be formed according to a base URI, which is the same base URI as the document they are referenced from. The recipe for getting the base URI is to take the document URI and chop off anything before the last path delimiter (ie,
Python offers the
urlparse module for interacting with URLs. In this case, the
urlparse.urljoin() method serves us well. You can simply pass it the document URI and the relative URI of a reference, and it will give you the full URI for the reference.
>>> import urlparse >>> urlparse.urljoin('http://www.example.com/python/urlparse.html', 'urlparsedemo.html') 'http://www.example.com/python/urlparsedemo.html'
But there is a catch:
>>> urlparse.urljoin('http://www.example.com/python', 'urlparsedemo.html') 'http://www.example.com/urlparsedemo.html'
You have to get the right baseURI, which means that the non-file path needs that trailing
/. And it may not be present in a reference to the page, but it should be present if you actually load the page.
That brings us to the
urllib2 module (though you could use a module that uses
curl‘s library, or Twisted, etc.). You can easily load a document over HTTP using
urllib2.urlopen(). It returns the response object, which you can then
response.read() or other similar reading methods, but you can also get the URL actually used for the response.
example.com happens to be unresponsive for me at this moment, I will be using
google.com for this example:
>>> import urllib2 >>> response = urllib2.urlopen('http://google.com') >>> response.geturl() 'http://www.google.com/'
We see that Google redirected the request to the
www subdomain. Here’s another example that happens to occur on the link from their current homepage:
>>> urllib2.urlopen('http://google.com/nexus').geturl() 'http://www.google.com/nexus/'
Note that the actual URL has the ending slash, which means that relative references on that page will use that in the base URI. Without it, you would get a series of response codes 404 for the CSS and images that are relatively referenced on that page.
The main downside to using
urllib2 is that you don’t get any kind of built-in caching. You can build your own caching, but before long you’re building all sorts of infrastructure beyond your small project. This is why I still believe the long-term future of the web on the desktop is having dedicated services for things like HTTP, with some capacity to bypass them through the browser. Having a service to handle the HTTP for a small Python application would save a lot of trouble, but would also let you have multiple browsers without the redundant caching and requests.
Furthermore, such a service could still have a permissive API that would allow direct, one-off loading of resources for situations like stale cache or cache control headers that want different behavior.