Tuesday, February 9, 2010

Web Interaction Using Python

Introduction


In a number of the HTS programming missions you are asked to interact with the site from a program that you have written, as opposed to using a webbrowser. There are plenty of other applications for web interaction, however. I have written a few python scripts to download various data from websites (e.g. http://python.pastebin.com/f268e6319 )

I will cover two ways of getting data from a website (and in fact, sending data too). If there are any problems with the article, leave a comment.

All examples have been written in Python 2.6. There are quite a few differences between 2.6 and 3.0, but the only ones that should apply in the code snippets in this article involve the print function.

In Python 2.6 a simple hello world is this:
CODE :
__________________________________________________________________________
print "Hello World"
__________________________________________________________________________

In Python 3.0 it looks like this:
CODE :
__________________________________________________________________________
print("Hello World")
__________________________________________________________________________

It's a good idea, and I will switch to 3.0 when it is finally worn in, but for the moment I'm sticking with 2.6.
If there are problems with any of the code running as 3.0, try using the 2to3 script (It came preinstalled with Xubuntu for me.. not sure about on windows etc).

Anyway, now that's all covered, on with the article.

The Url Libraries

First of all we will start with a tutorial on the URL libraries. These are urllib and urllib2.

Let's immediately get started with some code.
CODE :
__________________________________________________________________________
import urllib2
url = "http://example.com"
website = urllib2.urlopen(url)
print website.read()
__________________________________________________________________________

Pretty simple code really, and for a lot of things it's all you need to know. It fetches the website "http://example.com" and stores the data as an instance on which we use the read() function to return the data retrieved from the site. Here are the functions:
instance.read() This returns the data retrieved from the site.
instance.info() This returns the HTTP message from the server, it has a lot of useful information in it including cookie info and server type.
instance.geturl() Returns the URL that was requested - seems pointless but we'll cover it in a second and you'll see why there is a point.
instance.getcode() Returns the HTTP status code. (e.g. 404, 200)

It's worth messing around with those a bit, rather than just taking my word for what they do.
I'll now just show a use of the geturl() function:
CODE :
__________________________________________________________________________
import urllib2
url = "http://google.com" # After google, try 'http://example.com'
website = urllib2.urlopen(url)
if url == website.geturl():
print "Website not redirected."
else:
print "Website redirected you."
__________________________________________________________________________

Why you'd want to do that, I don't know, but there's bound to be a use for it sometime. But that is one application of the geturl() function anyway.

Let's do a HTTP POST request now. They're pretty easy really, but can look a little complicated, so don't worry.
Before you look at the code, you might want to set up a server (or get some webspace) so you can test this out. A little PHP script like below will do the trick:
CODE :
__________________________________________________________________________
echo $_POST['test'];
?>
__________________________________________________________________________

And before anyone says anything about XSS - get lost - it's a testpage that will be up for 10 minutes on a server that noone cares about. But if you really are that bothered, you can use strip_tags() around that. (I say this because I can tell there'll be someone who will try and pipe up a clever comment).

Now then, we'll be introducing a new module for this (though it isn't strictly necessary, it's the best way I reckon). I will import the single function as we don't need any other functions from the module.

Okay, let's go:
CODE :
__________________________________________________________________________
import urllib2
from urllib import urlencode # new module and function

url = "http://localhost/test.php"
data = {'test':'lolwut'}
# you can add as much info as you want to this dictionary
# "test" is the label for the data, so that PHP script above
# should display "lolwut".

encoded_data = urlencode(data)
# remember that this is from that imported module, normally you'd
# use this: urllib.urlencode(data) if you used a normal import.

website = urllib2.urlopen(url, encoded_data)
print website.read() # That was pretty easy, right?
__________________________________________________________________________

Pretty straightforward, right?
Let's go onto HTTP Basic Authentication. This is more tricky. Here's the skeleton code for opening more advanced things, including HTTP authentication.
CODE :
__________________________________________________________________________
import urllib2

url = "http://example.com"

openerDirective1 = ...
openerDirective2 = ...

opener = urllib2.build_opener(openerDirective1, openerDirective2)

urllib2.install_opener(opener)

website = urllib2.urlopen(url)
__________________________________________________________________________

Okay, that's a lot more complicated. Note the "openerDirective"s. They are basically a way of adding headers to the urlopen requests.
You can have numerous opener directives, or just the one. You build them into an opener using the build_opener() function then install it, using install_opener(). After that, you can request a site and it will include the headers that you have specified.

Let's look at creating a HTTP Basic Authentication header.

CODE :
__________________________________________________________________________
authDirective = urllib2.HTTPBasicAuthHandler()
realm = "Webmail"
url = "http://example.com/webmail/"
username = "leethaxxer"
password = "letmein"
authDirective.add_password(realm, url, username, password)
__________________________________________________________________________

Then, we just build the opener and install it like we did in the skeleton code. Here:
CODE :
__________________________________________________________________________
opener = urllib2.build_opener(authDirective)
urllib2.install_opener(opener)
__________________________________________________________________________

I plan to write another article soon about cookies in Python, both as part of CGI and as part of requests with Urllib2.
Now I will move onto sockets and raw HTTP requests, and include cookies in that.

Socket Programming in Python

Socket programming is a really useful thing to learn - it's a must really, especially if you want to learn about security.

Again, we'll get some code out there straight away:
CODE :
__________________________________________________________________________
import socket
s = socket.socket()

host = "www.example.com"
port = 80
addr = (host, port)

s.connect(addr)
s.send("Something to send..")
print s.recv(1024)
# 1024 is the buffer size, you don't need to worry about it
# much right now.

s.close()
__________________________________________________________________________

There we are. We've created a socket, connected to "www.example.com" on port 80 then sent "Something to send.." and received something back, which has been printed out. Then we closed the socket, which isn't strictly necessary - but good practice.

Here's some better stuff to send, however:
CODE :
__________________________________________________________________________
GET /index.html HTTP/1.1\r\n
Host: www.example.com\r\n
__________________________________________________________________________

That's a simple HTTP GET request, asking for "index.html".
Here's a post request:
CODE :
__________________________________________________________________________
POST /index.php HTTP/1.1\r\n
Host: www.example.com\r\n
Content-Length: 11\r\n
\r\n
hello=world\r\n
__________________________________________________________________________

Now let's add a cookie to a HTTP GET:
CODE :
__________________________________________________________________________
GET /index.html HTTP/1.1\r\n
Host: www.example.com\r\n
Set-Cookie: hello=world\r\n
__________________________________________________________________________

There are other socket modes that can be set, this article is a very basic introduction. I would recommend reading this article if you want to learn more: http://www.amk.ca/python/howto/sockets/

Conclusion

Hopefully this article will help you begin to interact with the Internet using Python. It's just the beginning and I will work on follow-up articles. Good luck and thanks for reading.
dotty.

No comments:

Post a Comment

try to make something then you never be lost

+++

Share |

"make something then You never be lost"

wibiya widget