Web-scraping with Python

Cohin University of Science and Technology release results for CAT examination few days ago but when I visited the site I noticed that results were posted without any secondary authentication for security.

Entry

Now after seeing this I was tempted to write a script in python that could scrape student details from this site.Hence, I downloaded Live HTTP Headers plugin for firefox to see http headers going through so that I can send similar request using python in order to extract the data. For demonstration purpose I am going to use my roll number.

Headers

we can see that it’s sending HTTP POST request to cusatresult13.asp with parameter rollno and B1.

Now let’s write python code to send that request and see if it works.

1
2
3
4
5
import requests
payload = {'rollno':38323 ,'B1':'Submit'} #Our POST data
url="http://www.cusatresults.nic.in/cusatresult13.asp"
r=requests.post(url, data =payload) #post request
print r.content #print html of page

I used requests library to send POST request to the server.Our script should work as expected because we’re sending same post request as send by browser but it doesn’t. r.content returns the html sent by server to us , html returned to us renders this

render

Uhhh? What the hell? This is possibly because we haven’t fully “mimicked” to be browser , There is also a possibility that server is checking if request came from their server using “refer” in http headers. we need to send headers to server like the one sent by our browser , thankfully with requests we can easily do that.

We can write all header in a dictionary and send it along with requests.post

1
2
3
4
5
6
7
8
9
10
11
12
headerdata = {
'Host':'cusatresults.nic.in',
'Connection': 'keep-alive',
'Content-Length': 22,
'Cache-Control': 'max-age=0',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://cusatresults.nic.in',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8'
}

What a long messy header -_- huh !

1
r=requests.post(url , data =payload , headers=headerdata)

Now when I do

1
print r.content

HTML given out by that statement renders this
render2

Hah , Now that we can “login” into the website , our half the work is done .Moving to the actual scraping part , I am going to use BeautifulSoup to do this task , You can use us lxml too for this purpose but here I am going to stick with BeautifulSoup.

On inspecting elements we see that our target data(Couse/Category and Rank) is stored within “font” tag with a specific attribute which used only for that text only.

Elements

This detail is going to useful when we pull out text targeting text details using BeautifulSoup

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from bs4 import BeautifulSoup
import requests
headerdata = {
'Host':'cusatresults.nic.in',
'Connection': 'keep-alive',
'Content-Length': 22,
'Cache-Control': 'max-age=0',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://cusatresults.nic.in',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8'
}
payload={'rollno':38323 ,'B1':'Submit'} #Our POST data
url="http://www.cusatresults.nic.in/cusatresult13.asp"
r=requests.post(url , data =payload , headers=headerdata)
soup=BeautifulSoup(r.content) #Giving beautifulsoup html
rank=soup.findAll('font', {'face':"Arial, Helvetica, sans-serif" , 'size':2 })
counter=0
for link in rank: #Transversing through searched data
if counter < 2:
print link.contents[0].strip()
counter=counter+1

soup.findAll() searches for specific tag , We can also define attributes for that tag using a dictionary(which I did)

Note that I have used a variable named counter because if let loop run more than 2 times it tends to print out newline probably because site has stored data that way.

This code prints out category and rank of the candidate , all that left is to also print out name of the candidate and roll number

Again inspecting elements we see that candidate name is stored in “font” tag with color “#000000”

tag
let’s extract it out with findAll

1
2
3
4
5
6
7
8
names=soup.findAll('font', {'color':"#000000"})
counter=0
for name in names:
if counter <2:
print name.contents[0]
counter=counter+1
else
break

clean up

but that also prints out html inside “font” tag , let’s strip that away using re library

1
2
3
4
5
6
7
8
9
10
11
import re
names=soup.findAll('font', {'color':"#000000"})
counter=0
for name in names:
if counter <2:
result = re.sub("<.*?>", "", store)
print result.strip().strip(':')
counter=counter+1
else
break

Now we are able to extract Roll number , name , rank and category from the website. Full code for our objective is

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from bs4 import BeautifulSoup
import requests
import re
headerdata = {
'Host':'cusatresults.nic.in',
'Connection': 'keep-alive',
'Content-Length': 22,
'Cache-Control': 'max-age=0',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://cusatresults.nic.in',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8'
}
payload={'rollno':38323 ,'B1':'Submit'} #Our POST data
url="http://www.cusatresults.nic.in/cusatresult13.asp"
r=requests.post(url , data =payload , headers=headerdata)
soup=BeautifulSoup(r.content) #Giving beautifulsoup html
rank=soup.findAll('font', {'face':"Arial, Helvetica, sans-serif" , 'size':2 })
counter=0
for link in rank: #Transversing through searched data
if counter < 2:
print link.contents[0].strip()
counter=counter+1
names=soup.findAll('font', {'color':"#000000"})
counter=0
for name in names:
store=str(name.contents[0])
if counter <2:
result = re.sub("<.*?>", "", store)
print result.strip().strip(':')
counter=counter+1
else
break

Now that we’ve pulled required data , we can store data in CSV file if we like . Worst part is that roll numbers are linear hence anyone can use loop and extract data of every student that took part in examination. Unfortunately site didn’t display any other details like score in specific subjects , Which could have been useful for statistical study.