Cohin University of Science and Technology release results for CAT examination few days ago but when I visited the site I noticed that results were posted without any secondary authentication for security.
Now after seeing this I was tempted to write a script in python that could scrape student details from this site.Hence, I downloaded Live HTTP Headers plugin for firefox to see http headers going through so that I can send similar request using python in order to extract the data. For demonstration purpose I am going to use my roll number.
we can see that it’s sending HTTP POST request to cusatresult13.asp with parameter rollno and B1.
Now let’s write python code to send that request and see if it works.
I used requests library to send POST request to the server.Our script should work as expected because we’re sending same post request as send by browser but it doesn’t. r.content returns the html sent by server to us , html returned to us renders this
Uhhh? What the hell? This is possibly because we haven’t fully “mimicked” to be browser , There is also a possibility that server is checking if request came from their server using “refer” in http headers. we need to send headers to server like the one sent by our browser , thankfully with requests we can easily do that.
We can write all header in a dictionary and send it along with requests.post
What a long messy header -_- huh !
Now when I do
HTML given out by that statement renders this
Hah , Now that we can “login” into the website , our half the work is done .Moving to the actual scraping part , I am going to use BeautifulSoup to do this task , You can use us lxml too for this purpose but here I am going to stick with BeautifulSoup.
On inspecting elements we see that our target data(Couse/Category and Rank) is stored within “font” tag with a specific attribute which used only for that text only.
This detail is going to useful when we pull out text targeting text details using BeautifulSoup
soup.findAll() searches for specific tag , We can also define attributes for that tag using a dictionary(which I did)
Note that I have used a variable named counter because if let loop run more than 2 times it tends to print out newline probably because site has stored data that way.
This code prints out category and rank of the candidate , all that left is to also print out name of the candidate and roll number
Again inspecting elements we see that candidate name is stored in “font” tag with color “#000000”
let’s extract it out with findAll
but that also prints out html inside “font” tag , let’s strip that away using re library
Now we are able to extract Roll number , name , rank and category from the website. Full code for our objective is
Now that we’ve pulled required data , we can store data in CSV file if we like . Worst part is that roll numbers are linear hence anyone can use loop and extract data of every student that took part in examination. Unfortunately site didn’t display any other details like score in specific subjects , Which could have been useful for statistical study.