Conferentie

Webscraping

Lots of websites contain information that can be of great value to you as a journalist. But this information might change every day, and therefore it is impossible to keep track of it by hand. That’s why we let the computer do our work for us. This is called webscraping.

At NRC Handelsblad, I have written a script that scrapes a website where real estate brokers publish information of houses that are for sale. I scape this site every week, and so I get a unique insight in the development of this market in turmoil. I put all the information in a database that now contains information about almost 250.000 houses.

And last year, during the take-over of ABN Amro bank by Fortis, we wanted to make a map of all the branches of these two banks in The Netherlands. Because both companies didn’t want to give us all the addresses, I used a script to scrape the Dutch Yellow Pages and I ended up with a long list that I could put on a map easily.

In this course, you will learn to do this yourself. You will learn to program your own webscraping machine. A computer program? This probably isn’t something you have done ever before, but it isn’t very difficult. Perhaps only slightly difficult.

There are several different computer languages you can use for this job, but in this course we use Perl, a language that some people call the duct tape of the internet. In Perl, you are going to write a script that can find a web address, fill out a search form and save the results you are looking for on your hard disk.

At the end of the session, you might have written your first ever computer program. But there, the story doesn’t end. I never said programming wasn’t difficult at all. In the future for every different task you’ll come up with, you will have to write a new program. But after this course, you won’t be afraid to do so. And after writing several scripts, you’ll notice that it takes you only about an hour to write a new one.

During this hands-on session, we use the Firefox 3 webbrowser with the add-on ‘Links and forms’. The script is written in Perl, using the ActiveState Komodo debugger. Besides Perl, we also need the package WWW::Mechanize.

Trainer: Arlen Poort

When? Friday November 21st, 3.45 PM
Where? Erasmushogeschool, Room 3.01

Gerelateerde artikelen

conferentie-1x

Hoe komt de journalistiek uit de crisisstand? Ontsnapt ons vak na de opeenstapeling van crises – corona, Oekraïne, klimaat, gas, vertrouwen en wat niet meer – ooit nog uit het frame waar ze zelf zo verslaafd aan is?

 

conferentie-1x

Eindelijk weer samen! Dat gevoel overheerste op de VVOJ Conferentie 2021 in Brussel, die vlak voor het ingaan van zwaardere lockdown-maatregelen kon doorgaan. Vaste conferentiegangers weten dat het gesprekje in de wandelgang, het vlugge contact via de nieuwe conferentieapp en de kans om samenwerkingsplannen te smeden tijdens het diner minstens zo belangrijk zijn als de keynote-sprekers, de VVOJ-essayist en de meer dan 36 losse workshops en debatten.

Sluit je aan bij de vereniging van onderzoeksjournalisten

En vergroot je kennis én netwerk