Conferentie

Webscraping

Lots of websites contain information that can be of great value to you as a journalist. But this information might change every day, and therefore it is impossible to keep track of it by hand. That’s why we let the computer do our work for us. This is called webscraping.

At NRC Handelsblad, I have written a script that scrapes a website where real estate brokers publish information of houses that are for sale. I scape this site every week, and so I get a unique insight in the development of this market in turmoil. I put all the information in a database that now contains information about almost 250.000 houses.

And last year, during the take-over of ABN Amro bank by Fortis, we wanted to make a map of all the branches of these two banks in The Netherlands. Because both companies didn’t want to give us all the addresses, I used a script to scrape the Dutch Yellow Pages and I ended up with a long list that I could put on a map easily.

In this course, you will learn to do this yourself. You will learn to program your own webscraping machine. A computer program? This probably isn’t something you have done ever before, but it isn’t very difficult. Perhaps only slightly difficult.

There are several different computer languages you can use for this job, but in this course we use Perl, a language that some people call the duct tape of the internet. In Perl, you are going to write a script that can find a web address, fill out a search form and save the results you are looking for on your hard disk.

At the end of the session, you might have written your first ever computer program. But there, the story doesn’t end. I never said programming wasn’t difficult at all. In the future for every different task you’ll come up with, you will have to write a new program. But after this course, you won’t be afraid to do so. And after writing several scripts, you’ll notice that it takes you only about an hour to write a new one.

During this hands-on session, we use the Firefox 3 webbrowser with the add-on ‘Links and forms’. The script is written in Perl, using the ActiveState Komodo debugger. Besides Perl, we also need the package WWW::Mechanize.

Trainer: Arlen Poort

When? Friday November 21st, 3.45 PM
Where? Erasmushogeschool, Room 3.01

Gerelateerde artikelen

conferentie-1x

Claimen en framen, dat was het thema van de conferentie 2023. Wie claimt en framet stuurt de dialoog, zet thema’s op scherp en maakt andere geluiden vaak monddood. Hoe gaan we hier als onderzoeksjournalisten mee om? Hoe bewegen we ons in een wereld van wantrouwen? Daarover ging het onder meer tijdens twee succesvolle en verrijkende conferentiedagen op vrijdag 17 en zaterdag 18 november 2023 in Gent.

conferentie-1x

Eind 2013 ontvangt de gemeenteraad van Birmingham, Engeland – een merkwaardige anonieme brief waarin melding wordt gemaakt van een complot van Islamisten om openbare scholen in een achterstandswijk over te nemen. Het geheime plan had de codenaam Operation Trojan Horse. De brief wordt intern snel ontmaskerd als nep, maar veroorzaakt toch landelijke paniek als die in maart 2014 lekt naar de pers. Anekdotes over leerkrachten die Salafistische gedachten in de hoofden van hun leerlingen pompen en hun vrouwelijke collega’s discrimineren, buitelen over elkaar heen in de media.

Sluit je aan bij de vereniging van onderzoeksjournalisten

En vergroot je kennis én netwerk