Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

ISOJ 2012 - Ben Welsh LA Times Data Desk

0 (0 Likes / 0 Dislikes)
My name is Ben Welsh. I work at a team called The Data Desk at Los Angeles Times. And my talk is called "Human-assisted Reporting" How to create robot reporters in your own image. I was trained in a tradition of journalism which is called "Computer-assisted Reporting" It's decades old. The whole idea is that we can be more efficient and do cooler and better investigative reporting by using computers. This is an idea that it's not new. It's been out for a long time and the name kind of tells you that. My joke always is ... Everyone now is "Computer-assisted" photographers, architects,... Pretty much every job you need to do you use a computer but you don't call it A "computer-assited architect" or a "computer-assisted photographer" It's only in journalism that we continue to distinguish ourselves with the use of Microsoft Excel. It's a spin on that term that I want to talk about today. This is whole idea of Computer-assisted Reporting and thinking about it differently. All the URLs and everything I talk about will be available at In my opinion, this is how CAR works today In an image Your editor or you, the reporter, have an idea Something out in the world that you want to investigate and get to the bottom of. So you pick up your weapon, your computer, and you go out hunting for it. That's the way most CAR gets done. People already have an idea or a field or a dataset or something that they want to 'hunt' and they go out and look for it. And the idea that I want to put out there is an alternative metaphor or way of thinking about what we do and it's this, which is from the movie 'Minority Report' which I love and it says 'how it ought to be' at the bottom there and in this scene there's these robotic spiders that are able to crawl all over and do an operation on Tom Cruise and I think if we can up our gain in what we do in CAR... we don't have to go out hunting for the story, the computer can go hunt for it for us. And bring it back to us. And where are we while all that's happening? We're back at the bar with Cary Grant, our editor, talking about the next story. But also enjoying a drink. This is a [...] from the movie 'His Girl Friday' which, if you haven't seen, do yourself a favour. This idea first came to me when I saw a website which was created by Matt Waite who is a leader in our field now, Professor in Nebraska. He's famous for the website Politifact, but he made a lot of other sites in St. Pete and this is one that's gone now that he did about real estate. It had a page for every neighborhood in Tampa Bay, for the Tampa area and there was a map which is now dead - this is from the internet archive, I had to pry this out - and it had the latest home listings in a list and then it had this paragraph right there: And I read this paragraph and it's what Matt calls a madlib. It was an automated paragraph written by an algorithm that said 'Based on this weeks data and this neighborhood, here's the story'. And for every neighborhood and every page there was this same paragraph but it had different information. And it was up or down depending on the trend in that area. I saw that and said 'my god, that's news!' It's like an automated news story. That inspired me to think about what I was doing in a different way. At the LA Times I've spent some time over the last few years experimenting with that and I want to show you how that process works and how you can create algorithms that write the news for you. And also find it. What I'm going to do is teach you how to Dougie. Here's how you dougie. One: you find a simple, repetitive and moving data stream that updates every day or with some frequency, like home sales. In this case, this is an email that I receive, with a list of blacked out other people every morning from the LA Police Department at about 2:30 in the morning. It includes a CSV file spreadsheet that has everyone arrested in the previous day and booked by the LAPD. So, I have my structured, simple, repetitive, moving data stream that lands in my inbox every day I then do a pull and a parse and I put it on loop. So I write a script that looks for that email, looks for that attachment pulls it in, parses it, loads it into a database and then I set up a system so that that just runs every day It's just an automated data pull. Then, I can write code, that will ask and answer the common questions that a reporter would ask... ...when they were looking at that same data set. There's so much of what we do, these questions we ask... what was the biggest, what was the most recent... these sort of basic journalistic questions we ask at the data... really can be turned into algorithms or code when you think about it. And these are just some examples that I thought of looking at this dataset. And I can turn it into code. Is this the first code of the conference? If it is I'm pretty proud. This is an example of how every day you would want to know... what were the most severe things that people got arrested for yesterday. What were the biggest deals? And a proxy for that in the data is what their bail amount was set for. The worse thing you did, the higher your bail - most likely. This is just a little bit of code that every day goes through that spreadsheet sorts it by bail from highest to lowest, slices it off, and gets you the list of the biggest bails. What do I do with that? I send it out in an email to all the reporters who cover the police and crime for the LA Times. And we don't just get the list of the biggest bails and send them that... we also keep a watch list. We want to know any time anyone who's a minister... or a producer or musician in their occupation field gets arrested and that gets flagged. So, instead of having to comb through this book every day as a reporter... and spend all their time doing it the computer can automate a lot of that process. And do it for you and then send you a nice email. You can see, that's an alert You also could make a dashboard to drill down This is an internal webpage we keep at the LA Times that just has everybody arrested yesterday but it also has some search features so when someone is arrested for a major crime... we can go look for previous arrests etc etc It's a research tool internally for us to use. This is an example of another piece of code that then takes a structured data set like that... and turns it into a sentence. It diagnoses certain things about the data and then writes a madlib That's a little sentence than tells you something about it. This actually isn't crime, this is a sensor state that we did for a neighborhood site we keep and it wrote this sentence, so for every neighborhood in Los Angeles... we can write a sentence that tells you one, the datapoint that's interesting to you and some contextual comparison along with it that links to other stuff. That way, I wrote 250 of those by writing it once, with a template. What do you get out of doing this? One: you get breaking news. This is an example where 'Puck' from 'The Real World' was arrested Not the biggest deal, but we scooped TMZ and had the news first in the world because the alert system caught it. We're watching closely. Two: It's a way around Press and Information Offices. One of the biggest crimes in Los Angeles last year was on opening day of the dodgers season. A man was brutally beaten, a San Francisco Giants fan and he quickly became a symbol for the decay of the Dodger's organization and our former team owner Frank McCourt. And police arrested the wrong guy the first time, they screwed it up. And when they arrested the first guy there was a big press conference 'everybody's got to know, we got the guy'. Well, it turns out it wasn't the guy and when they found the people who really did it they tried to hide who they were arresting. They didn't want to tell the media right away who they had that was going to go busted but I had the data, I didn't have to ask the PIO. It was in my system, it arrived at two in the morning and we were the first reporters knocking on those neighbors doors figuring out who these guys were. Because the system got us this step ahead. You also get instant analysis. Occupy LA was camped out and across from the LA Times for three months There was a three day standoff with the police where they came and cracked down and rolled everybody out When they did the big arrests of a couple hundred people we were able instantly to do a census of all the people who were arrested... and tell you something about them using this data We actually published a list of all the people who were arrested and some other things about them that were in there. You got the Automated Copy, this is from our Blog 'The Homicide Report' were we try to track every homicide that happens in LA county. We have a post for every person We don't have enough resources to do a lot of reporting an all that but certain amounts of information based on the coroner's data can be automated and then we write that. The bare minimum for every post is the automated paragraph. As we gather more information we then write through that and add more to it. This is a similar thing we do. This is an automated blog post written by the computer that runs a couple times a week when we get new LAPD crime data It analyzes it for trends and it tells you what neighborhoods in LA most recently... have had an uptake in crime historically. Here's another thing from our crime site where it's all automated. Same thing with earthquakes, my colleague Ken Schwenke did this. When an earthquake happens everybody is going to the USGS site... copying and pasting, where's the link, fuck I can't find it, where is it AAAAH We don't have to do that. It's structured data! We have a computer system that automatically writes a blog post and sends it in as soon as it happens There's a lot of stuff you can do, which should be fun. There's companies that are trying to make money off it like Narrative Science they do some awesome stuff. I'm just making news, they're trying to make money, who's smarter? I'm out of juice. I had a big finish to try to make a more serious point and talk a little trash and say they won't make it, but the next slide was a code that Narrative Science delivered to the NY Times where he said in full visionary startup 'I am the future' mode 'Within five years a computer program will win the Pulitzer Prize, and I'll be damned if it's not my software.' I hate to break it to him but computer programs have already won the Pulitzer Prize. They've won a half dozen of them, starting in 1989. With Bill Dedman's story in Atlanta. 'Color of money', that's right. My point is, what we should really strive for is not to automate things for automation's sake or to save money, but what would really be great was if we could automate and make it easier and lower the barrier to do the kind of work that wins the Pulitzer Prize. That's already been done by people that come before us and we need to be respectful and see what's in that tradition... that's worth saving and worth automating and worth making more efficient rather than just throwing it out, acting like it doesn't exist. Because I work in an old line media institution, I complain about it every day it drives me nuts. But there's also a hybris in the startup community around this idea that they don't need to learn anything from the past. That there's nothing worth saving about journalism. There's a lot worth saving and there's a lot worth doing. And we're the people who are going to do it I'm getting emotional. Ok, some old newspaper guy didn't like your blog Get over it. Write a story that's worth reading. Write a story that's worth Brian's mom reading. Let's fucking do it. I'm done

Video Details

Duration: 12 minutes and 20 seconds
Country: United States
Language: English
Views: 684
Posted by: lndata on May 29, 2012

Human Assisted Reporting

Caption and Translate

    Sign In/Register for Dotsub to translate this video.