Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

Effective Data Visualization

0 (0 Likes / 0 Dislikes)
  • Embed Video

  • Embed normal player Copy to Clipboard
  • Embed a smaller player Copy to Clipboard
  • Advanced Embedding Options
  • Embed Video With Transcription

  • Embed with transcription beside video Copy to Clipboard
  • Embed with transcription below video Copy to Clipboard
  • Embed transcript

  • Embed transcript in:
    Copy to Clipboard
  • Invite a user to Dotsub
[David Giard] Take a look at these 4 series of numbers. Who can tell me what is the relationship between each number in the series? Or between the series themselves? I did not expect an answer actually; it is really difficult to look at numbers in this format and glean any useful information from them. However if we plot these numbers as the set of scatter diagrams then the answer jumps out at us; we see very clearly there is a correlation between X and Y in each series. And if we are given another X we could with relative accuracy predict the next Y. Some with greater accuracy than others. We can even see that series 3 and 4 have a couple of outliers that we could either decide to throw away as bad data or maybe investigate further— find out why we are getting different information for those points than for the rest of them. The point is that very often numbers shown as pictures provides a lot more information a lot more quickly than just the raw numbers themselves. And the more numbers you have the more true that tends to be. I am a software developer by trade, and a lot of what I do is presenting data to users. Sometimes a picture is a great way to do that. And I have got a lot of great tools at my disposal. This talk is not about tools. This is about getting you to think about what good data visualization is when you are using those tools. But the tools themselves are pretty simple—SQL server reporting services, Microsoft Excel, Crystal Reports. They are all good tools. You select a template to display your data. You select some data source. You click next, next, finish, and bang. You have a pie chart. But what if that pie chart is not the best way to display the data you want to show? Or a more fundamental question is what do I mean when I say best way to display data. This is something that a lot of develops do not think about. They click next, next finish; they have something to show the user, and then they are on to the next problem. And so my goal today over the next 70 minutes or so is to get you to think about that— this pause after you click next, next, finish and say, "Is this the best way to display that? Can I improve this data visualization?" So data visualization's process of turning numbers into pictures is something I want you think about more. Now one man that has been thinking about this a lot is Dr. Edward Tufte. Dr. Tufte is a professor at Yale University, and for the last 2 decades he has been writing about data visualizations. Much of the material you will see today was ideas borrowed from his writings— a lot of it from this book The Visual Display of Quantitative Information. And when Dr. Tufte first started studying data visualization he started by looking at pictures of data that spoke to him and what really affected the communicating messages to him. And not all of those were computer generated pictures. This one—for example—is from the 19th century. It is a train schedule; if you look on the to you will see the times of the day It starts at 6:00 am and goes to noon, midnight, and back to 6:00 am. 24 hours along the top. And along the left side is the train's stations in France. And these diagonal lines that you see—these diagonal line segments— represent the routes that a train takes from one station to the next. The left end point of each line segment is the date and time of a departure, and the right end point of a line segment is the date and time of a— I am sorry—the location and time of a destination. So we can follow one of these routes—for example—this one and see that there is a train that leaves Paris at 12:20 in the afternoon, and it arrives in Tonnerre at 6:30 in the evening. It is pretty straightforward. It is very simple. It is not difficult to read—one could argue that the airline schedule makers could learn something from this chart. Here is another good chart—also from the 19th century. This is a—the reason I like this one is because it shows several series of data along the same axis. Because they are along the same axis we can compare them to see if there is any correlation between these datas. And if there is correlation maybe we can come up with hypotheses as to why that correlation exists. The red line at the bottom of this is the average age— average wage paid to a mechanic in England during each one of these decades between the mid-16th century and the mid-19th century. The bar chart in the middle is the average price of a bushel of wheat during that same time period. And you can see that the price of wheat and labor wages they kind of follow the same path although the price of wheat is much more volatile than wages. And an answer as to why that is true might lie in the third series at the very top of this picture which shows who were the reigning monarchs during each of these periods or more likely what wars were being waged by those monarchs. But we can start to ask those questions because we see everything along the same axis here—in this case it is time. This is geographic data; this is a lot of geographic data. This is from a 1960 census, and of course maps are great for displaying geographic data. But census data is a lot of points—millions of data points. This is broken up by county; every one of these lines is either a county or a parish or some county-like entity in the U.S. So even in aggregation there is still over 3,000 of these county-like entities in the U.S. This shows the percentage of the number of high income families and low income families in each county. The green graph at the bottom are the high income families. The red graph at the top are the low income families. And darker shaded counties represent a higher concentration. Lower shaded counties represent a lower concentration. So you can see that darker shades of red in the top graph tell me that in south Texas and the southeastern United States there were a lot of poor families in 1960. There are a lot of richer counties—higher income families in California at that same time. I can look at the state of Michigan—I am actually from southeastern Michigan— right there, and if you look at the state of Michigan you can see that Michigan was doing pretty well—at least southeastern Michigan was doing pretty well in 1960. That map actually might look very different for that part of the country if it were drawn today. I can also see counties in Alaska that have a high percentage of high income families and a high percentage of low income families in the same county. It seems like there is no middle class in those counties, so it leads to more questions and a lot of information that I can get very quickly. I really like the fact that this is using shades of color rather than distinct colors to indicate high concentration versus low concentration. Our mind things very well—it can very easily translate a progression of shades whereas we cannot really easily translate this red is low and green is high or something like that. We do not have to think about. We know dark/light. It just makes sense to us. And then there is this graph; this is—Edward Tufte called this the greatest data visualization ever created. This was created by a man named Charles Minard. It is also from the 19th century, and this is actually a map—a map that represents the Emperor Napoleon's advance on and retreat from Moscow near the end of the Napoleonic wars. On the far western edge of this map is the Russian/Polish border. And on the far eastern edge is the city of Moscow. Most of these broad, jagged lines that go across the map in the middle— one of them is tan and the one below it is black— those represent the route that Napoleon's men took as they marched eastward toward Moscow—that is the tan line, and westward away from Moscow is the black line below it. The thickness of those 2 lines is proportional to the number of troops at each point. Knowing just that you can look at this map and very quickly see that this campaign was a disaster for Napoleon. Napoleon entered Russia with almost half a million men expecting to overcome his enemy with superior numbers and live off the land which is what he had done when he conquered the rest of Europe. But the Russians did not engage them directly in battle. Instead they retreated, and as they retreated they destroyed their livestocks and their farms and their homes and their towns—they burned everything. They burned everything. And the scorched Earth that they left in their path stretched Napoleon's supply lines so thin that he lost far more men to disease, starvation, desertion, and suicide than he ever did to Russian bullets. That half million men by the time they got to Moscow a few months later had dwindled to 100,000. And that 80% attrition is very clearly seen by that tan line as it gets narrower and narrower as it heads eastward across the map. The way back was even worse. On the way back that 100,000 men had dwindled to barely 10,000 when they left Russia, and for a clue as to why that happened we can look at this line down here at the bottom—thank you— this line down here at the bottom which shows us the temperature along—during his return trip. And this was a brutally cold winter. Temperatures never rose above freezing. And—in fact—at one point on December 8 right there the temperature dropped to -30 degrees. Now that could be -30 degrees Fahreinheit; it could be -30 Celsius. It does not matter—-30 degrees—it was brutally cold. And it resulted in 90% attrition clearly visible as that black, jagged line moves across—westward across the map. We can even see where major points of attrition took place. So I think you can see why Edward Tufte liked this map so much. On a single axis Charles Minard has managed to show the number of troops, position of troops, direction of movement, time, and temperature. And because these are all on the same axis we can start to compare them and correlate them to come up with hypotheses as to why these things are correlated. So that is the good. Here are a few things to take away from that section of it. You notice that there is a simplicity to all the pictures that I showed you. They are not cluttered up with a lot of extra stuff. They are relatively easy to understand; in 60 seconds at most I could explain a graph, and you could understand how it worked. This idea of using common axis for comparisons is very powerful. Maps—of course—are good for geographic data; you probably knew that part already. But think about shades versus colors; shades are a lot better for showing gradations of data and numbers much better than using colors. So that is the good—what is the bad? What can you do that is wrong with data visualization? Well one thing you could do is you can lie to your users. And lying to users can take a lot of forms. Some people will lie deliberately. They have a goal, they have an agenda, and they want to deliberately mislead their users. This is from Fox News. This is a graph showing the unemployment rate each month during 2011—the year before the last presidential election in the United States. And it looks fine—you see it kind of goes up and down and so on until you look at that last number—that November number. And you realize that it is on the same level as this number here. 8.6 is represented the same level as 9.0—kind of misleading. It is actually above this number; 8.6 is drawn higher than 8.8. So—I said before—it is a lot easier to look at pictures of data than it is to look at the data itself. Most people are only going to look at the pictures, and the picture is misleading—it is lying to us. Here is another example from an annual report of a mining company— a now defunct mining company. And it shows net income for 5 different years. Each year is represented by this tall bar. It looks—everything looks fine until you look very closely at this first number here and realize that it is a negative number. And that is not at all obvious because whoever created this visualization decided to represent that negative number as a very tall bar, and they decided not to label zero above it. So I look at something like this and it causes me to think that—I cannot really think of another reason why they would choose to represent this data in this way other than to deliberately mislead the viewer. And the problem with that is that I look at it and I start to question every picture in this document—every number in this document. I start to become distrustful of what they are trying to tell me which is probably the opposite effect that they were trying to achieve by representing the data this way. So I cannot prove it but I think the last couple of examples were deliberately meant to mislead; there are people that they have an agenda. They want to present something in the perfect way, or they want their side presented better than the other side. Not all lies are deliberate though. Sometimes people lie without knowing they are lying. Here is an example that I do not think os deliberate. It is from the New York Times—a relatively reputable news source. This is a visualization trying to show the mandated fuel economy for American automobiles during the '70s and '80s. A little history lesson—there was an energy crisis in this country in the mid-70s. We suddenly realized we should probably stop making and buying these huge energy inefficient cars, and we should start making, selling, and buying much more fuel efficient cars. The U.S. government stepped in, and they said, "Well—car companies add up the average miles per gallon of every car you sell, divide it by the number of cars, and that number should be at or below X, and if not there will be some penalties which was pretty harsh. But the good news for the auto companies was they did not implement it all at once. They said, "All right—in 1978 that number has to be 18 miles per gallon, and by 1985 it will increase to 27-1/2." That is what this graph is trying to show. That number goes up 18, 19, 20, 22 and so on, and this picture of a road with its horizontal lines—it gets bigger and bigger and bigger. The real problem with this picture is that although this number goes up from 18-27-1/2—it goes up quite a bit this line that represents the number goes up a lot more than that. It gets a lot a bigger. It is not growing proportionally to the data. And one could argue that this picture does show the direction of change. The lines get bigger; the numbers get bigger. That is true. So what if the magnitude of change is misrepresented by a little bit? But—I mean—is it a little bit? Is it a lot? In fact—how much is it off? How do we even measure that? Well this is some place where Edward Tufte came and stepped in. And he invented something called the lie factor to measure how much data is lying to us. And the lie factor is simply the size of the effects shown on the graph divided by the size of the effect in the data. If it is telling the truth it should be 1. They should both grow proportionately. In this particular example the data from 18-27-1/2 grew by 53%. But those lines representing the data grew by 783%. This has a lie factor of 14.8. So it is not—the magnitude is not off by a little bit it is off by a lot. This is lying to us 15 times over. And I said before that my opinion is that it was not deliberately— that they did not have an agenda. I mean—it is okay if people have—it does not bother me. Fox News has an agenda. They are biased. As long as I recognize that I am okay with that. Complain about Fox News having a bias is like complaining The Daily Show has a bias. They do. Deal with it. Do not worry about it. Just be aware of it. But this is a little more subtle. Can anyone think of a reason why besides just deliberate misleading why this might be so wrong? Throw out the ideas. [class answering] Level of difficulty. So it was just too hard to represent it as a—? [class answering] Entertaining what? [class answering] So the level of difficulty of—I do not know that the level of difficulty of obtaining gas mileage is relevant here because this is not the car companies that are displaying this; this is a newspaper that is doing it, so they do not even have to build cars. The perspective is true; they are trying to draw a picture of a road that is kind of fading off into the horizon which makes it get bigger as it gets closer. And that is correct, so you are part way there with that I think at least in my opinion. My feeling is that what happened here is that the New York Times— like a lot of publications in this country—at one point they stopped worrying as much about the accuracy of the visualization and started worrying more about the prettiness of it. They moved the responsibility of creating this graphic away from the journalists, away from the automotive engineers, and the columnists— the people that understood that data, and they moved it to the art department because they wanted to make this thing pretty because they figured that you—the audience—would not really read it or look at it unless it was entertaining. I do not agree with that, but I think that happens a lot at least in this country. And there are a lot of people from other countries, and maybe that is going on different somewhere else. But this happens a lot here. So—my opinion—it was not done deliberately, but it is still a lie. It is still telling an untruth because nobody stops to look at this and say, "You know what? This implies a different story than reality." Here is the reality; this is actually a representation—an accurate representation— of how much those values went up. They went up, but not nearly at the rate of this 15 times widening of the road. Here is another example—also probably not intentional. This is the percentage of doctors dedicated to family practice. There used to be a lot of doctors that were generals. They would treat your whole family for just anything that ails you. As time goes on doctors tend to be more specialized today than they used to be, and that is what this graph represents here— in 1984 there were 27% of doctors were dedicated to family practice, and in 1990 only 12% were dedicated to family practice. And it is shown by this picture of a doctor whose about twice as tall on the left as the doctor on the right because 27% is about twice as much as 12—a little bit more than twice as much as 12. The problem with this visualization is that that doctor is a 2-dimensional picture. And—in fact—this picture varies by both its height and its width. And the area of a 2-dimensional object as you all learned in ninth grade mathematics is proportional to both its height and its width. So the size of this doctor of the left is not twice as big as the size of the doctor on the right. It is much, much bigger than that. So in this case the data change—I am going from right to left here because it is just easier to work with numbers—but the data change from right to left is 125%, but those pictures change by 406%. So—again—this is lying to us maybe unintentionally, but still— almost 4 times over it is lying to us because they have chosen to represent 1-dimensional data points as a 2-dimensional object. It is not necessarily wrong to use a 2-dimensional object. I mean—bar charts are 2-dimensional objects, but bar charts— the points—the widths of every bar does not vary. In this case they are variable—the width and the height. The problem is even worse if you choose a 3-dimensional object for your data points, so in this case here they have got actually a 3-dimensional barrel—a picture of a barrel—representing the price of gas as it goes up. Well—that price of gas did go up quite a bit— 454% between the beginning and the end of this graph, but that barrel—if you took this metaphor that they are presenting to us— the volume of gas that would fit in that barrel increased by 27,000%. As far as I know that is a record lie factor. Here is a more accurate representation of that data. This one—not only does it show the price going up at a more reasonable rate but also it shows this is both nominal dollars and real dollars. Real dollars are dollars that are adjusted for inflation. And whenever you are dealing with financial data over an extended period of time you should always adjust dollars for inflation. It is a more honest representation of it. Here is another one. This is the commissions paid to travel agents by 4 different airlines during 3 different periods. And you can see from Period 1-Period 2 the commissions went up, and from 2-3 they went down—that was true for every single one of these airlines. What is not at all obvious from this is that third period is actually only a 6-month period. The other 2 periods are 12-month periods—of course it went down. It would be crazy if it did not go down—it was only half as long. It is not easy to see whether or not—for some of these airlines whether it went down proportionally. A much better representation of this would be to either annualize that 6-month period or to split up the those 12-month periods into 2 bars. As it stands now it is almost useless and misleading—probably worse than useless. Another important thing when you are talking about the integrity of your data is context. We can provide a couple of data points to tell a story, but those data points really are not meaningful unless we provide some context around them. Here is an example—here are 2 data points on the number of traffic deaths in Connecticut—one in 1955 and one in 1956. I did omit the zero here, but you can see that traffic deaths in Connecticut did go down during that period. And this period was selected because that was the year that Connecticut decided to start enforcing speed limits very strictly. In the past they kind of winked an eye at it, but in 1955 it became a serious problem. And they started to address it. So this implies that stricter enforcement did save some lives, but we really cannot conclude that just from these 2 points because we do not know what happened before 1955 or after 1956. The graph could have looked like this if we went over a period of years like this or like this; we are only looking at a small window of time here. And we would draw different conclusions depending upon how this— which one of these graphs was true. In fact—the graph looks like this. Here are the 2 points we were looking at, and you can see the traffic deaths were increasing in the years prior to 1955, and they went down in 1956, and they continued to go down after. This actually is compelling evidence that stricter enforcement of speed limits did save some lives in Connecticut. Even more context you can add to this would be to compare it to the states that are contiguous to Connecticut. And this is deaths per 100,000 because some of the states are a much higher population, and some are much lower. But here we can see that the other states around Connecticut— the number of traffic deaths after 1956 they either remain steady or they increased. Connecticut is the only New England state that has showed a decrease in traffic deaths. This is a much more compelling argument right here. So here are some takeaways from that last section. Do not lie with your data—think about this lie factor. And lie factors always talk about the proportional change of your data and the proportional change of your visualization. These should be the same. They should be 1; if they are off by a lot more than that then you have got a problem—you are misleading your customers. Use consistent dimensions; do not try to use 2-dimensional data points to represent 1-dimensional data or 3-dimensional—even worse. Use real dollars when you are talking about financial money over time, and provide accurate context—relevant context for your visualizations. Questions so far? Okay—I cannot see all the audience very well, so if you have a question just shout it out. The next thing I want to talk about is Data-Ink. Data-Ink is—you probably do not remember this but—there was a time when we used to print our reports on paper—you know? No? Okay—well anyway—when I was a kid we would print reports on paper. And we would use ink which would be the stuff that would actually make it show up on the paper. That is the dark part of it; that is not done much anymore. Typically the visualizations I do they are on screen. People just see them on the computer screen on a Web page or print preview—something like that. But we still think about the dark parts of the report as the ink. The Data-Ink is the part of that ink that directly represents the data. And that is the important part of any visualization is the part that directly represents the data; everything else is less important than that. And here is another ratio that Edward Tufte came up with. It is called the Data-Ink ratio; it is simply the Data-Ink divided by the total Ink. .And this is a goal of data visualization. A good goal of effective data visualization is to maximize this number within reason—of course the highest it can get is 1.0 in which case all of your ink is Data-Ink. But we want to get that as high as we can within reason. And I will talk about what within reason means in a second, but first some of you are probably asking yourselves if Data-Ink is only some of the ink—well what is the rest of the ink? What else could there be? And the rest of the ink could be redundant data. If you have got the same data point represented multiple times then only 1 of those part—only part of that ink is actually Data-Ink. The rest of it is redundant. You could have metadata. You could have decorations or what Tufte called chart chunk—stuff that really does not help the graph at all. All of that is a part of a visualization that may or may not be part of the Data-Ink. Let us talk about redundant data first—here is a visualization that just shows a single point of data. It is the number 35.9. Let us count how many times that number is represented in this visualization. There is a vertical line on the left side of that bar that is 35.9 units tall. This bar also is 35.9 units tall on the right side. There is a horizontal line across the top that is 35.9 units from the X-axis. This area in the middle that is shaded is 35.9 units tall. There is a number at the top that is 35.9 units from the X-axis. And—of course—the number itself is the number 35.9. So what was that? Six? Seven? A lot of redundant ways of representing that data here. This picture has a very low Data-Ink ratio. I won't go through that same exercise here, but you can see that—by now you should see that for a 3-dimensional bar charts that number would be even lower. It is a lot of redundant information. There is also a lot of metadata that you can potentially get rid of in your visualization. I imagine a lot of people have created a graph that looks similar to this. This is a scatter diagram with a trend line going through it I created these in the second grade with graph paper and a ruler doing some science experiments. The only Data-Ink in here are the points—the data points— that are surrounding the trend line. That is the data, and that is the Data-Ink. Everything else is not. So this right now has a very low Data-Ink ratio—just a little bit of Data-Ink and a lot of ink. These grid lines—for example—are not data. They are metadata. And we can get rid of some of them. And notice when i get rid of them how the data itself jumps out. I can see those little data points a lot more clearly. I could even lighten them, and it jumps out even more. This is what spreadsheet manufacturers figured out by the time they got to Version 2. Version 1 of most spreadsheets for Windows—those lines between the grids— they were black; by the time Version 2 came out they were gray—they were light gray because spreadsheet manufacturers recognized that the lines between the cells are metadata. It is the numbers in the cells that are the data, and those should jump out. Those should be more important. In this case I do not even know if we need those gridlines. We can get rid of them entirely; I do not think it has lost anything in terms of information. And it has got a lot less ink, and I can see the data even more clearly now. So now I have got a much higher Data-Ink ratio, and the data jumps out at me more. But I can go further than this—do I really need those borders on the top and right? Do I really need all the numbers on the X and Y-axis? The less distractions I have the more the data jumps out at me. And each time I do this I stop, and I ask myself is the graph better? Or is it worse? If it is not worse leave it—all things being equal let us get rid of that non-Data-Ink. If it is worse them maybe we ought to put it back. Is it easier to process? So—for example—this trend line is not data; it is metadata. If I get rid of it I think it is better with it. I think it adds some value—this trend line. So that is a—I erase and stop and ask myself, "Does it make it better? Does it make it worse? Maybe I ought to put something back." Here is a real world example—this is from Linus Pauling. I am a great admirer of Linus Pauling because I actually have a degree in biochemistry, and Pauling is the only man to earn a Nobel Prize in—what? Anyone? He has a Nobel Prize in chemistry and peace. This one—oops—this one has—he was smart enough to not put these grid lines in his visualization. But he did replace them with these little plus signs, and I would argue those are not necessary. The data is just as clear, and the data itself jumps out more without them. This is the atomic volume of each element versus the atomic number, and he has added these lines—these curved lines that group together. The—they are basically the rows in the periodic table which I remember from my college days is the number of electron shell I think is what that is. Anyway—there was some significant information in the fact that these are all on the same row, and if I remove those trend lines it becomes much less useful. So that metadata I think is very useful. I would put that back, and I would also label them to show what—which row we are talking about by putting the label and the atomic number of the first element in that row— the one with only 1 electron in it. Remember this graph? This was the train schedule. This has a problem that—it has a couple problems but—it could benefit from some redundant data. What happens if a train is in transit at 6:00 am? This train right here—for example—if I want to traverse its entire line my eyes have to go all the way over to the left, and look over here and continue—it is kind of hard. It is no longer intuitive to me. I can resolve this though by adding—taking the first half of the graph and repeating it over here on the right, and then I do not have to do it through that, so this is a much lower Data-Ink ratio. The one on the bottom does, but it is still easier to read. So sometimes you need to add—reduce your Data-Ink ratio. That is not a hard and fast rule; we reduce it within reason. We could probably offset that by removing some of those gridlines. I am sure a few of you have thought of. So here are some principles from this last section here. Number 1—think about the data—what is the data, what are you trying to show? That is really what you ought to be focusing on, and get your users to look at that. And you can do so by thinking about this concept of a Data-Ink ratio. What is data? What is not data? What is non-Data-Ink? What is redundant Data-Ink? And this—just be aware it is an iterate process. This is probably the biggest benefit I had directly from reading Tufte's work. It is—I have had a lot of applications I have written that people were doing things in Excel and they wanted to replace it a Web app or something like that. But they want it to look just like Excel—put all these lines in here. And I start to put the lines in, and I said, "You know what? Take a look at this. Maybe this looks a little better to you without all these lines. Maybe it is a little bit less cluttered." And I get push back—they say, "Yeah, but our users love Excel." "Well what if had really light lines? What if you could barely see them? Then they could see the data a lot more clearly." Vibrations are less of a problem today than they were before. Vibrations refer to this kind of patterns that you see. Before we had cheap color printers people would print out different series with—picking different patterns, and this is— some people look at this they get a seizure. I am not making this up. This is—there is actually something called the Moiré effect which if you stare at something with a tight pattern like that it looks like it is moving, and it becomes very distracting. So graphs like this become almost unreadable to people. Or even like this—I look at this thing here, and I see that there is down at the bottom here I have got this series at the bottom— it is diagonal lines. I have got to look up at the legend here. And I see—okay—diagonal lines that is what? Unfinished oils. Oh no—wait—that is from the diagonal from the bottom. Left, top, right—I need this one—crude oil. It is a problem. First of all I do not like legends. I understand there are some times when it is necessary, but generally you are forcing the user to look up and then look down again. And that is distracting to the user. Oops. Here is another example here—these are percentage of articles during the 1972 presidential election shown by patterns. And—again—it is really difficult for me to look at this and see— okay—these diagonal lines refer to inflation here, and these are diagonal lines as well the same direction—they just happen to be more narrow. Eventually you run out of patterns to use. A much better way is something like this. First you could use colors; colors do have a disadvantage in that you may have users that are colorblind, and you have to be sensitive to that. It depends upon who your audience is. But here I have just simple labeled these bars. No need for a legend at all, so I can print this in black and white and it would still work. All right chart junk and ducks—these are essentially the same thing. These are things that a lot of people like to add to their visualizations, but they do not provide any value at all. They are just there to create some buzz around your visualization. They are called ducks because there is a building in Flanders, New York— which I have a picture of here—that is shaped like a duck for no other reason than that people will be driving down the street and say, "Hey look, there is a building that is shaped like a duck. Let us go check it out." It works in the short-term, but it does not get anybody to come back. It does not add to the functionality of that building at all. It is just, "Let us go check out the duck building so we can say we have seen it once." Well visualizations sometimes have cute little decorations that do not provide any value at all, and this one here I do not know the source of this one, but it is—they have curved these bars simply to represent the declining aspect of this graph, and it is confusing. First of all I do not know whether to look at the left edge or the right edge. It just makes it less usable for this chart. Something here that has a lot of clutter—they have added too many colors. They are using colors to represent numeric data—even more so here. Remember I showed you the chart earlier that was census data? It went from light red to dark red. It worked—it worked really well. Now if I want to go from different numeric ranges I have to look at green versus dark purple or green versus yellow to see these numbers. I am constantly having to look back and forth to these charts—it just does not work. And this—this is a real thing—those are Tufte's words. Perhaps the worst visualization ever created I think is his exact quote. This is from an educational journal. This—what it is showing is a—what is it? The number of—I need to refresh my memory here—it is the 5 different years—it is the percentage of students that are under 25 years old and the percentage of students that are over 25 years old. So there is really only like 5 pieces of data in here. It is the percentage for each year—right? I mean—because this number up here this is the percentage that are over 25—how long would you look at this graph before you realize that the number here at the top and the number at the bottom add up to 100% every time. I mean—I know that is the case, and I still look at it. I cannot see how that is true. So they have taken these 5 points of data—I guess you could argue there are 10. And they are using 5 different colors and 3-dimensional—it is just crazy stuff. It is not intuitive at all; they have also rounded it. I do not think there is any reason—I do not think they took measurements for each month of the year—they just added these decorations. I think for this particular example a better way of representing this data is just like that. Sometimes the numbers are there. Sometimes visualizations just get in the way. Here is something clever you can do to reduce your Data-Ink ratio—whoops—increase your Data-Ink ratio. It is you can use some of your data points as some of your metadata— what you typically would use as metadata. A clever desk sergeant during World War I was asked to create a graph of the number of troops that were deployed to France each month. And he did so, but rather than just using points he has the division number of each division that was deployed. So if you look for a given month you do not only see how tall the bar is, but you can actually see which individual divisions were deployed during that month. In this case we have no values along the X-axis. Usually we have values every X number of units, and we just have to interpolate. Here the points themselves are the X values, and we just project those from the X-axis up to our points. So we have increased the Data-Ink ratio, but we have not lost any information at all. In this example the Y-axis—instead of being in equally spaced increments— which is what we are used to—they simply put the Y value there. It is a little bit more ink here, but it is actually more Data-Ink. The only time they could not do that is right down near the bottom where the numbers were so close together that now you have to interpolate between 0.2 and 0.4. All right data density is another topic that is— it used to be easier to talk about data density when we printed things out because when we measure data density technically it should be the number of pieces of data divided by the area of a data graphic. And although we can still—on-screen—think about the number of pieces of data we really do not know area because we do not know if it is going to be printed on a piece of paper that is this big or a paper this big. But the important thing about data density is that it— we can recognize what is low—what is very dense data versus what is very low density data. There is a—kind of a—a lot of people are afraid to pack too much information into a visualization. They are worried that they are going to overwhelm their users. And I do not think that is a valid concern; I think people can see a lot of data very quickly as long as it is presented well. It is okay to put a lot of things—certainly that census data that we say there that had over 3,000 data points in it—even in the aggregation— people could see that because they could see patterns there. But you certainly see this is very low data density. This is the percentage of students that are—I am sorry— the percentage of adults who are either in college or university— that is this bar right here—or that are in adult education—this bar here. This bar over here actually is just the sum of the other 2, so it does not really add much value; there is really only a couple pieces of data in here— an X and a Y for those 2 other bars and a lot of wasted space here. So this has like—when it was originally printed out it was a government document. It had about 0.15 data entries per square inch. That number is not really important except that you can recognize that this is a much more dense document with quite a few lines in it. This is what? 181 numbers per square inch when it was printed out. And this is the weather every single day of the year in New York City. We have the high and low temperature. We have the average high and low temperature around that. We have the relative humidity every day. We have the high and low precipitation. And we have got some annual information up in the corner. There is a lot of information in this graphic, but it is not overwhelming—it is presented well. It is organized well. And we can see, "Oh look—it is hotter in the summertime than it is in the wintertime." Or you can see for this particular year how much volatility there was. It looks like there was a pretty rainy spring. Here is even more data—this is actually—each one of these little dots represents a measurement of sunlight. This is every hour of every day for an entire year— how much sunlight was there? And you can see the curved line is actually sunrise on top and sunset on the bottom. So there is no sunshine at night obviously, but during the day time There is usually sunshine—there is a lot more of it in the summertime when the days are longer. You can see when the cloudy days were because there is not a lot of sun. You can see when it was cloudy in the morning and so on. There are tons of pieces of information there—there is about 1,000 numbers per square inch in this. But it is readable. We can see it. Small multiples are a way of representing an extra dimension in your graph. This is a big thing nowadays; a lot of the visualization tools you are seeing now— and as I said this is not really a tools talk—this is more of a guidance talk— but a lot of tools are using animations to represent that extra dimension. So you can see things change over time. But you do not need fancy tools to do this. You can just simply repeat a graph multiple times with just 1 variable changed, and that becomes your extra dimension. For example—right here—here is a graph that was created by hand. It shows the age of fish caught in northern European streams. And you can see in 1908 there was—it looks like the average age of the fish was around 5 or 6 years—certainly there is a big peak around 4 years. And as time went on from 1908 to 1913 that peak moved to the right indicating that fish were getting old. The fish that fishermen were catching were getting older even though that they seemed to be catching more fish that graph seems to be a little bit higher. And that is an indication that these streams are not being overfished. So we have added that extra dimension simply by repeating it multiple times. Here is a company that really specializes in lots of data— presenting a lot of data to the user; this is Consumer Reports. And these are a lot of pieces of information, but with about 60 seconds of an explanation I think you will understand it. Every one of these rectangles represents a make and model of a car. And each rectangle is divided into 6 columns. Each column is a model year. And about 15 rows, and each row represents a potential problem area of an automobile like the brakes or the transmission or the interior rust or the exterior rust. And for each one of these intersections of a model year for a make/model of a car and a potential trouble area we have a circle. A completely filled in circle means this is a problem with that car— if the brakes fail a lot on that car or it rusts out a lot. A completely not filled in circle means that this is good— the owners of this car almost never had that problem. And then there are varying levels; it could be partially filled in— the more black the bigger a problem it is. So if you know that then you can just compare. You can look and say, "Boy, it sure looks like this Plymouth Volare has a lot more dark circles than this Mercedes Benz. It looks like that Mercedes Benz is a lot more reliable car and a lot less trouble for it. Mercedes Benz tends to cost a lot more than a Plymouth Volare did, and they do have a line down here for cos; it is the only dark area for the Mercedes. But because—well they basically added another dimension to it—the cars. So you are looking at this along several dimensions—trouble spots, make and model of the car, model year, and the different models to compare side by side. The last section I want to talk about here are the graphs that Edward Tufte came up with himself. He came up with one that has really been widely adopted. That is the spark line, and one that I have seen it but I do not really see it that often in the wild— that is the slope graph. The spark line is—we see this all over, but we see it a lot in financial magazines or financial newspapers and magazines. It is a very compact way of showing data series over time or really any dimension. We see with financial information—a lot of times we will see like stock prices or currency prices as is in this case over time. And rather than showing an entire X-axis or Y-axis with numbers to interpolate between what this will show you typically is just a couple of numbers— maybe the beginning number or the ending number just to give it context. So you know where you are—maybe the high or the low—identifying just those. And the beauty of this is that you can actually take this graph and either stack it on top with other similar graphs and put a lot of them together you can do comparisons between the volatility of all these currencies. I am sorry; this is not currencies this is gross domestic product. Or you could actually include it directly in text in a paragraph. Some people believe that you should not mix text and visualizations together. That is nonsense. In the middle of your paragraph if you want to talk about the fact that you have a spark line or that this stock price is very volatile then in parentheses you can actually include a spark line. And that can be a very effective way of doing things. In Excel now they have the ability to add a spark line to the inside of a cell. So you could have a row of financial data and then this row of just a spark line showing that financial data very quickly. Here is another example—there are some exchange rates. I like this one here because this was from—showing measurements— glucose measurements; the doctor might take glucose measurements over time. And the ending value is shown here; we can label that ending value to make it more clear. We can show what normal rates are, normal levels of glucose are to show that there are times when this patient's glucose was abnormally high or abnormally low. There is just a lot of information packed into a small area of this graph. And this is really a nice if a hospital chart contained something like this. A doctor could look at it very quickly without having to read a lot of information to see what measurements have been taken about a patient and which ones should we be concerned about because they are outside of the normal range. Very quickly you can get a lot of information here. Slope graphs this is essentially a line graph with just 2 points, and it is a bunch of line graphs that are compared together. So you can see the comparison between the changes—how things have changed— as opposed to—relative to one another. I have not seen this a lot in the wild; I actually gave this talk a few months ago. And somebody emailed me an example I thought was a pretty good on here. This is from the Atlantic, and these are 2 surveys taken 9 years apart showing public confidence in a bunch of public institutions such as the military, the police, the church, medical system. And you can see in most of these areas public confidence has actually gone down, and some of it has gone significantly down. A few of it had gone up, so over that decade confidence in churches had gone up, confidence in the medical system had gone up slightly. But confidence in Congress—for example—had gone way down. Congress took a big hit. The presidency took a big hit during that decade. So there is some information that you can see very quickly in here. I have not seen a lot of use of slope charts in the past, but I thought this was a very good one. So I am getting near the end here, and there are—so here are some of the takeaways. Keep your data simple and maintain graphical integrity. Do not lie with your data; think about what you are trying to show, and keep that lie factor in mind. Keep the proportions the same for your data as they are for your visualization. Think about the Data-Ink ratio. Maximize that value within reason—let us enter the process. Start looking at when you hit the next, next, finish stop for a second and look back and say, "Do I have too much metadata? Do I have redundant data here? What is in here?" Avoid extraneous stuff—we call them chart junk—we call them ducks. Just avoid them; they really do not help your graphic at all. Sometimes you can use multi-functioning graphic elements, and that is not always possible, but it is a clever way of doing things sometimes. Avoid legends; labeling data directly is much better. Also—some people say do not mix—do not put text inside of your visualization—do not put text in your graph—nonsense. Put text in your graph; it helps the graph. Put text in there. If it helps your text put a graph in there. As long as it is effective. Consider data density; do not be afraid to put a lot of data into your visualizations. People can handle it as long as it is presented well. And then common axis for comparison is a great tool for providing business intelligence. Now let us come back to this graph here. This is Charles Minard's representation of Napoleon's advance on and retreat from Moscow. Minard--he could have presented this a different way. He could have done whatever the 19th century equivalent of next, next, finish is. He could have dressed it up using his obvious artistic skills. But if we look at it this way we do not really see a lot of information— certainly not in the way he chose to represent it. And that is what I want you to do. I want you to take a step back, and I want you to think about how you are representing the data. So whatever tool you are using after you have created your visualization ask yourself, "Do I need all of that metadata? Can I get rid of some of it and not lose any information? Can I get rid of some redundant data? Can I clean this up and make it a little bit easier for my users to see? If I cleaned it up not only does the data jump out at me but I have got room to add another series. And once I have added another series then my users can start to look at this in a new way. They can start to compare series or see correlations between them coming up with reasonable explanations as to why those correlations exist. That is knowledge. That is information. That is power. Your users will thank you. My name is David Giard, and I thank you. [clapping]

Video Details

Duration: 55 minutes and 42 seconds
Country: United States
Language: English
License: All rights reserved
Genre: None
Views: 5
Posted by: asoboleva99 on Jun 28, 2013

We spend much of our time collecting and analyzing data. That data is only useful if it can be displayed in a meaningful, understandable way. Yale professor Edward Tufte presented many ideas on how to effectively present data to an audience or end user. In this session, Tufte's most important guidelines about data visualization are explained, as well as how you can apply those guidelines to your own data. Learn what to include, what to remove, and what to avoid in your charts, graphs, maps and other images that represent data.

Caption and Translate

    Sign In/Register for Dotsub to translate this video.