CS444/544: Data Visualization

Welcome to CS444/CS544, Data Visualization. In this course, you will learn how, and why, to create data visualizations. Pay attention to this document: it serves as the syllabus for the course.

A “visualization” is simply a visual representation of an object of our interest. It’s visual: we consume them with our eyes, and so it is essential that we know how our eyes work — and, more importantly, the parts of our brains connected to our eyes. It’s also a representation; we get to choose what this representation will be, and different choices lead to different pictures, some good and some bad. We will learn how to tell those apart, and how to make pictures that are more good than bad.

Good data visualization involves perceptual psychology, mathematics, and computer science. This makes our subject uniquely challenging: sometimes the way our eyes work stands in way of applying some beautiful result from computer science. Sometimes it’s the other way around: something deep about the math in the data will help guide the design process and let us make a picture that is beautiful, informative, and truthful.

The content of the course is split roughly in three distinct aspects: mechanics, principles, and techniques.

Content

The why: principles

Data visualization itself has existed for at least 200 years; we’ll learn about Playfair, Nightingale, Minard, and others. Statistics in the 1900s, computers in the 1950s; exploratory analysis. From the 1960s on, we started to realize that some things in visualization work better than others, and around 1980 scientists started seriously studying the effectiveness of data visualization as a medium itself. This program goes on to this day. To give a few examples, we know that using positions works better than using angles; we know that using length works better than using area. We know that, in some cases, using color intensity works better than color hue (and that in other cases, it’s the other way around).

We also know, since the 1960s, that interaction is a powerful idea. Back then people interacted with a data visualization by carefully rearranging bits of paper (no supercomputers in our pockets yet!), but many of the original thoughts are still valid. We will learn the basics of interactive visualizations.

Although much of what we know about visualization is finicky and specific, we have some general principles. We will spend about four weeks studying these principles.

List of Principles

The what: techniques

In comparison to the relative paucity of principles, data visualization has an enormity of existing techniques. We will spend about six weeks in this course going over existing techniques, and what kinds of data they apply to.

Here, computer science has much to say about data visualization.

For example, not everything we want to do with data is efficient, and not everything that is efficient is worth doing with data. This means that the practice of data visualization needs to be informed by algorithmic constraints.

Data visualization also interacts with software engineering: not every visualization algorithm plays well with the rest of the code in your program and in your head.

List of Techniques

The how: mechanics

When we talk about mechanics, we mean the practical things you will need to learn in order to create data visualizations. In this course, we will use the web software stack. This means making visualizations through web pages, using HTML, CSS, and Javascript. The main domain-specific tool we’ll learn is d3.

The modern web stack is good, bad, and ugly. We will spend about four weeks in this course learning how to use it to make visualizations.

In this course, we will handle data from many different sources. As we will see, data is dirty: standard formats are not really standard, sometimes there’s missing information in files, there’s weird data points that don’t belong (outliers), etc.

We will learn to do basic exploratory analysis in data: specifically, we will become comfortable with digging into a dataset and playing around with it to see what’s there. As we will see, data cleaning makes for better visualization, and visualization also makes for better data cleaning.

We will do one week of data cleaning and exploratory data analysis.

List of Mechanics

Reading material

There is no required textbook for the course, but that’s mostly because the textbook I wanted to use for the course was published after the university bookstore deadline. I strongly recommend Tamara Munzner’s book:

I have a copy I intend to keep in my office at all times, in case you want to browse it before deciding to buy it. Other excellent books I recommend include (again, ask me to take a look at them if you’re curious):

If you want to dive deeper into visualization, you should have read, at least once, the following books:

We will also be reading some web pages and some research papers (especially CS544 students). When you’re expected to read material ahead of time, the material will be posted on the course web page, and will be discussed in class.

Discussion Fora

Offline student discussion is welcome, as long as it does not involve sharing of assignment source code (see the Academic Conduct section below). Discussion on the Piazza site is especially encouraged, since we can monitor it and will count towards class participation.

Assessment

Assessment of your performance in this course will be done mostly through projects. There will be a large number of small assignments, (about one per week for a total of 10-12 small assignments), which should each take you less than two hours to complete. There will be one midterm, and one final project. In other words, there is no final exam in this course.

The final project can be done in teams. There’s no minimum or maximum group size, as long as you tell me ahead of time and I am ok with it. I will adjust the grading expectations accordingly, and I don’t expect to allow groups larger than three unless there’s a good reason.

The differences in assessment for students taking the class at the 444 level vs. the 544 level will be:

Small Assignments

Small assignments will test whether you understood the concepts discussed in that particular week, and will be small and self-contained. You’ll submit a webpage, typically, with a demonstration of the concept we discussed and a short explanation (around a paragraph or two).

Assignments will be posted on thursdays, and will be due on the following thursday at the time class starts. My office hours will be tuesday, 9:00-11:30, and there will be no office hours on wednesday and thursday morning. This is so that I can use the tuesday lecture to talk about issues with the previous assignment, and to nudge you to get started more than 24 minutes (ahem, hours) before the deadline.

Midterm

The midterm will be given, tentatively, on the eight week of the course. By then we will have been through the Mechanics, and some of the Principles. The midterm will be a closed-books exam which I’m calibrating so that you can finish in one hour.

Final project

For the final project, you will build one larger visualization of your own choosing. You will use the mechanics, principles and techniques you learn in this course, and will be graded to the extent in which your visualization respects these. Each student can expect to spend about as much time on the final project as they will on all assignments, combined.

CS544 students

If you are a 544 student, you are encouraged to replace the final project with a research-oriented project of your own choosing, as long as it:

1) involves visualization

2) is a concrete and significant step towards a research paper, either about using data visualization in your area of research, or a visualization research paper itself.

A successful final project does not have to be a finished paper (although if you do finish a solid manuscript, you’re essentially guaranteed an A), but the clearer it is how to take what you have and turn it into a reasonable submission to a workshop, conference, or journal, the better.

Grading

I will grade your assignments, midterms, and final project on a scale from 0 to 100, with respective weights of 50%, 20% and 30%. In addition, I will give class participation 10% weight. This will give you a score from 0 to 110. Your final grade in the course of be the best of a per-class grading curve and overall performance:

Overall performance:

Curve grading:

In other words, for you to get an A, you need to either score 90% average or better in the overall performance, or be in the top 15% of the class. If you get an F, you will have been in the bottom 40% of the class, and have scored less than 40% of the possible total points.

By March 31st (your last day to withdraw), you will know more than 40% of your grade by weight.

Grades for assignments, midterm and final project will be posted on D2L as soon as we have them.

Academic conduct, plagiarism, and open-source software

We will use a lot of existing libraries in this course. This is good sense, and good practice: much programming nowadays is more about finding the right set of libraries and learning how to combine them usefully than it is about designing new libraries from scratch. As a general rule, you are allowed to use any open source library you want in your assignments and projects, provided that you give them credit. For some assignments, learning how a particular implementation works is the entire point. In that case, I will make it clear that the implementation needs to be yours, and I will explicitly tell you that you are not allowed to search online for answers. Ignore this warning at your peril..

If you take an existing, however small, piece of code from elsewhere, use it in your coursework, and do not give attribution, this is plagiarism. Plagiarizing from classmates is not allowed, plagiarizing from sources on the web is not allowed, and plagiarizing from yourself is not allowed, either. In other words, I want to know how much of the code you turn in was written for this assignment in particular. It’s ok if it only took 15 lines of code, and it’s ok if takes 500.

In other words, plagiarism is cheating, and I will treat is as such. The penalty for cheating and plagiarism always includes a referral to the college, and ranges from an automatic zero in the assignment, to a failing grade in the course, up to potential expulsion from the university.

The main point is I want you to get in the habit of giving sources the proper attribution, and this course will be a great opportunity for that.

Incomplete policy

I only give incomplete grades with extenuating circumstances, and only on a case-by-case scenario. By the time I give you an incomplete grade, we will both have agreed on what exactly you need to finish, and by what time (I expect will you to have completed everything necessary before the end of summer).

A tour of visualizations, good, bad, and ugly

By the end of this course, you will have the skills to create many of these visualizations yourself, to tell whether they are a good or a bad design, and why.

The good

- The Periodic Table. You might not think of the Periodic Table as a good visualization. It is nevertheless an especially good one, because of its brilliant spatial arrangement of elements in ways that make your eyes think for you: electron affinity up and to the right; metallic character down and to the left; etc.

The bad, and the ugly

Further reading, watching, etc.

What’s possible (if not easy) with today’s web technology