[title: line-fit] Machine Learning: for whom, by whom, to whom?

Carlos Scheidegger, HDC Lab

Who we are

https://hdc.cs.arizona.edu

Computing is cheap

Storage is cheap

Software is expensive!

We spend U$312 billion per year on debugging alone

Machine Learning is a way out

Instead of writing code, we come up with examples of the expected behavior.
Then, we write one (pretty weird) program once, and make the computer adapt this program so that it does the things in the data.
Then we need data!

[slide-data: backgroundImage /talks/2020-02-08/jason-pacheco.jpg] [title: bg-black] The Promise of Machine Learning

Milstein, Pacheco, et al.
Intracortical Brain-Computer
Interfaces, NeurIPS 2017

[title: line-fit] The Peril of Machine Learning, 2009

[title: line-fit] The Peril of Machine Learning, 2018

But how does it all work?

[title: line-fit] Yet another AI CS admissions app

YAICS, for short
AI will determine which PhD applicants to accept
… using features associated with good and bad applications
… using historical data
It is data-driven, it will be objective!

[slide-data: backgroundImage /talks/2020-02-08/fifa.jpg] [title: bg-black line-fit] The features are obvious and exact, right?

In reality…

GPA
GRE scores
Relevant Major?
Good School?
Research Experience?
…
Our goal: evaluate application quality.

YAICS

create a rule that assigns a score to each candidate
select the high-scoring candidates
What rule?

Here’s the data

	Ajit	Blake	Cedric	Daniela
GPA	3.75	4.0	3.5	3.8
GRE-V	120	105	120	95
GRE-Q	110	117	100	130
GRE-A	5	4	3	6
major	CS	CS	Math	ECE
school	MIT	ASU	NAU	UA
research?	yes	no	no	yes
PhD in 6?	yes	no	yes	yes

How do we evaluate our rule?

	Ajit	Blake	Cedric	Daniela
GPA	3.75	4.0	3.5	3.8
GRE-V	120	105	120	95
GRE-Q	110	117	100	130
GRE-A	5	4	3	6
major	CS	CS	Math	ECE
school	MIT	ASU	NAU	UA
research?	yes	no	no	yes
PhD in 6?	yes	no	yes	yes

Don't assess your rule on the data you used to compute the rule!
Overfitting: You don't want your ML to memorize
Split the data into training and testing

[slide-data: backgroundImage /talks/2020-02-08/yaicsv1.png] How about a very simple rule?

[slide-data: backgroundImage /talks/2020-02-08/yaicsv2.png] … Maybe more complicated?

	Ajit	Blake	Cedric	Daniela
GPA	3.75	4.0	3.5	3.8
GRE-V	120	105	120	95
GRE-Q	110	117	100	130
GRE-A	5	4	3	6
major	CS	CS	Math	ECE
school	MIT	ASU	NAU	UA
research?	yes	no	no	yes
PhD in 6?	yes	no	yes	yes

$p_1 \textrm{GPA} + p_2 \textrm{GRE-V} + p_3 \textrm{GRE-Q} + p_4 \textrm{GRE-A} + $

$p_5 \textrm{major} + p_6\textrm{school} + p_7\textrm{research}$

[slide-data: backgroundImage /talks/2020-02-08/yaicsv3.png]

[slide-data: backgroundImage /talks/2020-02-08/yaicsv4.png] [title: line-fit] And this is a “deep” neural network

How do we find the right values?

Often, it’s just gradient descent
Force the (bad) model to make a prediction at a random data point
Measure the error, take the gradient of the error wrt the parameters
Nudge parameters in the (negative) direction, repeat
Eventually, stop

So now that you know how to do it.. (often) don’t!

Where did you get that data?

Who decided if the previous admission process was good?
What if (and hear me out here) there were times where the admissions committee made a mistake?
- What if your recruiting efforts were imbalanced with respect to gender?
What if you were giving out mortgages instead?

[slide-data: backgroundImage /talks/2020-02-08/redlining.jpg]

[slide-data: backgroundImage /talks/2020-02-08/bullshit-gaydar.png]

[slide-data: backgroundImage /talks/2020-02-08/criminality-paper.png]

[slide-data: backgroundImage /talks/2020-02-08/phrenology.png]

We don’t really understand this!

Computer says:
Left: "stop"
Right: "45 mph"

"Robust Physical-World Attacks on Deep Learning Visual Classification", CVPR 2018

[title: line-fit]

[title: line-fit] So you’re going to use ML. Great!

But if it’s about people, please ask yourself this:
- Who benefits from it? (For whom?)
- Who are the targets, and who suffers from mistakes? (To whom?)

[title: line-fit] So you’re going to use ML. Great!

But if it’s about people, please ask yourself this:
- Is your data going to repeat our racist, sexist past?
- Did you ask them if they want it?
- Are you ready to stop if they say no?
- Do you truly know the domain? Why you? (By whom?)

ML is not an excuse to ignore ethics, history, and society!

Thank you!

and thank you to
- my colleagues Stephen Kobourov and Jason Pacheco for their examples and materials
- McCallum and Blok for storage data
Come to mine and Stephen’s talk at the Centennial Hall, Feb 25th!
@scheidegger, https://cscheid.net