In this assignment, you will implement the simplest possible geometry-based classifier, namely k-nearest neighbors, and will investigate its behavior on the datasets of Assignment 1.
In addition to the kNN classifier, you will perform some experiments on the nature of high-dimensional data.
You will find useful helper code in the files knn.py
and dr.py
in
the starter repo.
Specifically, you will implement the class KNNClassification
in knn.py
.
Answer the questions below in a “answers.txt” plain file, “answers.md” Markdown, or “answers.pdf” PDF. I will not accept Microsoft Word, OS X Pages, or OpenOffice documents.
In addition, submit whatever code you use to answer the questions below.
What’s the best validation accuracy you obtain on
agaricus-lepiota
and on primary-tumor
by experimenting with k
? How
does this compare to your decision tree classifier?
As you increase the hyperparameter k
, what happens to each of the
training and validation accuracies, and why? Explain, specifically, the
training accuracy you obtain with k=1
for the primary-tumor
dataset.
Describe the performance (in terms of runtime) of your kNN classifier.
Generate 500 samples from a 2-dimensional multivariate normal with mean zero and total variance 1 (that is, each dimension should have variance 0.5).
Generate 500 samples from a 100-dimensional multivariate normal with mean zero and total variance 1 (that is, each dimension should have variance 0.01). Plot a histogram of its lengths, and a histogram of the distances between all pairs.
For each of the two datasets you generated above, plot:
Given these observations:
Use the script dr.py
to construct lower-dimensional versions of
agaricus-lepiota
and primary-tumor
, then run your kNN
classifier on a variety of settings for k
. Experiment with
varying numbers of d
as well. What do you find? What are the
tradeoffs here?
Given the results you found in 5, would it ever be useful to reduce
the dimensionality of primary-tumor
, if you could have a
well-principled way of doing it?