In this assignment, you will implement the simplest possible geometry-based classifier, namely k-nearest neighbors, and will investigate its behavior on the datasets of Assignment 1.
In addition to the kNN classifier, you will perform some experiments on the nature of high-dimensional data.
You will find useful helper code in the files
the starter repo.
Specifically, you will implement the class
Answer the questions below in a “answers.txt” plain file, “answers.md” Markdown, or “answers.pdf” PDF. I will not accept Microsoft Word, OS X Pages, or OpenOffice documents.
In addition, submit whatever code you use to answer the questions below.
What’s the best validation accuracy you obtain on
agaricus-lepiota and on
primary-tumor by experimenting with
does this compare to your decision tree classifier?
As you increase the hyperparameter
k, what happens to each of the
training and validation accuracies, and why? Explain, specifically, the
training accuracy you obtain with
k=1 for the
Describe the performance (in terms of runtime) of your kNN classifier.
Generate 500 samples from a 2-dimensional multivariate normal with mean zero and total variance 1 (that is, each dimension should have variance 0.5).
Generate 500 samples from a 100-dimensional multivariate normal with mean zero and total variance 1 (that is, each dimension should have variance 0.01). Plot a histogram of its lengths, and a histogram of the distances between all pairs.
For each of the two datasets you generated above, plot:
Given these observations:
Use the script
dr.py to construct lower-dimensional versions of
primary-tumor, then run your kNN
classifier on a variety of settings for
k. Experiment with
varying numbers of
d as well. What do you find? What are the
Given the results you found in 5, would it ever be useful to reduce
the dimensionality of
primary-tumor, if you could have a
well-principled way of doing it?