Carlos Scheidegger
http://cscheid.net
GDPR<p>The
<a href="https://en.wikipedia.org/wiki/General_Data_Protection_Regulation">GDPR</a>
starts being enforceable in a week. The general state of people who
run websites seems to be, essentially, “I am freaking out, man”. But
for many of these websites, there’s an easy solution.</p>
<p>Just don’t log any PII. Shocking, I know.</p>
<h2 id="on-gdpr-specifically">On GDPR specifically</h2>
<p>First, read this by <a href="https://jacquesmattheij.com/">Mattheij</a>:</p>
<ul>
<li><a href="https://jacquesmattheij.com/gdpr-hysteria">GDPR Hysteria</a></li>
</ul>
<p>If you’re interested in GDPR specifically, this “plain-english”
summary of the whole law is great: <a href="https://blog.varonis.com/gdpr-requirements-list-in-plain-english/">GDPR requirements list in plain
english</a>.</p>
<p>The hyperbole around “my tiny website can be fined two million euros”
is silly. The best charitable explanation I’ve read yet comes from this great <a href="https://news.ycombinator.com/item?id=17100541">HN
comment</a>:</p>
<p>The rest of the hyperbole comes from a number of folks that only now
realize that should you be on the internet and choose to not abide by
the laws of a particular country, you might get sued in that
country. It’s true! And yet, it’s not the GDPR’s doing: this has
always been the case. Welcome to being a grownup in the world.</p>
<p>Do you still think “but mah daaataaa”?</p>
<h2 id="data-really-is-a-liability">Data really is a liability</h2>
<p>Read these three pieces by <a href="http://idlewords.com/about.htm">Cegłowski</a>:</p>
<ul>
<li><a href="http://idlewords.com/talks/sase_panel.htm">The Moral Economy of Tech</a></li>
<li><a href="http://idlewords.com/talks/deep_fried_data.htm">Deep-Fried Data</a></li>
<li><a href="http://idlewords.com/talks/haunted_by_data.htm">Haunted by Data</a></li>
</ul>
<h2 id="heres-how-i-solve-the-problem">Here’s how I solve the problem</h2>
<p>In my specific case, the important question was: Do I really need
those logs? What do I use them for? (as opposed to “what do I
fantasize about eventually using them for, one day, when I finally
have all that spare time I’d like to have?”)</p>
<p>It turns out that I don’t actually need them, and the last time I used
the logs (many years ago), what I needed them for didn’t require any
IP information. I don’t run any analytics on this site now. I just
didn’t have a defensible use case for it.</p>
<p>I run nginx, so I <a href="https://docs.nginx.com/nginx/admin-guide/monitoring/logging/">configured
it</a> to
not store IP or user agent information in my logs. It turns out that
nginx naturally stores IP information in its error logs, and that
<a href="https://stackoverflow.com/questions/4246756/is-it-possible-to-specify-custom-error-log-format-in-nginx">cannot be
changed</a>. So
I just don’t store error logs (they go straight to <code>/dev/null</code>).</p>
<p>(I’m not a lawyer. This isn’t legal advice. Don’t be an idiot.)</p>
Fri, 18 May 2018 00:00:00 -0600
http://cscheid.net//2018/05/18/GDPR.html
http://cscheid.net//2018/05/18/GDPR.htmlThe negative binomial is weird<p>Over at <a href="http://datacolada.org/archives/2799">Data Colada</a>, Leif
Nelson has a nice discussion about the shape of the probability
distribution that governs the <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.888.5463&rep=rep1&type=pdf">file drawer
problem</a>.
Go read that first. It’s good and simple, I promise.</p>
<p>On Twitter,
<a href="https://twitter.com/johnmyleswhite/status/991007589268951041">John Myles White</a>
pointed out a problem with the argument. Yes, the mode, median and
mean are quite different than what we think they are, and this is cute
in itself. But is the difference between the median and the mean
really that big when $\alpha$ is small?</p>
<p>I was curious about this, and looked around online for an expression
of the median of the negative binomial, but couldn’t find anything. So
here it is.</p>
<h2 id="the-median">The median</h2>
<p>The PDF of the negative binomial, in our case, is $f(x) = \alpha (1 -
\alpha)^{x-1}$. The CDF of that is annoying to work with, so we’ll try
to get to an approximation of it instead, by blatantly pretending it’s a
continuous function and working with that.</p>
<p>The mean of the distribution is $1/\alpha$, and that is easy to find.
Then, note that the CDF of the negative binomial at its mean
$1/\alpha$ is approximately $1 - 1/e$; that’s literally the limit of
$1 - (1 - 1/n)^n$ for large $n$. So we know $F(1/\alpha) = 1 -
1/e$. What we’ll do is take a local (linear) approximation of $F$ at
that value and solve it for $F(x) = 1/2$, the median. We need the
slope of $F$ at the mean, but that’s simply the PDF. So we need to know
$f(1/\alpha)$. The calculation is simple:</p>
<script type="math/tex; mode=display">f(x) = \alpha(1-\alpha)^{1/\alpha - 1} = k</script>
<script type="math/tex; mode=display">\log k = (1/\alpha - 1) \log (1 - \alpha) + \log \alpha</script>
<script type="math/tex; mode=display">\log k \approx 1/\alpha \log (1 - \alpha) + \log \alpha</script>
<p>Then we just remember that $\log (1+x) \approx x$ when $x$ is small,
and everything cancels out neatly:</p>
<script type="math/tex; mode=display">\log k \approx \log \alpha - 1</script>
<script type="math/tex; mode=display">k \approx \alpha / e</script>
<p>So we know that $F(x) \approx (\alpha/e)x + b$. But we also know that
q$F(1/\alpha) = 1-1/e$, so we just solve for $b$, and get $b =
1-(2/e)$. Putting it all together, we get that</p>
<script type="math/tex; mode=display">F(x) = (\alpha / e) x + (1 - 2/e)</script>
<p>Finally solving for $F(x) = 1/2$ gives $x = (1/\alpha) \times (2 - e/2) \approx (1/\alpha) \times 0.640$.</p>
<p>Numerically, I’ve found that the number tends to be closer to $0.69
(1/\alpha)$, but I’m happy with the first-order approximation you get
from this quick calculation.</p>
<p>More importantly, what this shows is that the median is not <em>that</em> far
from the mean. Yes, it’s consistently skewed. But no, it’s not absurdly
small. In fact, it’s within a constant factor of the mean, and a
factor that’s pretty close to one. So the combination of ideas in the
original post, namely “the median is a better descriptor of this
distribution”, “no one has good intuition for the skew of the
file-drawer distribution”, and the implied conclusion “this partially
explains why the file drawer problem happens”, now seems weak to me.</p>
<h2 id="the-mode">The mode</h2>
<p>On further discussion, John mentions that the weirdest part of the
argument is the reference to the mode: yes, the mode of the negative
binomial is $1$, so the most common outcome of the “count how many
times you lose a billion-to-one bet until you win” experiment <em>is</em>
“one”. But the difference between this outcome and all other possible
outcomes is at most $10^{-9}$. So it seems better to think of a
“$\epsilon$-smeared mode” for a distribution, namely the set of all
values whose probability mass is within a factor of $1-\epsilon$ of
the probability at the mode. It’ll be obvious in this case that the
mode is very smeared around one.</p>
Mon, 30 Apr 2018 00:00:00 -0600
http://cscheid.net//2018/04/30/the-negative-binomial-is-weird.html
http://cscheid.net//2018/04/30/the-negative-binomial-is-weird.htmlA minimal tracing decorator for Python 3<p>Python3 has per-statement tracing, but that’s a little too
all-or-nothing for me. Every other tracer I found on the web was
vastly overengineered, so I hacked this decorator together this
afternoon:</p>
<pre><code>import inspect
trace_indent = 0
def tracing(f):
sig = inspect.signature(f)
def do_it(*args, **kwargs):
global trace_indent
ws = ' ' * (trace_indent * 2)
print("%sENTER %s: " % (ws, f.__name__))
for ix, param in enumerate(sig.parameters.values()):
print("%s %s: %s" % (ws, param.name, args[ix]))
trace_indent += 1
result = f(*args, **kwargs)
trace_indent -= 1
print("%sEXIT %s (returned %s)" % (ws, f.__name__, result))
return result
return do_it
</code></pre>
<p>Then, for this program,</p>
<pre><code>@tracing
def fib(n):
if n == 0:
return 0
elif n == 1:
return 1
else:
return fib(n-1) + fib(n-2)
if __name__ == '__main__':
print(fib(5))
</code></pre>
<p>You get this:</p>
<pre><code>$ python3 tracing_test.py
ENTER fib:
n: 5
ENTER fib:
n: 4
ENTER fib:
n: 3
ENTER fib:
n: 2
ENTER fib:
n: 1
EXIT fib (returned 1)
ENTER fib:
n: 0
EXIT fib (returned 0)
EXIT fib (returned 1)
ENTER fib:
n: 1
EXIT fib (returned 1)
EXIT fib (returned 2)
ENTER fib:
n: 2
ENTER fib:
n: 1
EXIT fib (returned 1)
ENTER fib:
n: 0
EXIT fib (returned 0)
EXIT fib (returned 1)
EXIT fib (returned 3)
ENTER fib:
n: 3
ENTER fib:
n: 2
ENTER fib:
n: 1
EXIT fib (returned 1)
ENTER fib:
n: 0
EXIT fib (returned 0)
EXIT fib (returned 1)
ENTER fib:
n: 1
EXIT fib (returned 1)
EXIT fib (returned 2)
EXIT fib (returned 5)
5
</code></pre>
<p>(MIT license, because why not.)</p>
Mon, 11 Dec 2017 00:00:00 -0700
http://cscheid.net//2017/12/11/minimal-tracing-decorator-python-3.html
http://cscheid.net//2017/12/11/minimal-tracing-decorator-python-3.htmlHow to hide the noise for LaTeX<p>(Thanks to
<a href="https://twitter.com/johnregehr/status/887713442013913088">John</a> for
this one.) If you’re perpetually annoyed by the amount of stdout noise
that LaTeX generates, then instead of running</p>
<pre><code>pdflatex foo
</code></pre>
<p>run</p>
<pre><code>texfot pdflatex foo
</code></pre>
<p>It makes a world of difference.</p>
Mon, 18 Sep 2017 00:00:00 -0600
http://cscheid.net//2017/09/18/how-to-hide-latex-noise.html
http://cscheid.net//2017/09/18/how-to-hide-latex-noise.html