The Defence for MD5

A few days ago, I tried to reset my password on PR.com, the press releases website. I entered my email, and they sent me the username and password in plain text. That’s right, in plain text.

“For your security”

The problem with this method of password storage is that if anyone gets access to your database, they can literally just see the passwords. This is why hashing is used, which converts the plain text password to an encrypted “hashed” version that is, in an ideal world, undecryptable. The problem with this hashing is really about how hashing fundamentally works: collisions are not uncommon, i.e., multiple strings could have the same hashed string.

For example, if the hash function converts all vowels to “X”, then the hash of “Hello” is “HXllX” and the hash of “Hille” is also “HXIIX”, even though the original strings are definitely distinct. Of course, real world hashing functions are mathematically complex, but collisions are still not that uncommon. This is why the MD5 and more recently SHA-1 hashing algorithms aren’t recommended for security usages, and larger ones such as SHA-256 which don’t have any proven collisions so far are.

These two strings have the same MD5 hash:

String 1: 4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa200a8284bf36e8e4b55b35f427593d849676da0d1555d8360fb5f07fea2
String 2: 4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa202a8284bf36e8e4b55b35f427593d849676da0d1d55d8360fb5f07fea2
Hash:     008ee33a9d58b51cfeb425b0959121c9

The next thing step to safe password storage is called salting. Salting is essentially inserting characters in the string before hashing it. “Hello” can become “H1e2l3lo” if you insert “123” after every alternate character. This means that the hashed file is now much more secure because an intruder would have to know the specific salting technique, which is usually based on server timestamp, tokens, or something unguessable.

Now, even though collisions are common in MD5, it’s still much much better at storing sensitive information than plain text. Since intruders usually just match your hashed file to hashes of common passwords, dictionary words, combinations, etc., if you have a nice, long password, the brute force method becomes inefficient.

This is why, as long as passwords are lengthy and therefore relatively secure, “outdated” hashing algorithms such as MD5 are also actually not a bad choice if it’s as simple as md5($string) vs $string when storing the password.  I have a nice long Facebook password, and I’ve decided to make its MD5 hash public to prove my point:

cf7dd0b01c061029778c72facdc14451

Even though it’s just MD5, I don’t think anyone can decrypt it. Not for 573 quadrillion years, at least.

Footnote: I’m not saying that we should use MD5 to sign TLS certificates, that’s crazy talk. All I’m saying is that (a) MD5 is better than plain text, and (b) it works for practical purposes, as long as there’s no sensitive data to be accessed and the user has a long, non-dictionary password.

Adding a Site to Digital Ocean

I recently migrated my server to Digital Ocean. I ended up using certain commands more often than not, and thought to put them up. In this droplet, I’m using LAMP on Ubuntu 16.04.

The first step is to add a virtual host to point the domain to a particular directory.

sudo cp /etc/apache2/sites-available/000-default.conf /etc/apache2/sites-available/oswald.co.in.conf

Restart the Apache after adding the ServerAdmin, ServerName, and DocumentRoot:

sudo service apache2 restart

Then, to add a Let’s Encrypt TLS/SSL certificate to the domain:

sudo letsencrypt --apache -d oswald.co.in

To execute this command, there are certain prerequisite like installing Apache, Let’s Encrypt, and Python. You should also add a CRON job for automated renewal. You can learn more about that on this Digital Ocean tutorial.

Tokens for Authentication

Something that I’ve started experimenting recently with is token-based authentication. Since I’ve been using more JavaScript and less PHP, I figured I can try using tokens in a RESTful API instead of sessions on the server. Instead of using a framework like OAuth (which I highly recommend using), I tried to recreate the token process. Here’s what I came up with.

This is usually how the process works: A user logs in, and a token is generated. The token is stored on the client (usually in a session, lately also as a local storage object). Then, to call an API, the view also sends the token. The server checks the integrity of the token and returns the relevant response. Each token contains a “private key” of sorts that only the server could’ve created. JWT does this really well. This my how I did it while playing:

A hashed version of the user unique key (a primary key like ID or, like I used, username) along with the date of 2 days into the future. In PHP, I wrote it like this:

$token = md5($input["username"] . date("Ymd", strtotime("+2 days")) . "secretkey123");

In this case, the non-hashed string looks like anand20160216secretkeyq123. I use this particular one because it’s going to be unique for every user (username) and it has a “secret” key. I chose +2 days because of how I’m checking for integrity:

When an API request is sent, the username is sent too, and the server knows the date. So we create two strings, username + date 1 day in the future + “secret” key, and another with the date 2 days into the future. So if a user logs in at 11:00 pm, the key works for 25 hours, and if a user logs in at 1:00 am, it works for 47 hours. Either way, a key works for at least one day and at most two days. This is good because we don’t want the key to work for more than a day or two.

Why this is bad

This is bad because we have no way of killing the token if a session is ended. When a user logs out and logs in again at the same, essentially the same key is generated because it’s dependent only on the server date. It works as a science experiment, but a fully-developed and tested framework like JWT or OAuth works for real-life projects.

Moving to Digital Ocean

For the past few months, I’ve been using a Hostgator cloud server which costs about $15 per month (I paid by the month) with four cores, 4 GB of RAM, unlimited domains, and “unmetered” storage and bandwidth. I chose to go for cloud hosting instead of shared because I assumed it was more scalable, but I was very wrong.

At about 10:40 pm last night, the Hostgator team added a mysterious .htaccess file to the root directory of my server that only allowed certain IP addresses to access the files. Instead of Allow from all, they listed some IPs that could access the server, and I started getting a 403 error on all my files. Hostgator didn’t take my permission before literally stopping my files from loadingand I got to know when I received a message from a client that the bootstrap.css file wasn’t loading and the website’s design was breaking. The least they could’ve done is sent an email. The only reason all my resources didn’t stop loading immediately was Cloudflare’s caching (I highly recommend Cloudflare, now that I’ve seen how useful it is in such situations.)

Anyway, I opened up Digital Ocean and created a droplet, installed LAMP and PHPMyAdmin on Ubuntu 16.04, configured a domain and LetsEncrypt, installed this WordPress blog and migrated the content, and am currently moving all my files, domains, and APIs to this droplet. The good part is that I taught myself how to SSH and SFTP, something I never bothered learning earlier. Thanks, Hostgator!

Reboot is a Better Normalize

Time and again I’ve included normalize.css but have needed to add additional basic styling like box-sizing: border-box and unidirectional margins. If I had the time to, I would’ve made a normalize.css + extra features, so I was very excited when I found it in the alpha of Bootstrap 4.

Reboot builds upon Normalize, providing many HTML elements with somewhat opinionated styles using only element selectors.

It fixes the defaults, fonts, headings, forms, etc., without styling them too much, just the bare necessities. I’m in love with Reboot. I think Reboot is the only styling a webpage really needs.

URL Shortener Length

I made a small URL shortener for Oswald at osw.li in an hour using PHP and MySQL, but I want to learn the MEAN stack, so I thought that this could be a fun starter project. One interesting decision was to decide how many characters the shortened URL’s slug be.

There can be 64 possible characters: A to Z, a to z, 0 to 9, – and _. Even if we make a 3-digit slug, there can be 64^3 = 262,144 possible URLs, which is a big number. The trouble happens with collisions, though. After how many URLs would a pseudorandom generator have repetition? I wrote some JavaScript to find out.

It essentially creates slugs until they’re repeated and returns the number at while repetition happened. It does this 10,000 times and logs the average.

function randomString(length) {
	var result = "", chars = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-";
	for (var i = length; i > 0; --i) result += chars[Math.floor(Math.random() * chars.length)];
	return result;
}
for (k = 0; k < 5; k++) {
	var bosarr = [];
	for (j = 0; j < 10000; j++) {
		var arr = [];
		var duplicate = 0;
		do {
			var num = randomString(k);
			var check = 0;
			for (i = 0; i < arr.length; i++) {
				if (arr[i] == num) {
					check++;
				}
			}
			if (check == 0) {
				arr.push(num);
			} else {
				duplicate++;
			}
		} while (duplicate <= 0);
		bosarr.push(arr.length);
	}
	var sum = 0;
	for (i = 0; i < bosarr.length; i++) {
		sum += parseInt(bosarr[i], 10);
	}
	var avg = sum / bosarr.length;
	console.log("For a " + k + "-digit string, there will be repetition after " + avg + " strings");
}

For a 3-character slug, repetition happened at around 640. This means that after around 600 shortened URLs, we would have to re-generate a slug. For a 4-character one, it was around 5,000. And for 5, it was around 40,000.

Of course, we’ll also check if the slug exists, but to (mostly) avoid it, a 5-character slug makes most sense. 40,000 URLs before we have to ever re-generate. Interesting.

Here’s a graph:

Footnote: If the odds are that repetition happens after 40,000 URLs, do we really have to send in a database query every time to check? And if we’re doing that, why not stick to 4-character ones? They’re shorter, and there can be over 16 million possible URLs. I pick 5-character because the probability gets reduced by over 6 times by adding one character, but 4 isn’t too bad if we’re checking anyway.

Machine Learning in 6 Lines

I’ve only very recently started experimenting with Machine Learning, but Python has made is super simple. First, set up an scikit-learn environment (I used Anaconda) and import the decision tree classifer.

from sklearn import tree

And that’s line 1. Compile this python script, and, if there are no errors, we have our environment set up. Now let’s get some data. In the following, we’re using two one-dimensional arrays for features and labels. Consider a phone app where we save the names of contacts I called, corresponding to the time when I called them.

features = [[10.00], [10.30], [12.10], [12.55], [14.00], [15.00], [18.00], [18.07], [20.00], [21.00]]
labels = ['Mom', 'Mom', 'Doctor', 'Doctor', 'Friend', 'Friend', 'Girlfriend', 'Girlfriend', 'Mom', 'Mom']

We’ve reached ’til line 3. This list can be populated using the history of your phone app, where labels correspond to features, and we use this information to predict who you might want to call. Let’s set up a classifier, in this case the Decision Tree Classifier, and start predicting after fitting the data.

clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
clf.predict([[15.30]])

And that’s it. When you execute this script, you’ll get the following output:

['Friend']

Which is precisely what we were aiming for. Even though we hadn’t explicitly told the computer who we might want to call at 3:00 pm, it recognized the calling pattern to generate this answer. That’s machine learning in six lines.

This is what a simple application of this could look like. We’ve converted the current time to decimal, and we’ll print who you might want to call right now.

from datetime import datetime
from sklearn import tree
features = [[10.00], [10.30], [12.10], [12.55], [14.00], [15.00], [18.00], [18.07], [20.00], [21.00]]
labels = ['Mom', 'Mom', 'Doctor', 'Doctor', 'Friend', 'Friend', 'Girlfriend', 'Girlfriend', 'Mom', 'Mom']
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
time = float("{:%H:%M}".format(datetime.now()).replace(":", "."))
print clf.predict([[time]])