Levenshtein, Sørensen–Dice, and Practical Information Theory
This started with me seeing a tweet announcing an npm package on GitHub. “Finds degree of similarity between two strings, based on Dice’s Coefficient, which is mostly better than Levenshtein distance”, the repo’s description reads, and that got me curious.
What’s a Dice Coefficient? Or Sørensen–Dice Coefficient, to get formal? You can click through to read the formula, but the thumbnail is that it’s a float between 0 and 1 indicating how similar two words are.
It says it’s better than the Levenshtein Distance. The Levenshtein Distance is the count of edits required to get from one string to another.
I’m happy to know that npm has this library in it, but I’m a Perl guy, and both Text::Dice and Text::Levenshtein, so it was faster for me to go forward with those. I wrote get.pl
, where the words in comparison are the standard set of git commands, and the command I’m testing against is gerp
.
DICE
----
gerp merge 3 0.571428571428571
gerp grep 2 0
gerp rm 3 0
gerp log 4 0
gerp push 4 0
gerp tag 4 0
gerp pull 4 0
gerp init 4 0
gerp diff 4 0
gerp show 4 0
gerp mv 4 0
gerp fetch 4 0
gerp add 4 0
gerp bisect 5 0
gerp clone 5 0
gerp rebase 5 0
gerp branch 6 0
gerp commit 6 0
gerp status 6 0
LEVENSHTEIN
------------
gerp grep 2 0
gerp merge 3 0.571428571428571
gerp rm 3 0
gerp tag 4 0
gerp pull 4 0
gerp push 4 0
gerp log 4 0
gerp fetch 4 0
gerp add 4 0
gerp mv 4 0
gerp show 4 0
gerp diff 4 0
gerp init 4 0
gerp bisect 5 0
gerp rebase 5 0
gerp clone 5 0
gerp branch 6 0
gerp status 6 0
gerp commit 6 0
The two words of interest here are grep and merge.
gerp grep 2 0
gerp merge 3 0.571428571428571
Basically, if you’re using Levenshtein, you’d go with grep
and if you using the coefficient, you’d guess merge
. If someone typed git gerp mysql
, I would go with Levenshtein and git and suggest you meant grep
.
This is not me saying Dice is bad. I just don’t know when that would make more sense. Understanding sentences rather than words? Anyway, I’m glad to know about both, but practically, for my tool-making purposes, I’m still reaching for Levenshtein first.
If you have any questions or comments, I would be glad to hear it. Ask me on Twitter or make an issue on my blog repo.