amatch is a high-performance collection of classes used for approximate
matching, searching, and comparing strings. It provides an efficient Ruby
interface to several industry-standard algorithms for calculating edit distance
and string similarity.
The library implements a wide array of metrics to suit different matching needs:
- Levenshtein Distance: The classic "edit distance" (insertions, deletions, substitutions).
- Sellers Algorithm: A variation of Levenshtein optimized for searching a pattern within a longer text.
- Damerau-Levenshtein: Similar to Levenshtein but considers transpositions of two adjacent characters as a single edit.
- Hamming Distance: Measures the number of positions at which corresponding symbols are different (only for strings of equal length).
- Jaro-Winkler: A metric geared towards short strings like names, giving more weight to prefix matches.
- Pair Distance: A flexible distance metric based on character pairs
(also available as
Amatch::DiceCoefficient). Unlike set-based measures, this implementation uses multisets, meaning it is sensitive to the frequency of repeated character pairs. - Longest Common Subsequence/Substring: Finds the longest shared sequences between two strings.
You can install the extension as a gem:
gem install amatchAlternatively, if you prefer manual installation:
ruby install.rb
# or
rake installTo get started, simply require the library and include the Amatch module to
add similarity methods directly to the String class.
require 'amatch'
include AmatchThese algorithms return the "cost" to transform one string into another. Lower values indicate higher similarity.
# Standard Levenshtein
m = Levenshtein.new("pattern")
m.match("pattren") # => 2
"pattern language".levenshtein_similar("language of patterns") # => 0.2
# Damerau-Levenshtein (handles transpositions)
m = Amatch::DamerauLevenshtein.new("pattern")
m.match("pattren") # => 1
"pattern language".damerau_levenshtein_similar("language of patterns") # => 0.2Sellers is particularly useful for finding the best match of a pattern within a larger body of text.
m = Sellers.new("pattern")
m.match("pattren") # => 2.0
# You can customize weights for different edit types
m.substitution = m.insertion = 3
m.match("pattren") # => 4.0
m.reset_weights
m.search("abcpattrendef") # => 2.0Used primarily for strings of equal length to count substitutions.
m = Hamming.new("pattern")
m.match("pattren") # => 2
"pattern language".hamming_similar("language of patterns") # => 0.1These algorithms typically return a score between 0.0 and 1.0, where 1.0
is a perfect match.
Highly effective for record linkage and matching names.
m = JaroWinkler.new("pattern")
m.match("paTTren") # => 0.9714...
m.ignore_case = false
m.match("paTTren") # => 0.7942...
# Custom scaling factor for prefix bonus
m.scaling_factor = 0.05
m.match("pattren") # => 0.9619...
"pattern language".jarowinkler_similar("language of patterns") # => 0.6722...The base metric for the Winkler variation.
m = Jaro.new("pattern")
m.match("paTTren") # => 0.9523...
"pattern language".jaro_similar("language of patterns") # => 0.6722...# Pair Distance
# Note: This implementation uses multisets, meaning it considers character
# frequencies rather than just unique pairs.
m = PairDistance.new("pattern")
m.match("pattr en") # => 0.5454...
# Pro Tip: Pass a regex as the second argument to match based on tokens
# (e.g., words) rather than individual characters. This is particularly
# useful for natural language.
m.match("language of patterns", /\s+/)
"pattern language".pair_distance_similar("language of patterns", /\s+/) # => 0.9285...
# Longest Common Subsequence
m = LongestSubsequence.new("pattern")
m.match("pattren") # => 6
"pattern language".longest_subsequence_similar("language of patterns") # => 0.4
# Longest Common Substring
m = LongestSubstring.new("pattern")
m.match("pattren") # => 4
"pattern language".longest_substring_similar("language of patterns") # => 0.4amatch is implemented as a C extension to ensure maximum throughput when
processing large datasets or complex string comparisons.
The homepage of this library is located at:
Apache License, Version 2.0 β See the COPYING file in the source archive.
