Skip to content

flori/amatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

102 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

amatch - Approximate Matching Extension for Ruby πŸ“

Description πŸ“

amatch is a high-performance collection of classes used for approximate matching, searching, and comparing strings. It provides an efficient Ruby interface to several industry-standard algorithms for calculating edit distance and string similarity.

Supported Algorithms 🧩

The library implements a wide array of metrics to suit different matching needs:

  • Levenshtein Distance: The classic "edit distance" (insertions, deletions, substitutions).
  • Sellers Algorithm: A variation of Levenshtein optimized for searching a pattern within a longer text.
  • Damerau-Levenshtein: Similar to Levenshtein but considers transpositions of two adjacent characters as a single edit.
  • Hamming Distance: Measures the number of positions at which corresponding symbols are different (only for strings of equal length).
  • Jaro-Winkler: A metric geared towards short strings like names, giving more weight to prefix matches.
  • Pair Distance: A flexible distance metric based on character pairs (also available as Amatch::DiceCoefficient). Unlike set-based measures, this implementation uses multisets, meaning it is sensitive to the frequency of repeated character pairs.
  • Longest Common Subsequence/Substring: Finds the longest shared sequences between two strings.

Installation πŸ“¦

You can install the extension as a gem:

gem install amatch

Alternatively, if you prefer manual installation:

ruby install.rb
# or
rake install

Usage πŸ› οΈ

Basic Setup

To get started, simply require the library and include the Amatch module to add similarity methods directly to the String class.

require 'amatch'
include Amatch

Edit Distance Algorithms πŸ“‰

These algorithms return the "cost" to transform one string into another. Lower values indicate higher similarity.

Levenshtein & Damerau-Levenshtein

# Standard Levenshtein
m = Levenshtein.new("pattern")
m.match("pattren") # => 2
"pattern language".levenshtein_similar("language of patterns") # => 0.2

# Damerau-Levenshtein (handles transpositions)
m = Amatch::DamerauLevenshtein.new("pattern")
m.match("pattren") # => 1
"pattern language".damerau_levenshtein_similar("language of patterns") # => 0.2

Sellers (Pattern Searching)

Sellers is particularly useful for finding the best match of a pattern within a larger body of text.

m = Sellers.new("pattern")
m.match("pattren") # => 2.0

# You can customize weights for different edit types
m.substitution = m.insertion = 3
m.match("pattren") # => 4.0

m.reset_weights
m.search("abcpattrendef") # => 2.0

Hamming Distance

Used primarily for strings of equal length to count substitutions.

m = Hamming.new("pattern")
m.match("pattren") # => 2
"pattern language".hamming_similar("language of patterns") # => 0.1

Similarity Metrics πŸ“ˆ

These algorithms typically return a score between 0.0 and 1.0, where 1.0 is a perfect match.

Jaro-Winkler

Highly effective for record linkage and matching names.

m = JaroWinkler.new("pattern")
m.match("paTTren") # => 0.9714...
m.ignore_case = false
m.match("paTTren") # => 0.7942...

# Custom scaling factor for prefix bonus
m.scaling_factor = 0.05
m.match("pattren") # => 0.9619...

"pattern language".jarowinkler_similar("language of patterns") # => 0.6722...

Jaro

The base metric for the Winkler variation.

m = Jaro.new("pattern")
m.match("paTTren") # => 0.9523...
"pattern language".jaro_similar("language of patterns") # => 0.6722...

Other Metrics (Pair Distance, LCS, Longest Substring)

# Pair Distance
# Note: This implementation uses multisets, meaning it considers character 
# frequencies rather than just unique pairs.
m = PairDistance.new("pattern")
m.match("pattr en") # => 0.5454...

# Pro Tip: Pass a regex as the second argument to match based on tokens
#  (e.g., words) rather than individual characters. This is particularly
# useful for natural language.
m.match("language of patterns", /\s+/)
"pattern language".pair_distance_similar("language of patterns", /\s+/) # => 0.9285...

# Longest Common Subsequence
m = LongestSubsequence.new("pattern")
m.match("pattren") # => 6
"pattern language".longest_subsequence_similar("language of patterns") # => 0.4

# Longest Common Substring
m = LongestSubstring.new("pattern")
m.match("pattren") # => 4
"pattern language".longest_substring_similar("language of patterns") # => 0.4

Performance ⚑

amatch is implemented as a C extension to ensure maximum throughput when processing large datasets or complex string comparisons.

performance

Download πŸ“₯

The homepage of this library is located at:

Author πŸ‘¨β€πŸ’»

Florian Frank

License πŸ“„

Apache License, Version 2.0 – See the COPYING file in the source archive.

About

Approximate String Matching library

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors