Skip to content

Add auto-language extra for code block language detection (#361)#706

Open
Darkness1521 wants to merge 1 commit into
trentm:masterfrom
Darkness1521:master
Open

Add auto-language extra for code block language detection (#361)#706
Darkness1521 wants to merge 1 commit into
trentm:masterfrom
Darkness1521:master

Conversation

@Darkness1521
Copy link
Copy Markdown

@Darkness1521 Darkness1521 commented May 18, 2026

Closes #361

Summary

  • Add auto-language extra that automatically detects the programming
    language of fenced code blocks without explicit language tags
  • Uses heuristic pattern matching (no new dependencies)
  • Supports 13 languages: Python, JavaScript, HTML, CSS, SQL, Bash, Java,
    Go, Rust, Ruby, PHP, JSON, YAML, C/C++

How it works

Runs before fenced-code-blocks in the processing pipeline. When a code
block has no language tag, it analyzes the content and inserts the
detected language name. The existing fenced-code-blocks and Pygments
highlighting then process it normally.

Usage

import markdown2
html = markdown2.markdown(text, extras=["fenced-code-blocks", "auto-language"])

Tests

On 24 short code snippets (the typical case for markdown code blocks):

Method Accuracy
Our heuristic (detect_language) 24/24 = 100%
Pygments guess_lexer() 6/24 = 25%

The test suite passes with no regressions. All changes comply with the
project's contribution guidelines (PEP8, test coverage, docs updated).

Result

with auto-language:
with auto-language

without auto-language:
without auto-language

Comment thread lib/markdown2.py

# -- Python --
s = 0
if re.search(r'^\s*(def\s+\w+\s*\(|class\s+\w+.*:)',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps alot of these consecutive if statements could be compressed?

Maybe something like

python = [
  [re.compile(r'^\s*(def\s+\w+\s*\(|class\s+\w+.*:)'), 5],
  [re.compile(r'^\s*(import\s+\w+|from\s+\w+\s+import\b)', re.M), 4],
  ...
]
scores['python'] = sum(score for regex, score in python.items() if regex.search(code))

Comment thread lib/markdown2.py
return '```' in text


def detect_language(code: str):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is specific to the extra, so should probably be part of the AutoLanguage class

@@ -0,0 +1,276 @@
"""
Compare our heuristic detect_language() against Pygments guess_lexer()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many languages does guess_lexer support? If the pygments implementation is wider and more tested, perhaps offering some kind of fallback if the dependency is present?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, just saw the test results you added in the PR about performance vs pygments. This is not needed then I guess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature request] language guess for code blocks

2 participants