Why not use pandas.DataFrame.to_markdown() instead of converting XLS/XLSX to HTML and then to Markdown? #328

kirisame-wang · 2025-02-12T05:51:51Z

kirisame-wang
Feb 12, 2025

I've been working on converting XLS/XLSX files into Markdown format. Currently, the process involves using the markitdown library, which first converts the Excel files to HTML using pandas.DataFrame.to_html(), and then transforms the HTML into Markdown using BeautifulSoup.

Given that pandas offers a DataFrame.to_markdown() method, wouldn't it be more efficient to use this method directly for the conversion? This approach seems to provide a direct way to convert DataFrames into Markdown tables, potentially eliminating the need to first convert DataFrames to HTML and then parse the HTML into Markdown.

Are there specific reasons or advantages for the current method that involves HTML conversion? Would using DataFrame.to_markdown() be a more streamlined solution?

I appreciate any insights or explanations regarding this approach.

Answered by ltianyi992

May 31, 2026

Good question — there are a few concrete reasons the current HTML-intermediate path was chosen over DataFrame.to_markdown() directly:

1. tabulate is not a mandatory dependency
DataFrame.to_markdown() requires tabulate to be installed. The HTML path only needs pandas (already required) plus BeautifulSoup (also already present for HTML conversion). Adding tabulate would either be a new hard dependency or require conditional logic.

2. The HTML path handles more Excel features
Excel sheets can contain merged cells, multi-level headers, mixed types, and formatted numbers. DataFrame.to_html() preserves some of these via colspan/rowspan attributes. to_markdown() flattens the DataFrame and loses …

View full answer

ltianyi992 · 2026-05-31T05:03:58Z

ltianyi992
May 31, 2026

Good question — there are a few concrete reasons the current HTML-intermediate path was chosen over DataFrame.to_markdown() directly:

1. tabulate is not a mandatory dependency
DataFrame.to_markdown() requires tabulate to be installed. The HTML path only needs pandas (already required) plus BeautifulSoup (also already present for HTML conversion). Adding tabulate would either be a new hard dependency or require conditional logic.

2. The HTML path handles more Excel features
Excel sheets can contain merged cells, multi-level headers, mixed types, and formatted numbers. DataFrame.to_html() preserves some of these via colspan/rowspan attributes. to_markdown() flattens the DataFrame and loses that structure.

3. Consistent processing pipeline
By going through HTML first, the XLSX converter reuses the same HtmlConverter pipeline that handles all other HTML sources. This keeps a single code path for HTML→Markdown translation rather than two separate rendering paths.

4. Formatting fidelity
to_markdown() uses GFM pipe-table syntax which can struggle with cells containing special characters like | or newlines. The HTML→markdownify pipeline handles these edge cases more robustly.

That said, to_markdown() would be a valid simpler alternative for plain numeric/string spreadsheets. If you want to experiment with it, you can write a custom converter:

from markitdown import MarkItDown, PRIORITY_SPECIFIC_FILE_FORMAT
from markitdown._base_converter import DocumentConverter, DocumentConverterResult
import pandas as pd

class SimpleXlsxConverter(DocumentConverter):
    def accepts(self, file_stream, stream_info, **kwargs):
        return (stream_info.extension or "").lower() in (".xlsx", ".xls")

    def convert(self, file_stream, stream_info, **kwargs):
        sheets = pd.read_excel(file_stream, sheet_name=None)
        parts = []
        for name, df in sheets.items():
            parts.append(f"## {name}\n\n{df.to_markdown(index=False)}")
        return DocumentConverterResult(markdown="\n\n".join(parts))

md = MarkItDown()
md.register_converter(SimpleXlsxConverter())

1 reply

kirisame-wang Jun 6, 2026
Author

Appreciate the detailed breakdown!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why not use pandas.DataFrame.to_markdown() instead of converting XLS/XLSX to HTML and then to Markdown? #328

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Why not use pandas.DataFrame.to_markdown() instead of converting XLS/XLSX to HTML and then to Markdown? #328

Uh oh!

kirisame-wang Feb 12, 2025

Replies: 1 comment · 1 reply

Uh oh!

ltianyi992 May 31, 2026

Uh oh!

Uh oh!

kirisame-wang Jun 6, 2026 Author

kirisame-wang
Feb 12, 2025

Replies: 1 comment 1 reply

ltianyi992
May 31, 2026

kirisame-wang Jun 6, 2026
Author