Why not use pandas.DataFrame.to_markdown() instead of converting XLS/XLSX to HTML and then to Markdown? #328
-
|
I've been working on converting XLS/XLSX files into Markdown format. Currently, the process involves using the Given that pandas offers a Are there specific reasons or advantages for the current method that involves HTML conversion? Would using I appreciate any insights or explanations regarding this approach. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
Good question — there are a few concrete reasons the current HTML-intermediate path was chosen over 1. 2. The HTML path handles more Excel features 3. Consistent processing pipeline 4. Formatting fidelity That said, from markitdown import MarkItDown, PRIORITY_SPECIFIC_FILE_FORMAT
from markitdown._base_converter import DocumentConverter, DocumentConverterResult
import pandas as pd
class SimpleXlsxConverter(DocumentConverter):
def accepts(self, file_stream, stream_info, **kwargs):
return (stream_info.extension or "").lower() in (".xlsx", ".xls")
def convert(self, file_stream, stream_info, **kwargs):
sheets = pd.read_excel(file_stream, sheet_name=None)
parts = []
for name, df in sheets.items():
parts.append(f"## {name}\n\n{df.to_markdown(index=False)}")
return DocumentConverterResult(markdown="\n\n".join(parts))
md = MarkItDown()
md.register_converter(SimpleXlsxConverter()) |
Beta Was this translation helpful? Give feedback.
Good question — there are a few concrete reasons the current HTML-intermediate path was chosen over
DataFrame.to_markdown()directly:1.
tabulateis not a mandatory dependencyDataFrame.to_markdown()requirestabulateto be installed. The HTML path only needspandas(already required) plusBeautifulSoup(also already present for HTML conversion). Addingtabulatewould either be a new hard dependency or require conditional logic.2. The HTML path handles more Excel features
Excel sheets can contain merged cells, multi-level headers, mixed types, and formatted numbers.
DataFrame.to_html()preserves some of these viacolspan/rowspanattributes.to_markdown()flattens the DataFrame and loses …