markitdown的安装和简单使用

安装

microsoft/markitdown: Python tool for converting files and office documents to Markdown.

官网给出了两个方法:

目前(20250224),直接使用pip install markitdown会安装markitdown-0.0.1a4,而目前最新版是markitdown-0.0.2a1,故建议用第二种方法从源代码进行构建:

1
2
3
git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e packages/markitdown

主要参数解释

-h, --help:显示帮助信息

-v, --version:显示版本号

-o OUTPUT, --output OUTPUT:指定输出文件名(如果不指定,将输出到控制台)

-d, --use-docintel:使用文档智能服务来提取文本(需要有效的 Document Intelligence 端点)

-p, --use-plugins:使用第三方插件来转换文件

--list-plugins:列出已安装的第三方插件

使用

基本信息

在命令行输入 markitdown -v 会输出版本:

1
2
C:\Users\Vanilla>
markitdown 0.0.2a1

输出帮助信息: markitdown -h

测试第三方插件:markitdown --list-plugins

docx文件测试

我选择之前美赛的论文进行测试。

这份完整的数模论文该有的部件都有:公式、图片、表格、题注、多级标题、加粗、斜体、链接、序号、页眉;其中,行间公式使用的是mathtype,行内公式使用的是word自带的公式编辑器。

执行命令

1
2
3
Measure-Command {
markitdown .\MCM-finish.docx -o docx.md
}

部分测试结果

摘要部分

提取结果:

1
2
3
4
5
6
7
Saving Juneau: Sustainable Development in Tourism

**Summary**

Excessive tourism in Juneau City has caused environmental and social challenges. To address these issues and promote sustainable development, we developed a multi-objective optimization model for sustainable tourism and applied it to Juneau City.

We constructed a general multi-objective optimization model with **tourist numbers** as the decision variable. The **objective function** integrates economic, environmental, and social factors, resulting in six goals. **Constraints** include carbon emissions, water resource utilization, and waste management. Further research will refine this model for application in other cities.

可以发现:

  • 页眉完全没有被提取
  • 标题 Saving Juneau: Sustainable Development in Tourism 原本是标题,这里变成了正常文本
  • 加粗正常

目录部分

提取结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
**Contents**

[1 Introduction 3](#_Toc188935048)

[1.1 Background 3](#_Toc188935049)

[1.2 Restatement of the problem 3](#_Toc188935050)

[1.3 Our works 4](#_Toc188935051)

[2 Model Preparation 4](#_Toc188935052)

[2.1 Assumptions and Justifications 4](#_Toc188935053)

[2.2 Notations 5](#_Toc188935054)

[3 Juneau: A Sustainable Tourism Model 6](#_Toc188935055)
  • 原本目录是可以跳转的。可以发现,这里转换的保留了跳转域,但是完全不可用啊……

正文部分

转换结果:

1
2
3
4
5
6
7
8
9
10
11
# Introduction

## Background

![图示

描述已自动生成](data:image/png;base64...)

Figure :Current situation map of Juneau City[1]

In 2023, Juneau, Alaska, hosted 1.6 million cruise passengers, with a daily peak of up to 20,000 visitors. While this influx brought significant economic benefits, it also caused overcrowding and accelerated glacial retreat, impacting natural attractions and potentially deterring future tourists. Additionally, excessive tourism has increased hidden costs related to infrastructure strain, environmental damage, and social challenges.

可以发现:

  • 一级、二级标题格式转换正常

  • 图片似乎是想要转换为base64的格式,但是

    • 内容没有发生转换

    • 图片描述是word自动生成的一句提示“图示描述已自动生成”,但是自动生成的描述去哪里了呢?

    • 图片描述中间还有两个换行符是怎么回事

  • 图片题注变成了正常文本,但是图片序号(包含域信息)消失了

  • 引用直接变成了纯文本

符号说明部分

转换结果:

1
2
3
4
5
6
7
8
9
10
11
12
* **Assumption 2:** Ignoring the carbon footprint caused by tourists' use of transportation within the city of Juneau.
* Justification: Juneau has no direct roads. Most tourists choose cruise ships or planes to reach there. In contrast, the carbon footprint generated by tourists' sightseeing within the city can be negligible.

## Notations

| Notation | Description | Unit |
| --- | --- | --- |
| | Direct income from tourism | USD |
| | The i-th source of direct income from tourism | USD |
| | Tax revenue | USD |
| | Daily water consumption per tourist | L/person/day |
| | Carbon footprint | t |

可以发现:

  • 序号转换成功,这里使用的是 *,使用着
  • 表格正常转换
  • 表格中最左边一列是word公式,全部消失

附录部分

1
2
3
4
# References

1. Background image source: Travel Juneau. (n.d.). *Home*<https://www.traveljuneau.com/>
2. LSC Transportation Consultants, Inc. (2024). *Juneau visitor circulator study final report (Prepared for City and Borough of Juneau)*. <https://juneau.org/wp-content/uploads/2024/02/Juneau-Visitor-Circulator-Study-Final-Report-2024-1.pdf>

可以发现:

  • 斜体正常
  • 链接正常,但是这里直接使用了 <link> 的方式而非 markdown 中更常用的 [name](link)

测试总结

文档部件 转换情况 备注
文件类型 docx 最新版的word
文件大小 16.3MB 图片较多,分辨率较大;25页,计空格39578字
转换耗时 2.4986917 可以说是挺快的了
公式 × 所有公式直接消失了
图片 × 完全不可用
表格
题注 变成文字
多级标题 多级标题正常;普通标题变成正常文本
加粗
斜体
链接
序号
页眉 × 消失
目录 正常文本,域跳转不可用

pdf文件测试

执行命令

1
2
3
Measure-Command {
markitdown .\MCM-finish.pdf -o pdf.md
}

部分测试结果

摘要部分

提取结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Problem Chosen
X

2025
MCM/ICM
Summary Sheet

Team Control Number

XXXXXXX

Saving Juneau: Sustainable Development in Tourism

Summary

Excessive tourism in Juneau City has caused environmental and social challenges.
To address these issues and promote sustainable development, we developed a multi-
objective optimization model for sustainable tourism and applied it to Juneau City.

We constructed a general multi-objective optimization model with tourist
numbers as the decision variable. The objective function integrates economic,
environmental, and social factors, resulting in six goals. Constraints include carbon
emissions, water resource utilization, and waste management. Further research will
refine this model for application in other cities.

Task 1: We extended the model by adding sales tax and hotel tax as decision
variables and maximizing tax revenue with related constraints. Using literature review
and linear regression, we determine the values, estimate the parameters and applied the
NSGA-II algorithm to find Pareto optimal solutions. The entropy weight method

可以发现:

  • 页眉的文字也能够转换了,虽然格式有点乱,但是至少是有的
  • 摘要部分的每一个自动换行都变成了换行符。这应该是与PDF存段落的方式(每行分开存储)有关
  • 没有任何的格式(加粗没了)

目录部分

提取结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Team#XXXXXXX

Page 2 of 25

Contents

1

Introduction ..................................................................................................... 3
1.1 Background ......................................................................................... 3
1.2 Restatement of the problem ................................................................. 3
1.3 Our works ............................................................................................ 4
2 Model Preparation ........................................................................................... 4
2.1 Assumptions and Justifications ........................................................... 4
2.2 Notations ............................................................................................. 5
3 Juneau: A Sustainable Tourism Model ............................................................ 6
  • 页眉部分正常
  • 没有能够跳转域信息
  • 所见即所得:PDF中的所有文本都被成功的转换了,最大程度的保留了文本信息

正文部分

转换结果:

1
2
3
4
5
6
7
8
9
10
11
12
1  Introduction

1.1 Background

Figure 1:Current situation map of Juneau City[1]

In 2023, Juneau, Alaska, hosted 1.6 million cruise passengers, with a daily peak
of up to 20,000 visitors. While this influx brought significant economic benefits, it also
caused overcrowding and accelerated glacial retreat, impacting natural attractions and
potentially deterring future tourists. Additionally, excessive tourism has increased
hidden costs related to infrastructure strain, environmental damage, and social
challenges.

可以发现:

  • 没有一级、二级标题格式,但是有一级、二级序号

  • 图片完全消失

  • 图片题注当然也没有

  • 引用变成了纯文本

符号说明部分

转换结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
  Justification: Juneau has no direct roads. Most tourists choose cruise ships or
planes to reach there. In contrast, the carbon footprint generated by tourists'
sightseeing within the city can be negligible.

2.2 Notations

Notation

Description
Direct income from tourism
The i-th source of direct income from tourism
Tax revenue
Daily water consumption per tourist

……

Unit
USD
USD
USD
L/person/day

可以发现:

  • 序号转换成了不知道是个什么东西:
  • 表格格式转换失败,只有文字
  • 表格中最左边一列是word公式,全部消失
  • 遵循的是先行再列而不是先列再行,不符合逻辑

附录部分

1
2
3
4
5
6
7
8
9
References

[1] Background image source: Travel Juneau. (n.d.). Home. https://www.traveljuneau.com/

[2] LSC Transportation Consultants, Inc. (2024). Juneau visitor circulator study final repo

rt (Prepared for City and Borough of Juneau). https://juneau.org/wp-content/uploads/20

24/02/Juneau-Visitor-Circulator-Study-Final-Report-2024-1.pdf

可以发现:

  • 斜体格式消失
  • 链接有的正常有的不正常,因为换行会把链接截断
  • 链接没有使用markdown的格式而是裸露的网址

测试总结

文档部件 转换情况 备注
文件类型 pdf
文件大小 5.42MB 图片较多,分辨率较大;25页,计空格39578字
转换耗时 12.411024 比word转md慢,大约是其5倍
公式 × 所有公式直接消失了
图片 × 图片消失
表格 × 表格格式消失
题注 变成文字
多级标题 × 变成(带序号的)正常文本
加粗 ×
斜体 ×
链接 × 纯文本,且会被换行截断
序号 × 纯文本
页眉 纯文本
目录 纯正常文本