2.2 将 LaTeX 翻译成 HTML
LaTeX [2] 是一种排版系统,广泛用于文档格式化(至少在学术界); HTML 是一种标记语言,用于指定网页的外观。 虽然表面上相似,它们都描述了文档的外观,但两种语言具有非常不同的语法——各自语法的上下文无关语法片段如图 2 所示——并且它们的特征和优势有很大不同。
然而,使用 LaTeX 准备文档的作者可能希望通过将它们转换为 HTML 来创建网页。 为此,有几种工具可用于将 LaTeX 文档翻译成 HTML(例如,参见 [3, 5])。
此类翻译器通常按以下方式进行:
1. 使用上下文无关解析技术读入 LaTeX 文档。
2. 必要时构建文档部分的内部表示。
3. 处理 LaTeX 结构并输出相应的 HTML。
这里的步骤顺序与绘图工具或编译器的顺序有些不同,主要是因为在这种情况下源语言和目标语言在语义上更接近,从而大大简化了翻译过程。
尽管如此,还是有许多相似之处,主要是在最初的词法分析和解析阶段(上面的步骤(1))和最终的 HTML 生成(上面的步骤(3)),这实际上是通过递归树遍历来执行的.然而,翻译并非完全是微不足道的,因为我们必须处理处理 HTML 不支持的 LaTeX 特征的问题,例如数学符号、图片等。这通常通过使用相应结构的 GIF 或 JPEG 图像来完成。这需要为 LaTeX 构造构建适当的内部表示,然后将其转换为图像(上面的步骤(2));相应的编译器类比是语言特性的代码生成——例如面向对象语言中的继承和虚函数调用——目标体系结构不直接支持。
在亚利桑那大学的本科编译器设计课程中,第 0 项编程作业让学生使用 lex 和 yacc 在大约 1.5 周内实现从 LaTeX(一个子集)到 HTML(一个子集)的翻译器。
那时,大多数学生对 LaTeX 知之甚少,许多学生对 HTML 知之甚少,对 lex 和 yacc 也一无所知。作业的目标是双重的:首先,让学生熟悉 lex 和 yacc,为更传统的项目实现 C 子集的编译器做准备;其次,说明这些工具对其他翻译问题的适用性。
我们使用讨论会和在线教程让他们对 LaTeX 和 HTML 有足够的了解,以便学生知道他们在做什么。我们在学期末的课堂讨论中重新审视这个问题,因为他们更精通这些工具(lex 和 yacc);学生们常常似乎很惊讶和高兴地意识到,他们现在已经具备了为 LaTeX 的重要片段实现一个重要且实用的软件的能力,而且相当快速且无需大量努力。 | 2.2 Translating LaTeX to HTML
LaTeX [2] is a typesetting system that is widely used for document formatting (at least in academia); HTML is a markup language used for specifying the appearance of web pages on the Internet. While superficially similar in that they both describe the appearance of documents, the two languages have very different syntax—fragments of context-free grammars for the respective syntaxes are shown in Figure 2—and are considerably different in their features and strengths.
Nevertheless, authors who prepare documents using LaTeX may then want to create web pages from them by translating them to HTML. To this end, several tools are available for translating Latex documents to HTML (e.g., see [3, 5]).
Such translators typically proceed as follows:
1. Read in the LaTeX document using context-free parsing techniques.
2. Construct internal representations of portions of the document, as necessary.
3. Process the LaTeX constructs and output the corresponding HTML.
The sequence of steps here is somewhat different from that of a graph-drawing tool or a compiler, primarily because the source and target languages are semantically much closer in this case, simplifying the translation process considerably.
Nevertheless there are a number of similarities, primarily in the initial lexical analysis and parsing phase (step (1) above) and the final HTML generation (step (3) above), which is carried out by what is in effect a recursive tree walk. However, the translation is not entirely trivial, since we have to deal with the problem of handling LaTeX features, such as mathematical symbols, pictures, etc., that are not supported by HTML. This is typically done by resorting to GIF or JPEG images of the corresponding constructs. This requires the construction of an appropriate internal representation for the LaTeX construct and then transforming this to an image (step (2) above); the corresponding compiler analog is that of code generation for language features—such as inheritance and virtual function calls in an object-oriented language—that are not directly supported by the target architecture.
In the undergraduate compiler design course at the University of Arizona, the 0th programming assignment has the students use lex and yacc to implement, in roughly 1.5 weeks, a translator from (a subset of) LaTeX to (a subset of) HTML.
At that point, most students know very little about LaTeX, many don’t know a lot about HTML, and none of them know anything about lex and yacc. The goals of the assignment are twofold: first, to get the students acquainted with lex and yacc, in preparation for a more traditional project implementing a compiler for a subset of C; and second, to illustrate the applicability of these tools to other translation problems.
We use discussion sessions and on-line tutorials to give them just enough acquaintance with LaTeX and HTML so that the students know what they are doing. We revisit the problem in classroom discussions at the end of the term, when they are much better versed with these tools (lex and yacc); students often seem quite surprised and pleased to realize that they are now equipped to implement a nontrivial and practically useful piece of software, for a significant fragment of LaTeX, reasonably quickly and without a great deal of effort.
|