找回密码
 快速注册
搜索
查看: 53|回复: 2

A brief [f]lex tutorial

[复制链接]

3149

主题

8386

回帖

6万

积分

$\style{scale:11;fill:#eff}꩜$

积分
65391
QQ

显示全部楼层

hbghlyj 发表于 2022-8-31 07:03 |阅读模式


CSc 453: Supporting Documents
A [f]lex tutorial (Powerpoint slides)
flex (and lex): Overview
Scanner generators:
Helps write programs whose control flow is directed by instances of regular expressions in the input stream.
\definecolor{blue}{HTML}{CCFFFF} \begin{tikzpicture}[scale=4] \node[fill=blue!50,draw,minimum width=2.5cm,minimum height=1.5cm]at(0,0)(v){flex (or lex)}; \draw[-stealth](v.east)--+(.5cm,0)node[align=left,xshift=50]{Output: C code\\implementing a\\scanner:\\ \underline{function}: \tt yylex()\\ \underline{file}: \tt lex.yy.c}; \draw[stealth-](v.west)--+(-.5cm,0)node[align=left,xshift=-50]{Input: a set of\\regular expressions\\+ actions}; \end{tikzpicture}
Using flex
\definecolor{blue}{HTML}{CCFFFF} \usetikzlibrary {decorations.pathreplacing} \begin{tikzpicture}[scale=4] \fontsize{10}{12}\selectfont \node[fill=blue!50,draw,minimum width=1.42cm,minimum height=1cm]at(2.75,2.17)(v){flex}; \draw[-stealth](v.east)--+(.1cm,0)node[align=left,xshift=35]{\underline{file}: \tt lex.yy.c\\ \\[-1em] \quad yylex()\\ \quad \{\\ \quad \dots\\ \quad \}}; \draw[stealth-](v.west)--+(-.1cm,0)node[align=left,xshift=-35](l){lex input spec\\(regexps +\\actions)}; \draw[stealth-](l.south)--+(0,-.3cm); \begin{scope}[shift={(0,-0.12)}] \draw[line width=.4] (2.143,1.684) circle (.016); \draw[line width=.4] (2.192,1.684) circle (.016); \draw (2.1257,1.6377) arc (-140:-40:0.057); \node (v1) at (2.1689,1.668) {}; \draw (v1) circle (0.0856); \draw (2.0351,1.5521) -- (2.1701,1.5137) -- (2.2941,1.5556) ; \draw (2.1707,1.5843) --(2.1689,1.4287)edge(2.0684,1.2857)--(2.2471,1.2851); \end{scope} \draw[-stealth] (2.3386,1.5475)--+(.67,0)node[align=left,xshift=30](sd){driver\\code}; \draw[dashed,yshift=-15.5] (3.12,2.32) rectangle (3.46,1.85); \draw[dashed] (3.12,2.32) rectangle (3.46,1.85); \draw (3.3349,1.4) -- (3.46,1.1389)node[yshift=-20,xshift=24]{$\left\{\!\!\begin{array}l\text{main() \{…\}}\\ \quad\text{or}\\ \text{parser() \{…\}} \end{array}\right.$}; \draw[decorate,decoration={brace,raise=5,amplitude=10}] (3.46,2.32) --+(0,-1); \draw[-stealth](3.59,1.82)--+(.2,0)[xshift=6]node{compiler}; \draw(3.8088,2.1068) rectangle (4.19,1.544); \draw[-stealth,xshift=17](3.59,1.82)--+(.2,0); \node at (2.36,1.7351)[align=left] {user\\supplies}; \end{tikzpicture}
flex: input format
An input file has the following structure:
\begin{tikzpicture}[scale=4] \node[minimum height=1.92cm,minimum width=2.2cm,align=left,draw]{definitions\\ \%\%\\ rules\\ \%\%\\ user code }; \draw[-stealth] (-0.5,0.1)node[left]{required} -- (-0.22,0.1); \draw[stealth-] (0.0158,-0.0181)--(0.4,0) node [right] {optional}; \draw[-stealth] (0.3977,0) edge (-0.0224,-0.1111)edge(0.1808,-0.2285)--(0.2237,0.1915); \node at (-1,-0.4137) {Shortest possible legal flex input:}; \node[minimum width=2.2cm,draw]at(0,-.6){\%\%\hskip2.7em\phantom.}; \end{tikzpicture}
Definitions
A series of:
name definitions, each of the form

name             definition

e.g.:
DIGIT            [0-9]

CommentStart     "/*"

ID               [a-zA-Z][a-zA-Z0-9]*
start conditions
stuff to be copied verbatim into the flex output (e.g., declarations, #includes):
enclosed in %{ … }%, or
indented

Rules
The rules portion of the input contains a sequence of rules.
Each rule has the form

  pattern     action

where:
pattern describes a pattern to be matched on the input
pattern must be un-indented
action must begin on the same line.

Patterns
Essentially, extended regular expressions.
Syntax: similar to grep (see man page)
<<EOF>> to match “end of file”
Character classes:
[:alpha:], [:digit:], [:alnum:], [:space:], etc. (see man page)
{name} where name was defined earlier.
“start conditions” can be used to specify that a pattern match only in specific situations.

Example
A flex program to read a file of (positive) integers and compute the average:
%{
#include <stdio.h>
#include <stdlib.h>
%}
dgt    [0-9]
%%
{dgt}+   return atoi(yytext);
%%
void main()
{
   int val, total = 0, n = 0;
   while ( (val = yylex()) > 0 ) {
      total += val;
      n++;
   }
   if (n > 0) printf(“ave = %d\n”, total/n);
}
注: The C library function int atoi(const char *str) converts the string argument str to type int. \usetikzlibrary {decorations.pathreplacing} \definecolor{blue}{HTML}{669999} \begin{tikzpicture}[scale=4] \fontsize{10}{12}\selectfont \node[align=left]{\tt\%\{\\ \tt\#include <stdio.h>\\ \tt\#include <stdlib.h>\\ \tt\%\}\\ \tt dgt [0-9]\\ \tt\color{cyan}\%\%\\ \tt\{dgt\}+ return atoi(yytext);\\ \tt\color{cyan}\%\%\\ \tt void main()\\ \tt\{\\ \tt\ \ \ int val, total = 0, n = 0;\\ \tt\ \ \ while ((val = yylex()) > 0 ) \{\\ \tt\ \ \ \ \ \ total += val;\\ \tt\ \ \ \ \ \ n++;\\ \tt\ \ \ \}\\ \tt\ \ \ if (n > 0) printf("ave = \%d\textbackslash n", total/n);\\ \tt\} }; \draw (-0.81,0.4225) ellipse (0.2597 and 0.06); \draw (-0.5711,0.4393) -- (0.3668,0.6585)node[align=left,right]{Definition for a digit\\(could have used builtin definition [:digit:] instead)}; \draw (-0.4116,0.2093) ellipse (0.674 and 0.0963); \draw (0.2131,0.2448) -- (0.5722,0.302)node[right,align=left]{Rule to match a number and return its value to\\the calling routine}; \draw[decorate,decoration={brace,amplitude=10}] (1.045,0.0471) -- (1.045,-0.8743); \draw (1.1338,-0.4095) -- (1.3751,-0.3595)node[align=left,right]{Driver code\\(could instead have been in a separate file)}; \draw[decorate,decoration={brace,amplitude=5}] (-1.1,0.3719)--(-1.1,0.8795); \draw[decorate,decoration={brace,amplitude=5}] (-1.1,0.139)--(-1.1,0.2485); \draw[decorate,decoration={brace,amplitude=5}] (-1.1,-0.8971)--(-1.1,0); \node[rotate=90,color=blue] at (-1.22,0.6187) {definitions}; \node[rotate=90,color=blue] at (-1.22,0.1896) {rules}; \node[rotate=90,color=blue] at (-1.22,-0.4496) {user code}; \end{tikzpicture}

\usetikzlibrary {decorations.pathreplacing} \definecolor{blue}{HTML}{669999} \begin{tikzpicture}[scale=4] \fontsize{10}{12}\selectfont \node[align=left]{\tt\%\{\\ \tt\#include <stdio.h>\\ \tt\#include <stdlib.h>\\ \tt\%\}\\ \tt{\color{magenta} dgt} [0-9]\\ \tt\color{cyan}\%\%\\ \tt\{{\color{magenta} dgt}\}+ return atoi({\color{magenta}yytext});\\ \tt\color{cyan}\%\%\\ \tt void main()\\ \tt\{\\ \tt\ \ \ int val, total = 0, n = 0;\\ \tt\ \ \ while ((val = {\color{magenta}yylex}()) > 0 ) \{\\ \tt\ \ \ \ \ \ total += val;\\ \tt\ \ \ \ \ \ n++;\\ \tt\ \ \ \}\\ \tt\ \ \ if (n > 0) printf("ave = \%d\textbackslash n", total/n);\\ \tt\} };\draw[decorate,decoration={brace,amplitude=5}] (-1.1,0.3719)--(-1.1,0.8795); \draw[decorate,decoration={brace,amplitude=5}] (-1.1,0.139)--(-1.1,0.2485); \draw[decorate,decoration={brace,amplitude=5}] (-1.1,-0.8971)--(-1.1,0); \node[rotate=90,color=blue] at (-1.22,0.6187) {definitions}; \node[rotate=90,color=blue] at (-1.22,0.1896) {rules}; \node[rotate=90,color=blue] at (-1.22,-0.4496) {user code}; \draw (-0.949,0.4181) ellipse (0.083 and 0.048); \draw (-0.9,0.2105) ellipse (0.083 and 0.048); \draw (-0.8714,0.4412) -- (0.32,0.5785)node[right]{defining and using a name} -- (-0.8556,0.2518); \draw (-0.0091,0.2091) ellipse (0.1461 and 0.0624); \draw (0.0746,0.2573) -- (0.32,0.3821)node[right,color=magenta](n){\tt char * yytext;}; \node at(n)[below right,xshift=-2em,yshift=-.5em,align=left]{a buffer that holds the input\\characters that actually match\\the pattern}; \draw (-0.1233,-0.3232) ellipse (0.13 and 0.0605); \draw (-0.0314,-0.2777) -- (0.6414,-0.0808)node[right](i){Invoking the scanner: \color{magenta}\tt yylex()}; \node at(i)[below right,xshift=-5.5em,yshift=-.5em,align=left]{Each time {\tt yylex()} is called, the\\scanner continues processing\\the input from where it last left\\off.\\Returns 0 on end-of-file.}; \end{tikzpicture}
Avoiding compiler warnings
If compiled using “gcc –Wall” the previous flex file will generate compiler warnings:
lex.yy.c: … : warning: `yyunput’ defined but not used
lex.yy.c: … : warning: `input’ defined but not used

These can be removed using ‘%option’ declarations in the first part of the flex input file:
%option nounput
%option noinput

Matching the Input
When more than one pattern can match the input, the scanner behaves as follows:
the longest match is chosen;
if multiple rules match, the rule listed first in the flex input file is chosen;
if no rule matches, the default is to copy the next character to stdout.
The text that matched (the “token”) is copied to a buffer yytext.

\definecolor{r}{HTML}{FF6600} \definecolor{b}{HTML}{0000FF} \begin{tikzpicture}[scale=4] \fontsize{10}{12}\selectfont \draw[fill=r!50,draw=red] (0.9892,0.052) ellipse (0.06 and 0.08); \draw[fill=b!20,draw=blue] (-0.1141,0.3859) ellipse (0.06 and 0.08); \draw[fill=r!50,draw=red] (-1.9594,0.9496) ellipse (0.06 and 0.08); \draw[fill=b!20,draw=blue] (-2.6673,0.9463) ellipse (0.06 and 0.08); \node[align=left,draw](c){ \tt\#include <stdio.h> \color{blue}/* definitions */\\ \tt \color{blue}int main(int argc, char * argv[ ]) \{\\ \tt\color{blue}\ \ \ if (argc <= 1) \{\\ \tt\color{blue}\ \ \ \ \ \ \ printf(“Error!\textbackslash n”); /* no arguments */\\ \tt\ \ \ \}\\ \tt\ \ \ printf(“\%d args given\textbackslash n”, argc);\\ \tt\ \ \ return 0;\\ \} }; \node at (-0.9404,0.5646) {Input:}; \node[align=left] at (-2.0828,1.0287) {Pattern to match C-style comments: /* … */\\[1em] \qquad\Large\tt"/*"(.|\textbackslash n)*"*/"}; \node[align=left] at (-2.4375,0.5897) {longest match:\\Matched text shown in \color{blue}blue }; \draw (-1.9175,0.4409) -- (c.west); \end{tikzpicture}
Using Start Conditions
Start conditions let us explicitly simulate finite state machines.
This lets us get around the “longest match” problem for C-style comments.
FSA for C comments:
\usetikzlibrary{automata, positioning} \begin{tikzpicture}[->,>=stealth,node distance=2cm] \fontsize{10}{12}\selectfont \node[state, initial](i){}; \node[state, right of=i] (s1) {S1}; \node[state, right of=s1] (s2) {S2}; \node[state, right of=s2] (s3) {S3}; \node[state, right of=s3, accepting] (f) {}; \draw(i)edge[above]node{/}(s1) (s1)edge[above]node{*}(s2) (s2)edge[loop above]node{non-*}(s2) (s2)edge[above]node{*}(s3) (s3)edge[loop above]node{*}(s3) (s3)edge[below, bend left]node{non-\{/,*\}}(s2) (s3)edge[above]node{/}(f); \end{tikzpicture}
flex input:
%x S1, S2, S3
%%
"/"              BEGIN(S1);
<S1>"*"          BEGIN(S2);
<S2>[^*]         ;     /* stay in S2 */
<S2>"*"          BEGIN(S3);
<S3>"*"          ;     /* stay in S3 */
<S3>[^*/]        BEGIN(S2);
<S3>"/"          BEGIN(INITIAL);

Putting it all together
Scanner implemented as a function
  int yylex();
return value indicates type of token found (encoded as a positive integer);
the actual string matched is available in yytext.
Scanner and parser need to agree on token type encodings
let yacc generate the token type encodings
yacc places these in a file y.tab.h
use “#include y.tab.h” in the definitions section of the flex input file.
When compiling, link in the flex library using “-lfl

3149

主题

8386

回帖

6万

积分

$\style{scale:11;fill:#eff}꩜$

积分
65391
QQ

显示全部楼层

 楼主| hbghlyj 发表于 2022-8-31 08:55
Lex与YACC详解
基于Lex 和 Yacc 的 C 语言编译器
lex与yacc(二)计算器的实现
构建一个C语言的编译器并不是一件容易的事,我想每个人在学习编译原理的时候并不会常见得它非常简单.
下面将会学习编译器的两个重要组成部分:词法分析器flex和语法分析器yacc
flex的实现有GNU的fast lex(lexical anslysis)
yacc的实现有GNU的Bison和Berkeley的byacc
flex与yacc程序是由下面三部分组成
  1. %{
  2. /*注释用空白缩进
  3. */
  4. %}
  5. %%标记这一部分结束
  6. %%第二部分与第三部分分割
复制代码

第一部分
定义段,拷贝到最终程序中的原始的c程序代码,如头文件.
第二部分
规则段,由两部分组成:模式和动作,空白分开,记法分析程序识别出某个模式时,将执行相应的动作
模式为unix样式的正则表达式
第三部分
用户子程序段,合法c代码组成.在lex生成代码结束之后复制到c文件.
  1. calc.l
  2. %{
  3. #include "y.tab.h"
  4. #include "ch3hdr2.h"
  5. #include <math.h>
  6. %}
  7.      
  8.      
  9. %%
  10. ([0-9]+|([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)?) {
  11. yylval.dval = atof(yytext);
  12. return NUMBER;
  13. }
  14.      
  15.      
  16. [ \t] ; /* ignore white space */
  17.      
  18.      
  19. [A-Za-z][A-Za-z0-9]* { /* return symbol pointer */
  20. struct symtab *sp = symlook(yytext);
  21.      
  22.      
  23. yylval.symp = sp;
  24. return NAME;
  25. }
  26.      
  27.      
  28. "{1}quot; { return 0; }
  29.      
  30.      
  31. \n |
  32. . return yytext[0];
  33. %%
复制代码

在编写好规则后,调用yylex来根据进行规则词法分析.
而这些规则是由正则表达式写成的.
  1. calc.y
  2. %{
  3. #include "ch3hdr2.h"
  4. #include <string.h>
  5. #include <math.h>
  6. %}
  7.      
  8.      
  9. %union {
  10. double dval;
  11. struct symtab *symp;
  12. }
  13. %token <symp> NAME
  14. %token <dval> NUMBER
  15. %left '-' '+'
  16. %left '*' '/'
  17. %nonassoc UMINUS
  18.      
  19.      
  20. %type <dval> expression
  21. %%
  22. statement_list: statement '\n'
  23. | statement_list statement '\n'
  24. ;
  25.      
  26.      
  27. statement: NAME '=' expression { $1->value = $3; }
  28. | expression { printf("= %g\n", $1); }
  29. ;
  30.      
  31.      
  32. expression: expression '+' expression { $ = $1 + $3; }
  33. | expression '-' expression { $ = $1 - $3; }
  34. | expression '*' expression { $ = $1 * $3; }
  35. | expression '/' expression
  36. { if($3 == 0.0)
  37. yyerror("divide by zero");
  38. else
  39. $ = $1 / $3;
  40. }
  41. | '-' expression %prec UMINUS { $ = -$2; }
  42. | '(' expression ')' { $ = $2; }
  43. | NUMBER
  44. | NAME { $ = $1->value; }
  45. | NAME '(' expression ')' {
  46. if($1->funcptr)
  47. $ = ($1->funcptr)($3);
  48. else {
  49. printf("%s not a function\n", $1->name);
  50. $ = 0.0;
  51. }
  52. }
  53. ;
  54. %%
  55. /* look up a symbol table entry, add if not present */
  56. struct symtab *
  57. symlook(s)
  58. char *s;
  59. {
  60. char *p;
  61. struct symtab *sp;
  62.      
  63.      
  64. for(sp = symtab; sp < &symtab[NSYMS]; sp++) {
  65. /* is it already here? */
  66. if(sp->name && !strcmp(sp->name, s))
  67. return sp;
  68.      
  69.      
  70. /* is it free */
  71. if(!sp->name) {
  72. sp->name = strdup(s);
  73. return sp;
  74. }
  75. /* otherwise continue to next */
  76. }
  77. yyerror("Too many symbols");
  78. exit(1); /* cannot continue */
  79. } /* symlook */
  80.      
  81.      
  82. addfunc(name, func)
  83. char *name;
  84. double (*func)();
  85. {
  86. struct symtab *sp = symlook(name);
  87. sp->funcptr = func;
  88. }
  89.      
  90.      
  91. main()
  92. {
  93. extern double sqrt(), exp(), log();
  94.      
  95.      
  96. addfunc("sqrt", sqrt);
  97. addfunc("exp", exp);
  98. addfunc("log", log);
  99. yyparse();
  100. }
复制代码

lex将输入流分成块,然后yacc取得这些块并将它们逻辑性地归组到一起.
yacc语法分析器yyparse()重复尝试分析句子直到输入结束.

需要输入时调用词法分析程序yylex(),把匹配的标记返回给程序.要注意的是这里的递归规则.

编译:
  1. lex calc.l
  2. yacc -d calc.y
  3. gcc y.tab.c lex.yy.c -ly -ll
复制代码
语法分析程序分析过程移动与归约

yacc语法分析程序是通过寻找可以匹配目前为止所看到的标记的规则来工作.yacc处理语法分析程序时创建了一组状态,每个状态反映一个或多个部分地被分析的规则中的一个可能位置.当语法分析程序读取标记时,每次它读取一个没完成规则的标记,就把它压入内部堆栈中并切换到一种反映它刚刚读取的标记的新状态,这个动作称为移进(shift).当它发现组成某条规则右侧的全部符号时,它就把右侧符号弹出堆栈,而将左侧符号压入堆栈中,并切换到反映堆栈上新符号的新状态,这个动作被称为归约(reduction).因为它通常减少堆栈上项目的数目.无论yacc什么时候归约规则,它都执行与这条规则有关的用户代码.

3149

主题

8386

回帖

6万

积分

$\style{scale:11;fill:#eff}꩜$

积分
65391
QQ

显示全部楼层

 楼主| hbghlyj 发表于 2022-8-31 12:11
1#的FSA是用TikZ library automa画的,参见Drawing Finite State Machines in LATEX using tikz - A Tutorial

手机版|悠闲数学娱乐论坛(第3版)

GMT+8, 2025-3-4 15:31

Powered by Discuz!

× 快速回复 返回顶部 返回列表