1 #-----------------------------------------------------------------------------
4 # Author: David M. Beazley (beazley@cs.uchicago.edu)
5 # Department of Computer Science
6 # University of Chicago
9 # Copyright (C) 2001, David M. Beazley
11 # $Header: /home/davidb/src/cvs/python-rwhoisd/rwhoisd/lex.py,v 1.1.1.1 2003/04/22 02:19:07 davidb Exp $
13 # This library is free software; you can redistribute it and/or
14 # modify it under the terms of the GNU Lesser General Public
15 # License as published by the Free Software Foundation; either
16 # version 2.1 of the License, or (at your option) any later version.
18 # This library is distributed in the hope that it will be useful,
19 # but WITHOUT ANY WARRANTY; without even the implied warranty of
20 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
21 # Lesser General Public License for more details.
23 # You should have received a copy of the GNU Lesser General Public
24 # License along with this library; if not, write to the Free Software
25 # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
27 # See the file COPYING for a complete copy of the LGPL.
30 # This module automatically constructs a lexical analysis module from regular
31 # expression rules defined in a user-defined module. The idea is essentially the same
32 # as that used in John Aycock's Spark framework, but the implementation works
33 # at the module level rather than requiring the use of classes.
35 # This module tries to provide an interface that is closely modeled after
36 # the traditional lex interface in Unix. It also differs from Spark
39 # - It provides more extensive error checking and reporting if
40 # the user supplies a set of regular expressions that can't
41 # be compiled or if there is any other kind of a problem in
44 # - The interface is geared towards LALR(1) and LR(1) parser
45 # generators. That is tokens are generated one at a time
46 # rather than being generated in advanced all in one step.
48 # There are a few limitations of this module
50 # - The module interface makes it somewhat awkward to support more
51 # than one lexer at a time. Although somewhat inelegant from a
52 # design perspective, this is rarely a practical concern for
53 # most compiler projects.
55 # - The lexer requires that the entire input text be read into
56 # a string before scanning. I suppose that most machines have
57 # enough memory to make this a minor issues, but it makes
58 # the lexer somewhat difficult to use in interactive sessions
59 # or with streaming data.
61 #-----------------------------------------------------------------------------
66 This module builds lex-like scanners based on regular expression rules.
67 To use the module, simply write a collection of regular expression rules
68 and actions like this:
73 # Define a list of valid tokens
75 'IDENTIFIER', 'NUMBER', 'PLUS', 'MINUS'
78 # Define tokens as functions
80 r' ([a-zA-Z_](\w|_)* '
87 # Some simple tokens with no actions
91 # Initialize the lexer
94 The tokens list is required and contains a complete list of all valid
95 token types that the lexer is allowed to produce. Token types are
96 restricted to be valid identifiers. This means that 'MINUS' is a valid
97 token type whereas '-' is not.
99 Rules are defined by writing a function with a name of the form
100 t_rulename. Each rule must accept a single argument which is
101 a token object generated by the lexer. This token has the following
104 t.type = type string of the token. This is initially set to the
105 name of the rule without the leading t_
106 t.value = The value of the lexeme.
107 t.lineno = The value of the line number where the token was encountered
109 For example, the t_NUMBER() rule above might be called with the following:
115 Each rule returns the token object it would like to supply to the
116 parser. In most cases, the token t is returned with few, if any
117 modifications. To discard a token for things like whitespace or
118 comments, simply return nothing. For instance:
124 For faster lexing, you can also define this in terms of the ignore set like this:
128 The characters in this string are ignored by the lexer. Use of this feature can speed
129 up parsing significantly since scanning will immediately proceed to the next token.
131 lex requires that the token returned by each rule has an attribute
132 t.type. Other than this, rules are free to return any kind of token
133 object that they wish and may construct a new type of token object
134 from the attributes of t (provided the new object has the required
137 If illegal characters are encountered, the scanner executes the
138 function t_error(t) where t is a token representing the rest of the
139 string that hasn't been matched. If this function isn't defined, a
140 LexError exception is raised. The .text attribute of this exception
141 object contains the part of the string that wasn't matched.
143 The t.skip(n) method can be used to skip ahead n characters in the
144 input stream. This is usually only used in the error handling rule.
145 For instance, the following rule would print an error message and
149 print "Illegal character in input %s" % t.value[0]
152 Of course, a nice scanner might wish to skip more than one character
153 if the input looks very corrupted.
155 The lex module defines a t.lineno attribute on each token that can be used
156 to track the current line number in the input. The value of this
157 variable is not modified by lex so it is up to your lexer module
158 to correctly update its value depending on the lexical properties
159 of the input language. To do this, you might write rules such as
164 t.lineno += t.value.count("\n")
166 To initialize your lexer so that it can be used, simply call the lex.lex()
167 function in your rule file. If there are any errors in your
168 specification, warning messages or an exception will be generated to
169 alert you to the problem.
171 (dave: this needs to be rewritten)
172 To use the newly constructed lexer from another module, simply do
177 plex.input("position = initial + rate*60")
180 token = plex.token() # Get a token
181 if not token: break # No more tokens
184 Assuming that the module 'lexer' has initialized plex as shown
185 above, parsing modules can safely import 'plex' without having
186 to import the rule file or any additional imformation about the
187 scanner you have defined.
190 # -----------------------------------------------------------------------------
195 import re, types, sys, copy
197 # Exception thrown when invalid token encountered and no default
198 class LexError(Exception):
199 def __init__(self,message,s):
200 self.args = (message,)
206 return "LexToken(%s,%r,%d)" % (self.type,self.value,self.lineno)
212 except AttributeError:
215 # -----------------------------------------------------------------------------
218 # input() - Store a new string in the lexer
219 # token() - Get the next token
220 # -----------------------------------------------------------------------------
224 self.lexre = None # Master regular expression
225 self.lexdata = None # Actual input data (as a string)
226 self.lexpos = 0 # Current position in input text
227 self.lexlen = 0 # Length of the input text
228 self.lexindexfunc = [ ] # Reverse mapping of groups to functions and types
229 self.lexerrorf = None # Error rule (if any)
230 self.lextokens = None # List of valid tokens
231 self.lexignore = None # Ignored characters
232 self.lineno = 1 # Current line number
233 self.debug = 0 # Debugging mode
234 self.optimize = 0 # Optimized mode
235 self.token = self.errtoken
240 c.lexdata = self.lexdata
241 c.lexpos = self.lexpos
242 c.lexlen = self.lexlen
243 c.lenindexfunc = self.lexindexfunc
244 c.lexerrorf = self.lexerrorf
245 c.lextokens = self.lextokens
246 c.lexignore = self.lexignore
247 c.lineno = self.lineno
248 c.optimize = self.optimize
249 c.token = c.realtoken
251 # ------------------------------------------------------------
252 # input() - Push a new string into the lexer
253 # ------------------------------------------------------------
255 if not isinstance(s,types.StringType):
256 raise ValueError, "Expected a string"
260 self.token = self.realtoken
262 # Change the token routine to point to realtoken()
264 if token == self.errtoken:
267 # ------------------------------------------------------------
268 # errtoken() - Return error if token is called with no data
269 # ------------------------------------------------------------
271 raise RuntimeError, "No input string given with input()"
273 # ------------------------------------------------------------
274 # token() - Return the next token from the Lexer
276 # Note: This function has been carefully implemented to be as fast
277 # as possible. Don't make changes unless you really know what
279 # ------------------------------------------------------------
281 # Make local copies of frequently referenced attributes
284 lexignore = self.lexignore
285 lexdata = self.lexdata
287 while lexpos < lexlen:
288 # This code provides some short-circuit code for whitespace, tabs, and other ignored characters
289 if lexdata[lexpos] in lexignore:
293 # Look for a regular expression match
294 m = self.lexre.match(lexdata,lexpos)
299 tok.value = m.group()
300 tok.lineno = self.lineno
302 func,tok.type = self.lexindexfunc[i]
307 # If token is processed by a function, call it
310 self.lineno = tok.lineno # Update line number
312 # Every function must return a token, if nothing, we just move to next token
313 if not newtok: continue
315 # Verify type of the token. If not in the token map, raise an error
316 if not self.optimize:
317 if not self.lextokens.has_key(newtok.type):
318 raise LexError, ("%s:%d: Rule '%s' returned an unknown token type '%s'" % (
319 func.func_code.co_filename, func.func_code.co_firstlineno,
320 func.__name__, newtok.type),lexdata[lexpos:])
324 # No match. Call t_error() if defined.
327 tok.value = self.lexdata[lexpos:]
328 tok.lineno = self.lineno
332 newtok = self.lexerrorf(tok)
333 lexpos += getattr(tok,"_skipn",0)
335 # Error method didn't change text position at all. This is an error.
337 raise LexError, ("Scanning error. Illegal character '%s'" % (lexdata[lexpos]), lexdata[lexpos:])
338 if not newtok: continue
343 raise LexError, ("No match found", lexdata[lexpos:])
346 self.lexpos = lexpos + 1
350 # -----------------------------------------------------------------------------
353 # This checks to see if there are duplicated t_rulename() functions or strings
354 # in the parser input file. This is done using a simple regular expression
355 # match on each line in the filename.
356 # -----------------------------------------------------------------------------
358 def validate_file(filename):
360 base,ext = os.path.splitext(filename)
361 if ext != '.py': return 1 # No idea what the file is. Return OK
365 lines = f.readlines()
370 fre = re.compile(r'\s*def\s+(t_[a-zA-Z_0-9]*)\(')
371 sre = re.compile(r'\s*(t_[a-zA-Z_0-9]*)\s*=')
381 prev = counthash.get(name)
383 counthash[name] = linen
385 print "%s:%d: Rule %s redefined. Previously defined on line %d" % (filename,linen,name,prev)
390 # -----------------------------------------------------------------------------
391 # _read_lextab(module)
393 # Reads lexer table from a lextab file instead of using introspection.
394 # -----------------------------------------------------------------------------
396 def _read_lextab(lexer, fdict, module):
397 exec "import %s as lextab" % module
398 lexer.lexre = re.compile(lextab._lexre, re.VERBOSE)
399 lexer.lexindexfunc = lextab._lextab
400 for i in range(len(lextab._lextab)):
401 t = lexer.lexindexfunc[i]
404 lexer.lexindexfunc[i] = (fdict[t[0]],t[1])
405 lexer.lextokens = lextab._lextokens
406 lexer.lexignore = lextab._lexignore
407 if lextab._lexerrorf:
408 lexer.lexerrorf = fdict[lextab._lexerrorf]
410 # -----------------------------------------------------------------------------
413 # Build all of the regular expression rules from definitions in the supplied module
414 # -----------------------------------------------------------------------------
415 def lex(module=None,debug=0,optimize=0,lextab="lextab"):
422 lexer.optimize = optimize
426 if not isinstance(module, types.ModuleType):
427 raise ValueError,"Expected a module"
429 ldict = module.__dict__
432 # No module given. We might be able to get information from the caller.
436 e,b,t = sys.exc_info()
438 f = f.f_back # Walk out to our calling function
439 ldict = f.f_globals # Grab its globals dictionary
441 if optimize and lextab:
443 _read_lextab(lexer,ldict, lextab)
444 if not lexer.lexignore: lexer.lexignore = ""
453 tokens = ldict.get("tokens",None)
455 raise SyntaxError,"lex: module does not define 'tokens'"
456 if not (isinstance(tokens,types.ListType) or isinstance(tokens,types.TupleType)):
457 raise SyntaxError,"lex: tokens must be a list or tuple."
459 # Build a dictionary of valid token names
460 lexer.lextokens = { }
463 # Utility function for verifying tokens
464 def is_identifier(s):
466 if not (c.isalnum() or c == '_'): return 0
470 if not is_identifier(n):
471 print "lex: Bad token name '%s'" % n
473 if lexer.lextokens.has_key(n):
474 print "lex: Warning. Token '%s' multiply defined." % n
475 lexer.lextokens[n] = None
477 for n in tokens: lexer.lextokens[n] = None
481 print "lex: tokens = '%s'" % lexer.lextokens.keys()
483 # Get a list of symbols with the t_ prefix
484 tsymbols = [f for f in ldict.keys() if f[:2] == 't_']
486 # Now build up a list of functions and a list of strings
490 if isinstance(ldict[f],types.FunctionType):
491 fsymbols.append(ldict[f])
492 elif isinstance(ldict[f],types.StringType):
493 ssymbols.append((f,ldict[f]))
495 print "lex: %s not defined as a function or string" % f
498 # Sort the functions by line number
499 fsymbols.sort(lambda x,y: cmp(x.func_code.co_firstlineno,y.func_code.co_firstlineno))
501 # Sort the strings by regular expression length
502 ssymbols.sort(lambda x,y: (len(x[1]) < len(y[1])) - (len(x[1]) > len(y[1])))
504 # Check for non-empty symbols
505 if len(fsymbols) == 0 and len(ssymbols) == 0:
506 raise SyntaxError,"lex: no rules of the form t_rulename are defined."
508 # Add all of the rules defined with actions first
511 line = f.func_code.co_firstlineno
512 file = f.func_code.co_filename
516 if f.func_code.co_argcount > 1:
517 print "%s:%d: Rule '%s' has too many arguments." % (file,line,f.__name__)
521 if f.func_code.co_argcount < 1:
522 print "%s:%d: Rule '%s' requires an argument." % (file,line,f.__name__)
526 if f.__name__ == 't_ignore':
527 print "%s:%d: Rule '%s' must be defined as a string." % (file,line,f.__name__)
531 if f.__name__ == 't_error':
538 c = re.compile(f.__doc__, re.VERBOSE)
540 print "%s:%d: Invalid regular expression for rule '%s'. %s" % (file,line,f.__name__,e)
545 print "lex: Adding rule %s -> '%s'" % (f.__name__,f.__doc__)
547 # Okay. The regular expression seemed okay. Let's append it to the master regular
548 # expression we're building
550 if (regex): regex += "|"
551 regex += "(?P<%s>%s)" % (f.__name__,f.__doc__)
553 print "%s:%d: No regular expression defined for rule '%s'" % (file,line,f.__name__)
555 # Now add all of the simple rules
556 for name,r in ssymbols:
558 if name == 't_ignore':
563 if name == 't_error':
564 raise SyntaxError,"lex: Rule 't_error' must be defined as a function"
568 if not lexer.lextokens.has_key(name[2:]):
569 print "lex: Rule '%s' defined for an unspecified token %s." % (name,name[2:])
573 c = re.compile(r,re.VERBOSE)
575 print "lex: Invalid regular expression for rule '%s'. %s" % (name,e)
579 print "lex: Adding rule %s -> '%s'" % (name,r)
581 if regex: regex += "|"
582 regex += "(?P<%s>%s)" % (name,r)
585 for f in files.keys():
586 if not validate_file(f):
590 print "lex: regex = '%s'" % regex
591 lexer.lexre = re.compile(regex, re.VERBOSE)
593 # Build the index to function map for the matching engine
594 lexer.lexindexfunc = [ None ] * (max(lexer.lexre.groupindex.values())+1)
595 for f,i in lexer.lexre.groupindex.items():
597 if isinstance(handle,types.FunctionType):
598 lexer.lexindexfunc[i] = (handle,handle.__name__[2:])
600 # If rule was specified as a string, we build an anonymous
601 # callback function to carry out the action
602 lexer.lexindexfunc[i] = (None,f[2:])
604 # If a lextab was specified, we create a file containing the precomputed
605 # regular expression and index table
607 if lextab and optimize:
608 lt = open(lextab+".py","w")
609 lt.write("# %s.py. This file automatically created by PLY. Don't edit.\n" % lextab)
610 lt.write("_lexre = %s\n" % repr(regex))
611 lt.write("_lextab = [\n");
612 for i in range(0,len(lexer.lexindexfunc)):
613 t = lexer.lexindexfunc[i]
616 lt.write(" ('%s',%s),\n"% (t[0].__name__, repr(t[1])))
618 lt.write(" (None,%s),\n" % repr(t[1]))
623 lt.write("_lextokens = %s\n" % repr(lexer.lextokens))
624 lt.write("_lexignore = %s\n" % repr(lexer.lexignore))
625 if (lexer.lexerrorf):
626 lt.write("_lexerrorf = %s\n" % repr(lexer.lexerrorf.__name__))
628 lt.write("_lexerrorf = None\n")
632 print "lex: Fatal error. Unable to compile regular expression rules. %s" % e
635 raise SyntaxError,"lex: Unable to build lexer."
636 if not lexer.lexerrorf:
637 print "lex: Warning. no t_error rule is defined."
639 if not lexer.lexignore: lexer.lexignore = ""
641 # Create global versions of the token() and input() functions
647 # -----------------------------------------------------------------------------
650 # This runs the lexer as a main program
651 # -----------------------------------------------------------------------------
653 def runmain(lexer=None,data=None):
656 filename = sys.argv[1]
661 print "Reading from standard input (type EOF to end):"
662 data = sys.stdin.read()
677 print "(%s,'%s',%d)" % (tok.type, tok.value, tok.lineno)