package main import ( "fmt" "strings" "unicode/utf8" ) type stateFunc func(*lexer) stateFunc type lexer struct { input string start int pos int width int tokenStream chan Token } func (l *lexer) run() { for state := lexCommand; state != nil; { state = state(l) } close(l.tokenStream) } func (l *lexer) emit(t TokenType) { l.tokenStream <- Token{ typ: t, val: l.input[l.start:l.pos], } l.start = l.pos } func (l *lexer) errorf(format string, args ...interface{}) stateFunc { l.tokenStream <- Token{ typ: TokenErr, val: fmt.Sprintf(format, args...), } return nil } const eof rune = -1 func (l *lexer) next() rune { if l.pos >= len(l.input) { l.width = 0 return eof } var r rune r, l.width = utf8.DecodeRuneInString(l.input[l.pos:]) l.pos += l.width return r } func (l *lexer) backup() { l.pos -= l.width } func (l *lexer) ignore() { l.start = l.pos } func (l *lexer) reset() { l.pos = l.start } func (l *lexer) expect(valid string) bool { for _, r := range valid { if l.next() != r { l.backup() return false } } return true } func (l *lexer) peek() rune { w := l.width r := l.next() l.backup() l.width = w return r } func (l *lexer) accept(valid string) bool { if strings.IndexRune(valid, l.next()) >= 0 { return true } l.backup() return false } func (l *lexer) acceptAll(valid string) { for strings.IndexRune(valid, l.next()) >= 0 {} l.backup() } func (l *lexer) acceptPassing(valid func(rune) bool) bool { if valid(l.next()) { return true } l.backup() return false } func (l *lexer) acceptAllPassing(valid func(rune) bool) { for valid(l.next()) {} l.backup() } type TokenType int const ( TokenErr TokenType = iota // Lexing error TokenEOF // end of file TokenSemicolon // ; TokenLParen // ( TokenRParen // ) TokenLBrace // { TokenRBrace // } TokenLBrack // [ TokenRBrack // ] TokenCommand // A command character TokenHash // # TokenAt // @ TokenDot // . TokenAst // * TokenBar // | TokenOr // || TokenAnd // && TokenHat // ^ TokenDollar // $ TokenQuestion // ? TokenHatDollar // ^$ TokenExclamation // ! TokenTilde // ~ TokenDoubleQuote // " TokenStringLiteral // A string literal, not including the " either side TokenNullLiteral // null TokenTrueLiteral // true TokenFalseLiteral // false TokenColon // : TokenComma // , TokenSubstituteDelimiter // usually / but could be something else TokenSubstitutePlaceholder // \1, \2 etc. TokenTerminalLiteral // One of {, }, [, ] TokenNumberLiteral // A number literal TokenPatternStringIndex // A string index in a pattern TokenPatternIntegerIndex // An integer index in a pattern TokenSubex // A subex ) type Token struct { typ TokenType val string } func (t Token) String() string { switch t.typ { case TokenEOF: return "EOF" case TokenErr: return t.val } if len(t.val) > 10 { return fmt.Sprintf("%.10q...", t.val) } return fmt.Sprintf("%q", t.val) } func Lex(input string) chan Token { l := &lexer{ input: input, tokenStream: make(chan Token), } go l.run() return l.tokenStream } const ( whitespace string = " \t" whitespaceNewlines string = " \t\r\n" ) func isAlpha(r rune) bool { return ('a' <= r && r < 'z') || ('A' <= r && r <= 'Z') } func isDigit(r rune) bool { return '0' <= r && r <= '9' } func isAlphaNumeric(r rune) bool { return isAlpha(r) || isDigit(r) } func isStringIndexChar(r rune) bool { return isAlphaNumeric(r) || r == '_' || r == '-' } func lexCommand(l *lexer) stateFunc { l.acceptAll(whitespace) l.ignore() if l.peek() == eof { l.emit(TokenEOF) return nil } r := l.next() switch r { case '#': l.emit(TokenHash) lexPatternStringIndex(l) return lexCommand case '@': l.emit(TokenAt) lexPatternIntegerIndex(l) return lexCommand case '.': l.emit(TokenDot) return lexCommand case '*': l.emit(TokenAst) return lexCommand case '|': if l.accept("|") { l.emit(TokenOr) } else { l.emit(TokenBar) } return lexCommand case '[': l.emit(TokenLBrack) return lexCommand case ']': l.emit(TokenRBrack) return lexCommand case '(': l.emit(TokenLParen) return lexCommand case ')': l.emit(TokenRParen) return lexCommand case '?': l.emit(TokenQuestion) return lexCommand case '{': l.emit(TokenLBrace) return lexCommand case '}': l.emit(TokenRBrace) return lexCommandEnd case '&': if l.accept("&") { l.emit(TokenAnd) return lexCommand } case '^': if l.accept("$") { l.emit(TokenHatDollar) } else { l.emit(TokenHat) } return lexCommand case '$': l.emit(TokenDollar) return lexCommand case '!': l.emit(TokenExclamation) return lexCommand case '~': l.emit(TokenTilde) return lexCommand case 'i': l.emit(TokenCommand) return lexMultipleLiterals case 's': l.emit(TokenCommand) return lexSubstitution case 'S': l.emit(TokenCommand) return lexBigSubstitution } if isAlpha(r) { l.emit(TokenCommand) return lexCommandEnd } return l.errorf("Expected command found something else") } func lexSubstitution(l *lexer) stateFunc { delimiter := l.next() if delimiter == eof { return l.errorf("Missing subex in substitution command") } l.emit(TokenSubstituteDelimiter) loop: for { r := l.next() switch r { case delimiter: l.backup() l.emit(TokenSubex) l.next() l.emit(TokenSubstituteDelimiter) break loop case eof: return l.errorf("Missing closing substitution delimiter") default: } } return lexCommand } func lexBigSubstitution(l *lexer) stateFunc { delimiter := l.next() if delimiter == eof || isAlphaNumeric(delimiter) { return l.errorf("Invalid delimiter for big substitution") } l.emit(TokenSubstituteDelimiter) loop: for { r := l.next() switch r { case delimiter: l.emit(TokenSubstituteDelimiter) break loop case '#': l.emit(TokenHash) lexPatternStringIndex(l) case '@': l.emit(TokenAt) lexPatternIntegerIndex(l) case '.': l.emit(TokenDot) case '*': l.emit(TokenAst) case '|': l.emit(TokenBar) case '[': l.emit(TokenLBrack) case ']': l.emit(TokenRBrack) case '?': l.emit(TokenQuestion) case ':': l.emit(TokenColon) case ',': l.emit(TokenComma) } } loop2: for { r := l.next() switch r { case delimiter: l.emit(TokenSubstituteDelimiter) break loop2 case '\\': if !l.acceptPassing(isDigit) { return l.errorf("Expected digit after \\") } l.emit(TokenSubstitutePlaceholder) } } // TODO: No clue where I was going with this return lexCommand } func lexMultipleLiterals(l *lexer) stateFunc { l.acceptAll(whitespaceNewlines) l.ignore() r := l.next() switch r { case ';', eof: l.backup() return lexCommandEnd case ':': l.emit(TokenColon) return lexMultipleLiterals case ',': l.emit(TokenComma) return lexMultipleLiterals } err := lexSingleLiteral(l) if err != "" { return l.errorf(err) } return lexMultipleLiterals } func lexSingleLiteral(l *lexer) string { l.acceptAll(whitespaceNewlines) l.ignore() r := l.next() switch r { case '"': l.emit(TokenDoubleQuote) if !lexStringLiteral(l) { return "Expected closing \"" } case 'n': if !l.expect("ull") { return "Invalid literal, expected null" } l.emit(TokenNullLiteral) case 't': if !l.expect("rue") { return "Invalid literal, expected true" } l.emit(TokenTrueLiteral) case 'f': if !l.expect("alse") { return "Invalid literal, expected false" } l.emit(TokenFalseLiteral) case '{', '}', '[', ']': l.emit(TokenTerminalLiteral) default: if isDigit(r) { lexNumberLiteral(l) return "" } return "Invalid literal" } return "" } // Just read the first digit func lexNumberLiteral(l *lexer) { l.acceptAllPassing(isDigit) if l.accept(".") { l.acceptAllPassing(isDigit) } l.emit(TokenNumberLiteral) } // TODO: escape characters func lexStringLiteral(l *lexer) bool { for { r := l.next() switch r { case '"': l.backup() l.emit(TokenStringLiteral) l.next() l.emit(TokenDoubleQuote) return true case eof: return false } } } func lexPatternStringIndex(l *lexer) { l.acceptAllPassing(isStringIndexChar) l.emit(TokenPatternStringIndex) } func lexPatternIntegerIndex(l *lexer) { l.acceptAllPassing(isDigit) l.emit(TokenPatternIntegerIndex) } func lexCommandEnd(l *lexer) stateFunc { if l.peek() == eof { l.emit(TokenEOF) return nil } if l.accept("}") { l.emit(TokenRBrace) return lexCommandEnd } return lexCommand }