A naïve attempt at parsing CSS in JavaScript – Part 3

2020-02-10

So far, the basics on tokenising CSS based on two types of tokens: comments and plain text are working. Plain text is not a token from the standard, rather it is being used as a token for everything that cannot be categorised at this point. In other words, everything apart from comments (and, after this post, whitespace).

The next type of token to consume is the whitespace token. With the current implementation, the following occurs:

result = parse(`/* comment */
p { font-weight: bold; }`)
// result = [
//   {t:"comment","v":"/* comment */"},
//   {t:"plain","v":"\np { font-weight: bold; }"}
// ]

The whitespace tokens, \n and the space characters (ignoring this in the comment), are included as part of the ‘plain’ token. They should be, along with tab characters, separated out into whitespace tokens. Here is an update to the parsing algorithm that attempts that:

const parse = cssStr => {
  const defaultTokenType = 'plain';
  const tokens = [];
  let tokenType = 'plain';
  let cache = '';
  for (let i = 0; i < cssStr.length; i++) {
    let newTokenType = tokenType;
    const current = cssStr[i];
    const next = cssStr[i + 1];
    const last = cssStr[i - 1];
    if (tokenType !== 'comment') {
      if (current === '/' && next === '*') {
        newTokenType = 'comment';
      } else if (current === ' ' || current === '\n' || current === '  ') {
        newTokenType = 'whitespace';
      } else {
        newTokenType = defaultTokenType;
      }
    }
    if (tokenType !== newTokenType) {
      if (cache) {
        tokens.push({ t: tokenType, v: cache });
        cache = '';
      }
      tokenType = newTokenType;
    }
    cache += current;
    if (tokenType === 'comment' && last === '*' && current === '/') {
      tokens.push({ t: tokenType, v: cache });
      cache = '';
      tokenType = defaultTokenType;
    }
  }
  return tokens;
};

console.log(parse(`/* c */
p { color: blue; }`));

// Result:
// [
//   {t:"comment",v:"/* c */"},
//   {t:"whitespace",v:"\n"},
//   {t:"plain",v:"p"},
//   {t:"whitespace",v:" "},
//   {t:"plain","v":"{"},
//   {t:"whitespace",v:" "},
//   {t:"plain",v:"color:"},
//   {t:"whitespace",v:" "},
//   {t:"plain",v:"blue;"},
//   {"t":"whitespace","v":" "}
// ]

The next step, in the next post, is to consume strings.