A naïve attempt at parsing CSS in JavaScript – Part 1

2020-02-06

All I want to be able to do is colour the CSS on this website by understanding when something is a selector, when it is a property and when it is a rule. The CSS specifications have a section on tokenizing and parsing CSS.

For this project, I will jump right over the input byte stream and preprocessing sections and look at tokenization.

The first codeable instruction is to ‘repeatedly consume a token until’ the end of the file is reached. For my project, that says I should loop through the characters in a string: so I’ll start with that:

const highlight = cssStr => {
  for (let i = 0; i < cssStr.length; i++) {
    // consume characters
  }
  return '';
}

The tokens then need to be consumed. The specification describes how to consume comments which I interpret like this:

const highlight = cssStr => {
  const state = {};
  for (let i = 0; i < cssStr.length; i++) {
    // Consume comments
    if (!state.isComment && (cssStr[i] === '/') && (cssStr[i+1] === '*')) {
      state.isComment = true;
    }
    else if (state.isComment && (cssStr[i-1] === '*') && (cssStr[i] === '/')) {
      state.isComment = false;
    }
  }
  return;
}

Technically, I should be collecting tokens. However, I just want to know when I should insert tags into the CSS to color code the different types of elements. For example, just focusing on comments, I would want this:

/* comment */
p { margin-bottom: 1em; }

To turn into this:

<span class="comment">/* comment */</span>
p { margin-bottom: 1em; }

Here’s an attempt at that in code:

const highlight = cssStr => {
  const state = {};
  let output = ''
  for (let i = 0; i < cssStr.length; i++) {
    // Consume comments
    if (!state.isComment && (cssStr[i] === '/') && (cssStr[i+1] === '*')) {
      state.isComment = true;
      output += '<span class="comment">';
    }
    output += cssStr[i];
    if (state.isComment && (cssStr[i-1] === '*') && (cssStr[i] === '/')) {
      state.isComment = false;
      output += '</span>';
    }
  }
  return output;
}

But when I am reading the HTML characters, I’m actually going to be encountering HTML-encoded syntax. That is, > will look like >. That’s the challenge for tomorrow.