Parsing Roman Numerals with PEG.js

on Javascript,

As part of a side-project, I was occupied by the idea of parsing roman numerals accurately.

Parser Expression Grammars (PEGs)

PEGs are very similar to Context-Free Grammars (CFGs). One important difference is that a PEG is deterministic, while CFGs are non-deterministic.

For the purpose of parsing roman numerals, a grammar is useful because not all sequences of characters are valid. Furthermore, validating a roman numeral does not require knowing the exact values of the individual symbols, just the valid orders they can appear in. This makes it easy to separate the concerns between parsing and evaluating a roman numeral.

In our case, PEG.js is used because of two benefits:

  1. There’s no separate lexer and parser, which is convenient.

  2. We can define the parse results in Javascript, which lets us evaluate the integer value at each step when patterns are matched.

Parsing Integers

First, let’s look over how to parse and evaluate an integer. How would you do it if you can only look at one digit at a time?

An integer is composed of digits for units, tens, hundreds, thousands, and so on. Since we parse from left to right, the first time look at a digit, we don’t know yet which place value the digit is for.

Using a recursive definition, we can define a decimal number to be a digit and the rest of the number, or nothing if it’s the last digit. This recursive method is very similar to a right fold, where the base case is the right-most number, and at every higher place number, we accumulate the extra digit multiplied by it’s place value.

/**
 * Grammar for Numbers
 * ===================
 * Paste this into https://pegjs.org/online to give it a whirl!
 */


Number
  = d:Decimal { return d.value }

Decimal
  = digit:[0-9] acc:(Decimal?) {
    const dValue = parseInt(digit, 10)
    const result = {}
    if(acc !== null) {
      /**
       * Recurrence: this is from the tens onwards
       * result.order = 10 * previous (ones -> tens, tens -> hundreds, etc)
       * result.value accumulates the current digit * order of magnitude
       */
      result.order = acc.order * 10
      result.value = acc.value + (dValue * result.order)
    } else {
      /**
       * Base case: this is the last decimal digit
       * result.order = 1 (ones)
       * result.value is the parsed digit value
       */
      result.order = 1
      result.value = dValue
    }
    return result
  }

There is actually a much simpler definition for integers, but the above example illustrates how a recursive solution should work.

/**
 * Shortest definition for Decimal
 */

Decimal
  = [0-9]+ { return parseInt(text(), 10); }

The simple case

To start off, here’s all the symbols and their corresponding decimal value.

Symbol Value
I 1
V 5
X 10
L 50
C 100
D 500
M 1000

Here are some valid roman numerals and their values.

Roman Numeral Value
I 1
II 2
III 3
IV 4
V 5
VIII 8
XX 20
XL 40
XC 90
CM 900
MMM 3000

Mapping roman characters to values

We can start by creating named rules that match the relevant roman characters. A case-insensitive regular expression allows us to match X and x, then a return value allows us to assign the value 10 as the meaning of X when the rule is matched.

RomanM
  = "M"i { return 1000; }

RomanD
  = "D"i { return 500; }

RomanC
  = "C"i { return 100; }

RomanL
  = "L"i { return 50; }

RomanX
  = "X"i { return 10; }

RomanV
  = "V"i { return 5; }

RomanI
  = "I"i { return 1; }

For example, if we wanted to make a parser that reads individual roman characters and converts their values to integers, we can make it like so:

/**
 * Grammar for Roman Numeral Symbols
 * =================================
 */

Expression
  = RomanM
  / RomanD
  / RomanC
  / RomanL
  / RomanX
  / RomanV
  / RomanI

Place numbers in roman numerals

If we examine a roman numeral for a large number, such as 3999 which makes MMMCMXCIX, we can split the roman numeral to it’s individual place numbers, MMM for thousands, CM for hundreds, XC for tens, and IX for units.

Since the place numbers are independent of each other, we can define a roman numeral to simply be the sum of each place value matched by the parser.

RomanNumeral
  = i:Thousands j:Hundreds k:Tens l:Units { return i + j + k + l; }

Thousands

The thousands digit is the simplest, since it’s just M all the way. A valid roman numeral can have as many of them to represent large numbers.

Thousands
  = i:RomanM* { return i.length * 1000 }

Units

There are three general ways to represent the units in a roman numeral:

  1. Simple: for 0-3, it’s I - III.
  2. Addition: for 5-8, it’s V - VIII, 0-3 is added to 5 by placings the I’s after V.
  3. Deduction: for 4 and 9, I is deducted from V or X by placing it before the larger symbol, i.e. IV and IX.

Here’s how to define the grammar for units:

Units
  = UnitsDeduction
  / UnitsAddition
  / UnitsSimple

UnitsDeduction
  = i:RomanI j:(RomanX / RomanV) {
    return j - i;
  }

UnitsAddition
  = v:RomanV s:(UnitsSimple) {
    return v + s;
  }

UnitsSimple
  = i:(RomanI? RomanI? RomanI?) {
    var value = 0;
    for(var x=0; x < i.length; x++) {
      if(i[x]) {
        value += i[x];
      }
    }
    return value;
  }

Hundreds and Tens

Having defined the units, the grammar for parsing tens and hundreds are almost identical. Unfortunately, it’s not possible to dynamically generate rules similar to Units, so we’re going to have to rewrite some rules. Fortunately, when writing the grammar, repetitions can be beneficial, since it make rules explicitly clear about their meaning.

Final Result

/**
 * Grammar for Roman Numerals
 * ==========================
 * Paste this into https://pegjs.org/online to give it a whirl!
 */


RomanNumeral
  = a:Thousands b:Hundreds c:Tens d:Units { return a + b + c + d; }

Thousands
  = i:RomanM* { return i.length * 1000 }

Hundreds
  = HundredsDeduction
  / HundredsAddition
  / HundredsSimple

HundredsDeduction
  = i:RomanC j:(RomanM / RomanD) {
    return j - i;
  }

HundredsAddition
  = d:RomanD s:(HundredsSimple) {
    return d + s;
  }

HundredsSimple
  = i:(RomanC? RomanC? RomanC?) {
    var value = 0;
    for(var x=0; x < i.length; x++) {
      if(i[x]) {
        value += i[x];
      }
    }
    return value;
  }

Tens
  = TensDeduction
  / TensAddition
  / TensSimple

TensDeduction
  = i:RomanX j:(RomanC / RomanL) {
    return j - i;
  }

TensAddition
  = l:RomanL s:(TensSimple) {
    return l + s;
  }

TensSimple
  = i:(RomanX? RomanX? RomanX?) {
    var value = 0;
    for(var x=0; x < i.length; x++) {
      if(i[x]) {
        value += i[x];
      }
    }
    return value;
  }

Units
  = UnitsDeduction
  / UnitsAddition
  / UnitsSimple

UnitsDeduction
  = i:RomanI j:(RomanX / RomanV) {
    return j - i;
  }

UnitsAddition
  = v:RomanV s:(UnitsSimple) {
    return v + s;
  }

UnitsSimple
  = i:(RomanI? RomanI? RomanI?) {
    var value = 0;
    for(var x=0; x < i.length; x++) {
      if(i[x]) {
        value += i[x];
      }
    }
    return value;
  }

RomanM
  = "M"i { return 1000; }

RomanD
  = "D"i { return 500; }

RomanC
  = "C"i { return 100; }

RomanL
  = "L"i { return 50; }

RomanX
  = "X"i { return 10; }

RomanV
  = "V"i { return 5; }

RomanI
  = "I"i { return 1; }