Skip to content

charset=Shift_JIS will fail to parse URI.js with a syntax error caused by non-ASCII characters in RegExp #415

@codefactor

Description

@codefactor

Steps to Reproduce:

  1. Have an HTML payload where server gives response header content-type: text/html;charset=Shift_JIS
  2. Include the URI.js file with a script tag, use a compressed version of URI.js (not sure if same issue happens on the uncompressed one)

Unfortunately I can't find an easy way to give a link for this easily, but if it's necessary I could produce one maybe with codesandbox.

Expected:

The Javascript include should run, there should be no errors in the console

Actual:

The Javascript fails to parse with an error in the logs:

Uncaught SyntaxError: Unexpected token ':'

Root Cause:

There are non-ASCII characters inside of a Regular Expression in a couple places, example:

URI.find_uri_expression = /\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»]))/ig;

The non-ASCII characters «»“”‘’ are interpreted differently when the charset is set to Shift_JIS on the HTML page as a response header, and it causes the regular expression not to be closed properly, running into the next lines making a syntax error in the middle of the JSON. The same behavior is seen in Firefox and Chrome, I have not checked Edge.

Proposed solution:

Don't use non-ASCII characters which are unsafe when charsets are changed on the page, instead use a String that will be constructed with escaped characters:

  URI.find_uri_expression = new RegExp("\\b((?:[a-z][\\w-]+:(?:\\/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}\\/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:'\".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))", "ig");

One other place:

trim: /[`!()\[\]{};:'".,<>?«»]+$/,

Could update to this:

    trim: new RegExp("[`!()\\[\\]{};:'\".,<>?\xab\xbb\u201c\u201d\u201E\u2018\u2019]+$"),

These are 2 places, there might be more.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions