Count UTF-8 Bytes in a Form using JavaScript?

Sure! Why not?

Sometimes your form input might have limited storage on the server--such as when your database uses a varchar field in an older Oracle database. If you're using UTF-8 on both your forms and in your database, you might want to validate if the user typed too many characters. Here's some JavaScript that does this using a trick.


Here's the JavaScript

  <script type="text/javascript">
     function checkLength() {
        var countMe = document.getElementById("someText").value
        var escapedStr = encodeURI(countMe)
        if (escapedStr.indexOf("%") != -1) {
            var count = escapedStr.split("%").length - 1
            if (count == 0) count++  //perverse case; can't happen with real UTF-8
            var tmp = escapedStr.length - (count * 3)
            count = count + tmp
        } else {
            count = escapedStr.length
        }
        alert(escapedStr + ": size is " + count)
     }
  </script>
  

But WHY?!?

No sooner did I write this than everyone seemed to need it. At IUC 29 this week (2006-03-06), there was a lively discussion of counting bytes vs. characters and the need for it in one of the sessions. Basically the problem is that UTF-8 is a multibyte encoding, so characters can take from one to four bytes each--not all the same. Database fields are typically allocated in terms of bytes. So measuring the number of UTF-8 bytes in the input will tell you if you've run over the buffer limit in the database.

On the other hand, it isn't very user friendly. If the buffer runs over by three bytes, what do you tell the user? Three bytes could be one, two, or three characters that the user needs to trim. Depending on which characters they trim, the result might still be too long. And recall that the user's perception of "a character" is probably closer to a grapheme or grapheme cluster than to a character. So they might delete too many characters without realizing it. Finally, if the buffer limit is small (like 10 or 20), some languages like Chinese will be severely restricted on the number of characters permitted.

Ultimately the answer is to address all aspects of the problem--assigning a storage buffer large enough to hold a reasonable number of characters using the most perverse byte count on the back end (4 bytes per character for UTF-8 with supplemental characters, although 3 is usually closer to the actual expansion) combined with enforcement on the front end using a character count. For example, a 40 byte back-end buffer can store 10 characters regardless of what the characters are in the front end.