Preamble

Recently at work our primary piece of web code was hit with a memory leak. We all panicked of course, but then we all took a hard look at what was going on. This blog post doesn’t cover that memory leak, but covers another memory hog I found while digging into things.

The Line

Now, we have a nice little web server built on the django web framework and in it we have a utilities module with a number of fairly general functions; nothing super uncommon. So imagine my surprise when I rig up our webserver with tracemalloc, boot the server then immediately take the server down and I see

...
utils/__init__.py:48: size=5412 KiB, count=65307, average=85 B
...

What is line 48? Well

_all_chars = [chr(i) for i in range(65535)]

Massive confusion here for a number of reasons: Why are we generating a list of all characters? why is this list not actually all characters? How is this taking up 5 megabytes of memory? The last question is stumped me for a bit until I read through the python import reference. Anything in the module level source of an __init__.py file will get executed whenever the module is imported. Well… _all_chars is at the module level and we do import utils all over the place. So, we’re creating tens of thousands of copies of the same list of characters. Clearly something needs to be done.

What’s actually going on

Let’s take a look at the full code sample and consider what’s actually happening here

_all_chars = [chr(i) for i in range(65535)]
_space_chars = [c for c in _all_chars if unicodedata.category(c) == 'Zs']
_line_break_chars = [c for c in _all_chars
                     if unicodedata.category(c) in ('Zp', 'Zl')]

_space_char_regex = re.compile('[%s]' % re.escape(''.join(_space_chars)))
_line_break_char_regex = re.compile(
    '[%s]' % re.escape(''.join(_line_break_chars))
)
_null_char_regex = re.compile(re.escape(chr(0)))


def clean_string(value):
    return _line_break_char_regex.sub(
        '\n', _space_char_regex.sub(
            ' ', _null_char_regex.sub('', value)
        )
    ).strip()

If your eyes glaze over while reading this code I don’t blame you. It certainly took me a while to wrap my head around it. Let’s break it down though. We have two sections. The first is module level and defines some constants. The second is a function which uses those constants. Looking at the module level code we first see some lists being created.

_all_chars = [chr(i) for i in range(65535)]
_space_chars = [c for c in _all_chars if unicodedata.category(c) == 'Zs']
_line_break_chars = [c for c in _all_chars
                     if unicodedata.category(c) in ('Zp', 'Zl')]

It might seem a bit odd to define these lists, but lets circle back to this later. On to the next section

_space_char_regex = re.compile('[%s]' % re.escape(''.join(_space_chars)))
_line_break_char_regex = re.compile(
    '[%s]' % re.escape(''.join(_line_break_chars))
)
_null_char_regex = re.compile(re.escape(chr(0)))

This is a bit more straight forward. It’s just defining some regular expressions based on the above character sets and on the null character. The lists above are used solely to create these regular expressions.

And onto the main act; the function.

def clean_string(value):
    return _line_break_char_regex.sub(
        '\n', _space_char_regex.sub(
            ' ', _null_char_regex.sub('', value)
        )
    ).strip()

Reading from inside out we:

Match on null characters and replace them with an empty string
Match on character separators (ascii and unicode) and replace them an ascii space
Match on line breaks and replace them with ascii line feed Point 3 is a bit of a misnomer as the above code only matches on \n, \u2028 and \u2029.

The latter two being the only character each in Zl and Zp respectively. As you might have expected given the run up; this function uses the regular expressions above to manipulate a string. Curiously though the only use of the regular expressions is to do substitutions and as it turns out the original author reinvented str.replace().

Complaints

Beyond the apparent desire to reinvent the standard library there are a number of complaints I had with this code. First the naming of _all_chars as it refers to a subset of all unicode characters or a superset of ascii characters, second _line_break_chars consists of two characters one of which doesn’t break a line, third the null character regex matches on one single character. More generally we spend a lot of time here regenerating constants. The null character for instance is simply \0. Why are we computing chr(0)? All of this for nought as well given that a simple easy to read alternative is faster.

What just happened now?

The original author of this code no longer works at my company which is a shame. I’d really like to talk to them to understand what was going through their head. I’m fairly certain that the developer was unaware of the import behavior and thought that module level variables would be an optimization. Still, why reinvent the standard library?

Before

_all_chars = [chr(i) for i in range(65535)]
_space_chars = [c for c in _all_chars if unicodedata.category(c) == 'Zs']
_line_break_chars = [c for c in _all_chars
                     if unicodedata.category(c) in ('Zp', 'Zl')]

_space_char_regex = re.compile('[%s]' % re.escape(''.join(_space_chars)))
_line_break_char_regex = re.compile(
    '[%s]' % re.escape(''.join(_line_break_chars))
)
_null_char_regex = re.compile(re.escape(chr(0)))


def clean_string(value):
    return _line_break_char_regex.sub(
        '\n', _space_char_regex.sub(
            ' ', _null_char_regex.sub('', value)
        )
    ).strip()

After

def clean_string(value: str) -> str:
    # Chain character replacements.
    # u2028 and 2029 are the unicode line and paragraph separators
    # \xa0 is a non-breaking space
    value = value.replace(chr(0), '')\
        .replace('\u2028', '\n')\
        .replace('\u2029', '\n')\
        .replace('\xa0', ' ')
    value = " ".join(value.split(r'(?!\n)\s'))  # Normalize non-\n whitespace
    return value.strip()

Thanks for reading