Baby CTO: Code Craft

Confront your greatest fear and parse a string with a Regular Expression

Rémy — Mon, 01 Apr 2024 19:04:38 GMT

Regular expressions are a scary thing and can take quite a while to be digested — even for mid-level developers. Many useful tools such as regex101 that will decode the syntax for you or debuggex which will let you visualize the expression as finite state machine.

But nothing like putting your hands in the dirt to understand how something really works! Something that eluded me for years is how you could parse a string — and in particular if there is an escaped quote in it?

Let’s start with the beginning. We want to match a basic string. Regular expressions are expressed in JS.

# To match
"hello"

# Regxp
/".*"/

The structure is simple: first a quote, then any character any number of time, then another quote. But in real life you’re probably working on a parser. For example:

In this case the regular expression is going to get greedy and return "bar" bar="foo", which is not what we want.

The first trick is probably to tell the regular expression not to be greedy by using the ? symbol.

# To match
"bar" bar="foo" />

# Regular expression
/".*?"/

That’s fine but now if like in most cases you want to allow your users to have quotes in the string by escaping them, you’ll be out of luck. This for example will not work:

const name = "Dwayne \"The Rock\" Johnson";
// Will match: "Dwayne \" and " Johnson"

This part got me perplexed for the longest time. There are different ways to solve it, my personal favorite is to consider what we want to allow within our string. Namely:

Any character that isn’t an end quote is fine: [^"] in regex language (^ is for not)
Any escape sequence — aka something that starts with a backslash: \\. in regex

Since we don’t want the first match to eat up the second match (afterall a “backslash” is “not a quote”), we’ll make sure to put them in the right order so that the matching can happen easily.

# To match
const name = "Dwayne \"The Rock\" Johnson";

# Regular Expression
/"(\\.|[^"])*"/

And that’s it! You are now matching an escaped string. Not that scary anymore?

Let’s study the second method, that I’ve found inside of Lark (amazing package by the way). It’s both simpler and more confusing and does not work with older JavaScript engines, but let’s go into it.

Essentially, if you say that “escaped quotes must not terminate the string” then it means that “the last quote of the string can’t be escaped”. That’s something we can easily check with a negative assertion:

# To match
const name = "Dwayne \"The Rock\" Johnson";

# Regular Expression
/".*?(?"/

The novelty here is that instead of just having a non-greedy match-all ( .*? ), we’re adding at the end an assertion (? to check that there is no backslash before the end. This has however a drawback, it’s that you can’t escape a backslash right before the end of the string, because then the last quote would be preceded by a backslash (still with me?). In short, this doesn’t work:

const effect = "Domino \\";

But fortunately, we can allow to terminate the string with quoted backslashes!

# To match
const effect = "Domino \\";

# Regular Expression
".*?(?(\\\\)*?"

And here we are! Matching strings another way.

Subscribe now

Let’s hope that this problem-oriented walkthrough helped you understand relatively advanced thought patterns in regular expression. Often you’ll walk on problems that can seem intractable without the proper knowledge but which can easily be unlocked if you master regular expressions — or better even: parsers! But that’s for another article.



Hands on! Parse your emails with Google's Gemma
Rémy — Sun, 25 Feb 2024 08:00:59 GMT
As I explained in my previous post, LLMs are not good at everything but they’re particularly good at parsing information and transforming it into another format. It’s a technique that we use everywhere in ChatFAQ for example.
Today we’re going to dive into the code that lets us do this. The goal is simple:
Thanks for reading Baby CTO! Subscribe for free to receive new posts and support my work.
First we’ll classify the email to know what kind of email it is. Yeah I said that it’s not a great idea because LLMs are not super performant for that. I’ve tried my best, it seems to work well with GPT-4 and more or less decently with Gemma.
And then for each type of email we’re going to extract a JSON which tells us in a machine-readable format the content of that email.
I’ll walk through the main elements of the code, if you want to follow up with the completed project it’s all on GitHub.
Also, of course there are many libraries and frameworks and whatnots to help you do this in different ways, but we’re here to learn so we do all by hand today.
Buckle up and let’s go!
The flask app
In order to do this, we’re going to make a Flask app which exposes:
A basic page that allows to upload an email in the .eml format (what you get when you “Download this email” from Gmail for example).
An API which for a given email gives you the semantics of it.
We’ll do that in a very small src/semmail/app.py file.
First, a super basic view which is just a form that will call the API when submitted.
@app.route("/")
def home():
    """Just a dumb form where you can upload a file to the API"""

    return render_template_string(
        """
        
        
        
            Upload File
        
        
        Upload File
        
            
            
        
        
        
    """
    )
Then the API itself. We’re going to expect that there is a function that takes an email as input and returns the parsed output.
@app.route("/upload", methods=["POST"])
def upload_file():
    """After a very basic validation of the file, we put it through the LLM
    so that we can know what it's about."""

    if "email_file" not in request.files:
        return jsonify({"error": "No file part"}), 400

    file = request.files["email_file"]

    if file.filename == "":
        return jsonify({"error": "No selected file"}), 400
    try:
        file_content = file.read()
        result = interpret_email(file_content)
        return jsonify(result), 200
    except ValueError as e:
        return jsonify({"error": str(e)}), 400
In the middle of the boilerplate, you’ll figure the interpret_email() function. For now it can just return a static structure, we’re going to implement it in a moment.
Calling the AI
Because in the project I want to have an OpenAI and a Gemma implementation I’ll have two modules in the GitHub but I’ll only cover Gemma because it’s the hot new kid that everyone wants to play with.
Make the instance
The first thing you need to do is to go to the model’s HuggingFace page and use your account to accept the license. Then go to your settings and fetch an API key that you’ll need to put in the project’s .env.
HUGGINGFACE_TOKEN=xxx
This will allow us to create the instance of the tokenizer and the model in our src/semmail/ai/gemma.py file. Put at the root of the module:
model_id = "google/gemma-2b-it"

if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"


tokenizer = lazy_object_proxy.Proxy(lambda: AutoTokenizer.from_pretrained(model_id))
model = lazy_object_proxy.Proxy(
    lambda: AutoModelForCausalLM.from_pretrained(model_id).to(device)
)
login_done = False


def ensure_login():
    """Makes sure that we're logged into HuggingFace Hub so that we can
    download the LLM (which requires to approve a license)."""

    global login_done

    if not login_done:
        login(token=environ["HUGGINGFACE_TOKEN"])
        login_done = True
A bunch of things to unpack:
Everything is wrapped in a lazy_object_proxy, which avoids to blow up the CPU and RAM at moment the module is imported. It will wait that the function is called a first time for that. You’ll thank me later.
We create an ensure_login() function which allows subsequent functions to make sure that we’re logged into huggingface_hub, but also does it only once to avoid having to do this every time we call the AI.
There is a conditional detection of CUDA to enable it or not depending on the availability. You guys tell me if it works, I’m an idiot who didn’t check his GPU’s compatibility before buying it.
Communicate with Gemma
You’ve probably noted the name of the model, google/gemma-2b-it.
The “2b” indicates the size of the model. I’m using this one and not the bigger one because it uses a lot less resources and can realistically be used on a CPU while the other one cannot.
The "it” tells you that it has been trained for chat-like interactions.
So how do you get anything about this chat training? It means that your prompt has to follow this structure:
user
How does the brain work?
model
It indicates to the model the alternance between human and model speakers. Sadly, it does not have a system prompt to also guide the LLM outside of this, but we’ll go around that.
The idea is to use it the following way:
def ask_gemma(instruction: str, this_input: str, max_tokens: int = 1000) -> str:
    ensure_login()

    chat = [
        {
            "role": "user",
            "content": f"# Instructions\n{instruction}\n# Input\n{this_input}",
        },
    ]

    prompt = tokenizer.apply_chat_template(
        chat,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer.encode(
        prompt,
        add_special_tokens=True,
        return_tensors="pt",
    )
    outputs = model.generate(
        input_ids=inputs.to(model.device),
        max_new_tokens=max_tokens,
    )

    convo_raw = tokenizer.decode(outputs[0])
    convo: Sequence[Dict] = parser.parse(convo_raw)  # noqa

    return convo[-1]["content"]
What you see here is that we’re using the tools from the transformers lib to generate the chat template prompt and give it to the LLM. Then when it runs, you receive a response that contains both the question and the answer from the bot, and… It becomes a bit confusing.
To be honest I’m not entirely sure how I’m supposed to parse this or if there are utilities in the transformers lib (I didn’t find them), so I’ve written my own parser (which you see used in the code above).
If you’ve outgrown your fear of regular expressions you may outgrow your fear of parsers as well. It’s relatively easy to write using the Lark package. It’s not the scope of this article, just check the grammar if you’re interested. What matters is that we’ve got a parser!
Another thing is that you’ve noticed how there are two important parameters:
instruction — Corresponds to the system prompt, tells the bot what to do
this_input — The user input
Since there is no management of system prompt in this fine-tuning, I’m just bulding a prompt from those two and hoping that the LLM picks it up (it does).
At this point we have a function that runs the LLM locally on your CPU/GPU. Pretty neat!
Getting the email’s text
Emails might be the oldest and most inconsistent standard in the Internet world. Their encoding is super confusing and while Python has a built-in library that implements all the heavy lifting, it really comes as a kit that you’ve got to assemble yourself (without the instructions).
The strategy is as follows:
Go through the different “parts” of the email, looking either for a plain text or a HTML attachment. With a preference for the HTML attachment, because sometimes you will find a plain text attachment that turns out to be bullshit, so sadly only the HTML is reliable.
Because the HTML is pretty fat, if that’s what we’re going for we’ll make sure to convert it into Markdown. This will greatly reduce the amount of tokens and absolutely reduce the complexity of understanding the message for the LLM.
This is all the job of the parse_email() function, which I’m not going to detail because it would be off-topic. What you need to know is that it outputs the email in a simplified text format which looks like:
From: foo@bar.com
To: someone@example.com
Date: 2024-02-23 20:12:00 +0100
Subject: Some email

Blah blah blah this is the content of the email
It’s something we can easily give to our LLM.
Plain text to JSON
Now the useful part. The core of this project is to convert plain text into JSON, isn’t it? Let’s do that!
def parse_to_json(
    prompt: str, text: str, schema: Any, attempts: int = 3
) -> Optional[Any]:
    for _ in range(attempts):
        parsed_raw = ask(prompt, text)
        parsed_raw = MD_START.sub("", parsed_raw)
        parsed_raw = MD_END.sub("", parsed_raw)
        try:
            parsed = yaml.safe_load(parsed_raw)
            jsonschema.validate(parsed, schema)
        except (yaml.YAMLError, jsonschema.ValidationError):
            pass
        else:
            return parsed
What do we see here:
First we clean up the model for any Markdown enclosing. Sometimes chat-tuned models like to put YAML within ```yaml quotes. We make sure to remove it if this happens.
Then we try to load the YAML data. Why YAML and not JSON? Easy: JSON is a sub-set of YAML so if the model decides to output JSON it will still work, but on the other hand YAML is more permissive and uses less tokens than JSON. So it’s both safer and more economic to use.
If all went well, we validate the parsed structure against the provided JSON schema. This ensures that the output corresponds to the constraints that we need to work with.
And if not or if the validation fails, we try again. Most LLMs will not strictly always have the same output for a given input so it doesn’t hurt to try another time to see if it’s still broken.
Prompting
In order to parse the different elements, we’ll use a prompt library. For each prompt, we associate a JSON schema which helps validating the output.
I’m not going to go through every single prompt because if you’re human you can probably understand them but I’m just going to explain the one I use to classify emails because it’s the hard one.
My goal here is to determine the probability of each email type by asking explicitly the LLM to give that probability according to different factors that I give him. The idea is that you can then easily pick the email type by checking which is the category with the highest probability. We rely on the LLM’s feelings but we’re using hard Python algorithms to take the decision.
The prompt goes as follows:
Take a deep breath.

You will analyze an email. For this email you need to determine the
likeliness that this email belongs to a specific category. This works
with a score system. For each category you MUST give a score of 0 if
you are sure that it's not from that category, a score of 1 if you are
sure or a number between 0 or 1 that reflects how much you want to
give that category to the email.

The categories are:
    - Commercial is a prospective email.
    - Bill is an invoice or a bill for a sold service or product.
    - Conversation is a regular conversation between humans.

Here are elements to look for in an email. For each element, if I tell 
you +X then consider that it's adding points to that aspect and -X is 
removing points.

Has few sentences +bill -commercial
Has a total price +bill
Has a list of items sold +bill
Has "bill", "invoice", "order confirmation" or any synonym in the Subject +bill -commercial -conversation
Presents several product benefits +commercial -bill
Is structured in a Hello/Message/Signature way +conversation -bill
Different signatures and quoted mails +conversation -bill -commercial

Now return the following YAML:

bill: x
commercial: x
conversation: x

You need to replace "x" by the score. If you want to give a score of
1 to two or more categories, you need to think harder to make the
difference.

Make sure that the output is pure YAML, not wrapped in Markdown, no sentences.
You can see the structure:
The LLM receives a general purpose
Then I explain the categories
Then I give the different factors for the different categories so that the LLM knows what to look for (it’s really not working if you don’t do this)
Then I give the YAML schema at the end. If the schema is too high in the prompt the LLM tends to forget about it
And then some banalities about the output to avoid getting stuff like “Of course, here is your YAML”, which would screw the parsing
This is matched up by a JSON schema:
{
    "type": "object",
    "properties": {
        "commercial": {"type": "number", "minimum": 0, "maximum": 1},
        "bill": {"type": "number", "minimum": 0, "maximum": 1},
        "conversation": {"type": "number", "minimum": 0, "maximum": 1},
    },
    "required": ["commercial", "bill", "conversation"],
}
With this strategy, I’ve written 4 prompts:
The one you’ve just seen to determine the email’s type
If it’s a commercial email, extract the name of the product and the USP
If it’s a bill, extract the total price and the purchased items
If it’s a conversation, make a summary of the conversation
Deciding
And all this culminates into the interpret_email() function.
def interpret_email(email: bytes) -> Any:
    """Uses a first round of LLM in order to determine the type of message,
    then proceeds to using a specific prompt for that type in order to parse
    the email into a JSON output."""

    parsed_email = parse_email(email)

    email_type_proba = parse_to_json(
        DETERMINE_TYPE.prompt,
        parsed_email,
        DETERMINE_TYPE.schema,
    )

    email_type = max(email_type_proba.items(), key=lambda p: p[1])[0]
    extra = {}

    if email_type == "commercial":
        extra["commercial"] = parse_to_json(
            COMMERCIAL_INFO.prompt,
            parsed_email,
            COMMERCIAL_INFO.schema,
        )
    elif email_type == "bill":
        extra["bill"] = parse_to_json(
            BILL_INFO.prompt,
            parsed_email,
            BILL_INFO.schema,
        )
    elif email_type == "conversation":
        extra["conversation"] = parse_to_json(
            CONVERSATION_INFO.prompt,
            parsed_email,
            CONVERSATION_INFO.schema,
        )

    return dict(
        email_type=dict(
            chosen=email_type,
            proba=email_type_proba,
        ),
        **extra,
    )
Which is very simple:
First we determine the email type, which we get as a JSON object
And then we use one of the three parsers to get the extra information relative to this type
And finally we output a JSON with the extracted information and the decision-making values that we’ve used
So if I take my latest Amazon purchase, I’m getting the following output:
{
  "bill": {
    "bought": [
      {
        "label": "L'investisseur eclaire: Cultiver son...",
        "price": [
          37.46,
          "EUR"
        ]
      }
    ],
    "total": [
      43.72,
      "EUR"
    ]
  },
  "email_type": {
    "chosen": "bill",
    "proba": {
      "bill": 1,
      "commercial": 0.2,
      "conversation": 0
    }
  }
}
Hooray! It worked!
Conclusion
I’ve showcased two things in this article:
Using extremely small boilerplate code and pretty conventional tools, I can easily leverage LLMs to parse generic content into usable JSON. It’s something that was completely unthinkable a few months ago!
And I can do so using a local LLM that even runs on “commodity” hardware (be sure to have 10 Gio of RAM before starting the proejct, or you’ll see how fast your computer can freeze)
This is an exciting time because long-standing problems are finally getting solved. In a near future you can expect to see every single tool out there getting a lot smarter when it comes to understanding human text.
Thanks for reading Baby CTO! Subscribe for free to receive new posts and support my work.



Revisited: 10 rules to code like NASA (applied to interpreted languages)
Rémy — Thu, 17 Aug 2023 13:28:38 GMT
Foreword — Dear beginner, dear not-so-beginner, dear reader. This article is a lot to take in. You'll need perspective for it to make sense. Once in a while, take a step back and re-think about all the concepts explained here. They helped me a lot over the years, and I hope that they will help you too. This article is my interpretation of them for the work I do, which is mostly web-related development.
NASA's JPL, which is responsible for some of the most awesomest science out there, is quite famous for its Power of 10 rules (see original paper). Indeed, if you are going to send a robot on Mars with a 40 minutes ping and no physical access to it then you pretty damn well should make sure that your code doesn't have bugs.
These rules were made with embedded software in mind but why wouldn't everybody be able to benefit from this? Could we apply them to other languages like JavaScript and Python — and thus make web applications more stable?
Baby CTO is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
That's a question I have been considering for years and here is my interpretation of the 10 rules applied to interpreted languages and web development, revisited some time after the initial post, with comments in mind.
1 — Avoid complex flow constructs
Original rule — Restrict all code to very simple control flow constructs – do not use goto statements, setjmp or longjmp constructs, and direct or indirect recursion.
When you use weird constructs then your code becomes difficult to analyze and to predict. The generations that came out after goto was considered harmful did indeed avoid using it. We're at the stage where we're debating if continue is goto and thus should be banned.
My take on this is that continue in a loop is exactly the same as return in a forEach() (especially now that JS has block scoping) so if you're saying that continue is goto then you're basically closing your eyes on the issue. But that's a JS-specific implementation detail.
As a general rule you should avoid everything that is mind-bending or hard to spot because if your brain power is spent understanding the quirks of jumping around then you're not spending it on the actual logic and then you might be hiding some bugs without your knowledge.
I'll let you be the judge of what you put in that category but I would definitely put:
goto itself of course
PHP's continue and break used in conjunction with numbers, which is just pure insanity
switch constructs, because they usually require a break to close the block and I guarantee you that there will be bugs. A series of if/else if will do the same job in a non-confusing manner, as well as match-like constructs in languages like Python or Crablang.
Besides this, avoid of course recursions, for several reasons:
As they build on the call stack, whose size is very limited, you can't really control how deep your recursion can go. Even if your code is legit, it might fail because it recurses too much.
It’s easier to put safeguards when working in non-recursive mode — think explored paths or node IDs.
Do you get this feeling when doing recursions where you don't really know if your code is ever going to stop? It's very hard to imagine a recursion and to prove that it will stop correctly at the end.
It's also more compatible with the following rules to use an iterative algorithm instead of a recursive one, because you have more control (again) on the size of the problem you're dealing with.
As a bonus, recursions can often come as an intuitive implementation of an algorithm but is usually also far from optimal. By example we often ask in job interviews to implement the factorial function using a recursive function but that's far less efficient than an iterative implementation. Regular expressions too can be disastrous.
2 — All loops must have fixed bounds. This prevents runaway code.
Original rule — All loops must have a fixed upper-bound. It must be trivially possible for a checking tool to prove statically that a preset upper-bound on the number of iterations of a loop cannot be exceeded. If the loop-bound cannot be proven statically, the rule is considered violated.
The idea with this rule is the same as with the interdiction of recursions: you want to prevent runaway code. The way you implement this is by making sure it's trivial to prove statically that the loop won't exceed a given number of iterations.
Let's give an example in Python. You could do this:
def iter_max(it, max_iter):
    cnt = 0

    for x in it:
        assert cnt < max_iter
        yield x
        cnt += 1


def main():
    for i in iter_max(range(100), 10):
        print(i)
A language like Python will however limit the number of iterations by itself in many cases. So if you prove that the input lists won't be too long there is a bunch of cases where you don't need to do this.
A good application of that is pagination: make sure that you always work with pages that are of a reasonable size and this way you won't need loops that could run forever. Always think your code so it only works on a finite amount of data and let tools that were made for that handle infinity (like your DB engine).
3 — Avoid heap memory allocation
Original rule — Do not use dynamic memory allocation after initialization.
That makes of course no sense in interpreted languages where literally everything is allocated dynamically. But this doesn't mean that the rule does not apply to them. The core idea of the rule is that, beyond the tedious memory management techniques that you have to use in C, it's also very important to be able to fixate an upper bound in the memory consumption of your program.
So for interpreted languages it means that when you write your code, you should be able to know that given any accepted input the memory consumption won't go beyond a certain point.
While this can be hard to prove in an absolute manner, there is good clues and principles that you can follow. To be more specific and to repeat the previous sections, pagination is an essential technique. If you only work with pages and that you know that the content of each page is limited (DB fields have limited length and so on) then it's quite easy to prove that at least the data coming from those pages can be contained within an upper bound.
This is a powerful idea: load a full page of data into memory, work on it then let garbage collection discard it. It can even — under specific conditions — be a way to parallelize the work. Indeed, if you’ve managed to make your problem workable in pages, it means they can be processed independently.
4 — Restrict functions to a single printed page
Original rule — No function should be longer than what can be printed on a single sheet of paper in a standard reference format with one line per statement and one line per declaration. Typically, this means no more than about 60 lines of code per function.
This is about two different things.
First, the human brain can only fully understand so much logic and the symbolic page looks about right. While this estimation is totally arbitrary you'll find that you can easily organize your code into functions of about that size or smaller and that you can easily understand those functions. Nobody likes to land on a 1000-lines function that seems to do a gazillion things at the same time. We've all been there and we know it should not happen.
Second, when the function is small — or rather as small as possible — then you can worry about giving this function the least possible power. Make it work on the smallest unit of data and let it be a super simple algorithm. It will de-couple your code and make it more maintainable.
And let me emphasis on the arbitrary aspect of this rule. It works for the very reason that it is arbitrary. Someone decided that they don't want to see a function longer than a page because it's not nice to work with if it is any longer. And they've also noticed that it is doable. At first I rejected this rule but more than a decade later I must say that if you just follow either of the goals mentioned above then your code will always fit in a page of paper. So yes, it's a good rule.
The good news is that we can even push this idea further.
First of all, lines length is important. You want your code to fit in a half-screen in order to be able to read two files side-by-side without having to scroll horizontally. This puts the limit at 80-ish (86 is becoming increasingly popular).
And secondly, you probably want to keep below 5~10 your cyclomatic complexity (for example a max-complexity = 5 in Ruff’s settings.
Although predating the publication of the P10 paper, this complexity limit wasn’t included. my guess leans towards the complexity it would impress upon the writing of the rule which in its current state only is a few lines long. Furthermore, you need specific tools to review the cyclomatic complexity while everything mentioned in this paper can be hand-checked. It however echoes greatly with rule 1, 4 and 9 so my advice is definitely to land it into your coding guidelines.
5 — Use a minimum of two runtime assertions per function
Original rule — The assertion density of the code should average to a minimum of two assertions per function. Assertions are used to check for anomalous conditions that should never happen in real-life executions. Assertions must always be side-effect free and should be defined as Boolean tests. When an assertion fails, an explicit recovery action must be taken, e.g., by returning an error condition to the caller of the function that executes the failing assertion. Any assertion for which a static checking tool can prove that it can never fail or never hold violates this rule. (I.e., it is not possible to satisfy the rule by adding unhelpful "assert(true)" statements.)
That one is tricky because you need to understand what would count as an assertion.
In the original rules, assertions are consider to be a boolean test done to verify "pre- and post- conditions of functions, parameter values, return values of functions, and loop-invariants". If the test fails then the function must do something about it, typically returning an error code.
In the context of C or Go it is mostly as simple as this. In the context of almost every other language it means raising an exception. And depending on the language, a lot of those assertions are made automatically.
To give Python as an example, you could do this:
assert "foo" in bar
do_something(bar["foo"])
But why bother when the fact of doing this will also raise an exception?
do_something(bar["foo"])
For me it's always very tempting to make as if the input value was always right by falling back to defaults when the input is crap. But that's usually not helpful. Instead, you should let your code fail as much as possible and use an exception reporting tool (I personally love Sentry but there is plenty out there). This way you'll know what goes wrong and you'll be able to fix your code.
Of course, this means that your code will fail at runtime. But it's all right! Runtime is not production time. If you test your application extensively before sending it to production, this will allow you to see most of the bugs. Then your real users will also encounter some bugs, but you will also be informed of them, instead of things failing silently.
As a side-note, if you don't have control over the input, like if you're doing an API by example, it's not always a good idea to fail. Raise an exception on incorrect input and you'll get an error 500 which is not really a good way to communicate bad input (since it would rather be something in the range of the 4xx status codes). In that case you need to properly validate the input before hand. However depending on who's using the code you might or might not want to report the exceptions. A few examples:
An external tool calls your API. In that case you want to report exceptions because you want to know if the external tool is going sideways.
Another of your services calls your API. In that case you also want to report exceptions as it's yourself doing things wrong.
The general public calls your API. In that case you probably don't want to receive an email every time that someone does something wrong.
In short it's all about knowing about the failures that you will find interesting to improve your code stability.
6 — Restrict the scope of data to the smallest possible.
Original rule — Data objects must be declared at the smallest possible level of scope.
In short, don't use global variables. Keep your data hidden within the app and make it so that different parts of the code can't interfere with each other.
You can hide your data in classes, modules, second-order functions, etc.
One thing though is that when you're doing unit testing then you'll notice that this sometimes backfires to you because you want to set that data manually just for the test. This might mean that you need to hide your data away but keep a way to change it which you conventionally won't use. That's the famous _name in Python or private in other languages (which can still be accessed using reflection).
7 — Check the return value of all non-void functions, or cast to void to indicate the return value is useless.
Original rule — The return value of non-void functions must be checked by each calling function, and the validity of parameters must be checked inside each function.
In C, the mostly-used way of indicating an error is by the return value of the corresponding function (or by reference into an error variable). However, with most interpreted languages it's simply not the case since errors are indicated by an exception. Even PHP 7 improved that (even if you still get warnings printed as HTML in the middle of your JSON if you do something non-fatal).
So in truth this rule is: let errors bubble up until you can handle them (by recovering and/or logging the error). In languages that have exceptions it's pretty simple to do, simply don't catch the exceptions until you can handle them properly.
See it another way: don't catch exceptions too early and don't silently discard them. Exceptions are meant to crash your code if needs to be and the proper way to deal with exceptions is to report them and fix the bug. Especially in web development where an exception will just result in a 500 response code without dramatically crashing the whole front-end.
8 — Use the preprocessor sparingly.
Original rule — The use of the preprocessor must be limited to the inclusion of header files and simple macro definitions. Token pasting, variable argument lists (ellipses), and recursive macro calls are not allowed. All macros must expand into complete syntactic units. The use of conditional compilation directives is often also dubious, but cannot always be avoided. This means that there should rarely be justification for more than one or two conditional compilation directives even in large software development efforts, beyond the standard boilerplate that avoids multiple inclusion of the same header file. Each such use should be flagged by a tool-based checker and 
justified in the code.
In C code, the macros are a particularly efficient way to hide the mess. They allow you to generate C code, mostly like you would write a HTML template. It's easy to understand that it's going to be used sideways and actually you can have a look at the IOCCC contestants which usually make a very heavy use of C macros to generate totally unreadable code.
However C (and C++) is mostly the only mainstream language making use of this, so how would you translate this into other languages? Did we get rid of the problem? Does compiling code into other code that will then be executed sound familiar to someone?
Yes, I'm talking about the huge pile of things we put in our Webpack configurations.
The initial rule recognizes the need for macros but asks that they are limited to "simple macro definitions". What is the "simple macro" of Webpack? What is the good transpiler and the bad transpiler?
My rationale is simple:
Keep the stack as small as possible. The less transpilers you have the less complexity you need to handle.
Stay as mainstream as possible. By example I always use Webpack to transpile my JS/CSS, even in Python or PHP projects. Then I use a simple wrapper around a manifest file to get the right file paths on the server side. This allows me to stay compatible with the rest of the JS world without having to write more than a simple wrapper. Another way to put it is: stay away from things like Django Pipeline.
Stay as close as possible from the real thing. Using ES6+ is nice because it's a superset of previous JS versions, so you can see transpiling as a simple layer of compatibility. I wouldn't recommend however to transpile Dart or Python or anything like that into JS.
Only do it if it brings an actual value for your daily work. By example, CoffeeScript is just an obfuscated version of JavaScript so it's probably not worth the pain, while something like Stylus/LESS/Sass bring variables and mixins to CSS will help you a lot to maintain CSS code.
You're the judge of good transpilers for your projects. Just don't clutter yourself with useless tools that are not worth your time.
9 — Limit pointer use to a single dereference, and do not use function pointers.
Original rule — The use of pointers should be restricted. Specifically, no more than one level of dereferencing is allowed. Pointer dereference operations may not be hidden in macro definitions or inside typedef declarations. Function pointers are not permitted.
Anybody who's done C beyond the basic examples will know the headache of pointers. It's like inception but with computer memory, you don't really know how deep you should follow the pointers.
The need for that is, by example, the qsort() function. You want to be able to sort any type of data but without knowing anything on them before compiling. Have a look at the signature:
void qsort( void *ptr, size_t count, size_t size,
            int (*comp)(const void *, const void *) );
It's one if the most frighteningly unsafe things you'll ever see in a standard library documentation. Yet, it allows the standard library to sort any kind of data, which other more modern language still have a little bit awkward solutions.
But of course when you open the gate for this kind of things, you open the gate to any kind of pointer madness. And as you know, when a gate is open then people will go through it. Hence this rule for C.
However what about our case of interpreted languages? We will first cover why references are bad and then we will explain how to accomplish the initial intent of writing generic code.
Don't use references
Pointers don't exist but some ancient and obscure languages like PHP still thought that it would be a good idea to have it. However, most of the other languages will only use a strategy named call-by-sharing. The idea is — very quickly — that instead of passing a reference you will pass objects that can modify themselves.
The core point against references is that, beyond being memory unsafe and crazy in C, they also produce side-effects. By example, in PHP:
function read($source, &$n) {
    $content = // some way to get the content
    $n = // some way to get the read length

    return $content;
}

$n = 0;
$content = read("foo", $n);

print($n);
That's a common, C-inspired, use-case for references. However, what you really want to do in this case is
function read($source) {
    $content = // some way to get the content
    $n = // some way to get the read length

    return [$content, $n];
}

list($content, $n) = read("foo");

print($n);
All you need is two return values instead of one. You can also return data objects which can fit any information you want them to fit and also evolve in the future without breaking existing code.
And all of this without affecting the scope of the calling function, which is rather nice.
Another safety point though is when you're modifying an object then you're potentially affecting the other users of that object. That's by example a common pitfall of Moment.js. Let's see.
function add(obj, attr, value) {
    obj[attr] = (obj[attr] || 0) + value;
    return obj;
}

const a = {foo: 1};
const b = add(a, "foo", 1);

console.log(a.foo); // 2
console.log(b.foo); // 2
On the other hand you can do:
function add(obj, attr, value) {
    const patch = {};
    patch[attr] = (obj[attr] || 0) + value;
    return Object.assign({}, obj, patch);
}

const a = {foo: 1};
const b = add(a, "foo", 1);

console.log(a.foo); // 1
console.log(b.foo); // 2
Both a and b stay distinct objects with distinct values because the add() function did a copy of a before returning it.
Let's conclude this already-too-long section with the final form of the rule:
Don't mutate your arguments unless the explicit goal of your function is to mutate your arguments. If you do so, do it by sharing and not by reference.
That would by example be the no-param-reassign rule in ESLint as well as the Object.freeze() method. Or in Python you can use a NamedTuple in many cases.
Note on performance: if you change the size of an object then the underlying process will basically be to allocate a new contiguous region of memory for it and then copy it. For this reason, a mutation is often a copy anyways, so don't worry about copying your objects.
Leverage the weak-ish dynamic typing
Now that we closed the crazy door of references, we still need to write generic code if we want to stay DRY.
The good news is that while compiled languages are bound by the rules of physics and the way computers work, interpreted languages can have the luxury of putting a lot of additional support logic on top of that.
Specifically, they mostly rely on duck typing. Of course you can add some level of static type checking like TypeScript, Python's type hints or PHP's type declarations. Using the wisdom of other rules:
Rule 5 — Make many assertions. Expecting something from an object which doesn't actually have it will raise an exception, which you can catch and report.
Rule 10 — No warnings allowed (explained hereafter). Using the various type checking mechanisms you can rely on a static analyzer to help you spot errors that would arise at runtime.
Those two rules will protect you from writing dangerous generic code. Which would result in the following rule
You can write generic code as long as you use as many tools as possible to catch mistakes, and especially you need to follow rules 5 and 10.
10 — Compile with all possible warnings active; all warnings should then be addressed before release of the software.
The initial full rule is:
All code must be compiled, from the first day of development, with allcompiler warnings enabled at the compiler’s most pedantic setting. All code must compile with these setting without any warnings. All code must be checked daily with at least one, but preferably more than one, state-of-the-art static source code analyzer and should pass the analyses with zero warnings.
Of course, interpreted code is not necessarily compiled so it's not about the compiler warnings per se but rather about getting the warnings.
There is fortunately a great amount of warning sources out there:
All the JetBrains IDEs are pretty awesome at finding out issues in your code. Recently, those IDE taught me a lot of patterns in different languages. That's really the main reason why I prefer something like this to a simplistic code editor: the warnings are very smart and helpful.
Linters for all the languages
JavaScript — eslint with a set of rules AirBnB maybe?
Python — You can go full steam on Ruff and pick the rules that suit you
Automated code review tools like SonarQube
Spell checkers are also surprisingly important because they will allow you to sniff out typos regardless of type analysis or any complicated static code analysis. It's a really efficient way to not lose hours because you typed reuslts instead of results.
The main thing about warnings is that you must train your brain to see them. A single warning in the IDE will drive me mad while on the other hand I know people that just won't see them.
A final point on warnings is that on the contrary of compiled languages, warnings here are not always 100% certain. They are more like 95% certain and sometimes it's just an IDE bug. In that case, you should explicitly disable the warning and if possible give a small explanation of why you're sure that you don't need to apply this warning. However, think well before doing so because usually the IDE is right.
Key takeaways
The long discussion above tells us that those 10 rules were made for C and while you can use there philosophy in interpreted languages you can't really translate them into 10 other rules directly. Let's make our new power of 10 + 2 rules for interpreted languages.
Rule 1 — Don't use goto, rationalize the use of continue and break, use match instead of switch.
Rule 2 — Prove that your problem can never create runaway code.
Rule 3 — To do so, limit the size of it. Usually using pagination, map/reduce, chunking, etc.
Rule 4 — Make code that fits in your head. If it fits in a page, it fits in your head.
Rule 5 — Check that things are right. Fail when wrong. Monitor failures. See rule 7.
Rule 6 — Don't use global-ish variables. Store data in the smallest possible scope.
Rule 7 — Let exceptions bubble up until you properly recover and/or report them.
Rule 8 — If you use transpilers, make sure that they solve more problems than they bring
Rule 9.1 — Don't use references even if your language supports it
Rule 9.2 — Copy arguments instead of mutating them, unless it's the explicit purpose of the function
Rule 9.3 — Use as many type-safety features as you can
Rule 10 — Use several linters and tools to analyze your code. No warning shall be ignored.
And if you take a step back, all of those rules could be summed up in one rule to rule them all.
Your computer, your RAM, your hard drive even your brain are bound by limits. You need to cut your problems, code and data into small boxes that will fit your computer, RAM, hard drive and brain. And that will fit together.
— Morpheus Me
I consider that to be the core rule of programming and I apply it as an universal rationale to everything I do which is computer-related.
Baby CTO is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.