Functions and Procedures

Brian Keng

2015-09-06 03:04

I have something to admit: I've never done any serious programming in a functional language. Yes, yes, I've done some small school assignments in Scheme (a dialect of Lisp), even helping out my friend from another university with his Scheme assignment but nothing real 1. Does this make me a worse programmer? Probably not, I imagine most developers haven't done anything real with functional programming. Although that's probably not what you'd expect from reading Hacker News, where you don't know programming if you haven't tried Clojure or Haskell 2.

My position is much more pragmatic: I'm interested in tools and techniques that help me solve problems faster, cleaner and with less headache. Ideas and concepts from functional programming languages are supposed to help with at least some of that -- and they do; they're just not a panacea for all your programming woes. Like most things a solid grasp of the fundamentals goes a long way. So the topic of this post is about something pretty fundamental: a subroutine. In particular, two important kinds of subroutines: procedures and pure functions.

Programming is Organization

Let me digress for a moment because I want to discuss an incredibly important idea that James Hague discusses in his post Organizational Skills Beat Algorithmic Wizardry. He nails down one of the most important points about software development: organization. (emphasis mine)

When it comes to writing code, the number one most important skill is how to keep a tangle of features from collapsing under the weight of its own complexity... there's always lots of state to keep track of, rearranging of values, handling special cases, and carefully working out how all the pieces of a system interact. To a great extent the act of coding is one of organization. Refactoring. Simplifying. Figuring out how to remove extraneous manipulations here and there.

As much as it is fun debating the merits of one language/framework/technology versus another, it's much more practical to talk about ways we can organize programs in a more efficient manner (in whatever language/framework/technology we're currently using). Now back to the main show...

A Function By Any Other Name

It's funny that one of the first things we learn when programming is the concept of a subroutine: a set of instructions designed to perform a frequently used operation within a program, which is supposed to help organize your program. Great, but I don't recall learning much about the different kinds of subroutines or even really best practices for using them. For some reason that's just something you have to figure out yourself. Let's try to be a bit more explicit.

In my mind, a useful (but perhaps not universally accepted) classification of subroutines breaks them down into two general categories: procedures and pure functions.

A procedure is a sequence of commands to be executed. These are usually used for doing stuff. Typically, these will involve side-effects (such as changing the state of variables, outputting to the screen, or saving things to a file etc.). Procedures don't have return values.

A (pure) function computes a value (and returns it). These are for computing stuff. Just like a function in math, for the same set of inputs, it will always return the exact same output. Functions don't have side-effects.

Notice that these are not the only two ways to think about subroutines. There is a type of subroutine that returns something and has side-effects (among others). But I argue that these two are the most constructive ways to think about subroutines.

The most popular languages out there today don't really make a distinction between these two types of subroutines, but that doesn't mean you shouldn't! The reason to look at subroutines this way is because of a general rule of thumb that I came across by Greg Ward in his talk at Pycon 2015, How to Write Reusable Code (slides):

Every [subroutine] should either return a value or have a side effect: never both.

This is great rule of thumb that's hard to appreciate until you've made the mistake of violating it and have it come back to bite you in the arse. Greg goes on to give a couple of great examples (from actual code reviews he has done). Here's one of his examples where this rule of thumb is violated:

def get_foo(self, foo=None):
    '''Query the server for foo values.
    Return a dict mapping hostname to foo value.

    foo must be a dict or None; if supplied, the
    foo values will additionally be stored there
    by hostname, and foo will be returned instead
    of a new dict.
    '''

Gee, I'm already confused even after reading the documentation (don't even get me started on the mismatch with the function name). Remember, we want to build systems that don't "collapse under the weight of its own complexity" by "Simplifying. Figuring out how to remove extraneous manipulations here and there." Sure, giving it a second read, we can probably figure out what it does but the fact that we need to think twice about it sure isn't helping the complexity. Imagine if every subroutine you wrote had this issue -- I don't envy that code reviewer.

Greg goes on to give a better way to implement get_foo() as a pure function:

def get_foo(self):
    '''Query the server for foo values.
    Return a dict mapping hostname to foo value.
    '''

Much simpler and easy to understand: query the server, get back a dict. No extraneous mental overhead with the foo parameter. It may only be a small improvement but when building a large system, these small things add up quickly (especially since complexity is likely multiplicative).

There's also this example involving C:

/**
 * Replace all 'e' characters in str with 'E'. Return the number of
 * characters replaced.
 */
int strmunge_v1(string str) {
    ...
}

He points out that this type of subroutine is pervasive in C and notes that the only valid reason for violating this rule is for performance (which is probably why you're programming in C in the first place!). For the rest of us who aren't writing performance critical code (come guys, that's most of you), a much cleaner solution is not to have the side-effect and convert it to a pure function:

/**
 * Return newstr, a copy of str with all 'e' characters replaced
 * by 'E', and nreplaced, the number of characters replaced.
 * (Assume language with multiple return values)
 */
(string, int) strmunge_v2(string str) {
    ...
}

The pure function has many benefits over the side-effect-ridden one (functional programmers rejoice!) with the main one that it's easier to reason about: you can look at the function in isolation of the entire program. Write it separately, review it separately, unit test it separately. And once you're convinced it works properly, you don't need to look at it again! You can now "abstract" that function out when reading the parent functions. Awesome! I'm a huge fan of making things simpler.

Fundamentals

The reason that I decided to write this post is that lately, I've been using a "procedure of (pure) functions" type pattern in my code. My main logic typically is some kind of procedure that farms out much of the work to pure functions rather than mixing them (kicking my old performance-driven C++ mindset). I find that it's been a very useful way to structure my programs and generally just more pleasant to read.

After noticing this subtle shift in my code (and after watching Greg's talk), I rediscovered my appreciation for the fundamentals. I get the feeling that when people want to learn something they conflate the most advanced ideas with the most important. There's definitely something to be said of taking a step back and learning the fundamentals well. Programming is no different in this respect. If you want to become strong at programming, start with the fundamentals.

1: "Real" programming work is kind of a vague word. The way I'm using it here is any kind of sizable project, solving a non-trivial problem. Most of the time these types of projects aren't measured in days or weeks but rather months and years.
2: Of course, I'm not saying Clojure and Haskell are bad languages or incapable of solving "real" problems with -- I'm almost positive they are fine languages to use. I'm more of the opinion that, practically, it's harder to use them to solve many of the problems out there. It's not just the issue from learning the FP conceptual point of view but also the fact that it's not that easy to find libraries, examples or even jobs that use these languages (although obviously some do exist). Without a good "support structure" (including monetary compensation), it's hard to justify using a functional language.