Software architecture emerges

When the team has decided that a thing must be built, where do you start?

In the old times, we would draft specifications, run it by a committee of senior engineers, revise it, check it again for correctness, and scrutinize it endlessly before our engineers would meticulously render each decision.

More often than not, we would build monuments to engineering perfection and at the end, we would find that our specifications were inadequate in one dimension or another.  We built a castle but what we really needed was a Burger King.  Maybe we can install a kitchen and even a drive-thru, but it’s still a castle, but it’s always going to have weird edges.

Definitely what you can not do, CAN NOT DO, is decide that the system is too complicated so what we really need to do is rebuild Castle Burger King in smaller bits but put a network between them and it’ll be easier to understand.  The small bits might be individually easier to understand, but the system (aka Burger King) is actually what you still have to understand.  And the system is still a Burger King defined in terms of what a castle has to offer.

So how do you build Burger King to begin with?

You don’t, you can’t, you don’t even know that’s what you need (and probably nobody does).  The customer came to you and said “I need a place to eat.”  So what you should do is take them very literally and rent them a table at a cheap restaurant and ask them if that solves their problem.

And then you keep going until you arrive at a Burger King, and then you keep on going, because even Burger King can’t stay the same forever.

The not-knowing bit can be uncomfortable, and we try to mitigate it with planning and thinking really hard.  In the end though, all that planning and thinking is going to die on a wiki page and reality is going to show us the truth (occasionally, with violence).  Given enough time and patience, even the greatest castle will eventually converge to what it must become.  So don’t build castles.  Build hovels.  They might not look as nice, but you’ll get to Burger King at lot faster.

Introduction to Results-Oriented Thinking and Post Mortems

A typical post-mortem meeting delves into Five Whys, not assigning blame, figuring out what mistakes were made that lead to the outage or whatever, and then assigning and prioritizing the work to correct those mistakes and, most importantly, to learn from them.

The post-mortem itself presumes a mistake, that we must find that mistake and learn from it.  I wonder then, is it possible to have an incident where the cause is not a mistake, but even maybe a correct decision?

Let us imagine a very simple game.  The house flips a coin.  You wager 50c – if the house wins, they keep your 50c.  If the house loses, they pay you $2.  You give the house your 50c and play.  So we figure, half the time we lose 50c, but half the time we gain $2.  The expected value (or EV) of an iteration of the game is $1.50.  Since the game only costs 50c to play, it has a positive EV.  EV (given a tolerable amount of catastrophic risk) gives us a simple mathematical basis for evaluating our decisions.

Anyway, so you decide that since the game is positive EV, it is correct to play (assuming you like money).  And then, you lose.  Later, you have a post mortem for the game, and you decide that the thing that caused the losing was to play to begin with, and the best way to avoid that in the future would be to not play anymore, or to play a different game, perhaps.

In games involving variance, this post-hoc analysis of decisions is referred to pejoratively as “results-oriented thinking.”  It boils down to overweighting our agency, and the belief that each time we realize risk, our decision was poorly conceived.

  1. a method of analyzing a poker play based on the outcome as opposed to the merits of the play. (source)

Trying to be good at games involving variance (like poker, or Magic: the Gathering) is one way to figure out how miserable humans really are at this.  There are times when you (correctly) estimate you are 9:1 against your opponent, and you convince them to go all-in, and then they suck out and you lose (this happens exactly 10% of the time, by my math).

Here is the horrible truth about games of variance: sometimes, making the correct decision will straight up cause you to lose.  Perhaps more terrifying: sometimes making the wrong decision will cause you to win.

It is not fun, and if no one has explained this to you, you might think that poker is just not your game.  But, if you make this play at every opportunity it is presented to you, in the long run, you will win a lot of money.

This relates to post-mortems in myriad ways, some of which I will have to address in future posts.  The most obvious is answering the question “how could we have prevented this?” is not an adequate post-mortem.  It’s not even a reasonable way to start.  We have to evaluate the decisions we made, on their own merits, ignoring their outcome.  This is admittedly tough to do, since in most cases, we only have post-mortems when things go bad.

Here are a few suggestions for post-mortem questions that nobody’s asking:

Should we have actually prevented this?

Post-mortems assume this.  I think assuming this will make you risk-averse.  Not all outages are worth attempting to prevent.  If you think the decisions that led to an outage had a fair risk component, you basically agree that those decisions were fine.

Should we try to prevent this in the future?

This is a different question.  We know different things now.  The stakes have changed, and the odds are different, and the costs are lower.

Critically, every post-mortem involves a cohort of stakeholders, sometimes ones with actual stakes (and pitchforks and torches).  If your system experiences a failure mode, your customers will assume that it makes sense that you will adjust your process so that you will never experience this failure mode again (regardless of the costs, etc).  Therefore, this failure mode will have a greater cost the next time it occurs.

In our favor though, we may potentially understand the failure mode better, and we may have an easier solution for it than if we’d gone out looking for dragons early in the process (we have somewhat validated a risk), which may have taught us that the odds of this failure mode are (perhaps!) higher than expected and a solution for this actual failure mode is cheaper than a solution for many theoretical failure modes.

Did we understand this risk up front or was it emergent?

Did we just come out on the bad side of a calculated risk?  Was this an unknown risk that we could have known about if we’d done reasonable due diligence (that is, due diligence that was likely to be worth it)?

How should our decision making change in the future?

Is there a critical dependency that is too poorly understood?  Maybe we should develop some more expertise here.

Are we taking too little risk?  Too much risk?  It’s tough to answer this question honestly during a post-mortem, but it’s something worth keeping in mind.

Final Notes

There is a big world of thinking about decisions that I didn’t cover here.  This mode of thinking can improve your decision-making in post-mortems, software design, career decisions, relationships, finances, and pretty much any area where risk and reward are traded off.

Find ways to think harder about the decisions you make than the outcomes you experience.

Close Decisions (usually) Don’t Matter

Some humans (engineers in particular) have a painful cognitive issue that hinders decision making and leads to, for example, analysis paralysis.  The problem is that if they are given two options with similar expected value, they will get hung in the mire of choosing the correct option, at pretty much any cost.

Here’s the thing: a lot of times, decisions are hard because there are no good choices, or all good choices, or because the choices are so similar that spending a bunch of time deciding the best one is really a waste.

A lot of the real growth as an engineer and as a person comes from figuring out the difference between a 2 and a 5, but the close calls are a lot more interesting to us (and a lot less valuable, really), so we tend to dig into the difference between a 4.5 and a 4.8 a lot more.

Software holy wars definitely fall into this category. vim or emacs ?  spaces or tabs? python or ruby?  At the end of the day, the difference between PHP and Node.js as a platform is something like a fraction of a point (both proven tools); all of the real edge in their differences comes down to your team’s expertise and interest.

Do close decisions matter though?  Yeah, but only in hindsight.  You can’t take a close decision, run it against a sample size of 1, and then have a good idea of whether or not you made the right call.  Sometimes, good plays lead to bad outcomes, and bad plays lead to good outcomes.  The key to making better choices is to have a better understanding of what constitutes a good choice.  Using a new technology for no reason is probably not a good choice, unless you just want to learn something.

Here’s the takeaway: pay attention when you have a hard decision.  Why is the decision hard?  Is it because the choices are close?  Is it likely you’ll get enough information to make it “not close” in a reasonable amount of time?

If it’s really close, just flip a coin.

Why isn’t JSON parsing symmetric in Ruby

We recently ran into a weird JSON-related issue in Ruby, namely that to_json and JSON.parse are not symmetric.

irb(main):004:0> JSON.parse("hello".to_json)
JSON::ParserError: 757: unexpected token at '"hello"'
	from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/json/common.rb:155:in `parse'
	from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/json/common.rb:155:in `parse'
	from (irb):4
	from /usr/bin/irb:12:in `'

It turns out that JSON.parse expects a valid JSON document, which is defined on json.org as follows:

JSON is built on two structures:

  • A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
  • An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

So a single JSON value will never be a valid JSON document.  Toward this end, JSON.generate is actually the symmetric function you are looking for – it only accepts hashes and arrays.  to_json can be used to convert single values, but it is often not what you are looking for.

Should you find yourself in a situation where you need to eval messy JSON, you can enable quirks_mode on JSON.parse, which will behave closer to your expectations

JSON.parse("hello".to_json, :quirks_mode => true) # => "hello"

Tips for reviewing code

Code reviews are a critical part of the modern software development cycle.  Unfortunately, a lot of cycles are burned, egos flare, and morale is damaged because reviewers don’t know how to offer suggestions constructively.

Here are some things that I think are very important to improving your code review process.

Figure out why you do code reviews

Is it to find bugs?  Is it to spread knowledge of the codebase?  Is it to find architectural problems?  What do code reviews offer your team?  If you don’t know the answer to this question, you don’t know how well they’re working.

Automate your style guide

If the answer to “why you do code reviews” includes “making sure the code follows a consistent style,” cut it out. Style comments are a huge waste of time.  It is very difficult to read the code for both correctness and style at the same time, and it is often easier to make a bunch of style comments and feel like you’ve accomplished something than to dig deep and find real issues with the code.

A deluge of style comments are often overwhelming, and drown out conversation about the code itself.

Tools exist in most languages (we use Rubocop and Scalariform) to handle code formatting automatically, or at the very least, to provide warnings to the author during build.  In no circumstance should paid engineers be scanning code for whitespace problems in 2016.  While automated tools will never provide a level of aesthetic that a human eye can offer, they can get plenty close enough at a much lower cost.

If you haven’t automated your style guide, you don’t have a style guide, you have a tool for your team to abuse each other.  Automate this problem away.

Offer suggestions, not orders

Don’t tell a committer how they’ve done a thing incorrectly in a code review remark.  Ask them what they think about your suggestion.  For example, a comment like this:

Don’t repeat this here, extract a function.

obviously feels very negative.  The author may feel defensive and start a pointless argument, even if there are few benefits to doing so; or worse, the author may feel like a less valuable contributor, and follow the suggestion blindly.  In either case, actual harm is done by comments like this.

The best thing you can do is to offer a suggestion, not as a gatekeeper, but as someone who is trying to learn about the changes to the code.

What do you think about taking this with the above and extracting a function?

Why should you make your comments sound so tentative?  Because you are reviewing your peer’s code.  And you have to invite them to politely explain what’s going on, or to notice that you’ve made a better suggestion.  Having no manners in this regard eliminates the possibility of discussion for a significant segment of the population.

Let me talk about my number one pet peeve in code reviews.

Why didn’t you…

Asking a “why didn’t you” question implies a great many things, none of which are constructive or positive.  It assumes that the author of the code considered your suggestion and explicitly chosen a different one, which may not be the case.  It implies that you feel your alternative is obviously superior (especially in the special case of “why didn’t you just…“), and that you require the author justify their solution against it.  It implies that what may be intended as a mere suggestion is, in fact, the default.

Whenever I am reviewing code and my hubris needs to be checked, I try to offer suggestion.  “How do you feel about…” or “what do you think about…” are good drop-in replacements for this horrible phrase.

Use overtly positive language

It is important to recognize that written communications have a profoundly negative bias.  We tend to interpret neutral communications as negative, and positive communications as neutral, which is a source of a lot of consternation in code reviews.  To combat this, strive for overt positivity.

Terseness can be interpreted as rude or inconsiderate.  In particular, avoid adding a two or three word comment.

End notes

My genuine unsupported-by-data opinion is that code quality and correctness are only marginally improved by having a code review by a gatekeeper.  It is rare that a reviewer shares the same level of insight as the person who authored the code.  A better model is to have code reviews treated as a learning exercise for the reviewer, i.e. to understand the changes to the code and to grow in understanding of the code base.  I think this leveling gives their questions and remarks appropriate weight, and allows for a healthier discussion about the larger codebase and how the changes interact with it.

It is easy to imagine that brutal code reviews are a ritual that leads to the best outcomes.  “We’re attacking the code!”  In my experience, engineers are a lot more sensitive to criticism than they let on, and they are a lot more willing to be found wrong if they feel like collaborators rather than defendants.

I recommend taking a few minutes to check your hubris, and see if there are steps you can take to be more encouraging in your reviews.

How to attach to a screen after su’ing to another user (Cannot open your terminal)

If you’ve ever tried attaching to another user’s screen by suing to them, you might have gotten this message:

-- c@waimea:~ $ screen -r
Cannot open your terminal '/dev/pts/6' - please check.

Here is the TL;DR solution:

-- c@waimea:~ $ script /dev/null
Script started, file is /dev/null
-- c@waimea:~ $ screen -r

The problem is that screen needs to write to your virtual terminal, which is owned by the user you logged in as.

-- c@waimea:~ $ tty
/dev/pts/6
-- c@waimea:~ $ ls -alF /dev/pts/6
crw--w---- 1 not-c tty 136, 6 Jan  1 21:08 /dev/pts/6

script is an old-school utility for capturing a terminal session and writing it to a file (in this case, /dev/null). To do this, it creates a new TTY (owned by the current user) and changes the session TTY to use it.

-- c@waimea:~ $ script /dev/null
Script started, file is /dev/null
-- c@waimea:~ $ tty
/dev/pts/7
-- c@waimea:~ $ ls -alF /dev/pts/7
crw--w---- 1 c tty 136, 7 Jan  1  2016 /dev/pts/7