Monday, 22 August 2016

Deterministic json output from python

How do we get deterministic JSON output from Python?

JSON objects are not inherently ordered, their properties can be written in any order without making any difference to their meaning.

Unfortunately, people often use text tools to check the output, and sometimes we want to generate deterministic JSON:

Consider this bash/Python mess:

for ((i=0; i != 10; i++ )); do python3 -c 'import json; print(json.dumps({"a":420, "b":99, "c":100}))'; done

{"a": 420, "c": 100, "b": 99}
{"a": 420, "c": 100, "b": 99}
{"c": 100, "a": 420, "b": 99}
{"b": 99, "a": 420, "c": 100}
{"c": 100, "b": 99, "a": 420}
{"c": 100, "b": 99, "a": 420}
{"c": 100, "a": 420, "b": 99}
{"c": 100, "a": 420, "b": 99}
{"b": 99, "c": 100, "a": 420}
{"c": 100, "a": 420, "b": 99}

What's happening here?

Python's internal dicts' hashing is using some random number to prevent hash-collision attacks, which means that each time we run the program a different hashing pattern is produced, and the dictionary keys appear in a different order. This is good because the order isn't predictable to an attacker, but it means that an otherwise deterministic program generates different output each time it's run.

Why would this be a problem?
  • You can't compare JSON files using "diff" any more, because these differences always appear
  • Continuous integration - different output triggers false alerts
  • Humans may be able to read the json more easily in a specific order.

How do we fix it?

Use collections.OrderedDict. But be sure that you don't initialise it from a regular dict.


od = collections.OrderedDict([("a",420), ("b", 999), ("c", 888)] )
od = collections.OrderedDict(); od["a"] = 420; od["b"] = 999; od["c"] = 888

Incorrect (creates a normal dict first):

od = collections.OrderedDict({"a":420, "b": 999, "c": 888})


Tuesday, 26 April 2016

How to use "cron" to run periodic scheduled jobs

"cron" is the Unix (Linux, etc) scheduler which runs regularly scheduled jobs. This post is not meant to be the "man" page (it has one of those already), but ideas how to use "cron" in a robust way.

Setting up cron jobs

There are at least *three* ways of configuring cron jobs on a modern Linux system; technically these are extensions, but they're so quasi-standard, they're even (possibly) available on FreeBSD :)

  • Per-user "crontab" file. This can be edited using crontab -e, or replaced by crontab . If you are installing system-level software, you probably don't want to use this. Each user can have only one crontab file.
  • System-wide "crontab" file, usually /etc/crontab. This is usually managed by the distribution / package manager, and you probably don't want to change this; there is only one.
  • Per-package "crontab" files - usually kept in /etc/cron.d. There are multiple files, usually one per package (or per app). This is the best solution if you're distributing software as a package, on multiple machines, and want the installation to be robust and repeatable.

The system-wide crontab files, give the option of running cron jobs as any OS user (who must exist, obviously!)

More tips on configuration:

  • A few environment variables can be set, typically  things like PATH, SHELL and importantly MAILTO
  • If you don't set MAILTO=, then the stdout and stderr will be sent by email to the user who owns the cron job.  This is seldom what they want nowadays.

When to schedule jobs

Don't schedule daily jobs between midnight and 2am. Cron jobs are scheduled in local time. The sysadmin might not have configured the machine for UTC, therefore there is a possibility that some cron jobs are repeated, missed etc, during changes of time zone or daylight saving time.

In general, I don't like to schedule daily jobs at all, unless they aren't very critical.

For important stuff, it's probably better to have an hourly job just check the (UTC) hour, so it can avoid time-zone dependency.

A feature of "cron" which is little-known, but available on most systems, is the "@reboot" jobs, which are run shortly after the system boots. Such jobs are useful to perform cleanup work that otherwise might not get done at all (e.g. a 6am job, when the user seldom has the machine powered on at 6am).


Some (i.e. most Linux) systems have a small script called "anacron", which runs jobs hourly, daily, weekly etc, with no particularly fixed schedule (because they run in sequence with other jobs which may take some time). 

This could be used as an alternative to "cron", however it's got more limitations and, in particular, runs everything as root.

Multiple instances

"cron" mostly does not care about multiple instances, and will execute more than one copy of your job. This is almost never desirable, so if there is any probability whatsoever of this happening, you should prevent multiple instances.

The "flock" shell program might be handy (recipes are in the man page), or creating a file and exclusively locking it (e.g. in Python, C or your favourite language).

If you have a slow job (say, a backup, or something which relies on the network), and a second instance starts, there is a good chance that the 2nd instance will slow down the first instance, then a third instance starts, until the whole system comes down with too many instances of a cron job.

Stampeding herds

If your software will be installed on many machines sharing infrastructure (e.g. a network, a server, a VM host), then it may be useful to try to avoid a stampeding herd.

"cron" has the unfortunately property that it usually runs jobs at the exact same moment (usually to the same second), if identically configured on several machines. If your cron job depends on something, it can cause problems or failures.

The obvious solution is "random sleep"; sleeping a random amount of time (usually only seconds/ minutes) before doing any work which may require infrastructure access. In some cases though, a random sleep can increase resource usage, imagine this sequence:
  • Start up program
  • Load loads of huge libraries
  • connect to database server
  • random sleep (0... 300 seconds)
  • perform work which takes, maybe 5 seconds
  • exit
 The random sleep is doing more harm than good. Be sure to place "random sleep" before taking too many resources!

Another option that I've seen occasionally, is to dynamically generate the "crontab" at install-time with a random value for the minute field.

Error handling

One of the nice things about running in a "cron" job is that "log & exit" is often sufficient error handling. This is particularly true if it's an hourly job, where the next hour would be an ideal time to retry whatever failed.

Some types of errors (e.g. network problems) might just go away if we try again 1 hour later. Other types (e.g. out of disc space) might need someone to fix them.

The stdout / stderr from a cron job is often lost, so it usually important to log to a file or system log. 


Maybe you don't need "cron" at all?

Historically, many popular web applications have just done periodic clean-up work in response to a user request. Sometimes random requests are chosen, or on particular user activity.

Some third party services exist, to simply "hit" a PHP script (such apps are usually written in PHP, which I'm not necessarily advocating) on a regular basis.

Writing a permanently-running "daemon" process has some advantages over "cron", although it does incur more initial work, more "boiler plate" code. If work is required very frequently (say, several times per hour) it might be more convenient or have better performance to run it in a daemon.

Friday, 15 April 2016

How to correctly make "latest" symlinks

"Latest symlink"

A "latest" symlink, is a symbolic link (on Linux, Unix etc) which links to the "latest" version of a file.

Suppose we have a file which takes some effort to create, which is generated periodically or in response to some stimulus (e.g. user activity). Then we want to create a "latest version" symlink.

Ideally the properties should be
  • latest symlink always points at the latest version (duuh!)
  • latest symlink always exists
  • latest symlink never points at a partially completed, broken, missing or otherwise bad file
Sometimes people do this in a way which won't work.

How to create a symlink

Dead easy, right? Just call the "symlink" function. 

 int symlink(const char *oldpath, const char *newpath);

       symlink()  creates  a  symbolic  link  named newpath which contains the
       string oldpath.

But if we call "symlink" and the newpath param points to an already existing file (including a symlink) then it will return EEXIST.

The wrong (obvious) way

except OSError:
symlink(src, dest)

The correct (not so obvious) way

templink = dest + '.temp'
symlink(src, templink)
rename(templink, dest) 


Because we want to avoid a race condition where the destination symbolic link does not exist. Renaming files is atomic and will instantly replace the existing link with a new one; no other program can possibly see a non-existing file.

Other wrong ways

Some possibly common, but wrong (or even wronger) ways to do this
  • Just create the "latest" file using our "do lots of work" process directly. This is really bad, as during file creation, another process can see a partially completed file. If your code looks like this: write file header; do lots of work to create file body; write file footer, then there is a really good chance that another process sees an incomplete file.
  • Create a different file, then copy the file (file copying isn't atomic)

Tuesday, 15 September 2015

Headless web browsers: PhantomJS and SlimerJS

What is headless web browsing?

It's using a web-browser like application to do automated fetching and analysis of web pages, without a human user present.This is different from simply fetching HTML content via HTTP; headless web browsers typically also load images, process Javascript code, CSS and layout the page content (albeit in an invisible way). 

The developer can then use scripting (usually Javascript) to examine the page as it is laid out in memory, as if in a "real" web browser, to look at the style of text, etc.

We could even use OCR to look for text within images shown in the page.


  • More effectively analyse the content of pages. Lots of pages nowadays contain a huge amount of "boiler plate" uninteresting text, often in HTML elements without semantic meaning (e.g. DIV). Only by using CSS (and sometimes Javascript) are we able to have a computer see the page as a human would
  • Generation of screenshots
  • Getting metadata which are dynamically written by scripts, etc, such as Javascript-created links.
  • Automated testing of web applications

What tools are available?

Several. Traditionally some users have hacked their own solutions using either a web browser extension, or embedding a web browser in a C++ program (often Webkit).

Here I'm looking at PhantomJS and SlimerJS.

PhantomJS and SlimerJS essentially perform the same task - to run developer-specified Javascript code in the context of an automated web browser, without using a real web browser.

PhantomJS is based on Webkit; SlimerJS is based on Mozilla / Firefox.


Two versions of PhantomJS are available - the 1.9 series and the 2.0 series. The main difference is that the 2.0 series uses a more recent version of Webkit.

Unfortunately, last time I tested them, neither is very good for browsing lots of real web pages "in the wild". NB: This may be fixed when you read this, test it yourself!

  • Lots of memory usage
  • Slow
  • Prone to crashing; diagnosing crashes is very difficult
  • v1.9 has an out-of-date Webkit which has less feature support
  • v2.0 seems to leak memory very badly.

So probably PhantomJS is ok for some automated testing scenarios, particularly if you have a "single page application", or only a small number of pages tested.

But accessing large numbers of "real" web pages quickly breaks it, and it's not easy to fix.

Essentially the problem is that Webkit is now an abandoned fork (Apple and Google have both forked off from it) and bugs don't get fixed upstream. PhantomJS does not usually apply bugfixes to Webkit itself.

PhantomJS is a C++ executable that includes most of Webkit inside its binary. This is OK, as it's almost completely standalone, but it means that compiling it is VERY time consuming, particularly on limited resources. For example, on a Raspberry Pi I was able to run PhantomJS, but building it will take days (a more powerful system is really required). On a modern x86 system compiling is much quicker, but can still take 1 hour; the link step uses several Gb of memory (not really a problem on a server, but careful if building in a memory-limited VM).

Linux binaries are also available from the web site, which is handy :)


SlimerJS is a completely different beast from PhantomJS. It is not a C++ binary and doesn't attempt to embed the engine directly in its own application. Instead, it uses an obscure feature of Firefox to run an alternative "user interface application" which provides an environment which is almost identical to PhantomJS.

This has benefits and drawbacks

  • It is not completely headless. It doesn't require user input, but it won't work without an X server on Linux (this is easily fixed using Xvfb). Under Windows, visible windows may be shown unless running an an alternate desktop, or as a service.
  • The web browser used is really identical to the Firefox version you're using - all the same features are available.
  • If you update Firefox, SlimerJS updates too (pro: good for security; con: it might break)

SlimerJS is under moderately active development, but has a much smaller user community than PhantomJS.

  • Performance of SlimerJS (using Firefox 40) seems MUCH BETTER than PhantomJS in general
  • Stability seems much better too (although I have had a few crashes)
  • The same APIs are supported, but doucmentation is mostly worse (example: the filesystem objects are barely documented)

Wrap up

So there you are - headless web browsing IS a niche application, but it is very useful in its place. I like SlimerJS because its overall design approach seems to work better in the general case.

It would be interesting to have a SlimerJS / PhantomJS type application which uses GoogleChrome as its web browser. I imagine one may appear, if it does not already exist.

Tuesday, 8 July 2014

Why I use Python

There are a lot of holy wars between programming language advocates in the industry.

I use Python.


Because I have found that programs written in Python are significantly more likely to Work Correctly. They more often run correctly the first time, therefore I spend less time debugging.

There are lots of other reasons, but it's mainly the "It just works".

A few years ago, I had 10+ years of industry experience of Perl, but only 1 year of Python. I already found that my Python programs initially worked correctly FAR MORE OFTEN.

This is comparing a language that I'd been using commercially, most days, for 10 years, with something that I'd only just picked up.

That's why I use Python.

Other stuff?

DEBUGGING: Stack traces. In Perl, it's possible to get a stack trace from an exception, but you sometimes have to jump through hoops to do it. Likewise, in PHP, stacktraces are available in principle, but don't always appear when you need them. Java? Stack traces usually appear, but are often huge and difficult to read (even better: in some versions of Java, the compiler "optimises" out line numbers from stack traces) In Python, stack traces Just Work. Like everything else.

Friday, 3 January 2014

The most common cause of unavailability

Hi, Happy new year.

I've done a lot of work on high-availability systems. There is a lot of writing on high-availability systems - how to implement failover, hot-spare systems, load-balancers etc.

However, most of these seem to make an assumption: humans are infallible.

In practice, this is not always the case.

In fact, I'd say that probably about 75% of downtime is caused by human errors, cock-ups, mistakes. I'm not an expert, but I suspect that it's about the same proportion as air crashes caused by pilot (or someone else's) error.

So, it's the human, stupid. PBKAC (problem between keyboard and chair).

Here are some possible fixes:

Give human less work to do

We can avoid SOME human errors by having systems automatically configure themselves, setup, or perform sanity checks before accepting settings.

"Blindly accepting" instructions given by an idiot human, is what computers are generally very good at, and often results in chaos.
  • Automatic configuration is less likely to make a mistake (when was the last time you saw a DHCP server give out a duplicate IP address? Never?)
  • Testing a configuration provided by a human, BEFORE applying it, might prevent a mistake.
  • Automatic configuration is a "clever" piece of software which can be difficult to test. Sometimes autoconfiguration tests are mind-blowingly difficult to set up. This means it's not likely to be very well tested.
  • If autoconfig gets it wrong, it's likely to do it on a large scale.
  • Automatic configuration probably means that the humans have no idea how it works, which may be bad.
  • Autoconfig might not correctly handle "very unusual" circumstances.

Give human more work to do

Perversely, we could give the humans more work. Something that someone does every day, they're unlikely to get wrong. The procedures which are done infrequently are more likely to be a problem.

  • Less code to test. Manual configuration.
  • Humans (ops engineer, support engineer) become used to getting manual configuration correct
  • Humans might be able to set up unusual configurations required for unusual circumstances. (Corollary: such unusual configurations probably won't have been tested!)

 My view

This is a completely opinionated rant...

  • Humans cannot be trusted to do anything
  • Make everything either automatic, or hard-coded
  • Give users as few options as possible. Perhaps let them make some cosmetic changes (colours, window layout?). Don't let them change the TCP port number your application uses.
  • There is usually far less value in allowing a parameter to be changed than forcing it to a reasonable hard-coded value.
  • Sales engineers want to see screens full of knobs and dials? Give them placebo ones which do little or nothing (i.e. don't break it).
  • If you put a big red button in the reactor control room, marked "Do not press.", someone will press it sooner or later.
On testing...

  • If your program fails to correctly find the server when running on Chinese-language Windows Vista connected to two VPNs at once, it's probably not really important.
  • If your software can't find the server for anyone who typoes the address, it's more important.
Of course users sometimes want to configure "policy". This is understandable, but try not to enable them to create stupid policies (or at least, make sure it's strongly discouraged!). A policy of "delete all emails instantly and irretrievably" sounds more like a bug to me.

Thursday, 17 October 2013

On web crawling robots

Here are some observations from writing web-crawling robots.


At some point, many of us (in the IT security industry) will need to write a robot which scrapes lots of web sites. By "lots", I mean a very large number, run by arbitrary parties. Not just a few run by well-behaved, cooperative entities.

Most owners of web servers try to make them compatible - but this is not guaranteed. Even with the best of intentions, we'll probably find things which go wrong.

Behaviour observed

Faulty DNS
* Returns too large responses
* Returns private addresses in "A" responses

Server hangs / timeout
* Connection timeout
* Timeout waiting for response
* Connection hang during headers or response

Bad responses
* Connection closed after request
* Connection closed while transmitting headers
* Connection closed while transmitting bodies

* Garbled response
* Bad status code
* too many headers
* single very long header

HTTP Redirects
* Redirect to relative URI
* Redirect loop
* Redirect to private sites, e.g. not-qualified names, private IPs
* 301 / 302 status, no Location: header

* high-bit set in HTTP headers - but not valid utf8
* No declared encoding, but not ascii or latin1
* Wrong declared encoding
* Unknown declared encoding (e.g. sjis variations)
* Inconsistent encoding in Content-type header, HTML * bad byte sequence for declared or detected encoding
* Non-html content served with html content-type (e.g. image, pdf)

* Bad certificates. If we don't care, it might be better not to attempt to verify certificates.
* Things which break our SSL library

* 200 status even for pages which do not exist
* 301 / 302 status for pages which do not exist (expecting 404)
* robots.txt served as html
* Unexpectedly large content

Framers, ad-injectors
* Frame somebody else's content
* Use javascript to display someone else's content with other (advert) elements layered or obstructing

* Some web sites exist to spam search engines
* These often contain large numbers of host names, linking to each other - "Link farms"
* Spam will cause us to waste resource and "dilute" good content (for statistical analysis, etc)



If the process crashes, we have a problem. A web crawler needs to be able to recover from unexpected errors.
  • Set timeouts to reasonable value. Defaults are typically too high
    • Check that timeouts work at any stage
  • Expect large responses; limit size if possible.
    Don't assume that anything is valid utf-8 bytes (even if it's required to be by some spec)
    Take metadata with a pinch of salt e.g. Content-length does not imply anything about the size of content!
    Be aware of race conditions. If you look again, something might disappear, appear or change. (example: do a HEAD request, see Content-type, you do a GET request, it has changed)

Making sense

We all hope that everything makes sense. However, it's not that simple. What encoding should we interpret things as? What content-type is really present?

Some sites serve data with incorrect metadata, but missing metadata is far more common.

A large proportion of Russian web sites are encoded in Windows-1251 without any metadata. A significant proportion of Japanese sites use Shift_JIS (or its many variants) without metadata.

Sometimes we just have to try to guess. There are definitely cases where we're going to see garbage and need to be able to identify it so we can ignore it.


If we've got a lot of work to do, we want to get through it as quickly as possible. Or at least fast enough.


  • Parallel fetching. Any serious robot is going to need to do lots of this. so consider multiprocessing, or asynchronous frameworks. For large scale it might need to be split amongst several hosts
  • HTTP HEAD method. If we only want the headers, use HEAD. This can save a lot of bandwidth, all servers support it.
  • HTTP/1.1 Range requests. We can ask for, say, the first 10k of a page using a "Range" request. Not all servers support it, but we can fail gracefully
  • gzip content - if your client supports it and there are no interoperability problems.

Bad ideas:
  • Keep-alive or pipelining.  Can cause interoperability problems, usually unnecessary. These are latency optimisations for web-browsers. (Possibly desirable when getting lots of pages from the same site on SSL)
  • Caching, proxies. It would be better for the application to behave intelligently and avoid requesting the same data more than once.
Do remember to be polite - don't hammer the same site repeatedly. If you have lots of pages to get from the same site, interleave the requests with requests to different sites.

Depending on your use-case, it might be a good idea to "back off" a site which returns errors (particularly 5xx or network-layer) and try again later.


It's probably a good idea, to maximise throughput, to decouple different stages with queues of work between them. It might also make our code cleaner, easier to test and possibly more robust (we can possibly retry any stage which fails). For example, decouple
* Fetching robots.txt (if you use it)
* Fetching other entities
* Parsing and processing
* Scheduling / prioritisation

It might also be worthwhile to decouple DNS requests from actual fetching.

One of the reasons to decouple is that parsing takes lots of memory, but fetching requires a lot of waiting for the network. We don't want to wait for the network a lot while simultaneously using a lot of memory. Doing fetching and parsing in different processes means we can let the parser make a mess of our heaps (i.e. heap fragmentation, possibly leaks) and occasionally call _exit to clean it all up without impacting the fetch latency.

Last resorts - nuclear options

Ask a human

We could add "alarm" conditions, to have the crawler ask a human when it encounters something unexpected. This may be useful, for example, to try to decipher a page in the wrong encoding.


If we persistently see bad sites which are spam, causing robustness problems or just plain nonsense, we can blacklist them.
  • Blacklisting host names (or domain names) is ok
  • Blacklisting IPv4 addresses is better (the supply is much more limited!)
 Ideally we don't even connect to a blacklisted site. Failing that, we could connect then drop the connection

If carrying out a very large-scale activity, automating blacklisting is desirable.