Tuesday, 8 July 2014

Why I use Python

There are a lot of holy wars between programming language advocates in the industry.

I use Python.

Why?

Because I have found that programs written in Python are significantly more likely to Work Correctly. They more often run correctly the first time, therefore I spend less time debugging.

There are lots of other reasons, but it's mainly the "It just works".

A few years ago, I had 10+ years of industry experience of Perl, but only 1 year of Python. I already found that my Python programs initially worked correctly FAR MORE OFTEN.

This is comparing a language that I'd been using commercially, most days, for 10 years, with something that I'd only just picked up.

That's why I use Python.

---
Other stuff?

DEBUGGING: Stack traces. In Perl, it's possible to get a stack trace from an exception, but you sometimes have to jump through hoops to do it. Likewise, in PHP, stacktraces are available in principle, but don't always appear when you need them. Java? Stack traces usually appear, but are often huge and difficult to read (even better: in some versions of Java, the compiler "optimises" out line numbers from stack traces) In Python, stack traces Just Work. Like everything else.

Friday, 3 January 2014

The most common cause of unavailability

Hi, Happy new year.

I've done a lot of work on high-availability systems. There is a lot of writing on high-availability systems - how to implement failover, hot-spare systems, load-balancers etc.

However, most of these seem to make an assumption: humans are infallible.

In practice, this is not always the case.

In fact, I'd say that probably about 75% of downtime is caused by human errors, cock-ups, mistakes. I'm not an expert, but I suspect that it's about the same proportion as air crashes caused by pilot (or someone else's) error.

So, it's the human, stupid. PBKAC (problem between keyboard and chair).

Here are some possible fixes:

Give human less work to do

We can avoid SOME human errors by having systems automatically configure themselves, setup, or perform sanity checks before accepting settings.

"Blindly accepting" instructions given by an idiot human, is what computers are generally very good at, and often results in chaos.
  • Automatic configuration is less likely to make a mistake (when was the last time you saw a DHCP server give out a duplicate IP address? Never?)
  • Testing a configuration provided by a human, BEFORE applying it, might prevent a mistake.
HOWEVER
  • Automatic configuration is a "clever" piece of software which can be difficult to test. Sometimes autoconfiguration tests are mind-blowingly difficult to set up. This means it's not likely to be very well tested.
  • If autoconfig gets it wrong, it's likely to do it on a large scale.
  • Automatic configuration probably means that the humans have no idea how it works, which may be bad.
  • Autoconfig might not correctly handle "very unusual" circumstances.

Give human more work to do

Perversely, we could give the humans more work. Something that someone does every day, they're unlikely to get wrong. The procedures which are done infrequently are more likely to be a problem.

  • Less code to test. Manual configuration.
  • Humans (ops engineer, support engineer) become used to getting manual configuration correct
  • Humans might be able to set up unusual configurations required for unusual circumstances. (Corollary: such unusual configurations probably won't have been tested!)

 My view

This is a completely opinionated rant...

  • Humans cannot be trusted to do anything
  • Make everything either automatic, or hard-coded
  • Give users as few options as possible. Perhaps let them make some cosmetic changes (colours, window layout?). Don't let them change the TCP port number your application uses.
  • There is usually far less value in allowing a parameter to be changed than forcing it to a reasonable hard-coded value.
  • Sales engineers want to see screens full of knobs and dials? Give them placebo ones which do little or nothing (i.e. don't break it).
  • If you put a big red button in the reactor control room, marked "Do not press.", someone will press it sooner or later.
On testing...

  • If your program fails to correctly find the server when running on Chinese-language Windows Vista connected to two VPNs at once, it's probably not really important.
  • If your software can't find the server for anyone who typoes the address, it's more important.
Of course users sometimes want to configure "policy". This is understandable, but try not to enable them to create stupid policies (or at least, make sure it's strongly discouraged!). A policy of "delete all emails instantly and irretrievably" sounds more like a bug to me.
 

Thursday, 17 October 2013

On web crawling robots

Here are some observations from writing web-crawling robots.

Intro



At some point, many of us (in the IT security industry) will need to write a robot which scrapes lots of web sites. By "lots", I mean a very large number, run by arbitrary parties. Not just a few run by well-behaved, cooperative entities.

Most owners of web servers try to make them compatible - but this is not guaranteed. Even with the best of intentions, we'll probably find things which go wrong.

Behaviour observed


Faulty DNS
* Returns too large responses
* Returns private addresses in "A" responses

Server hangs / timeout
* Connection timeout
* Timeout waiting for response
* Connection hang during headers or response

Bad responses
* Connection closed after request
* Connection closed while transmitting headers
* Connection closed while transmitting bodies

HTTP
* Garbled response
* Bad status code
* too many headers
* single very long header

HTTP Redirects
* Redirect to relative URI
* Redirect loop
* Redirect to private sites, e.g. not-qualified names, private IPs
* 301 / 302 status, no Location: header

Content
* high-bit set in HTTP headers - but not valid utf8
* No declared encoding, but not ascii or latin1
* Wrong declared encoding
* Unknown declared encoding (e.g. sjis variations)
* Inconsistent encoding in Content-type header, HTML * bad byte sequence for declared or detected encoding
* Non-html content served with html content-type (e.g. image, pdf)

SSL
* Bad certificates. If we don't care, it might be better not to attempt to verify certificates.
* Things which break our SSL library


Misc
* 200 status even for pages which do not exist
* 301 / 302 status for pages which do not exist (expecting 404)
* robots.txt served as html
* Unexpectedly large content

Framers, ad-injectors
* Frame somebody else's content
* Use javascript to display someone else's content with other (advert) elements layered or obstructing

Spam
* Some web sites exist to spam search engines
* These often contain large numbers of host names, linking to each other - "Link farms"
* Spam will cause us to waste resource and "dilute" good content (for statistical analysis, etc)

Advice

Robustness

If the process crashes, we have a problem. A web crawler needs to be able to recover from unexpected errors.
  • Set timeouts to reasonable value. Defaults are typically too high
    • Check that timeouts work at any stage
  • Expect large responses; limit size if possible.
    Don't assume that anything is valid utf-8 bytes (even if it's required to be by some spec)
    Take metadata with a pinch of salt e.g. Content-length does not imply anything about the size of content!
    Be aware of race conditions. If you look again, something might disappear, appear or change. (example: do a HEAD request, see Content-type, you do a GET request, it has changed)

Making sense

We all hope that everything makes sense. However, it's not that simple. What encoding should we interpret things as? What content-type is really present?

Some sites serve data with incorrect metadata, but missing metadata is far more common.

A large proportion of Russian web sites are encoded in Windows-1251 without any metadata. A significant proportion of Japanese sites use Shift_JIS (or its many variants) without metadata.

Sometimes we just have to try to guess. There are definitely cases where we're going to see garbage and need to be able to identify it so we can ignore it.

Performance

If we've got a lot of work to do, we want to get through it as quickly as possible. Or at least fast enough.

Ideas:

  • Parallel fetching. Any serious robot is going to need to do lots of this. so consider multiprocessing, or asynchronous frameworks. For large scale it might need to be split amongst several hosts
  • HTTP HEAD method. If we only want the headers, use HEAD. This can save a lot of bandwidth, all servers support it.
  • HTTP/1.1 Range requests. We can ask for, say, the first 10k of a page using a "Range" request. Not all servers support it, but we can fail gracefully
  • gzip content - if your client supports it and there are no interoperability problems.

Bad ideas:
  • Keep-alive or pipelining.  Can cause interoperability problems, usually unnecessary. These are latency optimisations for web-browsers. (Possibly desirable when getting lots of pages from the same site on SSL)
  • Caching, proxies. It would be better for the application to behave intelligently and avoid requesting the same data more than once.
Do remember to be polite - don't hammer the same site repeatedly. If you have lots of pages to get from the same site, interleave the requests with requests to different sites.

Depending on your use-case, it might be a good idea to "back off" a site which returns errors (particularly 5xx or network-layer) and try again later.

Decoupling 

It's probably a good idea, to maximise throughput, to decouple different stages with queues of work between them. It might also make our code cleaner, easier to test and possibly more robust (we can possibly retry any stage which fails). For example, decouple
* Fetching robots.txt (if you use it)
* Fetching other entities
* Parsing and processing
* Scheduling / prioritisation

It might also be worthwhile to decouple DNS requests from actual fetching.

One of the reasons to decouple is that parsing takes lots of memory, but fetching requires a lot of waiting for the network. We don't want to wait for the network a lot while simultaneously using a lot of memory. Doing fetching and parsing in different processes means we can let the parser make a mess of our heaps (i.e. heap fragmentation, possibly leaks) and occasionally call _exit to clean it all up without impacting the fetch latency.

Last resorts - nuclear options

Ask a human

We could add "alarm" conditions, to have the crawler ask a human when it encounters something unexpected. This may be useful, for example, to try to decipher a page in the wrong encoding.

Blacklisting

If we persistently see bad sites which are spam, causing robustness problems or just plain nonsense, we can blacklist them.
  • Blacklisting host names (or domain names) is ok
  • Blacklisting IPv4 addresses is better (the supply is much more limited!)
 Ideally we don't even connect to a blacklisted site. Failing that, we could connect then drop the connection

If carrying out a very large-scale activity, automating blacklisting is desirable.


Sunday, 22 April 2012

Move the viewport, not the world!

Two separate posts on gamedev.stackexchange.com drew my attention. It appears that some game programmers are getting it WRONG.

Superficially, it seems that in some games (think of scrolling games with enemies appearing of the edge of the screen), the enemies are all approaching at a constant rate.

But what REALLY happens, is that the viewport scrolls at a fixed rate, and each enemy starts at a fixed world-space position, and activates when the viewport reaches a particular place.

This is not a very important distinction for the player, but its a HUGE distinction for the programmer.

If the developer thinks it's good to have 1 million enemies, all of which are created at the beginning of the game, and then move each one of them, every single tick, to a slightly different position as the camera moves, then they are probably doing it wrong:



So games developers - remember - move the viewport, not the world!

Kevin Reid's answer to the "thousands of enemies" says it nicely "Don't move the enemies! move the camera"!

Monday, 23 May 2011

Expecting the unexpected

Few developers consider, when trying to build robust platforms, all the possible modes of failure. Indeed, it is difficult to consider them all, let alone plan for them, or design tests which exercise particular symptoms.

In this post, I discuss some of the types of failure we can see in real systems.

Complete server failure



Most developers DO consider this. In a "Complete server failure", what generally happens is:

* The server stops processing new requests, completely.
* The server's OS no longer responds to any network request at all (e.g. "ping")
* Processing does not continue within the server
* The contents of memory are immediately and irretrievably lost.

Typically, the server recovers, and when it does so, it is rebooted and restored to full health. All writes which were acknowledged before its failure have been persisted.

This is very easy to simulate (just hit the "power off" button in your VM hypervisor) and fairly easy to plan for; most robust systems consider this kind of scenario.

Network failure



There are many different kinds of network failure, but consider the simplest, most severe network failure:

* One or more machines in the infrastructure lose network connectivity
* None of them can talk to anything at all, including each other
* Local processing on these servers continues as normal
* No machines need to be rebooted to fix the fault, when it is repaired everything is back to normal.

This is a symptom of, perhaps a switch failure, where a "complete" failure occurs.

I won't discuss network failures at all, but there are many different kinds. My experience suggests that the most common is partial or complete loss of internet connectivity from one location (datacentre).

IO subsystem failures


* One or more discs / volumes suddenly become unavailable
* The OS does not reboot; processes do not stop

These are the kinds of failures which developers typically don't consider and are a lot more difficult to simulate. What might happen is, the power fails for a disc enclosure unit, but not its host server, in this case the OS and its boot discs remain available, but data discs are not. In these cases, failover might not be triggered or might behave incorrectly.

Heavy load or unexpected poor performance



* A single server unexpectedly starts performing very badly
* In the extreme, this means without sufficient capacity to do useful work
* But it's not failed; no subsystem is individually totally unavailable
* Sometimes the effect is severe enough to prevent operations engineers logging in to diagnose / fix the fault

These kinds of faults usually cause a larger problem, because failover systems aren't triggered, or cannot take over in a timely fashion. Common causes can be

* Rogue process consuming lots of resources
* Denial-of-service attack
* Bad application design causing legitimate requests to suddenly spike in resource usage.
* Operational error (well, anything can be caused by operational error :) )

"Zombie" systems or, back from the dead



* A system fails in a catastrophic way and can't be remotely recovered
* Operations engineers assume that it's going to be completely dead until physically replaced (They are some distance away and don't raise a "remote hands" request, or are unable to recover it by doing so)
* Another system is provisioned in its place, and takes over its IP address, role etc
* Then one day... the "Zombie" system unexpectedly comes back from the dead to haunt its successsor ... Brraaainss....

Of course this could be months later, after many software updates (possibly security updates). The "zombie" system is running an old build and will not carry out correct processing if it is given work to do.

Conclusion



These are just a few of the annoying types of failures which happen to real systems in production. Expect the unexpected (as if that's not a contradiction!).

Happy hacking!

Sunday, 20 February 2011

HTML 2d Canvas upscaling - really inefficient

I started writing some test programs with the HTML canvas element. This is great, as you can actually write games in Javascript - efficiently - in principle.

My previous attempts have all used the DOM API which is not very convenient, and not very efficient.

I had assumed the canvas 2d drawing context was basically a software-renderer - it's not extremely efficient, but provided the canvas doesn't have too many pixels, you can still do a lot of work per frame on a modern machine.

Which is fine.

Suppose you have a canvas which is 640x320 pixels, you can then get it upscaled into whatever resolution is in the browser window, making the game appear the same size for everyone. Great.

Except, the upscaling in web browsers sucks performance. I tried Firefox 3.6 and Chrome 9. Both of them use loads of CPU scaling the canvas on to the screen.

If we use a canvas element without any scaling (no CSS width etc) then all is fine.

Scale it up to a large window, Boom! now it's slow as a pregnant snail. Bummer.

See Example here

Wednesday, 26 May 2010

MySQL, what are you smoking?

There are a lot of weird things which MySQL does to handle its mix of transactional and non-transactional behaviour, but this one was new to me :)

create table t1 (ID INT NOT NULL PRIMARY KEY, V INT NOT NULL);

Query OK, 0 rows affected (0.01 sec)

mysql> insert into t1 (ID,V) VALUES (2,NULL);
ERROR 1048 (23000): Column 'V' cannot be null

mysql> insert into t1 (ID,V) VALUES (3,1),(4,1);
Query OK, 2 rows affected (0.00 sec)
Records: 2 Duplicates: 0 Warnings: 0

insert into t1 (ID,V) VALUES (5,1),(6,NULL);
Query OK, 2 rows affected, 1 warning (0.00 sec)
Records: 2 Duplicates: 0 Warnings: 1

mysql> show warnings;
+---------+------+---------------------------+
| Level | Code | Message |
+---------+------+---------------------------+
| Warning | 1048 | Column 'V' cannot be null |
+---------+------+---------------------------+
1 row in set (0.00 sec)

select * from t1;

+----+---+
| ID | V |
+----+---+
| 3 | 1 |
| 4 | 1 |
| 5 | 1 |
| 6 | 0 |
+----+---+
4 rows in set (0.00 sec)
What's going on?
MySQL does not consider (some) errors to be errors, if they happen on the second or subsequent row of a multi-row insert.

On the other hand, if it happens on the first row, it's an error.

Why?

Because non-transactional engines can't rollback to a savepoint. This means that if it's inserted one or more rows already, to generate an error would leave some stuff in the database.

No, really why?

I don't know. This is not consistent with, for example, a unique index violation, which makes it stop half way through a multi-row insert anyway, and non-transactional engines can't rollback.

So if you insert a duplicate, THAT still generates an error on the second a subsequent row. It's not even consistent!

mysql> insert into t1 (ID,V) VALUES (10,1),(10,2);
ERROR 1062 (23000): Duplicate entry '10' for key 'PRIMARY'
mysql> select * from t1;
+----+---+
| ID | V |
+----+---+
| 3 | 1 |
| 4 | 1 |
| 5 | 1 |
| 6 | 0 |
| 10 | 1 |
+----+---+
5 rows in set (0.00 sec)

Of course if we use a transactional engine, it looks better:


mysql> ALTER TABLE t1 ENGINE=InnoDB;
Query OK, 5 rows affected (0.00 sec)
Records: 5 Duplicates: 0 Warnings: 0

mysql> insert into t1 (ID,V) VALUES (20,1),(20,2);
ERROR 1062 (23000): Duplicate entry '20' for key 'PRIMARY'
mysql> select * from t1 WHERE ID=20;
Empty set (0.00 sec)



But that won't change the behaviour of a multi-row insert with NULLs in invalid places.

This kind of stuff is nonsense and we need it to GO AWAY NOW.

How?

SET SQL_MODE='STRICT_ALL_TABLES'

And now every error is really an error. Yay! Why can't this be the default? (I know the answer; this is a rhetorical question)