Screen Shot 2015-09-15 at 21.38.05BT’s Sports online player – which being polite is piss poor and with the UX design provided by a six year old – is a fine example of how not to deal with errors in user interfaces. “User” being the key word here…

Rather than accepting users are human beings in need of meaningful error messages, and perhaps in-situ advice about what to do about it, they insist on providing cryptic codes with no explanation and which you need to google the meaning of (ok, I admit I ignored the FAQ!). This will lead you to an appalling page; clearly knocked up by the six year olds senior sibling on day one of code-club, littered with links you need to eyeball which finally leads you to something useful telling you how to deal with the idiocy of BT’s design decisions.

In this particular case the decision some halfwit architect made which requires users to downgrade their security settings (e.g. VC002 or VC040)! So a national telecom provider and ISP is insisting that users weaken their online security settings so they can access a half arsed service?…

Half-arsed since when you do get it working it performs abysmally with the video quality of a 1996 RealAudio stream over a 28.8 kbps connection. This likely because some mindless exec has decided they don’t want people watching on their laptops, they’d rather they use the “oh-god-please-don’t-get-me-started-on-how-bad-this-device-is” BT Vision Box – I feel sick just thinking about it…

In non-functional terms:

  • Form – Fail
  • Performance – Fail
  • Security – Fail
  • Operability – Well all I know is that it failed on day one of launch and I suspect it’s as solid as a house of cards behind the scenes. Lets see what happens if the service falls over at champions league final kick-off!

Success with non-functionals alone doesn’t guarantee success – you need a decent functional product and lets face it, champions league football is pretty decent – but no matter how good the function, if it’s unusable, if it makes your eyes bleed, if it performs like a dog, if it’s insecure and if its not reliable then you’re going to fail! It’s actually pretty impressive that BT have managed to fail on (almost) every count! Right, off now to watch something completely different not supplied by BT… oh, except they’re still my ISP because – quelle surprise – that bit actually works!

Windows Update – Really? Still? In 2015?

I’d almost forgotten how ridiculous this is after a couple of years with OSX and Linux… Happened on the train this morning.  Well, I’ll just look at the scenic view that is the south of England for the next half an hour then… Thanks Microsoft!


p.s. Some would claim this isn’t such a bad thing since the south of England can be beautiful, in this case though, I had work to do.


The Irresponsible Architect

I was once told that we; as architects, should always do “the right thing”! What is “right” is of course debatable and since you can take many different viewpoints – financial (the cheapest), time (the quickest), security (the most secure), etc. – you can often have the argument with yourself quite successfully without  worrying about other peoples opinions.

But as architects it is often up to us to weigh the balance of concerns and seek a compromise which provides a reasonable solution from all viewpoints. Harder than it sounds, but that is what I choose to interpret as “the right thing”.

There may still be many competing solution options which appear viable but some of those are false. Demons placed in front of you to tempt you into taking that first – and fatal – bite of the apple… The reason? Incorrectly ascribed responsibilities.

It is often easy to say “I need to do a lookup here” or “I need to parse this bit of data there” and it’s often quick (and cheap) to JFDI and stick some dirty piece of code in where it shouldn’t be. The code becomes more complicated than it need be as responsibilities become distributed throughout  and maintenance becomes harder and more costly as the days and years tick by. Finally you’ll want to pour petrol over the machine and let the damn thing burn.

The same applies from an infrastructure perspective. Using a database server as a file-server because, well, it’s accessible or using your backup  procedures as an archive because they’re kind of similar(!?) is wrong. Tomorrow someone will move the database and it’ll break, or your requirements for archiving will change and your backup solution will no longer be appropriate. Burn baby burn…

So before we sign off on the solution we have to ask “are the responsibilities of each component (logical and physical) well defined and reasonable?”,  “are the dependencies and relationships to other components a natural (necessary) consequence of those responsibilities?” and “if I were to rip this component out and replace it with something else… would I rather immolate myself?”.  If you answer “yes” to any of those then it’s probably not “right“. Probably.. not always. Sometimes you’ve just got to JFDI, sometimes you don’t care about tomorrow (throwaway code, temporary tin etc.) and sometimes, just sometimes, you’ll be wrong when you thought you were right (we’re all fallible… right?). Once you have a clear view of the components and their responsibilities, then you can worry about the precise implementation details…

And finally, if a higher authority overrules you then so long as you’ve explained the rationale, issues and implications clearly, it’s not your fault and you can sleep (or try to) with a clear conscience. Hey, you could be wrong!

So as a big fan of  keeping lists, for each component we need to define its:

  • Responsibilities
  • Rationale
  • Issues and implications
  • Dependencies and relationships to other components
  • Implementation


Instrumentation Revolution

It’s long been good practice to include some sort of tracing in code to help with problems as and when they arise (and they will). And as maligned as simply dumping to stdout is, I would prefer to see this than no trace at all. However, numerous logging frameworks exist and there’s little excuse not to use one.

We have though gotten into the habit of disabling much of this valued output in order to preserve performance. This is understandable of course as heavily used components or loops can chew up an awful lot of time and I/O writing out “Validating record 1 of 64000000. Validating record 2 of 64000000…” and so on. Useful huh?

And we have various levels of output – debug, info, warning, fatal – and the ability to turn the output level up for specific classes or libraries. Cool.

But what we do in production is turn off anything below a warning and when something goes wrong we scramble about; often under change control, to try and get some more data out of the system. And most of the time… you need to add more debug statements to the code to get the data out that you want. Emergency code releases, aren’t they just great fun?

Let’s face it, it’s the 1970’s and we are; on the whole, British Leyland knocking up rust buckets which break down every few hundred miles for no reason at all.

British Leyland Princess

My parents had several of these and they were, without exception, shit.

One of the most significant leaps forward in the automotive industry over the past couple of decades has been the instrumentation of many parts of the car along with complex electronic management systems to monitor and fine tune performance. Now when you open the bonnet (hood) all you see is a large plastic box screaming “Do not open!”.

Screen Shot 2015-07-11 at 11.12.07

And if you look really carefully you might find a naff looking SMART socket when an engineer can plug his computer in to get more data out. The car can tell him which bit is broken and probably talk him through the procedure to fix it…

Meanwhile, back in the IT industry…

It’s high time we applied some of the lessons from the failed ’70s automotive industry to the computer systems we build (and I don’t mean the unionised industries). Instrument your code!

For every piece of code, for every component part of your system, you need to ask, “what should I monitor?”. It should go without saying that you need to log exceptions when they’re raised but you should also consider logging:

  • Time-spent (in milli or microseconds) for potentially slow operations (i.e. anything that goes over a network or has time-complexity risks).
  • Frequency of occurrence – Just log the event and let the monitoring tools do the work to calculate frequency.
  • Key events – Especially entry points into your application (web access logs are a good place to start), startup, shutdown etc. but also which code path requests went down.
  • Data – Recording specific parameters or configuration items etc. You do though need to be very careful here as to what you record to avoid having any personal or sensitive data in log files – no passwords or card numbers etc…
  • Environment utilisation – CPU, memory, disk, network – Necessary to know how badly you’re affecting the environment in which your code is homed.

If you can scale the application horizontally you can probably afford the few microseconds it’s going to take to log the required data safely enough.

Then, once logged you need to process and visualise this data. I would recommend decoupling your application from the monitoring infrastructure as much as possible by logging to local files or; if that’s not possible, stream it out asynchronously to somewhere (a queue, Amazon Kinesis etc.). By decoupling you keep the responsibilities clear and can vary either without necessarily impacting the other.

You then need some agent to monitor the logged output and upload it to some repository, a repository to store the data and some means of analysing this data as and when required.

Screen Shot 2015-07-11 at 11.39.04

Using tools like Kibana, ElasticSearch and LogStash – all from Elastic – you can easily monitor files and visualise the data in pretty much real-time. You can even do so in the development environment (I run ElasticSearch and Kibana on a Raspberry Pi 2 for example) to try to understand the behaviour of your code before you get anywhere near production.

So now when that production problem occurs you can see when the root event occurred and impact it has across numerous components without needing to go through change-control to get more data out whilst the users suffer yet another IT failure. Once you know where to look the problem is 9 times out of 10 fixed. Dashboards can be set up to show at a glance the behaviour of the entire system and you’ll soon find your eye gets used to the patterns and will pick up on changes quite easily if you’re watching the right things.

The final step is to automate the processing of this data, correlate it across components and act accordingly to optimise the solution and eventually self-heal. Feedback control.

Screen Shot 2015-07-11 at 12.18.38


With the costs of computing power falling and the costs of an outage rising you can’t afford not to know what’s going on. For now you may have to limit yourself to getting the data into your enterprise monitoring solution – something like Tivoli Monitoring – for operations to support. It’s a start…

Without the data we’re blind. It’s time we started to instrument our systems more thoroughly.

Password Trash

We know passwords aren’t great and that people choose crappy short ones that are easily remembered given half the chance. The solution to which seems to be to ask for a least one number, one upper case char, one symbol and a minimum of 8 chars…

However, you don’t want to use the same password everywhere as the majority of sites aren’t trustworthy* so it’s foolish to use the same password on all of them. The result is an ever mounting litter of passwords that you can’t remember and either end up writing them down (which likely violates terms of service and makes you liable in the event of abuse) or rely on “forgotten password mechanisms” to log in as and when needed (the main frustration here being turnaround-time and the need to come up with a new bloody password each time).

Yet using 3 or more words as a passphrase is more secure than a short forgettable password and would make a website a damn sight easier to use – you still can’t use the same password everywhere though. It’s about time we started making minimumpasswordlength 16 characters and dismiss the crypto garbage rules that don’t help anyone.

Facebook, Google and the like would have you use their Open ID Connect services – this way they effectively own your online identity – and, if you do use them, the multi-factor authentication (MTA) options are well worth adopting. Personally I don’t want these guys to be in charge of my online identity though (and most organisations won’t be either) so whilst it’s ok to provide the option you can’t force it on people.

We need to continue to support passwords but we need to stop with these daft counterproductive restrictions.
* Ok, none of them are but some I trust more than others. I certainly won’t trust some hokey website just because the developers provide the illusion of security by making my life more complicated than is necessary.

Jenkins on Raspberry Pi 2

I had an old Raspberry Pi which was fun to play with but really too underpowered to do much with. However, I recently took part in the BCS Womens App-athon World Record attempt where I had a chat with a guy about Pis and running Minecraft and he pointed out the newer Pi2 was quite capable of running MCServer (a C++ implementation of a Minecraft Server).

But I wanted a Pi2 for something more serious – something running Jenkins to manage the various bits of codes I knock up. I’d tried this on the original Pi but it was unusable…

Install from instructions Random Code Solutions were ok but the version installed via apt-get was old and slow and couldn’t easily be updated (managed package). I removed this and installed Tomcat 7 but this bizarrely used Java 1.6 and similarly wasn’t easily changed…

Managed packages, they make life easy but you’re dependent on them keeping things up to date.

So the process I finally used was:

1. Install Java 8, the Oracle version. This was already there on the version of Raspbian I was using.

2. Ensure this JVM is set as the default:

update-alternatives --config java

Should give you something like image below. Select the version corresponding to java 8…

update-alternatives --config java

3. Download Tomcat. Version 7.0.62 was the one I used.

cd /usr/local/bin
tar -zxf apache-tomcat-7.0.62.tar.gz

4. Jenkins will complain about URI charset on startup so before installing change the URIEncoding parameter for Tomcat i18n in apache-tomcat-7.0.62/conf/server.xml to set the URIEncoding as UTF-8:

<Connector port="8080" URIEncoding="UTF-8"/>

5. Startup Tomcat


Check this is running by hitting the server: http://{server}:8080/ and you should see something like below:

Tomcat Successful Install

6. Download Jenkins (just grab the jenkins latest release though at time of writing the version I have is 1.617) and put this into the webapps folder under Tomcat. Or…

wget -O /usr/local/bin/apache-tomcat-7.0.62/webapps/jenkins.war

7. Tomcat will now deploy the WAR. This can take a little time and the easiest way to see it running is to run “top” and you should see Java consuming 100% CPU. When this drops to 0% it’ll be done.

8. You should now be able to see Jenkins running on http://{server}:8080/jenkins/ (on my network it’s called raspberrypi2  – genius huh!).

All done! Kind of, it’s best to update all plugins and restart, configure security, set access control and configure an email server for alerting.

A simple task runs ok and Jenkins is quite responsive when nothing else is running on the server…

I can run MCServer ok at the same time but Jenkins gets slow when the world is in use. Otherwise no issues yet though I suspect as I add more jobs it’ll probably run out of RAM. We shall see. Perhaps I’ll be buying another one of these Pi2’s; wife will be pleased.. 😉

Power to the People

Yesterday I received my usual gas and electricity bill from my supplier with the not so usual increase to my monthly direct debit of a nice round 100%! 100% on top of what is already more than I care for… joy!

What followed was the all too familiar vent-spleen / spit-feathers etc. before the situation was resolved by a very nice customer services representative who had clearly seen this before… humm..

So, as I do, I ponder darkly on how such a situation could have arisen. And as an IT guy, I ponder darkly about how said situation came about through IT (oh what a wicked web we weave)… Ok, so pure conjecture, but this lot have previous…

100%! What on earth convinced them to add 100%? Better still, what convinced them to add 100% when I was in fact in credit and they had just reimbursed me £20 as a result?…

Customer service rep: It’s the computers you see sir.

Me: The computers?

CSR: Well because they reimbursed you they altered your direct-debit amount.

Me: Yeah, ok, so they work out that I’m paying too much and reduce my monthly which would kind of make sense but it’s gone up! Up by 100%!

CSR: Well er yes. But I can fix that and change it back to what it was before…

Me: Yes please do!

CSR: Can I close the complaint now?

Me: Well, you can’t do much else can you? But really, you need to speak to your IT guys because this is just idiotic…

(more was said but you get the gist).

So, theories for how this came about:

  1. Some clever-dick specified a requirement that if they refund some money then claw it back ASAP by increasing the monthly DD by 100%!..
  2. A convoluted matrix exists which is virtually impossible to comprehend; a bit like their pricing structure, which details how and when to apply various degrees of adjustment to DD amounts and has near infinite paths that cannot be proven via any currently known mathematics on the face of this good earth.
  3. A defect exists somewhere in the code.

If 1 or 2 then sack the idiots who came up with such a complete mess – probably the same lot who dream up energy pricing models so a “win-win” as they say!

If 3 then, well, shit happens. Bugs happen. Defects exist; god knows I’ve been to root-cause of many…

It’s just that this isn’t the first time, or the second, or the third…

(start wavy dreamy lines)

The last time they threatened to take me to court because I wouldn’t let their meter maid in when they’d already been and so hadn’t even tried again..  and since they couldn’t reconcile two different “computer systems” properly it kept on bitching until it ratcheted up to that “sue the bastards” level. Nice.

(end wavy dreamy lines)

… and this is such an simple thing to test for. You’ve just got to pump in data for a bunch of test scenarios and look at the result – the same applies to that beastly matrix! … or you’ve got to do code reviews and look for that “if increase < 1.0 then surely it’s wrong and add 1.0 to make it right” or “increase by current/current + (1.0*current/current)”.

So bad test practices and/or bad coding practices – or I could also accept, really bad architecture which can’t be implemented, or really bad project management which just skips anything resembling best practices, or really bad leadership which doesn’t give a toss about the customer – and really bad operations and maintenance regardless because they know they’ve got some pretty basic problems and yet they clearly can’t help themselves to get it fixed.

It all comes down to money and with practices like this they’ll be losing customers hand over fist (they do have the worst customer sat ratings apparently).

They could look at agile, continuous-integration, test automation, dev-ops, TDD and BDD practices as is the rage but they need to brush away the fairy dust that often accompanies these concepts (concepts I generally support incidentally) and realise this does not mean you can abandon all sanity and give up on the basic principles of testing and coding!

If anything these concepts weigh even more heavily on such fundamentals – more detailed tracking of delivery progress and performance, pair-programming, reviewing test coverage, using tests to drive development, automating build and testing as much as can be to improve consistency and quality, getting feedback from operational environments so you detect and resolve issues faster, continuous improvement and so on.

Computer systems are more complex and more depended on by society then ever before, they change at an ever increasing rate, interact in an constantly changing environment with other systems and are managed by disparate teams spread across the globe who come and go with the prevailing technological wind. Your customers and your business relies on them 100% to, at best, get by, and at worst, to exist. You cannot afford not to do it right!

I’m sure the issues are more complex than this and there are probably some institutionalised problems preventing efficient resolution mores the pity. But hey, off to search for a new energy provider…


I’m experimenting; or trying to, with IBM Bluemix virtual machines. Clearly beta and half the time I get a UI in italian (!?) but the documentation is woeful.

Simple VM created… What’s the connection string?

Launch Horizon (oddly a separate site, but ok,.. beta…) says it’s:

ssh -i cloud.key <username>@<instance_ip>

Ok, I know the key, I know the IP, but what’s the username?

(the username might be different depending on the image you launched):

Yeah, ok, but what is it? It’s a standard IBM image (CentOS 7 in this case) so…

Nada (Spanish, so still not sure what’s with the Italian)! No documentation, no advice, a broken link in the VM docs… Stack Overflow has no questions under bluemix usernames.. But thankfully DW answers does – though not specifically my question, rather others with sudo issues! Not a great way to find out… Perhaps they hand out that bit of info on the training course… Anyway, two days lost of whatever free period I get, I can login. Now I need to remember what I wanted to try out..

For the record it’s: