Instrumentation Revolution

It’s long been good practice to include some sort of tracing in code to help with problems as and when they arise (and they will). And as maligned as simply dumping to stdout is, I would prefer to see this than no trace at all. However, numerous logging frameworks exist and there’s little excuse not to use one.

We have though gotten into the habit of disabling much of this valued output in order to preserve performance. This is understandable of course as heavily used components or loops can chew up an awful lot of time and I/O writing out “Validating record 1 of 64000000. Validating record 2 of 64000000…” and so on. Useful huh?

And we have various levels of output – debug, info, warning, fatal – and the ability to turn the output level up for specific classes or libraries. Cool.

But what we do in production is turn off anything below a warning and when something goes wrong we scramble about; often under change control, to try and get some more data out of the system. And most of the time… you need to add more debug statements to the code to get the data out that you want. Emergency code releases, aren’t they just great fun?

Let’s face it, it’s the 1970’s and we are; on the whole, British Leyland knocking up rust buckets which break down every few hundred miles for no reason at all.

British Leyland Princess

My parents had several of these and they were, without exception, shit.

One of the most significant leaps forward in the automotive industry over the past couple of decades has been the instrumentation of many parts of the car along with complex electronic management systems to monitor and fine tune performance. Now when you open the bonnet (hood) all you see is a large plastic box screaming “Do not open!”.

Screen Shot 2015-07-11 at 11.12.07

And if you look really carefully you might find a naff looking SMART socket when an engineer can plug his computer in to get more data out. The car can tell him which bit is broken and probably talk him through the procedure to fix it…

Meanwhile, back in the IT industry…

It’s high time we applied some of the lessons from the failed ’70s automotive industry to the computer systems we build (and I don’t mean the unionised industries). Instrument your code!

For every piece of code, for every component part of your system, you need to ask, “what should I monitor?”. It should go without saying that you need to log exceptions when they’re raised but you should also consider logging:

  • Time-spent (in milli or microseconds) for potentially slow operations (i.e. anything that goes over a network or has time-complexity risks).
  • Frequency of occurrence – Just log the event and let the monitoring tools do the work to calculate frequency.
  • Key events – Especially entry points into your application (web access logs are a good place to start), startup, shutdown etc. but also which code path requests went down.
  • Data – Recording specific parameters or configuration items etc. You do though need to be very careful here as to what you record to avoid having any personal or sensitive data in log files – no passwords or card numbers etc…
  • Environment utilisation – CPU, memory, disk, network – Necessary to know how badly you’re affecting the environment in which your code is homed.

If you can scale the application horizontally you can probably afford the few microseconds it’s going to take to log the required data safely enough.

Then, once logged you need to process and visualise this data. I would recommend decoupling your application from the monitoring infrastructure as much as possible by logging to local files or; if that’s not possible, stream it out asynchronously to somewhere (a queue, Amazon Kinesis etc.). By decoupling you keep the responsibilities clear and can vary either without necessarily impacting the other.

You then need some agent to monitor the logged output and upload it to some repository, a repository to store the data and some means of analysing this data as and when required.

Screen Shot 2015-07-11 at 11.39.04

Using tools like Kibana, ElasticSearch and LogStash – all from Elastic – you can easily monitor files and visualise the data in pretty much real-time. You can even do so in the development environment (I run ElasticSearch and Kibana on a Raspberry Pi 2 for example) to try to understand the behaviour of your code before you get anywhere near production.

So now when that production problem occurs you can see when the root event occurred and impact it has across numerous components without needing to go through change-control to get more data out whilst the users suffer yet another IT failure. Once you know where to look the problem is 9 times out of 10 fixed. Dashboards can be set up to show at a glance the behaviour of the entire system and you’ll soon find your eye gets used to the patterns and will pick up on changes quite easily if you’re watching the right things.

The final step is to automate the processing of this data, correlate it across components and act accordingly to optimise the solution and eventually self-heal. Feedback control.

Screen Shot 2015-07-11 at 12.18.38

 

With the costs of computing power falling and the costs of an outage rising you can’t afford not to know what’s going on. For now you may have to limit yourself to getting the data into your enterprise monitoring solution – something like Tivoli Monitoring – for operations to support. It’s a start…

Without the data we’re blind. It’s time we started to instrument our systems more thoroughly.

Password Trash

We know passwords aren’t great and that people choose crappy short ones that are easily remembered given half the chance. The solution to which seems to be to ask for a least one number, one upper case char, one symbol and a minimum of 8 chars…

However, you don’t want to use the same password everywhere as the majority of sites aren’t trustworthy* so it’s foolish to use the same password on all of them. The result is an ever mounting litter of passwords that you can’t remember and either end up writing them down (which likely violates terms of service and makes you liable in the event of abuse) or rely on “forgotten password mechanisms” to log in as and when needed (the main frustration here being turnaround-time and the need to come up with a new bloody password each time).

Yet using 3 or more words as a passphrase is more secure than a short forgettable password and would make a website a damn sight easier to use – you still can’t use the same password everywhere though. It’s about time we started making minimumpasswordlength 16 characters and dismiss the crypto garbage rules that don’t help anyone.

Facebook, Google and the like would have you use their Open ID Connect services – this way they effectively own your online identity – and, if you do use them, the multi-factor authentication (MTA) options are well worth adopting. Personally I don’t want these guys to be in charge of my online identity though (and most organisations won’t be either) so whilst it’s ok to provide the option you can’t force it on people.

We need to continue to support passwords but we need to stop with these daft counterproductive restrictions.
* Ok, none of them are but some I trust more than others. I certainly won’t trust some hokey website just because the developers provide the illusion of security by making my life more complicated than is necessary.

Jenkins on Raspberry Pi 2

I had an old Raspberry Pi which was fun to play with but really too underpowered to do much with. However, I recently took part in the BCS Womens App-athon World Record attempt where I had a chat with a guy about Pis and running Minecraft and he pointed out the newer Pi2 was quite capable of running MCServer (a C++ implementation of a Minecraft Server).

But I wanted a Pi2 for something more serious – something running Jenkins to manage the various bits of codes I knock up. I’d tried this on the original Pi but it was unusable…

Install from instructions Random Code Solutions were ok but the version installed via apt-get was old and slow and couldn’t easily be updated (managed package). I removed this and installed Tomcat 7 but this bizarrely used Java 1.6 and similarly wasn’t easily changed…

Managed packages, they make life easy but you’re dependent on them keeping things up to date.

So the process I finally used was:

1. Install Java 8, the Oracle version. This was already there on the version of Raspbian I was using.

2. Ensure this JVM is set as the default:

update-alternatives --config java

Should give you something like image below. Select the version corresponding to java 8…

update-alternatives --config java

3. Download Tomcat. Version 7.0.62 was the one I used.

cd /usr/local/bin
wget http://www.mirrorservice.org/sites/ftp.apache.org/tomcat/tomcat-7/v7.0.62/bin/apache-tomcat-7.0.62.tar.gz
tar -zxf apache-tomcat-7.0.62.tar.gz

4. Jenkins will complain about URI charset on startup so before installing change the URIEncoding parameter for Tomcat i18n in apache-tomcat-7.0.62/conf/server.xml to set the URIEncoding as UTF-8:

<Connector port="8080" URIEncoding="UTF-8"/>

5. Startup Tomcat

/usr/local/bin/apache-tomcat-7.0.62/bin/startup.sh

Check this is running by hitting the server: http://{server}:8080/ and you should see something like below:

Tomcat Successful Install

6. Download Jenkins (just grab the jenkins latest release though at time of writing the version I have is 1.617) and put this into the webapps folder under Tomcat. Or…

wget http://mirrors.jenkins-ci.org/war/latest/jenkins.war -O /usr/local/bin/apache-tomcat-7.0.62/webapps/jenkins.war

7. Tomcat will now deploy the WAR. This can take a little time and the easiest way to see it running is to run “top” and you should see Java consuming 100% CPU. When this drops to 0% it’ll be done.

8. You should now be able to see Jenkins running on http://{server}:8080/jenkins/ (on my network it’s called raspberrypi2  – genius huh!).

All done! Kind of, it’s best to update all plugins and restart, configure security, set access control and configure an email server for alerting.

A simple task runs ok and Jenkins is quite responsive when nothing else is running on the server…

I can run MCServer ok at the same time but Jenkins gets slow when the world is in use. Otherwise no issues yet though I suspect as I add more jobs it’ll probably run out of RAM. We shall see. Perhaps I’ll be buying another one of these Pi2’s; wife will be pleased.. 😉

Power to the People

Yesterday I received my usual gas and electricity bill from my supplier with the not so usual increase to my monthly direct debit of a nice round 100%! 100% on top of what is already more than I care for… joy!

What followed was the all too familiar vent-spleen / spit-feathers etc. before the situation was resolved by a very nice customer services representative who had clearly seen this before… humm..

So, as I do, I ponder darkly on how such a situation could have arisen. And as an IT guy, I ponder darkly about how said situation came about through IT (oh what a wicked web we weave)… Ok, so pure conjecture, but this lot have previous…

100%! What on earth convinced them to add 100%? Better still, what convinced them to add 100% when I was in fact in credit and they had just reimbursed me £20 as a result?…

Customer service rep: It’s the computers you see sir.

Me: The computers?

CSR: Well because they reimbursed you they altered your direct-debit amount.

Me: Yeah, ok, so they work out that I’m paying too much and reduce my monthly which would kind of make sense but it’s gone up! Up by 100%!

CSR: Well er yes. But I can fix that and change it back to what it was before…

Me: Yes please do!

CSR: Can I close the complaint now?

Me: Well, you can’t do much else can you? But really, you need to speak to your IT guys because this is just idiotic…

(more was said but you get the gist).

So, theories for how this came about:

  1. Some clever-dick specified a requirement that if they refund some money then claw it back ASAP by increasing the monthly DD by 100%!..
  2. A convoluted matrix exists which is virtually impossible to comprehend; a bit like their pricing structure, which details how and when to apply various degrees of adjustment to DD amounts and has near infinite paths that cannot be proven via any currently known mathematics on the face of this good earth.
  3. A defect exists somewhere in the code.

If 1 or 2 then sack the idiots who came up with such a complete mess – probably the same lot who dream up energy pricing models so a “win-win” as they say!

If 3 then, well, shit happens. Bugs happen. Defects exist; god knows I’ve been to root-cause of many…

It’s just that this isn’t the first time, or the second, or the third…

(start wavy dreamy lines)

The last time they threatened to take me to court because I wouldn’t let their meter maid in when they’d already been and so hadn’t even tried again..  and since they couldn’t reconcile two different “computer systems” properly it kept on bitching until it ratcheted up to that “sue the bastards” level. Nice.

(end wavy dreamy lines)

… and this is such an simple thing to test for. You’ve just got to pump in data for a bunch of test scenarios and look at the result – the same applies to that beastly matrix! … or you’ve got to do code reviews and look for that “if increase < 1.0 then surely it’s wrong and add 1.0 to make it right” or “increase by current/current + (1.0*current/current)”.

So bad test practices and/or bad coding practices – or I could also accept, really bad architecture which can’t be implemented, or really bad project management which just skips anything resembling best practices, or really bad leadership which doesn’t give a toss about the customer – and really bad operations and maintenance regardless because they know they’ve got some pretty basic problems and yet they clearly can’t help themselves to get it fixed.

It all comes down to money and with practices like this they’ll be losing customers hand over fist (they do have the worst customer sat ratings apparently).

They could look at agile, continuous-integration, test automation, dev-ops, TDD and BDD practices as is the rage but they need to brush away the fairy dust that often accompanies these concepts (concepts I generally support incidentally) and realise this does not mean you can abandon all sanity and give up on the basic principles of testing and coding!

If anything these concepts weigh even more heavily on such fundamentals – more detailed tracking of delivery progress and performance, pair-programming, reviewing test coverage, using tests to drive development, automating build and testing as much as can be to improve consistency and quality, getting feedback from operational environments so you detect and resolve issues faster, continuous improvement and so on.

Computer systems are more complex and more depended on by society then ever before, they change at an ever increasing rate, interact in an constantly changing environment with other systems and are managed by disparate teams spread across the globe who come and go with the prevailing technological wind. Your customers and your business relies on them 100% to, at best, get by, and at worst, to exist. You cannot afford not to do it right!

I’m sure the issues are more complex than this and there are probably some institutionalised problems preventing efficient resolution mores the pity. But hey, off to search for a new energy provider…

Bluemuddle

I’m experimenting; or trying to, with IBM Bluemix virtual machines. Clearly beta and half the time I get a UI in italian (!?) but the documentation is woeful.

Simple VM created… What’s the connection string?

Launch Horizon (oddly a separate site, but ok,.. beta…) says it’s:

ssh -i cloud.key <username>@<instance_ip>

Ok, I know the key, I know the IP, but what’s the username?

(the username might be different depending on the image you launched):

Yeah, ok, but what is it? It’s a standard IBM image (CentOS 7 in this case) so…

Nada (Spanish, so still not sure what’s with the Italian)! No documentation, no advice, a broken link in the VM docs… Stack Overflow has no questions under bluemix usernames.. But thankfully DW answers does – though not specifically my question, rather others with sudo issues! Not a great way to find out… Perhaps they hand out that bit of info on the training course… Anyway, two days lost of whatever free period I get, I can login. Now I need to remember what I wanted to try out..

For the record it’s:

ibmcloud

Communication Breakdown

G+ Polls are really very useful for a quick; if not terribly scientific, survey and I recently asked “How do you typically share solution designs?” on the IT Pros community given the variety I see from day to day.

Screen Shot 2015-05-18 at 21.20.17

That presentations are top and models are bottom is sadly unsurprising. I’m in two minds over documents and wikis as effective forms of communication – the former quickly gathers dust though at least provide a snapshot of what was intended at one point in time, the latter decays rapidly into a confusing contradiction of opinions in which the truth is a long lost fairy-tale.

But the really surprising thing for me was the number of votes for whiteboard + photos + email (and/or some online post). We all do it and it’s a excellent way to frame a discussion and share ideas. Does it really end there though? Personally I need to take the output of these sessions and work it into something more cohesive and focused which often yields insights which were not uncovered during the rabbit hole exercise that whiteboarding can become.

Truth be said, I still hunger for a good model and the liberating constraints of UML. Unfortunately it seems I need to both improve my PowerPoint and drawing skills instead.

Electoral Load-balancing

I can’t help but think that I’d not get away with load-balancing of resources assuming an equal prioritisation of requests as the British political system allows with votes. No doubt the Tories got the most seats and should rightly be charged with forming a government. However,  based on raw votes the distribution is unequal and the system is weighted in favour of the main two parties…  and the emergent SNP! The table below is based on UK elections result data from The Guardian assuming you get the number of seats relative to the number of votes cast.

Suppressing minorities or mitigating extremities? Or a bit of both… It should not though be for the incumbents; and the system is a product of them, to decide. Perhaps our electoral system could do with an updated load-balancing strategy.

Party Actual
Seats
Proportional
Seats
Benefit
of the
System
Conservative 331 240 +91
Labour 232 198 +34
Scottish National Party 56 31 +25
Liberal Democrat 8 51 -43
Democratic Unionist Party 8 4 +4
Sinn Fein 4 4 0
Plaid Cymru 3 4 -1
Social Democratic and Labour Party 3 2 +1
Ulster Unionist Party 2 2 0
UK Independence Party 1 82 -81
Green 1 24 -23
Independent 1 2 -1
Alliance 0 1 -1
Trade Unionist and Socialist Coalition 0 1 -1
Traditional Unionist Voice 0 0 0

 

Woke up this morning…

… annoyed!

Been hearing some daft comments recently like “cost is a secondary concern” and “we shouldn’t worry about cost”... wtf!?

You need a vision. You need  a goal. You need to know what the hell it is you’re trying to achieve!

But! The most important thing once your have the vision is how the hell you’re going to get there. Time and money thus become primary concerns and if you ain’t got the money or you ain’t got the time then you ain’t going there and you need to adjust your expectations!

Hell, I want to go to the moon for a vacation but that isn’t going to happen anytime soon.

And as Hemingway said, “It’s good to have an end to journey toward; but it’s the journey that matters, in the end”.