Posts

Artificial Intelligence : OpenAI - Analytics, Open Data and A Few Simple Prompts

As I mentioned in my Artificial Intelligence : OpenAI : Data Use and Privacy post, a key consideration when feeding data and dialogue to a third-party GenAI provider like OpenAI’s ChatGPT relates to data privacy. While investigating what you can do with to toolset, or simply if the data you need is available with an open license an excellent way to start is to use Open Data. It’s worth noting that just because data is found in a public location does not mean that it is Open Data. Before using the data in this way make sure to check the publishers license. A good source of open data is public government data - for example as published on sites like: ...

Artificial Intelligence - OpenAI - Data Use and Privacy

This begins a series of articles / code-snippets / thoughts regarding use of Generative Artificial Intelligence (GenAI) for data analytics - starting out primarily using OpenAI. This is not a broader discussion of uses of GenAI (though they’re kinda fun as well, and I’ll probably write about chat and API -style usage in the future). In particular I’m looking at OpenAI’s ChatGPT Plus which (as of the date of writing) is the paid subscription option of ChatGPT which allows access to additional functionality: ...

Migrating git to svn (subversion)

I’ve found that most documentation / forum discussion around the web for this topic tends to be about migrating svn to git - so here’s a quick shell script (reconfigure the variables at the start) to help migrate from git to subversion. This script is also available here: migrate-git-to-svn.sh.txt #!/bin/bash # Run this script from an empty working directory - it will: # 1. git clone from the ORIGIN_GIT URL # 2. run through a number of stages to construct a local svn repository # 3. check out the svn repo for you to check # # NO error checking is done - you need to look at the output and # look for any issues. This script DOES delete it's working directories # on each run (so make sure to start in an empty directory to be safe!) # Configuration PROJECT_NAME=MyProjectName ORIGIN_GIT="git@github.com:UserName/MyProjectName.git" # Keep track of starting directory to make working in sub-directories easier BASEDIR=$PWD # Clone (bare) to main repo from main git repo # Base because later stages want to talk only to a bare repo echo "### Cloning git origin into local bare repo: ${PROJECT_NAME}.git" if [ -d "${PROJECT_NAME}.git" ] ; then rm -rf "${PROJECT_NAME}.git" ; fi git clone --bare "${ORIGIN_GIT}" # Protect the real origin by removing it as a remote in the bare clone echo "### Protect real origin by removing it as a remote in our clone" ( cd "${PROJECT_NAME}.git"; \ git remote remove origin; \ ) # Create an empty svn repository to migrate into echo "### Create and initialise the target svn repository for the migration: ${PROJECT_NAME}.svnrepo" if [ -d "${PROJECT_NAME}.svnrepo" ] ; then rm -rf "${PROJECT_NAME}.svnrepo" ; fi mkdir "${PROJECT_NAME}.svnrepo" svnadmin create "${PROJECT_NAME}.svnrepo" svn mkdir --parents "file://${BASEDIR}/${PROJECT_NAME}.svnrepo/${PROJECT_NAME}/"{trunk,branches,tags} -m 'Inititalise empty svn repo' # git svn (NOTE svn mode - needs the git-svn package installed on debian) # Clone the new local svn repository into a git repo # The --stdlayout option tells "git svn" that we are using the "standard" {trunk,branches,tags} directories echo "### git-svn clone the target svn repo as a git directory (used to import from git and then export to svn): ${PROJECT_NAME}-git2svn" if [ -d "${PROJECT_NAME}-git2svn" ] ; then rm -rf "${PROJECT_NAME}-git2svn" ; fi git svn clone "file://${BASEDIR}/${PROJECT_NAME}.svnrepo/${PROJECT_NAME}" --stdlayout "${PROJECT_NAME}-git2svn" # Set up the bare git clone as the origin for the "${PROJECT_NAME}-git2svn" clone echo "### Add our git clone as the remote origin for ${PROJECT_NAME}-git2svn" ( cd "${PROJECT_NAME}-git2svn"; \ git remote add origin "file://${BASEDIR}/${PROJECT_NAME}.git"; \ ) # Import changes into an import branch in the "${PROJECT_NAME}-git2svn" clone and then export to svn # Note: # 1. git fetch first to get branch details # 2. Then branch to an import branch tracking the remote origin/main # 3. Rebase that onto master (rebase --root commits all reachable, allows to rebase the root commits) # This builds the information needed to sync to svn via dcommit. # 4. Then use svn dcommit - include author information (to help track who made changes) echo "### Import full commit history into ${PROJECT_NAME}-git2svn and then send to subversion repo" ( cd "${PROJECT_NAME}-git2svn"; \ git fetch origin; \ git checkout -b import origin/main; \ git rebase --onto master --root; \ git svn dcommit --add-author-from ; \ ) # Checkout a svn working dir to check the export echo "### Checking out a working svn directory to check the results: svn-check" if [ -d svn-check ] ; then rm -rf svn-check ; fi svn co "file://${BASEDIR}/${PROJECT_NAME}.svnrepo/${PROJECT_NAME}" svn-check echo "Check the contents/log in svn-check/"

qemu - simplest command-line for a performant VM

TL;DR You need to tell qemu to: use the local host CPU (rather than emulating one) and enable hypervisor mode, include a couple of CPU cores for performance, give it a hard disk and some memory and finally (after setting up a bridge network device on your host system) use a bridged network with a MAC address unique to your network. The final bit about networking requires root access - hence the sudo: ...

* (asterisk intended)

This is a sentence*****. (I promise it’s safe to click up there - you’ll enjoy it.)

Increasing / decreasing number of xargs parallel processes (at run time!)

xargs makes it very easy to quickly run a set of similar processes in parallel - but did you know when you’re half-way through a long list of tasks it’s possible to change the number of parallel processes that are being used? It’s there in the man page under “P max-procs, –max-procs=max-procs” but it’s an easy feature to miss if you don’t read all the way through: -P max-procs, --max-procs=max-procs Run up to max-procs processes at a time; the default is 1. If max-procs is 0, xargs will run as many processes as possible at a time. Use the -n option or the -L option with -P; otherwise chances are that only one exec will be done. While xargs is running, you can send its process a SIGUSR1 signal to increase the number of commands to run simultaneously, or a SIGUSR2 to decrease the number. You cannot increase it above an implementation-defined limit (which is shown with --show-limits). You cannot decrease it below 1. xargs never terminates its commands; when asked to decrease, it merely waits for more than one existing command to terminate before starting another. Please note that it is up to the called processes to properly manage parallel access to shared resources. For example, if more than one of them tries to print to stdout, the output will be produced in an indeterminate order (and very likely mixed up) unless the processes collaborate in some way to prevent this. Using some kind of locking scheme is one way to prevent such problems. In general, using a locking scheme will help ensure correct output but reduce performance. If you don't want to tolerate the performance difference, simply arrange for each process to produce a separate output file (or otherwise use separate resources). What does that look like? Spin up some slow processes and start with 3-way parallel execution: ...

Vim sub-replace-special - ampersands in substitute search/replace

Curious… While editing in vim you want to search and replace including a sub-string with an ampersand (&) - this doesn’t have an special regular expression meaning but given the input: <foo> And the search/replace (changing “foo” to “bar”): :s/<foo>/<bar> The result is: <foo>lt;bar<foo>gt; That looks… unexpected? Well, at least, undesired! What’s going on? Reading up on vim’s substitute command, we find a section on sub-replace-special where we find: magic nomagic action <a class="d" href="https://vimhelp.org/change.txt.html#%26">&</a> \& replaced with the whole matched <a class="d" href="https://vimhelp.org/pattern.txt.html#pattern">pattern</a> \& <a class="d" href="https://vimhelp.org/change.txt.html#%26">&</a> replaced with <a class="d" href="https://vimhelp.org/change.txt.html#%26">&</a> Where magic is enabled by default - so what’s happening is this: On the right-hand side of the substitute (ie the output side) a non-escaped ampersand will be replace by the whole matched pattern, so the output suddenly makes sense (even though it’s still unwanted). Importantly note the ampersand on the left-hand-side is ok un-escaped as you would usually expect for a regular expression. In this particular case it looks like a complete mess because of having multiple ampersands on the right-hand-side, but it now makes sense (where I’ve shown the whole matched pattern in the output in bold for the two ampersands in the output): ...

stdbuf - Run COMMAND, with modified buffering operations for its standard streams

While piping together commands that only output intermittently we run into the pipe buffers created by the pipe() system call (also see overview of pipes and FIFOs). This can particularly come into play when stringing together multiple pipes in a row (as there are multiple buffers to pass through). For example in the command below while “tail -f” flushes on activity and awk will flush on output but the grep in the middle ends up with a buffered pipe and so a quiet access.log will result in long delays before updates are shown: ...

Disk Usage

To review disk usage recursively - a few different options exist (when scanning manually through with df and du are not enough). I have found ncdu to be fast and very easy to use. I’ve also used durep from time to time. For a desktop system (or a server with a X server handy) a few options exist. Some support remote scanning, though this can be slow and problematic as a network connection is required for the duration of the scan: ...

md (software RAID) and lvm (logical volume management)

md Building a RAID array using mdadm - two primary steps: “mdadm –create” to build the array using available resources “mdadm –detail –scan” to build config string for /etc/mdadm/mdadm.conf Simple examples: RAID6 array Set up partitions to be used (in this case the whole disk): # for x in /dev/sd{b,c,d,e,f}1 ; do fdisk $x ; done Create the array (in this case, with one hot-spare): # mdadm --create /dev/md0 --level=6 --raid-devices=4 --spare-devices=1 /dev/sd{b,c,d,e,f}1 Configure the array for reboot (append to the end of /etc/mdadm/mdadm.conf): # mdadm --detail --scan ARRAY /dev/md/0 metadata=1.2 spares=1 name=debian6-vm:0 UUID=9b42abcd:309fabcd:6bfbabcd:298dabcd Considerations when setting up the partitions might be that any replacement disks will need to support that same size partition. Unconfirmed but it sounds like it might be a reasonable concern: “Enter a value smaller than the free space value minus 2% or the disk size to make sure that when you will later install a new disk in replacement of a failed one, you will have at least the same capacity even if the number of cylinders is different.” (http://www.jerryweb.org/settings/raid/) ...

Supporting old Debian distros

For old servers that need to stay that way (for whatever reason) updates are no longer available but you can access the packages that were available for that distro by pointing apt at the archive - for example see lenny: deb http://archive.debian.org/debian/ lenny contrib main non-free And for ubuntu: deb http://old-releases.ubuntu.com/ubuntu/ natty main restricted universe multiverse deb http://old-releases.ubuntu.com/ubuntu/ natty-updates main restricted universe multiverse deb http://old-releases.ubuntu.com/ubuntu/ natty-security main restricted universe multiverse

A quick start for Python decorators

Synopsis #!/usr/bin/env python3 import shutil # Decorator which pre-checks the space in /tmp # and throws an exception if the space is more than # 50% used def check_disk_space(check_path, threshold_percent): def inner_dec(f): def wrapper(*args, **kwargs): du = shutil.disk_usage(check_path) used_pct = (du.used / du.total) * 100 if used_pct >= threshold_percent: raise Exception(f"Aborting call - {check_path} is >{threshold_percent} (={used_pct}) full") return f(*args, **kwargs) return wrapper return inner_dec # Build another pre-set decorator def check_tmp_over_50(f): return check_disk_space("/tmp", 50)(f) # Use the decorator on some function that # might need /tmp space @check_disk_space('/tmp', 50) def foo(a, b, c): print("Able to run foo - must have been disk space") @check_tmp_over_50 def bar(a, b, c): print("Able to run bar - must have been disk space") if __name__ == '__main__': try: foo(1,2,3) bar(1,2,3) except Exception as e: print(f'foo aborted with: {e}') Getting Started Decorator syntax and usage isn’t all that complicated - but at the moment you won’t find any help from the Python Tutorial (decorators aren’t mentioned in Defining Functions, nor in More on Defining Functions) and the Python Language Reference only really touches on the existence of decorators without much in the way of a detailed description in the Function definitions and Class definitions sections. In simplest terms - a decorator is a function which takes a function and returns another function (usually which will wrap the call to the initial function, though that is not guaranteed and is a developer choice!). The Synopsis above demonstrates the two main patterns: ...

PyCon(line) AU 2020 Rube Codeberg competition

This is for fun, for silliness and for do not use anywhere-ness - or to quote the instructions: If you’ve ever wondered just how egregious your use of Python could be, or how unnecessarily elaborately you could engineer a simple piece of code, we want your entries in our Rube Codeberg competition! Named for the famously complicated machines drawn by American cartoonist Rube Goldberg, the Codeberg competition is a chance for you to showcase your creativity. Everyone is welcome to participate in the competition - or just tune in as the results are announced. You may find you learn a thing or two about how Python ticks along the way. ...

A tale of two burst balances (AWS EC2 and EBS performance)

When using and monitoring AWS for EC2 instances and their attached EBS volumes there are a couple of very important metrics to keep an eye on which can have enormous performance and availability implications. In particular I’m writing about General Purpose EC2 instance running in standard mode (eg. T2, T3 and T3a at the time of writing) and General Purpose EBS (gp2) because these are very common low-mid spec instances and the default disk storage type. If you’re already using one of the other EC2 or EBS types, chances are you’re already aware of some of the issues I’ll discuss below and those other products are designed to manage CPU and disk resources in a different load targeted way. These metrics are important because they report on the way these AWS resources (otherwise designed to mimic a real hardware) operate in a very different way. Note that while CPU credit reports are shown in the AWS web console for an EC2 instance under it’s Monitoring tab (and so people tend to see it), the EBS credit reports are not. To see these you need to find the EBS volume(s) attached to an EC2 instance (this is linked from the EC2 Description tab) and then look at the Monitoring tab for each EBS volume. ...

Adding tasks to a background screen

A bunch of processes have failed - and you’d like to restart them in a screen session in case you need to rerun them in an interactive shells (for instance to answer prompts from the processes) - lots of Ctrl-A-C … start command …. Ctrl-A-S … name the window … and repeat later there has to be an easier way! Step 1: Create a background screen session to hold the runs This will open a new screen session named “ScreenSessionName” into the background (so you don’t need to Ctrl-A-d): ...

perl oct('0b...') to interpret binary strings

This is really a quick reminder about a perl function which does a little more than you’d perhaps expect. Need to convert a binary (or hex or octal) string to an integer? The perl documentation for the oct(EXPR) function starts out with: Interprets EXPR as an octal string and returns the corresponding value. Then includes the comment: (If EXPR happens to start off with 0x, interprets it as a hex string. If EXPR starts off with 0b, it is interpreted as a binary string. Leading whitespace is ignored in all three cases.) ...

single quote characters in a single-quoted string in shells

A very quick and simple comment on building single-quoted strings in shell scripts which include single quotes. Note that it’s not possible to include a single quote in a single-quoted string - for example the bash man page: Enclosing characters in single quotes preserves the literal value of each character within the quotes. A single quote may not occur between single quotes, even when preceded by a backslash. And dash: Enclosing characters in single quotes preserves the literal meaning of all the characters (except single quotes, making it impossible to put single-quotes in a single-quoted string). ...

LINES and COLUMNS environment magic

Ever wondered why you can read the $LINES and $COLUMNS environment variables from your text shell and have them seemingly aware (or indeed, actually aware) of the size of the graphical terminal in which that shell is running? Enter SIGWINCH - a signal sent to processes when a window size changes. This signal causes a process to retrieve it’s current window size. For example in linux it is done though an ioctl call in termios.h: ...

Arctic Death Spiral

The Arctic Death Spiral describes events unfolding such that we have a feedback loop whereby the the increasing amount of ice melting over summer is reducing the “heat mirroring” effect of the arctic and allowing the planet to absorb more heat, which in turn melts more ice the next summer… For more information see: Arctic Death Spiral

Exception-al perl signals

Scenario For a certain code-base it’s decided it would be useful to be able to trigger an exception on-demand (mid-way through a process working). Why? Perhaps we have a process which loops over a repeated task which takes a reasonable amount of time, and we know that from time to time the input data to this process might need to be quickly updated and we’d like to abort whatever the current loop is doing to read in the latest data. Meanwhile for another process we’d simply like some diagnostic information (which might be slightly expensive to generate) to be available on demand. We might also have a standard application design whereby all otherwise unhandled exceptions will be captured at the last moment (before causing a process to fail) and signal sysadmins with emails or management systems with failure states (rather than simply aborting a process with a kill and having to communicate that this has been done). Signals are helpful here, and we even have SIGUSR1 and SIGUSR2 (see “Standard Signals” in your handy signal(7) man page) allocated for user-defined purposes. Rather than having to build some other signalling method with our application, we can use one of those to trigger an exception by setting up a signal handler. ...