Untitled

In this work we will describe the use of the asimov python library to create the workflows required to perform parameter estimation analyses for a gravitational wave event catalogue. While in order to provide a concrete illustration of this process we will use specific gravitational wave triggers, the same principles can be applied generally to any set of triggers. We will discuss the tools which asimov provides to simplify this process for triggers which have been identified by the LIGO-Virgo-KAGRA (LVK) collaboration, but will also describe performing analysis on arbitrary segments of gravitational wave data.

In this work we describe the process of creating a catalogue based on the additional gravitational wave events which were identified in the 4-OGC catalogue, but which had not been reported in GWTC-3.

Background

Asimov has grown from a collection of scripts which were used during the production of the parameter estimation analyses for the GWTC-2 catalogue in 2020, and over time has evolved into a much more capable analysis management tool. This does mean that there are sometimes still quirks which are specific to gravitational wave analyses, and specifically related to the kind of analysis which we created asimov to handle.

Why use asimov?

Asimov was originally created to handle complex data analysis workflows, but it is capable of helping to maintain organisation of much smaller and simpler projects as well. When it was first created the main motivation for asimov was ensuring consistency between hundreds of analyses; making sure that the various settings were the same for all of them except for the ones which we intended to change. This remains a core part of how asimov functions, but we have been able to build upon this to make asimov a tool for creating reproducible analyses, and to help analysts keep on top of hundreds of concurrent analyses. If you run analyses where you need to maintain some degree of consistency, and need to manage a large number of settings, variations, and inputs, then asimov can help. If you run multi-stage analyses which require the provenance of intermediary results to be assured, asimov can help.

What does asimov need?

Asimov was designed to work within the computing ecosystem of the LVK, and as a result it currently supports a specific set of tools from that environment. However, we’re working to expand this all of the time, and we’re also open to both feature requests and pull requests.

The principle tool which asimov works with to manage the execution of analysis code is htcondor, a high-throughput job scheduler, which is designed to spread jobs around a computing cluster, but is also capable of scheduling jobs on a single machine. In this guide we’ll assume that you already have access to a condor pool, but if you don’t, and you want to run your code locally (on a desktop workstation or a laptop) then we’ve provided some guidance about getting that set-up in the minicondor appendix at the end of this paper.

Asimov can work with open gravitational wave data, provided by the Gravitational Wave Open Science Centre (GWOSC) without additional setup requirements on your system. However, if you wish to run analyses on non-open data you’ll need to either run this at a computing facility which has access to the relevant data, or ensure that you have set-up authentication to access this data over the internet. We won’t cover the details of this in this guide (though we will note specific places where you might need to use a service which requires authentication), but comprehensive instructions on various authentication methods are provided in the LDG Computing Guide.

Asimov projects

An asimov project is a directory located on your machine’s file system which will contain all of the metadata required to run your analyses, the working directories for each analysis, and eventually the results for your analyses. When we created asimov we modelled the projects on git repositories, which are designed to keep all of the code for a given project together in one place, and it’s best to think as an asimov project as the place on your machine that the entire project exists. When you run asimov commands, which we’ll introduce later in this guide, in the project directory asimov will work with the contents of the project in isolation from other asimov projects on your machine. Most of the commands which we’ll see in this guide should be run in an asimov project directory, but we’ll note cases where that’s not the case as they arise.

Asimov blueprints

Asimov is designed to work with a large number of different analysis codes, and each of these can require dozens, and sometimes hundreds of configuration settings. Asimov maintains an internal database of all of these settings in order to allow it to setup new analyses without a requirement to respecify these each time. This both simplifies the process of creating a new analysis, but also reduces the potential for inconsistent settings being introduced between runs. However, we still need to be able to specify settings for analyses somehow, to set default values for all of those settings, and to add new subjects for analyses (which in the case of our gravitational wave catalogue, will be transient gravitational wave events).

Asimov simplifies this process by using text files called blueprints, which are written in YAML format. We’ll meet plenty of examples of blueprints in this guide, but we also maintain a curated set of blueprints which you can either use directly, or which can be adapted for your own needs.

Creating an asimov project

Asimov keeps all of the input and output files for all of the analyses it is being used to run in the project directory, and adds some additional files and directories to allow it to organise and track analyses.

The current directory can be turned into an asimov project running

bash $ asimov init "my project" ● New project created successfully!

where “my project” can be replaced with your own choice of name for the project. Asimov will add several new files and directories to the current directory, so it is best to use an empty directory for your asimov project.

For our 4-OGC project we created a project using the following command:

bash $ asimov init "4OGC" ● New project created successfully!

Running this command creates a variety of new files and directories within the current directory.

We can see these files and directories by running tree.

```bash $ tree . ├── asimov.log ├── checkouts ├── logs ├── results └── working

4 directories, 1 file ```

We’ll revisit most of these later, but here’s a quick summary of them.

working This directory is where the individual run directories will be created for asimov where each analysis will be created. Files which are generated by analysis pipelines can be found here, though asimov will handle results files for you automatically.

checkouts This directory is where asimov will save the configuration files it generates for each analysis. These are held under version control using the git version control system.

results This directory will be used to hold the results for each analysis, including various pieces of metadata, such as the hashes of the file. Asimov performs file operations automatically in this directory, and comes with additional command line tools to interact with it which perform various tasks to provide assurance of data integrity. Files contained here are read-only.

.asimov This directory contains the database and various other operating files for asimov. By default it is hidden, and you’ll not need to work with any files in it for the purposes of this tutorial.

asimov.log This is a text file which contains the logging output from asimov.

Adding analysis defaults

By design, asimov ships with no default settings for its analyses (though individual pipelines’ integrations with asimov generally do). This means that we need to set up the defaults for our analyses. These defaults will be applied to each analysis which asimov creates, however they can always be overwritten, either on a per-event basis (so that the defaults are changed for every analysis run on the event, or on a per-analysis basis, so that the defaults are changed for a specific analysis. We’ll cover how you should go about doing that later.

Untitled

To set these defaults we need to create our first asimov blueprint.

Asimov blueprints are YAML-formatted text files. We want to set project-wide defaults; to tell asimov to do this we need to make sure that one line in the blueprint is

yaml kind: configuration

For clarity we recommend making it the first line. The configuration kind of blueprint sets configuration values at the project level.

Let’s take a concrete example of a set of values we want to configure: the priors. We’ll normally want to keep almost all of the priors for a set of analyses the same; we might want to change a couple, for example the chirp mass, on an event-by-event basis, but the others should stay the same as much as possible to keep all of our analyses comparable.

Example: Setting default priors

We want to keep consistency with the priors used in the GWTC-3 catalogue for all of our events.

Asimov makes this easy, by allowing you to specify defaults to be used right across your project which will be used in every analysis. This means that unless you specifically set an analysis up to do something different it will have these default, project-wide settings.

This can be especially helpful when you want to make sure that the correct data is being loaded, or, as in this example, that the same priors are used in all of the analyses to produce a consistent set of results.

For the sake of brevity in this guide we will only write a blueprint for a small subset of these priors, but you should consult the curated version for a comprehensive set.

To set the default prior on the two component compact object spins we can create the following blueprint:

yaml kind: configuration priors: spin 1: maximum: 1 minimum: 0 type: Uniform spin 2: maximum: 1 minimum: 0 type: Uniform

You can see from this structure that asimov reads configuration data in a hierarchical structure; the priors are all set under the priors keyword, but other settings, including for the likelihood, samplers, and schedulers, each have their own keyword.

Here each prior has a type, which is the type of prior distribution, and then configuration values for that prior (a maximum and a minimum in the case of the uniform prior).

Save the above text block as a file; let’s call it priors.yaml, and for the sake of simplicity just now let’s save it in the root of the project directory. As you find yourself making more blueprints it will probably make sense to create a directory to keep these in, and it would be wise to keep them under version control.

We add the configuration to asimov through the asimov apply command line tool (blueprints are “applied” to asimov projects). Let’s apply our prior defaults using the command line.

```bash $ asimov apply -f priors.yaml

● Successfully applied a configuration update ```

Note that if you’re applying settings from a blueprint you need to pass the file with the -f or --file argument. This can either be a file on your local file-system, or one in a publicly accessible internet location.

This means that we can add the entire set of curated defaults by running

bash $ asimov apply -f https://raw.githubusercontent.com/transientlunatic/gw-event-data/main/defaults/production-pe-priors.yaml ● Successfully applied a configuration update

Setting other analysis defaults

In addition to setting priors, you’ll almost certainly want to configure the behaviour of things like the scheduler, likelihood function, and other parts of the analysis. The configuration settings for these in GWTC-3 are located in a curated blueprint here which you can examine in more detail and update as required.

You can directly apply these settings to your project by running

bash $ asimov apply -f https://git.ligo.org/asimov/data/-/raw/main/defaults/production-pe.yaml ● Successfully applied a configuration update

Per-pipeline configuration

In addition to setting defaults across the whole project, it can be helpful to set them for all analyses which use a specific pipeline. We can do this using a configuration blueprint file, but placing all of the settings within the pipelines part of the hierarchy.

For example, in order to set all bilby pipeline jobs to use a specific accounting tag on htcondor we can apply this blueprint:

yaml kind: configuration pipelines: bilby: scheduler: accounting group: ligo.dev.o4.cbc.pe.bilby

Recalling that to apply the blueprint we save the YAML-formatted data to a text file, for example, bilby.yaml, and then run

bash $ asimov apply -f bilby.yaml

All of the settings which can be applied across the project can be set on a per-pipeline basis this way.

Adding a new event to the project

Having configured the project with all of the settings and defaults which we’ll want for the analyses we can get started on the process of adding subjects for those analyses. In the case of a gravitational wave catalogue those will typically be gravitational wave events, such as GW150914_095045.

There are two approaches to adding a new gravitational wave event to your project, and the right one to choose will depend on the nature of the event, and the resources you have access to.

  1. Using gracedb: If you want to analyse an event which is present in GraceDB, which is the gravitational wave trigger database curated by the LVK then asimov can download basic information about the trigger, and create an event directly from the database. This is only possible if you have login details for GraceDB, and your trigger has been identified by a pre-existing search. This tutorial will briefly cover this, but you should consult the GraceDB Guide for further information.
  2. Using a blueprint: You can use an event blueprint to add an event to the project, which should include information such as the event time and information about the data source. This tutorial will focus on this approach, since our example is replicating the 4-OGC catalogue which is not based on a LIGO search.

Adding an event from GraceDB

To add an event from GraceDB you’ll need to know its superevent number, also known as its SID. These weren’t used prior to O3, so for older events you’ll need to use an alternative approach, which won’t be covered in this documentation, but which is discussed in the detailed GraceDB Guide.

An event can be added directly to the project from GraceDB by running

bash $ asimov event create –superevent <SID>

Adding an event from a Blueprint

You will have most control over the settings for a new event by adding it using a Blueprint, but you’ll also need to do a little more preparation work than is required when using GraceDB.

An event blueprint differs from the configuration blueprints encountered earlier in the tutorial by having kind: event in them. We also need to add details of the GPS Time for the event, and we will use this opportunity to set some additional settings which will be specific to all the analyses on the event, but not necesarily for all the events in the project.

With just the GPS time of the event our blueprint will look like this:

yaml kind: event event time: 1261197166.15 name: OGC191224_043228

This isn’t ready to be applied to the project yet, but so far we have copied the GPS time of the event from the 4-OGC publication, and named the event (though we’ve swapped the GW designation for OGC to make clear the source of the event; this isn’t a technical requirement of asimov however, and the event can have any name as long as it’s unique within the project).

The first of these is details about the data. We will use open data produced by LIGO, which is stored in gravitational wave frame files, or GWF files. Part of the workflow we’ll define later in the tutorial will be able to download these files for us, but we must provide some additional information first. GWF files store data in channels which can correspond to different data sources. For GWOSC data the channels are called <IFO>:GWOSC-16KHZ_R1_STRAIN where <IFO> is replaced with the name of the interferometer. We also need to tell asimov about the types of frame it needs to use; for GWOSC data these are <IFO>:<IFO>_GWOSC_O3b_16KHZ_R1. Additionally, we need to tell asimov how much data to analyse; for this event we’ll use an 8-second segment.

yaml kind: event event time: 1261197166.15 name: OGC191224_043228 data: channels: H1: H1:GWOSC-16KHZ_R1_STRAIN L1: L1:GWOSC-16KHZ_R1_STRAIN V1: V1:GWOSC-16KHZ_R1_STRAIN frame types: H1: H1:H1_GWOSC_O3b_16KHZ_R1 L1: L1:L1_GWOSC_O3b_16KHZ_R1 V1: V1:V1_GWOSC_O3b_16KHZ_R1 segment length: 8

We’ll generally want to include as much information as we can about the likelihood function to be used in the analyses on this event in this blueprint too. This is a good idea as it means that, unless we explicitly over-ride this for individual analyses, all of the analyses on the event will have consistent settings. This should include information like the length of PSD to produce, the sample rate for the analysis, and a handful of other settings.

yaml kind: event event time: 1261197166.15 name: OGC191224_043228 data: channels: H1: H1:GWOSC-16KHZ_R1_STRAIN L1: L1:GWOSC-16KHZ_R1_STRAIN V1: V1:GWOSC-16KHZ_R1_STRAIN frame types: H1: H1:H1_GWOSC_O3b_16KHZ_R1 L1: L1:L1_GWOSC_O3b_16KHZ_R1 V1: V1:V1_GWOSC_O3b_16KHZ_R1 segment length: 8 likelihood: psd length: 8 sample rate: 2048 start frequency: 13.333333333333334 window length: 8

Next we need to tell asimov which interferometers observed the event; for this specific event it is just H1 and L1.

yaml kind: event event time: 1261197166.15 name: OGC191224_043228 data: channels: H1: H1:GWOSC-16KHZ_R1_STRAIN L1: L1:GWOSC-16KHZ_R1_STRAIN V1: V1:GWOSC-16KHZ_R1_STRAIN frame types: H1: H1:H1_GWOSC_O3b_16KHZ_R1 L1: L1:L1_GWOSC_O3b_16KHZ_R1 V1: V1:V1_GWOSC_O3b_16KHZ_R1 segment length: 8 likelihood: psd length: 8 sample rate: 2048 start frequency: 13.333333333333334 window length: 8 interferometers: - H1 - L1

Finally, we can make any event-specific adjustments to the prior settings; for this event we’ll set a specific chirp mass prior.

yaml kind: event event time: 1261197166.15 name: OGC191224_043228 data: channels: H1: H1:GWOSC-16KHZ_R1_STRAIN L1: L1:GWOSC-16KHZ_R1_STRAIN V1: V1:GWOSC-16KHZ_R1_STRAIN frame types: H1: H1:H1_GWOSC_O3b_16KHZ_R1 L1: L1:L1_GWOSC_O3b_16KHZ_R1 V1: V1:V1_GWOSC_O3b_16KHZ_R1 segment length: 8 likelihood: psd length: 8 sample rate: 2048 start frequency: 13.333333333333334 window length: 8 interferometers: - H1 - L1 priors: amplitude order: 1 chirp mass: maximum: 14 minimum: 7

And this is everything we need to create the event in the project. Save this as OGC191224_043228.yaml, and add it to the project by running

bash $ asimov apply -f OGC191224_043228 ● Successfully applied OGC191224_043228

Adding a new analysis to an event

So far we’ve encountered two kinds of asimov blueprint. Now we’ll use the third kind, an analysis blueprint, to set up an analysis on the event we created in the last step.

Any action which is performed on an event’s data is carried-out using an asimov analysis, even if these things might not immediately be something you’d consider “data analysis”. Indeed, the first example of an asimov analysis which we’ll need to set up does no real analysis, and in fact is just used to fetch the data which we will go on to analyse.

The software which asimov configures and runs in each analysis is called a pipeline, and a number of pipelines are supported in the default installation of asimov:

  • Bayeswave: Used to calculate on source estimates of the power spectral density (PSD) of noise in a timeseries
  • Bilby: Used to perform generic parameter estimation on potential signals

In order to create a full parameter estimation analysis we’ll need to chain several different analyses together. Asimov will handle the extra work required to ensure that these are run in the right order, and that their results are passed along the chain, but we’ll need to provide it with some information about the workflow to allow it to do that.

Let’s have a look at the blueprint for the first job in the chain, and think about what’s needed to create any analysis. Remember, this isn’t going to perform any actual analysis, and is just downloading the data files (called “frame files” in LIGO terminology) which contain our data.

yaml kind: analysis name: get-data pipeline: gwdata download: - frames file length: 4096

This looks pretty similar to previous blueprint files we’ve encountered, but this time it has kind: analysis as its first line. Each analysis needs a unique name, and I’ve chosen name: get-data for this one, since that seems pretty short and descriptive. Then I specified the pipeline which asimov should use with this blueprint; in this instance it’s a pipeline called gwdata, which is used to… download gravitational wave data.

The next few lines have specific configuration for this pipeline; we tell the pipeline to download frames, and to download frames which are 4096 seconds long. Individual pipelines should provide documentation for all of these additional options, but this is all we need to download data.

If you’ve ever configured a pipeline for LIGO analysis you might be a bit surprised by this. Isn’t there a lot of information missing? Yes, but asimov already knows about this. We provided plenty of information which is common to all analyses on a given event in the previous step when we created the event, and almost everything else comes from the configuration we performed just after we set up the project. Asimov first uses the event and then project to fill in any gaps in the settings we provide. That makes it easier to set up an analysis, since we only need to set things which we want to be different for that analysis. It also makes it easier to maintain consistency between multiple analyses on the same event, and across many events, since we need to consciously set the things which should be different, and can’t inadvertently set two parameters slightly differently.

Before we add this analysis to our event, let’s have a quick look at a second analysis. Asimov excels at analysis workflow management, so let’s make a workflow.

yaml kind: analysis name: generate-psds pipeline: bayeswave comment: Bayeswave on-source PSD estimation job needs: - get-data

The blueprint for this analysis looks superficially similar to the one to fetch data. This one will run the bayeswave pipeline, which is used to estimate the amount of noise in a stretch of data, to produce the power spectral density for the data (the PSD). I’ve called it generate-psds since it will calculate a PSD for each detector. There are two new bits of information in this blueprint. First, the comment. This is a place you can put a human-readable string to describe the analysis; it’s not used for configuring the analysis, but it can be useful for keeping track of things if you have a lot of similar-looking analyses on an event!

Second is the needs section. This is how we tell asimov which analyses depend on each other. This job needs access to the data in order to analyse the data, so we tell asimov that the generate-psds analysis needs the get-data “analysis” to have completed before it starts. This also allows asimov to pass the outputs of the get-data analysis (in this case, the data files) to the next job in the chain. While we’ll not encounter anything more complicated than this in this tutorial, it is possible to give a list of analyses which are all needed, and asimov will work out how to execute each stage of the workflow and in what order.

You must always add new jobs to an event which require another one after the analysis which they require.

Let’s look at the final analysis in the workflow, which performs our parameter estimation using the bilby pipeline.

yaml kind: analysis name: bilby-imrphenomxphm pipeline: bilby waveform: approximant: IMRPhenomXPHM comment: Bilby parameter estimation job needs: - generate-psds

Again, we have a needs section which tells asimov that the generate-psds job needs to finish before this one can run, and the various other bits of information like the name and pipeline. What’s new here is the waveform section, which allows us to specify the waveform model which will be used for the analysis. This is the only analysis specific information we’ll set right now; everything else will be gathered from the event (things like priors) or the project overall (things like the correct data types).

We can put all three analysis blueprint together in a single file by separating them with a line containing three hyphens, ---. So we can write this:

yaml # This file contains the standard set of analyses which were # applied to the events for the GWTC-3 catalogue paper. kind: analysis name: get-data pipeline: gwdata download: - frames file length: 4096 --- kind: analysis name: generate-psds pipeline: bayeswave comment: Bayeswave on-source PSD estimation job needs: - get-data --- kind: analysis name: bilby-imrphenomxphm pipeline: bilby waveform: approximant: IMRPhenomXPHM comment: Bilby parameter estimation job needs: - generate-psds

into a file called analyses.yaml, for example.

All that’s left to do is to apply these analyses to the event we created earlier, OGC191224_043228. We can do this using the command

bash $ asimov apply -f analyses.yaml --event OGC191224_043228 ● Successfully applied get-data to OGC191224_043228 ● Successfully applied generate-psds to OGC191224_043228 ● Successfully applied bilby-imrphenomxphm to OGC191224_043228

Which will register these analyses with asimov, but it won’t start any of them just yet.

Starting analyses

In the last few sections of the tutorial we’ve gone through the process of setting-up an asimov project, detailing an event and adding that to the project, and then designing a workflow of analyses to run on that event.

We’re now ready to actually begin one of the analyses. For this step it’s important that the machine where you’re running asimov has access to an htcondor scheduler. If you’re not sure about what that means, there’s a section at the end of this tutorial which discusses condor in a little more detail, and also explains how you can set up a lightweight version on a single machine just to try things out.

Starting an analysis, perhaps counterintuitively, is a two-step process in asimov (though we’ll see a way to do this in a single step later). First we need to build the analysis.

bash $ asimov manage build

In this step asimov iterates over every event in the project, and checks for analyses which are ready to run (which normally comes down to having the status of ready, and where all of the analyses listed in the needs section of the analysis have finished. It then constructs the configuration files for each pipeline based on templates which are provided by the maintainers of those pieces of software. This step doesn’t actually start any analyses.

When you first run this you’ll get an output which looks something like this:

bash ● Working on OGC191224_043228 Working on production get-data Production config get-data created.

Indicating that the configuration for the get-data analysis has been created. Initially this is the only analysis which can run, since all of the others that we’ve added to the event have either required the get-data analysis to complete, or rely on analyses which themselves rely on it.

The second step is the submit step, which constructs the pipeline and sends it to the htcondor computing pool. This step can take a little while to run, especially if you’re setting up a lot of analyses.

bash $ asimov manage submit

If there are any problems while submitting an analysis, asimov should report these (and will normally mark the relevant analysis as stuck).

If everything went to plan asimov should output something like this:

bash ● Submitted OGC191224_043228/get-data

Your analyses should now be running! Make a cup of your favoured warm beverage, and sit back. Asimov’s working.

If you want to combine the build and submit steps so that asimov does both at the same time you can chain them like this:

bash $ asimov manage build submit

Monitoring analyses

Now that you’ve got asimov to set some analyses running, you’ll probably want to check-up on them, and see if they’ve finished. If you’ve used htcondor before you’re probably familiar with running the condor_q command every so often to do this. Asimov is able to add some additional functionality, while doing this as part of its monitor process.

When you run

bash $ asimov monitor

Asimov will check the status of all the analyses it’s running on the condor scheduler. If they’re still running it will collect any information about that analysis that each pipeline is designed to report back, and it will report to the console that the job is still running. If the job is no longer running asimov will investigate. Hopefully your job is no longer running because it has finished! Asimov will look to see if the expected outputs have been produced, and if they have it will collect these outputs, and mark the job as finished. If the pipeline needs to invoke any post-processing asimov will start this.

For the analysis we started earlier, you might see something like this:

OGC191224_043228 - get-data[gwdata] ● get-data is in the queue (condor id: 109) - generate-psds[bayeswave] ● ready - bilby-imrphenomxphm[bilby] ● ready The event also has these analyses which are waiting on other analyses to complete: bilby-imrphenomxphm which needs generate-psds generate-psds which needs get-data

However, the exact output of the message will depend on the exact state of all the analyses in the project.

If the job is no longer running, but it hasn’t produced the desired outputs, asimov will attempt to restart it. This can happen because of a fault in the distributed computing setup, and for very complex analyses this might happen with reasonable frequency. Not all pipelines support this action, and asimov will eventually stop attempting to restart most analyses. If it’s not able to fix things itself it will mark the analysis as stuck, and you’ll need to look at the log files for the analysis to diagnose the problem.