Companies House IXBRL Conversion - More Technical Details

This post is a continuation of the original post about building convert-ixbrl.co.uk, this time focused more on the technical implementation details including database architecture, search optimization, and the complete tech stack. Convert-ixbrl is an advanced company finder and iXBRL-to-JSON/Excel converter for UK Companies House Bulk data. It is now supported by a talented team but I developed the technical core myself and it originated a few years ago as a spinoff of fastukcompanysearch.com

Background: What is this post about

Several people, over the past months have approached me asking specific questions about database implementation, pipeline processing and the like for convert-ixbrl.co.uk since its launch.

I thought my fellow geeks might find the technical details interesting, so I'll share them here. If you are building something like this of your own, perhaps you could compare notes. Leave a comment if you think there are areas where I could have done better.

The foundations for Convert-IXBRL.co.uk

The idea is simple.

Companies house makes the accounts XBRL and IXBRL file available as a zip file. Download it, extract the XBRL and iXBRL contents, convert the files and save them in an appropriate database, setup the database for fast lookups and build an interface to query the data over convert-ixbrl.co.uk

Allow me to walk you through my process. I'll skip over the most technically challenging part of building an IXBRL converter from scratch as I've described in the Medium post I linked to in the first paragraph.

Check it out after reading or atleast skimming this post as it provides a more complete view of the complete system.

Overview of the Current Tech Stack:

Backend: SQLite, .Net batch processes that make up the data conversion pipelines, a companies house monitor, .Net APIs, Serverless functions, AWS S3, AWS SNS

Frontend: VueJS, HTML5, CSS

Some technologies tried and abandoned: MySQL, PostgreSQL, MeiliSearch, ElasticSearch, DuckDB, DynamoDB and a few more that I have managed to forget about

Building a Batch Processor to Automate the zip file download

Arguably, this was the most straightforward part. My initial implementation was similar to a batch job processor I had built for fastukcompanysearch.com, which involved creating a job that runs once a day, checking if any new files had arrived and if they had, copying them to the server, extracting the zip and putting these in a staging area for processing.

I am a .Net developer so for me this was a console app, setup to run at a schedule using cron.

The staging area: 400 Million rows in an SQLite table

For convert-ixbrl.co.uk, the conversion begins when the extracted IXBRL files are converted to an intermediary format such that each XBRL concept and the associate value is stored in a single row in a gigantic table. A row might look like so:

Id | CompanyNumber | XBRLConcept | Value
xxxx | SC123456 | core_netassets | 100000

This was the most natural format to store the data in initially as I was figuring out how best to optimise the database for search. This meant that the queries were quite simple:

Sample Query (representative - don't do this unless you want to risk SQL injection):

Select * From CompanyTable where CompanyNumber = 'SC123456'

This would then return a few hundred rows. I would then need to group the rows by year and then for each year return the actual values and the concept, mapped to a user friendly string (e.g. core_netassets would become NetAssets) as JSON.

I initially picked MySQL to implement this. I started to convert and import the data using my converter but my biggest friction point was database portability as I worked through the bugs. I work across Windows, Mac and Linux machines and continually exporting and importing a database of that size significantly increases the build and debug cycles making it really painful to work on this application and a large part of what I do is about the development experience, it's not all about the commercial aspects.

I work across these machines because working on my development projects in a coffee shop is one of the greatest pleasures of my life and while I prefer Windows(yes, I said it) I have to use Mac for my iOS mobile apps, but I digress.

PostgreSQL would have had the same issues, so I thought "why not true SQLite?"

This initially seemed ridiculous as the title suggests. SQLite is typically used for small, local installs of relatively tiny databases but there was nothing in the documentation that said that it couldn't accommodate a database of this size. Moving to SQLite would mean that I could just copy and paste the file across the machines, a far quicker operation on the modern SSDs.

And, it actually worked fine. Using the WAL mode of SQLite. I encourage you to read up on it. The inserts, updates and reads were blazingly fast even at the ridiculous (for SQLite) table sizes I mention.

Well, that kind of works until you need to run other types of queries

So, this worked but was not enough for the variety of queries convert-ixbrl offers and for the variety of queries users actually search for. Users also search based on partial company names. Searching for partial terms across such a large table, even with indexed columns, was just unacceptably slow, squashing my hopes of getting away with just a relatively simple implementation like this.

Enter ElasticSearch, then MeiliSearch

Need to run text based searches across largish db? ElasticSearch might be one of the top candidates for many. Just dump everything in and get querying.

The problem is:

You need a far beefier machine. Cloud machines can get expensive fast, particularly when you consider that you need geographic redundancy. You ALWAYS need that redundancy, but the costs multiply far quicker with the beefy machines. At this stage, I wanted to keep my costs low to make the most use of my cloud credits during the build phase
Numeric searches are not straightforward. I'd likely need to run queries like "Find companies that have netassets > 2Million within area X" and the mixed queries mean I'd need to either tune ElasticSearch just right or combine the results with a traditional DB search
Query building is not intuitive(subjective opinion)

I also then tried MeiliSearch, a more modern alternative. The querying was nicer (again, subjective opinion) but it didn't really fix the advanced querying issue for me. Also, it still needed a beefy machine for optimum performance and frankly I just did not feel it was battle tested enough. And ofcourse, using either ES or MeiliSearch would that meant my DB backup and restore issue would make a comeback as I build across machines.

Enter FTS

SQLite supports extensions. FTS enables Free Text Search within the same db where I had the company data using SQL like syntax.

So that worked. It wasn't perfect as it still required writing optimised queries and performing serverside grouping to structure the text based results and merge them with the numeric results the existing traditional tables but atleast now I just had one db until…

Enter Advanced Queries

… I realised I would also need to run advanced queries like:

'Find all companies within 2 miles of EC1 post code with netassets of over a million, with officers who are aged 65 years or older, that are atleast 10 years old and operate within the provided SIC code sectors'

This is a real representative example of a query that are run on convert-ixbrl.co.uk.

Back to staging area

I mentioned the staging area earlier. I now realised that no matter how much I optimise the gigantic table shown a few paragaphs ago, I will struggle to enable the users to be able to search across ANY xbrl concept. Examples include netassets, profit/loss, pension, accounting fee paid, tax + dozens of others and the flexible search was the whole reason behind building this service. This was circa 2021, before ChatGPT got mainstream and this was when I was building this on the side.

I then had to embark on a trial and error journey. Different table layouts, column layouts, indexing strategies and the like. All with the goal to make the search work efficiently, regardless of the nature of queries.

The process was essentially a repeat of "Build a structure, try the queries, find several cases where the structure didn't work or was too slow, modify it, try again, find a few more cases where it didn't work or was too slow, rinse and repeat until done'

I ultimately settled on two databases. One storing the 'raw' converted data shown earlier, and another with rows that were really wide, containing one xbrl concept value per column like so:

CompanyNumber | YearEnd | NetAssets | EmployeeCount…. dozens of columns followed

The wide column layout was somewhat ugly but what really matters is the query speed, update/insert speed and ultimately the user experience. You have to remember, it is ultimately a product that someone will use and it is being built to deliver value, achieving technical perfectionism is a futile exercise.

It was fun but frustrating at the same time. When something worked, I'd find myself yelling' YES! I knew I am a genious!" and then ten minutes later when I discovered that it only worked for a subset of queries I need to support and other queries took minutes to run I would throw my hands up in the air exclaiming "Argh, what is even the point? Will this ever be done?"

Slowly but surely though, over months (remember I wasn't working on this full time), things started to come together. One by one I worked through optimising the tables, column types, indexes, SQLite setup, querying logic, tweaking, optimising all the way until all the queries I wished to run worked, and returned results within a few hundred milliseconds (there are some exceptions depending on the query type but mostly).

There is additional processing following the retrieval of database results that happens in the application layer (.net). There are unit tests that run as part of the usual CI process, as well as integration tests and a postman test suite that is used to test the API in a semi-automatic fashion.

Companies House Monitor

The setup above worked initially but required multiple batch jobs. The basic company data bulk file and the accounts bulk data files need to be merged, as well as the PSC bulk file and the officers file that are available from Companies House servers.

Over the next few months, I repurposed a companies house monitor, a live, always on listener service which processes the Companies House stream to update the convert-ixbrl data in as close to real time as possible.

Running continuosly, it receives the events sent by Companies House streaming service, saving the data locally, backing it up to S3 (in case the hosting infrastructure fails and also for future processing) and updating the db.

Batch processes

The batch processes perform a number of functions other than just the data update. There are data backups, uploads to Snowflake (I make the data available on Snowflake too), certain SalesForce updates(I have a SalesForce LWC app launching Q1 2026), sending reminders and alerts and performing certain health checks to ensure system is operating as expected.

Serverless functions

I use serverless functions to carry out continous health checks. If a part of the system fails, an alert is triggered so that action can be taken to fix the issue. There are also additional alerts that use AWS SNS to send warning emails if any thresholds are breached, for instance consistently high CPU usage.

Web Portal

The data is made available as a giant | delimited text file, over a JSON API and also over a web search interface where dozens of search parameters can be combined to perform very specific searches.

The web portal is built using VueJS and HTML and it talks to the backend using an API, very similar to the API available publicly. Working with Javascript is my least favourite part of this process (subjective experience, you are welcome to disagree) but VueJS makes it somewhat less painful.

Not that JS is terrible to work with (or perhaps it can be, depends on who you ask), for me it just doesn't provide as enjoyable a development experience as the pure .Net based development does. Having a web portal meant that I have spent many, many hours building, polishing, tweaking and retweaking the web interface to make it easy to use without overwhelming the user with 75+ search filters. The options are still there but the user is initially presented with a much smaller subset of filters which can be expanded to reveal the advanced options. (Check it out when you are done reading this)

Authentication

The website uses passwordless authentication. Just enter your email, get a login code in your inbox and start using the service. This meant I don't have to worry about password hashes, salts, fancy auth mechanisms and the like.

I don't collect or store any card/bank information either, all of that is taken care of by Stripe and neither do I ask for other personal information pieces like addresses, date of birth, contact number etc.

Putting it all together

The project was delivered across 3–4 years (working on it part time), although when it finally went live after an 8 months long beta period to iron out the remaining bugs, it felt like it took forever.

It has been a really fun project to work on, and it was equally thrilling to see the positive reception and the signups. While there is a commercial aspect to it to pay for a sustainable service, it was a labour of passion.

It's not 'done', no project is ever done. On the roadmap the team now has an officer tracker next, enabling tracking of officers across companies even when the name, dob or address varies. The build is complete, and I expect it to be release in the next few weeks.