How to build a Companies House IXBRL converter for bulk data conversion

Convert-ixbrl is an advanced company finder and iXBRL-to-JSON/Excel converter for UK Companies House Bulk data. It is now supported by a talented team but I developed the technical core myself and it originated a few years ago as a spinoff of fastukcompanysearch.com. My story below includes all the juicy technical bits! For more specifics, read my follow-up article on architecture and database design.

First, Some background

Hi, I'm Hasham. I'm a software developer based in Leeds, UK. Convert-ixbrl.co.uk is my second project, a sort of a spin off following https://fastukcompanysearch.com which is an app I built to search Companies House and track accounting due dates (check it out if you are an accountant or a bookkeeper).

Companies House makes the IXBRL accounts filed by (most) UK companies available on its website. These IXBRL files have a ton of interesting info ('interesting', atleast in my dictionary) and includes data on net assets, liabilities, turnover, employees and much much more. I came across this while building fastukcompanysearch.com which is essentially a company finder. All this data is made available under Open Government license, which is fantastic but…

The IXBRL format

..but it is provided in IXBRL format.

IXBRL files look like standard html files and often include Balance Sheet, P&L and cash flow statements, all viewable in a browser. The information displayed is also stored in xml elements and attributes within the file and can be extracted to run all sorts of data analysis.

The fun starts when you try to extract this!

Parsing IXBRL / XBRL Files

Initially, I didn't want to reinvent the wheel by writing a parser myself. Through research, I come across a couple of opensource online, mostly python ones.

The parsers kind of worked but there were a few key challenges. Firstly, I was getting the average parse times of about 4 to 5 seconds per file. Doing some quick maths, I realised that it would take months upon months to get through the millions of files that I needed to get through to build a database.

Secondly, there were these random parsing errors that I kept coming across, which didn't help.

The code was opensource. I could try to optimise the library to speed things up. I could try to break up the data in multiple sets and do some parallel processing but my gut feeling was that I'd likely run into one or more show stopper problems as I bulk process the files. These will be very hard to troubleshoot as I won't have built up a strong IXBRL foundation, having taken the 'easy' way out of using a library. So, I decided to bite the bullet and roll my own parser.

Thirdly, I'm a C#/Mobile App developer and I've never had a need to use python. Tailoring the python libraries for my use case wouldn't likely have been insurmountable but would have initially slowed me down. Also, I just like C# and part of the reason for building this was the actual enjoyment I derive from the development experience. I won't get that with python.

Enter Taxonomies

If you view the source for an IXBRL file, you'll see that it contains lots and lots of tags. The tags come from the financial taxnomy used for the file. FRS-102 is one such popular example. The taxonomy defines what the tag for, say 'Current Assets' might be. Taxonomies are updated regularly.

Before I wrote the IXBRL parser, I had to write a taxonomy parser too so that I could get actual meaning out of each ixbrl and xbrl file. While this step wasn't strictly necessary, I think it proved instrumental later on as I battled with one data quality issue after another, due to the variety of ways the data for the IXBRL files is prepared before it gets sent off.

It took a while to understand all the relationships between the different parts. A taxonomy is a collection of XSD and XML files and it was quite overwhelming to see the size and content of the different files that are included. It required learning and understanding several important concepts like presentation arcs, label arcs and more.

I also went out and bought The XBRL Book by Ghislain Fourny (not an affiliate link), and read it almost cover to cover and it really helped to start to build an understanding of these.

This wasn't enough though. At one point I had to find and hire an IXBRL finance specialist online to overcome certain data parsing challenges.

After much trial and error, I was able to convert one of the FRC taxonomies into an SQL model, ready for querying. From there, it wasn't that hard to make small modifications until the parser was generic enough to support other taxonomies. The solution currently supports about 53 taxonomies. Parsing these and importing into an the DB takes about 30 minutes

Now, to IXBRL Files

By the time I had written the taxonomy parser, I had 'graduated' from being an IXBRL 'noob' to an IXBRL 'semi-noob'. The parsing the data out itself was relatively trivial, the complications typically centre around:

Understanding the overall structure
Making sense of hierarchies and how the concepts and facts link together
Mapping the extracted facts to the taxonomy and to the different financial statements, such as Balance sheets, cash flow statement and P&L
The challenge that the amazing flexibility offered by IXBRL means that there is little consistency in how certain fields are tagged.

(The above is not an exhaustive list)

Other, relatively easy, challenges were technical:

IXBRL and XBRL parsing had to be fast enough to get through the several million files
Building a user friendly interface to make this data searchable without making the search overly complicated
It had to be stored in an efficient manner for it to be queryable

There were problems, lots of them. Lots of head scratchers, but it was great fun to work through them.

So how long does it take to convert the full Companies House IXBRL data to a format that is searchable

The C# parser currently takes on average an hour and a half to read through the IXBRL and XBRL files for a given calendar month. Multiple IXBRL files can be imported at one time but after some experimentation, I've found a sweet spot where a 5 years import can be done in about 8–9 hours using 6 concurrent imports into the SQL DB. This is a 6 core intel 10th gen CPU with 32GB Ram and the import is from an external SSD. I did have to run many, MANY imports as I worked through the bugs. Newly added files will only take 1.5 hours once a month as a batch job and that's fine.

Querying

However, once the data has been extracted, the job isn't done. It then takes another 8 hours to build a second database that enables companies to be looked up based on different financial metrics.

Making the extracted IXBRL data available to the world

The data, sat in an SQL DB isn't of much use. I wrote a JSON API and a VueJS/Asp.net web portal to make it accessible to developers and non-developers, along with the documentation to go with.

This itself involves significant effort, including building robust data pipelines, extensive testing, optimising data storage and retrieval and everything else around these.

Update

A few people have asked why I didn't use any of the AI tools. There were three reasons for that.

Firstly, I wrote the parser itself just before ChatGPT became mainstream so I couldn't have used it even if I wanted to. Once the parser was functional, I parked the project for a little while to focus on rebuilding fastukcompanysearch mobile app using Swift(a topic for another post someday) and then resumed the project a year or so later.

Secondly, even though I do use AI in my workflows regularly for isolated pieces of work, even in 2025 I've not found the results of any of the AI models to be 100% hallucination free meaning they sometimes just make things up. This would mean that I wouldn't have enough of a confidence in the data's quality if I was to just feed the XBRL files to an LLM like ChatGPT or to Gemini.

Thirdly, these LLMs are slow and expensive when it comes to processing such a large amount of data. It just did not seem practical for these reasons.

What can convert-ixbrl.co.uk do

Here is what you can do using the web search panel or the API:

Find UK companies based on an additional 50+ financial filters (turnover, netassets, liabilities, audit fee and many more) and get Balance Sheet, P&L and cashflow statements in Excel/JSON format
Create company lists for sales lead generation or general tracking
Search companies based on location and distance
Search companies using director age ranges
Search companies using incorporation date, CS01 dates and accounts due dates
Setup email alerts for when a company's financials change (coming soon)
Search officers based on name, ID, postcode, nationality and more
Download Bulk data if you wish to perhaps upload it to SalesForce or Snowflake or some other similar system. Bulk data will be available in the live release.

I invite you to try convert-ixbrl.co.uk (and fastukcompanysearch.com). I am delighted to see that there is already an active user base. I love getting feedback from the users as, like I said, it's a product of love first and a commercial venture second.