Notes on Labor Data by Forest Gregg
Archive, RSS

How to build a corporate registry?

January 09, 2023

In 2023, I’m going to be working on organizing data about employers that will be useful to the labor movement. That means things like identifying the employers that repeatedly violate labor law, or employers that have many but not all units organized.

This is going to take a lot of record linkage within and across datasets. That’s a pain, but I have spent a long time building good record linkage technology, so I think we can do it.

Unfortunately, it’s also going to mean I’ll have to create unique identifiers for business firms and establishments.

I would really, really like to not be in the business of minting unique identifiers. Ideally, there would be an existing registry of coporate filings and establishment, that I could match against and use that registry’s identifiers. If that existed, then my work could be linked easily against datasets created by other folks who used the same set of identifiers.

Without that registry, the identifiers I will mint will be of very limited use for connecting my data to others outside the labordata project. To link other datasets to mine, it will be yet another record linkage problem.

But, in 2023 we don’t have an open registry of firms or establishments that is anywhere near comprehensive, so I will proceed without one. Alack.

Of course, an open dataset of corporate identifiers could be useful for a ton of goals beyond my own. It’s one of those constant frustrations which, if solved to some satisfaction, could save a lot of people a lot of time. With that in mind, I reached out to Jeremy Singer-Vine of the Data Liberation and we’ve been talking about how we would go about it, if we were foolish to build such a registry.

I’m writing this in the spirit of being wrong on the internet, with hope you, my dear reader, will let me know about how wrong I am.

What this registry would contain

Datasets we can’t use.

So, that’s what’s walled off. If we were to try to build a registry from open, public data, these are the data sources we would use.

Partial datasets that we can use to knit something together.

Please let us know if we are missing something important other datasets or initiatives! If you are interested in helping on something like this, then let us know.

Addendum

I’ve gotten some great suggestions for other data sources


Subscribe to get Notes on Labor Data as an email newsletter.