Using Algorithms to Normalize Company Names

BY admin

Data inconsistency is a frequent big data problem, especially when you need an effective way to normalize company names. We’ve run into a number of situations where we need to normalize company names in a database for consistency. It’s usually not impossible to precisely sanitize and validate company names on the input side, unless you have a solid dataset of company names like Facebook or LinkedIn that you can validate against. This leads to vast inconsistencies with company name values in most databases and datasets.

Data Normalization

To solve this problem we need a good normalization process. Unfortunately normalizing company names accurately is a difficult task to do well because of the free form nature of a company name. Technically the following are all valid and correct company names:

  • Wal Mart
  • Walmart
  • Wal*mart
  • Wal-Mart Stores, Inc.

Because company names can contain special characters, like Walmart’s * for example, it’s very difficult to accurate map and reduce any specific pattern. Common automated and manual matching practices like search and replace, whitelists/blacklists, and regex pattern matching will only get you so far and quickly become cumbersome to manage and maintain.

Phonetic Algorithms

At AdDaptive we needed a way to normalize an extremely inconsistent database of company names. To do this we first used a manual cleanup process to make sure each company name was legible. Once we had legible names that could be read out loud if necessary we experimented with a few powerful phonetic algorithms. A phonetic search algorithm, sometimes called a fuzzy matching algorithm, is a relatively complex algorithm that indexes a group of words based upon their pronunciation.

Soundex is the most commonly recognized, and simplistic, phonetic algorithm in part because it is part of the standard spec in common database software including DB2, PostgreSQL, MySQL, Ingres, MS SQL Server and Oracle. It was originally created for indexing names by sound, as pronounced in English where homophones are encoded to the same representation in order to be matched even with minor differences in spelling.

Soundex does a good job with simpler datasets so we suggest you start with it. The quality of your results will vary significantly based on the complexity of your company name variations. Depending on the patterns in your database Soundex may not be the best algorithm to normalize company names. Unfortunately these is no answer or simple solution for this, but we suggest that you start with Soundex and then gradually explore through other options if you need more granularity. Here’s a list of the phonetic algorithms that we explored to reduce and normalize company names in our database:

What Worked For Us

So how did we approach it? We started with a particularly dirty database of roughly 2.3 million company names in our database. The approach we took can be broken down into a the following steps:

  1. Manually cleanup company names to provide a phonetically legible string
  2. Determine the phonetic index of each cleaned company name
  3. Reduce the specificity of our phonetic index to provide a looser match (optional)
  4. Group the cleaned company names by phonetic index
  5. Use the shortest company name as the final company value

We wrote the following script to handle this for us our specific use case with Node.js, because JavaScript is fast and asynchronous and Lodash is a great utility for manipulating big data objects and arrays.

Node.js Cleanup Script

// copy original
var original = _.clone( company );

// Manual cleanup rules to provide a phonetically legible string
var c = company;
c = _.trimEnd( c, '"' );
c = _.trimStart( c, '"' );
c = _.deburr( c );
c = _.replace( c, new RegExp( '-[0-9]{12}$', 'i' ), '' );
c = _.replace( c, new RegExp( '^_(private|public)_', 'g' ), '' );
c = splitBy( c, '-- ' );
c = splitBy( c, '--' );
c = splitBy( c, ' - - ' );
c = splitBy( c, '__' );
c = _.trimStart( c, '_' );
c = _.trimStart( c, '-' );
c = _.trimStart( c, '.' );
c = _.trimStart( c, '1 - ' );
c = _.trimStart( c, '0' );
c = _.trimStart( c, '- - ' );
c = _.trimEnd( c, ' -' );
c = _.replace( c, new RegExp( ' - ', 'g' ), ' ' );
c = _.replace( c, new RegExp( '_', 'g' ), ' ' );
c = _.trim( c );

// Determine phonetic index using clean 'c' string
var strictIndex = algorithms.phonetics.caverphone( c );
//var strictIndex = algorithms.phonetics[ algorithm ].call( algorithms, line );

// Most phonetic index specificity increases with each 
// additional character so we can reduce the size of our
// index to provide a looser match
var looseIndex = strictIndex.slice( 0, 8 );

// Save to companies array, grouped by phonetic index
// this represents the output of our normalize company names
// routine
var group = _.get( companies, looseIndex, {
    'in': [],
    'out': []
} ); original );
group.out.push( c );

// Use the shortest company name as the final company value
// ...

The Results

The final script we use to do this streams in company names from a .txt file line by line with the Node.js readline method. The results are pretty efficient and did the job well. Below are the results of running the script against a subset of 20,000 company records to normalize company names:

  • Records: 20001
  • Reduced: 10398
  • Execution Time: 6310ms
  • Algorithm: caverphone

With this combination we were able to normalize company names and reduce our inconsistent company names by a factor of about 50%. 2.3 million company names took about 12-13 minutes to process. Not bad.