18 May 2017

Speeding up Python Pandas with Vectorized Operations

When working on very large datasets, vectorized operations are many orders of magnitude faster than naive row-by-row operations. In my example, 200X faster.

Graph showing a ~200x difference between the two modes of operation.
To get to a position where I could vectorize means the code is less clean and readable than what one would normally strive for. However, the gains in speed are worth it, in this case we can continue to work on the data straight from the source rather than having to create an intermediate representation for continued experimentation.

The example below is from a real problem we have where source data for flight itineraries is stored as a single comma delimited string, but for analysis we'd like to see it leg by leg.