21 Efficiency

STATUS: Under construction.

Required reading

Required viewing

Recommended reading

Key concepts/skills/etc

Key libraries

Key functions


21.1 Introduction

21.2 Data efficiency

21.2.1 SQL

21.2.2 Feather

21.3 Code efficiency

By and large, worrying about performance is a waste of time. For the most part you are far better off, just pushing things into the cloud, letting them run for a reasonable time, and using that time to worry about other aspects of your pipeline. However, eventually this becomes unfeasible. For me, this is when something takes more than a day to run because it just becomes a pain. There is rarely a most common area for obvious performance gains. Instead you need to learn to measure and then cut.

Being fast is valuable but it’s mostly about you being able to iterate fast not your code running fast. If you find that the speed at which your code completes is a bottle neck then shard. Then throw more machines at it. Then shard again. Then throw more machines at it.

21.4 Code refactoring

Some baby examples, focused on data science, along the lines of this: https://indrajeetpatil.github.io/Refactoring-ggstatsplot/refactoring-ggstatsplot#1

Start with an example of bad code, and then how it gets fixed.

21.4.1 Measure

Using tic() and tic().


21.5 Experimental efficiency

Multi-armed bandit

21.6 Other languages

21.6.1 Python

21.6.2 Julia

21.7 Exercises and tutorial

21.7.1 Exercises

21.7.2 Tutorial