Python for Data Science Cheat Sheet

My little collection of Python recipes for data science featuring Pandas, Matplotlib, and friends.

Python for Data Science Cheat Sheet

Pandas: reading a CSV from a string

In Pandas we can load data from a CSV file with read_csv:

import pandas as pd

df = pd.read_csv('some_file.csv')

Now, it's not uncommon to have some tabular data as a string:

data = """year,income
2014,25699
2015,26888
2016,24699
2017,29541
2018,38122
2019,39611
2020,41598
"""

To load this string as a file we can use Python built-in StringIO:

import pandas as pd
from io import StringIO

data = StringIO("""year,income
2014,25699
2015,26888
2016,24699
2017,29541
2018,38122
2019,39611
2020,41598
""")

df = pd.read_csv(data)

Credits to my friend Ernesto for this tip.

How to plot a CSV with Pandas

Consider this file-like CSV:

data = StringIO("""
year,gross income
2013,21222
2014,12333
2015,15888
2016,18774
2017,19566
2018,36576
2019,34600
2020,41598
""")

To plot this CSV with Pandas we call the plot method on the DataFrame:

import pandas as pd
from io import StringIO
import matplotlib.pyplot as plt

data = StringIO("""
year,gross income
2013,21222
2014,12333
2015,15888
2016,18774
2017,19566
2018,36576
2019,34600
2020,41598
""")

df = pd.read_csv(data)

df.plot()

To show the plot instead we call show on plt:

import pandas as pd
from io import StringIO
import matplotlib.pyplot as plt

data = StringIO("""
year,gross income
2013,21222
2014,12333
2015,15888
2016,18774
2017,19566
2018,36576
2019,34600
2020,41598
""")

df = pd.read_csv(data)

df.plot()
plt.show()

You can also save the plot with savefig:

# omit

plt.savefig("plot_csv.png")

Now, you'll notice that the resulting picture has indeed two labels taken from the CSV column. But the x axis is associated with the indexes of each DataFrame row:

Plot CSV index

A DataFrame in fact has indexes:

>>> df
   year  gross income
0  2013         21222
1  2014         12333
2  2015         15888
3  2016         18774
4  2017         19566
5  2018         36576
6  2019         34600
7  2020         41598

To use the year column instead of an index for the x axis we can instruct plot respectively with the x and y arguments (in this example you can omit y):

# omit

df.plot(x="year", y="gross income")
plt.savefig("plot_csv.png")

Now the plot is coherent with the dataset:

Plot CSV columns

How to groupby in Pandas

Suppose you've got a CSV with two columns, year and amount:

from io import StringIO
import pandas as pd

data = StringIO("""
Year,Amount
2017, 464.62
2017, 465.29
2017, 465.31
2018, 465.44
2018, 465.88
2020, 334.63
""")

df = pd.read_csv(data)

To compute the amount by year you can group by year and then call sum:

from io import StringIO
import pandas as pd

data = StringIO("""
Year,Amount
2017, 464.62
2017, 465.29
2017, 465.31
2018, 465.44
2018, 465.88
2020, 334.63
""")

df = pd.read_csv(data)

sum_by_year = df.groupby('Year').sum()

This gives you a new DataFrame as expected:

       Amount
Year         
2017  1395.22
2018   931.32
2020   334.63
Valentino Gagliardi

Hi! I’m Valentino! Educator and consultant, I help people learning to code with on-site and remote workshops. Looking for JavaScript and Python training? Let’s get in touch!

More from the blog: