Python for Data Science Cheat Sheet
My little collection of Python recipes for data science featuring Pandas, Matplotlib, and friends.
Pandas: reading a CSV from a string
In Pandas we can load data from a CSV file with read_csv
:
import pandas as pd
df = pd.read_csv('some_file.csv')
Now, it's not uncommon to have some tabular data as a string:
data = """year,income
2014,25699
2015,26888
2016,24699
2017,29541
2018,38122
2019,39611
2020,41598
"""
To load this string as a file we can use Python built-in StringIO
:
import pandas as pd
from io import StringIO
data = StringIO("""year,income
2014,25699
2015,26888
2016,24699
2017,29541
2018,38122
2019,39611
2020,41598
""")
df = pd.read_csv(data)
Credits to my friend Ernesto for this tip.
How to plot a CSV with Pandas
Consider this file-like CSV:
data = StringIO("""
year,gross income
2013,21222
2014,12333
2015,15888
2016,18774
2017,19566
2018,36576
2019,34600
2020,41598
""")
To plot this CSV with Pandas we call the plot
method on the DataFrame:
import pandas as pd
from io import StringIO
import matplotlib.pyplot as plt
data = StringIO("""
year,gross income
2013,21222
2014,12333
2015,15888
2016,18774
2017,19566
2018,36576
2019,34600
2020,41598
""")
df = pd.read_csv(data)
df.plot()
To show the plot instead we call show
on plt
:
import pandas as pd
from io import StringIO
import matplotlib.pyplot as plt
data = StringIO("""
year,gross income
2013,21222
2014,12333
2015,15888
2016,18774
2017,19566
2018,36576
2019,34600
2020,41598
""")
df = pd.read_csv(data)
df.plot()
plt.show()
You can also save the plot with savefig
:
# omit
plt.savefig("plot_csv.png")
Now, you'll notice that the resulting picture has indeed two labels taken from the CSV column. But the x axis is associated with the indexes of each DataFrame row:
A DataFrame in fact has indexes:
>>> df
year gross income
0 2013 21222
1 2014 12333
2 2015 15888
3 2016 18774
4 2017 19566
5 2018 36576
6 2019 34600
7 2020 41598
To use the year column instead of an index for the x axis we can instruct plot
respectively with the x and y arguments (in this example you can omit y):
# omit
df.plot(x="year", y="gross income")
plt.savefig("plot_csv.png")
Now the plot is coherent with the dataset:
How to groupby in Pandas
Suppose you've got a CSV with two columns, year and amount:
from io import StringIO
import pandas as pd
data = StringIO("""
Year,Amount
2017, 464.62
2017, 465.29
2017, 465.31
2018, 465.44
2018, 465.88
2020, 334.63
""")
df = pd.read_csv(data)
To compute the amount by year you can group by year and then call sum
:
from io import StringIO
import pandas as pd
data = StringIO("""
Year,Amount
2017, 464.62
2017, 465.29
2017, 465.31
2018, 465.44
2018, 465.88
2020, 334.63
""")
df = pd.read_csv(data)
sum_by_year = df.groupby('Year').sum()
This gives you a new DataFrame as expected:
Amount
Year
2017 1395.22
2018 931.32
2020 334.63