00:00:00

Data Visualization in Python

Notes

What is this course about

This course teaches an important aspect of data science - data visualization. Picture is worth a thousand words.

  • Learn how to parse a CSV dataset from the web
  • Learn the basics of Pandas Dataframe
  • Learn to use Basemap
  • Create a project that visualizes global earthquake activities

Notes

Global Earthquake Activities

Earthquake map

What do you think are the kind of data needed to draw this map?

Notes

Finding data

Google USGS data feed

Notes

CSV-format data feed

USGS data page

Notes

Taking a look at the data ...

Past 7-days M2.5+ Earthquakes

https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_week.csv)

Looking at CSV data

Notes

And the description of the data fields

https://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php

Does it have the kind of data you need to draw the activity map?

Notes

Introducing Pandas and Pandas Dataframe

1
2
3
4
5
import pandas as pd
df_7d  = pd.read_csv(
  'https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_week.csv')
df_30d = pd.read_csv(
  'https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv')
  • Pandas is an essential Python library for data scientists.
  • DataFrame (aka df) is center piece of Pandas.
  • The code above calls the read_csv function, which takes the URL of the CSV data, and returns a DataFrame.

Notes

Examining the data

You can peek at the dataframe's content by simply evaluating it in a cell, or use the .head() function.

Pandas read_csv()

Notes

Shapes and column names

1
2
3
df_7d.shape

df_30d.columns

DataFrame shape column

Notes

Selecting data using column headings

1
df_7d['time']   # <-- or df_7d.time

DF select column

Notes

Selecting multiple columns

1
2
map_df_7d = df_7d[['time', 'longitude', 'latitude', 'mag']]
map_df_7d.head()

DF select multiple columns

Notes

DataFrame is iterable over rows

1
2
3
for index, row in map_df_7d.iterrows():
    print(index)
    print(row['time'])

DF iterable

Notes

Plotting histograms of magnitudes

Histograms are excellent ways to visualize the frequency of data measurements

Histograms overlapping

Notes

Histogram code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use('seaborn')  # <-- 'seaborn' is just a style name
plt.figure(figsize=(10, 5))
plt.hist(df_7d.mag, bins=50, alpha=0.3, label='7-day')
plt.hist(df_30d.mag, bins=50, alpha=0.3, label='30-day')
plt.ylabel('Occurences')
plt.xlabel('Magitude')
plt.title('7-day and 30-day earthquake magnitudes')
plt.legend()  # <-- draw the legend box. Must have `label` param in `hist()`
plt.show()
  • For available plot styles, see https://matplotlib.org/3.1.1/gallery/style_sheets/style_sheets_reference.html
  • bins - how many bins to divide the data into
  • alpha - transparency: 0 to 1, 1 being non-transparent

Notes

Filtering data based on certain criteria

For example, you want to study earthquakes that are stronger than M2.5 but weaker than M5.0.

1
2
3
4
minor_df = df_30d[df_30d.mag < 2.5]
print('Minor earthquake count:', minor_df.shape[0])
medium_df = df_30d[ (df_30d.mag >= 2.5) & (df_30d.mag < 5) ]
print('Medium earthquake count:', medium_df.shape[0])

Dataframe filter data

Notes

Exercise: Plot a histogram of ...

  • 30-day earthquake depth

Histogram 30-day depth

Notes

Answer:

1
2
3
4
5
6
7
8
plt.style.use('seaborn')
plt.figure(figsize=(10, 5))

plt.hist(df_30d.depth, bins=50, alpha=0.3)
plt.ylabel('Occurences')
plt.xlabel('Depth in km')
plt.title('30-day earthquake depth')
plt.show()

Notes

Exercise: Plot a histogram of ...

  • 30-day earthquake depth less than 100km

Histogram 30-day depth 2

Notes

Answer:

1
2
3
4
5
6
7
8
plt.style.use('seaborn')
plt.figure(figsize=(10, 5))

plt.hist(df_30d[df_30d.depth < 100].depth, bins=50, alpha=0.3)
plt.ylabel('Occurences')
plt.xlabel('Depth in km')
plt.title('30-day earthquake depth < 100km')
plt.show()

Notes

Break

Notes

Introducing Basemap

Notes

Making maps using Basemap

1
2
3
4
5
6
7
%matplotlib inline
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

my_map = Basemap(projection='ortho', lat_0=50, lon_0=-100, resolution='l')
my_map.drawcoastlines()  
plt.show()

Basemap first map

Notes

Choosing a projection for World Map

1
my_map = Basemap(projection='eck4', lat_0=0, lon_0=-120, resolution='l')

Basemap eck4

Notes

Adding more details

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
%matplotlib inline
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
plt.figure(figsize=(18,9))  # <-- set a larger canvas size

my_map = Basemap(projection='eck4', lat_0=0, lon_0=-90, resolution='l')

my_map.drawmapboundary(fill_color='aqua')
my_map.drawcoastlines()
my_map.drawcountries()
my_map.fillcontinents(color='coral', lake_color='aqua')

plt.title('A blank map')
plt.show()

Notes

How it looks

Basemap more details

Notes

Putting Ardsley, NY on the map!

The coordinate of Ardsley, NY is (41.0107° N, 73.8437° W)

1
2
3
4
5
6
7
8
...
longitude = -73.8437
latitude  =  41.0107
# convert lon and lat to a (x,y) coordinate using the map's projection type
x, y = my_map(longitude, latitude)   
my_map.scatter(x, y, color='red', marker='o', s=64,
               zorder=2)    # <- zorder=2 makes the marker show up on top
plt.show()

Sometimes you see people plot directly with lon and lat, by using latlon=True.

1
2
3
4
5
6
...
longitude = -73.8437
latitude  =  41.0107
my_map.scatter(longitude, latitude, latlon=True, color='red', marker='o', 
               s=64, zorder=2)
plt.show()

Notes

How it looks

Basemap Ardsley

Notes

Marking multiple locations

Matplotlib deals with lists (or list-like containers) of data seamlessly.

Let's plot two more cities on the map, with names next to the marker:

Beijing: 39.9042° N, 116.4074° E

Singapore: 1.3521° N, 103.8198° E

1
2
3
4
5
6
7
8
9
longitudes = [-73.8437, 116.4074, 103.8198]
latitudes  = [ 41.0107,  39.9042,   1.3521]
names      = ['Ardsley', 'Beijing', 'Singapore']

xs, ys = my_map(longitudes, latitudes)
my_map.scatter(xs, ys, color='red', marker='o', s=64, zorder=2)

for i in range(0, len(names)):
    plt.text(xs[i], ys[i], names[i], fontsize=12, color='black')
  • Now you see certain functions such as text() only takes projected coordinates.

Notes

How it looks

Basemap multiple locations

Notes

Putting earthquake data on the map

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
%matplotlib inline
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
plt.figure(figsize=(18,9))

my_map = Basemap(projection='eck4', lat_0=0, lon_0=-90, resolution='l')

my_map.drawmapboundary(fill_color='aqua')
my_map.drawcoastlines()
my_map.drawcountries()
my_map.fillcontinents(color='coral', lake_color='aqua')

#---- This is the important part
data = map_df_7d[map_df_7d.mag > 2.5]
xs, ys = my_map(data['longitude'].tolist(), data['latitude'].tolist())
my_map.scatter(xs, ys, color='yellow', marker='o', s=64, zorder=2)
#---- End of important part

plt.title('Map with 7-day earthquakes > M2.5 marked')
plt.show()

Notes

How it looks

Basemap add earthquake data

Notes

Python List comprehension

List comprehension is a very handy and intuitive way to create a list based on another list.

Suppose you, the teacher, got the list of points (out of 100) everyone in the class scored in the last exam. What the class doesn't know is that instead of marking their papers, you randomly generated their scores. Ha!

1
2
3
import random
points = random.sample(range(40, 101), 25)
points

List random sample

Notes

List comprehension - 2

You want to assign a letter grade from A to F for each score. The grade range is as follows: F < 50, E < 60, D < 70, C < 80, B < 90, A otherwise. To express this rule in Python -

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def assign_grade(point):
    if point < 50:
        return 'F'
    elif point < 60:
        return 'E'
    elif point < 70:
        return 'D'
    elif point < 80:
        return 'C'
    elif point < 90:
        return 'B'
    else:
        return 'A'

Notes

List comprehension - 3

Now, we use list comprehension to produce the list of letter grades -

1
2
grades = [assign_grade(p) for p in points]
grades

List comprehension

Notes

Exercise: Generate marker colors

The rule: Green for < M3.0, Yellow for <M5.0, Red otherwise

The data: all earthquakes > M2.5

Hint:

1
2
3
4
5
6
def get_marker_color(magnitude):
    ? # EXPRESS THE RULE USING IF..ELIF..ELSE

data = map_df_7d[map_df_7d.mag > 2.5]

intensity_colors = ? # USE LIST COMPREHENSION

Notes

Answer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def get_marker_color(magnitude):
    if magnitude < 3: 
        return 'green'
    elif magnitude < 5.0:
        return 'yellow'
    else:
        return 'red'

data = map_df_7d[map_df_7d.mag > 2.5]
intensity_colors = [get_marker_color(m) for m in data['mag']]

List marker color

Notes

Using generated marker colors

1
2
3
4
5
6
7
...
data = map_df_7d[map_df_7d.mag > 2.5]
intensity_colors = [get_marker_color(m) for m in data['mag']]

xs, ys = my_map(data['longitude'].tolist(), data['latitude'].tolist())
my_map.scatter(xs, ys, color=intensity_colors, marker='o', s=64, zorder=2)
...

Notes

How it looks

Basemap marker color

Notes

Exercise: Generate marker size

Note: for an earthquake of magnitude m, use a marker size of m**4, or m to the 4th power.

Answer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def get_marker_size(magnitude):
    return magnitude**4

...
data = map_df_7d[map_df_7d.mag > 2.5]
intensity_colors = [get_marker_color(m) for m in data['mag']]
intensity_sizes = [get_marker_size(m) for m in data['mag']]

xs, ys = my_map(data['longitude'].tolist(), data['latitude'].tolist())
my_map.scatter(xs, ys, color=intensity_colors, marker='o', s=intensity_sizes, zorder=2, alpha=0.6)
...

Notes

How it looks

Basemap marker color size

Notes

Congratulations, you've completed this workshop!

Notes

Notes