Numpy Notes: Python data analysis

数据分析
Author

Tom

Published

September 20, 2021

NumPy, short for Numerical Python, has such advantages:

Pure Python leaves many details to runtime environment:

NumPy is fast.

import numpy as np
# speed test for numpy
my_arr = np.arange(1000_000)
my_list = list(range(1000_000))
%%time
my_arr2 = my_arr * 2
Wall time: 2.99 ms
%%time
my_list2 = [x * 2 for x in my_list]
Wall time: 114 ms

1 ndarray

An ndarray (N dimensional array) is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type (free of type checking). Every array has a shape, a tuple indicating the size of each dimension; a dtype, an object describing the data type of the array; a ndim, an integer indicating the dimension of the ndarray.

data = np.random.randn(3, 4)
data
array([[ 0.59905431, -1.08465416, -0.95319914,  1.93616798],
       [ 0.92951762,  0.81376066,  0.87067676, -0.05844408],
       [ 0.61867604, -0.78530194, -0.93464763,  0.74309266]])
data.shape
(3, 4)
data.dtype
dtype('float64')
data.ndim
2

1.1 Creating ndarrays

1.1.1 array function.

array function accepts any sequence-like object (list, list of lists, other arrays, etc) and produces a new NumPy array containing the passed data. Unless explicitly specified, np.array tries to infer a good data type for the array that it creates. The data type is stored in a special dtype metadata object.

# passing list
arr1 = np.array([1, 2, 3, 4])
print(arr1)
print(arr1.shape)
print(arr1.dtype)
print(arr1.ndim)
[1 2 3 4]
(4,)
int32
1
# passing list of lists
arr2 = np.array([[1, 2, 3],
                 [4, 5, 6]])
print(arr2)
print(arr2.shape)
print(arr2.dtype)
print(arr2.ndim)
[[1 2 3]
 [4 5 6]]
(2, 3)
int32
2
# passing array
arr3 = np.array(arr2)
print(arr3)
print(arr3.shape)
print(arr3.dtype)
print(arr3.ndim)
[[1 2 3]
 [4 5 6]]
(2, 3)
int32
2

1.1.2 zeros,ones,empty,arange etc

Function Description
array Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a dtype or explicitly specifying a dtype; copies the input data by default
asarray Convert input to ndarray, but do not copy if the input is already an ndarray
arange Like the built-in range but returns an ndarray instead of a list
ones Produce an array of all 1s with the given shape and dtype
ones_like Takes another array and produces a ones array of the same shape and dtype
zeros Like ones but producing arrays of 0s instead
zeros_like Like ones_like but producing arrays of 0s instead
empty Create new arrays by allocating new memory, but do not populate with any values
empty_like Like ones_like but do not populate with any values
full Produce an array of the given shape and dtype with all values set to the indicated “fill value”
full_like full_like takes another array and produces a filled array of the same shape and dtype
eye, identity Create a square NxN identity matrix
# np.zeros
np.zeros(4)
array([0., 0., 0., 0.])
# np.ones
np.ones((2, 3))
array([[1., 1., 1.],
       [1., 1., 1.]])
# np.empty return uninitialized "garbage" values, which can later be populated with data
np.empty((2, 3, 2))
array([[[0.59905431, 1.08465416],
        [0.95319914, 1.93616798],
        [0.92951762, 0.81376066]],

       [[0.87067676, 0.05844408],
        [0.61867604, 0.78530194],
        [0.93464763, 0.74309266]]])
# arange ia an array-valued version of the build-in Python range function
np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

1.2 Data Types for ndarrays

The data type or dtype is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data. In most cases dtype provide a mapping directly onto an underlying disk or memory representation, which makes it easy to read and write binary streams of data to disk and also to connect to code written in a low-level language like C or Fortran.

1.3 Arithmetic with NumPy Arrays

Vectorization - Any arithmetic operations between equal-size arrays applies the operation element-wise.

arr = np.random.randn(3, 4)
arr
array([[ 0.07338553,  0.8116425 , -2.17800941, -0.32029061],
       [ 0.41584223, -1.32481565, -1.3783163 ,  0.26723131],
       [-0.31385858,  1.30899248,  1.13523462, -0.68452327]])
arr * 2
array([[ 0.14677106,  1.62328501, -4.35601882, -0.64058122],
       [ 0.83168445, -2.64963129, -2.75663261,  0.53446262],
       [-0.62771717,  2.61798496,  2.27046924, -1.36904654]])
arr - arr
array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])
1 / arr
array([[13.62666475,  1.23206953, -0.45913484, -3.1221646 ],
       [ 2.4047582 , -0.754822  , -0.72552287,  3.74207649],
       [-3.18614831,  0.76394633,  0.88087518, -1.46087072]])

Broadcasting - Operations between differently sized arrays.

1.4 Basic Indexing and Slicing

One-dimensional array indexing and slicing act similarly to Python lists.

# indexing
arr = np.arange(10)
print(arr)
print(arr[0])
[0 1 2 3 4 5 6 7 8 9]
0
# slicing
print(arr[1:4])
[1 2 3]

Array slices are views on the original array, which means any modification to the view will be reflected in the source array. This design intends to obtain high performance and save memory.

arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arr[1:5] = 10
arr
array([ 0, 10, 10, 10, 10,  5,  6,  7,  8,  9])

For higher dimensional arrays, we can access every individual element recursively. First, indexing moves along axis 0 as the “rows” of the array and then axis 1 as the “columns”.

arr = np.arange(9).reshape(3,3)
arr
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
# indexing along axis 0
arr[1]
array([3, 4, 5])
# recursively indexing
arr[1][0]
3
# easy and equivalent way
arr[1, 0]
3
# indexing with slices, slice along axis 0
arr[:2]
array([[0, 1, 2],
       [3, 4, 5]])
# recursively indexing with slices
arr[:2, :2]
array([[0, 1],
       [3, 4]])

1.5 Boolean Indexing

Selecting data from an array by boolean indexing always creates a copy of the data.

arr = np.array([["Bob", 1, 2, 3],
                ["Luffy", 2, 3, 4],
                ["Joe", 6, 7, 8]])
arr
array([['Bob', '1', '2', '3'],
       ['Luffy', '2', '3', '4'],
       ['Joe', '6', '7', '8']], dtype='<U11')
names = arr[:, 0]
names
array(['Bob', 'Luffy', 'Joe'], dtype='<U11')
luffy_selected = (names == "Luffy")
luffy_selected
array([False,  True, False])
# boolean indexing slice along axis 0, select 'true' rows
arr[luffy_selected]
array([['Luffy', '2', '3', '4']], dtype='<U11')
# select everything except luffy, use != or negate the condition using ~
arr[names != "Luffy"]
array([['Bob', '1', '2', '3'],
       ['Joe', '6', '7', '8']], dtype='<U11')
arr[~luffy_selected]
array([['Bob', '1', '2', '3'],
       ['Joe', '6', '7', '8']], dtype='<U11')
# select two of the three names to combine multiple boolean conditions, use boolean arithmetic operators like & and |
mask = (names=="Bob")|(names=="Joe")
mask
array([ True, False,  True])
arr[mask]
array([['Bob', '1', '2', '3'],
       ['Joe', '6', '7', '8']], dtype='<U11')

1.6 Fancy Indexing

  • Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays.
  • The result of fancy indexing is always one-dimensional.
  • Fancy indexing always copies the data into a new array.
# create a 8x4 array
arr = np.empty((8, 4))
for i in range(8):
    arr[i] = i
arr
array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.],
       [4., 4., 4., 4.],
       [5., 5., 5., 5.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]])
# fancy indexing by passing a list
arr[[4, 3, 5, 6]]
array([[4., 4., 4., 4.],
       [3., 3., 3., 3.],
       [5., 5., 5., 5.],
       [6., 6., 6., 6.]])
# passing multiple index arrays selects a one-dimensional array of elements corresponding to each tuple of indices
# select (1,0), (5,3), (7,1), (2,2)
arr = np.arange(32).reshape(8,4)
arr[[1, 5, 7, 2], [0, 3, 1, 2]]
array([ 4, 23, 29, 10])
# trying to select a rectangular region
arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]
array([[ 4,  7,  5,  6],
       [20, 23, 21, 22],
       [28, 31, 29, 30],
       [ 8, 11,  9, 10]])

1.7 Transposing Arrays and Swapping Axes

Transposing is a special form of reshaping that similiarly returns a view on the underlying data without copying anything. Arrays have the transpose method and also the special T attribute.

arr = np.arange(6).reshape(2,3)
arr
array([[0, 1, 2],
       [3, 4, 5]])
arr.T
array([[0, 3],
       [1, 4],
       [2, 5]])

2 Universal Functions

ufunc, short for universal function, is a function that performs element-wise operations on data in ndarrays.

# unary ufuncs
arr = np.arange(10)
np.sqrt(arr)
array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])
np.exp(arr)
array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03])
# binary ufuncs
x = np.random.randn(8)
y = np.random.randn(8)
np.maximum(x, y)
array([ 0.7741251 , -0.13054798,  0.92974564, -0.83160733,  0.90103776,
        0.68551387, -0.34032336, -0.08283191])

3 Array-Oriented Programming with Arrays

points = np.arange(-5, 5, 0.01)
xs, ys = np.meshgrid(points, points)
# think of xs as points on the x axis, ys as points on the y axis 
xs
array([[-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       ...,
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99]])
ys
array([[-5.  , -5.  , -5.  , ..., -5.  , -5.  , -5.  ],
       [-4.99, -4.99, -4.99, ..., -4.99, -4.99, -4.99],
       [-4.98, -4.98, -4.98, ..., -4.98, -4.98, -4.98],
       ...,
       [ 4.97,  4.97,  4.97, ...,  4.97,  4.97,  4.97],
       [ 4.98,  4.98,  4.98, ...,  4.98,  4.98,  4.98],
       [ 4.99,  4.99,  4.99, ...,  4.99,  4.99,  4.99]])
z = np.sqrt(xs **2 + ys**2)

3.1 Expressing Conditional Logic as Array Operations

# list comprehension edition for conditional logic
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])
result = [(x if c else y) for x,y,c in zip(xarr, yarr, cond)]
result
[1.1, 2.2, 1.3, 1.4, 2.5]
# np.where edition for conditional logic
result = np.where(cond, xarr, yarr)
result
array([1.1, 2.2, 1.3, 1.4, 2.5])

3.2 Mathematical and Statiscal Methods

A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class and the top-level NumPy function.

arr = np.random.randn(5, 4)
arr
array([[-0.32270463, -2.47923282,  0.51142065,  1.64402202],
       [-1.11875424, -0.50816377, -0.24412379, -0.35906071],
       [-0.80848479, -1.5290442 ,  0.33861759, -1.84812779],
       [-0.38523178, -1.14234316,  1.07015372, -0.7025341 ],
       [-0.742273  ,  0.62327938, -0.24617117, -0.87927529]])
# find the mean of an array
arr.mean()
-0.45640159378569933
# using the top-level NumPy function to find the mean of an array
np.mean(arr)
-0.45640159378569933
# computing the mean over the axis 0
arr.mean(axis=0)
array([-0.67548969, -1.00710091,  0.2859794 , -0.42899517])
np.mean(arr, axis=0)
array([-0.67548969, -1.00710091,  0.2859794 , -0.42899517])

3.3 Methods for Boolean Arrays

Boolean values are coerced to 1 (True) and 0 (False) in the preceding methods. Thus, sum is often used as a means of counting True values in a boolean array.

arr = np.random.randn(100)
(arr > 0).sum() # Number of positive values
51

There are two additional methods, any and all, useful especially for boolean arrays. any tests whether one or more values in an array is True, while all checks if every value is True.

bools = np.array([False, False, True, False])
bools.any()
True
bools.all()
False

3.4 Sorting

NumPy arrays can be sorted in-place with the sort method.

arr = np.random.randn(6)
arr
array([-0.64050099,  0.37239892,  0.48466042,  0.0832035 , -0.24079602,
       -0.62832189])
arr.sort()
arr
array([-0.64050099, -0.62832189, -0.24079602,  0.0832035 ,  0.37239892,
        0.48466042])

We can sort each one-dimensional section of values in a multidimensional array in-place along an axis by passing the axis number to sort.

arr = np.random.randn(5, 3)
arr
array([[ 0.22066993,  1.28272713, -2.80933259],
       [ 1.24150303,  0.6821006 ,  0.21857812],
       [ 0.38492004,  2.30910114,  0.354785  ],
       [-0.99229831, -0.81723761,  0.19111813],
       [ 0.46279363,  0.11871894,  0.7152068 ]])
arr.sort(1)
arr
array([[-2.80933259,  0.22066993,  1.28272713],
       [ 0.21857812,  0.6821006 ,  1.24150303],
       [ 0.354785  ,  0.38492004,  2.30910114],
       [-0.99229831, -0.81723761,  0.19111813],
       [ 0.11871894,  0.46279363,  0.7152068 ]])

The top-level method np.sort returns a sorted copy of an array instead of modifying the array in-place. A quick-and-dirty way to compute the quantiles of an array is to sort it and select the value at a particular rank.

large_arr = np.random.randn(1000)
large_arr.sort()
large_arr[int(0.05 * len(large_arr))] # 5% quantile
-1.5942713719535697

3.5 Unique and Other Set Logic

# np.unique returns the sorted unique values in an array
numbers = np.array([1, 2, 3, 3, 4, 4, 6])
np.unique(numbers)
array([1, 2, 3, 4, 6])
# pure python edition
sorted(set(numbers))
[1, 2, 3, 4, 6]