import numpy as np
NumPy, short for Numerical Python, has such advantages:
- Fast vectorized array operations for data munging and cleaning, subsetting and filtering, transformation, and any other kinds of computations
- Common array algorithms like sorting, unique, and set operations
- Efficient descriptive statistics and aggregating/summarizing data
- Data alignment and relational data manipulations for merging and joining together heterogeneous datasets
- Expressing conditional logic as array expressions instead of loops with
if-elif-else
branches - Group-wise data manipulations (aggregation, transformation, function application)
Pure Python leaves many details to runtime environment:
- specifying variable types
- memory allocation/deallocation, etc
NumPy is fast.
- NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy’s library of algorithms written in the C language can operate on this memory without any type checking or other overhead. NumPy arrays also use much less memory than built-in Python sequences.
- NumPy operations perform complex computations on entire arrays without the need for Python
for
loops.
# speed test for numpy
= np.arange(1000_000)
my_arr = list(range(1000_000)) my_list
%%time
= my_arr * 2 my_arr2
Wall time: 2.99 ms
%%time
= [x * 2 for x in my_list] my_list2
Wall time: 114 ms
1 ndarray
An ndarray (N dimensional array) is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type (free of type checking). Every array has a shape, a tuple indicating the size of each dimension; a dtype, an object describing the data type of the array; a ndim, an integer indicating the dimension of the ndarray.
= np.random.randn(3, 4)
data data
array([[ 0.59905431, -1.08465416, -0.95319914, 1.93616798],
[ 0.92951762, 0.81376066, 0.87067676, -0.05844408],
[ 0.61867604, -0.78530194, -0.93464763, 0.74309266]])
data.shape
(3, 4)
data.dtype
dtype('float64')
data.ndim
2
1.1 Creating ndarrays
1.1.1 array
function.
array
function accepts any sequence-like object (list, list of lists, other arrays, etc) and produces a new NumPy array containing the passed data. Unless explicitly specified, np.array
tries to infer a good data type for the array that it creates. The data type is stored in a special dtype metadata object.
# passing list
= np.array([1, 2, 3, 4])
arr1 print(arr1)
print(arr1.shape)
print(arr1.dtype)
print(arr1.ndim)
[1 2 3 4]
(4,)
int32
1
# passing list of lists
= np.array([[1, 2, 3],
arr2 4, 5, 6]])
[print(arr2)
print(arr2.shape)
print(arr2.dtype)
print(arr2.ndim)
[[1 2 3]
[4 5 6]]
(2, 3)
int32
2
# passing array
= np.array(arr2)
arr3 print(arr3)
print(arr3.shape)
print(arr3.dtype)
print(arr3.ndim)
[[1 2 3]
[4 5 6]]
(2, 3)
int32
2
1.1.2 zeros
,ones
,empty
,arange
etc
Function | Description |
---|---|
array | Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a dtype or explicitly specifying a dtype; copies the input data by default |
asarray | Convert input to ndarray, but do not copy if the input is already an ndarray |
arange | Like the built-in range but returns an ndarray instead of a list |
ones | Produce an array of all 1s with the given shape and dtype |
ones_like | Takes another array and produces a ones array of the same shape and dtype |
zeros | Like ones but producing arrays of 0s instead |
zeros_like | Like ones_like but producing arrays of 0s instead |
empty | Create new arrays by allocating new memory, but do not populate with any values |
empty_like | Like ones_like but do not populate with any values |
full | Produce an array of the given shape and dtype with all values set to the indicated “fill value” |
full_like | full_like takes another array and produces a filled array of the same shape and dtype |
eye, identity | Create a square NxN identity matrix |
# np.zeros
4) np.zeros(
array([0., 0., 0., 0.])
# np.ones
2, 3)) np.ones((
array([[1., 1., 1.],
[1., 1., 1.]])
# np.empty return uninitialized "garbage" values, which can later be populated with data
2, 3, 2)) np.empty((
array([[[0.59905431, 1.08465416],
[0.95319914, 1.93616798],
[0.92951762, 0.81376066]],
[[0.87067676, 0.05844408],
[0.61867604, 0.78530194],
[0.93464763, 0.74309266]]])
# arange ia an array-valued version of the build-in Python range function
10) np.arange(
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
1.2 Data Types for ndarrays
The data type or dtype is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data. In most cases dtype provide a mapping directly onto an underlying disk or memory representation, which makes it easy to read and write binary streams of data to disk and also to connect to code written in a low-level language like C or Fortran.
1.3 Arithmetic with NumPy Arrays
Vectorization - Any arithmetic operations between equal-size arrays applies the operation element-wise.
= np.random.randn(3, 4)
arr arr
array([[ 0.07338553, 0.8116425 , -2.17800941, -0.32029061],
[ 0.41584223, -1.32481565, -1.3783163 , 0.26723131],
[-0.31385858, 1.30899248, 1.13523462, -0.68452327]])
* 2 arr
array([[ 0.14677106, 1.62328501, -4.35601882, -0.64058122],
[ 0.83168445, -2.64963129, -2.75663261, 0.53446262],
[-0.62771717, 2.61798496, 2.27046924, -1.36904654]])
- arr arr
array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
1 / arr
array([[13.62666475, 1.23206953, -0.45913484, -3.1221646 ],
[ 2.4047582 , -0.754822 , -0.72552287, 3.74207649],
[-3.18614831, 0.76394633, 0.88087518, -1.46087072]])
Broadcasting - Operations between differently sized arrays.
1.4 Basic Indexing and Slicing
One-dimensional array indexing and slicing act similarly to Python lists.
# indexing
= np.arange(10)
arr print(arr)
print(arr[0])
[0 1 2 3 4 5 6 7 8 9]
0
# slicing
print(arr[1:4])
[1 2 3]
Array slices are views on the original array, which means any modification to the view will be reflected in the source array. This design intends to obtain high performance and save memory.
= np.arange(10)
arr arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
1:5] = 10
arr[ arr
array([ 0, 10, 10, 10, 10, 5, 6, 7, 8, 9])
For higher dimensional arrays, we can access every individual element recursively. First, indexing moves along axis 0 as the “rows” of the array and then axis 1 as the “columns”.
= np.arange(9).reshape(3,3)
arr arr
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
# indexing along axis 0
1] arr[
array([3, 4, 5])
# recursively indexing
1][0] arr[
3
# easy and equivalent way
1, 0] arr[
3
# indexing with slices, slice along axis 0
2] arr[:
array([[0, 1, 2],
[3, 4, 5]])
# recursively indexing with slices
2, :2] arr[:
array([[0, 1],
[3, 4]])
1.5 Boolean Indexing
Selecting data from an array by boolean indexing always creates a copy of the data.
= np.array([["Bob", 1, 2, 3],
arr "Luffy", 2, 3, 4],
["Joe", 6, 7, 8]])
[ arr
array([['Bob', '1', '2', '3'],
['Luffy', '2', '3', '4'],
['Joe', '6', '7', '8']], dtype='<U11')
= arr[:, 0]
names names
array(['Bob', 'Luffy', 'Joe'], dtype='<U11')
= (names == "Luffy")
luffy_selected luffy_selected
array([False, True, False])
# boolean indexing slice along axis 0, select 'true' rows
arr[luffy_selected]
array([['Luffy', '2', '3', '4']], dtype='<U11')
# select everything except luffy, use != or negate the condition using ~
!= "Luffy"] arr[names
array([['Bob', '1', '2', '3'],
['Joe', '6', '7', '8']], dtype='<U11')
~luffy_selected] arr[
array([['Bob', '1', '2', '3'],
['Joe', '6', '7', '8']], dtype='<U11')
# select two of the three names to combine multiple boolean conditions, use boolean arithmetic operators like & and |
= (names=="Bob")|(names=="Joe")
mask mask
array([ True, False, True])
arr[mask]
array([['Bob', '1', '2', '3'],
['Joe', '6', '7', '8']], dtype='<U11')
1.6 Fancy Indexing
- Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays.
- The result of fancy indexing is always one-dimensional.
- Fancy indexing always copies the data into a new array.
# create a 8x4 array
= np.empty((8, 4))
arr for i in range(8):
= i
arr[i] arr
array([[0., 0., 0., 0.],
[1., 1., 1., 1.],
[2., 2., 2., 2.],
[3., 3., 3., 3.],
[4., 4., 4., 4.],
[5., 5., 5., 5.],
[6., 6., 6., 6.],
[7., 7., 7., 7.]])
# fancy indexing by passing a list
4, 3, 5, 6]] arr[[
array([[4., 4., 4., 4.],
[3., 3., 3., 3.],
[5., 5., 5., 5.],
[6., 6., 6., 6.]])
# passing multiple index arrays selects a one-dimensional array of elements corresponding to each tuple of indices
# select (1,0), (5,3), (7,1), (2,2)
= np.arange(32).reshape(8,4)
arr 1, 5, 7, 2], [0, 3, 1, 2]] arr[[
array([ 4, 23, 29, 10])
# trying to select a rectangular region
1, 5, 7, 2]][:, [0, 3, 1, 2]] arr[[
array([[ 4, 7, 5, 6],
[20, 23, 21, 22],
[28, 31, 29, 30],
[ 8, 11, 9, 10]])
1.7 Transposing Arrays and Swapping Axes
Transposing is a special form of reshaping that similiarly returns a view on the underlying data without copying anything. Arrays have the transpose
method and also the special T
attribute.
= np.arange(6).reshape(2,3)
arr arr
array([[0, 1, 2],
[3, 4, 5]])
arr.T
array([[0, 3],
[1, 4],
[2, 5]])
2 Universal Functions
ufunc, short for universal function, is a function that performs element-wise operations on data in ndarrays.
# unary ufuncs
= np.arange(10)
arr np.sqrt(arr)
array([0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])
np.exp(arr)
array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
2.98095799e+03, 8.10308393e+03])
# binary ufuncs
= np.random.randn(8)
x = np.random.randn(8)
y np.maximum(x, y)
array([ 0.7741251 , -0.13054798, 0.92974564, -0.83160733, 0.90103776,
0.68551387, -0.34032336, -0.08283191])
3 Array-Oriented Programming with Arrays
= np.arange(-5, 5, 0.01)
points = np.meshgrid(points, points) xs, ys
# think of xs as points on the x axis, ys as points on the y axis
xs
array([[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
...,
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99]])
ys
array([[-5. , -5. , -5. , ..., -5. , -5. , -5. ],
[-4.99, -4.99, -4.99, ..., -4.99, -4.99, -4.99],
[-4.98, -4.98, -4.98, ..., -4.98, -4.98, -4.98],
...,
[ 4.97, 4.97, 4.97, ..., 4.97, 4.97, 4.97],
[ 4.98, 4.98, 4.98, ..., 4.98, 4.98, 4.98],
[ 4.99, 4.99, 4.99, ..., 4.99, 4.99, 4.99]])
= np.sqrt(xs **2 + ys**2) z
3.1 Expressing Conditional Logic as Array Operations
# list comprehension edition for conditional logic
= np.array([1.1, 1.2, 1.3, 1.4, 1.5])
xarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
yarr = np.array([True, False, True, True, False]) cond
= [(x if c else y) for x,y,c in zip(xarr, yarr, cond)]
result result
[1.1, 2.2, 1.3, 1.4, 2.5]
# np.where edition for conditional logic
= np.where(cond, xarr, yarr)
result result
array([1.1, 2.2, 1.3, 1.4, 2.5])
3.2 Mathematical and Statiscal Methods
A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class and the top-level NumPy function.
= np.random.randn(5, 4)
arr arr
array([[-0.32270463, -2.47923282, 0.51142065, 1.64402202],
[-1.11875424, -0.50816377, -0.24412379, -0.35906071],
[-0.80848479, -1.5290442 , 0.33861759, -1.84812779],
[-0.38523178, -1.14234316, 1.07015372, -0.7025341 ],
[-0.742273 , 0.62327938, -0.24617117, -0.87927529]])
# find the mean of an array
arr.mean()
-0.45640159378569933
# using the top-level NumPy function to find the mean of an array
np.mean(arr)
-0.45640159378569933
# computing the mean over the axis 0
=0) arr.mean(axis
array([-0.67548969, -1.00710091, 0.2859794 , -0.42899517])
=0) np.mean(arr, axis
array([-0.67548969, -1.00710091, 0.2859794 , -0.42899517])
3.3 Methods for Boolean Arrays
Boolean values are coerced to 1 (True) and 0 (False) in the preceding methods. Thus, sum is often used as a means of counting True values in a boolean array.
= np.random.randn(100)
arr > 0).sum() # Number of positive values (arr
51
There are two additional methods, any
and all
, useful especially for boolean arrays. any
tests whether one or more values in an array is True, while all
checks if every value is True.
= np.array([False, False, True, False]) bools
any() bools.
True
all() bools.
False
3.4 Sorting
NumPy arrays can be sorted in-place with the sort
method.
= np.random.randn(6)
arr arr
array([-0.64050099, 0.37239892, 0.48466042, 0.0832035 , -0.24079602,
-0.62832189])
arr.sort()
arr
array([-0.64050099, -0.62832189, -0.24079602, 0.0832035 , 0.37239892,
0.48466042])
We can sort each one-dimensional section of values in a multidimensional array in-place along an axis by passing the axis number to sort.
= np.random.randn(5, 3)
arr arr
array([[ 0.22066993, 1.28272713, -2.80933259],
[ 1.24150303, 0.6821006 , 0.21857812],
[ 0.38492004, 2.30910114, 0.354785 ],
[-0.99229831, -0.81723761, 0.19111813],
[ 0.46279363, 0.11871894, 0.7152068 ]])
1) arr.sort(
arr
array([[-2.80933259, 0.22066993, 1.28272713],
[ 0.21857812, 0.6821006 , 1.24150303],
[ 0.354785 , 0.38492004, 2.30910114],
[-0.99229831, -0.81723761, 0.19111813],
[ 0.11871894, 0.46279363, 0.7152068 ]])
The top-level method np.sort
returns a sorted copy of an array instead of modifying the array in-place. A quick-and-dirty way to compute the quantiles of an array is to sort it and select the value at a particular rank.
= np.random.randn(1000)
large_arr
large_arr.sort()int(0.05 * len(large_arr))] # 5% quantile large_arr[
-1.5942713719535697
3.5 Unique and Other Set Logic
# np.unique returns the sorted unique values in an array
= np.array([1, 2, 3, 3, 4, 4, 6])
numbers np.unique(numbers)
array([1, 2, 3, 4, 6])
# pure python edition
sorted(set(numbers))
[1, 2, 3, 4, 6]