import numpy as npNumPy, short for Numerical Python, has such advantages:
- Fast vectorized array operations for data munging and cleaning, subsetting and filtering, transformation, and any other kinds of computations
- Common array algorithms like sorting, unique, and set operations
- Efficient descriptive statistics and aggregating/summarizing data
- Data alignment and relational data manipulations for merging and joining together heterogeneous datasets
- Expressing conditional logic as array expressions instead of loops with
if-elif-elsebranches - Group-wise data manipulations (aggregation, transformation, function application)
Pure Python leaves many details to runtime environment:
- specifying variable types
- memory allocation/deallocation, etc
NumPy is fast.
- NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy’s library of algorithms written in the C language can operate on this memory without any type checking or other overhead. NumPy arrays also use much less memory than built-in Python sequences.
- NumPy operations perform complex computations on entire arrays without the need for Python
forloops.
# speed test for numpy
my_arr = np.arange(1000_000)
my_list = list(range(1000_000))%%time
my_arr2 = my_arr * 2Wall time: 2.99 ms
%%time
my_list2 = [x * 2 for x in my_list]Wall time: 114 ms
1 ndarray
An ndarray (N dimensional array) is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type (free of type checking). Every array has a shape, a tuple indicating the size of each dimension; a dtype, an object describing the data type of the array; a ndim, an integer indicating the dimension of the ndarray.
data = np.random.randn(3, 4)
dataarray([[ 0.59905431, -1.08465416, -0.95319914, 1.93616798],
[ 0.92951762, 0.81376066, 0.87067676, -0.05844408],
[ 0.61867604, -0.78530194, -0.93464763, 0.74309266]])
data.shape(3, 4)
data.dtypedtype('float64')
data.ndim2
1.1 Creating ndarrays
1.1.1 array function.
array function accepts any sequence-like object (list, list of lists, other arrays, etc) and produces a new NumPy array containing the passed data. Unless explicitly specified, np.array tries to infer a good data type for the array that it creates. The data type is stored in a special dtype metadata object.
# passing list
arr1 = np.array([1, 2, 3, 4])
print(arr1)
print(arr1.shape)
print(arr1.dtype)
print(arr1.ndim)[1 2 3 4]
(4,)
int32
1
# passing list of lists
arr2 = np.array([[1, 2, 3],
[4, 5, 6]])
print(arr2)
print(arr2.shape)
print(arr2.dtype)
print(arr2.ndim)[[1 2 3]
[4 5 6]]
(2, 3)
int32
2
# passing array
arr3 = np.array(arr2)
print(arr3)
print(arr3.shape)
print(arr3.dtype)
print(arr3.ndim)[[1 2 3]
[4 5 6]]
(2, 3)
int32
2
1.1.2 zeros,ones,empty,arange etc
| Function | Description |
|---|---|
| array | Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a dtype or explicitly specifying a dtype; copies the input data by default |
| asarray | Convert input to ndarray, but do not copy if the input is already an ndarray |
| arange | Like the built-in range but returns an ndarray instead of a list |
| ones | Produce an array of all 1s with the given shape and dtype |
| ones_like | Takes another array and produces a ones array of the same shape and dtype |
| zeros | Like ones but producing arrays of 0s instead |
| zeros_like | Like ones_like but producing arrays of 0s instead |
| empty | Create new arrays by allocating new memory, but do not populate with any values |
| empty_like | Like ones_like but do not populate with any values |
| full | Produce an array of the given shape and dtype with all values set to the indicated “fill value” |
| full_like | full_like takes another array and produces a filled array of the same shape and dtype |
| eye, identity | Create a square NxN identity matrix |
# np.zeros
np.zeros(4)array([0., 0., 0., 0.])
# np.ones
np.ones((2, 3))array([[1., 1., 1.],
[1., 1., 1.]])
# np.empty return uninitialized "garbage" values, which can later be populated with data
np.empty((2, 3, 2))array([[[0.59905431, 1.08465416],
[0.95319914, 1.93616798],
[0.92951762, 0.81376066]],
[[0.87067676, 0.05844408],
[0.61867604, 0.78530194],
[0.93464763, 0.74309266]]])
# arange ia an array-valued version of the build-in Python range function
np.arange(10)array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
1.2 Data Types for ndarrays
The data type or dtype is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data. In most cases dtype provide a mapping directly onto an underlying disk or memory representation, which makes it easy to read and write binary streams of data to disk and also to connect to code written in a low-level language like C or Fortran.
1.3 Arithmetic with NumPy Arrays
Vectorization - Any arithmetic operations between equal-size arrays applies the operation element-wise.
arr = np.random.randn(3, 4)
arrarray([[ 0.07338553, 0.8116425 , -2.17800941, -0.32029061],
[ 0.41584223, -1.32481565, -1.3783163 , 0.26723131],
[-0.31385858, 1.30899248, 1.13523462, -0.68452327]])
arr * 2array([[ 0.14677106, 1.62328501, -4.35601882, -0.64058122],
[ 0.83168445, -2.64963129, -2.75663261, 0.53446262],
[-0.62771717, 2.61798496, 2.27046924, -1.36904654]])
arr - arrarray([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
1 / arrarray([[13.62666475, 1.23206953, -0.45913484, -3.1221646 ],
[ 2.4047582 , -0.754822 , -0.72552287, 3.74207649],
[-3.18614831, 0.76394633, 0.88087518, -1.46087072]])
Broadcasting - Operations between differently sized arrays.
1.4 Basic Indexing and Slicing
One-dimensional array indexing and slicing act similarly to Python lists.
# indexing
arr = np.arange(10)
print(arr)
print(arr[0])[0 1 2 3 4 5 6 7 8 9]
0
# slicing
print(arr[1:4])[1 2 3]
Array slices are views on the original array, which means any modification to the view will be reflected in the source array. This design intends to obtain high performance and save memory.
arr = np.arange(10)
arrarray([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arr[1:5] = 10
arrarray([ 0, 10, 10, 10, 10, 5, 6, 7, 8, 9])
For higher dimensional arrays, we can access every individual element recursively. First, indexing moves along axis 0 as the “rows” of the array and then axis 1 as the “columns”.
arr = np.arange(9).reshape(3,3)
arrarray([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
# indexing along axis 0
arr[1]array([3, 4, 5])
# recursively indexing
arr[1][0]3
# easy and equivalent way
arr[1, 0]3
# indexing with slices, slice along axis 0
arr[:2]array([[0, 1, 2],
[3, 4, 5]])
# recursively indexing with slices
arr[:2, :2]array([[0, 1],
[3, 4]])
1.5 Boolean Indexing
Selecting data from an array by boolean indexing always creates a copy of the data.
arr = np.array([["Bob", 1, 2, 3],
["Luffy", 2, 3, 4],
["Joe", 6, 7, 8]])
arrarray([['Bob', '1', '2', '3'],
['Luffy', '2', '3', '4'],
['Joe', '6', '7', '8']], dtype='<U11')
names = arr[:, 0]
namesarray(['Bob', 'Luffy', 'Joe'], dtype='<U11')
luffy_selected = (names == "Luffy")
luffy_selectedarray([False, True, False])
# boolean indexing slice along axis 0, select 'true' rows
arr[luffy_selected]array([['Luffy', '2', '3', '4']], dtype='<U11')
# select everything except luffy, use != or negate the condition using ~
arr[names != "Luffy"]array([['Bob', '1', '2', '3'],
['Joe', '6', '7', '8']], dtype='<U11')
arr[~luffy_selected]array([['Bob', '1', '2', '3'],
['Joe', '6', '7', '8']], dtype='<U11')
# select two of the three names to combine multiple boolean conditions, use boolean arithmetic operators like & and |
mask = (names=="Bob")|(names=="Joe")
maskarray([ True, False, True])
arr[mask]array([['Bob', '1', '2', '3'],
['Joe', '6', '7', '8']], dtype='<U11')
1.6 Fancy Indexing
- Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays.
- The result of fancy indexing is always one-dimensional.
- Fancy indexing always copies the data into a new array.
# create a 8x4 array
arr = np.empty((8, 4))
for i in range(8):
arr[i] = i
arrarray([[0., 0., 0., 0.],
[1., 1., 1., 1.],
[2., 2., 2., 2.],
[3., 3., 3., 3.],
[4., 4., 4., 4.],
[5., 5., 5., 5.],
[6., 6., 6., 6.],
[7., 7., 7., 7.]])
# fancy indexing by passing a list
arr[[4, 3, 5, 6]]array([[4., 4., 4., 4.],
[3., 3., 3., 3.],
[5., 5., 5., 5.],
[6., 6., 6., 6.]])
# passing multiple index arrays selects a one-dimensional array of elements corresponding to each tuple of indices
# select (1,0), (5,3), (7,1), (2,2)
arr = np.arange(32).reshape(8,4)
arr[[1, 5, 7, 2], [0, 3, 1, 2]]array([ 4, 23, 29, 10])
# trying to select a rectangular region
arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]array([[ 4, 7, 5, 6],
[20, 23, 21, 22],
[28, 31, 29, 30],
[ 8, 11, 9, 10]])
1.7 Transposing Arrays and Swapping Axes
Transposing is a special form of reshaping that similiarly returns a view on the underlying data without copying anything. Arrays have the transpose method and also the special T attribute.
arr = np.arange(6).reshape(2,3)
arrarray([[0, 1, 2],
[3, 4, 5]])
arr.Tarray([[0, 3],
[1, 4],
[2, 5]])
2 Universal Functions
ufunc, short for universal function, is a function that performs element-wise operations on data in ndarrays.
# unary ufuncs
arr = np.arange(10)
np.sqrt(arr)array([0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])
np.exp(arr)array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
2.98095799e+03, 8.10308393e+03])
# binary ufuncs
x = np.random.randn(8)
y = np.random.randn(8)
np.maximum(x, y)array([ 0.7741251 , -0.13054798, 0.92974564, -0.83160733, 0.90103776,
0.68551387, -0.34032336, -0.08283191])
3 Array-Oriented Programming with Arrays
points = np.arange(-5, 5, 0.01)
xs, ys = np.meshgrid(points, points)# think of xs as points on the x axis, ys as points on the y axis
xsarray([[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
...,
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99]])
ysarray([[-5. , -5. , -5. , ..., -5. , -5. , -5. ],
[-4.99, -4.99, -4.99, ..., -4.99, -4.99, -4.99],
[-4.98, -4.98, -4.98, ..., -4.98, -4.98, -4.98],
...,
[ 4.97, 4.97, 4.97, ..., 4.97, 4.97, 4.97],
[ 4.98, 4.98, 4.98, ..., 4.98, 4.98, 4.98],
[ 4.99, 4.99, 4.99, ..., 4.99, 4.99, 4.99]])
z = np.sqrt(xs **2 + ys**2)3.1 Expressing Conditional Logic as Array Operations
# list comprehension edition for conditional logic
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])result = [(x if c else y) for x,y,c in zip(xarr, yarr, cond)]
result[1.1, 2.2, 1.3, 1.4, 2.5]
# np.where edition for conditional logic
result = np.where(cond, xarr, yarr)
resultarray([1.1, 2.2, 1.3, 1.4, 2.5])
3.2 Mathematical and Statiscal Methods
A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class and the top-level NumPy function.
arr = np.random.randn(5, 4)
arrarray([[-0.32270463, -2.47923282, 0.51142065, 1.64402202],
[-1.11875424, -0.50816377, -0.24412379, -0.35906071],
[-0.80848479, -1.5290442 , 0.33861759, -1.84812779],
[-0.38523178, -1.14234316, 1.07015372, -0.7025341 ],
[-0.742273 , 0.62327938, -0.24617117, -0.87927529]])
# find the mean of an array
arr.mean()-0.45640159378569933
# using the top-level NumPy function to find the mean of an array
np.mean(arr)-0.45640159378569933
# computing the mean over the axis 0
arr.mean(axis=0)array([-0.67548969, -1.00710091, 0.2859794 , -0.42899517])
np.mean(arr, axis=0)array([-0.67548969, -1.00710091, 0.2859794 , -0.42899517])
3.3 Methods for Boolean Arrays
Boolean values are coerced to 1 (True) and 0 (False) in the preceding methods. Thus, sum is often used as a means of counting True values in a boolean array.
arr = np.random.randn(100)
(arr > 0).sum() # Number of positive values51
There are two additional methods, any and all, useful especially for boolean arrays. any tests whether one or more values in an array is True, while all checks if every value is True.
bools = np.array([False, False, True, False])bools.any()True
bools.all()False
3.4 Sorting
NumPy arrays can be sorted in-place with the sort method.
arr = np.random.randn(6)
arrarray([-0.64050099, 0.37239892, 0.48466042, 0.0832035 , -0.24079602,
-0.62832189])
arr.sort()arrarray([-0.64050099, -0.62832189, -0.24079602, 0.0832035 , 0.37239892,
0.48466042])
We can sort each one-dimensional section of values in a multidimensional array in-place along an axis by passing the axis number to sort.
arr = np.random.randn(5, 3)
arrarray([[ 0.22066993, 1.28272713, -2.80933259],
[ 1.24150303, 0.6821006 , 0.21857812],
[ 0.38492004, 2.30910114, 0.354785 ],
[-0.99229831, -0.81723761, 0.19111813],
[ 0.46279363, 0.11871894, 0.7152068 ]])
arr.sort(1)arrarray([[-2.80933259, 0.22066993, 1.28272713],
[ 0.21857812, 0.6821006 , 1.24150303],
[ 0.354785 , 0.38492004, 2.30910114],
[-0.99229831, -0.81723761, 0.19111813],
[ 0.11871894, 0.46279363, 0.7152068 ]])
The top-level method np.sort returns a sorted copy of an array instead of modifying the array in-place. A quick-and-dirty way to compute the quantiles of an array is to sort it and select the value at a particular rank.
large_arr = np.random.randn(1000)
large_arr.sort()
large_arr[int(0.05 * len(large_arr))] # 5% quantile-1.5942713719535697
3.5 Unique and Other Set Logic
# np.unique returns the sorted unique values in an array
numbers = np.array([1, 2, 3, 3, 4, 4, 6])
np.unique(numbers)array([1, 2, 3, 4, 6])
# pure python edition
sorted(set(numbers))[1, 2, 3, 4, 6]