{ } Raw JSON

bundles / numpy 2.4.3 / docs

Doc

user:how-to-io

docs/user:how-to-io

Reading and writing files

This page tackles common applications; for the full collection of I/O routines, see routines.io.

Reading text and `CSV_` files

With no missing values

Use numpy.loadtxt.

With missing values

Use numpy.genfromtxt.

numpy.genfromtxt will either

return a masked array<maskedarray.generic> masking out missing values (if usemask=True), or
fill in the missing value with the value specified in filling_values (default is np.nan for float, -1 for int).

With non-whitespace delimiters

>>> with open("csv.txt", "r") as f:
...     print(f.read())
1, 2, 3
4,, 6
7, 8, 9

Masked-array output

>>> np.genfromtxt("csv.txt", delimiter=",", usemask=True)
masked_array(
  data=[[1.0, 2.0, 3.0],
        [4.0, --, 6.0],
        [7.0, 8.0, 9.0]],
  mask=[[False, False, False],
        [False,  True, False],
        [False, False, False]],
  fill_value=1e+20)

Array output

>>> np.genfromtxt("csv.txt", delimiter=",")
array([[ 1.,  2.,  3.],
       [ 4., nan,  6.],
       [ 7.,  8.,  9.]])

Array output, specified fill-in value

>>> np.genfromtxt("csv.txt", delimiter=",", dtype=np.int8, filling_values=99)
array([[ 1,  2,  3],
       [ 4, 99,  6],
       [ 7,  8,  9]], dtype=int8)

Whitespace-delimited

numpy.genfromtxt can also parse whitespace-delimited data files that have missing values if

Each field has a fixed width: Use the width as the delimiter argument.

# File with width=4. The data does not have to be justified (for example,
# the 2 in row 1), the last column can be less than width (for example, the 6
# in row 2), and no delimiting character is required (for instance 8888 and 9
# in row 3)

>>> with open("fixedwidth.txt", "r") as f:
...    data = (f.read())
>>> print(data)
1   2      3
44      6
7   88889

# Showing spaces as ^
>>> print(data.replace(" ","^"))
1^^^2^^^^^^3
44^^^^^^6
7^^^88889

>>> np.genfromtxt("fixedwidth.txt", delimiter=4)
array([[1.000e+00, 2.000e+00, 3.000e+00],
       [4.400e+01,       nan, 6.000e+00],
       [7.000e+00, 8.888e+03, 9.000e+00]])

A special value (e.g. "x") indicates a missing field: Use it as the missing_values argument.

>>> with open("nan.txt", "r") as f:
...     print(f.read())
1 2 3
44 x 6
7  8888 9

>>> np.genfromtxt("nan.txt", missing_values="x")
array([[1.000e+00, 2.000e+00, 3.000e+00],
       [4.400e+01,       nan, 6.000e+00],
       [7.000e+00, 8.888e+03, 9.000e+00]])

You want to skip the rows with missing values: Set invalid_raise=False.

>>> with open("skip.txt", "r") as f:
...     print(f.read())
1 2   3
44    6
7 888 9

>>> np.genfromtxt("skip.txt", invalid_raise=False)  # doctest: +SKIP
__main__:1: ConversionWarning: Some errors were detected !
    Line #2 (got 2 columns instead of 3)
array([[  1.,   2.,   3.],
       [  7., 888.,   9.]])

The delimiter whitespace character is different from the whitespace that indicates missing data. For instance, if columns are delimited by \t, then missing data will be recognized if it consists of one or more spaces.

>>> with open("tabs.txt", "r") as f:
...    data = (f.read())
>>> print(data)
1       2       3
44              6
7       888     9

# Tabs vs. spaces
>>> print(data.replace("\t","^"))
1^2^3
44^ ^6
7^888^9

>>> np.genfromtxt("tabs.txt", delimiter="\t", missing_values=" +")
array([[  1.,   2.,   3.],
       [ 44.,  nan,   6.],
       [  7., 888.,   9.]])

Read a file in .npy or .npz format

Choices:

Use numpy.load. It can read files generated by any of numpy.save, numpy.savez, or numpy.savez_compressed.
Use memory mapping. See numpy.lib.format.open_memmap.

Write to a file to be read back by NumPy

Binary

Use numpy.save, or to store multiple arrays numpy.savez or numpy.savez_compressed.

For security and portability, set allow_pickle=False unless the dtype contains Python objects, which requires pickling.

Masked arrays can't currently be saved <MaskedArray.tofile>, nor can other arbitrary array subclasses.

Human-readable

numpy.save and numpy.savez create binary files. To write a human-readable file, use numpy.savetxt. The array can only be 1- or 2-dimensional, and there's no savetxtz for multiple files.

Large arrays

See how-to-io-large-arrays.

Read an arbitrarily formatted binary file ("binary blob")

Use a structured array.

Example:

The .wav file header is a 44-byte block preceding data_size bytes of the actual sound data

chunk_id         "RIFF"
chunk_size       4-byte unsigned little-endian integer
format           "WAVE"
fmt_id           "fmt "
fmt_size         4-byte unsigned little-endian integer
audio_fmt        2-byte unsigned little-endian integer
num_channels     2-byte unsigned little-endian integer
sample_rate      4-byte unsigned little-endian integer
byte_rate        4-byte unsigned little-endian integer
block_align      2-byte unsigned little-endian integer
bits_per_sample  2-byte unsigned little-endian integer
data_id          "data"
data_size        4-byte unsigned little-endian integer

The .wav file header as a NumPy structured dtype

wav_header_dtype = np.dtype([
    ("chunk_id", (bytes, 4)), # flexible-sized scalar type, item size 4
    ("chunk_size", "<u4"),    # little-endian unsigned 32-bit integer
    ("format", "S4"),         # 4-byte string, alternate spelling of (bytes, 4)
    ("fmt_id", "S4"),
    ("fmt_size", "<u4"),
    ("audio_fmt", "<u2"),     #
    ("num_channels", "<u2"),  # .. more of the same ...
    ("sample_rate", "<u4"),   #
    ("byte_rate", "<u4"),
    ("block_align", "<u2"),
    ("bits_per_sample", "<u2"),
    ("data_id", "S4"),
    ("data_size", "<u4"),
    #
    # the sound data itself cannot be represented here:
    # it does not have a fixed size
])

header = np.fromfile(f, dtype=wave_header_dtype, count=1)[0]

This .wav example is for illustration; to read a .wav file in real life, use Python's built-in module wave.

(Adapted from Pauli Virtanen, advanced_numpy, licensed under CC BY 4.0.)

Write or read large arrays

Arrays too large to fit in memory can be treated like ordinary in-memory arrays using memory mapping.

Raw array data written with numpy.ndarray.tofile or numpy.ndarray.tobytes can be read with numpy.memmap:
```
array = numpy.memmap("mydata/myarray.arr", mode="r", dtype=np.int16, shape=(1024, 1024))
```
Files output by numpy.save (that is, using the numpy format) can be read using numpy.load with the mmap_mode keyword argument
```
large_array[some_slice] = np.load("path/to/small_array", mmap_mode="r")
```

Memory mapping lacks features like data chunking and compression; more full-featured formats and libraries usable with NumPy include:

HDF5: h5py or PyTables.
Zarr: here.
NetCDF: scipy.io.netcdf_file.

For tradeoffs among memmap, Zarr, and HDF5, see pythonspeed.com.

Write files for reading by other (non-NumPy) tools

Formats for exchanging data with other tools include HDF5, Zarr, and NetCDF (see how-to-io-large-arrays).

Write or read a JSON file

NumPy arrays and most NumPy scalars are not directly JSON serializable. Instead, use a custom json.JSONEncoder for NumPy types, which can be found using your favorite search engine.

Save/restore using a pickle file

Avoid when possible; pickles are not secure against erroneous or maliciously constructed data.

Use numpy.save and numpy.load. Set allow_pickle=False, unless the array dtype includes Python objects, in which case pickling is required.

numpy.load and pickle submodule also support unpickling files created with NumPy 1.26.

Convert from a pandas DataFrame to a NumPy array

See pandas.Series.to_numpy.

Save/restore using `~numpy.ndarray.tofile` and `~numpy.fromfile`

In general, prefer numpy.save and numpy.load.

numpy.ndarray.tofile and numpy.fromfile lose information on endianness and precision and so are unsuitable for anything but scratch storage.