.. _NEP51: ===================================================== NEP 51 — Changing the representation of NumPy scalars ===================================================== :Author: Sebastian Berg :Status: Accepted :Type: Standards Track :Created: 2022-09-13 :Resolution: https://mail.python.org/archives/list/numpy-discussion@python.org/message/U2A4RCJSXMK7GG23MA5QMRG4KQYFMO2S/ Abstract ======== NumPy has scalar objects ("NumPy scalar") representing a single value corresponding to a NumPy DType. The representation of these currently matches that of the Python builtins, giving:: >>> np.float32(3.0) 3.0 In this NEP we propose to change the representation to include the NumPy scalar type information. Changing the above example to:: >>> np.float32(3.0) np.float32(3.0) We expect that this change will help users distinguish the NumPy scalars from the Python builtin types and clarify their behavior. The distinction between NumPy scalars and Python builtins will further become more important for users once :ref:`NEP 50 ` is adopted. These changes do lead to smaller incompatible and infrastructure changes related to array printing. Motivation and scope ==================== This NEP proposes to change the representation of the following NumPy scalars types to distinguish them from the Python scalars: * ``np.bool_`` * ``np.uint8``, ``np.int8``, and all other integer scalars * ``np.float16``, ``np.float32``, ``np.float64``, ``np.longdouble`` * ``np.complex64``, ``np.complex128``, ``np.clongdouble`` * ``np.str_``, ``np.bytes_`` * ``np.void`` (structured dtypes) Additionally, the representation of the remaining NumPy scalars will be adapted to print as ``np.`` rather than ``numpy.``: * ``np.datetime64`` and ``np.timedelta64`` * ``np.void`` (unstructured version) The NEP does not propose to change how these scalars print – only their representation (``__repr__``) will be changed. Further, array representation will not be affected since it already includes the ``dtype=`` when necessary. The main motivation behind the change is that the Python numerical types behave differently from the NumPy scalars. For example numbers with lower precision (e.g. ``uint8`` or ``float16``) should be used with care and users should be aware when they are working with them. All NumPy integers can experience overflow which Python integers do not. These differences will be exacerbated when adopting :ref:`NEP 50 ` because the lower precision NumPy scalar will be preserved more often. Even ``np.float64``, which is very similar to Python's ``float`` and inherits from it, does behave differently for example when dividing by zero. Another common source of confusion are the NumPy booleans. Python programmers sometimes write ``obj is True`` and will surprised when an object that shows as ``True`` fails to pass the test. It is much easier to understand this behavior when the value is shown as ``np.True_``. Not only do we expect the change to help users better understand and be reminded of the differences between NumPy and Python scalars, but we also believe that the awareness will greatly help debugging. Usage and impact ================ Most user code should not be impacted by the change, but users will now often see NumPy values shown as:: np.True_ np.float64(3.0) np.int64(34) and so on. This will also mean that documentation and output in Jupyter notebook cells will often show the type information intact. ``np.longdouble`` and ``np.clongdouble`` will print with single quotes:: np.longdouble('3.0') to allow round-tripping. Additionally to this change, ``float128`` will now always be printed as ``longdouble`` since the old name gives a false impression of precision. Backward compatibility ====================== We expect that most workflows will not be affected as only printing changes. In general we believe that informing users about the type they are working with outweighs the need for adapting printing in some instances. The NumPy test suite includes code such as ``decimal.Decimal(repr(scalar))``. This code needs to be modified to use the ``str()``. An exception to this are downstream libraries with documentation and especially documentation testing. Since the representation of many values will change, in many cases the documentation will have to be updated. This is expected to require larger documentation fixups in the mid-term. It may be necessary to adopt tools for doctest testing to allow approximate value checking for the new representation. Changes to ``arr.tofile()`` --------------------------- ``arr.tofile()`` currently stores values as ``repr(arr.item())`` when in text mode. This is not always ideal since that may include a conversion to Python. One issue is that this would start saving longdouble as ``np.longdouble('3.1')`` which is clearly not desired. We expect that this method is rarely used for object arrays. For string arrays, using the ``repr`` also leads to storing ``"string"`` or ``b"string"`` which seems rarely desired. The proposal is to change the default (back) to use ``str`` rather than ``repr``. If ``repr`` is desired, users will have to pass ``fmt=%r``. Detailed description ==================== This NEP proposes to change the representation for NumPy scalars to: * ``np.True_`` and ``np.False_`` for booleans (their singleton instances) * ``np.scalar()``, i.e. ``np.float64(3.0)`` for all numerical dtypes. * The value for ``np.longdouble`` and ``np.clongdouble`` will be given in quotes: ``np.longdouble('3.0')``. This ensures that it can always roundtrip correctly and matches the way that ``decimal.Decimal`` behaves. For these two the size-based name such as ``float128`` will not be used as the actual size is platform-dependent and therefore misleading. * ``np.str_("string")`` and ``np.bytes_(b"byte_string")`` for string dtypes. * ``np.void((3, 5), dtype=[('a', '>> np.longlong numpy.longlong >>> np.longlong(3) np.int64(3) The proposal will lead to the ``longlong`` name for the type while using the ``int64`` form for the scalar. This choice is made since ``int64`` is generally the more useful information for users, but the type name itself must be precise. Related work ============ A PR to only change the representation of booleans was previously made `here `_. The implementation is (at the time of writing) largely finished and can be found `here `_ Implementation ============== .. note:: This part has *not* been implemented in the `initial PR `_. A similar change will be required to fix certain cases in printing and allow fully correct printing e.g. of structured scalars which include longdoubles. A similar solution is also expected to be necessary in the future to allow custom DTypes to correctly print. The new representations can be mostly implemented on the scalar types with the largest changes needed in the test suite. The proposed changes for void scalars and masked ``fill_value`` makes it necessary to expose the scalar representation without the type. We propose introducing the semi-public API:: np.core.arrayprint.get_formatter(*, data=None, dtype=None, fmt=None, options=None) to replace the current internal ``_get_formatting_func``. This will allow two things compared to the old function: * ``data`` may be ``None`` (if ``dtype`` is passed) allowing to not pass multiple values that will be printed/formatted later. * ``fmt=`` will allow passing on format strings to a DType-specific element formatter in the future. For now, ``get_formatter()`` will accept ``repr`` or ``str`` (the singletons not strings) to format the elements without type information (``'3.1'`` rather than ``np.longdouble('3.1')``). The implementation ensures that formatting matches except for the type information. The empty format string will print identically to ``str()`` (with possibly extra padding when data is passed). ``get_formatter()`` is expected to query a user DType's method in the future allowing customized formatting for all DTypes. Making ``get_formatter`` public allows it to be used for ``np.record`` and masked arrays. Currently, the formatters themselves seem semi-public; using a single entry-point will hopefully provide a clear API for formatting NumPy values. The large part for the scalar representation changes had previously been done by Ganesh Kathiresan in [2]_. Alternatives ============ Different representations can be considered: alternatives include spelling ``np.`` as ``numpy.`` or dropping the ``np.`` part from the numerical scalars. We believe that using ``np.`` is sufficiently clear, concise, and does allow copy pasting the representation. Using only ``float64(3.0)`` without the ``np.`` prefix is more concise but contexts may exists where the NumPy dependency is not fully clear and the name could clash with other libraries. For booleans an alternative would be to use ``np.bool_(True)`` or ``bool_(True)``. However, NumPy boolean scalars are singletons and the proposed formatting is more concise. Alternatives for booleans were also discussed previously in [1]_. For the string scalars, the confusion is generally less pronounced. It may be reasonable to defer changing these. Non-finite values ----------------- The proposal does not allow copy pasting ``nan`` and ``inf`` values. They could be represented by ``np.float64('nan')`` or ``np.float64(np.nan)`` instead. This is more concise and Python also uses ``nan`` and ``inf`` rather than allowing copy-pasting by showing it as ``float('nan')``. Arguably, it would be a smaller addition in NumPy, where the will already be always printed. Alternatives for the new ``get_formatter()`` -------------------------------------------- When ``fmt=`` is passed, and specifically for the main use (in this NEP) to format to a ``repr`` or ``str``. It would also be possible to use a ufunc or a direct formatting function rather than wrapping it into a ```get_formatter()`` which relies on instantiating a formatter class for the DType. This NEP does not preclude creating a ufunc or making a special path. However, NumPy array formatting commonly looks at all values to be formatted in order to add padding for alignment or give uniform exponential output. In this case ``data=`` is passed and used in preparation. This form of formatting (unlike the scalar case where ``data=None`` would be desired) is unfortunately fundamentally incompatible with UFuncs. The use of the singleton ``str`` and ``repr`` ensures that future formatting strings like ``f"{arr:r}"`` are not in any way limited by using ``"r"`` or ``"s"`` instead. Discussion ========== * An discussion on this changed happened in the mailing list thread: https://mail.python.org/archives/list/numpy-discussion@python.org/thread/7GLGFHTZHJ6KQPOLMVY64OM6IC6KVMYI/ * There was a previous issue [1]_ and PR [2]_ to change only the representation of the NumPy booleans. The PR was later updated to change the representation of all (or at least most) NumPy scalars. References and footnotes ======================== .. [1] https://github.com/numpy/numpy/issues/12950 .. [2] https://github.com/numpy/numpy/pull/17592 Copyright ========= This document has been placed in the public domain.