Skip to content

Conversation

@jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Oct 22, 2025

As discussed in last week's dev call, this PR changes to_datetime, DatetimeIndex, and the Timestamp constructor behavior when it sees strings to infer a "us" unit in cases where it would previously (in main, not a released version) infer either "s" or "ms". Cases with nano precision stay nano.

Non-string cases are unchanged (again, unchanged from main). So np.datetime64 objects retain their resolution (or nearest-supported). pydatetime objects stay "us". ints and floats mean we end up with "ns".

For all-string cases this is similar to the OP suggestion from #58989. Most users will end up with a "us" unit when calling e.g. read_csv.

@jbrockmendel
Copy link
Member Author

gentle ping @jorisvandenbossche @rhshadrach @mroeschke since this is a blocker for several 3.0-milestone issues

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests with multiple strings with different precision correctly returning nano if there's a nano string otherwise micro?

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Oct 27, 2025

Do we have tests with multiple strings with different precision correctly returning nano if there's a nano string otherwise micro?

Didn't see one in a quick pass. Will add where appropriate.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines +83 to +85
if unit in ["s", "ms"]:
# TODO: should _cast_pointwise_result attempt to preserve unit?
xp = xp.dt.as_unit("us")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant here by "preserve unit"? It seems to me the current rule is to surface the lowest resolution of the inputs, which I believe is sensible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider:

series_with_dt64_ms_unit.map(lambda x: pd.Timestamp("2016-01-01"))

What dtype would you expect the result to have? In main it will be "s". After this PR it would be "us". This comment is suggesting that since the original has "ms" unit, we might try to retain "ms" in the mapped result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should let the input influence the result here. If we did, then the result of

pd.Series([pd.Timestamp("2016-01-01").as_unit("ms")]).map(lambda x: x.as_unit("us"))

would also be ms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Will remove this comment next time I push.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit different case, though?

In your example with lambda x: pd.Timestamp("2016-01-01"), the return value is an actual timestamp with a unit, that we might ignore to coerce to the original unit (but agreed we shouldn't do that in a context of map).

But the test case here is about combining timestamp and string. At that point, I think it could make sense to be flexible how to parse the string to timestamps such to preserve the unit. Although no strong opinion about it.

That said, what I was actually wondering in this case: if we are combining timestamps and strings, shouldn't that give object dtype as a result instead?
Like concat of timestamps and strings also gives object. Or do we consider combine_first more like a setitem operation? But then it would actually make sense to preserve the unit of the left side, as we would do for left[mask] = right[mask]

Copy link
Member Author

@jbrockmendel jbrockmendel Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, what I was actually wondering in this case: if we are combining timestamps and strings, shouldn't that give object dtype as a result instead?

Yah, this is driven by a line that I think is weird/bad in Series.combine_first:

        if this.dtype.kind == "M" and other.dtype.kind != "M":
            # TODO: try to match resos?
            other = to_datetime(other)

when we get here in this test we have other = Series({1: "2011"}), i.e. dtype='str'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened #62931 to deprecate this to_datetime call

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 28, 2025

@jbrockmendel can you clarify a bit more what the PR exactly does? The title mentions "to_datetime strings", but so that means other input stays as it was?
Although in practice, other input like datetime stdlib objects already give microseconds as well (edit: not for datetime.date, though), and integers are continued to be interpreted as nanos by default. Are there other cases remaining for to_datetime?

@jbrockmendel
Copy link
Member Author

can you clarify a bit more what the PR exactly does? The title mentions "to_datetime strings", but so that means other input stays as it was?

Updated OP, LMK if it is still unclear.

@jorisvandenbossche
Copy link
Member

Updated OP, LMK if it is still unclear.

Thanks!
I would still use a "general" rule of defaulting to microseconds, and not just for strings. In practice (comparing with my older branch), I think that only means updating the datetime.date case (PyDate_Check(ts)), because all other input has a resolution (numpy, our own timestamps), or is numeric.

For numeric input can't just change the default resolution how it is interpreted (since that would change the result, not just the resolution), but we could still cast the resulting nanoseconds to microseconds (if not out-of-bounds). But fine leaving that for a separate PR / discussion.

@jbrockmendel
Copy link
Member Author

For numeric input can't just change the default resolution how it is interpreted (since that would change the result, not just the resolution), but we could still cast the resulting nanoseconds to microseconds (if not out-of-bounds). But fine leaving that for a separate PR / discussion.

Very happy to leave that for separate discussion.

I think that only means updating the datetime.date case (PyDate_Check(ts))

I think the "natural" resolution on a date object is "D", so it makes sense to treat this like a np.datetime64[D]. But if changing that is what it takes to get to "+1" i'll make my peace with it.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 31, 2025

So my only concern here is if someone has code that converts time objects to integers and assumes that those integers represent nanoseconds.

e.g. pd.Series([pd.Timestamp("2025-10-31 9:57")]).astype(int) will now return a different result.

If we go forward with this, then we need to say something about this in the docs, and demonstrate via an example what is the best practice for converting such code.

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if in user_guide/timeseries.rst, we should have some text that says that the integer representation of the underlying Timestamp should NOT be used in computation, and provide a recommendation of what to do. What's not clear to me is that if you have a Timestamp, and you use Timestamp.value, how do you know what the unit of the value is in case you do want to do arithmetic on the integer representation?

@jbrockmendel
Copy link
Member Author

What's not clear to me is that if you have a Timestamp, and you use Timestamp.value, how do you know what the unit of the value is in case you do want to do arithmetic on the integer representation?

Timestamp.value already converts to nano

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 3, 2025

What's not clear to me is that if you have a Timestamp, and you use Timestamp.value, how do you know what the unit of the value is in case you do want to do arithmetic on the integer representation?

Timestamp.value already converts to nano

Unless, it's out of bounds, it converts to seconds.

So is the real issue here is that if a user has .astype(int) in their current code, they might get different units than before? So maybe the docs should say "do not use .astype(int) and only use .value ? Although to do that, I think we'd need to add a dt.value method to Series and DatetimeIndex.

Or should we warn on .astype(int) when applied to a Series that has a datetime dtype?

I guess I don't fully understand what kind of user code will break with this change.

@jbrockmendel
Copy link
Member Author

Unless, it's out of bounds, it converts to seconds.

Where are you seeing this? I don't think its accurate.

should say "do not use .astype(int)

I argued for that long ago and lost that argument.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 3, 2025

Unless, it's out of bounds, it converts to seconds.

Where are you seeing this? I don't think its accurate.

>>> pd.Series([pd.Timestamp("3000-10-31 8:30")]).astype(int)
0    32529889800
dtype: int64
>>> pd.Series([pd.Timestamp("3000-10-31 8:30:02.04")]).astype(int)
0    32529889802040

This is without this PR (using main). So it seems the integer value you get is dependent on the underlying unit as determined by when the string is converted.

should say "do not use .astype(int)

I argued for that long ago and lost that argument.

Maybe we need to revisit? Especially considering my example above.

@jbrockmendel
Copy link
Member Author

So it seems the integer value you get is dependent on the underlying unit

OK, that is about Series.astype(np.int64). The previous statement was about Timestamp.value.

Maybe we need to revisit? Especially considering my example above.

I'm not opposed to that being discussed separately, but am averse to expanding the scope here.

I'm getting pretty burnt out over here. Going to tag out on this.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 3, 2025

I'm not opposed to that being discussed separately, but am averse to expanding the scope here.

I'm getting pretty burnt out over here. Going to tag out on this.

I don't think you have to expand scope. I just think we need to document that doing .astype(int) is dangerous with series containing datetime-like data. If you do that, I'm fine with moving this forward.

Would like to see what @rhshadrach thinks.

@rhshadrach
Copy link
Member

rhshadrach commented Nov 3, 2025

So is the real issue here is that if a user has .astype(int) in their current code, they might get different units than before?

Yes - this is a breaking change. It seems to me this is worthwhile to do, and that there is no way to deprecate in a way that is worth the noise. As you've already demonstrated, the unit that one gets with strings is fragile on 2.3.x - depends on the number of decimal digits in the input. If users are relying on .astype(int), they should be ensuring they have the right unit when doing so. In addition, I do not expect usage of .astype(int) to be very common. From my experience, the entire benefit of using datetime dtypes is to avoid having to do arithmetic with integers.

Also, pd.Series([pd.Timestamp("3000-10-31 8:30")]).astype(int) raises a TypeError on 2.3.x.

@rhshadrach
Copy link
Member

Posted my comment prior to seeing @Dr-Irv's request - no objection to adding a line in the documentation.

@jbrockmendel
Copy link
Member Author

I’ve already added documentation to that effect. If you want more, someone else needs to tag in.

@jorisvandenbossche
Copy link
Member

Also, pd.Series([pd.Timestamp("3000-10-31 8:30")]).astype(int) raises a TypeError on 2.3.x.

Note that this is only because the series is object dtype. If you have an actual datetime64 series (eg pd.Series(pd.to_datetime(["2000-10-31 8:30"])).astype(int)), this conversion to int works, and will give a different result in 3.0 compared to 2.3.

Now, we are repeating the discussion from #58989 though. We know that the choice to default to microseconds is a breaking change. Previously we decided to therefore use a slower change cycle by only doing the breaking change in 4.0 and adding an opt-in for it in 3.x. Here we are keeping the breaking change directly in 3.0. We just need to make a decision on that.

If it is a matter of better documentation of the breaking change / best practices about what to do instead, I don't think that needs to be included in / block this PR. We can do follow-up PRs to improve the docs (current main already has breaking behaviour as well anyway).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

5 participants