-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
API: to_datetime strings default to microsecond #62801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
gentle ping @jorisvandenbossche @rhshadrach @mroeschke since this is a blocker for several 3.0-milestone issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have tests with multiple strings with different precision correctly returning nano if there's a nano string otherwise micro?
Didn't see one in a quick pass. Will add where appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| if unit in ["s", "ms"]: | ||
| # TODO: should _cast_pointwise_result attempt to preserve unit? | ||
| xp = xp.dt.as_unit("us") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant here by "preserve unit"? It seems to me the current rule is to surface the lowest resolution of the inputs, which I believe is sensible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider:
series_with_dt64_ms_unit.map(lambda x: pd.Timestamp("2016-01-01"))
What dtype would you expect the result to have? In main it will be "s". After this PR it would be "us". This comment is suggesting that since the original has "ms" unit, we might try to retain "ms" in the mapped result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should let the input influence the result here. If we did, then the result of
pd.Series([pd.Timestamp("2016-01-01").as_unit("ms")]).map(lambda x: x.as_unit("us"))would also be ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Will remove this comment next time I push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit different case, though?
In your example with lambda x: pd.Timestamp("2016-01-01"), the return value is an actual timestamp with a unit, that we might ignore to coerce to the original unit (but agreed we shouldn't do that in a context of map).
But the test case here is about combining timestamp and string. At that point, I think it could make sense to be flexible how to parse the string to timestamps such to preserve the unit. Although no strong opinion about it.
That said, what I was actually wondering in this case: if we are combining timestamps and strings, shouldn't that give object dtype as a result instead?
Like concat of timestamps and strings also gives object. Or do we consider combine_first more like a setitem operation? But then it would actually make sense to preserve the unit of the left side, as we would do for left[mask] = right[mask]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, what I was actually wondering in this case: if we are combining timestamps and strings, shouldn't that give object dtype as a result instead?
Yah, this is driven by a line that I think is weird/bad in Series.combine_first:
if this.dtype.kind == "M" and other.dtype.kind != "M":
# TODO: try to match resos?
other = to_datetime(other)
when we get here in this test we have other = Series({1: "2011"}), i.e. dtype='str'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opened #62931 to deprecate this to_datetime call
|
@jbrockmendel can you clarify a bit more what the PR exactly does? The title mentions "to_datetime strings", but so that means other input stays as it was? |
Updated OP, LMK if it is still unclear. |
Thanks! For numeric input can't just change the default resolution how it is interpreted (since that would change the result, not just the resolution), but we could still cast the resulting nanoseconds to microseconds (if not out-of-bounds). But fine leaving that for a separate PR / discussion. |
Very happy to leave that for separate discussion.
I think the "natural" resolution on a date object is "D", so it makes sense to treat this like a |
|
So my only concern here is if someone has code that converts time objects to integers and assumes that those integers represent nanoseconds. e.g. If we go forward with this, then we need to say something about this in the docs, and demonstrate via an example what is the best practice for converting such code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if in user_guide/timeseries.rst, we should have some text that says that the integer representation of the underlying Timestamp should NOT be used in computation, and provide a recommendation of what to do. What's not clear to me is that if you have a Timestamp, and you use Timestamp.value, how do you know what the unit of the value is in case you do want to do arithmetic on the integer representation?
Timestamp.value already converts to nano |
Unless, it's out of bounds, it converts to seconds. So is the real issue here is that if a user has Or should we warn on I guess I don't fully understand what kind of user code will break with this change. |
Where are you seeing this? I don't think its accurate.
I argued for that long ago and lost that argument. |
>>> pd.Series([pd.Timestamp("3000-10-31 8:30")]).astype(int)
0 32529889800
dtype: int64
>>> pd.Series([pd.Timestamp("3000-10-31 8:30:02.04")]).astype(int)
0 32529889802040This is without this PR (using main). So it seems the integer value you get is dependent on the underlying unit as determined by when the string is converted.
Maybe we need to revisit? Especially considering my example above. |
OK, that is about Series.astype(np.int64). The previous statement was about Timestamp.value.
I'm not opposed to that being discussed separately, but am averse to expanding the scope here. I'm getting pretty burnt out over here. Going to tag out on this. |
I don't think you have to expand scope. I just think we need to document that doing Would like to see what @rhshadrach thinks. |
Yes - this is a breaking change. It seems to me this is worthwhile to do, and that there is no way to deprecate in a way that is worth the noise. As you've already demonstrated, the unit that one gets with strings is fragile on 2.3.x - depends on the number of decimal digits in the input. If users are relying on Also, |
|
Posted my comment prior to seeing @Dr-Irv's request - no objection to adding a line in the documentation. |
|
I’ve already added documentation to that effect. If you want more, someone else needs to tag in. |
Note that this is only because the series is object dtype. If you have an actual datetime64 series (eg Now, we are repeating the discussion from #58989 though. We know that the choice to default to microseconds is a breaking change. Previously we decided to therefore use a slower change cycle by only doing the breaking change in 4.0 and adding an opt-in for it in 3.x. Here we are keeping the breaking change directly in 3.0. We just need to make a decision on that. If it is a matter of better documentation of the breaking change / best practices about what to do instead, I don't think that needs to be included in / block this PR. We can do follow-up PRs to improve the docs (current main already has breaking behaviour as well anyway). |
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.As discussed in last week's dev call, this PR changes to_datetime, DatetimeIndex, and the Timestamp constructor behavior when it sees strings to infer a "us" unit in cases where it would previously (in main, not a released version) infer either "s" or "ms". Cases with nano precision stay nano.
Non-string cases are unchanged (again, unchanged from main). So np.datetime64 objects retain their resolution (or nearest-supported). pydatetime objects stay "us". ints and floats mean we end up with "ns".
For all-string cases this is similar to the OP suggestion from #58989. Most users will end up with a "us" unit when calling e.g. read_csv.