如何以数学方式为直方图选择最佳 bin

直方图中的最佳分辨率:密度拟合的严格贝叶斯方法如何以数学方式为直方图选择最佳箱的帖子首先出现在走向数据科学上。

来源:走向数据科学

Have you ever wondered how to choose your bins in a histogram?您是否曾问过自己,除了看起来不错之外,是否还有更深层次的原因做出选择? While histograms are the most fundamental tool for data visualization, setting their resolution is important, especially when the histogram itself is used for further analyses. Histograms are often computed to visualize the density of the data. In this post, we explore the mathematics of density fitting, specifically looking at how bins should shrink as our dataset grows. Inspired by adjacent fields such as perturbation theory in physics and Taylor expansions in mathematics, we will find a rigorous method for constructing densities.

All images are by the author

背景

Approximations

The intuition is simple: the more data you have, the more detail you should be able to see. If you are looking at a sample of ten observations, two or three wide bins are likely all you can afford before your visualization becomes a sparse collection of empty gaps.但如果你有一千万个观察值,那些宽的数据箱就会开始感觉像一张低分辨率的像素化照片。 You want to “zoom in” by increasing the number of bins. The question, however, is: How exactly should we scale this resolution?

In physics, when we face a system that is too complex to solve exactly, we often turn to Perturbation Theory. In Quantum Electrodynamics (QED), for example, we approximate complex interactions by expanding them in terms of a small coupling constant, like the electron charge e. This “interaction strength” provides a natural hierarchy for our approximations. But for a histogram, what is the analogous “charge”? Is there a fundamental parameter that governs the interaction between our discrete data points and the underlying distribution we are trying to estimate?

Information Theory

Priors & Posteriors

𝑃(𝜃|𝑋) = 𝑃(𝑋|𝜃)𝑃(𝜃|ℳ)/𝑃(𝑋)

The most-likely model given the data versus model weighting

𝑃(ℳ𝑖 ∣ 𝑋) ~ 𝑃(𝑋 | ℳ𝑖) 𝑃(ℳ𝑖 )

Densities