Class TDigest

java.lang.Object
com.tdunning.math.stats.TDigest
All Implemented Interfaces:
Serializable
Direct Known Subclasses:
AbstractTDigest

public abstract class TDigest extends Object implements Serializable
Adaptive histogram based on something like streaming k-means crossed with Q-digest. The special characteristics of this algorithm are: - smaller summaries than Q-digest - works on doubles as well as integers. - provides part per million accuracy for extreme quantiles and typically <1000 ppm accuracy for middle quantiles - fast - simple - test coverage roughly at 90% - easy to adapt for use with map-reduce
See Also:
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    (package private) double
     
    (package private) double
     
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    abstract void
    add(double x)
    Add a sample to this TDigest.
    abstract void
    add(double x, int w)
    Adds a sample to a histogram.
    abstract void
    add(TDigest other)
    Add all of the centroids of another TDigest to this one.
    abstract void
    add(List<? extends TDigest> others)
     
    abstract void
    Serialize this TDigest into a byte buffer.
    abstract void
    Serialize this TDigest into a byte buffer.
    abstract int
    Returns the number of bytes required to encode this TDigest using #asBytes().
    abstract double
    cdf(double x)
    Returns the fraction of all points added which are <= x.
    abstract int
     
    A Collection that lets you go through the centroids in ascending order by mean.
    (package private) final void
    checkValue(double x)
     
    abstract void
    Re-examines a t-digest to determine whether some centroids are redundant.
    abstract double
    Returns the current compression factor.
    static TDigest
    createAvlTreeDigest(double compression)
    Creates an AVLTreeDigest.
    static TDigest
    createDigest(double compression)
    Creates a TDigest of whichever type is the currently recommended type.
    static TDigest
    createMergingDigest(double compression)
    Creates an MergingDigest.
    double
     
    double
     
    abstract boolean
     
    abstract double
    quantile(double q)
    Returns an estimate of the cutoff such that a specified fraction of the data added to this TDigest would be less than or equal to the cutoff.
    abstract TDigest
    Tell this TDigest to record the original data as much as possible for test purposes.
    (package private) void
    setMinMax(double min, double max)
    Over-ride the min and max values for testing purposes
    abstract long
    Returns the number of points that have been added to this TDigest.
    abstract int
    Returns the number of bytes required to encode this TDigest using #asSmallBytes().

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • min

      double min
    • max

      double max
  • Constructor Details

    • TDigest

      public TDigest()
  • Method Details

    • createMergingDigest

      public static TDigest createMergingDigest(double compression)
      Creates an MergingDigest. This is generally the best known implementation right now.
      Parameters:
      compression - The compression parameter. 100 is a common value for normal uses. 1000 is extremely large. The number of centroids retained will be a smallish (usually less than 10) multiple of this number.
      Returns:
      the MergingDigest
    • createAvlTreeDigest

      public static TDigest createAvlTreeDigest(double compression)
      Creates an AVLTreeDigest. AVLTreeDigest is nearly the best known implementation right now.
      Parameters:
      compression - The compression parameter. 100 is a common value for normal uses. 1000 is extremely large. The number of centroids retained will be a smallish (usually less than 10) multiple of this number.
      Returns:
      the AvlTreeDigest
    • createDigest

      public static TDigest createDigest(double compression)
      Creates a TDigest of whichever type is the currently recommended type. MergingDigest is generally the best known implementation right now.
      Parameters:
      compression - The compression parameter. 100 is a common value for normal uses. 1000 is extremely large. The number of centroids retained will be a smallish (usually less than 10) multiple of this number.
      Returns:
      the TDigest
    • add

      public abstract void add(double x, int w)
      Adds a sample to a histogram.
      Parameters:
      x - The value to add.
      w - The weight of this point.
    • checkValue

      final void checkValue(double x)
    • add

      public abstract void add(List<? extends TDigest> others)
    • compress

      public abstract void compress()
      Re-examines a t-digest to determine whether some centroids are redundant. If your data are perversely ordered, this may be a good idea. Even if not, this may save 20% or so in space. The cost is roughly the same as adding as many data points as there are centroids. This is typically < 10 * compression, but could be as high as 100 * compression. This is a destructive operation that is not thread-safe.
    • size

      public abstract long size()
      Returns the number of points that have been added to this TDigest.
      Returns:
      The sum of the weights on all centroids.
    • cdf

      public abstract double cdf(double x)
      Returns the fraction of all points added which are <= x.
    • quantile

      public abstract double quantile(double q)
      Returns an estimate of the cutoff such that a specified fraction of the data added to this TDigest would be less than or equal to the cutoff.
      Parameters:
      q - The desired fraction
      Returns:
      The value x such that cdf(x) == q
    • centroids

      public abstract Collection<Centroid> centroids()
      A Collection that lets you go through the centroids in ascending order by mean. Centroids returned will not be re-used, but may or may not share storage with this TDigest.
      Returns:
      The centroids in the form of a Collection.
    • compression

      public abstract double compression()
      Returns the current compression factor.
      Returns:
      The compression factor originally used to set up the TDigest.
    • byteSize

      public abstract int byteSize()
      Returns the number of bytes required to encode this TDigest using #asBytes().
      Returns:
      The number of bytes required.
    • smallByteSize

      public abstract int smallByteSize()
      Returns the number of bytes required to encode this TDigest using #asSmallBytes(). Note that this is just as expensive as actually compressing the digest. If you don't care about time, but want to never over-allocate, this is fine. If you care about compression and speed, you pretty much just have to overallocate by using allocating #byteSize() bytes.
      Returns:
      The number of bytes required.
    • asBytes

      public abstract void asBytes(ByteBuffer buf)
      Serialize this TDigest into a byte buffer. Note that the serialization used is very straightforward and is considerably larger than strictly necessary.
      Parameters:
      buf - The byte buffer into which the TDigest should be serialized.
    • asSmallBytes

      public abstract void asSmallBytes(ByteBuffer buf)
      Serialize this TDigest into a byte buffer. Some simple compression is used such as using variable byte representation to store the centroid weights and using delta-encoding on the centroid means so that floats can be reasonably used to store the centroid means.
      Parameters:
      buf - The byte buffer into which the TDigest should be serialized.
    • recordAllData

      public abstract TDigest recordAllData()
      Tell this TDigest to record the original data as much as possible for test purposes.
      Returns:
      This TDigest so that configurations can be done in fluent style.
    • isRecording

      public abstract boolean isRecording()
    • add

      public abstract void add(double x)
      Add a sample to this TDigest.
      Parameters:
      x - The data value to add
    • add

      public abstract void add(TDigest other)
      Add all of the centroids of another TDigest to this one.
      Parameters:
      other - The other TDigest
    • centroidCount

      public abstract int centroidCount()
    • getMin

      public double getMin()
    • getMax

      public double getMax()
    • setMinMax

      void setMinMax(double min, double max)
      Over-ride the min and max values for testing purposes