C# Dictionary 的实际工作原理

作者:微信公众号:【架构师老卢】
9-17 18:11
121

Dictionary<TKey, TValue>是 C# 中非常流行的数据结构,也是面试问题的热门选择。我已经使用了 10 亿次,我非常确定我了解它们的工作原理。但是,当我更深入地研究它们并检查实际代码时,我发现它们比我想象的还要好(也许您也是如此)。在本文中,我们将一起进行深入研究,甚至编写我们自己的词典教育副本。所以和我一起开始吧!Dictionary

为了确保我们的副本与实际代码匹配,我们将首先探索原始代码中的内容,然后删除所有不必要的内容,并在需要的地方添加日志 ()。只需复制两个主要方法就足够了:AddGetValueOrDefault 来重新创建 的所有基本要素,因此这就是我们要执行的操作。但首先,让我们看看 a 中的字段 :Console.WriteLineDictionaryDictionary

我将使用参考源中的 .NET Framework 4.8 中的代码。尽管现代 .NET 代码稍微复杂一些,但基本要素仍然相同,因此 Framework 版本应该可以。

private struct Entry {  
    public int hashCode;  
    public int next;  
    public TKey key;  
    public TValue value;  
}  
   
private int[] buckets;  
private Entry[] entries;  
private int count;  
private int version;  
private int freeList;  
private int freeCount;  
private IEqualityComparer<TKey> comparer;

和 properties 是优化技术的一部分,对于 a 的运行不是必需的 - 我们将把它们传递给我们的 replica。 只是一个 changes 计数器,我们也会传递它。为简单起见,我们还将使用而不是允许外部配置。添加打印当前状态(所有属性的值)的方法,我们将得到:freeListfreeCountDictionaryversionEqualityComparer<TKey>.Default

public class EducationalDictionary<TKey, TValue>
{
    private struct Entry {
        public int hashCode;
        public int next;
        public TKey key;
        public TValue? value;
        
        string ValueString => value == null ? "null" : value!.ToString()!;
        public override string ToString()
        {
            return $"{key} - {ValueString} + (next = {next})";
        }
    }
 
    private int[] buckets;
    private Entry[] entries;
    private int count;
    private readonly EqualityComparer<TKey> comparer = EqualityComparer<TKey>.Default;
    public void PrintFullState(string preface)
    {
        Console.WriteLine();
        Console.Write(this.ToString(preface));
    }
    public string ToString(string preface)
    {
        StringBuilder result = new();
        
        result.AppendLine($"{preface}, state:");
        result.AppendLine();
        result.AppendLine($"buckets: [{String.Join(", ", buckets!)}]");
        result.AppendLine($"entries:");
        for (int i = 0; i < entries.Length; i++)
        {
            result.AppendLine($" [{i}] = {entries[i]}");
        }
        result.AppendLine($"count: {count}");
        return result.ToString();
    }
}

源代码中的方法实质上是对另一个方法的调用:Add

public void Add(TKey key, TValue value) {  
    Insert(key, value, true);  
}

该方法可能看起来很可怕,但别担心,我们会一点一点地剖析它:Insert

private void Insert(TKey key, TValue value, bool add) {
    if( key == null ) {
        ThrowHelper.ThrowArgumentNullException(ExceptionArgument.key);
    }
 
    if (buckets == null) Initialize(0);
    int hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF;
    int targetBucket = hashCode % buckets.Length;
 
#if FEATURE_RANDOMIZED_STRING_HASHING
    int collisionCount = 0;
#endif
 
    for (int i = buckets[targetBucket]; i >= 0; i = entries[i].next) {
        if (entries[i].hashCode == hashCode && comparer.Equals(entries[i].key, key)) {
            if (add) { 
                ThrowHelper.ThrowArgumentException(ExceptionResource.Argument_AddingDuplicate);
            }
            entries[i].value = value;
            version++;
            return;
        } 
 
#if FEATURE_RANDOMIZED_STRING_HASHING
                collisionCount++;
#endif
    }
    int index;
    if (freeCount > 0) {
        index = freeList;
        freeList = entries[index].next;
        freeCount--;
    }
    else {
        if (count == entries.Length)
        {
            Resize();
            targetBucket = hashCode % buckets.Length;
        }
        index = count;
        count++;
    }
 
    entries[index].hashCode = hashCode;
    entries[index].next = buckets[targetBucket];
    entries[index].key = key;
    entries[index].value = value;
    buckets[targetBucket] = index;
    version++;
 
#if FEATURE_RANDOMIZED_STRING_HASHING
 
#if FEATURE_CORECLR
    // In case we hit the collision threshold we'll need to switch to the comparer which is using randomized string hashing
    // in this case will be EqualityComparer<string>.Default.
    // Note, randomized string hashing is turned on by default on coreclr so EqualityComparer<string>.Default will 
    // be using randomized string hashing
 
    if (collisionCount > HashHelpers.HashCollisionThreshold && comparer == NonRandomizedStringEqualityComparer.Default) 
    {
        comparer = (IEqualityComparer<TKey>) EqualityComparer<string>.Default;
        Resize(entries.Length, true);
    }
#else
    if(collisionCount > HashHelpers.HashCollisionThreshold && HashHelpers.IsWellKnownEqualityComparer(comparer)) 
    {
        comparer = (IEqualityComparer<TKey>) HashHelpers.GetRandomizedEqualityComparer(comparer);
        Resize(entries.Length, true);
    }
#endif // FEATURE_CORECLR
 
#endif 
}

它还调用其他方法:

private void Initialize(int capacity) {
    int size = HashHelpers.GetPrime(capacity);
    buckets = new int[size];
    for (int i = 0; i < buckets.Length; i++) buckets[i] = -1;
    entries = new Entry[size];
    freeList = -1;
}
private void Resize(int newSize, bool forceNewHashCodes) {
    Contract.Assert(newSize >= entries.Length);
    int[] newBuckets = new int[newSize];
    for (int i = 0; i < newBuckets.Length; i++) newBuckets[i] = -1;
    Entry[] newEntries = new Entry[newSize];
    Array.Copy(entries, 0, newEntries, 0, count);
    if(forceNewHashCodes) {
        for (int i = 0; i < count; i++) {
            if(newEntries[i].hashCode != -1) {
                newEntries[i].hashCode = (comparer.GetHashCode(newEntries[i].key) & 0x7FFFFFFF);
            }
        }
    }
    for (int i = 0; i < count; i++) {
        if (newEntries[i].hashCode >= 0) {
            int bucket = newEntries[i].hashCode % newSize;
            newEntries[i].next = newBuckets[bucket];
            newBuckets[bucket] = i;
        }
    }
    buckets = newBuckets;
    entries = newEntries;
}

最后,它们调用 上的方法。我们将复制这些,因为如果您不处理边缘情况,它们将非常微不足道。我们的版本将如下所示:HashHelpers

public static class HashHelpers
{
    public static readonly int[] primes = {
        3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107, 131, 163, 197, 239, 293, 353, 431, 521, 631, 761, 919,
        1103, 1327, 1597, 1931, 2333, 2801, 3371, 4049, 4861, 5839, 7013, 8419, 10103, 12143, 14591,
        17519, 21023, 25229, 30293, 36353, 43627, 52361, 62851, 75431, 90523, 108631, 130363, 156437,
        187751, 225307, 270371, 324449, 389357, 467237, 560689, 672827, 807403, 968897, 1162687, 1395263,
        1674319, 2009191, 2411033, 2893249, 3471899, 4166287, 4999559, 5999471, 7199369};
    
    public static int GetPrime(int min)
    {
        foreach (var prime in primes)
        {
            if (prime >= min) return prime;
        }

        return min;
    }
    
    public static int ExpandPrime(int oldSize)
    {
        return GetPrime(2 * oldSize);
    }
}

因为我们将在调整大小完成后删除 logic 和 print state 。老实说,代码并不是很重要,它本质上是重新排列和针对新大小。这是我们的结果:Resizeif (forceNewHashCodes)entriesbuckets

private void Resize(int newSize) {
    var newBuckets = new int[newSize];
    for (var i = 0; i < newBuckets.Length; i++) newBuckets[i] = -1;
    var newEntries = new Entry[newSize];
    Array.Copy(entries, 0, newEntries, 0, count);

    for (var i = 0; i < count; i++) {
        if (newEntries[i].hashCode >= 0) {
            var bucket = newEntries[i].hashCode % newSize;
            newEntries[i].next = newBuckets[bucket];
            newBuckets[bucket] = i;
        }
    }
    buckets = newBuckets;
    entries = newEntries;
    
    PrintFullState("\u2194\ufe0f Resize");
}

我们将初始化逻辑从 straight 移动到构造函数(因为它将更容易处理可空性):if (buckets == null) Initialize(0);

以及初始化完成后的打印状态

public EducationalDictionary()
{
    var size = HashHelpers.GetPrime(0);
    buckets = new int[size];
    for (var i = 0; i < buckets.Length; i++) buckets[i] = -1;
    entries = new Entry[size];
    
    PrintFullState("\ud83d\ude80 Initialized");
}

以下是我们将执行的更改列表:Add

  • 删除参数验证
  • 删除#FEATURE_RANDOMIZED_STRING_HASHING
  • 删除 logic under,因为我们还是删除了参数if (freeCount > 0)
  • 删除 loop,检查已设置的 key:for (int i = buckets[targetBucket]; i >= 0; i = entries[i].next) {
  • 添加后的打印状态
public void Add(TKey key, TValue value) {
    var hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF;
    var targetBucket = hashCode % buckets!.Length;
    
    if (count == entries!.Length)
    {
        Resize(HashHelpers.ExpandPrime(count));
        targetBucket = hashCode % buckets.Length;
    }
    
    var index = count;
    count++;
    
    entries[index].hashCode = hashCode;
    entries[index].next = buckets[targetBucket];
    entries[index].key = key;
    entries[index].value = value;
    buckets[targetBucket] = index;
    
    Console.WriteLine();
    PrintFullState($"\ud83d\udce5 Add: {key} - {value}. hashCode = {hashCode}, targetBucket = {targetBucket}");
}

现在看起来简单多了,对吧?现在让我们完成我们的类,实现 .GetValueOrDefault

获取价值

这一次的实现相当简单,甚至在原始代码中也是如此:GetValueOrDefault

internal TValue GetValueOrDefault(TKey key) {
    int i = FindEntry(key);
    if (i >= 0) {
        return entries[i].value;
    }
    return default(TValue);
}

在搜索本身(在方法中):FindEntry

private int FindEntry(TKey key) {
    if( key == null) {
        ThrowHelper.ThrowArgumentNullException(ExceptionArgument.key);
    }
 
    if (buckets != null) {
        int hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF;
        for (int i = buckets[hashCode % buckets.Length]; i >= 0; i = entries[i].next) {
            if (entries[i].hashCode == hashCode && comparer.Equals(entries[i].key, key)) return i;
        }
    }
    return -1;
}

这里没有太多需要清理的地方。但是我们将添加大量日志,因为这是我们最重要的逻辑

private int FindEntry(TKey key)
{
    var hashCode = comparer.GetHashCode(key!) & 0x7FFFFFFF;
    var initialBucketIndex = hashCode % buckets.Length;
    
    Console.WriteLine();
    Console.WriteLine($"\ud83d\udd0e Search. Key = {key}. Initial Bucket Index {initialBucketIndex}");
    
    for (var i = buckets[initialBucketIndex]; i >= 0; i = entries[i].next)
    {
        Console.WriteLine();
        Console.WriteLine($"Comparing key from entries[{i}] ({entries[i]}) to {key}");
        
        if (entries[i].hashCode == hashCode && comparer.Equals(entries[i].key, key))
        {
            Console.WriteLine($"Key is equal returning {i}");
            
            return i;
        }
        else {
            Console.WriteLine($"Key is not equal, moving to the next linked index ({entries[i].next})");
        }
    }
    
    Console.WriteLine();
    Console.WriteLine("Search exit condition met (i >= 0). Returning -1 (as not found)");
    return -1;
}

这样就完成了我们的 .现在让我们使用它并查看我们准备的详细日志!EducationalDictionary

执行测试

我们将添加 4 条记录,搜索每条记录,外加一个不存在的键,该键应返回 ,表示找不到它。测试可能如下所示:null

[TestMethod]
public void FindsMultipleKeys()
{
    var dict = new EducationalDictionary<int, string>();
    dict.Add(48, "John");
    dict.Add(34, "Josh");
    dict.Add(22, "Jack");
    dict.Add(11, "Alex");

    dict.GetValueOrDefault(48).Should().Be("John");
    dict.GetValueOrDefault(34).Should().Be("Josh");
    dict.GetValueOrDefault(22).Should().Be("Jack");
    dict.GetValueOrDefault(11).Should().Be("Alex");
    dict.GetValueOrDefault(50).Should().Be(null);
}

这是我们从运行测试中得到的日志:

Initialized, state:
buckets: [-1, -1, -1]
entries:
 [0] = 0 - null + (next = 0)
 [1] = 0 - null + (next = 0)
 [2] = 0 - null + (next = 0)
count: 0

Add: 48 - John. hashCode = 48, targetBucket = 0, state:
buckets: [0, -1, -1]
entries:
 [0] = 48 - John + (next = -1)
 [1] = 0 - null + (next = 0)
 [2] = 0 - null + (next = 0)
count: 1

Add: 34 - Josh. hashCode = 34, targetBucket = 1, state:
buckets: [0, 1, -1]
entries:
 [0] = 48 - John + (next = -1)
 [1] = 34 - Josh + (next = -1)
 [2] = 0 - null + (next = 0)
count: 2

Add: 22 - Jack. hashCode = 22, targetBucket = 1, state:
buckets: [0, 2, -1]
entries:
 [0] = 48 - John + (next = -1)
 [1] = 34 - Josh + (next = -1)
 [2] = 22 - Jack + (next = 1)
count: 3

Resize, state:
buckets: [-1, 2, -1, -1, -1, -1, 1]
entries:
 [0] = 48 - John + (next = -1)
 [1] = 34 - Josh + (next = 0)
 [2] = 22 - Jack + (next = -1)
 [3] = 0 - null + (next = 0)
 [4] = 0 - null + (next = 0)
 [5] = 0 - null + (next = 0)
 [6] = 0 - null + (next = 0)
count: 3

Add: 11 - Alex. hashCode = 11, targetBucket = 4, state:
buckets: [-1, 2, -1, -1, 3, -1, 1]
entries:
 [0] = 48 - John + (next = -1)
 [1] = 34 - Josh + (next = 0)
 [2] = 22 - Jack + (next = -1)
 [3] = 11 - Alex + (next = -1)
 [4] = 0 - null + (next = 0)
 [5] = 0 - null + (next = 0)
 [6] = 0 - null + (next = 0)
count: 4

Search. Key = 48. Initial Bucket Index 6

Comparing key from entries[1] (34 - Josh + (next = 0)) to 48
Key is not equal, moving to the next linked index (0)

Comparing key from entries[0] (48 - John + (next = -1)) to 48
Key is equal returning 0

Search. Key = 34. Initial Bucket Index 6

Comparing key from entries[1] (34 - Josh + (next = 0)) to 34
Key is equal returning 1

Search. Key = 22. Initial Bucket Index 1

Comparing key from entries[2] (22 - Jack + (next = -1)) to 22
Key is equal returning 2

Search. Key = 11. Initial Bucket Index 4

Comparing key from entries[3] (11 - Alex + (next = -1)) to 11
Key is equal returning 3

Search. Key = 50. Initial Bucket Index 1

Comparing key from entries[2] (22 - Jack + (next = -1)) to 50
Key is not equal, moving to the next linked index (-1)

Search exit condition met (i >= 0). Returning -1 (as not found)

有了代码和日志,让我们把我们发现的 A 作品Dictionary

通过上面的代码和日志,我们可以解释 a 如何搜索值:Dictionary

  1. A 存储一组 和 。Dictionarybucketsentries
  2. 存储桶的索引 — - 是根据 key 的哈希码计算的。i
  3. 的值包含bucket[i]entries
  4. 匹配的条目要么与键匹配,要么在链接链中具有另一个匹配键的条目(或者该键在字典中不存在)
  5. 要查找实际值,则按链接循环,直到找到匹配的键或停在死胡同( 或 in )处。Dictionary0-1next

使用该算法,搜索和插入操作都非常快,并且不需要太多内存!

相关留言评论
昵称:
邮箱:
阅读排行