2023-12-24

Apache Paimon 文件布局设计

一张表的所有文件都存储在一个基本目录下，Paimon 文件以分层方式组织。从快照文件开始，可以递归地访问表中的所有记录。

Snapshot Files

所有的 snapshot 文件都存储在 snapshot 目录下，snapshot file 是一个包含了 snapshot 信息的 JSON 文件：

使用的 Schema 文件
manifest 列表包含了 snapshot 的所有变更

public class Snapshot {
    private final Integer version;

    private final long id;

    private final long schemaId;

    // a manifest list recording all changes from the previous snapshots
    private final String baseManifestList;

    // a manifest list recording all new changes occurred in this snapshot
    // for faster expire and streaming reads
    private final String deltaManifestList;

    // a manifest list recording all changelog produced in this snapshot
    // null if no changelog is produced, or for paimon <= 0.2
    private final String changelogManifestList;

    // a manifest recording all index files of this table
    // null if no index file
    private final String indexManifest;

    private final String commitUser;

    // Mainly for snapshot deduplication.
    //
    // If multiple snapshots have the same commitIdentifier, reading from any of these snapshots
    // must produce the same table.
    //
    // If snapshot A has a smaller commitIdentifier than snapshot B, then snapshot A must be
    // committed before snapshot B, and thus snapshot A must contain older records than snapshot B.
    private final long commitIdentifier;

    private final CommitKind commitKind;

    private final long timeMillis;

    private final Map<Integer, Long> logOffsets;

    // record count of all changes occurred in this snapshot
    // null for paimon <= 0.3
    private final Long totalRecordCount;

    // record count of all new changes occurred in this snapshot
    // null for paimon <= 0.3
    private final Long deltaRecordCount;

    // record count of all changelog produced in this snapshot
    // null for paimon <= 0.3
    private final Long changelogRecordCount;

    // watermark for input records
    // null for paimon <= 0.3
    // null if there is no watermark in new committing, and the previous snapshot does not have a
    // watermark
    private final Long watermark;
}

Manifest Files

所有的 manifest lists 和 manifest 文件都存放在 manifest 目录下，manifest list 是一组 manifest 文件名列表。manifest 文件是一个包含有关 LSM 数据文件和变更日志文件的变更信息的文件。例如，它记录了在对应的快照中创建了哪个 LSM 数据文件以及删除了哪个文件。

Schema：

public class Schema {
    private final List<DataField> fields;
    private final List<String> partitionKeys;
    private final List<String> primaryKeys;
    private final Map<String, String> options;
    private final String comment;
}

FileKind:

public enum FileKind {
    ADD((byte) 0),
    DELETE((byte) 1);
}

IndexFileMeta:

public class IndexFileMeta {
    private final String indexType;
    private final String fileName;
    private final long fileSize;
    private final long rowCount;
}

IndexManifestEntry:

public class IndexManifestEntry {
    private final FileKind kind;
    private final BinaryRow partition;
    private final int bucket;
    private final IndexFileMeta indexFile;
}

ManifestFileMeta:

public class ManifestFileMeta {
    private final String fileName;
    private final long fileSize;
    private final long numAddedFiles;
    private final long numDeletedFiles;
    private final BinaryTableStats partitionStats;
    private final long schemaId;
}

ManifestCommittable:

public class ManifestCommittable { ///Manifest commit message
    private final long identifier;
    @Nullable private final Long watermark;
    private final Map<Integer, Long> logOffsets;
    private final List<CommitMessage> commitMessages;
}

ManifestFile:

/**
 * This file includes several ManifestEntry, representing the additional changes since last snapshot.
 */
public class ManifestFile extends ObjectsFile<ManifestEntry> {
    private final SchemaManager schemaManager;
    private final RowType partitionType;
    private final FormatWriterFactory writerFactory;
    private final long suggestedFileSize;
}

// 有 write 方法将各种 ManifestEntry 写进去 ManifestFile，其中会统计对应的 metadata

ManifestList：

// This file includes several ManifestFileMeta, representing all data of the whole table at the corresponding snapshot.

public class ManifestList extends ObjectsFile<ManifestFileMeta> {
    public String write(List<ManifestFileMeta> metas) {
        return super.writeWithoutRolling(metas);
    }
}

Data Files

数据文件按照分区和 bucket 进行分组。每个 bucket 目录包含一个 LSM 树和其对应的变更日志文件。
目前，Paimon 支持使用 orc（默认）、parquet 和 avro 作为数据文件的格式。

LSM Trees

Paimon 采用 LSM 树（日志结构合并树）作为文件存储的数据结构。下面简要介绍了关于 LSM 树的概念。

Sorted Runs

LSM 树将文件组织成多个 sorted runs。一个 sorted run 由一个或多个数据文件组成，每个数据文件都属于且只属于一个 sorted run。
数据文件内的记录按其主键进行排序。在一个 sorted run 内，数据文件的主键范围不会重叠。

正如您所看到的，不同的 sorted run 可能具有重叠的主键范围，甚至可能包含相同的主键。在查询 LSM 树时，必须将所有的 sorted run 组合起来，并根据用户指定的合并引擎和每个记录的时间戳进行主键相同的记录合并。

写入 LSM 树的新记录将首先缓存在内存中。当内存缓冲区满时，所有内存中的记录将被排序并刷新到磁盘上。此时就会创建一个新的 sorted run。

Compaction

当越来越多的记录被写入 LSM 树时，sorted run 的数量会增加。因为查询 LSM 树需要将所有 sorted run 组合起来，过多的 sorted run 将导致查询性能下降，甚至可能导致内存不足。
为了限制 sorted run 的数量，我们需要定期将几个 sorted run 合并成一个大的 sorted run。这个过程被称为compaction。
然而，compaction 是一个资源密集型的过程，会消耗一定的 CPU 时间和磁盘 IO，因此过于频繁的 compaction 可能会导致写入速度变慢。这是查询性能和写入性能之间的权衡。Paimon 目前采用了类似Rocksdb 的 universal compaction 策略。
默认情况下，当 Paimon 向 LSM 树追加记录时，会根据需要进行 compaction。用户也可以选择在单独的 compaction 作业中执行所有的 compaction 操作。