[AURON #2366] fix: Handle Paimon metadata columns in V2 native scan by lyne7-sc · Pull Request #2367 · apache/auron

lyne7-sc · 2026-06-26T09:44:48Z

Which issue does this PR close?

Rationale for this change

Paimon metadata columns are produced by the Paimon scan layer rather than stored as physical columns in data files. The Paimon V2 native scan was passing these columns to the native Parquet/ORC reader as file columns, which can return incorrect values.

For example:

create table paimon.db.t_metadata (id int, v string) using paimon;
insert into paimon.db.t_metadata values (1, 'a');
select id, __paimon_file_path from paimon.db.t_metadata;

The native path returned null for __paimon_file_path, while Spark/Paimon's scan path returns the actual file path.

What changes are included in this PR?

Recognize Paimon metadata columns using PaimonMetadataColumn.
Materialize supported file-level metadata columns (__paimon_file_path, __paimon_bucket) as per-file constants.
Keep unsupported Paimon metadata columns on Spark/Paimon's scan path instead of reading them from Parquet/ORC files.
Cover metadata columns both with and without table partition columns.

Are there any user-facing changes?

No API changes. This is a correctness fix for Paimon V2 native scan.

How was this patch tested?

Adds Paimon V2 integration tests

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

SteNicholas

@lyne7-sc, thanks for the fix! The overall approach is sound: materialize __paimon_file_path/__paimon_bucket as per-file constants via partitionSchema, and fall back to Spark for unsupported metadata columns. The functional Test Paimon 1.2 CI job (which runs the new integration tests) is green.

SteNicholas · 2026-06-28T08:34:57Z

+
+  private def isPaimonMetadataColumn(name: String): Boolean = {
+    containsName(PaimonMetadataColumns, name) ||
+      name.toLowerCase(Locale.ROOT).startsWith(PaimonMetadataColumnPrefix)


Style / CI blocker. spotless scalafmt rejects this line — the || continuation should be indented 4 spaces, not 6. This is one of the two violations turning the Style job red (-······name.toLowerCase → +····name.toLowerCase). mvn spotless:apply fixes it:

private def isPaimonMetadataColumn(name: String): Boolean = { containsName(PaimonMetadataColumns, name) || name.toLowerCase(Locale.ROOT).startsWith(PaimonMetadataColumnPrefix) }

Thanks for pointing this out. Fixed by running Spotless.

SteNicholas · 2026-06-28T08:34:58Z

+      assert(df.collect().length === 1)
+    }
+  }
+


Style / CI blocker. This blank line has trailing whitespace (8 spaces), which spotless rejects (-········ → +). It is the second cause of the failing Style job. mvn spotless:apply removes it.

Removed the trailing whitespace, and the ci is green now.

SteNicholas · 2026-06-28T08:34:58Z

-      }
      split.dataFiles().asScala.map { dataFile =>
        val filePath = s"${split.bucketPath()}/${dataFile.fileName()}"
+        val partitionValues = if (partitionSchema.isEmpty) {


Efficiency: partitionValues is now computed inside split.dataFiles().map, so partitionConverter.convert(split.partition()), indexByName, and the per-field DataConverter.fromPaimon conversions all run once per data file — even though everything except __paimon_file_path is constant across the files of a split (split.partition() and split.bucket() are split-level). For a split with N data files this rebuilds the whole partition row N times. Consider computing the split-invariant portion once per split and only filling the per-file file_path slot inside the loop.

Makes sense. Addressed by computing the split-invariant partition/metadata values once per split, and only filling the per-file __paimon_file_path inside the data-file loop.

SteNicholas · 2026-06-28T08:34:58Z

+    def isPartitionValueField(name: String): Boolean =
+      containsName(partitionKeys, name) || isSupportedMetadataColumn(name)
+    val partitionFields = readSchema.fields.filter(f => isPartitionValueField(f.name))
+    val fileFields = readSchema.fields.filterNot(f => isPartitionValueField(f.name))


Coverage gap worth a test: when only metadata columns are projected from a non-partitioned table (e.g. select __paimon_file_path from t), every field is classified as a partition/metadata constant, so fileFields is empty and fileSchema is empty. The native Parquet/ORC scan is then asked to read zero data columns but must still emit one row per record so the constant columns get the right cardinality. All three new tests also select id, so the empty-fileSchema path is never exercised (and there's no existing partition-only/count(*) test either). Please add a metadata-only projection test on a multi-row non-partitioned table to confirm the row count is correct on this path.

Good point. Added a regression test for metadata-only projection on a multi-row non-partitioned table.

SteNicholas · 2026-06-28T08:34:58Z

-    val partitionFields = readSchema.fields.filter(f => containsName(partitionKeys, f.name))
-    val fileFields = readSchema.fields.filterNot(f => containsName(partitionKeys, f.name))
+    def isPartitionValueField(name: String): Boolean =
+      containsName(partitionKeys, name) || isSupportedMetadataColumn(name)


Minor / edge case: classification here is purely by name (resolver against __paimon_file_path/__paimon_bucket). Paimon's schema validation reserves only the _KEY_ prefix and the core system field names — not the __paimon_ prefix — so a user could in principle define a real physical column named __paimon_bucket. It would then be treated as a per-file constant and return split.bucket() instead of the stored value (a silent wrong result rather than a fallback). Very unlikely in practice, but flagging it.

Fixed by making physical table columns take precedence over metadata name matching. A real column named
__paimon_bucket now stays in fileSchema and is read from the data file. Added a regression test for this case.

SteNicholas · 2026-06-28T08:34:58Z

      s"plan should use native paimon scan:\n$plan")
  }

+  private def checkSparkAnswerAndNativePaimonScan(sqlText: String): DataFrame = {


Minor test cleanup: the DataFrame return value is ignored by both callers (and the third metadata test doesn't use this helper), so Unit would be clearer. Also var expected: Seq[Row] = Nil reassigned inside withSQLConf can be a val, since withSQLConf returns its block value:

val expected = withSQLConf("spark.auron.enable.paimon.scan" -> "false") { sql(sqlText).collect().toSeq }

Updated the helper to return Unit.

But I kept the existing var expected pattern because val expected = withSQLConf { ... } does not compile on Spark 3.x: withSQLConf returns Unit there.

lyne7-sc · 2026-06-28T14:52:21Z

@SteNicholas Thanks for the careful review! Addressed the comments in the latest update, and the relevant ci is green now.

lyne7-sc added 2 commits June 26, 2026 17:23

test: add paimon metadata columns suite

3995864

support paimon file-level metadata

b05f5a6

github-actions Bot added the thirdparty-paimon label Jun 26, 2026

SteNicholas requested a review from Copilot June 28, 2026 06:27

Copilot AI reviewed Jun 28, 2026

SteNicholas reviewed Jun 28, 2026

View reviewed changes

SteNicholas self-assigned this Jun 28, 2026

apply suggestions

ea31cda

Uh oh!

Conversation

lyne7-sc commented Jun 26, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

SteNicholas left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lyne7-sc commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SteNicholas left a comment •

edited

Loading