TabPFN Always Wins on Benchmarks. Real-World Data Tells a Different Story
The paper says it beats XGBoost without tuning. An experiment on a diabetes dataset with 253K respondents gives us a fresh perspective.
In the previous post, I discussed some of the claims from the TabPFN paper. A 100% win rate on small datasets, 230 times faster than AutoML, and no preprocessing needed. All of these numbers are true. But it is important to remember that these claims were measured on curated benchmarks, where datasets are carefully selected, grouped, and the results are averaged. So, what happens if we take this model and throw it at a single, messy dataset straight out of the real world? It turns out the story can be a little different.
A reader requested a direct comparison between XGBoost and TabPFN v2.5 on a real dataset from the field, rather than just another benchmark. This reader also recommended the data we should use, the Diabetes Health Indicators BRFSS 2015 from the CDC. This dataset has 253,680 survey responses with 21 features. The biggest challenge here is the severe 86/14 class imbalance between respondents who do not have diabetes compared to those who have diabetes or prediabetes. This experiment was not set up to prove the TabPFN paper wrong, but rather to find out which claims actually hold up in the field and which ones need a little more context.
The bottom line is that the paper is not lying. However, there are important details that often get hidden when we only look at aggregate numbers from benchmarks.
This dataset comes from the Behavioral Risk Factor Surveillance System (BRFSS). It is an annual health survey conducted by the CDC in the United States, running since 1984. Every year, this survey collects responses from more than 400,000 people regarding health risks, chronic diseases, and preventive services.
This data is publicly available on Kaggle under the name Diabetes Health Indicators Dataset. From the original 2015 dataset, which has over 400,000 responses and 330 features, the data has been cleaned and split into three main files.
For this experiment, I chose the file where the target is binary (0 for no diabetes, 1 for diabetes or prediabetes) containing a total of 253,680 responses, without any artificial balancing. There are several strong reasons behind this choice.
First, it preserves the original class distribution. There is actually a version on Kaggle that is perfectly balanced 50 to 50, but that scenario is just not realistic. In the real world, diabetes prevalence is a minority. If a model is tested on data that was intentionally balanced, its performance will look much better than it actually is. The file I used keeps the original 86/14 distribution, which is a much more accurate representation of actual field conditions.
This 86/14 imbalance also completely changes how we look at evaluation metrics. With a distribution this skewed, a model that simply guesses “no diabetes” for every single person will automatically get an 86% accuracy. Because of this, accuracy cannot be trusted here. In a medical context, what is far more crucial is recall, which is the model’s ability to detect patients who truly have diabetes out of all the patients who actually have it. Missing a single diabetes patient or a false negative carries a much more fatal risk than a simple false alarm. Additionally, the F1 score is also important because it provides a balance between recall and precision. These two metrics will be our main focus.
Second, the full sample size. The pre-balanced version of the data throws away almost 75% of the rows. By using the file containing the full 253,680 rows, we can test the upper capacity limit of TabPFN v2.5, which happens to have a hard limit at exactly 50,000 rows. This is important so we know how it performs on larger data, not just on the small datasets where it usually shines.
Third, the format is binary classification. Figuring out whether someone has diabetes or not is usually the very first question asked during screening, making it a much more relevant and common baseline for comparing models.
Lastly, the features and data are real. This is direct survey data from real people, so the answers vary, there is noise, and there might be bias. The types of features inside are also a mix of binary, numeric, and ordinal values without any special normalization. It perfectly mimics the tabular data conditions practitioners face every single day.
Considering TabPFN’s capacity limits, this experiment does not use all 253,000 rows at once. I split the data into two sample sizes with different goals.
The first 10,000 rows were used to test the claim from the TabPFN v2.5 paper stating they have a 100% win rate against default XGBoost on datasets with 10,000 rows or fewer. If that claim holds true, TabPFN should win by a landslide here.
The first 50,000 rows were used to see TabPFN’s upper capacity limit. The TabPFN v2.5 paper mentions that the model’s capacity has been increased to 50,000 rows, so we can directly verify their claim that they still maintain an 87% win rate for medium-sized datasets.
In both sample sizes, the 86/14 class imbalance ratio was strictly maintained. There were no undersampling, oversampling, or SMOTE techniques applied. This was done intentionally so we could see how the built-in mechanisms of each model handle truly skewed data, rather than seeing how preprocessing techniques cover up the imbalance issue.
For evaluation, I used 5-Fold Stratified Cross-Validation. For those unfamiliar, this method divides the data into five sections or folds. The model is trained and tested alternately on each section. The word “Stratified” ensures that every fold has the exact same class proportions, making the final averaged results much more stable.
Both models were then tested across several configurations to keep things fair:
XGBoost Vanilla, which is the default XGBoost without any tuning or imbalance handling at all.
XGBoost Class Weight, which is XGBoost given the scale_pos_weight instruction to assign more weight to the diabetes class during training.
XGBoost Optuna, which is XGBoost tuned using Optuna for 100 trials to find the best hyperparameters. This configuration still uses class weights.
On the other side:
TabPFN Vanilla, which is the default TabPFN without imbalance handling.
TabPFN Balanced, which is TabPFN using the balanced_probabilities=True instruction. This adjusts the output threshold so predictions across classes are more balanced. This is the only built-in feature TabPFN has to handle class imbalance.
Let’s look at the results on the first 10,000 rows. This is TabPFN’s comfort zone, where their paper claims they always win.
| Model | Diabetes Recall | Diabetes F1 | Precision | Macro F1 | Accuracy | Time |
|---|---|---|---|---|---|---|
| XGBoost | 22% | 29% | 45% | 61% | 85% | 0.92s |
| TabPFN v2.5 | 13% | 22% | 59% | 57% | 87% | 48.81s |
Without the help of imbalance handling, both models completely fail to detect diabetes. XGBoost only manages to catch 22% of the cases, while TabPFN catches a mere 13%. Their accuracy looks high at 85% to 87%, but this metric is very misleading because it simply reflects their success in guessing the “no diabetes” majority class.
TabPFN does have a better precision at 59% compared to XGBoost at 45%. This means that if TabPFN says someone has diabetes, its guess is more trustworthy. Unfortunately, out of 100 people who actually have diabetes, TabPFN only detects 13 of them. For a medical screening context, a recall this low is simply unusable. This proves that the balanced_probabilities=True parameter in TabPFN is absolutely mandatory when dealing with imbalanced data.
| Model | Diabetes Recall | Diabetes F1 | Precision | Macro F1 | Accuracy | Time |
|---|---|---|---|---|---|---|
| XGBoost | 48% | 39% | 33% | 63% | 79% | 0.96s |
| TabPFN v2.5 | 74% | 45% | 32% | 64% | 74% | 45.59s |
The story changes drastically once class weights are activated. TabPFN’s recall jumps immediately from 13% to 74%. This model now successfully catches 74% of diabetes patients, leaving XGBoost far behind at 48%.
TabPFN also has a slight edge in the F1 score for the diabetes class, sitting at 45% compared to 39%. XGBoost does win in overall accuracy, but as discussed earlier, that is heavily assisted by easy predictions in the majority class. For a task where the priority is to prevent missed patients, TabPFN’s 74% recall is clearly much more valuable than XGBoost’s 79% accuracy.
| Model | Diabetes Recall | Diabetes F1 | Precision | Macro F1 | Accuracy | CV Time | Tuning Time | Total Pipeline |
|---|---|---|---|---|---|---|---|---|
| XGBoost | 59% | 45% | 37% | 66% | 80% | 8.88s | ~99s | ~108s |
| TabPFN v2.5 (Balanced) | 74% | 45% | 32% | 64% | 74% | 45.11s | 0s | 45.11s |
Even after going through 100 tuning trials with Optuna, XGBoost’s recall, which climbed to 59%, still loses to the default TabPFN. Impressively, TabPFN manages to match the F1 score of an XGBoost model that was painstakingly tuned, yet TabPFN does it without any tuning effort whatsoever.
What is even more interesting is the total time required. XGBoost took about 108 seconds from start to finish for tuning. This is 2.4 times longer than TabPFN, which only needed 45 seconds for inference. If you need a competitive model and do not want the hassle of wasting time on hyperparameter searches, TabPFN really works well at this scale.
Now let’s multiply the data volume by five, bringing it to 50,000 rows. The goal here is to test TabPFN’s upper limits and see how the model scales compared to XGBoost.
| Model | Diabetes Recall | Diabetes F1 | Precision | Macro F1 | Accuracy | Time |
|---|---|---|---|---|---|---|
| XGBoost | 17% | 26% | 50% | 59% | 86% | 1.01s |
| TabPFN v2.5 | 15% | 23% | 57% | 58% | 87% | 862.66s |
The same pattern repeats itself. Without imbalance handling, both models struggle to guess the minority class. What is shocking is the speed difference, which becomes very extreme. TabPFN needs 862 seconds or about 14 minutes, while XGBoost finishes the job in just 1 second.
| Model | Diabetes Recall | Diabetes F1 | Precision | Macro F1 | Accuracy | Time |
|---|---|---|---|---|---|---|
| XGBoost | 69% | 43% | 31% | 63% | 74% | 1.11s |
| TabPFN v2.5 | 73% | 45% | 33% | 65% | 75% | 864.13s |
In the balanced configuration, TabPFN still leads across all the crucial metrics. Its recall reaches 73% compared to XGBoost at 69%, plus it takes the lead in F1 and accuracy.
There is an interesting observation here. XGBoost’s recall slowly crawls upward as more data is added. From just 48% on the 10K data, it now sits at 69% on the 50K data because the model has more examples to learn from. On the flip side, TabPFN’s performance remains incredibly stable and consistent around 73% to 74% regardless of how much the dataset grows. Even though the performance gap starts to narrow, TabPFN remains ahead.
| Model | Diabetes Recall | Diabetes F1 | Precision | Macro F1 | Accuracy | CV Time | Tuning Time | Total Pipeline |
|---|---|---|---|---|---|---|---|---|
| XGBoost | 63% | 44% | 34% | 65% | 78% | 11.99s | 187.48s | ~199s |
| TabPFN v2.5 (Balanced) | 73% | 45% | 33% | 65% | 75% | 863.76s | 0s | 863.76s |
Even after getting over 3 minutes of help from Optuna, XGBoost’s recall of 63% still lags 10 points behind TabPFN. All of these achievements by TabPFN are earned purely without any extra tuning.
To be honest, at this data scale, speed becomes a major roadblock for TabPFN. XGBoost finishes the entire process, including tuning, in just 199 seconds. TabPFN, on the other hand, needs almost 14 minutes just for inference. If you are working on a project that demands super fast model iteration, this performance time gap simply cannot be ignored.
From the experiment we just ran, we can re-evaluate the claims from the TabPFN paper that I summarized in my previous post. Which ones are relevant, and which ones need extra context?
As a quick side note, it is important to know that each release version of TabPFN was tested on a different set of benchmarks. So, the claims do not all come from the exact same research source. TabPFN v1 was tested on 18 OpenML datasets, focusing their claims on beating AutoML with extreme speedups. TabPFN v2 was tested on dozens of AutoML Benchmark datasets, focusing heavily on a maximum limit of 10,000 samples. Meanwhile, TabPFN v2.5 was tested on the TabArena benchmark where the capacity was expanded up to 50,000 samples.
Now, let’s break down the claims.
Source: TabPFN v2.5 paper. The experiment results show that this claim needs context. TabPFN does win in recall and F1 scores if the parameters are set correctly. But XGBoost wins in accuracy and is, of course, far faster. The paper calculates its “win rate” from average metrics generalized across many datasets at once, meaning specific trade-offs like speed versus accuracy become invisible. The conclusion here is that the winner depends entirely on which metric matters most for the problem you are trying to solve.
Source: TabPFN v1 paper. The comparison in their paper is valid on paper, but it does not connect well with daily practitioners. Most practitioners use default XGBoost as their first baseline, not AutoML which takes a long time. When compared to default XGBoost in our experiment, XGBoost is actually 50 to 850 times faster than TabPFN.
Source: TabPFN v2.5 paper. TabPFN definitely runs on 50,000 rows of data, and its prediction quality remains very good. But the main problem is the compute waiting time. Just because the data increased by a factor of five, the inference time bloated by almost 19 times. Its capacity did increase, but the computational scaling issue is still unresolved homework.
Source: General conclusion from the paper presentations.
The “no tuning” part of this statement is 100% true. TabPFN successfully matches and even exceeds the quality of a painstakingly tuned XGBoost model. However, the “no preprocessing” part comes with terms and conditions. If you face data with severe imbalances, you at least need to include the balanced_probabilities=True instruction. Without that step, the model becomes paralyzed and struggles to detect the minority class. In fact, to push the F1 score even further, data-level handling techniques like SMOTE or oversampling might still be necessary.
This experiment clearly is not meant to invalidate the research behind the TabPFN paper. The results on this diabetes dataset also cannot be blindly applied to every type of data out there. This is purely a single case study to observe the model’s behavior on medical data with a fairly severe imbalance.
TabPFN successfully finds more diabetes cases consistently without the hassle of tuning. This is a highly relevant achievement in a healthcare context. But the trade-offs are equally real.
TabPFN is very slow. Waiting 14 minutes for a single 5-fold run on 50,000 rows feels like too much of a burden for projects that require rapid iteration. If speed is your priority, an XGBoost model that finishes the task in one second is still hard to beat.
TabPFN’s precision is also relatively low. Out of the positive predictions generated by TabPFN, a lot of them are false alarms. Even though XGBoost suffers from a similar issue on this dataset, you still need to be careful when accepting positive guesses from the model.
Despite all of this, the old saying that tree-based algorithms always win on tabular data is starting to be proven wrong. But that does not mean TabPFN is a magic bullet for every problem either. The answer, once again, relies heavily on your data’s characteristics, compute time constraints, and the metrics you are trying to chase.
The most interesting thing about TabPFN’s arrival is the shift in mindset. For years, we have always designed and adjusted algorithms manually. TabPFN provides proof that the capability of the algorithm itself can actually be “learned” by a Transformer model. The question is no longer just which model will always win, but rather how we frame the data distribution problem so the model can become the most fitting algorithm to solve it.
This experiment was run on a machine equipped with an NVIDIA GeForce RTX 5060 Laptop GPU. All the code and data pipelines are available for you to check out on GitHub.