Comments Page - Avoiding duplicate objects in Django querysets

« Back Avoiding duplicate objects in Django querysetsjohnnymetz.comSubmitted by johnnymetz 4 days ago

jiaaro 42 minutes ago
Exists is a useful tool that you should certainly know how to use. Whether or not it's faster than distinct depends on the rest of the query. I've optimized queries where distinct is faster than exists. It's been some time, but I think it boils down to the relative sizes of the tables and how many of the exists queries actually find something (and how often they find more than one something).
Also, some databases (like clickhouse) allow for `any` joins which avoid producing duplicate rows. For example:
```
    select author.*
    from author
    inner any join book on (
        book.author_id = author.id 
        and book.title like 'Book%'
    )
```
tecoholic 29 minutes ago
Nice write up showcasing Exists. I would say, if ORM abstraction “distinct()” is a performance issue, then it’s probably time to switch to SQL. I find it simpler to either use the ORM or the SQL than to bend ORM into SQL.
- sgarland 21 minutes ago
  The ORM isn’t the performance issue here, it’s the DB. DISTINCT is a form of GROUP BY, and so it brings with it the various limitations imposed by the RDBMS. For example, look at what MySQL requires to use an index to perform a GROUP BY.
  0: https://dev.mysql.com/doc/refman/8.4/en/group-by-optimizatio...
augusteo 4 hours ago
Nice writeup. The Exists subquery approach is definitely the cleanest.
One thing worth mentioning: if you're hitting this problem frequently, it might be worth reconsidering the query patterns themselves. We had a similar issue at work where we kept adding `.distinct()` everywhere, and eventually realized we were doing the filtering wrong upstream.
The PostgreSQL-specific `distinct(*fields)` with the ORDER BY restriction is one of those things that trips people up. The error message isn't great either. "SELECT DISTINCT ON expressions must match initial ORDER BY expressions" is technically correct but doesn't explain why or what to do about it.
Good call recommending Exists as the default approach. It's more explicit about intent too.
ducdetronquito 3 hours ago
Good read, TIL!
That being said, I use Django daily for 10 years but I don’t understand the ORM besides basic CRUD. Even a simple group by looks weird.
Writing plain SQL feels easier and more maintainable in the long run.
- storystarling 35 minutes ago
  I've mostly switched to raw SQL for these kinds of queries too. It seems like the maintenance burden is actually lower when you can see the explicit query plan, rather than trying to reverse engineer what the ORM is doing with the joins.