Comments Page - Hypothesis: Property-Based Testing for Python

« Back Hypothesis: Property-Based Testing for Pythonhypothesis.readthedocs.ioSubmitted by lwhsiao 10 hours ago

rmunn 8 hours ago
I love property-based testing, especially the way it can uncover edge cases you wouldn't have thought about. Haven't used Hypothesis yet, but I once had FsCheck (property-based testing for F#) find a case where the data structure I was writing failed when there were exactly 24 items in the list and you tried to append a 25th. That was a test case I wouldn't have thought to write on my own, but the particular number (it was always the 25th item that failed) quickly led me to find the bug. Once my property tests were running overnight and not finding any failures after thousands and thousands of random cases, I started to feel a lot more confident that I'd nailed down the bugs.
- tombert 8 hours ago
  I had a similar thing, with F# as well actually.
  We had some code that used a square root, and in some very rare circumstances, we could get a negative number, which would throw an exception. I don't think i would have even considered that possibility if FsCheck hadn't generated it.
- teiferer 7 hours ago
  That example caught my attention. What was it in your code that made length 24 special?
NortySpock 8 hours ago
I keep thinking I have a possible use case for property -based testing, and then I am up to my armpits in trying to understand the on-the-ground problem and don't feel like I have time to learn a DSL for describing all possible inputs and outputs when I already had an existing function (the subject-under-test) that I don't understand.
So rather than try to learn to black boxes at the same time , I fall back to "several more unit tests to document more edge cases to defensibly guard against"
Is there some simple way to describe this defensive programming iteration pattern in Hypothesis? Normally we just null-check and return early and have to deal with the early-return case. How do I quickly write property tests to check that my code handles the most obvious edge cases?
- eru 7 hours ago
  In addition to what other people have said:
  > [...] time to learn a DSL for describing all possible inputs and outputs when I already had an existing function [...]
  You don't have to describe all possible inputs and outputs. Even just being able to describe some classes of inputs can be useful.
  As a really simple example: many example-based tests have some values that are arbitrary and the test shouldn't care about them, like eg employees names when you are populating a database or whatever. Instead of just hard-coding 'foo' and 'bar', you can have hypothesis create arbitrary values there.
  Just like learning how to write (unit) testable code is a skill that needs to be learned, learning how to write property-testable code is also a skill that needs practice.
  What's less obvious: retro-fitting property-based tests on an exiting codebase with existing example-based tests is almost a separate skill. It's harder than writing your code with property based tests in mind.
  ---
  Some common properties to test:
  * Your code doesn't crash on random inputs (or only throws a short whitelist of allowed exceptions).
  * Applying a specific functionality should be idempotent, ie doing that operation multiple times should give the same results as applying it only once.
  * Order of input doesn't matter (for some functionality)
  * Testing your prod implementation against a simpler implementation, that's perhaps too slow for prod or only works on a restricted subset of the real problem. The reference implementation doesn't even have to be simpler: just having a different approach is often enough.
  wodenokoto 6 hours ago
  But let's say employee names fail on apostrophe. Won't you just have a unit test that sometimes fail, but only when the testing tool randomly happens to add an apostrophe in the employee name?
  bluGill 3 minutes ago
  If you know it will fail on apostrophe you should have a specific test for that. However if that detail is burried in some function 3 levels deep that you don't even realize is used you wouldn't write the test or handle it even though it matters. This should find those issuses too.
  IanCal 5 hours ago
  Hypothesis keeps a database of failures to use locally and you can add a decorator to mark a specific case that failed. So you run it, see the failure, add it as a specific case and then that’s committed to the codebase.
  The randomness can bite a little if that test failure happens on an unrelated branch, but it’s not much different to someone just discovering a bug.
  edit - here's the relevant part of the hypothesis guide https://hypothesis.readthedocs.io/en/latest/tutorial/replayi...
  movpasd an hour ago
  You can either use the @example decorator to force Hypothesis to check an edge case you've thought of, or just let Hypothesis uncover the edge cases itself. Hypothesis won't fail a test once and then pass it next time, it keeps track of which examples failed and will re-run them. The generated inputs aren't uniformly randomly distributed and will tend to check pathological cases (complex symbols, NaNs, etc) with priority.
  You shouldn't think of Hypothesis as a random input generator but as an abstraction over thinking about the input cases. It's not perfect: you'll often need to .map() to get the distribution to reflect the usage of the interface being tested and that requires some knowledge of the shrinking behaviour. However, I was really surprised how easy it was to use.
  jcattle 6 hours ago
  As far as I remember, hypothesis tests smartly. Which means that possibly problematic strings are tested first. It then narrows down which exact part of the tested strings caused the failure.
  So it might as well just throw the kitchen sink at the function, if it handles that: Great, if not: That string will get narrowed down until you arrive at a minimal set of failing inputs.
  vlovich123 6 hours ago
  Either your code shouldn’t fail or the apostrophe isn’t a valid case.
  In the former, hypothesis and other similar frameworks are deterministic and will replay the failing test on request or remember the failing tests in a file to rerun in the future to catch regressions.
  In the latter, you just tell the framework to not generate such values or at least to skip those test cases (better to not generate in terms of testing performance).
  reverius42 6 hours ago
  I think what they meant is, "won't Hypothesis sometimes fail to generate input with an apostrophe, thus giving you false confidence that your code can handle apostrophes?"
  I think the answer to this is, in practice, it will not fail to generate such input. My understanding is that it's pretty good at mutating input to cover a large amount of surface area with as few as possible examples.
  eru 5 hours ago
  Hypothesis is pretty good, but it's not magic. There's only so many corner cases it can cover in the 200 (or so) cases per tests it's running by default.
  But by default you also start with a new random seed every time you run the tests, so you can build up more confidence over the older tests and older code, even if you haven't done anything specifically to address this problem.
  Also, even with Hypothesis you can and should still write specific tests or even just specific generators to cover specific classes of corners cases you are worried about in more detail.
  thunky 32 minutes ago
  > But by default you also start with a new random seed every time you run the tests, so you can build up more confidence over the older tests and older code
  Is it common practice to use the same seed and run a ton of tests until you're satisfied it tested it thoroughly?
  Because I think I would prefer that. With non-deterministic tests I would always wonder if it's going to fail randomly after the code is already in production.
  bluGill a minute ago
  the more the test runs the less likely it is there is an uncovered case left. So your confidence grows. Remember too anything found before release is something a customer won't find.
  Balinares 5 hours ago
  No, Hypothesis iterates on test failures to isolate the simplest input that triggers it, so that it can report it to you explicitly.
- akshayshah 7 hours ago
  Sibling comments have already mentioned some common strategies - but if you have half an hour to spare, the property-based testing series on the F# for Fun and Profit blog is well worth your time. The material isn’t really specific to F#.
  https://fsharpforfunandprofit.com/series/property-based-test...
- sunshowers 8 hours ago
  The simplest practical property-based tests are where you serialize some randomly generated data of a particular shape to JSON, then deserialize it, and ensure that the output is the same.
  A more complex kind of PBT is if you have two implementations of an algorithm or data structure, one that's fast but tricky and the other one slow but easy to verify. (Say, quick sort vs bubble sort.) Generate data or operations randomly and ensure the results are the same.
  eru 6 hours ago
  > The simplest practical property-based tests are where you serialize some randomly generated data of a particular shape to JSON, then deserialize it, and ensure that the output is the same.
  Testing that f(g(x)) == x for all x and some f and g that are supposed to be inverses of each other is a good test, but it's probably not the simplest.
  The absolute simplest I can think of is just running your functionality on some randomly generated input and seeing that it doesn't crash unexpectedly.
  For things like sorting, testing against an oracle is great. But even when you don't have an oracle, there's lots of other possibilities:
  * Test that sorting twice has the same effect as sorting once.
  * Start with a known already in-order input like [1, 2, 3, ..., n]; shuffle it, and then check that your sorting algorithm re-creates the original.
  * Check that the output of your sorting algorithm is in-order.
  * Check that input and output of your sorting algorithm have the same elements in the same multiplicity. (If you don't already have a datastructure / algorithm that does this efficiently, you can probe it with more randomness: create a random input (say a list of numbers), pick a random number X, count how many times X appears in your list (via a linear scan); then check that you get the same count after sorting.
  * Check that permuting your input doesn't make a difference.
  * Etc.
  sunshowers 6 hours ago
  Speaking for myself — those are definitely all simpler cases, but for me I never found them compelling enough (beyond the "it doesn't crash" property). For me, the simplest case that truly motivated PBT for me was roundtrip serialization. Now I use PBT quite a lot, and most of them are either serialization roundtrip or oracle/model-based tests.
  eru 5 hours ago
  Oh, yes, I was just listing simple examples. I wasn't trying to find a use case that's compelling enough to make you want to get started.
  I got started out of curiosity, and because writing property based tests is a lot more fun than writing example based tests.
  jgalt212 43 minutes ago
  > The absolute simplest I can think of is just running your functionality on some randomly generated input and seeing that it doesn't crash unexpectedly.
  For this use case, we've found it best to just use a fuzzer, and work off the tracebacks.
  That being said, we have used hypothesis to test data validation and normalizing code to decent success. We use on a one-off basis, when starting something new or making a big change. We don't run these tests everyday.
  Also, I don't like how hypothesis integrates much better with pytest than unittest.
- disambiguation 8 hours ago
  I've only used it once before, not as unit testing, but as stress testing for a new customer facing api. I wanted to say with confidence "this will never throw an NPE". Also the logic was so complex (and the deadline so short) the only reasonable way to test was to generate large amounts of output data and review it manually for anomalies.
- meejah 8 hours ago
  Here are some fairly simple examples: testing port parsing https://github.com/meejah/fowl/blob/e8253467d7072cd05f21de7c...
  ...and https://github.com/magic-wormhole/magic-wormhole/blob/1b4732...
  The simplest ones to get started with are "strings", IMO, and also gives you lots of mileage (because it'll definitely test some weird unicode). So, somewhere in your API where you take some user-entered strings -- even something "open ended" like "a name" -- you can make use of Hypothesis to try a few things. This has definitely uncovered unicode bugs for me.
  Some more complex things can be made with some custom strategies. The most-Hypothesis-heavy tests I've personally worked with are from Magic Folder strategies: https://github.com/tahoe-lafs/magic-folder/blob/main/src/mag...
  The only real downside is that a Hypothesis-heavy test-suite like the above can take a while to run (but you can instruct it to only produce one example per test). Obviously, one example per test won't catch everything, but is way faster when developing and Hypothesis remembers "bad" examples so if you occasionally do a longer run it'll remember things that caused errors before.
- fwip 8 hours ago
  I think the easiest way is to start with general properties and general input, and tighten them up as needed. The property might just be "doesn't throw an exception", in some cases.
  If you find yourself writing several edge cases manually with a common test logic, I think the @example decorator in Hypothesis is a quick way to do that: https://hypothesis.readthedocs.io/en/latest/reference/api.ht...
benrutter 4 hours ago
I love the idea of hypothesis! Haven't found a lot of use cases for it yet, I think the quick start example helps explain why.
Essentially, you're testing that "my_sort" returns the same as python's standard "sort". Of course, this means you need a second function that acts the same as the function you wrote. In real life, if you had that you probably wouldn't have written the function "my_sort" at all.
Obviously it's just a tiny example, but in general I often find writing a test that will cover every hypothetical case isn't realistic.
Anyone here use this testing in the wild? Where's it most useful? Do you have the issue I described? Is there an easy way to overcome it?
- arnsholt an hour ago
  I've only used PBT a few times, but when it fits it's been extremely useful. A concrete example from my practice of what has been pointed out in this thread, that you want to properties of your function's output rather than the output itself: I was implementing a fairly involved algorithm (the Zhang-Shasha edit distance algorithm for ordered trees), and PBT was extremely useful in weeding out bugs. What I did was writing a function that generated random tree structures in the form I needed for my code, and tested the four properties that all distance functions should have:
  1. d(x, x) = 0 for all x 2. d(x, y) >= 0 for all x, y 3. d(x, y) = d(y, x) for all x, y 4. d(x, z) <= d(x, y) + d(y, z) for all x, y, z (the triangle inequality)
  Especially the triangle inequality check weeded out some tricky corner cases I probably wouldn't have figured out on my own. Some will object that you're not guaranteed to find bugs with this kind of random-generation strategy, but if you blast through a few thousand cases every time you run your test suite, and the odd overnight run testing a few million, you quickly get fairly confident that the properties you test actually hold. Of course any counterexamples the PBT finds should get lifted to regression tests in addition, to make sure they're always caught if they crop up again. And as with any testing approach, there are no guarantees, but it adds a nice layer of defense in depth IMO.
- aflukasz 4 hours ago
  > Anyone here use this testing in the wild? Where's it most useful? Do you have the issue I described? Is there an easy way to overcome it?
  One example would be when you have a naive implementation of some algorithm and you want to introduce a second one, optimized but with much more complex implementation. Then this naive one will act as a model for comparisons.
  Another case that comes to mind is when you have rather simple properties to test (like: does it finish without a crash, within a given time?, does not cross some boundaries on the output?), and want to easily run over a sensible set of varying inputs.
- masklinn 4 hours ago
  What you're talking about is using an oracle (a different implementation of what you're testing), it's an option for property (or exhaustive) testing but is by no means a requirement. Even for a sort function there are plenty of properties you can check without needing an oracle e.g.
  - that the output sequence is the same length as the input
  - that the output sequence is sorted
  - that the population counts are the same in the input and the output
  > In real life, if you had that you probably wouldn't have written the function "my_sort" at all.
  Having a generalised sort doesn't mean you can't write a more specialised one which better fits your data set e.g. you might be in a situation where a radix sort is more appropriate.
- sevensor 2 hours ago
  I use hypothesis to test that the happy path is really happy. I use it to define the set of expected inputs for a function, and it looks for the nastiest possible sets of inputs. It’s kind of the opposite of how I write ordinary tests, where I pick challenging but representative examples. Often I’m just checking some basic property, like “function f is the inverse of function g” for round tripping data through a different representation. Sometimes it’s just “function f does not divide by zero.”. Hypothesis seems to have a library of corner cases that it checks first. If you don’t say you can’t handle floats above 2^53, it will try 2^53, for instance. (That’s the threshold where doubles stop being able to represent every integer.) Often the result of running hypothesis is not a code change but a more precise description of what inputs are valid.
- wongarsu 4 hours ago
  Imho it's just not a great example. A better example would be testing the invariants of the function. "For any list of numbers, my_sort does not throw an exception" is trivial to test. "For any list of numbers, in the list returned by my_sort the item at index n is smaller than the item at index n+1" would be another test. Those two probably capture the full specification of my_sort, and you don't need a sort function to test either. In a real-world example you would more likely be testing just some subset of the specification, those things that are easy to assert
  lgas 3 hours ago
  Always returning the empty list meets your spec.
  wongarsu 2 hours ago
  Good point. I suppose we should add "number of input elements equals number of output elements" and "every input element is present in the output". Translated in a straightforward test that still allows my_sort([1,1,2]) to return [1,2,2], but we have to draw the line somewhere
- thekoma 4 hours ago
  Indeed. For a sorting algorithm it would be more insightful to show it test the actual property of something that is sorted: every consecutive element is larger than the previous one (or the other way around). You don’t need a sorting function to test the “sorted” property.
- IanCal 4 hours ago
  I definitely had the same issue when I started, I think you've hit on the two main mental blocking points many people hit.
  > this means you need a second function that acts the same as the function you wrote.
  Actually no, you need to check that outcomes match certain properties. The example there works for a reimplementation (for all inputs, the output of the existing function and new one must be identical) but you can break down a sorting function into other properties.
  Here are some for a sorting function that sorts list X into list Y.
  Lists X and Y should be the same length.
  All of the elements in list X should be in list Y.
  Y[n] <= Y[n+m] for n+m<len(X)
  If you support a reverse parameter, iterating backwards over the sorted list should match the reverse sorted list.
  Sorting functions are a good simple case but we don't write them a lot and so it can be hard to take that idea and implement it elsewhere.
  Here's another example:
  In the app focus ELEMENT A. Press tab X times. Press shift-tab X times. Focus should be be on ELEMENT A.
  In an app with two or more elements, focus ELEMENT A. Press tab. The focus should have changed.
  In an app with N elements. Focus ELEMENT A. Press tab N times. The focus should be on ELEMENT A.
  Those are a simple properties that should be true, are easy to test and maybe you'd have some specific set of tests for certain cases and numbers but there's a bigger chance you then miss that focus gets trapped in some part of your UI.
  I had a library for making UIs for TVs that could be shaped in any way (so a carousel with 2 vertical elements then one larger one, then two again for example). I had a test that for a UI, if you press a direction and the focus changes, then pressing the other direction you should go back to where you were. Extending that, I had that same test run but combined with "for any sequence of UI construction API calls...". This single test found a lot of bugs in edge cases. But it was a very simple statement about using the navigation.
  I had a similar test - no matter what API calls are made to change the UI, either there are no elements on screen, or one and only one has focus. This was defined in the spec, but we also had defined in the spec that if there was focus, removing everything and then adding in items meant nothing should have focus. These parts of the spec were independently tested with unit tests, and nobody spotted the contradiction until we added a single test that checked the first part automatically.
  This I find exceptionally powerful when you have APIs. For any way people might use them, some basic things hold true.
  I've had cases where we were identifying parts of text and things like "given a random string, adding ' thethingtofind ' results in getting at least a match for ' thethingtofind ', with start and end points, and when we extract that part of the string we get ' thethingtofind '". Sounds basic, right? Almost pointless to add when you've seen the output work just fine and you have explicit test cases for this? It failed. At one step we lowercased things for normalisation and we took the positions from this lowercased string. If there was a Turkish I in the string, when lowercased it became two characters:
  >>> len("İ") == len("İ".lower()) False
  So anything where this appeared before the match string would throw off the positions.
  Have a look at your tests, I imagine you have some tests like
  "Doing X results in Y" and "Doing X twice results in Y happening once". This is two tests expressing "Doing X n times where n>=1, Y happens". This is much more explicitly describing the actual property you want to test as well.
  You probably have some tests where you parameterise a set of inputs but actually check the same output - those could be randomly generated inputs testing a wider variety of cases.
  I'd also suggest that if you have a library or application and you cannot give a general statement that always holds true, it's probably quite hard to use.
  Worst case, point hypothesis at a HTTP API, have it auto generate some tests that say "no matter what I send to this it doesn't return a 500".
  In summary (and no this isn't LLM generated):
  1. Don't test every case.
  2. Don't check the exact output, check something about the output.
  3. Ask where you have general things that are true. How would you explain something to a user without a very specific setup (you can't do X more than once, you always get back Y, it's never more than you put in Z...)
  4. See if you have numbers in tests where the specific number isn't relevant.
  5. See if you have inputs in your tests where the input is a developer putting random things in (sets of short passwords, whatever)
- rowanG077 4 hours ago
  On more complex logic you can often reason about many invariants. I don't think using sorting and then comparing to the existing sort is a good example. Precisely for the reason you mention.
  A sort itself is a good example though. You can do much better then just compare to an existing sort implementation. Because you can easily check if an array is sorted with a linear scan. If your sort is stable you can do an additional check that for all pairs of adjacent elements that compare as equal the order of them is the same as in the input array.
rtu8lu an hour ago
Great project. I used it a lot, but now I mostly prefer ad hoc generators. Hypothesis combinators quickly become unmaintainable mess for non-trivial objects. Also, shrinking is not such a big deal when you can generate your data in a complexity-sorted order.
- ctenb an hour ago
  What is complexity sorted order?
ed_blackburn 4 hours ago
I'm using sqlglot to parse hundreds of old mysql back up files to find diffs of schemas. The joys of legacy code. I've found hypothesis to be super helpful for tightening up my parser. I've identified properties (invariants) and built some strategies. I can now generate many more permutations of DDL than I'd thought of before. And I have absolutely confidence in what I'm running.
I started off TDD covered the basics. Learned what I needed to learn about the files I'm dealing with, edge cases, sqlglot and then I moved onto Hypothesis for extra confidence.
I'm curious to see if it'll help with commands for APIs. I nothing else it'll help me appreciate how liberal my API is when perhaps I don't want it to be?
- groundhogstate 4 hours ago
  You might find schemathesis useful for API testing (which IIRC is built on Hypothesis). Certainly helped me find a stack of unhandled edge cases.
  YMMV, but once I first set up Schemathesis on one of my MVPs, it took several hours to iron out the kinks. But thereafter, if built into CI, I've found it guards against regression quite effectively.
  IanCal 4 hours ago
  https://schemathesis.readthedocs.io/en/stable/
  This looks really good. I like using hypothesis for APIs so this is a really nice layer on top, thanks for pointing this one out.
dbcurtis 8 hours ago
It’s been quite some time since I’ve been in the business of writing lots of unit tests, but back in the day, I found hypothesis to be a big force multiplier and it uncovered many subtle/embarrassing bugs for me. Recommend. Also easy and intuitive to use.
- aethor 6 hours ago
  I concur. Hypothesis saved me many times. It also helped me prove the existence of bugs in third party code, since I was able to generate examples showing that a specific function was not respecting certain properties. Without that I would have spent a lot of time trying to manually find an example, let alone the simplest possible example.
- IanCal 5 hours ago
  Huge second.
  I’ve never use pbt and failed to find a new bug. I recommended it in a job interview, they used it and discovered a pretty clear bug on their first test. It’s really powerful.
- eru 7 hours ago
  Hypothesis is also a lot better at giving you 'nasty' floats etc than Haskell's QuickCheck or the relevant Rust and OCaml libraries are. (Or at least used to be, I haven't checked on all of them recently.)
spooky_deep 5 hours ago
Property based testing is fantastic.
Why is it not more popular?
My theory is that only code written in functional languages has complex properties you can actually test.
In imperative programs, you might have a few utils that are appropriate for property testing - things like to_title_case(str) - but the bulk of program logic can only be tested imperatively with extensive mocking.
- chamomeal 18 minutes ago
  I think testing culture in general is suffering because the most popular styles/runtimes don’t support it easily.
  Most apps (at least in my part of the world) these days are absolutely peppered with side effects. At work our code is mostly just endpoints that trigger tons of side effects, then return some glob of data returned from some of those effects. The joys of micro services!!
  If you’re designing from the ground up with testing in mind, you can make things super testable. Defer the actual execution of side effects. Group them together and move local biz logic to a pure function. But when you have a service that’s just a 10,000 line tangle of reading and writing to queues, databases and other services, it’s really hard to ANY kind of testing.
  I think that’s why unit testing and full on browser based E2E testing are popular these days. Unit testing pretends the complexity isn’t there via mocks, and lets you get high test coverage to pass your 85% coverage requirement. Then the E2E tests actually test user stories.
  I’m really hoping there’s a shift. There are SO many interesting and comprehensive testing strategies available that can give you such high confidence in your program. But it mostly feels like an afterthought. My job has 90% coverage requirements, but not a single person writes useful tests. We have like 10,000 unit tests literally just mocking functions and then spying on the mocked return.
  For anybody wanting to see a super duper interesting use of property based testing, check out “Breaking the Bank with test contract”, a talk by Allen Rohner. He pretty much uses property based testing to verify that mocks of services behave identically to the actual services (for the purpose of the program) so that you can develop and test against those mocks. I’ve started implementing a shitty version of this at work, and it’s wicked cool!!
- IanCal 3 hours ago
  I strongly disagree, but I think there are a few problems
  1. Lots of devs just don't know about it.
  2. It's easy to do it wrong and try and reimplement the thing you're testing.
  3. It's easy to try and redo all your testing like that and fail and then give up.
  Here's some I've used:
  Tabbing changes focus for UIs with more than 1 element. Shift tabbing the same number of times takes you back to where you came from.
  This one on TVs with u/d/l/r nav -> if pressing a direction changes focus, pressing the opposite direction takes you back to where you came from.
  An extension of the last -> regardless of the set of API calls used to make the user interface, the same is true.
  When finding text ABC in a larger string and getting back `Match(x, start, end)`, if I take the string and chop out string[start:end] then I get back exactly ABC. This failed because of a dotted I that when lowercased during a normalisation step resulted in two characters - so all the positions were shifted. Hypothesis found this and was able to give me a test like "find x in 'İx' -> fail".
  No input to the API should result in a 500 error. N, where N>0, PUT requests result in one item created.
  Clicking around the application should not result in a 404 page or error page.
  Overall I think there's lots of wider things you can check, because we should have UIs and tools that give simple rules and guarantees to users.
- ibizaman 4 hours ago
  I actually used property testing very successfully to test a DB driver and a migration to another DB driver in Go. I wrote up about it here https://blog.tiserbox.com/posts/2024-02-27-stateful-property...
  imiric 4 hours ago
  Thanks for sharing! Your article illustrates well the benefits of this approach.
  One drawback I see is that property-based tests inevitably need to be much more complex than example-based ones. This means that bugs are much more likely, they're more difficult to maintain, etc. You do mention that it's a lot of code, but I wonder if the complexity is worth it in the long run. I suppose that since testing these scenarios any other way would be even more tedious and error-prone, the answer is "yes". But it's something to keep in mind.
- vrnvu 4 hours ago
  >> Why is it not more popular?
  Property, fuzzy, snapshot testing. Great tools that make software more correct and reliable.
  The challenge for most developers is that they need to change how they design code and think about testing.
  I’ve always said the hardest part of programming isn’t learning, it’s unlearning what you already know…
- maweki 4 hours ago
  I think core of the problem in property-based testing that the property/specification needs to be quite simple compared to the implementation.
  I did some property-based testing in Haskell and in some cases the implementation was the specification verbatum. So what properties should I test? It was clearer where my function should be symmetric in the arguments or that there is a neutral element, etc..
  If the property is basically your specification which (as the language is very expressive) is your implementation then you're just going in circles.
- eru 5 hours ago
  But wouldn't that apply just as much to example based testing?
cat-whisperer 7 hours ago
A little off topic but theirs’ this https://youtu.be/64t-gPC33cc this is a great video by jon about property testing in Rust
__mharrison__ 7 hours ago
I've taught this in my testing courses. I find that (pytest) fixtures are often good enough for coming up with multiple tests but are simple enough to implement.
vlade11115 4 hours ago
One cool application of this is schemathesis. I really enjoyed it, and I found more input validation bugs in my code than I can count. Very useful for public APIs.
thdhhghgbhy 6 hours ago
Is there something this nice for JS, with the decorators like that?
- ngruhn 6 hours ago
  No decorators, but fast-check has add-ons to various test frameworks. E.g. if you use Vitest you can write:
  import { test, fc } from '@fast-check/vitest' test.prop([fc.array(fc.double())])('sort is correct', (lst) => { expect(lst).toEqual(lst.toSorted()) })
  https://www.npmjs.com/package/@fast-check/vitest?activeTab=r...
  chamomeal 15 minutes ago
  Fast check is fantastic!! I found it to be pretty verbose but I think that’s just a typescript limitation. It’s VERY well typed, which was a nice surprise. Such a great library. Shrinking, model based testing, it’s really comprehensive
- aidos 5 hours ago
  Not decorators (or at least not last time I looked) but we use fast-check.
  Was already familiar with and using Hypothesis in Python so went in search of something with similar nice ergonomics. Am happy with fast-check in that regard.
  https://fast-check.dev/
- eru 5 hours ago
  The decorators are a nice approach in Python, but they aren't really core to what Hypothesis does, nor what makes it better than eg Haskell's QuickCheck.
- greener_grass 3 hours ago
  Decorators are just function application with more syntax.
pyuser583 8 hours ago
Make sure to read the docs and understand this well. It has its own vocabulary that can be very counterintuitive.
imiric 4 hours ago
Coincidentally, I recently stumbled upon a similar library for Go[1].
I haven't used it, or property-based testing, but I can see how it could be useful.
[1]: https://github.com/flyingmutant/rapid
klntsky 8 hours ago
It seems to only implement a half of QuickCheck idea, because there is no counterexample shrinking. Good effort though! I wonder how hard would it be to derive generators for any custom types in python - probably not too hard, because types are just values
- sunshowers 8 hours ago
  Shrinking is by far the most important and impressive part of Hypothesis. Compared to how good it is in Hypothesis, it might as well not exist in QuickCheck.
  Proptest in Rust is mostly there but has many more issues with monadic bind than Hypothesis does (I wrote about this in https://sunshowers.io/posts/monads-through-pbt/).
  eru 6 hours ago
  Python's Hypothesis has some very clever features to deal with shrinking past a monadic bind.
  If I remember right, it basically uses a binary 'tape' of random decisions. Shrinking is expressed as manipulations of that tape. Your generators (implicitly) define a projection from that tape to your desired types. Shrinking an early part of the tape, leave the later sub-generators to try and re-use the later parts of the tape.
  That's not guaranteed to work. But it doesn't have to work reliably for every shrink operation the library tries! It's sufficient, if you merely have a good-enough-chance to recover enough of the previous structure to trigger the bug again.
  ctenb 29 minutes ago
  > Shrinking is expressed as manipulations of that tape.
  How do you do that in general? I can't find any documentation on that.
  sunshowers 6 hours ago
  I've always wondered if there could be a small machine learning model trained on shrinking.
  eru 5 hours ago
  I'm not sure whether it would be useful, but it would definitely get you a grant (if done academically) or VC money (if done as a company) these days.
- Jtsummers 8 hours ago
  > because there is no counterexample shrinking
  Hypothesis does shrink the examples, though.
  eru 7 hours ago
  And Hypothesis is miles ahead of QuickCheck in how it handles shrinking! Not only does it shrink automatically, it has no problem preserving invariants from generation in your shrinking; like only prime numbers or only strings that begin with a vowel etc.
  lgas 3 hours ago
  QuickCheck also shrinks automatically and preserves invariants though?
- pfdietz 8 hours ago
  The way it does counterexample shrinking is the most clever part of Hypothesis.
  ctenb 28 minutes ago
  Do you have a reference where it is explained? It's not part of the docs as far as I can tell
teiferer 7 hours ago
This approach has two fundamental problems.
1. It requires you to essentially re-implement the business logic of the SUT (subject-under-test) so that you can assert it. Is your function doing a+b? Then instead of asserting that f(1, 2) == 3 you need to do f(a, b) == a+b since the framework provides a and b. You can do a simpler version that's less efficient, but in the end of the day, you somehow need to derive the expected outputs from input arguments, just like your SUT does. Any logical error that might be slipped into your SUT implementation has a high risk of also slipping into your test and will therefore be hidden by the complexity, even though it would be obvious from just looking at a few well thought through examples.
2. Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of. You basically just give up and leave edge case finding to chance. Testing for 0 or -1 or 1-more-than-list-length are obvious cases which both you the human test writer and some test framework can easily generate, and they are often actual edge cases. But what really constitutes an edge case depends on your implementation. You as the developer know the implementation and have a chance of coming up with the edge cases. You know the dark corners of your code. Random tests are just playing the lottery, replacing thinking hard.
- ChadNauseam 6 hours ago
  > Then instead of asserting that f(1, 2) == 3 you need to do f(a, b) == a+b since the framework provides a and b. You can do a simpler version that's less efficient, but in the end of the day, you somehow need to derive the expected outputs from input arguments, just like your SUT does.
  Not true. For example, if `f` is `+`, you can assert that f(x,y) == f(y,x). Or that f(x, 0) == x. Or that f(x, f(y, z)) == f(f(x, y), z).
  Even a test as simple as "don't crash for any input" is actually extremely useful. This is fuzz testing, and it's standard practice for any safety-critical code, e.g. you can bet the JPEG parser on the device you're reading this on has been fuzz tested.
  > You basically just give up and leave edge case finding to chance.
  I don't know anything about Hypothesis in Python, but I don't think this is true in general. The reason is because the generator can actually inspect your runtime binary and see what branches are being triggered and try to find inputs that will cause all branches to be executed. Doing this for a JPEG parser actually causes it to produce valid images, which you would never expect to happen by chance. See: https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-th...
  > Such a fuzzing run would be normally completely pointless: there is essentially no chance that a "hello" could be ever turned into a valid JPEG by a traditional, format-agnostic fuzzer, since the probability that dozens of random tweaks would align just right is astronomically low.
  > Luckily, afl-fuzz can leverage lightweight assembly-level instrumentation to its advantage - and within a millisecond or so, it notices that although setting the first byte to 0xff does not change the externally observable output, it triggers a slightly different internal code path in the tested app. Equipped with this information, it decides to use that test case as a seed for future fuzzing rounds:
  eru 6 hours ago
  > I don't know anything about Hypothesis in Python, but I don't think this is true in general. The reason is because the generator can actually inspect your runtime binary and see what branches are being triggered and try to find inputs that will cause all branches to be executed.
  The author of Hypothesis experimented with this feature once, but people usually want their unit tests to run really quickly, regardless of whether property based or example based. And the AFL style exploration of branch space typically takes quite a lot longer than what people have patience for in a unit test that runs eg on every update to every Pull Request.
  tybug 6 hours ago
  (Hypothesis maintainer here)
  Yup, a standard test suite just doesn't run for long enough for coverage guidance to be worthwhile by default.
  That said, coverage-guided fuzzing can be a really valuable and effective form of testing (see eg https://hypofuzz.com/).
  sevensor 2 hours ago
  Thank you, Hypothesis is brilliant!
  eru 3 hours ago
  Thanks for the good work!
- vlovich123 7 hours ago
  I have not met anyone that says you should only fuzz/property test, but claiming it can’t possibly find bugs or is unlikely to is silly. I’ve caught numerous non-obvious problems, including a non-fatal but undesirable off-by-1 error in math heavy code due to property testing. It works well when it’s an “np”-hard style problem where the code is harder than the verification. It does not work well for a+b but most problems it’s generally easier to write assertions that have to hold when executing your function. But if it’s not don’t use it - like all testing, it’s an art to determine when it’s useful and how to write it well.
  Hypothesis in particular does something neat where it tries to generate random inputs that are more likely to execute novel paths within the code under test. That’s not replicated in Rust but is super helpful about reaching more paths of your code and that’s simply not able to be done manually if you have a lot of non obvious boundary conditions.
  eru 7 hours ago
  Yes, NP-style verification is a prime candidate.
  But even for something like a+b, you have lots of properties you can test. All the group theory axioms (insofar as they are supposed to hold) for example. See https://news.ycombinator.com/item?id=45820009 for more.
- eru 7 hours ago
  > 1. It requires you to essentially re-implement the business logic of the SUT (subject-under-test) so that you can assert
  No. That's one valid approach, especially if you have a simpler alternative implementation. But testing against an oracle is far from the only property you can check.
  For your example: suppose you have implemented an add function for your fancy new data type (perhaps it's a crazy vector/tensor thing, whatever).
  Here are some properties that you might want to check:
  a + b == b + a
  a + (b + c) = (a + b) + c
  a + (-a) == 0
  For all a and b and c, and assuming that these properties are actually supposed to hold in your domain, and that you have an additive inverse (-). Eg many of them don't hold for floating point numbers in general, so it's good to note that down explicitly.
  Depending on your domain (eg https://en.wikipedia.org/wiki/Tropical_semiring), you might also have idempotence in your operation, so a + b + b = a + b is also a good one to check, where it applies.
  You can also have an alternative implementation that only works for some classes of cases. Or sometimes it's easier to prepare a challenge than to find it, eg you can randomly move around in a graph quite easily, and you can check that your A* algorithm you are working on finds a route that's at most as long as the number of random steps you took.
  > 2. Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of. You basically just give up and leave edge case finding to chance. Testing for 0 or -1 or 1-more-than-list-length are obvious cases which both you the human test writer and some test framework can easily generate, and they are often actual edge cases. But what really constitutes an edge case depends on your implementation. [...]
  You'd be surprised how often the generic heuristics for edge cases actually work and how often manual test writers forget that zero is also a number, and how often the lottery does a lot of the rest.
  Having said this: Python's Hypothesis is a lot better at its heuristics for these edge cases than eg Haskell's QuickCheck.
  teiferer an hour ago
  > a + b == b + a
  > a + (b + c) = (a + b) + c
  > a + (-a) == 0
  Great! Now I have a stupid bug that always returns 0, so these all pass, and since I didn't think about this case (otherwise I'd not have written that stupid bug in the first place), I didn't add a property about a + b only being 0 if a == -b and boom, test is happy, and there is nothing that the framework can do about it.
  Coming up with those properties is hard for real life code and my main gripe with formal methods based approaches too, like model checking or deductice proofs. They move the bugs from the (complicated) code to the list of properties, which ends up just as complicated and error prone, and is entirely un...tested.
  Contrast that with an explicit dead-simple test. Test code doesn't have tests. It needs to be orders of magnitudes simpler than the system it's testing. Its correctness must be obvious. Yes, it is really hard to write a good test. So hard that it should steer how you architect your system under test and how you write code. How can I test this to have confidence that it's correct? That must be a guiding principle from the first line of code. Just doing this as an afterthought by playing lottery and trying to come up with smart properties after the fact is not going to get you the best outcome.
- 12_throw_away 6 hours ago
  > Then instead of asserting that f(1, 2) == 3 you need to do f(a, b) == a+b
  Not really, no, it's right there in the name: you should be testing properties (you can call them "invariants" if you want to sound fancy).
  In the example of testing an addition operator, you could test:
  1. f(x,y) >= max(x,y) if x and y are non-negative
  2. f(x,y) is even iff x and y have the same parity
  3. f(x, y) = 0 iff x=-y
  etc. etc.
  The great thing is that these tests are very easy and fast to write, precisely because you don't have to re-model the entire domain. (Although it's also a great tool if you have 2 implementations, or are trying to match a reference implementation)
- robertfw 7 hours ago
  I feel like this talk by John Hughes showed that there is real value in this approach with production systems of varying levels of complexity, with two different examples of using the approach to find very low level bugs that you'd never think to test for in traditional approaches.
  https://www.youtube.com/watch?v=zi0rHwfiX1Q
- locknitpicker 6 hours ago
  > (...) but in the end of the day, you somehow need to derive the expected outputs from input arguments, just like your SUT does.
  I think you're manifesting some misconceptions and ignorance about property-based testing.
  Property-based testing is still automated testing. You still have a sut and you still exercise it to verify and validate invariants. This does not change.
  The core trait of property-based testing is that instead of having to define and maintain hard coded test data to drive your tests, which are specific realizations of the input state, property-based testing instead focuses on generating sequences of randomly-generated input data, and in the event of a test failing it follows up with employing reduction strategies to distil input values that pinpoint minimum reproducible examples.
  As a consequence, tests don't focus on which specific value a sut returns when given a specific input value. Instead, they focus on verifying more general properties of a sut.
  Perhaps the main advantage of property-based testing is that developers don't need to maintain test data anymore, and this tests are no longer be green just because you forgot to update the test data to cover a scenario or to reflect an edge case. Developers instead define test data generators, and the property-based testing framework implements the hard parts such as the input distillation step.
  Property-based testing is no silver bullet though.
  > Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of.
  Your comment completely misses the point of property-based testing. You still need to exercise your sut to cover scenarios. Where property-based testing excels is that you no longer have to maintain curated sets of test data, or update them whenever you update a component. Your inputs are already randomly generated following the strategy you specified.