dcreager.net

NP-hard isn't the end of the world

2018-05-31 (last updated 2018-06-12)

We've typically considered it a deal-breaker to discover that an algorithm we care about is NP-hard. We'll go to great lengths to simplify the problem domain so that we can switch to a polynomial-time algorithm. But if we simplify too much, then we run the risk that our solution is no longer useful. Luckily, we might not have to! NP-hard is a worst-case bound. If we can convince ourselves that we won't encounter pathological inputs, the NP-hard algorithm might be just fine in practice.

Complexity theory is the study of how difficult certain problems are. We typically measure a problem's complexity by estimating the amount of time and memory that a computer would need to solve the problem for an arbitrary input. We also like to lump together problems with basically the same difficulty into complexity classes. For instance, there is a complexity class named P that contains all of the problems that are "easy enough" for computers to solve. (More formally, that can be solved in "polynomial time".) There are important differences between problems in P — for instance, a constant-time algorithm is definitely better than a linear-time one. But if you zoom out far enough, those differences stop mattering — an exponential algorithm is so much worse than either that we can throw them both into P and call it a day.

Another important complexity class is NP, which is where things first start to get difficult for our poor little computers. A problem in NP is hard to solve (as in, if you hand it the wrong input, there isn't enough time left in the life of the universe for your computer to finish churning away!) But somewhat surprisingly, these problems are easy to verify. That is, if some oracle were able to magically produce a solution out of thin air, our computer could easily (again, "in polynomial time") check that the solution is correct. This means that there's always a simple, but long-running, algorithm for any NP problem: list out every single possible solution, and check them one by one. There will always be exponentially many possible solutions, so checking each one in turn will take exponential (i.e., really f---ing long) time.

The last wrinkle is that even after decades upon decades, we still don't know whether or not P and NP are the same thing. We believe that there are no easy solutions to any NP problem, but haven't been able to prove it. That leads us to the NP-hard problems. These problems have some kind of fundamental difficulty: if we were ever able to find an easy solution to an NP-hard problem, we could use it to construct easy solutions to all NP problems. That, in turn, would prove that P = NP. We believe so strongly that P ≠ NP that we're willing to base entire fields like cryptography on that hunch.

All of this means that we have an (understandably) innate fear of NP-hard problems. With an algorithm that's just in NP, there's always a chance that we just haven't thought about things long enough or hard enough — if we keep at it, we'd see that while this algorithm is hard, the underlying problem is not, and we'd find a different algorithm that solves the problem more easily. On the other hand, if we discover that our algorithm is NP-hard, we're hosed. To crack this nut, we'd have to do something that none of the greatest minds in the history of our field have done.

So instead of trying to solve the unsolvable, we look for a simpler problem to solve. For example, dependency management (like that provided by build tools like cargo and npm) is a provably NP-hard problem — you can construct a set of libraries whose dependency relationships with each other would cause npm install to run effectively forever. This is...not the best usability story, to say the least!

cargo

npm

Proof of NP-hardness

While researching this problem for the Go ecosystem, Russ Cox pointed out that you can simplify the problem by prohibiting certain kinds of negative dependencies (e.g., "the kitten package doesn't work with version 1.2.6 of the catnip package"). That in turn, moves the dependency management problem from NP-hard to P. Perfect! Now our tool will never blow up on us.

Minimal Version Selection [Cox]

Unfortunately, by simplifying the problem, we run the risk that our solution is no longer useful. Sam Boyer suggests this is the case for Go dependency management: that there are real use cases where the community will need the kinds of negative dependencies that would be outlawed.

Failure modes [Boyer]

That leaves us with a conundrum. On the one hand, we can't simplify our problem without sacrificing real use cases. On the other, we have an algorithm that might not finish in any reasonable amount of time. What can we do?

I would argue that there's a third possibilty, which is to just...shrug. NP-hard is a worst-case bound on your algorithm. It might only apply to “pathological” inputs. For dependency management, for instance, we said that “you could construct” a dependency graph that causes npm install to blow up. But how many hoops do you have to jump through to construct that graph? Are you really going to encounter a graph like that in practice? If not, we don't really have to worry about the worst-case complexity.

Now, this doesn't necessarily apply to all NP-hard problems. You might be dealing with something so fundamentally difficult that all interesting inputs take exponential time to solve. If that's the case...well, this post isn't going to help you all that much. You'll still have to simplify your problem domain somewhow to make any forward progress.

But before you go down that road, it's worth thinking through some real-world use cases. You might be pleasantly surprised!

If you have a hankering for formal methods, check out [Creager2007b]: a paper I wrote during my grad school days, where I use a CSP refinement checker to explore the complexity space of an interesting NP-hard problem, trying to find what kinds of inputs lead to the worst-case exponential behavior.

[Creager2007b] Empirical analysis and optimization of an NP-hard algorithm using CSP and FDR