Ignoring all the details like hyper-threading and caching then yes, 1.0/core means there is on average one runnable process per core in the time-frame of that average. You might get more performance with a load average of 2.0/core+1 which I know the Gentoo wiki used to recommend to minimise compile time because it would make the most of the hyper-threading features and realistic I/O considerations, but at that load you’ll also start to see a loss of responsiveness and that’s how I’ve always chosen how many CPU-intensive processes to run at the same time.
The idea is to make sure every processor always has something to do without creating unnecessary contention for the cache space inside the processor. Modern processors+kernels have ways of mitigating the cache contention problem though and those strategies are a bit beyond my knowledge, so my advice might be outdated regarding exactly what’s going on with the load average.
It would be very interesting to graph load average against actual throughput in something like siege (run locally to avoid measuring network speed).