Relevant suggestion Robert Lancer I’m by no means married to Python.
But I don’t think Python itself is the issue here. It’s just a wrapper for Numpy. And Numpy matrices seem to be fairly close to the metal. Googling, they just allocate contiguous memory, and size should be something close to dimensions x bits/entry. That works out. I was topping out at 30,000 word dimensions. Even naively calculated at 8-bits/entry that’s 7,200,000,000 bits. Including swap, close to my 6GB RAM system. Seems naive, but basically with 30k x 30k 6GB is probably where it is going to be.
I don’t actually need all 8 bits/entry. But that’s an optimization for production rather than testing. Maybe, I don’t know. I might check using fewer bits/entry.
Production will probably represent words as a distribution of bits anyway, so fewer overall dimensions.
For testing, if I can’t find a good sparse format, I may just have to rent a big server. I think 100k x 100k should be enough. That might be 80GB. But Amazon seem to have as much as 244GB on offer, for $2.66/hr.
I’m also looking at something called PyTables:
Though that seems to only work where you access just a part of the matrix at one time. I’m not sure that would apply in my case.
There are things to explore.
My only issue is I’d rather be working on the logic of the solution than the engineering!