Sarthak Sharma
· GuwahatiIndia
Open

Hi Rob, if you could specifically tell me the input and desired output, I could write a python program for that. I have worked on a neural simulation software in the past and would like to contribute to your project as well.

Python

Python
1 like
Like
Rob Freeman
· Los AngelesU.S.
``````Input: "hello world"

Output:
[[ 0.  1.]
[ 0.  0.]]

index 0 = hello
index 1 = world
``````

So just 1’s in columns indicating when a column word follows a row word.

1 like
Rob Freeman
· Los AngelesU.S.

If you’re still looking at this Sarthak Sharma I’ve tried a few things.

I tried just declaring a numpy array with enough dimensions. But there are 60,000+ words and 60,000+ dimension numpy matricies are too big for my RAM (6GB, seems to top out at around 40,000 rows and columns. I don’t know how it calculates that. Maybe by closing some programs and using a lighter GUI I could get more.)

Currently I’m looking at scipy sparse bok_matrix matricies. But am blocked trying to find equivalents for operations like:

``````matrix[matrix<0]=0
``````

Which zeros all numpy matrix entries less than zero. And:

``````numpy.fill_diagonal(matrix,0)
``````

Which zeros the diagonal of numpy matrix “matrix”.

It is also very slow. Took about two hours reading in must 4% of a 5 million word sample data text.

Rob Freeman
· Los AngelesU.S.

Update:

Seems I can get the functionality I had with numpy:

``````matrix[matrix<0]=0
``````

by iterating over all terms in the dok_matrix sparse matrix:

``````for d in matrix:
for key, value in d.iteritems():
if value < 0:
d[key] = 0
``````
Steven Reubenstone
· New YorkU.S.

Robert Lancer any ideas here?

Robert Lancer
Chief Technology Officer at Collaborizm
· New YorkU.S.

I’d say that if memory is an issue maybe switch from Python to a language more performant. maybe #Rust https://www.rust-lang.org/

Rob Freeman
· Los AngelesU.S.

Relevant suggestion +Robert Lancer I’m by no means married to Python.

But I don’t think Python itself is the issue here. It’s just a wrapper for Numpy. And Numpy matrices seem to be fairly close to the metal. Googling, they just allocate contiguous memory, and size should be something close to dimensions x bits/entry. That works out. I was topping out at 30,000 word dimensions. Even naively calculated at 8-bits/entry that’s 7,200,000,000 bits. Including swap, close to my 6GB RAM system. Seems naive, but basically with 30k x 30k 6GB is probably where it is going to be.

I don’t actually need all 8 bits/entry. But that’s an optimization for production rather than testing. Maybe, I don’t know. I might check using fewer bits/entry.

Production will probably represent words as a distribution of bits anyway, so fewer overall dimensions.

For testing, if I can’t find a good sparse format, I may just have to rent a big server. I think 100k x 100k should be enough. That might be 80GB. But Amazon seem to have as much as 244GB on offer, for \$2.66/hr.

I’m also looking at something called PyTables:

https://kastnerkyle.github.io/posts/using-pytables-for-larger-than-ram-data-processing/

Though that seems to only work where you access just a part of the matrix at one time. I’m not sure that would apply in my case.

There are things to explore.

My only issue is I’d rather be working on the logic of the solution than the engineering!