Donate. I desperately need donations to survive due to my health

Get paid by answering surveys Click here

Click here to donate

Remote/Work from Home jobs

Creating a big matrix from a big dictionary efficiently

I have a dictionary, where: keys are the document ids,
values are lists of tuples where the first element of a tuple has a term(a word) and the second element of tuple has a number(a score).

Basically, each document has some terms ,and some values to it.

Now, I am trying to create a sparse matrix from this dictionary where the rows are the document_id from the dictionary and the columns are the terms, and the values of this matrix will be the corresponding values(value_1, ... , value_k) from the dict. i.e:

 matrix[document_one, term_1] = 0.5  
 matrix[document_one, term_6] = 0.8

But, if document_one does not contain term_8 then:

 matrix[document_one, term_8] = 0

Basically I will have a sparse matrix, where if the document contains the term , it will have a value bigger than zero, otherwise it will have a zero value.

The size of the matrix will be number_of_rows x number_of_columns, where:

number_of_rows: number of documents(number of keys from dict)
number_of_columns: number of (distinct) terms

I am doing this with pandas, and it works fine. But the problem is, that it is way too slow. In my case it takes about 20 minutes to build this "matrix". Is there any other way to do this faster?

My code for realizing this ,is:

data_frame = pd.DataFrame()

for doc_id, list_values in finalDict.items():
    for tpl in list_values:
        data_frame.loc[doc_id,tpl[0]] = tpl[1]

data_frame = data_frame.fillna(0)

Comments