I am trying to work with gensim for topic modeling. From what I can tell looking at the module's documentation, gensim expects to receive its input as a list, with each item in a list being a text:
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system"]
I have a collection of texts in a directory that I would like to use with gensim, and so I need to read those files into a list. Each of those texts, some of which consist of multiple lines -- the texts range in size from a little under 100 words to a little over 1000 words -- needs to be one item in the list. If stripping newlines out is required, I think I can figure out how to do that, but embedding it into a loop is where I fail ... completely. (In fact, I am taking myself to loop school over the weekend, but I regularly mess that part up.)
I have found all kinds of useful information on how to read a single file into a list -- by line or by word or by whatever -- but I can't figure out how to read a series of text files into a series of strings all contained within a single list -- this is the important bit:
textfile1.txt
textfile2.txt
need to become
list = ['contents of textfile1', 'contents of textfile2']
Here's what I have so far:
# get to the files, open an empty list
import glob
file_list = glob.glob('./texts' + '/*.txt')
documents = []
# Now to read the files into a list:
for file in file_list:
documents.append()
print documents
The print documents is obviously a throwaway line so I could check my work, and you can see that I didn't get very far with the loop.
Aucun commentaire:
Enregistrer un commentaire