python: Python: How to read a directory of texts into a list

vendredi 8 mai 2015

Python: How to read a directory of texts into a list

I am trying to work with gensim for topic modeling. From what I can tell looking at the module's documentation, gensim expects to receive its input as a list, with each item in a list being a text:

documents = ["Human machine interface for lab abc computer applications",
        "A survey of user opinion of computer system response time",
        "The EPS user interface management system"]

I have a collection of texts in a directory that I would like to use with gensim, and so I need to read those files into a list. Each of those texts, some of which consist of multiple lines -- the texts range in size from a little under 100 words to a little over 1000 words -- needs to be one item in the list. If stripping newlines out is required, I think I can figure out how to do that, but embedding it into a loop is where I fail ... completely. (In fact, I am taking myself to loop school over the weekend, but I regularly mess that part up.)

I have found all kinds of useful information on how to read a single file into a list -- by line or by word or by whatever -- but I can't figure out how to read a series of text files into a series of strings all contained within a single list -- this is the important bit:

textfile1.txt
textfile2.txt

need to become

list = ['contents of textfile1', 'contents of textfile2']

Here's what I have so far:

# get to the files, open an empty list

import glob

file_list = glob.glob('./texts' + '/*.txt')
documents = []

# Now to read the files into a list:

for file in file_list:
    documents.append()

print documents

The print documents is obviously a throwaway line so I could check my work, and you can see that I didn't get very far with the loop.

python

vendredi 8 mai 2015

Python: How to read a directory of texts into a list

Aucun commentaire:

Enregistrer un commentaire