How can I get my app to read a Word doc line-by-line?

I have an ASP.NET app (using VB.NET) that allows a user to upload a file, and then the program should parse certain info out of the file and insert it into a database. The file is a Word document, so when I try to just use a FileStream, there is so much junk in the file.

So, I figured I would use a Word.Document and try to read the data there. I have found NUMEROUS examples of how to manipulate data and create new Word documents using .NET, but I have not seen an example where I can just read the document line by line. This is important to my program.

I have also explored the Word.Document and Word.ApplicationClass functions and can't find an answer.

Do you know of a good way to READ in a Word document without that document being converted into another format from the user? I don't need to write or manipulate the data. Thank you.
One way of doing this is using the clipboard. You will need a reference to System.Windows.Forms, and then you can use something like the following to get the text version of the document:

Word.Application app = new Word.ApplicationClass();
//Refs to pass to the Open method for parameters we're not interested in.
object nullobj = System.Reflection.Missing.Value;

//Full file path.
object file = @"C:\MyDoc.doc";

Word.Document doc = app.Documents.Open(
 ref file, ref nullobj, ref nullobj, 
 ref nullobj, ref nullobj, ref nullobj, 
 ref nullobj, ref nullobj, ref nullobj, 
 ref nullobj, ref nullobj, ref nullobj, 
 ref nullobj, ref nullobj, ref nullobj);

//Select and copy text to clipboard.

IDataObject data = Clipboard.GetDataObject();
//Do whatever with the text.
string text = data.GetData(DataFormats.Text).ToString();

//Close doc and shutdown Word application.
doc.Close(ref nullobj, ref nullobj, ref nullobj);
app.Quit(ref nullobj, ref nullobj, ref nullobj);

Beware, though, that using these automation objects actually instantiates Word, therefore, I wouldn't recommend it for a high-traffic app. And, remember each instance created uses somewhere in between 10-20 MB of memory (Word XP in my case).

This was last published in October 2003

Have a question for an expert?

Please add a title for your question

Get answers from a TechTarget expert on whatever's puzzling you.

You will be able to add details on the next page.



Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: