I have an ASP.NET app (using VB.NET) that allows a user to upload a file, and then the program should parse certain info out of the file and insert it into a database. The file is a Word document, so when I try to just use a FileStream, there is so much junk in the file.
So, I figured I would use a Word.Document and try to read the data there. I have found NUMEROUS examples of how to manipulate data and create new Word documents using .NET, but I have not seen an example where I can just read the document line by line. This is important to my program.
I have also explored the Word.Document and Word.ApplicationClass functions and can't find an answer.
Do you know of a good way to READ in a Word document without that document being converted into another format from the user? I don't need to write or manipulate the data. Thank you.
One way of doing this is using the clipboard. You will need a reference to System.Windows.Forms, and then you can use something like the following to get the text version of the document:
Word.Application app = new Word.ApplicationClass(); //Refs to pass to the Open method for parameters we're not interested in. object nullobj = System.Reflection.Missing.Value; //Full file path. object file = @"C:\MyDoc.doc"; Word.Document doc = app.Documents.Open( ref file, ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj); //Select and copy text to clipboard. doc.ActiveWindow.Selection.WholeStory(); doc.ActiveWindow.Selection.Copy(); IDataObject data = Clipboard.GetDataObject(); //Do whatever with the text. string text = data.GetData(DataFormats.Text).ToString(); Console.WriteLine(text); //Close doc and shutdown Word application. doc.Close(ref nullobj, ref nullobj, ref nullobj); app.Quit(ref nullobj, ref nullobj, ref nullobj);
Beware, though, that using these automation objects actually instantiates Word, therefore, I wouldn't recommend it for a high-traffic app. And, remember each instance created uses somewhere in between 10-20 MB of memory (Word XP in my case).
Dig Deeper on Win Development Resources
Related Q&A from Daniel Cazzulino
Here Daniel Cazzulino explains how to load a DSL (domain specific language) domain model instance file programmatically. This requires the .NET type ... Continue Reading
Here we offer a glimpse at 12 of .NET development expert Danny Cazzulino's top ASP.NET questions and answers. Continue Reading
C# developers should NOT be modifying InitializeComponent method in the code-behind (or any of the variable definitions) by hand. Continue Reading