Transform document docx to html

11

I have a document docx already saved in bytes[] and I need to pass it to html to be able to display it on a page.

I'm using Visual Studio with .NET to develop it in C #.

Currently I already work from pdf that is easy to transform to html but it is not the case of docx or any product of Microsoft since I can not use the native library interop since it is not a guarantee that the server has it installed.

The final result is:

strFinalDoc = strFinalDoc.Replace("<body>", "<body>" + documentInfoHtml + "<BR /><BR />");

Where documentInfoHtml is the result of transforming the bytes[] to html and strFinalDoc is simply the content that replaces the body of a page.

I have found some solution but practically all use interop or else payment libraries.

Do you know any way to do it with free software or open projects?

I also have to do the same process for xls and xlsx files.

The current response is very good but it only covers one file doc and not the docx

It is also important to maintain the existing CSS styles as much as possible so answers that simply extract the content to generate it in HTML is not enough in the sense that I would lose all the format.

    
asked by Miquel Coll 30.06.2016 в 16:57
source

3 answers

10

Using Apache POI is relatively easy to do. So we can use NPOI to do the transformation in C #.

Using this response from Convert Word to HTML with Apache POI

  

Java version

HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new
FileInputStream("D:\temp\seo\1.doc"));

WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
        DocumentBuilderFactory.newInstance().newDocumentBuilder()
                .newDocument());
wordToHtmlConverter.processDocument(wordDocument);
Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);

TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();

String result = new String(out.toByteArray());
System.out.println(result);

We transform this into C #

HWPFDocumentCore wordDocument = WordToHtmlUtils.LoadDoc(@"D:\Hola.doc"); 

WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
    new XmlDocument());

wordToHtmlConverter.ProcessDocument(wordDocument);

XmlDocument htmlDocument = wordToHtmlConverter.Document;

htmlDocument.Save(@"D:\Hola.html");

I recommend that you do not download NPOI by nuget (current version 2.2.1) and use version 2.1.3.1 but from the official page since you need two more files that do not come in nuget NPOI.ScratchPad.HSSF.dll and NPOI.ScratchPad.HWPF.dll both compiled with NET Framework 2.x and you need the other libraries to be version 2.x as well. You can download these 2 files from the GPOub NPOI

Testing it seems that the NPOI version has a bug in the final result of the HTML, since to simulate the format it creates the style with the first letter of the tag type and an incremental number

<!-- ejemplo POI java-->
span.s1{color:red;}
...
<span class="s1">Hola</span>

but for some reason the NET version does not generate them well

<!-- ejemplo NPOI C#-->
span.s1{color:red;}
...
<span>Hola</span>

Maybe it has to do with the Transformer but I do not know what the equivalence will be in C #

By doing a manual account, you may not need more to see the exit properly

    ....
    XmlNode node = htmlDocument.FirstChild.LastChild; //encontramos el body
    EditNode(node); //metodo de edición recursiva
    htmlDocument.Save(@"D:\tmp18\Hola.html");
}

Dictionary<string, int> cuenta; //para llevar la cuenta de cada elemento

private void EditNode(XmlNode node) {
    try
    {
        XmlElement xe = (XmlElement)node;     

        xe.SetAttribute("class", cuenta[xe.LocalName].ToString()); //localName seria span o p por ejemplo
        cuenta[xe.LocalName] += 1;
    }
    catch (Exception) { return; }

    if (node.HasChildNodes) {
        foreach (XmlNode x in node.ChildNodes) {                
            EditNode(x);
        }
    }

}
    
answered by 30.06.2016 / 20:01
source
4

As a Word document is composed of XML, then because not starting from this point you only convert your XML to HTML . See the MSDN page to show you the structure that has a word document in xml, here I leave the structure:

 <?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
  <CoreProperties xmlns="http://schemas.microsoft.com/package/2005/06/md/core-properties"> 
   <Title>Word Document Sample</Title> 
   <Subject>Microsoft Office Word 2007</Subject> 
   <Creator>2007 Microsoft Office System User</Creator> 
   <Keywords/> 
   <Description>2007 Microsoft Office system .docx file</Description> 
   <LastModifiedBy>2007 Microsoft Office System User</LastModifiedBy> 
   <Revision>2</Revision> 
   <DateCreated>2005-05-05T20:01:00Z</DateCreated> 
   <DateModified>2005-05-05T20:02:00Z</DateModified> 
  </CoreProperties>

And similarly in the MSDN as well give you an example of the use of XmlDocument Class , here's an example of it:

 using System;
 using System.IO;
 using System.Xml;

 public class Sample
 {
   public static void Main()
   {
     //Create the XmlDocument.
     XmlDocument doc = new XmlDocument();
     doc.LoadXml("<?xml version='1.0' ?>" +
            "<book genre='novel' ISBN='1-861001-57-5'>" +
            "<title>Pride And Prejudice</title>" +
            "</book>");

     //Display the document element.
     Console.WriteLine(doc.DocumentElement.OuterXml);
  }
 }

Now, to access the nodes, you can do it like this:

  public XmlNode GetBook(string uniqueAttribute, XmlDocument doc)
  {
      XmlNamespaceManager nsmgr = new XmlNamespaceManager(doc.NameTable);
      nsmgr.AddNamespace("bk", "http://www.contoso.com/books");
      string xPathString = "//bk:books/bk:book[@ISBN='" + uniqueAttribute +      "']";
      XmlNode xmlNode = doc.DocumentElement.SelectSingleNode(xPathString, nsmgr);
     return xmlNode;
  }

So that's where you already concatenate all your code HTML . I saw the codes in the MSDN XmlDocument Class

    
answered by 30.06.2016 в 20:29
1

Conversion

As you have already noticed, the docx is nothing but a zipped xml, and therefore easily convertible to HTML.

Shipping to the customer

To send the information to the client (to make sure you do not see only one txt) you have to remember to send the Headers first:

Content-Type:text/html; charset=utf8
Content-Length: 12345

In charset you have to put what corresponds and in content length also (in bytes, not in chars, remember that a char utf8 can measure more than one byte). The length serves to let the browser know how many bytes it expects and can put the progress bar when the document is long.

Compact

Once it works without compacting you could evaluate using a middleware or a module to send the compacted information (gzip for example)

    
answered by 09.07.2016 в 15:59