Using Apache POI is relatively easy to do. So we can use NPOI to do the transformation in C #.
Using this response from Convert Word to HTML with Apache POI
Java version
HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new
FileInputStream("D:\temp\seo\1.doc"));
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder()
.newDocument());
wordToHtmlConverter.processDocument(wordDocument);
Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
System.out.println(result);
We transform this into C #
HWPFDocumentCore wordDocument = WordToHtmlUtils.LoadDoc(@"D:\Hola.doc");
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
new XmlDocument());
wordToHtmlConverter.ProcessDocument(wordDocument);
XmlDocument htmlDocument = wordToHtmlConverter.Document;
htmlDocument.Save(@"D:\Hola.html");
I recommend that you do not download NPOI by nuget (current version 2.2.1) and use version 2.1.3.1 but from the official page since you need two more files that do not come in nuget NPOI.ScratchPad.HSSF.dll
and NPOI.ScratchPad.HWPF.dll
both compiled with NET Framework 2.x and you need the other libraries to be version 2.x as well. You can download these 2 files from the GPOub NPOI
Testing it seems that the NPOI version has a bug in the final result of the HTML, since to simulate the format it creates the style with the first letter of the tag type and an incremental number
<!-- ejemplo POI java-->
span.s1{color:red;}
...
<span class="s1">Hola</span>
but for some reason the NET version does not generate them well
<!-- ejemplo NPOI C#-->
span.s1{color:red;}
...
<span>Hola</span>
Maybe it has to do with the Transformer
but I do not know what the equivalence will be in C #
By doing a manual account, you may not need more to see the exit properly
....
XmlNode node = htmlDocument.FirstChild.LastChild; //encontramos el body
EditNode(node); //metodo de edición recursiva
htmlDocument.Save(@"D:\tmp18\Hola.html");
}
Dictionary<string, int> cuenta; //para llevar la cuenta de cada elemento
private void EditNode(XmlNode node) {
try
{
XmlElement xe = (XmlElement)node;
xe.SetAttribute("class", cuenta[xe.LocalName].ToString()); //localName seria span o p por ejemplo
cuenta[xe.LocalName] += 1;
}
catch (Exception) { return; }
if (node.HasChildNodes) {
foreach (XmlNode x in node.ChildNodes) {
EditNode(x);
}
}
}