.NET平台下不借助Office实现Word、Powerpoint等文件的解析(完)

【题外话】

这是这个系列的最后一篇文章了，为了不让自己觉得少点什么，顺便让自己感觉完美一些，就再把OOXML说一下吧。不过说实话，OOXML真的太容易解析了，而且这方面的文档包括成熟的开源类库也特别特别特别的多，所以我就稍微说一下，文章中引用了不少的链接，感兴趣的话可以深入了解下。

【系列索引】

Office文件的奥秘——.NET平台下不借助Office实现Word、Powerpoint等文件的解析(一)
获取Office二进制文档的DocumentSummaryInformation以及SummaryInformation Office文件的奥秘——.NET平台下不借助Office实现Word、Powerpoint等文件的解析(二)
获取Word二进制文档（.doc）的文字内容（包括正文、页眉、页脚、批注等等） Office文件的奥秘——.NET平台下不借助Office实现Word、Powerpoint等文件的解析(三)
详细介绍Office二进制文档中的存储结构，以及获取PowerPoint二进制文档（.ppt）的文字内容 Office文件的奥秘——.NET平台下不借助Office实现Word、Powerpoint等文件的解析(完)
介绍Office Open XML文档（.docx、.pptx）如何进行解析以及解析Office文件常见开源类库

【文章索引】

初见Office Open XML(OOXML) OOXML文档属性的解析 Word 2007文件的解析 PowerPoint 2007文件的解析常见Office文档（Word、PowerPoint、Excel）文件的开源类库相关链接

【一、初见Office Open XML(OOXML)】

先来看一段微软官方对Office Open XML的说明（详细见 http://office.microsoft.com/zh-cn/support/HA010205815.aspx?CTT=3 ）：

可以看到，与Windows 复合文档不同的是，OOXML生来就是开放的，而且由于基于zip+xml的格式，使得读取变得更容易，如果仅是为了抽取文字，我们甚至不需要读取文档的任何参数！

如果您之前不了解OOXML的话，我们可以把手头docx、pptx以及xlsx文件的扩展名改为zip，然后用压缩软件打开看看。

打开的这三个文件分别是docx、pptx和xlsx，我们可以看到，目录结构清晰可见，所以我们只需要使用读取zip的类库读取zip文件，然后再解析xml文件即可。对于使用.NET Framework 3.0及以上的，可以直接使用.NET自带的Package类（System.IO.Packaging，在WindowsBase.dll中）进行解压，个人感觉如果只是读取zip流中的文件流或内容，WindowsBase中的Package还是很好用的。如果用于.NET CF或者2.0甚至以下的CLR可以使用SharpZipLib（支持CLR 1.1、2.0、4.0，官方网站 http://www.icsharpcode.net/ ），也可以使用DotNetZip（支持CLR 2.0，官方网站 http://dotnetzip.codeplex.com/ ），个人感觉后者的License更友好些。

比如我们使用自带的Package打开OOXML文件：

View Code

 1   #region  字段
  2   protected   FileStream m_stream;
   3   protected   Package m_package;
   4   #endregion 
  5  
  6   #region  构造函数
  7   ///   <summary> 
  8   ///   初始化OfficeOpenXMLFile
   9   ///   </summary> 
 10   ///   <param name="filePath">  文件路径  </param> 
 11   public   OfficeOpenXMLFile(String filePath)
  12   {
  13       try 
 14       {
  15           this .m_stream =  new   FileStream(filePath, FileMode.Open, FileAccess.Read);
  16           this .m_package = Package.Open( this  .m_stream);
  17  
 18           this  .ReadProperties();
  19           this  .ReadCoreProperties();
  20           this  .ReadContent();
  21       }
  22       finally 
 23       {
  24           if  ( this .m_package !=  null  )
  25           {
  26               this  .m_package.Close();
  27           }
  28  
 29           if  ( this .m_stream !=  null  )
  30           {
  31               this  .m_stream.Close();
  32           }
  33       }
  34   }
  35   #endregion

【二、OOXML文档属性的解析】

OOXML文件的文档属性其实存在于docProps目录下，比较重要的有三个文件

app.xml：记录文档的属性，内容类似之前的DocumentSummaryInformation。 core.xml：记录文档核心的属性，比如创建时间、最后修改时间等等，内容类似之前的SummaryInformation。 thumbnail.*：文档的缩略图，不同文件存储的是不同的格式，比如Word为emf，Excel为wmf，PowerPoint为jpeg。

我们只需要遍历XML文件中所有的子节点就可以读出所有的属性，为了好看，这里还用的Windows复合文件中的名称：

View Code

 1   #region  常量
  2   private   const  String PropertiesNameSpace =  "  http://schemas.openxmlformats.org/officeDocument/2006/extended-properties  "  ;
   3   private   const  String CorePropertiesNameSpace =  "  http://schemas.openxmlformats.org/package/2006/metadata/core-properties  "  ;
   4   #endregion 
  5  
  6   #region  字段
  7   protected  Dictionary<String, String>  m_properties;
   8   protected  Dictionary<String, String>  m_coreProperties;
   9   #endregion 
 10  
 11   #region  属性
 12   ///   <summary> 
 13   ///   获取DocumentSummaryInformation
  14   ///   </summary> 
 15   public   override  Dictionary<String, String>  DocumentSummaryInformation
  16   {
  17       get 
 18       {
  19           return   this  .m_properties;
  20       }
  21   }
  22  
 23   ///   <summary> 
 24   ///   获取SummaryInformation
  25   ///   </summary> 
 26   public   override  Dictionary<String, String>  SummaryInformation
  27   {
  28       get 
 29       {
  30           return   this  .m_coreProperties;
  31       }
  32   }
  33   #endregion 
 34  
 35   #region  读取Properties
 36   private   void   ReadProperties()
  37   {
  38       if  ( this .m_package ==  null  )
  39       {
  40           return  ;
  41       }
  42  
 43      PackagePart part =  this .m_package.GetPart( new  Uri( "  /docProps/app.xml  "  , UriKind.Relative));
  44       if  (part ==  null  )
  45       {
  46           return  ;
  47       }
  48  
 49      XmlDocument doc =  new   XmlDocument();
  50       doc.Load(part.GetStream());
  51  
 52      XmlNodeList nodes = doc.GetElementsByTagName( "  Properties  "  , PropertiesNameSpace);
  53       if  (nodes.Count <  1  )
  54       {
  55           return  ;
  56       }
  57  
 58       this .m_properties =  new  Dictionary<String, String> ();
  59       foreach  (XmlElement element  in  nodes[ 0  ])
  60       {
  61           this  .m_properties.Add(element.LocalName, element.InnerText);
  62       }
  63   }
  64   #endregion 
 65  
 66   #region  读取CoreProperties
 67   private   void   ReadCoreProperties()
  68   {
  69       if  ( this .m_package ==  null  )
  70       {
  71           return  ;
  72       }
  73  
 74      PackagePart part =  this .m_package.GetPart( new  Uri( "  /docProps/core.xml  "  , UriKind.Relative));
  75       if  (part ==  null  )
  76       {
  77           return  ;
  78       }
  79  
 80      XmlDocument doc =  new   XmlDocument();
  81       doc.Load(part.GetStream());
  82  
 83      XmlNodeList nodes = doc.GetElementsByTagName( "  coreProperties  "  , CorePropertiesNameSpace);
  84       if  (nodes.Count <  1  )
  85       {
  86           return  ;
  87       }
  88      
 89       this .m_coreProperties =  new  Dictionary<String, String> ();
  90       foreach  (XmlElement element  in  nodes[ 0  ])
  91       {
  92           this  .m_coreProperties.Add(element.LocalName, element.InnerText);
  93       }
  94   }
  95   #endregion

【三、Word 2007文件的解析】

Word文件（.docx）主要的内容基本都存在于word目录下，比较重要的有以下的内容

document.xml：记录Word文档的正文内容 footer*.xml：记录Word文档的页脚 header*.xml：记录Word文档的页眉 comments.xml：记录Word文档的批注 endnotes.xml：记录WOrd文档的尾注

这里我们只读取Word文档的正文内容，由于OOXML文档在存储文字时也是嵌套结构存储的，比如对于Word而言，<w:p></w:p>之间存储的是段落，段落中会嵌套着<w:t></w:t>，而这个存储的是文字。除此之外<w:tab/>是Tab符号，<w:br w:type="page"/>是分页符等等，所以我们需要写一个方法递归处理这些标签：

View Code

  1   ///   <summary> 
  2   ///   抽取Node中的文字
   3   ///   </summary> 
  4   ///   <param name="node">  XmlNode  </param> 
  5   ///   <returns>  Node中的文字  </returns> 
  6   public   static   String ReadNode(XmlNode node)
   7   {
   8       if  ((node ==  null ) || (node.NodeType != XmlNodeType.Element)) //  如果node为空 
  9       {
  10           return   String.Empty;
  11       }
  12  
 13      StringBuilder nodeContent =  new   StringBuilder();
  14  
 15       foreach  (XmlNode child  in   node.ChildNodes)
  16       {
  17           if  (child.NodeType !=  XmlNodeType.Element)
  18           {
  19               continue  ;
  20           }
  21  
 22           switch   (child.LocalName)
  23           {
  24               case   "  t  " : //  正文 
 25                   nodeContent.Append(child.InnerText.TrimEnd());
  26  
 27                  String space = ((XmlElement)child).GetAttribute( "  xml:space  "  );
  28                   if  ((!String.IsNullOrEmpty(space)) && (space ==  "  preserve  " )) nodeContent.Append( '   '  );
  29                   break  ;
  30               case   "  cr  " : //  换行符 
 31               case   "  br  " : //  换页符 
 32                   nodeContent.Append(Environment.NewLine);
  33                   break  ;
  34               case   "  tab  " : //  Tab 
 35                  nodeContent.Append( "  \t  "  );
  36                   break  ;
  37               case   "  p  " : //  段落 
 38                   nodeContent.Append(ReadNode(child));
  39                   nodeContent.Append(Environment.NewLine);
  40                   break  ;
  41               default : //  其他情况 
 42                   nodeContent.Append(ReadNode(child));
  43                   break  ;
  44           }
  45       }
  46  
 47       return   nodeContent.ToString();
  48  }

然后我们从根标签开始读取就可以了

View Code

 1   #region  常量
  2   private   const  String WordNameSpace =  "  http://schemas.openxmlformats.org/wordprocessingml/2006/main  "  ;
   3   #endregion 
  4  
  5   #region  字段
  6   private   String m_paragraphText;
   7   #endregion 
  8  
  9   #region  属性
 10   ///   <summary> 
 11   ///   获取文档正文内容
  12   ///   </summary> 
 13   public   String ParagraphText
  14   {
  15       get  {  return   this  .m_paragraphText; }
  16   }
  17   #endregion 
 18  
 19   #region  读取内容
 20   protected   override   void   ReadContent()
  21   {
  22       if  ( this .m_package ==  null  )
  23       {
  24           return  ;
  25       }
  26  
 27      PackagePart part =  this .m_package.GetPart( new  Uri( "  /word/document.xml  "  , UriKind.Relative));
  28       if  (part ==  null  )
  29       {
  30           return  ;
  31       }
  32  
 33      StringBuilder content =  new   StringBuilder();
  34      XmlDocument doc =  new   XmlDocument();
  35       doc.Load(part.GetStream());
  36  
 37      XmlNamespaceManager nsManager =  new   XmlNamespaceManager(doc.NameTable);
  38      nsManager.AddNamespace( "  w  "  , WordNameSpace);
  39  
 40      XmlNode node = doc.SelectSingleNode( "  /w:document/w:body  "  , nsManager);
  41  
 42       if  (node ==  null  )
  43       {
  44           return  ;
  45       }
  46  
 47       content.Append(NodeHelper.ReadNode(node));
  48  
 49       this .m_paragraphText =  content.ToString();
  50   }
  51   #endregion

【四、PowerPoint 2007文件的解析】

PowerPoint文件（.pptx）主要的内容都存在于ppt目录下，而幻灯片的信息则又在slides子目录下，这里边幻灯片按照slide + 页序号 +.xml的名称进行存储，我们挨个顺序读取就可以。不过需要注意的是，由于字符串比较的问题，如“slide10.xml”<"slide2.xml"，所以如果你按顺序读取的话可能会出现页码错乱的情况，所以我们可以先进行排序然后再挨个页面从根标签读取就可以了。

View Code

  1   #region  常量
  2   private   const  String PowerPointNameSpace =  "  http://schemas.openxmlformats.org/presentationml/2006/main  "  ;
   3   #endregion 
  4  
  5   #region  字段
  6   private   StringBuilder m_allText;
   7   #endregion 
  8  
  9   #region  属性
 10   ///   <summary> 
 11   ///   获取PowerPoint幻灯片中所有文本
  12   ///   </summary> 
 13   public   String AllText
  14   {
  15       get  {  return   this  .m_allText.ToString(); }
  16   }
  17   #endregion 
 18  
 19   #region  构造函数
 20   ///   <summary> 
 21   ///   初始化PptxFile
  22   ///   </summary> 
 23   ///   <param name="filePath">  文件路径  </param> 
 24   public   PptxFile(String filePath) :
  25       base  (filePath) { }
  26   #endregion 
 27  
 28   #region  读取内容
 29   protected   override   void   ReadContent()
  30   {
  31       if  ( this .m_package ==  null  )
  32       {
  33           return  ;
  34       }
  35  
 36       this .m_allText =  new   StringBuilder();
  37  
 38      XmlDocument doc =  null  ;
  39      PackagePartCollection col =  this  .m_package.GetParts();
  40      SortedList<Int32, XmlDocument> list =  new  SortedList<Int32, XmlDocument> ();
  41      
 42       foreach  (PackagePart part  in   col)
  43       {
  44           if  (part.Uri.ToString().IndexOf( "  ppt/slides/slide  " , StringComparison.OrdinalIgnoreCase) > - 1  )
  45           {
  46              doc =  new   XmlDocument();
  47               doc.Load(part.GetStream());
  48  
 49              String pageName = part.Uri.ToString().Replace( "  /ppt/slides/slide  " ,  "" ).Replace( "  .xml  " ,  ""  );
  50              Int32 index =  0  ;
  51              Int32.TryParse(pageName,  out   index);
  52  
 53               list.Add(index, doc);
  54           }
  55       }
  56  
 57       foreach  (KeyValuePair<Int32, XmlDocument> pair  in   list)
  58       {
  59          XmlNamespaceManager nsManager =  new   XmlNamespaceManager(doc.NameTable);
  60          nsManager.AddNamespace( "  p  "  , PowerPointNameSpace);
  61  
 62          XmlNode node = pair.Value.SelectSingleNode( "  /p:sld  "  , nsManager);
  63  
 64           if  (node ==  null  )
  65           {
  66               continue  ;
  67           }
  68  
 69           this  .m_allText.Append(NodeHelper.ReadNode(node));
  70       }
  71   }
  72   #endregion

附，本系列全部代码下载： https://files.cnblogs.com/mayswind/DotMaysWind.OfficeReader_4.rar

【五、常见Office文档（Word、PowerPoint、Excel）文件的开源类库】

1、NPOI： http://npoi.codeplex.com

这个没的说，.NET上最好的，没有之一，Office文档类库，提供完整的Excel读取与编辑操作，目前支持二进制（.xls）文件和OOXML（.xlsx）两种格式。如果用过Apache的Java类库POI的话，NPOI提供几乎一样的类库。实际上，对于ASP.NET，需要编辑的Office文档大多都是Excel文件，或者也可以使用Excel文件代替，所以使用NPOI几乎已经能满足所有需要。目前已经支持docx文件，而doc的支持则在NPOI.ScratchPad中，大家可以去Source Code中下载自己编译。如果不需要OOXML的话，类库仅有1.5MB，并且支持.NET CLR 2.0和4.0。

2、Open XML SDK 2.0 for Microsoft Office： http://msdn.microsoft.com/en-us/library/bb448854(office.14).aspx

微软提供的Open XML SDK，支持读写任意OOXML文档，其同时提供了一个工具，可以打开Office文档然后直接生成使用该类库生成该文档的程序代码。只不过类库确实大了些，有5MB之多，并且需要.NET Framework 3.5的支持。

3、Office Binary Translator to Open XML： http://b2xtranslator.sourceforge.net/

这是我最近才知道的一个类库，其实很早很早以前就有了，其可以将Windows复合文档（.doc、.ppt、.xls）转换为对应的OOXML格式（.docx、.pptx、.xlsx），当然你也可以获取文件中存储的内容。不知道为什么，这个网站被墙了。如果你想研究Windows复合文档的话，我比较推荐这个类库，因为NPOI实在是太完美的一个类库，要想走一遍文件读取的流程实在是太复杂，但是如果用这个类库单步的话还是很容易懂的。这个类库将每种文件的支持（以及支持的模块等）都拆分到了不同的项目中，支持每种文件仅需要几百KB，而且是基于.NET CLR 2.0的。

4、EPPlus： http://epplus.codeplex.com

在2010年NPOI还不支持OOXML的时候，个人感觉EPPlus是最好的.xlsx文件处理的类库，其仅有几百KB，非常轻量，对于zip文件的读取，这个类库没有选择SharpZipLib或者DotNetZip，老版本需要.NET Framework 3.0就行，刚看了下新版本得需要.NET Framework 3.5才可以。

5、ExcelDataReader： http://exceldatareader.codeplex.com

也是一个非常轻量并且好用的库，同时支持读取.xls和.xlsx，当年在使用EPPlus之前使用的这个类库，记不得是因为什么问题替换成了EPPlus，也不知道这个问题现在解决了没有。这个类库的好处是仅需要.NET CLR 2.0，并且支持.NET CF，只不过现在已经不需要开发Windows Mobile的应用了。

【六、相关链接】

1、OpenXMLDeveloper.org： http://openxmldeveloper.org
2、如何：从 Office Open XML 文档检索段落： http://msdn.microsoft.com/zh-cn/library/bb669175.aspx
3、如何操作 Office Open XML 格式文档： http://www.microsoft.com/china/msdn/library/office/office/howManipulateOfficexml.mspx
4、如何实现...（打开 XML SDK）： http://msdn.microsoft.com/zh-cn/library/bb491088.aspx

【后记】

终于到了最后一篇，这个系列就到这结束了，感谢大家的捧场，我也终于实现了两年前的心愿。说实话，我确实没想到第一篇会有那么多的访问和推荐，因为需要解析Office文档的毕竟是少数的。写这四篇文章也希望起到抛砖引玉的作用，起码可以对Office文档有个最基础的了解，而之后如果想深入了解下去也会容易得多，这也是我要把这些内容写出来的原因。

分类: C#

标签: .NET C# Office Open XML 文件文字解析读取

作者： Leo_wl

出处： http://www.cnblogs.com/Leo_wl/

本文版权归作者和博客园共有，欢迎转载，但未经作者同意必须保留此段声明，且在文章页面明显位置给出原文连接，否则保留追究法律责任的权利。

版权信息

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://www.haodehen.cn/did46411

更新时间：2022-09-24 阅读：67次