新书推介:《语义网技术体系》
作者:瞿裕忠,胡伟,程龚
   XML论坛     W3CHINA.ORG讨论区     计算机科学论坛     SOAChina论坛     Blog     开放翻译计划     新浪微博  
 
  • 首页
  • 登录
  • 注册
  • 软件下载
  • 资料下载
  • 核心成员
  • 帮助
  •   Add to Google

    >> XML与各种文件格式的相互转换及相关工具。 word to xml, xml to word, html to xml, xml to pdf,
    csv to xml, rtf to xml, text to xml, xml to text, xls to xml, xml to xls
    FOP
    [返回] 中文XML论坛 - 专业的XML技术讨论区XML.ORG.CN讨论区 - XML技术『 WORD to XML, HTML to XML 』 → Export a Word Document to XML - 来自 MSDN的文章 查看新帖用户列表

      发表一个新主题  发表一个新投票  回复主题  (订阅本版) 您是本帖的第 16389 个阅读者浏览上一篇主题  刷新本主题   树形显示贴子 浏览下一篇主题
     * 贴子主题: Export a Word Document to XML - 来自 MSDN的文章 举报  打印  推荐  IE收藏夹 
       本主题类别:     
     admin 帅哥哟,离线,有人找我吗?
      
      
      
      威望:9
      头衔:W3China站长
      等级:计算机硕士学位(管理员)
      文章:5255
      积分:18406
      门派:W3CHINA.ORG
      注册:2003/10/5

    姓名:(无权查看)
    城市:(无权查看)
    院校:(无权查看)
    给admin发送一个短消息 把admin加入好友 查看admin的个人资料 搜索admin在『 WORD to XML, HTML to XML 』的所有贴子 点击这里发送电邮给admin  访问admin的主页 引用回复这个贴子 回复这个贴子 查看admin的博客楼主
    发贴心情 Export a Word Document to XML - 来自 MSDN的文章


    Kevin McDowell
    Microsoft Corporation

    May 2001

    Applies to:
       Microsoft® Word 2000 and Microsoft Word 2002

    Summary: This solution allows you to export a Word document to an XML file. (12 printed pages)

    Download ODC_ExpWordToXML.exe.

    Contents
    The Reasons
       The Case for Styles
       Querying
       Reusing Documents
    The Solution
       The Mechanism
       The Options
          Properties
          Styles
          Graphics
       Additional Files
    The Roadblocks
       The Style Object
       Hyperlinks
       Graphics
       Tables
       Lists
    XML Output
    Conclusion

    The Reasons
    Microsoft Word documents are not usually thought of as a data source in the traditional sense. However, if you author a document in Word with an eye towards converting it to XML, you can turn the document into a data source that can be easily queried and reused. XML is an ideal format because XML is data centric, XML is easily manipulated and displayed, and XML is accessible programmatically.

    The Case for Styles
    Converting any data to XML requires parsing the data and tagging it with descriptors. Within a Word document, text and hyperlinks already tagged by their formatting. Most documents contain multiple structural elements, such as headings, bylines, footnotes, and quotations. All types of formatting can be applied to indicate what the elements are. For example, most headings are not the same size, weight, or even font as paragraph level text. Within a Word document, you alter text by one of two methods: by applying a style or by applying formatting manually. A style in Word is nothing more than a named set of specific instructions describing the formatting to apply. When you apply a style, you are basically tagging that text as something: a heading, a subheading, a code block, a quotation, or some other document element. When you apply formatting manually, you are tagging that text as something special, but that something is not defined. If you were to attempt to parse the document by formatting, you would know how the text appears in the document, but you wouldn't know what the text is. However, if you only apply formatting using styles, when you parse the document, not only do you know how the text appears in the document, but also you have a style name to describe what the text is.

    Creating a document in this manner requires that you know what your formatting represents. Instead of making text bold for emphasis, you apply a style that not only bolds the text but is descriptive of why the text is bold to begin with. For example, suppose you are creating a document and typed out a quotation. Rather than applying italic formatting to the quotation to highlight it, you should create a style called "Quotation" that includes italicized formatting.

    Querying
    After you author a document by using styles and then convert it to XML, it becomes a queryable data source. If you have a folder of XML documents, it is essentially a database. Using the FileSystemObject object in the Microsoft Scripting Runtime object model to loop through all of the files in the folder, you could apply an Extensible Stylesheet Language (XSL) query to pull out all of the headings, author information, quotations, or whatever you want, from each of the XML articles. Possible uses for this approach are to create synopses of articles, create a code library from developer articles, or retrieve all of the references in an article set.

    Reusing Documents
    The best document is one that can be used multiple times in multiple ways. When you create a document in Word, to reuse it, you usually copy and paste sections of text, reformat the text, resave the document with a different file name, or republish the document as a Hypertext Markup Language (HTML) file.

    With a document saved in XML format, you can build different documents from the original document using XSL transformations. XSL allows you to pick and choose which pieces of data in the XML document that you want to display. A full discussion of XSL is outside the scope of this article, but there are many great resources discussing XSL on MSDN. This solution described later in this article allows you to output an HTML file that uses an XSL transformation of your XML source.

    The Solution
    Now that you know some of the reasons to put Word documents into an XML format, let me describe how to do it. I will describe how I did it, what options I discovered along the way, and some of the problem areas I encountered. Finally, I will list a template of the XML that is output by my solution.

    The Mechanism
    The goal of this solution is to convert a Word document into a well-formed XML document. As noted above, the most logical way to do this is to tag data in a Word document by using styles. An overly simplified description of how this solution works is that it parses through the document paragraph by paragraph and identifies text with styles applied, and then tags that text. The complications with this approach are discussed later in this article.

    The following list outlines the object library references needed for this solution:

    Control or object model reference Implementation
    Microsoft Office 9.0 Object Library Used to handle graphics.
    Microsoft Word 9.0 Object Library Used for document access.
    Microsoft Excel 9.0 Object Library  Used for sorting arrays.
    Microsoft Scripting Runtime Used to write out the XML, XSL, and HTML files.

    The following table outlines the custom classes used in this solution:

    Class or Form Description
    frmWizard This form contains all of the graphical elements for this solution. Each step is contained in a separate frame control.
    XMLConverter This object stores all of the options that are set on the forms and contains all the methods directly involved in converting the document to XML.
    StyleInstance This object contains information about a specific instance of a style in the document.
    StyleInstances This object contains a collection of StyleInstance objects.
    DocumentStyleInformation This object contains a StyleInstances collection and the function to retrieve a StyleInstances collection.
    GraphicInstance This object contains information about a specific instance of a graphic in the document.
    GraphicInstances This object contains a collection of GraphicInstance objects.
    DocumentGraphicInformation This object contains a GraphicInstances collection and the function to retrieve a GraphicInstances collection.

    The following table shows the primary functions (and their parents) used in this solution:

    Function Description Also calls
    ConvertToXML Parent: XMLConverter. This function handles the overall build process of the XML output. ParseParagraph, CleanString, WriteShape, WriteInlineShape, GetCSSArray, ParseTable
    ParseParagraph Parent: XMLConverter. This function handles individual paragraphs that are not within list paragraphs or tables. CleanString, WriteShape, WriteInlineShape, ParseForGraphics
    CleanString Parent: XMLConverter. This function removes nonprinting characters from strings, and replaces characters that cause problems in XML with their appropriate escape codes.   
    WriteShape Parent: XMLConverter. This function writes an XML string that represents a Shape object. This function is only called when the user selects to include graphics in a separate list.   
    WriteInlineShape Parent: XMLConverter. This function writes an XML string representing an InlineShape object. This function is only called when the user selects to include graphics in a separate list.   
    ParseForGraphics Parent: XMLConverter. This function parses a Range passed to it that represents Shape or InlineShape objects. This function is only called when graphics are written out inline. WriteGraphicInfo
    WriteGraphicInfo Parent: XMLConverter. This function writes an XML string that represents a GraphicInstance object. This function is only called when graphics are written out inline.   
    GetCSSArray Parent: XMLConverter. Using the HTML Document Object Model, this function gets a list of the Cascading Style Sheet (CSS) formatting associated with each style in the document.   
    ParseTable Parent: XMLConverter. This function handles tables. ParseCell
    ParseCell Parent: XMLConverter. This function parses individual cells within a table. This function calls the ParseTable function when a cell contains a nested table. The function calls the ParseParagraph function for each paragraph within a cell. ParseParagraph, ParseTable
    GetStyleInformation Parent: DocumentStyleInformation. This function populates the StyleInstances collection of the DocumentStyleInformation object.  ExcelSort
    GetGraphicInformation Parent: DocumentGraphicInformation. This function populates the GraphicInstances collection of the DocumentGraphicInformation object.  ExcelSort
    ExcelSort This function uses Microsoft Excel to sort an array by the specified field. This function is used to sort the style and graphics in the order they appear in the document.   
    WriteXSL This routine creates a skeleton XSL file based on the XML file structure created by this application. The XSL file is stored in the same location and with the same name as the XML file, but with an .xsl extension.   
    WriteHTM This routine creates an HTML file that uses script to write out the contents of the XML file based on the associated XSL file. The HTML file is stored in the same location and with the same name as the XML file, but with an .htm extension.   

    The Options
    The structure of a Word document creates several interesting parsing options. The first option is the granularity with which the document is parsed. I give the user the option to export the document only, or to export an enhanced XML document. If the first option is chosen, all the text is written out in <Paragraph> elements. The second option creates a much more useful document. With an enhanced document, the options all revolve around the properties, styles, and graphics contained in the document. Additionally, you have the option to output an HTML and XSL file.

    Properties
    Users may want to know not just about the data in the document, but data about the document. The document properties options let the user choose among the following:

    No document properties.
    All document properties (both built-in and custom).
    Only built-in document properties.
    Only custom document properties (user-defined properties).
    Styles
    Users have the option of deciding to what level styles are handled:

    Paragraph and character styles are handled: all style formatting is represented in the XML document, including tables and lists.
    Only paragraph level styles are handled: only the style applied to the paragraph is exported. Styles applied to text within the paragraphs are ignored. Tables and lists are not exported as anything but paragraphs.
    Additionally, since we are using styles to tag information in the document, it may be useful to know what the formatting is that the style represents. Users can choose from the following options:

    List only styles actually in use in the document (this option is faster).
    List all styles available in the document.
    Graphics
    If pictures are worth a thousand words, then they should also be converted to XML. However, unless the picture is an external file being linked to, there is no way to take the binary information that represents the picture and convert it to XML. Microsoft clipart pictures can be converted to a Vector Markup Language (VML) format, but that is outside the scope of this project. Even if the picture can't be converted, users may want to see a placeholder for the graphic, so I provide the following options:

    No graphic information is written out.
    All graphic information is written out inline.
    Only linked graphics are written out inline.
    All graphics are written out in a separate list (not inline).
    Additional Files
    When you select to output an HTML and XSL file, the solution builds an HTML wrapper for the XML file. The HTML file contains script that applies the XSL file to the XML file. The XSL is very basic but should give you a good starting point for building a more elaborate XSL transformation.

    The Roadblocks
    When I first undertook this project, I thought it would be even easier than converting an Access table or Excel data to XML. I was wrong. The intricacies and limitations of the Word object model make this a very interesting exercise. The biggest areas of concern were the Style object, graphics, tables, and lists.

    The Style Object
    The Style object in the Word object model represents a style definition, not an instance of the style in the document. While the searching mechanism in Word is powerful and lets you search for instances of a style, it does not let you search for the next instance of any style - only a specific style. When you are parsing out a document sequentially, knowing when the next instance of a specific style occurs is useless because you may have passed over an instance of another style.

    To get around this problem, I created three custom classes: the StyleInstance object, the StyleInstances collection, and the DocumentStyleInformation object. The StyleInstance object has five properties

    Property Description
    StartPosition Start position of the style within the document. Retrieved from the Start property (Range object).
    EndPosition End position of the style within the document. Retrieved from the End property (Range object).
    StyleName Name of the style. Retrieved from the NameLocal property (Style object).
    StyleType The style type: character or paragraph. Retrieved from the Type property (Style object).
    Target Hyperlink address. Only written when the text is formatted with the Hyperlink style.
    Index The index of the StyleInstance within the StyleInstances collection.

    The DocumentStyleInformation has two important pieces to it: a StyleInstances collection and the GetStyleInformation function. The function is simple: it loops through every style in the document style list and searches for instances of the style within the document. If it finds that style, it adds the start and end position, the style name, the style type, and the target address (if it is hyperlinked) to an internal array. After it has looped through the entire style list, the array is sorted by the start position and then placed into the StyleInstances collection. The result is a sequential list of all instances where the styles have been applied. When you then parse the document sequentially, you can access your styles in sequential order.

    Hyperlinks
    It is possible in Word to set the style of hyperlinked text to something other than Hyperlink, or even change hyperlinked text to multiple styles. Because XML must be well-formed, it is easier to handle a hyperlink as one consecutive text stream, rather than try to break it into various style pieces. For this solution, hyperlinks are only handled if they have the Hyperlink style applied. When text with the Hyperlink style is encountered, the hyperlink's address is written as an attribute of the element.

    Graphics
    Graphics in Word documents pose a similar problem as styles. Pictures in Word documents are contained in two types of objects: Shape or InlineShape objects. If you sequentially loop through a collection of Shape objects, you could pass over instances of InlineShape objects, and vice versa.

    This problem was handled in the same way as styles. I created three custom classes to handle all of the graphics: the GraphicInstance object, the GraphicInstances collection, and the DocumentGraphicInformation object. The GraphicInstance object has eight properties outlined in the following table:

    Property Description
    WordShapeType 1 for Shape objects, 2 for InlineShape objects.
    StartPosition Start position of the graphic within the document. Retrieved from the Start property (Anchor object) for Shape objects and the Start property (Range object) for InlineShape objects.
    EndPosition End position of the graphic within the document. Retrieved from the End property (Anchor object) for Shape objects and the End property (Range object) for InlineShape objects.
    GraphicType The type of Shape or InlineShape object; maps to a wdShapeType or wdInlineShapeType constant value.
    LinkedPath The path to the external file represented by the graphic. Only written for linked objects.
    GraphicSubType The AutoShape type, maps to a msoAutoShapeType constant value. Only written if the Shape object is an AutoShape.
    OLEClass The OLE class type.
    Target The target address if the graphic is a hyperlink.
    Index The index of the GraphicInstance in the GraphicInstances collection.

    The DocumentGraphicInformation object is similar to the DocumentStyleInformation object. It contains a GraphicInstances collection and the GetGraphicInformation function. The GetGraphicInformation function builds an array of all the Shape and InlineShape objects information, sorts the array by the start position, and then places the array into the GraphicInstances collection.

    One quirk with Shape and InlineShape objects is that they don't always return the same value for the Hyperlink property if they aren't hyperlinks. When an InlineShape object is not a hyperlink, the Hyperlink property returns Nothing, which is easy enough to check. However, when a Shape object is not a hyperlink, the Hyperlink property returns a Hyperlink object, but accessing any properties of the Hyperlink object returns "Run-time error 4198: Command failed." In the GetGraphicInformation routine, this error is trapped and handled appropriately.

    The DocumentGraphicInformation object not only stores information about the pictures in a document, but any ActiveX&reg; embedded or linked object. Since the primary use of the Shape and InlineShape objects is to represent pictures, I used "Graphic" in naming these objects, but you could just as easily use OLE or ActiveX in place of "Graphic".

    You may have also noticed that the DocumentStyleInformation and DocumentGraphicInformation objects are very similar. It is possible to put all of the information into a single object, but to make it clearer what type of object I was working with, I wanted to separate the classes. Since a picture can occur in the middle of a block of styled text, it is also easier to parse with two separate classes, rather than one.

    Tables
    I initially created the parsing routine (ConvertToXML) to walk through the document paragraph by paragraph. The first thing that the paragraph loop does is check to see if the range of the current paragraph contains a table. If it does, it sets a flag (blnResumeParagraphProcessing) not to parse out the following ranges as paragraphs, but rather table cells. Once it is determined that a range is no longer part of a table, it reverts to processing ranges as paragraphs. The downside of this mechanism is that it only parses top-level tables, not nested tables.

    A quirk within the Word object model makes it very difficult to handle nested tables. When you check a table cell to see if it contains a nested table, using the following syntax, it returns 0:

    ThisDocument.Tables(1).Cell(1,1).Tables.Count
    However, when you check the range contained within that cell for nested tables, using the following syntax, it always returns 1:

    ThisDocument.Tables(1).Cell(1,1).Range.Tables.Count
    There is no way to use the Range object to determine if there is a nested table, because the Range object of a cell always contains the parent table. Since the top-level parsing mechanism first checked the range to see if it contained a table, this proved to be troublesome. I got around this limitation by using two functions, ParseTable and ParseCell. The top-level parser determines that a table has been encountered, and then calls the ParseTable function. TheParseTable function then loops through each cell and writes out the appropriate tags to wrap each cell, and then calls the ParseCell function to handle the data within the cell. The ParseCell function loops through each paragraph object in a cell and then calls the ParseParagraph function to write it out. If the ParseParagraph function encounters a table, it in turn calls the ParseTable function.

    There are two caveats to the way tables are handled by this solution. First, styles applied with the table are ignored. Second, nested lists and graphics are also ignored. It would not be difficult to handle these cases, but they fall outside of the scope of this project. To accomplish these two objectives, you would simply need to pull the loop that parses paragraphs (marked by the comment "Handles paragraph XML output") into a separate function, and then call the new function from within the ParseCell function.

    Lists
    Lists in Word documents present a similar challenge as tables. Each item in a list is a Paragraph object. You determine whether the paragraph is part of a list by using the ListParagraph property of the Range object:

    If ThisDocument.Paragraph(1).Range.ListParagraphs.Count>0 Then
    ...
    End if
    This routine also sets the blnResumeParagraphProcessing flag. When a paragraph contains a list item, the routine stops processing as text paragraphs, and processes the list. The portion of the routine that handles the list also includes the type of list as an attribute of the list element. The value of the attribute maps to the wdListType constant. If the list contains outline indenting, the level of indenting is also written out in the level attribute.

    XML Output
    The XML output by this application is very straightforward and very similar to the HTML output by Word itself, but it fully accounts for all styled text, tables, and lists. Listed below is an XML representation of this structure, without any data included:

    <?xml version="1.0" encoding="ISO-8859-1"?>
    <Document>
        <DocumentProperties>
            <BuiltInProperties>
                <Property/>
            </BuiltInProperties>
            <CustomProperties>
                <Property/>
            </CustomProperties>
        </DocumentProperties>
        <DocumentStyles>
            <Style/>
        </DocumentStyles>
        <DocumentBody>
            <Paragraph>
                <Text/>
                <StyledText/>
                <InlineShape/>
                <Shape/>
                <Table>
                    <Row>
                        <Cell>
                            <Paragraph>
                                <Text/>
                                <StyledText/>
                                <Table/>
                                </Paragraph>
                        </Cell>
                     </Row>
                </Table>
                <List/>
            </Paragraph>
        </DocumentBody>
    </Document>
    Conclusion
    This solution provides a starting point to build an XML parser for Word documents. In addition to the XML functionality, it discusses how to build custom objects to handle sequential instances of all styles and graphics and how to loop through tables and lists. Remember, documents shouldn't be converted to XML merely for the sake putting them in XML. The best document to convert to XML is one that makes use of styles and will be reused in other ways.


       收藏   分享  
    顶(0)
      




    ----------------------------------------------

    -----------------------------------------------

    第十二章第一节《用ROR创建面向资源的服务》
    第十二章第二节《用Restlet创建面向资源的服务》
    第三章《REST式服务有什么不同》
    InfoQ SOA首席编辑胡键评《RESTful Web Services中文版》
    [InfoQ文章]解答有关REST的十点疑惑

    点击查看用户来源及管理<br>发贴IP:*.*.*.* 2005/2/24 0:14:00
     
     qltouming 帅哥哟,离线,有人找我吗?
      
      
      等级:大一新生
      文章:0
      积分:54
      门派:XML.ORG.CN
      注册:2005/4/28

    姓名:(无权查看)
    城市:(无权查看)
    院校:(无权查看)
    给qltouming发送一个短消息 把qltouming加入好友 查看qltouming的个人资料 搜索qltouming在『 WORD to XML, HTML to XML 』的所有贴子 引用回复这个贴子 回复这个贴子 查看qltouming的博客2
    发贴心情 
    晕,老大,有不是E文的吗?
    点击查看用户来源及管理<br>发贴IP:*.*.*.* 2005/4/28 14:10:00
     
     cxh0926 帅哥哟,离线,有人找我吗?
      
      
      等级:大一(猛啃高等数学)
      文章:20
      积分:136
      门派:XML.ORG.CN
      注册:2005/3/11

    姓名:(无权查看)
    城市:(无权查看)
    院校:(无权查看)
    给cxh0926发送一个短消息 把cxh0926加入好友 查看cxh0926的个人资料 搜索cxh0926在『 WORD to XML, HTML to XML 』的所有贴子 引用回复这个贴子 回复这个贴子 查看cxh0926的博客3
    发贴心情 
    顶起。支持!,不过要是中文的就好了,
    点击查看用户来源及管理<br>发贴IP:*.*.*.* 2005/5/5 22:16:00
     
     cxh0926 帅哥哟,离线,有人找我吗?
      
      
      等级:大一(猛啃高等数学)
      文章:20
      积分:136
      门派:XML.ORG.CN
      注册:2005/3/11

    姓名:(无权查看)
    城市:(无权查看)
    院校:(无权查看)
    给cxh0926发送一个短消息 把cxh0926加入好友 查看cxh0926的个人资料 搜索cxh0926在『 WORD to XML, HTML to XML 』的所有贴子 引用回复这个贴子 回复这个贴子 查看cxh0926的博客4
    发贴心情 
    顶起。支持!
    点击查看用户来源及管理<br>发贴IP:*.*.*.* 2005/5/5 22:17:00
     
     zx516 帅哥哟,离线,有人找我吗?
      
      
      等级:大一新生
      文章:0
      积分:64
      门派:XML.ORG.CN
      注册:2005/4/13

    姓名:(无权查看)
    城市:(无权查看)
    院校:(无权查看)
    给zx516发送一个短消息 把zx516加入好友 查看zx516的个人资料 搜索zx516在『 WORD to XML, HTML to XML 』的所有贴子 引用回复这个贴子 回复这个贴子 查看zx516的博客5
    发贴心情 
    有中文的吗?
    看不太懂!!
    点击查看用户来源及管理<br>发贴IP:*.*.*.* 2005/5/17 13:11:00
     
     blackhorseyyz 帅哥哟,离线,有人找我吗?
      
      
      等级:大一新生
      文章:6
      积分:80
      门派:XML.ORG.CN
      注册:2005/5/22

    姓名:(无权查看)
    城市:(无权查看)
    院校:(无权查看)
    给blackhorseyyz发送一个短消息 把blackhorseyyz加入好友 查看blackhorseyyz的个人资料 搜索blackhorseyyz在『 WORD to XML, HTML to XML 』的所有贴子 引用回复这个贴子 回复这个贴子 查看blackhorseyyz的博客6
    发贴心情 
    看不懂!
    刚来,多看看先
    点击查看用户来源及管理<br>发贴IP:*.*.*.* 2005/5/22 17:13:00
     
     多俘双轨制 帅哥哟,离线,有人找我吗?
      
      
      等级:大一新生
      文章:0
      积分:54
      门派:XML.ORG.CN
      注册:2005/7/5

    姓名:(无权查看)
    城市:(无权查看)
    院校:(无权查看)
    给多俘双轨制发送一个短消息 把多俘双轨制加入好友 查看多俘双轨制的个人资料 搜索多俘双轨制在『 WORD to XML, HTML to XML 』的所有贴子 引用回复这个贴子 回复这个贴子 查看多俘双轨制的博客7
    发贴心情 
    我也看不懂
    点击查看用户来源及管理<br>发贴IP:*.*.*.* 2005/7/5 16:26:00
     
     GoogleAdSense
      
      
      等级:大一新生
      文章:1
      积分:50
      门派:无门无派
      院校:未填写
      注册:2007-01-01
    给Google AdSense发送一个短消息 把Google AdSense加入好友 查看Google AdSense的个人资料 搜索Google AdSense在『 WORD to XML, HTML to XML 』的所有贴子 访问Google AdSense的主页 引用回复这个贴子 回复这个贴子 查看Google AdSense的博客广告
    2024/10/31 23:07:33

    本主题贴数7,分页: [1]

    管理选项修改tag | 锁定 | 解锁 | 提升 | 删除 | 移动 | 固顶 | 总固顶 | 奖励 | 惩罚 | 发布公告
    W3C Contributing Supporter! W 3 C h i n a ( since 2003 ) 旗 下 站 点
    苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》
    140.625ms