java + dom4j.jar提取xml文档内容

 更新时间:2019年08月30日 10:27:12   作者:静远小和尚  
这篇文章主要为大家详细介绍了java + dom4j.jar提取xml文档内容,具有一定的参考价值,感兴趣的小伙伴们可以参考一下

本文实例为大家分享了java + dom4j.jar提取xml文档内容的具体代码,供大家参考,具体内容如下

资源下载页:点击下载

本例程主要借助几个遍历的操作对xml格式下的内容进行提取,操作不是最优的方法,主要是练习使用几个遍历操作。

xml格式文档内容:

<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd"> 
-<nitf version="-//IPTC//DTD NITF 3.3//EN" change.time="19:30" change.date="June 10, 2005">
 
 
-<head>
 
<title>An End to Nuclear Testing</title>
 
<meta name="publication_day_of_month" content="7"/> 
<meta name="publication_month" content="7"/> 
<meta name="publication_year" content="1993"/> 
<meta name="publication_day_of_week" content="Wednesday"/> 
<meta name="dsk" content="Editorial Desk"/> 
<meta name="print_page_number" content="14"/> 
<meta name="print_section" content="A"/> 
<meta name="print_column" content="1"/>
<meta name="online_sections" content="Opinion"/>
 
 
-<docdata>
 
<doc-id id-string="619929"/>
 
<doc.copyright year="1993" holder="The New York Times"/>
 
 
-<identified-content>
 
<classifier type="descriptor" class="indexing_service">ATOMIC WEAPONS</classifier> 
<classifier type="descriptor" class="indexing_service">NUCLEAR TESTS</classifier> 
<classifier type="descriptor" class="indexing_service">TESTS AND TESTING</classifier> 
<classifier type="descriptor" class="indexing_service">EDITORIALS</classifier> 
<person class="indexing_service">CLINTON, BILL (PRES)</person> 
<classifier type="types_of_material" class="online_producer">Editorial</classifier> 
<classifier type="taxonomic_classifier" class="online_producer">Top/Opinion</classifier> 
<classifier type="taxonomic_classifier" class="online_producer">Top/Opinion/Opinion</classifier> 
<classifier type="taxonomic_classifier" class="online_producer">Top/Opinion/Opinion/Editorials</classifier> 
<classifier type="general_descriptor" class="online_producer">Nuclear Tests</classifier> 
<classifier type="general_descriptor" class="online_producer">Atomic Weapons</classifier> 
<classifier type="general_descriptor" class="online_producer">Tests and Testing</classifier> 
<classifier type="general_descriptor" class="online_producer">Armament, Defense and Military Forces</classifier>
 
</identified-content> 
</docdata> 
<pubdata name="The New York Times" unit-of-measure="word" item-length="390" ex-ref="http://query.nytimes.com/gst/fullpage.html?res=9F0CEFDF1439F934A35754C0A965958260" date.publication="19930707T000000"/>
 
</head>
 
 
-<body>
 
 
-<body.head>
 
 
-<hedline>
 
<hl1>An End to Nuclear Testing</hl1>
 
</hedline> 
</body.head>
 
 
-<body.content>
 
 
-<block class="lead_paragraph">
 
<p>For nearly half a century, test explosions in the Nevada desert were a reverberating reminder of cold war insecurity. Now the biggest worry is nuclear proliferation, not the Soviet threat. That's why President Clinton has quietly decided to extend the moratorium on tests of nuclear arms for at least 15 months.</p> 
<p>To persuade nuclear have-nots to stay out of the bomb-making business, it makes more sense to halt testing and try to get others to do likewise than to conduct more demonstrations of America's deterrent power.</p>
 
</block>
 
 
-<block class="full_text">
 
<p>For nearly half a century, test explosions in the Nevada desert were a reverberating reminder of cold war insecurity. Now the biggest worry is nuclear proliferation, not the Soviet threat. That's why President Clinton has quietly decided to extend the moratorium on tests of nuclear arms for at least 15 months.</p>
<p>To persuade nuclear have-nots to stay out of the bomb-making business, it makes more sense to halt testing and try to get others to do likewise than to conduct more demonstrations of America's deterrent power.</p> 
<p>Not that nuclear wannabes will necessarily follow America's lead. Nor will an end to all testing assure an end to bomb-making; states like Pakistan have developed nuclear devices without testing them first.</p>
<p>But calling a halt to U.S. nuclear testing makes it easier for leaders in Russia and France to extend the moratoriums they are now observing and improve the atmosphere for prompt negotiation of a treaty to ban all tests.</p>
<p>That test ban in turn should shore up international support for the 1968 Nonproliferation Treaty, linchpin of efforts to stop the spread of nuclear arms, when it comes up for review in 1995. It will also bolster the backing for tighter controls on exports used in bomb-making.</p>
<p>Mr. Clinton has taken three helpful steps. He has extended the Congressionally mandated moratorium on U.S. tests that was due to expire last week. He has declared that the U.S. will not test unless another nation does so first. And he wants to negotiate a total ban on testing.</p>
<p>But the President also wants the nuclear labs to be prepared for a prompt resumption of warhead safety and reliability tests. This could cost millions of dollars and doesn't make much sense, since in Mr. Clinton's own words, "After a thorough review, my Administration has determined that the nuclear weapons in the United States' arsenal are safe and reliable."</p>
<p>Moreover, preparations for testing can take on a life of their own: 30 years after the Limited Test Ban Treaty put an end to above-ground tests, the U.S. still spends $20 million a year on Safeguard C, a program to keep test sites ready.</p>
<p>American security no longer rests on that sort of eternal nuclear vigilance. Mr. Clinton's moratorium may make America safer than all the tests and preparations for tests that the nuclear labs can dream up.</p>
 
</block>
 
</body.content>
 
</body>
 
</nitf>

提取代码:

对多文件进行操作,首先遍历所有文件路径,存到遍历器中,然后对遍历器中的文件路径进行逐一操作。

package com.njupt.ymh;
 
import java.io.File;
import java.util.ArrayList;
import java.util.List;
 
import edu.princeton.cs.algs4.In;
 
/**
 * 返回文件名列表
 * @author 11860
 *
 */
public class SearchFile {
 
 public static List<String> getAllFile(String directoryPath,boolean isAddDirectory) {
  List<String> list = new ArrayList<String>(); // 存放文件路径
  File baseFile = new File(directoryPath); // 当前路径
  
  if (baseFile.isFile() || !baseFile.exists()) 
   return list;
  
  File[] files = baseFile.listFiles(); // 子文件
  for (File file : files) {
   if (file.isDirectory()) 
   { 
    if(isAddDirectory) // isAddDirectory 是否将子文件夹的路径也添加到list集合中
     list.add(file.getAbsolutePath()); // 全路径
    
    list.addAll(getAllFile(file.getAbsolutePath(),isAddDirectory));
   } 
   else 
   {
    list.add(file.getAbsolutePath());
   }
  }
  return list;
 }
 public static void main(String[] args) {
 
 //SearchFile sFile = new SearchFile();
 List<String> listFile = SearchFile.getAllFile("E:\\huadai", false);
 System.out.println(listFile.size());
 File file = new File(listFile.get(3));
 In in = new In(listFile.get(4));
 while (in.hasNextLine()) {
 String readLine = in.readLine().trim(); // 读取当前行
 System.out.println(readLine);
 
 }
 System.out.println(file.length());
 
 }
 
}
package com.njupt.ymh;
 
import java.io.File;
import java.util.Iterator;
import java.util.List;
 
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.Element;
import org.dom4j.Node;
import org.dom4j.io.SAXReader;
 
public class NewsPaper {
 int doc_id; // 文章id
 String doc_title; // 文章标题
 String lead_paragraph ; // 文章首段
 String full_text; // 文章内容
 String date; // 文章日期
 public NewsPaper(String xml) {
 doc_id = -1; // 文章id
 doc_title = null; // 文章标题
 lead_paragraph = null; // 文章首段
 full_text = null; // 文章内容
 date = null; // 文章日期
 searchValue(xml);
 }
 
 /**
 * 加载Document文件
 * @param fileName
 * @return Document
 */
 private Document load(String fileName) {
 Document document = null; // 文档
 SAXReader saxReader = new SAXReader(); // 读取文件流
 
 try {
 document = saxReader.read(new File(fileName));
 } catch (DocumentException e) {
 e.printStackTrace();
 }
 
 return document;
 }
 
 /**
 * 获取Document的根节点
 * @param args
 */
 private Element getRootNode(Document document) {
 return document.getRootElement();
 }
 
 /**
 * 获取所需节点值
 * @param xml
 */
 private void searchValue(String xml) {
 Document document = load(xml);
  Element root = getRootNode(document); // 根节点 
  
  // 文章日期
  date = xml.substring(10, 20);
  // 文章标题
  doc_title = root.valueOf("//head/title");
  
  // 文章-id
  List<Node> list_doc_id = document.selectNodes("//doc-id/@id-string"); 
  for(Node ele:list_doc_id){
   doc_id = Integer.parseInt(ele.getText());
  }
  
  // 文章内容
  for (Iterator<Element> i = root.elementIterator(); i.hasNext();) { 
   Element el = (Element) i.next(); // head、body
   
   // 对body节点进行操作
   if (el.getName() == "body") { // body
    for (Iterator<Element> body = el.elementIterator(); body.hasNext();) {
  Element elbody = body.next();
  
  if (elbody.getName() == "body.content") { //body.content
  for (Iterator<Element> block = elbody.elementIterator(); block.hasNext();) {
  Element block_class = (Element) block.next();
  
  if (block_class.attributeValue("class").equals("full_text") ) { // full_text
  List<Node> list_text = block_class.selectNodes("p");
  for (Node text : list_text) 
   if (full_text == null) 
   full_text = text.getStringValue();
   else 
   full_text = full_text +" " + text.getStringValue();
  }
  
  else { // lead_paragraph
  List<Node> list_lead = block_class.selectNodes("p");
  for (Node lead : list_lead) 
   if (lead_paragraph == null)
   lead_paragraph = lead.getStringValue();
   else 
   lead_paragraph = lead_paragraph +" "+ lead.getStringValue();
  }
  }
  }
 }
   }
  } 
 }
 
 /**
 * 获取文章标题
 * @param args
 */
 public String getTitle() {
 return doc_title;
 }
 
 /**
 * 获取文章id
 * @param args
 */
 public int getID() {
 return doc_id;
 }
 
 /**
 * 获取文章简介
 * @param args
 */
 public String getLead() {
 if (getID() < 394070 && lead_paragraph != null && lead_paragraph.length() > 6)  //1990-10-22之前
 return lead_paragraph.substring(6);
 else       //1990-10-22之后
 return lead_paragraph;
 }
 
 /**
 * 获取文章正文
 * @param args
 */
 public String getfull() {
 if (getID() < 394070 && full_text != null && full_text.length() > 6)   //1990-10-22之前
 return full_text.substring(6);
 else
 return full_text;
 }
 
 /**
 * 获取文章日期
 * @param args
 */
 public String getDate() {
 return date;
 }
 
 /**
 * 判断获取的信息是否有用
 * @return
 */
 public boolean isUseful() {
 if (getID() == -1)
 return false;
 if (getDate() == null ) 
 return false;
 if (getTitle() == null || getTitle().length() >= 255) 
 return false;
 if (getLead() == null || getLead().length() >= 65535 ) 
 return false;
 if (getfull() == null || getfull().length() >= 65535) 
 return false;
 
 return !isnum();
 }
 
 /**
 * 挑出具有特殊开头的数字内容文章
 * @return
 */
 private boolean isnum() {
 if (getfull() != null && getfull().length() > 24) {
 if (getfull().substring(0, 20).contains("*3*** COMPANY REPORT") ) { // 剔除数字文章 
 return true;
 }
 }
 return false;
 }
 
 
 public static void main(String[] args) {
 List<String> listFile = SearchFile.getAllFile("E:\\huadai\\1989\\10", false); // 文件列表
 //String date; // 日期
 int count = 0;
 int i = 0;
 for (String string : listFile) {
 NewsPaper newsPaper = new NewsPaper(string);
 count++;
 if (!newsPaper.isUseful()) {
 i++;
 System.out.println(newsPaper.getLead());
 } 
 }
 
 System.out.println(i + " "+ count);
 
 }
}

 以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持脚本之家。

相关文章

  • JAVA大作业之图书管理系统实现全解

    JAVA大作业之图书管理系统实现全解

    随着网络技术的高速发展,计算机应用的普及,利用计算机对图书馆的日常工作进行管理势在必行,本篇文章手把手带你用Java实现一个图书管理系统,大家可以在过程中查缺补漏,提升水平
    2022-01-01
  • 一篇文章弄懂Java8中的时间处理

    一篇文章弄懂Java8中的时间处理

    Java8以前Java处理日期、日历和时间的方式一直为社区所诟病,将 java.util.Date设定为可变类型,以及SimpleDateFormat的非线程安全使其应用非常受限,下面这篇文章主要给大家介绍了关于Java8中时间处理的相关资料,需要的朋友可以参考下
    2022-01-01
  • SpringBoot2自动装配原理解析

    SpringBoot2自动装配原理解析

    这篇文章主要介绍了SpringBoot2自动装配原理解析,本文通过实例代码给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价值,需要的朋友可以参考下
    2022-03-03
  • 探讨Java 将Markdown文件转换为Word和PDF文档

    探讨Java 将Markdown文件转换为Word和PDF文档

    这篇文章主要介绍了Java 将Markdown文件转换为Word和PDF文档,本文通过分步指南及代码示例展示了如何将 Markdown 文件转换为 Word 文档和 PDF 文件,需要的朋友可以参考下
    2024-07-07
  • 深入理解Java虚拟机_动力节点Java学院整理

    深入理解Java虚拟机_动力节点Java学院整理

    虚拟机是一种抽象化的计算机,通过在实际的计算机上模拟各种计算机功能来实现的,下面通过本文给大家分享Java虚拟机相关知识,感兴趣的朋友一起看看吧
    2017-06-06
  • springboot集成mybatisplus的方法

    springboot集成mybatisplus的方法

    这篇文章主要为大家详细介绍了springboot集成mybatisplus的相关资料,具有一定的参考价值,感兴趣的小伙伴们可以参考一下
    2018-04-04
  • Java二叉树路径和代码示例

    Java二叉树路径和代码示例

    这篇文章主要介绍了Java二叉树路径和代码示例,具有一定借鉴价值,需要的朋友可以参考下。
    2017-12-12
  • Java初级必看的数据类型与常量变量知识点

    Java初级必看的数据类型与常量变量知识点

    这篇文章主要给大家介绍了关于Java初级必看的数据类型与常量变量知识点的相关资料,需要的朋友可以参考下
    2023-11-11
  • SpringBoot bean查询加载顺序流程详解

    SpringBoot bean查询加载顺序流程详解

    当你在项目启动时需要提前做一个业务的初始化工作时,或者你正在开发某个中间件需要完成自动装配时。你会声明自己的Configuration类,但是可能你面对的是好几个有互相依赖的Bean
    2023-03-03
  • 玩转SpringBoot中的那些连接池(小结)

    玩转SpringBoot中的那些连接池(小结)

    这篇文章主要介绍了玩转SpringBoot中的那些连接池(小结),文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一起学习学习吧
    2020-12-12

最新评论