JavaScript如何提取PDF中的图片和文字

更新时间：2024年11月27日 10:59:45 作者：haorooms

这篇文章主要为大家详细介绍了JavaScript如何实现提取PDF中的图片和文字,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下

从 PDF 中提取文字 -核心代码

其实核心代码还是利用了pdf.js这个库，之前上一篇文章也有提及这个库，主要可以做pdfweb端的预览。

文档地址：mozilla.github.io/pdf.js/api/draft/module-pdfjsLib-PDFPageProxy.html

/**
 * Retrieves the text of a specif page within a PDF Document obtained through pdf.js
 *
 * @param {Integer} pageNum Specifies the number of the page
 * @param {PDFDocument} PDFDocumentInstance The PDF document obtained
 **/
function getPageText(pageNum, PDFDocumentInstance) {
  // Return a Promise that is solved once the text of the page is retrieven
  return new Promise(function (resolve, reject) {
    PDFDocumentInstance.getPage(pageNum).then(function (pdfPage) {
      // The main trick to obtain the text of the PDF page, use the getTextContent method
      pdfPage.getTextContent().then(function (textContent) {
        var textItems = textContent.items;
        var finalString = '';

        // Concatenate the string of the item to the final string
        for (var i = 0; i < textItems.length; i++) {
          var item = textItems[i];

          finalString += item.str + ' ';
        }

        // Solve promise with the text retrieven from the page
        resolve(finalString);
      });
    });
  });
}

从 PDF 中提取图片

核心代码如下：

// first here I open the document
pdf.getDocument('haorooms.pdf').promise.then(async function (pdfObj) {
  // because I am testing, I just wanted to get page 7
  const page = await pdfObj.getPage(7);

  // now I need to get the image information and for that I get the operator list
  const operators = await page.getOperatorList();

  // this is for the paintImageXObject one, there are other ones, like the paintJpegImage which I assume should work the same way, this gives me the whole list of indexes of where an img was inserted
  const rawImgOperator = operators.fnArray
    .map((f, index) => (f === pdf.OPS.paintImageXObject ? index : null))
    .filter((n) => n !== null);

  // now you need the filename, in this example I just picked the first one from my array, your array may be empty, but I knew for sure in page 7 there was an image... in your actual code you would use loops, such info is in the argsArray, the first arg is the filename, second arg is the width and height, but the filename will suffice here
  const filename = operators.argsArray[rawImgOperator[0]][0];

  // now we get the object itself from page.objs using the filename
  page.objs.get(filename, async (arg) => {
    // and here is where we need the canvas, the object contains information such as width and height
    const canvas = ccc.createCanvas(arg.width, arg.height);
    const ctx = canvas.getContext('2d');

    // now you need a new clamped array because the original one, may not contain rgba data, and when you insert you want to do so in rgba form, I think that a simple check of the size of the clamped array should work, if it's 3 times the size aka width*height*3 then it's rgb and shall be converted, if it's 4 times, then it's rgba and can be used as it is; in my case it had to be converted, and I think it will be the most common case
    const data = new Uint8ClampedArray(arg.width * arg.height * 4);
    let k = 0;
    let i = 0;
    while (i < arg.data.length) {
      data[k] = arg.data[i]; // r
      data[k + 1] = arg.data[i + 1]; // g
      data[k + 2] = arg.data[i + 2]; // b
      data[k + 3] = 255; // a

      i += 3;
      k += 4;
    }

    // now here I create the image data context
    const imgData = ctx.createImageData(arg.width, arg.height);
    imgData.data.set(data);
    ctx.putImageData(imgData, 0, 0);

    // get myself a buffer
    const buff = canvas.toBuffer();

    // and I wrote the file, worked like charm, but this buffer encodes for a png image, which can be rather large, with an image conversion utility like sharp.js you may get better results by compressing the thing.
    fs.writeFile('test', buff);
  });
});

小结

本文主要介绍了js获取pdf中文本和图片的方法，其实pdf转word也是大致这个思路，主要获取文本和图片，放到word文档中。本文主要是利用了pdfjs库，参考了issue github.com/mozilla/pdf.js/issues/13541

以上就是JavaScript如何提取PDF中的图片和文字的详细内容，更多关于JavaScript提取PDF图片和文字的资料请关注脚本之家其它相关文章！

您可能感兴趣的文章:

JavaScript常用正则验证函数实例小结【年龄,数字,Email,手机,URL,日期等】
这篇文章主要介绍了JavaScript常用正则验证函数,结合实例形式总结分析了javascript针对年龄、数字、Email、手机、URL、日期等格式常用正则验证技巧,需要的朋友可以参考下
2017-01-01
使用Echarts设置地图并触发点击事件的代码
这篇文章主要给大家介绍了关于使用Echarts设置地图并触发点击事件的的相关资料,ECharts是一款基于JavaScript的数据可视化库,可以用于创建各种类型的交互式图表,包括地图,需要的朋友可以参考下
2023-09-09
一文了解你不知道的JavaScript生成器篇
ES6引入了一个新的函数类型，发现它并不符合这种运行到结束的特性。这类新的函数被称为生成器。生成器的出现是我们知道原来有时代码并不会顺利的运行，可以通过暂停的方式进行异步回调，让我们摒弃了此前的认知。本文就来聊聊JavaScript中生成器的相关知识
2022-11-11
JavaScript 中问号的三种用法 ??和?.以及?:
本文主要介绍了JavaScript 中问号的三种用法 ??和?.以及?: ,分别是空值合并操作符、可选链操作符和三目运算,具有一定的参考价值,感兴趣的可以了解一下
2025-04-04
js删除Array数组中指定元素的两种方法
下面小编就为大家带来一篇js删除Array数组中指定元素的两种方法。小编觉得挺不错的，现在就分享给大家，也给大家做个参考。一起跟随小编过来看看吧
2016-08-08
JavaScript中的Promise详解
现在网上有非常多的Promise文章，但都是给你一堆代码，或者某些核心代码，让你看完之后感觉，嗯，很厉害，但还是不知所云，不知其所以然。那么本文真正从一个小白开始带你深入浅出，一步一步实现自己的 Promise，这种自己造轮子的过程一定是进步最快的过程，快上车开始吧
2022-11-11
实现非常简单的js双向数据绑定
Angular实现了双向绑定机制。所谓的双向绑定，无非是从界面的操作能实时反映到数据，数据的变更能实时展现到界面。本文给大家详细介绍js双向数据绑定，感兴趣的朋友参考下
2015-11-11
Webpack在异步请求JS文件时如何获取JS Bundle的机制
Webpack是一个现代JavaScript应用程序的静态模块打包工具,它的主要作用是将项目中的多个模块按照依赖关系进行打包,生成一个或多个静态资源文件,这篇文章主要介绍了Webpack在异步请求JS文件时如何获取JS Bundle机制的相关资料,需要的朋友可以参考下
2026-01-01
ES6中Generator与异步操作实例分析
这篇文章主要介绍了ES6中Generator与异步操作,结合实例形式分析Generator的概念、功能及相关操作技巧,需要的朋友可以参考下
2017-03-03
JavaScript文件的同步和异步加载的实现代码
本篇文章主要介绍了JavaScript文件的同步和异步加载的实现代码，具有一定的参考价值，有兴趣的可以了解一下
2017-08-08