open pdf without text with python

Welcome to Programming Tutorial official website. Today - we are going to cover how to solve / find the solution of this error open pdf without text with python on this date .

I want open a PDF for a Django views but my PDF has not a text and python returns me a blank PDF. On each page, this is a scan of a page : link

from django.http import HttpResponse

def views_pdf(request, path):
   with open(path) as pdf:
   response = HttpResponse(pdf.read(),content_type='application/pdf')
   response['Content-Disposition'] = 'inline;elec'
   return response

Exception Type: UnicodeDecodeError

Exception Value: ‘charmap’ codec can’t decode byte 0x9d in position 373: character maps to < undefined >

Unicode error hint

The string that could not be encoded/decoded was: � ��`����

How to say at Python that is not a text but a picture ?

Answer

By default, Python 3 opens files in text mode, that is, it tries to interpret the contents of a file as text. This is what causes the exception that you see.

Since a PDF file is (generally) a binary file, try opening the file in binary mode. In that case, read() will return a bytes object.

Here’s an example (in IPython). First, opening as text:

In [1]: with open('2377_001.pdf') as pdf:
   ...:     data = pdf.read()
   ...:     
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-1-d807b6ccea6e> in <module>()
      1 with open('2377_001.pdf') as pdf:
----> 2     data = pdf.read()
      3 

/usr/local/lib/python3.6/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Next, reading the same file in binary mode:

In [2]: with open('2377_001.pdf', 'rb') as pdf:
   ...:     data = pdf.read()
   ...: 
In [3]: type(data)
Out[3]: bytes

In [4]: len(data)
Out[4]: 45659

In [5]: data[:10]
Out[5]: b'%PDF-1.4n%'

That solves the first part, how to read the data.

The second part is how to pass it to a HttpResponse. According to the Django documentation:

“Typical usage is to pass the contents of the page, as a string, to the HttpResponse constructor”

So passing bytes might or might not work (I don’t have Django installed to test). The Django book says:

content should be an iterator or a string.”

I found the following gist to write binary data:

from django.http import HttpResponse

def django_file_download_view(request):
    filepath = '/path/to/file.xlsx'
    with open(filepath, 'rb') as fp:  # Small fix to read as binary.
        data = fp.read()
    filename = 'some-filename.xlsx'
    response = HttpResponse(mimetype="application/ms-excel")
    response['Content-Disposition'] = 'attachment; filename=%s' % filename # force browser to download file
    response.write(data)
return response