open pdf without text with python
I want open a PDF for a Django views but my PDF has not a text and python returns me a blank PDF. On each page, this is a scan of a page : link
from django.http import HttpResponse def views_pdf(request, path): with open(path) as pdf: response = HttpResponse(pdf.read(),content_type='application/pdf') response['Content-Disposition'] = 'inline;elec' return response
Exception Type: UnicodeDecodeError
Exception Value: ‘charmap’ codec can’t decode byte 0x9d in position 373: character maps to < undefined >
Unicode error hint
The string that could not be encoded/decoded was: � ��`����
How to say at Python that is not a text but a picture ?
Answer
By default, Python 3 opens files in text mode, that is, it tries to interpret the contents of a file as text. This is what causes the exception that you see.
Since a PDF file is (generally) a binary file, try opening the file in binary mode. In that case, read()
will return a bytes
object.
Here’s an example (in IPython). First, opening as text:
In [1]: with open('2377_001.pdf') as pdf: ...: data = pdf.read() ...: --------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) <ipython-input-1-d807b6ccea6e> in <module>() 1 with open('2377_001.pdf') as pdf: ----> 2 data = pdf.read() 3 /usr/local/lib/python3.6/codecs.py in decode(self, input, final) 319 # decode input (taking the buffer into account) 320 data = self.buffer + input --> 321 (result, consumed) = self._buffer_decode(data, self.errors, final) 322 # keep undecoded input until the next call 323 self.buffer = data[consumed:] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
Next, reading the same file in binary mode:
In [2]: with open('2377_001.pdf', 'rb') as pdf: ...: data = pdf.read() ...: In [3]: type(data) Out[3]: bytes In [4]: len(data) Out[4]: 45659 In [5]: data[:10] Out[5]: b'%PDF-1.4n%'
That solves the first part, how to read the data.
The second part is how to pass it to a HttpResponse
. According to the Django documentation:
“Typical usage is to pass the contents of the page, as a string, to the
HttpResponse
constructor”
So passing bytes
might or might not work (I don’t have Django installed to test). The Django book says:
“
content
should be an iterator or a string.”
I found the following gist to write binary data:
from django.http import HttpResponse def django_file_download_view(request): filepath = '/path/to/file.xlsx' with open(filepath, 'rb') as fp: # Small fix to read as binary. data = fp.read() filename = 'some-filename.xlsx' response = HttpResponse(mimetype="application/ms-excel") response['Content-Disposition'] = 'attachment; filename=%s' % filename # force browser to download file response.write(data) return response