Hi. I am new to the forum. Hello.
As many people have likely experienced we are not able to read the full website documentation available at https://openwrt.ifw.cn/docs/ offline via often used methods such as from the device webserver page, via a software plugin ( opkg install 'documentation'), or as a standalone pdf. The openwrt documentation is excellent. For most openwrt users and developers, the documentation not only effectively describes how to operate our openwrt devices, but is also an excellent reference for a good number of networking and security topics.
I would like to know if we can programmatically download the website documentation html pages and convert them into pdf. Obviously the resulting pdf will be for personal use only. I know I can do this, at least in a technical sense, but do the maintainers and owners of openwrt allow this?
This query is directed to other users on the forum, and especially the forum administrators.
the documentation won't fit in the device flash memory, that's why it's not included.
Yes you can freely download openwrt wiki contents and make a pdf, the information there is shared freely under this Creative Commons license (you can find a link to it at the bottom of each page) https://creativecommons.org/licenses/by-sa/4.0/deed.en
You know, this is an interesting thing to consider - I’m not sure how much value it would have, but if someone could download basically the entire site documentation (or subsections) as a package to their computer, that could be good for when they are planning to be offline due to router swaps / maintenance / configuration.
Probably not all that necessary given that many users will have at least a mobile device for access while their home connection is down, but I wonder if anyone would use such a feature and how it would be implemented.
for windows there are applications that do that, and can be done from a linux/mac/windows system with wget commandline
wget --mirror --page-requisites --convert-links --adjust-extension --compression=auto --reject-regex "/search|/rss" --no-check-certificate --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36" --wait 1s --restrict-file-names=windows openwrt.ifw.cn
I'm doing some tests to see if it works and I can implement this in the wiki server itself to provide a package users can download and browse offline with their web browser.
You should take care that wget doesn't choke on the toh views. They contain links that only change the sorting or filtering. Dumb processes like search engines or wget can easily get stuck there forever. Should be avoided for obvious reasons.
If anyone ends up scraping the wiki into HTML files, I have used a simple pipeline to generate PDFs using pandoc and tectonic.
Basic steps are:
- gather markup friendly source files
- for each file
- convert from markup to PDF (using tectonic or other PDF generator)
there has to be a better way than wget? it's a very messy way to keep the docs synced.
rsync access would be nice, for example.
There is a better way than wget, but it's also much more complicated. I'm not a fan of documentation spread across hundreds of HTML pages because it's practically impossible to do a regular expression search on them. A few years ago I developed a small system that can recursively walk a set of HTML pages and convert the lot of them to markdown text. The result certainly isn't as pretty as HTML, but I can search the entire set for something in one go.
For the current OpenWrt documentation, my system downloaded 57 megabytes in 590 files. After converting to text and removing boilerplate header/footer matter, I got 6.6 megs of text containing some 775,000 words.
I haven't tried to convert it to PDF, although I have tools that can do that as well.
I have no idea how to make this available to a larger audience.
@oscar_wiley - I'm not sure about the best way to distribute such a document, either... a PDF would probably be the easiest method... you could share that with me in a PM and I'd be happy to give you my thoughts about the readability and ease of navigating the document after it has been distilled.
The biggest issue I see is that the wiki articles are constantly being updated (some are old and stale, of course, others are actively being edited), so this would have to be something that could be run dynamically and on-demand. I could imagine a button on the main page that says "generate a PDF for offline reading" -- clicking that button would scrape the pages and build that PDF for download. It would be somewhat similar to the way the firmware-selector can generate custom firmware images for download.
It should be a shell script using free tools to "compile" human readable docs from the "source".
The server admins can set up a cron task to run the script on a schedule and provide a link to the compiled offline docs.
I've played around a bit with my scraping system. Due to its design it produces markdown files, but they don't do a very good job of regenerating the HTML they came from. The resulting PDF looks terrible.
I notice OpenWrt is using DocuWiki, and it has a plugin called DW2PDF. This appears to be the best option, and relieves me of having to reinvent the wheel. There's also a Site Export plugin that can export a namespace to HTML, including images and media.
Both plugins can be installed using DocuWiki's Extension Manager. Both of these look like they'll address this issue.