{"id":89,"date":"2022-03-28T21:10:04","date_gmt":"2022-03-29T01:10:04","guid":{"rendered":"http:\/\/triosdevelopers.com\/J.Smith\/rjeffsmith.ca\/wordpress\/?p=89"},"modified":"2024-10-11T09:32:59","modified_gmt":"2024-10-11T13:32:59","slug":"downloading-via-command-line-with-wget","status":"publish","type":"post","link":"https:\/\/triosdevelopers.com\/J.Smith\/rjeffsmith.ca\/wordpress\/?p=89","title":{"rendered":"downloading via command line with wget"},"content":{"rendered":"\n<pre class=\"wp-block-preformatted\">\/etc\/wgetrc \tDefault location of the global startup file.\n.wgetrc \tUser startup file.\n\u00a0<strong>\n#How to Download a Website Using wget <\/strong><\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">wget -r www.dlsite.com<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">#This downloads the pages recursively up to a maximum of 5 levels deep.\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">#Five levels deep might not be enough to get everything from the site. You can use the -l switch to set the number of levels you wish to go to as follows:<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">wget -r -l10 www.dlsite.com\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">#If you want infinite recursion you can use the following:<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">wget -r -l inf www.dlsite.com\n<\/pre>\n\n\n\n<pre id=\"ext-gen3861\" class=\"wp-block-preformatted\">#  How to Download Certain File Types \n\nwget -A \"*.mp3\" -r<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">#The reverse of this is to ignore certain files. Perhaps you don't want to download executables. In this case, you would use the following syntax:<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">wget -R \"*.exe\" -r\n<\/pre>\n\n\n\n<pre id=\"ext-gen1707\" class=\"wp-block-preformatted\">\n<strong>#Other Parameters<\/strong><\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-b, --background\tGo to background immediately after startup. If no output file is specified via the -o, output is redirected to wget-log.<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-o logfile, --output-file=logfile\tLog all messages to logfile. The messages are normally reported to standard error.<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-a logfile, --append-output=logfile \tAppend to logfile. This option is the same as -o, only it appends to logfile instead of overwriting the old log file. If logfile does not exist, a new file is created.<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-q, --quiet \tTurn off wget's output.<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-v, --verbose \tTurn on verbose output, with all the available data. The default output is verbose.<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-nv, --non-verbose \tNon-verbose output. Turn off verbose without being completely quiet (use -q for that), which means that error messages and basic information still get printed.<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-i file, --input-file=file \tRead URLs from a local or external file. If \"-\" is specified as file, URLs are read from the standard input. (Use \".\/-\" to read from a file literally named \"-\".)<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-F, --force-html \tWhen input is read from a file, force it to be treated as an HTML file. This enables you to retrieve relative links from existing HTML files on your local disk, by adding  to HTML, or using the --base command-line option.<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-t number, --tries=number \tSet number of retries to number. Specify 0 or inf for infinite retrying. The default is to retry 20 times, with the exception of fatal errors like \"connection refused'' or \"not found'' (404), which are not retried.\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-O file, --output-document=file \tThe documents will not be written to the appropriate files, but all will be concatenated together and written to file.\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-c, --continue \tContinue getting a partially-downloaded file. This option is useful when you want to finish up a download started by a previous instance of wget, or by another program. For instance: wget -c ftp:\/\/dlsite\/filename\u00a0\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--progress=type \tSelect the progress indicator you want to use. Legal indicators are \"dot\" and \"bar\".\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-N, --timestamping \tTurn on time stamping. Output file will have timestamp matching remote copy; if file already exists locally, and remote file is not newer, no download will occur.\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--no-use-server-timestamps \tDon't set the local file's timestamp by the one on the server.\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-S, --server-response \tPrint the headers sent by HTTP servers and responses sent by FTP servers.<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--spider \tWhen invoked with this option, wget will behave as a web spider, which means that it will not download the pages, just check that they are there. For example, you can use wget to check your bookmarks: wget --spider --force-html -i bookmarks.html\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-T seconds, --timeout=seconds \tSet the network timeout to seconds seconds. This option is equivalent to specifying --dns-timeout, --connect-timeout, and --read-timeout, all at the same time.\u00a0\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--dns-timeout=seconds \tSet the DNS lookup timeout to seconds seconds. DNS lookups that don't complete within the specified time will fail. By default, there is no timeout on DNS lookups, other than that implemented by system libraries.<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--connect-timeout=seconds \tSet the connect timeout to seconds seconds. TCP connections that take longer to establish will be aborted. By default, there is no connect timeout, other than that implemented by system libraries.<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--read-timeout=seconds \tSet the read (and write) timeout to seconds seconds. Reads that take longer will fail. The default value for read timeout is 900 seconds.<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--limit-rate=amount \tLimit the download speed to amount bytes per second. The amount may be expressed in bytes, kilobytes (with the k suffix), or megabytes (with the m suffix). For example, --limit-rate=20k will limit the retrieval rate to 20 KB\/s. This option is useful when, for whatever reason, you don't want wget to consume the entire available bandwidth.\u00a0\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-w seconds, --wait=seconds \tWait the specified number of seconds between the retrievals. Use of this option is recommended, as it lightens the server load by making the requests less frequent. Instead of in seconds, the time can be specified in minutes using the m suffix, in hours using h suffix, or in days using d suffix.\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--waitretry=seconds \tIf you don't want wget to wait between every retrieval, but only between retries of failed downloads, you can use this option. wget will use linear backoff, waiting 1 second after the first failure on a given file, then waiting 2 seconds after the second failure on that file, up to the maximum number of seconds you specify. Therefore, a value of 10 will actually make wget wait up to (1 + 2 + ... + 10) = 55 seconds per file. By default, wget will assume a value of 10 seconds.\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--random-wait \tSome websites may perform log analysis to identify retrieval programs such as wget by looking for statistically significant similarities in the time between requests. This option causes the time between requests to vary between 0 and 2*wait seconds, where wait was specified using the --wait option, to mask wget's presence from such analysis.\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--no-dns-cache \tTurn off caching of DNS lookups. Normally, wget remembers the addresses it looked up from DNS so it doesn't have to repeatedly contact the DNS server for the same (typically small) set of addresses it retrieves. This cache exists in memory only; a new wget run will contact DNS again.\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--retry-connrefused \tConsider \"connection refused\" a transient error and try again. Normally wget gives up on a URL when it is unable to connect to the site because failure to connect is taken as a sign that the server is not running at all and that retries would not help. This option is for mirroring unreliable sites whose servers tend to disappear for short periods of time.\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--user=user, --password=password \tSpecify the username user and password for both FTP and HTTP file retrieval. These parameters can be overridden using the --ftp-user and --ftp-password options for FTP connections and the --http-user and --http-password options for HTTP connections.\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--ask-password \tPrompt for a password for each connection established. Cannot be specified when --password is being used, because they are mutually exclusive.\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--unlink \tForce wget to unlink file instead of clobbering existing file. This option is useful for downloading to the directory with hardlinks.\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-nd, --no-directories \tDo not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the file names will get extensions .n).<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-x, --force-directories \tThe opposite of -nd; create a hierarchy of directories, even if one would not have been created otherwise. For example, wget -x http:\/\/fly.srk.fer.hr\/robots.txt will save the downloaded file to fly.srk.fer.hr\/robots.txt.<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-nH, --no-host-directories \tDisable generation of host-prefixed directories. By default, invoking wget with -r http:\/\/dlsite\/ will create a structure of directories beginning with dlsite\/. This option disables such behaviour.<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--protocol-directories \tUse the protocol name as a directory component of local file names. For example, with this option, wget -r http:\/\/host will save to http\/host\/... rather than just to host\/....<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--cut-dirs=number \tIgnore number directory components. This option is useful for getting a fine-grained control over the directory where recursive retrieval will be saved.\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--http-user=user, --http-passwd=password \tSpecify the username user and password on an HTTP server. According to the challenge, wget will encode them using either the \"basic\" (insecure) or the \"digest\" authentication scheme.\n\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--ignore-length \tUnfortunately, some HTTP servers (CGI programs, to be more precise) send out bogus \"Content-Length\" headers, which makes wget start to bray like a stuck pig, as it thinks not all the document was retrieved. You can spot this syndrome if wget retries getting the same document again and again, each time claiming that the (otherwise normal) connection has closed on the very same byte. With this option, wget ignores the \"Content-Length\" header, as if it never existed.\u00a0<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--private-key=file \tRead the private key from file. This option allows you to provide the private key in a file separate from the certificate.<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">--private-key-type=type \tSpecify the type of the private key. Accepted values are PEM (the default) and DER.<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-r, --recursive \tTurn on recursive retrieving.\n\n-l depth, --level=depth \tSpecify recursion maximum depth level depth. The default maximum depth is 5.<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-K, --backup-converted \tWhen converting a file, backup the original version with an .orig suffix. Affects the behavior of -N.\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-m, --mirror \tTurn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to -r -N -l inf -nr.\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">-p, --page-requisites \tThis option causes wget to download all the files that are necessary to properly display a given HTML page. Including such things as inlined images, sounds, and referenced stylesheets. Ordinarily, when downloading a single HTML page, any requisite documents that may be needed to display it properly are not downloaded. Using -r together with -l can help, but since wget does not ordinarily distinguish between external and inlined documents, one is generally left with \"leaf documents'' that are missing their requisites.\n\n-A acclist, --accept acclist; -R rejlist, --reject rejlist \tSpecify comma-separated lists of file name suffixes or patterns to accept or reject. Note that if any of the wildcard characters, *, ?, [ or ], appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix.\n\n-D domain-list, --domains=domain-list \tSet domains to be followed. domain-list is a comma-separated list of domains. Note that it does not turn on -H.\n\n--exclude-domains domain-list \tSpecify the domains that are not to be followed.\n\n--follow-ftp \tFollow FTP links from HTML documents. Without this option, wget will ignore all the FTP links.\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>\/etc\/wgetrc Default location of the global startup file. .wgetrc User startup file. \u00a0 #How to Download a Website Using wget wget -r www.dlsite.com #This downloads the pages recursively up to a maximum of 5 levels deep. #Five levels deep might not be enough to get everything from the site. You can use the -l switch [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[15],"tags":[],"class_list":["post-89","post","type-post","status-publish","format-standard","hentry","category-file-management"],"_links":{"self":[{"href":"https:\/\/triosdevelopers.com\/J.Smith\/rjeffsmith.ca\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/89","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/triosdevelopers.com\/J.Smith\/rjeffsmith.ca\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/triosdevelopers.com\/J.Smith\/rjeffsmith.ca\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/triosdevelopers.com\/J.Smith\/rjeffsmith.ca\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/triosdevelopers.com\/J.Smith\/rjeffsmith.ca\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=89"}],"version-history":[{"count":1,"href":"https:\/\/triosdevelopers.com\/J.Smith\/rjeffsmith.ca\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/89\/revisions"}],"predecessor-version":[{"id":90,"href":"https:\/\/triosdevelopers.com\/J.Smith\/rjeffsmith.ca\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/89\/revisions\/90"}],"wp:attachment":[{"href":"https:\/\/triosdevelopers.com\/J.Smith\/rjeffsmith.ca\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=89"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/triosdevelopers.com\/J.Smith\/rjeffsmith.ca\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=89"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/triosdevelopers.com\/J.Smith\/rjeffsmith.ca\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=89"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}