You can use git-annex as a podcatcher, to download podcast contents.
No additional software is required, but your git-annex must be built
with the Feeds feature (run git annex version to check).
All you need to do is put something like this in a cron job:
cd somerepo && git annex importfeed http://url/to/podcast http://other/podcast/url
This downloads the urls, and parses them as RSS, Atom, or RDF feeds.
All enclosures are downloaded and added to the repository, the same as if you
had manually run git annex addurl on each of them.
git-annex will avoid downloading a file from a feed if its url has already
been stored in the repository before. So once a file is downloaded,
you can move it around, delete it, git annex drop its content, etc,
and it will not be downloaded again by repeated runs of
git annex importfeed. Just how a podcatcher should behave. (git-annex versions
since 2015 also tracks the podcast guid values, as metadata, to help avoid
duplication if the media file url changes; use git annex metadata ... to inspect.)
templates
To control the filenames used for items downloaded from a feed,
there's a --template option. The default is
--template='${feedtitle}/${itemtitle}${extension}'
Other available template variables:
feedauthor, itemauthor, itemsummary, itemdescription, itemrights, itemid,
itempubdate, author, title.
catching up
To catch up on a feed without downloading its contents,
use git annex importfeed --relaxed, and delete the symlinks it creates.
Next time you run git annex addurl it will only fetch any new items.
fast mode
To add a feed without downloading its contents right now,
use git annex importfeed --fast. Then you can use git annex get as
usual to download the content of an item.
storing the podcast list in git
You can check the list of podcast urls into git right next to the files it downloads. Just make a file named feeds and add one podcast url per line.
Then you can run git-annex on all the feeds:
xargs git-annex importfeed < feeds
recreating lost episodes
If for some reason git-annex refuses to download files you are certain are in the podcast, it is quite possible it is because they have already been downloaded. In any case, you can use --force to redownload them:
git-annex importfeed --force http://example.com/feed
distributed podcatching
A nice benefit of using git-annex as a podcatcher is that you can
run git annex importfeed on the same url in different clones
of a repository, and git annex sync will sync it all up.
centralized podcatching
You can also have a designated machine which always fetches all podcstas to local disk and stores them. That way, you can archive podcasts with time-delayed deletion of upstream content. You can also work around slow downloads upstream by podcatching to a server with ample bandwidth or work around a slow local Internet connection by podcatching to your home server and transferring to your laptop on demand.
youtube channels
You can also use git annex importfeed on youtube channels.
It will use yt-dlp to automatically download the videos.
You can either use git-annex importfeed --scrape with the url to the
channel, or you can find the RSS feed for the channel, and
git-annex importfeed that url (without --scrape).
Use of yt-dlp is disabled by default as it can be a security risk. See the documentation of annex.security.allowed-ip-addresses in git-annex for details.)
metadata
As well as storing the urls for items imported from a feed, git-annex can store additional metadata, like the author, and itemdescription. This can then be looked up later, used in metadata driven views, etc.
To make all available metadata from the feed be stored:
git config annex.genmetadata true

It seems that some of my feeds get stored into keys that generate a too long filename:
Is there a way to work around this?
git-annex addurlalready deals with this sort of problem by limiting the filename to 255 characters. If you'd like to file a bug report with details about your system, I can try to make git-annex support its limitations, I suppose.Looking forward to seeing it in Debian unstable; where it will definitely replace my hpodder setup.
I guess there is no easy way to re-use the files already downloaded with hpodder? At first I thought that
git annex importfeed --relaxedfollowed by adding the files to the git annex would work, butimportfeedstores URLs, not content-based hashes, so it wouldn’t match up.@nomeata, well, you can, but it has to download the files again.
When run without --fast,
importfeeddoes use content based hashes, so if you run it in a temporary directory, it will download the content redundantly, hash it and see it's the same, and add the url to that hash. You can then delete the temporary directory, and the files hpodder had downloaded will have the url attached to them now. I don't know if this really buys you anything over deleting the hpodder files and starting over though.The only way it can skip downloading a file is if its url has already been seen before. Perhaps you deleted them?
I've made
importfeed --forcere-download files it's seen before.Joey - your initial post said:
...but how do I actually switch on the feeds feature?
I install git-annex from cabal, so I do
which I did this morning and now
git annex versiongives me:So it is the latest version, but without Feeds.
cabal install feedshould get the necessary library installed so that git-annex will build with feeds support.Then I reinstalled
git-annexbut it still doesn't find the feeds flag.Do I need to do something like:
...but what are the default flags to include in addition to
-feed-f-Feed will disable the feature. -fFeed will try to force it on.
You can probably work out what's going wrong using cabal install -v3
So I ran
cabal install -v3and looked at the output,This looks like feed should be on.
There doesn't appear to be any errors in the compile either.
Is it as simple as a bug where this flag just doesn't show in the
git annex versioncommand?http://user:pass@site.com/rss.xmlbut it didn't work.Hi,
the explanations to --fast and --relaxed on this page could be extended a bit. I looked it up in the man page, but it is not yet clear to me when I would use one or the other with feeds. Also, does “Next time you run git annex addurl it will only fetch any new items.” really only apply to --relaxed, and not --fast?
Furthermore, it would be good if there were a template variable
itemnumthat I can use to ensure thatlsprints the casts in the right order, even when the titles of the items are not helpful.Greetings, Joachim
importfeedjust runswget(orcurl) to do all downloads, and wget's documentation says that works. It also says you can use ~/.netrc to store the password for a site.The git-annex man page has a bit more to say about --relaxed and --fast. Their behavior when used with
importfeedis the same as withaddurl.If the podcast feed provides an
itemid, you can use that in the filename template. I don't know how common that is. Due to the wayimportfeedworks, it cannot keep track of eg, an incrementing item number itself.itemdescriptionis something I can include in the template for the filename, but the descriptions can be really long... doesn't seem very elegant to have that in the file name. Could the description for example be included as metadata of the item?Good idea, Sazius!
I've made importfeed store the metadata, as long as annex.genmetadata is set in .git/config.
Using a
--template='${feedtitle}/${itempubdate}-${itemtitle}${extension}'with a libsyn RSS feed (eg, Poly Weekly), I found thatitempubdatewas expanding to "none", even though there is a date with each entry in the RSS, eg,Maybe the date string cannot be parsed? But it does look like a fairly typical datestamp to me. If the cause is the mixed-case in the tag, could
pubDatebe supported in addition topubdate? (AFAICTpubDateis the standardised mix of lower/upper case, but maybe not the most common, in which case supporting bothpubDateandpubdatemight help?) As seen withgit-annex version: 5.20141024~bpo70+1, installed from Debian Backports; AFAICT it's still the latest release to make it to backports.For now I'm just omitting "itempubdate" from my template.
Ewen
@ewen, I tested that feed and it is able to get the pubDate from it and parses it ok.
Most likely, your version of git-annex is not built with a new enough version of the haskell feed library. Version 0.3.9 or newer is needed to be able to extract pubdates. For example, Debian stable doesn't have a new enough version.
While tracking podcast media URLs usually works to avoid duplicate downloads, when it fails it usually fails spectacularly. In particular if a podcast feed decides to update all the URLs (for old and new podcasts) to use a different URL scheme, then suddenly that looks like a huge volume of new URLs, and all of them get downloaded again -- even if the content has actually already been retrieved from a different URL (ie, older URL scheme). For instance the
(Many downloaded; some skipped once I caught the bulk download and stopped it/reran with
acast.comservice has changed their URL scheme a couple of times in the last 1-2 years, rewriting all the historical URLs, so I have three copies of many of the episodes on podcasts on their service--fastor--relaxedto make placeholders instead.acast.comseem to have managed to cause even more confusion by rewriting many of the oldermp3files with newid3tags, thus changing the file size/hashes -- it definitely made cleaning up more complicated.)Some (all?) podcast feeds also have a
guidfield, which specifies what should be a unique per-episode and unchanging, that other podcatchers use to track "seen this" content. In theory thatguidvalue should be stable even across media URL changes -- at least if it isn't, then a podcaster changing theguidand media URL will almost certainly induce re-downloads in most podcatchers, and thus hopefully realise early on (eg, during testing) rather than in production.Can
git-annexbe extended to track theguidvalues as well as the filenames, sogit annex importfeedcan avoid downloading episodes where it has already processed thatguid, and instead just add the newly listed url as an alternate web URL for that specific episode (which has been my manual work around). Perhaps the episodeguidcould be stored as additional metadata, along with some sort of feed unique ID (link?), and then an index built/consulted whenimportfeedruns (although that "feed unique ID" would probably also have to be updatable by the user, to cope with "the feed URL has now changed fromhttp://tohttps://which also seems to be happening a bunch at present.)Ewen
PS: Apologies for duplicate partial comment; I think my browser decided some key combination meant "do default form action", which is post -- and I wasn't finished writing. I couldn't see a way to edit the comment, hence deleting/readding.
@ewen importfeed already tracks guids, since 2015. Relevant commit is f95a8c867223b2e17d036d0d3377bf0fc9d3adff
You may well have an older version of git-annex that didn't do that. But there are probably also feeds that lack a useful guid, or that even make a change that changes the guid of an existing item.
With
git annex metadata, you can see theitemidwhich is where the guid is stored.PS, please post in todo when you have a request..
@joey - thanks, that's prompt feature request fulfilment
Looking more closely at the duplicates, it turns out that not everything got duplicated, just the "older" episodes. It turns out the newer episodes do have
guidvalues saved (asitemidin the metadata) and the older episodes do not. I think this is most likely because I was running a fairly old git-annex until about October 2016, on a fairly old OS install, but then upgraded to a more recent one (now about 6 months old) which does track them. My assumption (without checking every file) is the episodes downloaded before October 2016 are ones that got duplicated.I've edited the main page and added a note that GUIDs are tracked in versions since 2015, since I didn't obviously find that listed anywhere before.
Ewen
playlist_idinstead ofchannel_idin thehttps://www.youtube.com/playlist?<query>URL. My immediate problem was that it fetched only last videos... will need to resort to manual treat or alike to feed the rest