Finally, I've found some time to do some improvements on this blog engine.
First thing which was more and more missing is reasonable content for description in HTML headers. This lead to sites like Google+ always fetch the constant text and that did not look that. So now, there is something really relevant to the content, either first paragraph (what is the case for all old entries) or something I can manually edit (in case I won't be too lazy).
For automatic extraction I tried several approaches. First attempt was simply extracting first paragraph from Markdown, however I quickly remembered that quite a lot of older posts were written directly in HTML, which made this big failure. Later trying BeautifulSoup on rendered HTML, what lead to missing spaces around text which was originally in links. Finally I discovered html2text, which worked pretty easily and does what I need.
In the end the code looks like following:
def get_auto_summary(self): h2t = html2text.HTML2Text() h2t.body_width = 0 h2t.ignore_images = True h2t.ignore_links = True h2t.ignore_emphasis = True text = h2t.handle(self.body.rendered) return text.splitlines()