1

I am at a monastery which includes a publishing house, and part of what we have is a shelf's worth of hard drives containing the work product of monks past and possibly present. I am expecting to find word processing and publishing documents in more than one (natural) language, organized (or not) in ways that made sense to the individual contributors. I am expecting Macintosh as among the most common, and also Windows, and for that matter possibly an MS-DOS disk or two. (I am also expecting, based on what I know of anthropology, to find out I was significantly wrong in one or more assumption(s) I brought to the table.)

The endgame of what I would like to achieve is an intranet server, running on the same engine as https://orthodoxchurchfathers.com, which indexes HTML output having a link to the original media file and a POSH copy of the text content of the media file, for those documents I was able to successfully access.

—-EDIT—

The original questions asked were something of a catch-all, and the question was closed for not being focused.

So I want to ask the one question that most motivated me to ask.

In the theory of relational databases, the principles include that you don’t store a person’s age; you may, however, store a person’s birthdate, and perhaps make a view that calculates the person’s age on the fly. In system administration, it is a principle that stinginess with privileges is kindness in disguise, and it is a related principle that work should be done at the lowest level of privilege that is sufficient. In software engineering, it is a micro-principle that you don’t change the counter or the endpoint in a count-control loop, and if you have what is otherwise a valid use case for doing it, you should use a for loop, and it is a macro-principle that you should be aware of patterns and reuse them as appropriate to the situation. All of these are important but not obvious principles that distinguish doing something well rather than doing an autodidact’s first take.

So, my question is, what are the principles, perhaps in knowledge management, that distinguish copying well-engineered wheels from reinventing the wheel.

Thanks,

Christos Hayward
  • 1,180
  • 3
  • 16
  • 39

1 Answers1

3

I worked in a company that did stuff similar to this. It was for legal discovery, which encounters many of the same problems; having to render to PDF and full text index any file created by a company, going back 20+ years. It's hard work, but you have some constraints that make it easier. You have to_html and to_text problems to solve.

  • Modern stuff handles ancient formats pretty well, if they still have the converter.
  • Sometimes, you may have to go 10 years old to get a version that has a converter for a very old format.
  • Native is always better if you care about print-quality
  • HTML changes a lot over the years. The HTML output of Word 97 likely won't pass modern validators, but Word 2021 exporting an old .doc file to HTML will be more valid.
  • Mac Office formats are different than PC formats, which gets worse the further back in time you go.

Regarding connecting old harddrives, you're at the mercy of your operating system. Linux is generally polyglot in that way, but there are clear cases where mounting read-only is your best choice. We used Linux by preference specifically because it could just do things. However, Windows has many off the shelf tools out there that will let you read from Mac-formatted drives (with various caveats.)

Because I worked for legal stuff, I know about write-blocker cables for SATA. I believe they also exist for IDE and SCSI, but I could be wrong. Write blocking for USB is a moving target, and I don't have advice.

What we did when working with legal preservation is to copy files off of the source media, and then do our conversion work on the copy on media that was native to our platform. Sometimes that meant extracting on Linux, then copying to Windows to do Office operations.

This will get you files. Full text extract is different for every file format, and you're going to have to do the software side of that yourself. HTML exporting sometimes can be scripted from the command-line, depends on what your editor can support. In other cases, dumping to .rtf will give you an artifact that can then later be ported to HTML. In fact, if you're looking to standardize your HTML presentation, the to_rtf() --> to_html() pipeline is more likely to give you that.

sysadmin1138
  • 135,853