Creating Reusable Code

Create reusable software is challenging, especially when that software may be reused in situations or scenarios for which it may not necessarily have been designed. We’ve all had that meeting where a boss or manager asked the question: “What you’ve designed is great, but can we also use it here?”

In the last month I’ve had this exact experience, from which I’ve learned a number of valuable lessons about crafting reusable software.

eTexts and annotations

When I first started working for eNotes, my initial task was to fix some code related to electronic texts that we displayed on our site (e.g., Shakespeare, Poe, Twain, etc.). We have a significant collection of annotations for many texts, and those annotations were displayed to users when highlights in the text were clicked. A couple of years ago we spun this technology off into a separate product, Owl Eyes, with additional teacher tools and classroom management features. Because of my experience with the existing eText and annotation code, and because I am primarily responsible for front-end JavaScript, I was tasked with building a “Kindle-like” experience in the browser for these eTexts. (This is one of the highlights of my career. The work was hard, and the edge cases were many, but it works very well across devices, and has some pretty cool features.)

Filtering, serializing, and fetching annotation data

The teacher and classroom features introduced some additional challenges that were not present when the eText content was first hosted on enotes.com. First, classrooms had to be isolated from one another, meaning that if a teacher or student left an annotation in an eText for their classroom, it would not be visible to anyone outside the classroom. Also, a teacher needed the ability to duplicate annotations across classrooms if they taught multiple courses with the same eText. Eventually we introduced paid subscriptions for premium features, which made annotation visibility rules even more complicated. All Owl Eyes Official annotations are available for free, public viewing, but certain premium educator annotations are restricted to paid subscribers. (Also, students in a classroom taught by a teacher with a paid subscription are considered subscribers, but only within that classroom’s texts!) It was complicated.

We devised a strategy whereby a chain of composable rules could be applied to any set of annotations, to filter them by our business requirements. These rules implemented a simple, identical interface, and each could be passed as an argument to another to form aggregates. The filtered annotation data was then serialized as JSON and emitted onto the page server-side. When the reader renders in the client this data is deserialized and the client-side application script takes over.

The role that a given user possesses in the system often determines if they can see additional meta-data related to annotations, or whether they can perform certain actions on those annotations. These were communicated to the front-end to enable/disable features as needed, and then enforced on the back-end should a clever user attempt to subvert the limitations of his own role. To keep the data footprint as light as possible on the page, we developed a composable serialization scheme that could be applied to any entity in our application. The generic serialization classes break down an entity’s data into a JSON structure, while more specialized serialization classes add or remove data based on a user’s role and permissions. In this way a given annotation might contain meta-data of interest to teachers, but would exclude that meta-data for students. Additional information is added if the user is an administrator, to give them better control over the data on the front-end.

The end result is that, from a user’s perspective, the annotations visible to them, and the data within those annotations, are tailor-made to the user when the eText reader is opened.

Fast-forward to the present day. I have recently been tasked with bringing our eTexts and annotations full circle, back to enotes.com. We brainstormed about the best way to make this happen, as enotes.com lacks the full eText and annotation data, as well as the rich front-end reading experience.

We decided that since the eText and annotation data was already being serialized as JSON for client-side consumption in owleyes.org, it would be trivial to make that same data available via an API. I implemented a simple controller that made use of Symfony’s authentication mechanisms for authenticating signed requests via API key pair, and returned annotation JSON data in the exact same manner that would be used for rendering that data in the eText reader. On inspection, I realized that some of the annotation data wasn’t relevant to what we wanted to display on enotes.com, so I quickly created new serialization classes that made use of existing serialization classes, but plucked unwanted data from their generated JSON structures before returning it. No changes were necessary to the annotation filtering rules, as an API user is, from the ruleset’s perspective, a “public user”, and so would see the same annotation data that users who aren’t logged in on the site would see.

Fetching this data on enotes.com was a simple matter of using PHP’s CURL classes to request data from the owleyes.org endpoint.

The user interface

The eText reader JavaScript code on owleyes.org is complex; it is composed of many different modules – view modules, state modules, utility modules, messaging modules, etc. – that interact together to form a smooth, reading experience. It is far more interactive than the pages we wanted to display on enotes.com, so I initially worried that the code would not be entirely reusable because of its complexity.

I was pleasantly wrong.

When I write software I take great pains to decouple code, favor composition over inheritance, and observe clear, strict, and course API boundaries in my modules and classes. I, as every programmer does, have a particular “style” of programming – the way I think about and model problems – which, in this case, served me very well.

I copied modules from the owleyes.org codebase into the enotes.com codebase that I knew would be necessary for the new eText pages to function. With some minor adjustments (mostly related to DOM element identifiers and classes) the code worked almost flawlessly. Where I needed to introduce new code (we’re using a popup to display annotations in enotes.com, whereas in owleyes.org we use a footer “flyout” that cycles through annotations in a carousel) the APIs in existing code were so well defined that I was able to adapt to them with few issues. Where differing page behavior was desired (e.g., the annotation popup shifts below the annotation when it gets too close to the top of the screen as the reader scrolls, and above otherwise) the decoupled utility modules that track window and page state already provided me with the events and information I needed to painlessly implement those behaviors. And because the schema of the serialized annotation data delivered over the API was identical to the JSON data embedded in the owleyes.org reader, the modules that filtered, sorted, and otherwise manipulated that data did not change at all.

Why it worked

Needless to say, this project left me very satisfied as a developer. When your code is painlessly reused in other contexts it means you’ve something right. I’ve made some observation about what made this reuse possible.

First, reusable code should model a problem, or a system, in such a way that the constituent components of that model can act together, or be used in isolation, without affecting the other parts of the model. Modules, classes, and functions are the tangible building blocks we use to express these models in software, and they should correspond with the way we think about these models in our heads. Each should be named appropriately, corresponding to some concept in the model, and the connections between them should be well understood and obvious. For example, in the eText reader, a tooltip is a highlighted portion of text that may be clicked on to display an annotation popup, which displays annotation information. The tooltip and annotation popup are components in the visual model; they are named appropriately, and the relationship between them is one-way, from tooltip to popup.

Second, a given problem may in fact be composed of multiple models that are being run at the same time. Modules that control the UI are part of the visual or display model; modules that control the access to, and filtering of, data are part of the domain model. Modules track mouse movements, or enable/disable features based on user interaction, are part of the interaction model. Within these models, objects or modules should only perform work that makes sense within the purpose of the model. Objects in the visual model should not apply business rules to data, for example. When one or more objects exhibit behaviors from multiple models, extracting and encapsulating the behavior that is not part of each object’s primary model makes that object more reusable.

Third, objects within a model should have well-defined, coarse APIs. (In the context of objects, an API is an object’s “public” methods to outside callers, or to the objects that extend it.) A coarse API is one that provides the least amount of functionality that its responsibilities require. Yes, the least. An object either stands alone, or makes use of other objects to do its work. If the methods on an object are numerous the object can likely be broken down into several smaller objects to which it will delegate and on which it will depend to do its work internally. Ask: what abstraction does this object represent, and which methods fulfill that abstraction. Likewise the parameters to an object’s methods can often be reduced by passing known state to the object’s constructor (or factory function, or whatever means are used to create the object). This chains the behavior of the object to a predetermined state – all remaining method arguments are only augmentations to this state. If the state needs to change, another object of the same type, with different state, is created and used in its stead. The API is coarse because the methods are few, and their parameters are sparse.

Fourth an object’s state should be stable at all times. Its initial state should be set, completely, through the object’s source of construction (whether by data provided via parameters, or sensible defaults, or both). Properties on objects should be considered read-only, as they represent a “window” into the object’s state. Computed properties should be calculated whenever an object’s relevant internal state changes, usually the result of a method invocation. I avoid exposing objects that can be manipulated by reference through properties; properties are always primatves that can be re-produced or re-calculated, or collections of other “data” objects that have the same characteristics (usually cloned or reduced from some other source). If an object needs to expose information from one of its internal children, I copy that information from the internal source to a primitive property on the external object itself. If the information is itself in the form of an object with multiple properties, I flatten those into individual properties on the external object. The end result is that an object’s state is always generated internally, as a consequence of method invocations, and cannot be manipulated externally, except by way of its public API (methods).

Finally, shared data should exist in “bags” – objects that jealously guard data and only deliver data by value to callers when asked. For example, on owleyes.org a given chapter in Hamlet may contain hundreds of annotations. Annotations may be crated, edited, deleted, and receive replies in client code. The annotation bag is responsible for holding the annotation data and delivering it, in read-only format, to other modules as requested so that they can render themselves (or perform computations) accordingly. When an annotation changes – when an owleyes.org PUT request is sent to the API and a successful response is received – a method on the bag is invoked to update the annotation. Because annotations are only fetched by value, it does no good for the module that initiated the update to directly manipulate the properties on its own annotation object. No other module will receive the change. Instead, the responsible module tells the bag to update the annotation by passing it the new annotation deserialized from the API response. The bag replaces the annotation in its internal collection and then raises an event to notify listening modules that the given annotation has changed. Any module interested in that annotation – or all annotations – then requests the updated data (in read-only format) and re-renders itself (or re-computes its internal state). The bag, then, is the shared resource among modules (not the data, directly) and it is the source of Truth for all data requests.

Epilogue

There is more I could say on the patterns and principles that arose during the execution of this project, but those enumerated above were of the most import and consequence while porting existing code into its new context. Reusable code is not easy to write. It is not automatic. It is the result of thought and discipline that slowly become habit as exercised.

Not all code will be reused; most won’t, in fact. But writing code with a view of extension and reuse in mind can pay off in time and effort in the long run. This is a trade-off, though. The more reusable code tends to be, the more layers of redirection it will possess, necessitating an increase in the number of modules, classes, functions, etc. that need be created. This is a trade-off that can be mitigated by keeping code as simple as possible. Code can be navigated with relative ease if one can reason about it, divining what modules (etc.) do and how they are related through inference.

While I can’t guarantee your experience will be as pleasant as mine, I do believe that if you think about and put these patterns and principles into action you will one day experience the joy of truthfully telling your manager, “oh, that will only take two weeks!” because your diligence produced well-crafted, reusable code.

My Favorite Book of 2018

I was asked to write about the best book I’ve read in 2018 in 200 words or less. Here we go.

My current obsession is authors Will and Ariel Durant, two of this centuries most prolific historians and pure joys to read. This year I read The Story of Philosophy by Will Durant, a work which reminds me that the past is not so unlike the present, and that the problems of humanity now are the same problems humanity has always faced. Philosophy tells the story of fifteen Western philosophers, from Plato to Dewey, explaining the ideas of each through the lenses of their personal experiences and cultures. Of each, my favorite is Spinoza, a Jewish philosopher who envisioned a god beyond that of his youth. For that he and his progeny were literally cursed by his peers and his people, forcing him, as an exile, to seek refuge among the Dutch. While I disagree with Spinoza’s metaphysics, Durant so masterfully presents a paramount human that I cannot but fall in love with his ethos: tolerance and benevolence that leaves humans free to express convictions peacefully. Philosophy stimulates the mind with both rich ideas and eloquent prose as it brings great ideas and great thinkers to the layman. It has earned its place well on my list of favorite books.

READ IT.

Politicians don't understand the Internet (or anything else)

The internet is justifiably ablaze with criticism of Rudy Giuliani’s recent Tweet in which he blames Twitter for linking his own fat-fingered mistype to an anti-Trump website. I feel bad for defending Twitter because the company is no friend of actual free speech, but my loathing for semi-private companies is only eclipsed by my loathing of politicians, so here it goes.

Twitter auto-links anything that appears to be a a URL, roughly defined as text that’s not a dot or a space, followed by a dot, followed by text that’s not a space. The Internet is, of course, defined by URLs, so this feature makes sense for a technology company that hosts Internet-based content.

Giuliani’s tweet contained a poorly punctuated reference to the G-20 summit – a reference which triggered Twitter’s auto-linking feature, and which prompted a tech-savy individual to register the URL to which the tweet auto-linked. At this URL, that irreverent middling pleab sought to get under aristocrat Giuliani’s skin by hosting an anti-Trump message, for which Giuliani blamed Twitter, proving his own ignorance of how the Internet (and likely most other technology) works. Of course he’s not alone. Who can forget the Net Neutrality debate in which one of our revered statesman referred to the Internet as “a series of tubes”?

If Giuliani were my grandfather I could perhaps forgive him for this faux pas, but he’s not. He’s a politician who has a direct impact on the “relationship” between government and the private technology sector. The problem is obvious. Since the government’s only hammer is violence and force, the last person who should be in an authoritative position is someone who has no idea how to identify an actual nail.

As someone who grew up in the Internet era, it’s easy to shrug off this ignorance as just the province of an aged aristocracy who did not. But this is not just a case of generational ignorance. I recall, as a middle-school student, visiting our legislators in Jefferson City. I sat in on a committee meeting in which my representatives discussed the exact content of a public school curriculum that they considered necessary for the education of a younger populace. And I realized, in that moment, that very few (if any) of my representatives had any clue about what constituted actual education, or possessed reasons for their opinions other than it would sound good to their constituents and thus ensure their re-election.

The truth is that politicians are rarely qualified to judge the milieu in which they legislate. And that’s because they are normal human beings, and like the rest of us, their ignorance outweighs their knowledge in most things. “But they have access to specialists who advise them!”, some say. True. And most people have access to the Internet – the largest repository of information ever collected, indexed, and formatted in a user-friendly way. And yet most of us would agree that access to information and expert opinion does not a wise person make.

Human life and social interaction complex and cannot be reduced to committee meeting decisions. Even the most qualified professionals are limited to their own experiences and knowledge.

Economist Friedrich von Hayek observed that,

“The curious task of economics is to demonstrate to men how little they really know about what they imagine they can design.”

His observation can, and should, be extended to legislation and regulation. Who would go to a senator or representative for dental work, solely on the basis of their access to “professional opinion”? Nobody. We should limit government severely. Not because there aren’t good people in government positions (maybe like three), but because no matter how virtuous, well-intentioned, or smart they may be, they are still at most only capable of making the coarsest of decisions for the 325 million people they represent.

We need to invent technology that's never even been invented yet.

Microsoft is building a Chromium web browser to replace Edge on Windows 10

From Windows Central:

Microsoft’s Edge web browser has seen little success since its debut on Windows 10 in 2015…

I’m told that Microsoft is throwing in the towel with EdgeHTML and is instead building a new web browser powered by Chromium, which uses a similar rendering engine first popularized by Google’s Chrome browser known as Blink. Codenamed “Anaheim,” this new browser for Windows 10 will replace Edge as the default browser on the platform, according to my sources, who wish to remain anonymous…

Using Chromium means websites should behave just like they do on Google Chrome in Microsoft’s new Anaheim browser, meaning users shouldn’t suffer from the same instability and performance issues found in Edge today.

If this is true, like most things Microsoft does now it’s too little, too late. The people who care about their browsing already use alternative browsers (and tell their family members to do the same). At most it will alleviate some enterprise developer suffering, as they now have a management-friendly argument to ditch IE support in favor of Anaheim.

I really don’t understand why MS even bothers with browsers anymore. Why not just strike deals with Google, Firefox, and Apple to pre-install their browsers for cash? Does the Edge feature-set (which I assume will be ported to Anaheim) offer more than these browsers and their relatively mature extension communities? I doubt it.

Time will tell.

Using the find command

The find command is used to recursively locate files in a directory hierarchy. Since programmers and system administrators spend a great deal of time working with files, familiarity with this command can make each more efficient at the terminal.

The command is composed of four parts:

  • the command name
  • options that control how the command searches (optional)
  • the path(s) to search (required)
  • expressions (composed of “primaries) and “operators” that filter files by name (optional)

A primary is a switch, such as -name or -regex that may or may not have additional arguments. An operator is a primary such as -or and -not that combines expressions in logical ways.

Basic usage

Finding files by name is perhaps the most common use of the find command. Its output consists of all paths in which the file name appears within the directory structure it searches.

1
2
3
4
5
6
$ find . -name README.md
./node_modules/hexo-renderer-stylus/README.md
./node_modules/is-extendable/README.md
./node_modules/striptags/README.md
...
./README.md

NOTE: All terminal examples were generated within the directory structure of my blog, created by the Hexo static site generator.

The path given to find in the first argument is the path prepended to each result. The . path instructs find to search under the working directory and generate relative paths in the output. If an absolute path to the working directory is used, however, the full path will appear in results. Command substitution may be used to strike a compromise between brevity and more detailed output.

1
2
3
4
5
6
$ find `pwd` -name README.md
/Users/nicholascloud/projects/nicholascloud.com/node_modules/hexo-renderer-stylus/README.md
/Users/nicholascloud/projects/nicholascloud.com/node_modules/is-extendable/README.md
/Users/nicholascloud/projects/nicholascloud.com/node_modules/striptags/README.md
...
/Users/nicholascloud/projects/nicholascloud.com/README.md

If a filtering expression is omitted, find will return all file paths within its search purview.

1
2
3
4
5
6
7
$ find .
.
./scaffolds
./scaffolds/draft.md
./scaffolds/post.md
./scaffolds/page.md
...

Other scenarios

The find command can do much more than locate files by exact name, however. It can find files according to a wide swath of criteria in many different scenarios.

We may not know the exact name of the file for which we are searching

The -name primary supports wildcard searches.

  • an asterisk (*) can replace any consecutive number of characters
  • a question mark (?) can replace any single character
1
2
3
4
5
$ find . -name '*javascript*.md'
./source/_posts/javascript-frameworks-for-modern-web-dev.md
./source/_posts/l33t-literals-in-javascript.md
./source/_posts/historical-javascript-objects.md
./source/_posts/maintainable-javascript-book-review.md

By default, the -name primary is case-sensitive. To conduct a case-insensitive search, we can use the -iname primary.

1
2
3
4
5
6
$ find . -iname '*JAVA*.md'
./source/_posts/java-4-ever.md
./source/_posts/javascript-frameworks-for-modern-web-dev.md
./source/_posts/l33t-literals-in-javascript.md
./source/_posts/historical-javascript-objects.md
./source/_posts/maintainable-javascript-book-review.md

For more complex searches, we can use the power of regular expressions with the -regex and -iregex (case-insensitive) primaries.

NOTE: Use the -E option to specify that extended regular expressions should be used instead of basic regular expressions.

1
2
3
4
5
6
7
8
 $ find -E . -regex '.*package(-lock)?\.json'
./node_modules/hexo-renderer-stylus/package.json
./node_modules/is-extendable/package.json
./node_modules/striptags/package.json
./node_modules/babylon/package.json
...
./package-lock.json
./package.json

We may want to limit our results to a specific path, or a pattern that matches multiple paths

To filter results by a path mask, we can specify a pattern with -path and -ipath (case-insensitive). Both support the asterisk and question mark wildcards.

1
2
3
4
5
6
7
8
9
10
11
$ find . -path './node_modules/*/lib/*' -regex '.*hexo.*'
./node_modules/hexo-renderer-stylus/lib/renderer.js
./node_modules/hexo-renderer-marked/lib/renderer.js
./node_modules/hexo-generator-archive/lib/generator.js
./node_modules/hexo-migrator-wordpress/node_modules/async/lib/async.js
./node_modules/hexo-log/lib/log.js
./node_modules/hexo-generator-category/lib/generator.js
./node_modules/hexo-i18n/lib/i18n.js
./node_modules/hexo-pagination/lib/pagination.js
./node_modules/hexo-generator-index/lib/generator.js
./node_modules/hexo-util/lib/pattern.js

Using the -path primary does not change the top-level directory in which find begins its search; it merely filters results by sub-directory.

We may want detailed information about the files we find

To see detailed information about a file, in a manner similar to ls -l, the -ls primary may be appended to the list of primaries.

1
2
3
4
5
$ find . -name *.md -path '*/_posts/*' -ls
8636130637 8 -rw-r--r-- 1 nicholascloud staff 490 Oct 24 12:25 ./source/_posts/strange-loop-2010-video-release-schedule-posted.md
8637286945 8 -rw-r--r-- 1 nicholascloud staff 1570 Oct 24 12:45 ./source/_posts/god-mode-in-windows-7-not-as-cool-as-rise-of-the-triad.md
8636130716 8 -rw-r--r-- 1 nicholascloud staff 204 Oct 24 12:25 ./source/_posts/what-writing-fiction-taught-me-about-writing-software.md
...

(As an alternative to the -ls primary, the -exec primary may be used to invoke ls, or the xargs command may be used for the same purpose.)

We may want to stop descending into a hierarchy once we’ve found the file(s) for which we’ve searched

The -prune primary causes find to stop traversing a particular directory path once it has found a result that matches its expression. It will, however, continue to search at the same directory level as a found result for other potential matches.

1
2
3
4
5
6
 $ find . -regex '.*middleware.*' -prune
./source/_posts/new-appendto-blog-post-streams-and-middleware-in-strata-js.md
./node_modules/stylus/lib/middleware.js
./node_modules/hexo-server/lib/middlewares
./public/2013/06/new-appendto-blog-post-streams-and-middleware-in-strata-js
./.deploy_git/2013/06/new-appendto-blog-post-streams-and-middleware-in-strata-js

By using the diff tool and IO redirection we can compare the output of a “pruned” result set with the output of unpruned results to see what paths were omitted. For example, in the diff below, the remaining paths that matched /node_modules/hexo-server/lib/middlewares/* were omitted once /node_modules/hexo-server/lib/middlewares had been added to the result set.

1
2
3
4
5
6
7
8
9
10
11
12
$ diff <(find . -regex '.*middleware.*') <(find . -regex '.*middleware.*' -prune)
4,9d3
< ./node_modules/hexo-server/lib/middlewares/route.js
< ./node_modules/hexo-server/lib/middlewares/redirect.js
< ./node_modules/hexo-server/lib/middlewares/logger.js
< ./node_modules/hexo-server/lib/middlewares/gzip.js
< ./node_modules/hexo-server/lib/middlewares/header.js
< ./node_modules/hexo-server/lib/middlewares/static.js
11d4
< ./public/2013/06/new-appendto-blog-post-streams-and-middleware-in-strata-js/index.html
13d5
< ./.deploy_git/2013/06/new-appendto-blog-post-streams-and-middleware-in-strata-js/index.html

We may only want to search to a particular depth OR search beyond a particular depth

Several primaries control depth traversal, or how far find will go to locate results.

-maxdepth controls the path depth to which find will traverse before stopping.

1
2
3
4
5
$ find . -name *.css -maxdepth 3
./public/fancybox/jquery.fancybox.css
./public/css/style.css
./.deploy_git/fancybox/jquery.fancybox.css
./.deploy_git/css/style.css

-mindepth controls the path depth at which find will start to search.

1
2
3
4
5
6
$ find . -name *.css -mindepth 6
./node_modules/hexo/node_modules/hexo-cli/assets/themes/landscape/source/fancybox/jquery.fancybox.css
./node_modules/hexo/node_modules/hexo-cli/assets/themes/landscape/source/fancybox/helpers/jquery.fancybox-thumbs.css
./node_modules/hexo/node_modules/hexo-cli/assets/themes/landscape/source/fancybox/helpers/jquery.fancybox-buttons.css
./themes/landscape/source/fancybox/helpers/jquery.fancybox-thumbs.css
./themes/landscape/source/fancybox/helpers/jquery.fancybox-buttons.css

-depth specifies the exact depth at which find will search.

1
2
3
4
$ find . -name *.css -depth 5
./node_modules/async-limiter/coverage/lcov-report/prettify.css
./node_modules/async-limiter/coverage/lcov-report/base.css
./themes/landscape/source/fancybox/jquery.fancybox.css

We may want to find files that are newer/older relative to another file

The -newer primary will find files that are newer than the specified file by comparing the modification times of each.

1
2
3
4
5
6
7
$ ls -l source/_drafts/
-rw-r--r-- 1 nicholascloud staff 189 Nov 7 11:51:20 2018 the-importance-of-names.md
-rw-r--r-- 1 nicholascloud staff 353 Nov 7 11:50:49 2018 the-most-satisfying-thing.md
-rw-r--r-- 1 nicholascloud staff 10812 Nov 8 19:13:09 2018 using-the-find-command.md

$ find . -newer source/_drafts/the-importance-of-names.md -path '*_drafts*'
./source/_drafts/using-the-find-command.md

For more fine-grained control, use the -newer[XY] primary, where values of X and Y represent different kinds of file timestamps (see table below). The timestamp for X applies to the files that find evaluates; that of Y applies to the file path argument supplied for comparison

X/Y flags value
a access time
B inode creation time
c change time (file attributes)
m modification time (file contents)
t (y only) file is interpreted as a date understood by cvs(1)

For example, the command find . -neweram foo.txt will find all files that have a newer access time than the modification time of foo.txt.

For each X flag there are shortcut primaries that make a comparison against the modification time of the file argument.

  • -anewer compares the access time of each file in the result set to the modification time of the specified file.
  • -bnewer compares the inode creation time of each file in the result set to the modification time of the specified file.
  • -cnewer compares the change time of each file in the result set to the modification time of the specified file.
  • -mnewer compares the modification time of each file in the result set to the modification time of the specified file, and is identical to -newer.

In Unix-like systems, “everything is a file”, and these files have types. The find command can detect file type, and filter results accordingly. Regular files (for which we search most often) have a type of f; directories have a type of d. Block files – disks, for example – have a type of b.

In OSX it is easy to find all block files that represent disks (physical and logical).

1
2
3
4
5
6
7
8
9
$ find /dev -name 'disk*' -type b
/dev/disk0
/dev/disk0s1
/dev/disk0s2
/dev/disk1
/dev/disk1s2
/dev/disk1s3
/dev/disk1s1
/dev/disk1s4

The table below lists each file type that the find command may detect.

Flag Meaning
b block special
c character special
d directory
f regular file
l symbolic link
p FIFO
s socket

We may want to search for files that a particular user or group owns (or inversely, that are not owned by a known user or group)

Users and groups are identified by name and numeric ID on Unix-like systems. In OSX the id command tells me my user ID and group ID(s).

1
2
$ id
uid=501(nicholascloud) gid=20(staff)...

The find command accepts primaries that filter file results by user and/or group.

  • -uid <uid> and -user <username> filter results by owning user. If the argument to -user is numeric, and no group exists with that name, it is assumed to be a user ID.
  • -gid <id> and -group <groupname> filter results by owning group. The same caveat applies to groupname as username.

I write code for a website called OwlEyes.org, which is a PHP application served by the apache2 web server. If I search for files in my home directory owned by the www-data user (the typical apache2 user), I see some interesting results.

1
2
3
$ find . -user www-data
./projects/enotes/owleyesorg/app/logs/apache-error.log
./projects/enotes/owleyesorg/app/logs/apache-custom.log

Every other file in my project directory is owned by my user, but the apache2 log files are written by the web server, and are therefore owned by its user.

To find files that aren’t owned by any known user and/or group, the inverse primaries may be used.

  • -nouser <username> shows results that do not belong to a known user.
  • -nogroup <groupname> shows results that do not belong to a known group.

We may want to find empty files or directories

To find empty (0 byte files or directories with no files) files append the -empty primary to the find command. This can be useful, for example, to see what log files are empty on your system.

1
2
3
4
5
6
7
8
9
$ sudo find /var/log -empty
./appfirewall.log
./ppp
./alf.log
./apache2
./com.apple.xpc.launchd
./cups
./CoreDuet
./uucp

We may want to run a utility on the files identified by find

The -exec and -ok primaries may be used to run a command on each file in find‘s result set. The two primaries are identical but -ok will request user confirmation for each file before executing the specified command.

The syntax for executing a command with find is:

1
find <expression(s)> -exec <command> \;

The command is written in standard form, as you would type it in a terminal. If the string '{}' appears anywhere in the command, it will be replaced by the file path of each result as find iterates over them. Commands must be terminated by a \;. (The escape character is necessary when executing within a shell environment.)

The command find . -newer db.json -type f -exec cp '{}' ~/tmp \;:

  • starts in the current directory
  • finds files that were modified after db.json (the database file that stores blog post information)
  • finds files of type “regular file”
  • and copies each one to ~/tmp
1
2
3
4
5
6
7
8
$ ls -lh db.json
-rw-r--r-- 1 nicholascloud staff 2.6M Nov 8 19:15 db.json

$ find . -newer db.json -type f -exec cp '{}' ~/tmp \;

$ ls -l ~/tmp
-rw-r--r-- 1 nicholascloud staff 61B Nov 9 10:46 README.md
-rw-r--r-- 1 nicholascloud staff 14K Nov 9 10:46 using-the-find-command.md

Two corresponding primaries, -execdir and -okdir do the same thing as -exec and -ok, however '{}' is replaced with as many file paths as possible from the result set, making these primaries akin to xargs. For example, to archive files in a find result set, one could use -execdir to create a tarball.

1
2
3
$ find . -newer db.json -type f -execdir tar cvzf ~/tmp/back.tar.gz '{}' \;
a using-the-find-command.md
a README.md

We may want to format find‘s output

The output from find can be formatted in two ways.

By specifying the -print primary, the file path of each result in find‘s result set is printed to standard output, terminated by a newline. This is the way find displays results by default. However, some primaries, such as -exec, might not print each file to the terminal. The command find . -newer db.json -type f -print -exec cp '{}' ~/tmp \; will copy all files newer than db.json to ~/tmp, but the output will remain empty (the default behavior of the cp command). To force each file to be displayed, the -print primary may be added before -exec.

1
2
3
$ find . -newer db.json -type f -print -exec cp '{}' ~/tmp \;
./source/_drafts/using-the-find-command.md
./README.md

The -print0 primary creates a space-delimited string of all file paths, and can be useful when piping the output of find to xargs or some similar command that expects input in such a format.

By default primaries are combined and applied together to form an expression, but find supports two operators that change the way expressions are applied. If two expressions are separated by the -or operator, then they will be applied in a boolean OR fashion; results will be returned that match either expression, or both.

1
2
3
4
5
6
7
$ find . -name '*eclipse*' -or -name '*clean*'
./source/images/2011/10/eclipse-example.png
./source/images/2011/10/eclipse-example-150x127.png
...
./source/images/2011/08/clean-coders1.png
./source/images/2011/08/clean-coders-150x117.png
...

If the -not (or !) operator precedes an expressison, it will negate it and remove matching file paths from the result set.

1
2
3
4
5
6
7
8
9
$ find . -name '*eclipse*'
./source/images/2011/10/eclipse-example.png
./source/images/2011/10/eclipse-example-150x127.png
./source/images/2011/10/eclipse-example-2-300x97.png
./source/images/2011/10/eclipse-example-2.png

$ find -E . -name '*eclipse*' ! -regex '.*[0-9]+x[0-9]+.*'
./source/images/2011/10/eclipse-example.png
./source/images/2011/10/eclipse-example-2.png

(Recall that the -E option in the example above forces find to use extended regular expressions when evaluating the -regex primary.)

We may want to delete found files

While possible to use -execdir rm '{}' \; to delete files in a result set, find supports a shorter primary, -delete that accomplishes the same task. By default, -delete will not show output for each file that is removed; use the -print primary in conjunction with -delete to see which files were removed from the file system.

ES2015 Generators

I have written a guest blog post, “ES2015 Generators“, on the eNotes developer blog:

Recently I had the opportunity to re-write the content tree control that we use to manage content nodes in www.enotes.com. We’ve all worked with the DOM, which represents HTML nodes in a tree structure but has some challenging deficiencies and a relatively grumpy API, the tortures of which prompted me to take a stab at a smoother tree-like design. During this process I experimented with several approaches to managing nodes in a tree structure, the first of which was a “flat hierarchy” which is as obtuse as it sounds and didn’t get much traction. I then opted for the more traditional parent/child approach, but still wanted a way to treat an entire tree of nodes in a “flat” manner. The ES2015 generator function was my solution…

Read the rest of this article on the eNotes developer blog.

LaunchCode 101, Unit 1

The first unit of LaunchCode 101 is almost complete. Class size has probably halved (based on visual inspection, not exact numbers) since the beginning because, though the material is introductory, it is also challenging. Those who remain have made serious personal effort to stay on track, and the buds of knowledge are finally beginning to bloom. What’s really interesting to me is how different minds approach the same problems. Just tonight I reviewed three assignments wherein each student achieved the same goal by taking a unique approach. And all three were different than my solution. The class also stretches my communication skills. When I explain things to other seasoned programmers, I can make assumptions about the knowledge that we already share and talk at a higher conceptual level. For students who have no prior programming knowledge, it is necessary to build a hierarchy of concepts from the ground up. Talking about complicated things in simple terms is both trying and rewarding. This unit has also been more math intensive than I anticipated, which is good for *me* because I am mathematically weak. Working problems along with students gives me an opportunity to expand my own knowledge beyond the realm of code. And that’s music to a nerd’s ears. Other than some technical difficulties, LC101 has been a success. Everyone is ready for winter break, but I think students and TFs will return for a strong start in January.